## Looking Back and Forward for Open-Source Mathematics Software (2014)

When a year ends people make lists. I can only guess that several people are currently busy with writing “The 5 most revised papers on eprint ” and “The 8 best IACR flagship conference rump session presentations of 2014”. Since all the good lists are taken, my list has to be a little bit more personal. Alas, here is my list of stuff that happened in open-source computational mathematics in 2014 around me. That is, below I list what developments happened in 2014 and try to provide an outlook for 2015 (so that I can come back in a year to notice that nothing played out as planned).

If you are interested in any of the projects below feel invited to get involved. Also, if you are student and you are interested in working on one of the (bigger) projects listed below over the summer, get in touch: we could try to turn it into a Google Summer of Code 2015 project.

## meataxe64

Meataxe64 is a large software development project to produce programs for working at high performance with large matrices over finite fields.

At the lowest level, the aim is to work modulo primes (only), using grease (much like “four Russians”) to reduce the amount of work, to use vectorized code in x86 assembler (SSE/AVX) to do the basic operations and to have short rows and few columns so that matrices fit suitably into the various levels of cache.  The objective is to run as fast as possible with as little use of real-memory bandwidth as possible.

At a middle level, the aim is to use linear functions to work with extension fields, and to chop the matrices up so that the lowest level can operate.

At a higher level, the aim is to make effective use of a multi-core environment, building on the advantage that the cache-friendly lower level provides to ensure that many cores can be used effectively.  The thread-farm looks after the messy signals, locks and thread handling.

It is hoped soon that a layer will be added to take a matrix that fits on disk but not in memory to extend the possible scale of operations further.

Finally I dream that a fault-tolerant distributed system can be build on top of this to handle matrices of gargantuan proportions, but this lies some considerable way into the future.

Go read the development blog, I certainly learned a lot from Richard Parker whenever we talked.

## Three sweet but short postdocs in France

The HPAC project has three one-year postdoc positions available:

Three research positions (postdoc or research engineer), offered by the French ANR project HPAC  (High Performance Algebraic Computation), are open.

Title: High Performance Algebraic Computing

Keywords: parallel computing, computer algebra, linear algebra, C/C++ programming

Locations:

• Grenoble, France (LIG-MOAIS, LJK-CASYS),
• Lyon, France (LIP-AriC),
• Paris, France (LIP6-PolSys),

Starting date: between June 2014 and January 2015

Type of position: 3 postdoc or research engineer positions of 1 year each

Detailed descriptions:

General Context:

The ambition of the project HPAC is to provide international reference high-performance libraries for exact linear algebra and algebraic systems on multi-processor architectures and to influence parallel programming approaches for algebraic computing. It focuses on the design of new parallel algorithms and building blocks dedicated to exact linear algebra routines. These blocks will then be used for the parallelization of the sequential code of the LinBox and FGb libraries, state of the art for exact linear algebra and polynomial systems solving, and used in many computer algebra systems. The project combines several areas of expertise: parallel runtime and language, exact,
symbolic and symbolic/numeric algorithmic, and software engineering.

Profile of the positions:

We are seeking for candidates with solid expertise in software library design and developments (e.g. C, C++, OpenMP, Autotools, versioning,…) with preferably good background on mathematical software and computer algebra algorithmic. The main outcome of the work will depend on the type of the position (postdoc or engineer) and include code development in open-source C/C++ libraries such as LinBox, FGb, Kaapi and research publications in international journals or conferences.

Each location is seeking for candidates matching with the following keywords:

• Lyon: (contact: Gilles….@ens-lyon.fr) High performance/parallel computer algebra, symbolic and mixed symbolic-numeric linear algebra,  validated computation, high performance Euclidean lattice computation, lattice basis reduction.
• Grenoble: (contact: Jean-Guill…@imag.fr) Library design and development, LinBox, Sage, XKaapi, parallel exact linear algebra, work-stealing and data-flow tasks.
• Paris: (contact: Jean-Charl…@groebner.org) Polynomial system solving, Gröbner basis computations, parallel exact linear algebra, algebraic cryptanalysis, distributed computing.

Feel free to exchange with the contact person of each site for further information.

## Lazy Modulus Switching for the BKW Algorithm on LWE

our paper (with Jean-Charles FaugèreRobert Fitzpatrick and Ludovic Perret) on solving small secret LWE faster just hit ePrint (and was accepted for presentation at PKC 2014)

Abstract. Some recent constructions based on LWE do not sample the secret uniformly at random but rather from some distribution which produces small entries. The most prominent of these is the binary-LWE problem where the secret vector is sampled from {0, 1}* or {-1, 0, 1}*. We present a variant of the BKW algorithm for binary-LWE and other small secret variants and show that this variant reduces the complexity for solving binary-LWE. We also give estimates for the cost of solving binary-LWE instances in this setting and demonstrate the advantage of this BKW variant over standard BKW and lattice reduction techniques applied to the SIS problem. Our variant can be seen as a combination of the BKW algorithm with a lazy variant of modulus switching which might be of independent interest.

The code used to produce experimental data is available on bitbucket, source code to compute our complexity estimations is also available. Slides for a presentation discussing this work are also available on bitbucket.

## Enrico Bertolazzi’s linear algebra code over GF(2) available

Enrico made the code (if the link doesn’t work search for his name on Research Gate) for his LU factorisation code over GF(2) available online under the GPL. This is an implement of the algorithm described by Anna and him in Fast matrix decomposition in F2 and for which they give timings in that paper (discussed a bit here). I had to fix some includes to make it compile on my box, but nothing major. I can also confirm the impressive performance of their software (I ran testRankComputation).

Continue reading “Enrico Bertolazzi’s linear algebra code over GF(2) available”

## Sage 5.10

Sage 5.10 was released earlier today. It has the following goodies I particularly care about:

## Faster Dense Linear Algebra over GF(2)

TL;DR: We updated M4RI to the most recent upstream release which is better suited for modern CPUs.

After Enrico Bertolazzi and Anna Rimoldi kicked out butts with their pre-print we went to work to re-tune M4RI. That is, I don’t agree with their premise that their new algorithm is the (main) cause of their impressive performance. As a result M4RI got considerably faster on modern CPUs.

Here’s a comparison of Sage 5.8 (which has the same performance characteristics as 5.9 for this stuff) and Sage 5.10. Sage 5.8 goes first:

sage: A = random_matrix(GF(2),2^14, 2^14)
sage: B = random_matrix(GF(2),2^14, 2^14)
sage: %time A*B
CPU times: user 4.46 s, sys: 0.02 s, total: 4.48 s
Wall time: 4.50 s

sage: %time A.echelonize()
CPU times: user 2.53 s, sys: 0.00 s, total: 2.53 s
Wall time: 2.54 s


Now Sage 5.10 which is 1.22 times faster for multiplication and 1.17 times faster for elimination in this particular benchmark.

sage: A = random_matrix(GF(2),2^14, 2^14)
sage: B = random_matrix(GF(2),2^14, 2^14)
sage: %time A*B
CPU times: user 3.61 s, sys: 0.04 s, total: 3.65 s
Wall time: 3.66 s

sage: %time A.echelonize()
CPU times: user 2.16 s, sys: 0.00 s, total: 2.17 s
Wall time: 2.17 s


For comparision, Magma 2.15-10 takes 4.5 seconds for this multiplication and Magma 2.18-7 takes 5 seconds on the same machine. See here for details on the M4RI update.

## Faster Dense Linear Algebra over GF(2^e)

TL;DR: Improvements over GF(2) have a knock-on effect on GF(2^e) and we upgraded M4RIE to the newest upstream release which extends the supported degree size up to $e \leq 16$

M4RIE recently dropped its dependency on Givaro and extended the degrees it supports up to 16. Sage 5.10 updates to this new release and extends the finite field size that is covered by M4RIE to $\mathbb{F}_{2^16}$. This means a huge performance improvements for dense linear algebra over $\mathbb{F}_{2^e}$ for $8 < e \leq 16$. Note, however, that these cases are not fully optimised yet, so that it’s not the fastest implementation  – in this range – yet. Sage 5.8 first:

sage: A = random_matrix(GF(2^8,'a'),10^4, 10^4)
sage: B = random_matrix(GF(2^8,'a'),10^4, 10^4)
sage: %time A*B
CPU times: user 32.07 s, sys: 0.48 s, total: 32.56 s
Wall time: 32.67 s
10000 x 10000 dense matrix over Finite Field in a of size 2^8

sage: A = random_matrix(GF(2^12,'a'),10^3, 10^3)
sage: B = random_matrix(GF(2^12,'a'),10^3, 10^3)
sage: %time A*B # Sage 5.8 uses generic Python code to do this
CPU times: user 339.02 s, sys: 0.70 s, total: 339.72 s
Wall time: 340.86 s
1000 x 1000 dense matrix over Finite Field in a of size 2^12


Now, Sage 5.10 which is 1.16 times and 1420 times faster respectively for these benchmarks.

sage: A = random_matrix(GF(2^8,'a'),10^4, 10^4)
sage: B = random_matrix(GF(2^8,'a'),10^4, 10^4)
sage: %time A*B # knock-on effect from GF(2) improvements
CPU times: user 27.42 s, sys: 0.62 s, total: 28.04 s
Wall time: 28.14 s
10000 x 10000 dense matrix over Finite Field in a of size 2^8

sage: A = random_matrix(GF(2^12,'a'),10^3, 10^3)
sage: B = random_matrix(GF(2^12,'a'),10^3, 10^3)
sage: %time A*B # new code in M4RIE
CPU times: user 0.23 s, sys: 0.01 s, total: 0.24 s
Wall time: 0.24 s
1000 x 1000 dense matrix over Finite Field in a of size 2^12


For comparison, Magma 2.15-10 takes 3.79 seconds and Magam 2.18-7 takes 0.16 seconds for the latter benchmark. This highlights that M4RIE isn’t what it should be yet in that range (see here for details).