# meataxe64

Meataxe64 is a large software development project to produce programs for working at high performance with large matrices over finite fields.

At the lowest level, the aim is to work modulo primes (only), using grease (much like “four Russians”) to reduce the amount of work, to use vectorized code in x86 assembler (SSE/AVX) to do the basic operations and to have short rows and few columns so that matrices fit suitably into the various levels of cache.  The objective is to run as fast as possible with as little use of real-memory bandwidth as possible.

At a middle level, the aim is to use linear functions to work with extension fields, and to chop the matrices up so that the lowest level can operate.

At a higher level, the aim is to make effective use of a multi-core environment, building on the advantage that the cache-friendly lower level provides to ensure that many cores can be used effectively.  The thread-farm looks after the messy signals, locks and thread handling.

It is hoped soon that a layer will be added to take a matrix that fits on disk but not in memory to extend the possible scale of operations further.

Finally I dream that a fault-tolerant distributed system can be build on top of this to handle matrices of gargantuan proportions, but this lies some considerable way into the future.

Go read the development blog, I certainly learned a lot from Richard Parker whenever we talked.

# Sage 5.10

Sage 5.10 was released earlier today. It has the following goodies I particularly care about:

## Faster Dense Linear Algebra over GF(2)

TL;DR: We updated M4RI to the most recent upstream release which is better suited for modern CPUs.

After Enrico Bertolazzi and Anna Rimoldi kicked out butts with their pre-print we went to work to re-tune M4RI. That is, I don’t agree with their premise that their new algorithm is the (main) cause of their impressive performance. As a result M4RI got considerably faster on modern CPUs.

Here’s a comparison of Sage 5.8 (which has the same performance characteristics as 5.9 for this stuff) and Sage 5.10. Sage 5.8 goes first:

sage: A = random_matrix(GF(2),2^14, 2^14)
sage: B = random_matrix(GF(2),2^14, 2^14)
sage: %time A*B
CPU times: user 4.46 s, sys: 0.02 s, total: 4.48 s
Wall time: 4.50 s

sage: %time A.echelonize()
CPU times: user 2.53 s, sys: 0.00 s, total: 2.53 s
Wall time: 2.54 s


Now Sage 5.10 which is 1.22 times faster for multiplication and 1.17 times faster for elimination in this particular benchmark.

sage: A = random_matrix(GF(2),2^14, 2^14)
sage: B = random_matrix(GF(2),2^14, 2^14)
sage: %time A*B
CPU times: user 3.61 s, sys: 0.04 s, total: 3.65 s
Wall time: 3.66 s

sage: %time A.echelonize()
CPU times: user 2.16 s, sys: 0.00 s, total: 2.17 s
Wall time: 2.17 s


For comparision, Magma 2.15-10 takes 4.5 seconds for this multiplication and Magma 2.18-7 takes 5 seconds on the same machine. See here for details on the M4RI update.

## Faster Dense Linear Algebra over GF(2^e)

TL;DR: Improvements over GF(2) have a knock-on effect on GF(2^e) and we upgraded M4RIE to the newest upstream release which extends the supported degree size up to $e \leq 16$

M4RIE recently dropped its dependency on Givaro and extended the degrees it supports up to 16. Sage 5.10 updates to this new release and extends the finite field size that is covered by M4RIE to $\mathbb{F}_{2^16}$. This means a huge performance improvements for dense linear algebra over $\mathbb{F}_{2^e}$ for $8 < e \leq 16$. Note, however, that these cases are not fully optimised yet, so that it’s not the fastest implementation  – in this range – yet. Sage 5.8 first:

sage: A = random_matrix(GF(2^8,'a'),10^4, 10^4)
sage: B = random_matrix(GF(2^8,'a'),10^4, 10^4)
sage: %time A*B
CPU times: user 32.07 s, sys: 0.48 s, total: 32.56 s
Wall time: 32.67 s
10000 x 10000 dense matrix over Finite Field in a of size 2^8

sage: A = random_matrix(GF(2^12,'a'),10^3, 10^3)
sage: B = random_matrix(GF(2^12,'a'),10^3, 10^3)
sage: %time A*B # Sage 5.8 uses generic Python code to do this
CPU times: user 339.02 s, sys: 0.70 s, total: 339.72 s
Wall time: 340.86 s
1000 x 1000 dense matrix over Finite Field in a of size 2^12


Now, Sage 5.10 which is 1.16 times and 1420 times faster respectively for these benchmarks.

sage: A = random_matrix(GF(2^8,'a'),10^4, 10^4)
sage: B = random_matrix(GF(2^8,'a'),10^4, 10^4)
sage: %time A*B # knock-on effect from GF(2) improvements
CPU times: user 27.42 s, sys: 0.62 s, total: 28.04 s
Wall time: 28.14 s
10000 x 10000 dense matrix over Finite Field in a of size 2^8

sage: A = random_matrix(GF(2^12,'a'),10^3, 10^3)
sage: B = random_matrix(GF(2^12,'a'),10^3, 10^3)
sage: %time A*B # new code in M4RIE
CPU times: user 0.23 s, sys: 0.01 s, total: 0.24 s
Wall time: 0.24 s
1000 x 1000 dense matrix over Finite Field in a of size 2^12


For comparison, Magma 2.15-10 takes 3.79 seconds and Magam 2.18-7 takes 0.16 seconds for the latter benchmark. This highlights that M4RIE isn’t what it should be yet in that range (see here for details).

# Sage 4.8 is out

If you care about the stuff I care about (and why else would you read this blog?) you might get excited about a few changes in Sage.

## Efficient linear algebra for $\mathbb{F}_{2^e}$

The very first non-trivial patch I ever produced for Sage was about interfacing with NTL for dense linear algebra over $\mathbb{F}_{2^e}$ (I was interested in algebraic attacks against AES at the time). Here’s William’s reply:

Your NTL patch worked perfectly for me first try. I tried more benchmarks (on Pentium-M 1.8Ghz).

[…]

This is pretty good; vastly better than what’s was in SAGE by default, and way better than PARI. Note that MAGMA is much faster though (nearly 8 times faster):

[…]

MAGMA uses (1) […] and (2) a totally different algorithm for computing the echelon form. […] As far as I know, the MAGMA method is not implemented anywhere in the open source world But I’d love to be wrong about that… or even remedy that.

Well, that was 2006. Fast forward to the year 2011 and we get the following timings for computing the reduced row echelon form of a 1,000 x 1,000 matrix over $\mathbb{F}_{256}$: Sage 4.7.2 takes 36.53 seconds , NTL 5.4.2 takes 31.06 seconds and Magma 2.15 does it in 0.87 seconds. So essentially, the situation didn’t change at all for the better.

With Sage 4.8 this situation changes dramatically  and we get that Sage performs this computation in 0.08 seconds, that’s 450 times faster than Sage 4.7.2. This is because M4RIE was merged in Sage 4.8. Hence, Sage is now (in some cases by far) fastest system to do linear algebra with dense matrices over $\mathbb{F}_{2^e}$ for $1 \leq e \leq 8$  and usually also for $9 \leq e \leq 10$.

## Efficient linear algebra for $\mathbb{F}_{p}$

One can tell a similar story for $\mathbb{F}_p$ for, say, small to medium sized primes $p$. In Sage 4.7.2 it took 1.12* seconds to multiply two 1,000 x 1,000 matrices over $\mathbb{F}_{251}$ (although you always had the option to call LinBox explicitly which was way faster but took more memory). With Sage 4.8 the same computation takes *0.16 seconds. For comparison, Magma 2.15 takes 0.22 seconds. So here again Sage moved from poor performance to best in class performance between 4.7.2 and 4.8 simply by making proper use of available libraries.

## Viable Alternative yet?

Overall, the story for dense linear algebra in Sage for small finite fields $\mathbb{F}_q$  is as follows.

 $q$ Implementation Comments $2$ M4RI Fastest implementation or equal performance depending on platform $3,5,7 \dots$ LinBox Decent performance, but faster implementations are known in the literature. Also, Magma is a bit faster on my machine. prime < $2^{23}$ LinBox Fastest implementation or equal performance depending on platform. $2^e$ for $2 \leq e \leq 8$ M4RIE Fastest $p^e$ for $p>2$ Generic Very poor performance, but some work is being done.

So, once we fix that last row Sage finally achieves “viable alternative” quality when it comes to dense linear algebra over $\mathbb{F}_{q}$ if $q$ is $q < 2^{16}$.