# M4RIE Paper

I’ve been writing up the ideas that went into the M4RIE library for dense linear algebra over small extensions of $\mathbb{F}_2$. I think it is now in a state to be readable enough to up a PDF of  the current draft online. Hence, here it is. While the paper does explain what we mean by “Travolta tables” it doesn’t explain why we call them that way … but the image below does:

# Talk about M4RI and M4RIE

I just finished with my talk at SIAM AG11 about the M4RI and the M4RIE libraries  … here are the slides.

# GAP for dense linear algebra over GF(2^e)

For some strange reason it hardly every occurs to me to benchmark GAP when it comes to linear algebra over GF(2^e). Turns out, that’s a really bad idea as it seems to be the best open-source implementation for such things. For example, computing the reduced row echelon form of a 4,000 x 4,000 random matrix over $\mathbb{F}_{16}$ takes 16.94s in GAP 4.4.12 but 3267s in Sage and 1005s in NTL 5.4.2. So much much faster than other established open-source projects out there.

I hope libgap becomes a reality at some point in the future.

PS: For comparison, Magma does the same task in in 4.67s and M4RIE in 0.71s.

# Asymptotically fast Gaussian elimination for M4RIE

I guess one perk of being on vacation is that one can get more work done. Hence, I finished implementing PLE decomposition (formerly known as PLS decomposition) and hence asymptotically fast Gaussian elimination. The implementation uses two matrix representations: mzd_slice_t and mzed_t. The former is optimised for implementing Karatsuba multiplication (cf., here) and the latter is optimised for using Travolta tables (cf., here). That is, multiplication is based on Karatsuba while the PLE base case is based on Travolta tables. The same applies to TRSM where the base case is also implemented using Travolta tables.

There is still a lot to be done:

• Both TRSM and PLE base cases use only one Travolta table while Gaussian elimination and multiplication use 6 and 8 in parallel respectively.
• There way too much copying going on. For example, I was lazy and implemented TRSM upper left with respect to matrices which do not start at offset 0 with respect to a machine word by simply copying the whole matrix out. Hence, we’re wasting memory where we shouldn’t.
• PLE isn’t cache efficient yet and I assume that the code isn’t very good for sparse-ish matrices (cf. the journey in M4RI improving this)
Still, the results are quite nice already. Below, the left hand side plots the wall time of computing the reduced row echelon form of $m \times (m + 1000)$ random matrices over $\mathbb{F}_{2^e}$. The right hand side expresses efficiency by dividing the running time by $2mnr^{\omega-2}$, i.e., the number of operations (fingers crossed I didn’t screw up the constant).The fact, that we’re not there yet, is clearly expressed by the poor performance of the asymptotically fast code (“PLE”) for small dimensions ($< 3,000$).
Finally, it should be mentioned, that the green plot (“Travolta”) already is considerably faster than Magma, which – as far as I know – is the fastest other implementation of this kind of operation over these fields. Below, Travolta based elimination is compared with Magma’s implementation, where blueness $b$ expresses a speed-up factor of $2^b$ (redness the other way around).The code is on bitbucket.

# Talk about M4RI and M4RIE

I just finished giving a talk about the M4RI & the M4RIE libraries for dense linear algebra over $\mathbb{F}_2$ and $\mathbb{F}_{2^e}$ respectively. I’m in LORIA, Nancy right now visiting the CARAMEL team, btw. Well, here are the slides.

# Karatsuba Multiplication of Dense Matrices over GF(2^e)

Elements in $GF(2^e)$ can be represented as $c_0 a^0 + c_1 a^1 + \dots + c_{e-1} a^{e-1}$ where $a$ is a root of the primitive polynomial of $GF(2^e)$ and $c_i$ are elements in $GF(2)$. Thus, we can rewrite a matrix over $GF(2^e)$ as a list of $e$ matrices over $GF(2)$ which contain the coefficients for each degree of $a$. To put it in another way: we can turn the matrix with polynomial entries into a polynomial with matrix coefficients. In Bitslicing and the Method of Four Russians Over Larger Finite Fields Tomas J. Boothby and Robert Bradshaw (both Sage developers from Seattle by the way) proposed to use this representation to perform matrix multiplication.

#+HTML <!–more–>

As an example consider $GF(2^2)$ with the primitive polynomial $x^2 + x + 1$. Suppose we want to compute $C = AB$. The matrix $A$ can be represented as $A_0x + A_1$ and the matrix $B$ can be represented as $B_0x + B_1$. Their product is $C = A_0B_0x^2 + (A_0B_1 + A_1B_0)x + A_1B_1.$

Finally, reduction modulo $x^2 + x + 1$ gives

$(A_0B_0 + A_0B_1 + A_1B_0)x + A_1B_1 + A_0B_0$.

This product thus costs 4 multiplications over $GF(2)$ and three additions over $GF(2)$.

Optionally, this last expression can be rewritten as

$((A_0 + A_1)(B_0 + B_1) + A_1B_1)x + A_1B_1 + A_0B_0$

using the divide and conquer trick Karatsuba introduced in the 60s. Thus this multiplication costs 3 matrix multiplications over  $GF(2)$ and 4 matrix additions over $GF(2)$.

To see why this strategy for matrix multiplication might be efficient, consider the following table which shows

1. the CPU time for multiplying dense 4,000 x 4,000 matrices over $GF(2^e)$,
2. to how many matrix multiplications over $GF(2)$ this corresponds,
3. how many GF(2) multiplications naive polynomial multiplication would need and
4. how many Karatsuba needs.
 e CPU time GF(2) factor naive Karatsuba 2 0.664 9.222 4 3 3 1.084 16.937 9 4 1.288 17.889 16 9 5 3.456 50.824 25 6 4.396 64.647 36 7 6.080 84.444 49 8 11.117 163.471 64 27 9 37.842 556.473 81 10 47.143 620.261 100

All times were taken on my Intel i7 powering my Macbook Pro using the M4RI and M4RIE libraries for $GF(2)$ and $GF(2^e)$ respectively.

So much for the theoretical complexity. Yesterday, I implemented this multiplication routine in M4RIE for $GF(2^2)$. The tables below give the CPU time for multiplication of dense $n \times n$ matrices over $GF(2^2)$

1. for the old M4RIE Travolta + Strassen code,
2. the new Karatsuba based code and
3. how long three matrix multiplications over $GF(2)$ take.

The second column includes the time needed for converting back and forth between the M4RIE matrix layout and the bitsliced representation needed for Karatsuba. The first table is again for my Macbook Pro’s Intel i7 clocked at 2.6Ghz:

 n Strassen + Travolta Karatsuba 3 * GF(2) CPU time 1000 0.012 0.012 0.012 2000 0.068 0.052 0.024 3000 0.224 0.136 0.096 4000 0.648 0.280 0.336 5000 1.144 0.520 0.432 6000 1.952 0.984 1.008 7000 3.272 1.444 1.632 8000 4.976 2.076 2.484 9000 6.444 2.784 2.628 10000 8.761 3.668 3.528

The second table is for an old AMD Opteron 885 clocked at 2.6Ghz :

 n Strassen + Travolta Karatsuba 3 * GF(2) CPU time 1000 0.100 0.036 0.024 2000 0.724 0.108 0.084 3000 2.076 0.336 0.264 4000 4.996 0.736 0.636 5000 9.509 1.360 1.116 6000 14.989 2.416 2.340 7000 22.985 3.276 3.204 8000 35.302 4.788 4.692 9000 47.695 6.304 5.892 10000 64.712 9.101 7.980

A third table for a new Opteron 8439 SE (redhaw):

 n Strassen + Travolta Karatsuba 3 * GF(2) CPU time 1000 0.020 0.020 0.060 2000 0.140 0.070 0.030 3000 0.470 0.200 0.150 4000 1.120 0.480 0.390 5000 2.090 0.870 0.690 6000 3.490 1.500 1.260 7000 5.440 2.270 1.950 8000 8.050 3.230 2.850 9000 10.710 4.560 4.140 10000 14.580 5.770 5.190

Ignoring measurement imprecisions (especially in the first table) it is clear that this new approach is much faster than the old one implemented in M4RIE especially on the Opteron. However, it seems on the Opteron our conversion between M4RIE and Karatsuba has a considerable cost, input how to improve that would be very welcome since I’m out of ideas for now. To compare with Magma: for the $10,000 \times 10,000$ case Magma takes 9.18 seconds on the i7 and 11.8 seconds on the Opteron 858. I didn’t compare with Tom and Robert’s implementation but I expect them to essentially be at 3 matrix multiplications over $GF(2)$ since they have no conversion overhead.

I should point out though that the code for $GF(2^2)$ is the least efficient in M4RIE since I only implemented 8 parallel Travolta table which means that over $GF(2^2)$ only $8 \cdot 2 = 16$ bits are dealt with at each step in the inner loop. We could use more tables to fill up L1 and we could also implement Kronrod’s method aka M4RM for $GF(2^2)$.

While I expect that we could catch up to Karatsuba at least on Intel CPUs over $GF(2^2)$, I assume that the Karatsuba timings are close to optimal since matrix multiplication in M4RI seems to be close to optimal at least without further tricks being discovered and 3 matrix multiplications seems to be the best one can do for degree two polynomials.

I guess I’ll implement Karatsuba for $GF(2^3)$ and $GF(2^4)$, but I’m not sure I can be asked to do it for bigger fields if I don’t figure out a nice generic way of implementing it where I don’t have to write code for each minimal polynomial etc.

# M4RIE new beta releases

I just uploaded new releases of M4RI and M4RIE to the M4RI website. Not much changed except that I implemented Strassen multiplication in M4RIE. The results are visualised below, where blue means M4RIE wins and red means Magma wins. The first two plots are for a Xeon (eno at UW) and the last two for an Opteron (prai243 at RHUL).

I’ve also updated the Sage package http://trac.sagemath.org/sage_trac/ticket/9562 such that it is easier to play around with the new code.

# M4RIE

My coding spring project here at Sage Days 24 was/is to write a new library for dense linear algebra over small extensions of GF(2).

The main idea is to represent matrices over GF(2^e) internally as M4RI matrices over GF(2) in a row major flat representation. This allows to  re-use many of the M4RI functions such as addition of rows and matrices, stacking, augmenting etc.

For the elimination and multiplication base  cases we use a simple trick inspired by M4RI: Assume we found a pivot, now we need to rescale it and then add the right multiple of it to every row with a nonzero entry in the same column. If we have an n x n matrix we expect to perform n^2 multiplications and n^2 additions per pivot. However, if the field is small then we can precompute all possible multiplies of our pivot row and store them in a table. Then, we use this table as a lookup to perform the elimination of other rows. This costs 2^k * n multiplications and  n^2 additions per pivot. Note that M4RI has efficient functions to perform the look-up + elimination part.

We haven’t implemented any asymptotically fast methods yet but I hope to finish with Strassen multiplication this week. Asymptotically fast elimination requires slightly more infrastructure (TRSM, taking care of  offsets which are not zero % 64 etc.) If anybody is interested in helping out that would be  greatly appreciated. Also, I guess one should also try M4RI (the algorithm) for GF(2^2).

Some first results

In these plots red means that Magma is faster by a factor of e^abs(red) and blue means that our code is faster by a factor of e^abs(blue). More details can be found on the SD24 wiki page.

To get a sense for absolute numbers,  here are some preliminary benchmarks for row echelon forms of dense  random n x n matrices over GF(2^4)

|| n    || Sage 4.5 || NTL/2 || Magma || M4RIE (old)|| M4RIE (new) ||
|| 1000 ||    49.49 || 18.84 || 0.090 || 0.097      || 0.046       ||
|| 2000 ||   429.05 || 149.11|| 0.510 || 0.529      || 0.28        ||
|| 3000 ||  1494.33 ||526.57 || 1.640 || 2.315      || 1.00        ||

Bottom line it is much much faster (> 1,000x) than what we have in Sage right now (which is the generic implementation). Also note that switching to NTL would not improve things much.

I should point out that the strategy for multiplication in Tom and  Robert’s paper Bitslicing and the Method of Four Russians Over Larger Finite Fields is likely to be better.  Judging from the timings in that paper we are about a factor of two  behind them. I plan to implement/port their very cool trick for finite  extension fields at some point in the future. The trick is limited to  multiplication as far as I understand it thus it might make still sense to consider my representation for the elimination base case etc.