I’ve been writing up the ideas that went into the M4RIE library for dense linear algebra over small extensions of . I think it is now in a state to be readable enough to up a PDF of the current draft online. Hence, here it is. While the paper does explain what we mean by “Travolta tables” it doesn’t explain why we call them that way … but the image below does:

# Tag: m4rie

# Talk about M4RI and M4RIE

# GAP for dense linear algebra over GF(2^e)

For some strange reason it hardly every occurs to me to benchmark GAP when it comes to linear algebra over GF(2^e). Turns out, that’s a really bad idea as it seems to be the best open-source implementation for such things. For example, computing the reduced row echelon form of a 4,000 x 4,000 random matrix over takes 16.94s in GAP 4.4.12 but 3267s in Sage and 1005s in NTL 5.4.2. So much much faster than other established open-source projects out there.

I hope libgap becomes a reality at some point in the future.

PS: For comparison, Magma does the same task in in 4.67s and M4RIE in 0.71s.

# Asymptotically fast Gaussian elimination for M4RIE

I guess one perk of being on vacation is that one can get more work done. Hence, I finished implementing PLE decomposition (formerly known as PLS decomposition) and hence asymptotically fast Gaussian elimination. The implementation uses two matrix representations: **mzd_slice_t** and **mzed_t**. The former is optimised for implementing Karatsuba multiplication (cf., here) and the latter is optimised for using Travolta tables (cf., here). That is, multiplication is based on Karatsuba while the PLE base case is based on Travolta tables. The same applies to TRSM where the base case is also implemented using Travolta tables.

There is still a lot to be done:

- Both TRSM and PLE base cases use only one Travolta table while Gaussian elimination and multiplication use 6 and 8 in parallel respectively.
- There way too much copying going on. For example, I was lazy and implemented TRSM upper left with respect to matrices which do not start at offset 0 with respect to a machine word by simply copying the whole matrix out. Hence, we’re wasting memory where we shouldn’t.
- PLE isn’t cache efficient yet and I assume that the code isn’t very good for sparse-ish matrices (cf. the journey in M4RI improving this)

*there*yet, is clearly expressed by the poor performance of the asymptotically fast code (“PLE”) for small dimensions ().

*blue*ness expresses a speed-up factor of (

*red*ness the other way around).The code is on bitbucket.

# Talk about M4RI and M4RIE

I just finished giving a talk about the M4RI & the M4RIE libraries for dense linear algebra over and respectively. I’m in LORIA, Nancy right now visiting the CARAMEL team, btw. Well, here are the slides.

# Karatsuba Multiplication of Dense Matrices over GF(2^e)

Elements in can be represented as where is a root of the primitive polynomial of and are elements in . Thus, we can rewrite a matrix over as a list of matrices over which contain the coefficients for each degree of . To put it in another way: we can turn the matrix with polynomial entries into a polynomial with matrix coefficients. In Bitslicing and the Method of Four Russians Over Larger Finite Fields Tomas J. Boothby and Robert Bradshaw (both Sage developers from Seattle by the way) proposed to use this representation to perform matrix multiplication.

#+HTML <!–more–>

As an example consider with the primitive polynomial . Suppose we want to compute . The matrix can be represented as and the matrix can be represented as . Their product is $C = A_0B_0x^2 + (A_0B_1 + A_1B_0)x + A_1B_1.$

Finally, reduction modulo gives

.

This product thus costs 4 multiplications over and three additions over .

Optionally, this last expression can be rewritten as

using the divide and conquer trick Karatsuba introduced in the 60s. Thus this multiplication costs 3 matrix multiplications over and 4 matrix additions over .

To see why this strategy for matrix multiplication might be efficient, consider the following table which shows

- the CPU time for multiplying dense 4,000 x 4,000 matrices over ,
- to how many matrix multiplications over this corresponds,
- how many GF(2) multiplications naive polynomial multiplication would need and
- how many Karatsuba needs.

e |
CPU time |
GF(2) factor |
naive |
Karatsuba |

2 | 0.664 | 9.222 | 4 | 3 |

3 | 1.084 | 16.937 | 9 | |

4 | 1.288 | 17.889 | 16 | 9 |

5 | 3.456 | 50.824 | 25 | |

6 | 4.396 | 64.647 | 36 | |

7 | 6.080 | 84.444 | 49 | |

8 | 11.117 | 163.471 | 64 | 27 |

9 | 37.842 | 556.473 | 81 | |

10 | 47.143 | 620.261 | 100 |

All times were taken on my *Intel i7* powering my Macbook Pro using the M4RI and M4RIE libraries for and respectively.

So much for the theoretical complexity. Yesterday, I implemented this multiplication routine in M4RIE for . The tables below give the CPU time for multiplication of dense matrices over

- for the old M4RIE Travolta + Strassen code,
- the new Karatsuba based code and
- how long three matrix multiplications over take.

The second column includes the time needed for converting back and forth between the M4RIE matrix layout and the bitsliced representation needed for Karatsuba. The first table is again for my Macbook Pro’s Intel i7 clocked at 2.6Ghz:

n |
Strassen + Travolta |
Karatsuba |
3 * GF(2) CPU time |

1000 | 0.012 | 0.012 | 0.012 |

2000 | 0.068 | 0.052 | 0.024 |

3000 | 0.224 | 0.136 | 0.096 |

4000 | 0.648 | 0.280 | 0.336 |

5000 | 1.144 | 0.520 | 0.432 |

6000 | 1.952 | 0.984 | 1.008 |

7000 | 3.272 | 1.444 | 1.632 |

8000 | 4.976 | 2.076 | 2.484 |

9000 | 6.444 | 2.784 | 2.628 |

10000 | 8.761 | 3.668 | 3.528 |

The second table is for an old AMD Opteron 885 clocked at 2.6Ghz :

n |
Strassen + Travolta |
Karatsuba |
3 * GF(2) CPU time |

1000 | 0.100 | 0.036 | 0.024 |

2000 | 0.724 | 0.108 | 0.084 |

3000 | 2.076 | 0.336 | 0.264 |

4000 | 4.996 | 0.736 | 0.636 |

5000 | 9.509 | 1.360 | 1.116 |

6000 | 14.989 | 2.416 | 2.340 |

7000 | 22.985 | 3.276 | 3.204 |

8000 | 35.302 | 4.788 | 4.692 |

9000 | 47.695 | 6.304 | 5.892 |

10000 | 64.712 | 9.101 | 7.980 |

A third table for a new Opteron 8439 SE (redhaw):

n |
Strassen + Travolta |
Karatsuba |
3 * GF(2) CPU time |

1000 | 0.020 | 0.020 | 0.060 |

2000 | 0.140 | 0.070 | 0.030 |

3000 | 0.470 | 0.200 | 0.150 |

4000 | 1.120 | 0.480 | 0.390 |

5000 | 2.090 | 0.870 | 0.690 |

6000 | 3.490 | 1.500 | 1.260 |

7000 | 5.440 | 2.270 | 1.950 |

8000 | 8.050 | 3.230 | 2.850 |

9000 | 10.710 | 4.560 | 4.140 |

10000 | 14.580 | 5.770 | 5.190 |

Ignoring measurement imprecisions (especially in the first table) it is clear that this new approach is much faster than the old one implemented in M4RIE especially on the Opteron. However, it seems on the Opteron our conversion between M4RIE and Karatsuba has a considerable cost, input how to improve that would be very welcome since I’m out of ideas for now. To compare with Magma: for the case Magma takes 9.18 seconds on the i7 and 11.8 seconds on the Opteron 858. I didn’t compare with Tom and Robert’s implementation but I expect them to essentially be at 3 matrix multiplications over since they have no conversion overhead.

I should point out though that the code for is the least efficient in M4RIE since I only implemented 8 parallel Travolta table which means that over only bits are dealt with at each step in the inner loop. We could use more tables to fill up L1 and we could also implement Kronrod’s method aka M4RM for .

While I expect that we could catch up to Karatsuba at least on Intel CPUs over , I assume that the Karatsuba timings are close to optimal since matrix multiplication in M4RI seems to be close to optimal at least without further tricks being discovered and 3 matrix multiplications seems to be the best one can do for degree two polynomials.

I guess I’ll implement Karatsuba for and , but I’m not sure I can be asked to do it for bigger fields if I don’t figure out a nice generic way of implementing it where I don’t have to write code for each minimal polynomial etc.

# M4RIE new beta releases

I just uploaded new releases of M4RI and M4RIE to the M4RI website. Not much changed except that I implemented Strassen multiplication in M4RIE. The results are visualised below, where blue means M4RIE wins and red means Magma wins. The first two plots are for a Xeon (eno at UW) and the last two for an Opteron (prai243 at RHUL).

I’ve also updated the Sage package http://trac.sagemath.org/sage_trac/ticket/9562 such that it is easier to play around with the new code.

# M4RIE

My coding spring project here at Sage Days 24 was/is to write a new library for dense linear algebra over small extensions of GF(2).

The main idea is to represent matrices over GF(2^e) internally as M4RI matrices over GF(2) in a row major flat representation. This allows to re-use many of the M4RI functions such as addition of rows and matrices, stacking, augmenting etc.

For the elimination and multiplication base cases we use a simple trick inspired by M4RI: Assume we found a pivot, now we need to rescale it and then add the right multiple of it to every row with a nonzero entry in the same column. If we have an n x n matrix we expect to perform n^2 multiplications and n^2 additions per pivot. However, if the field is small then we can precompute all possible multiplies of our pivot row and store them in a table. Then, we use this table as a lookup to perform the elimination of other rows. This costs 2^k * n multiplications and n^2 additions per pivot. Note that M4RI has efficient functions to perform the look-up + elimination part.

We haven’t implemented any asymptotically fast methods yet but I hope to finish with Strassen multiplication this week. Asymptotically fast elimination requires slightly more infrastructure (TRSM, taking care of offsets which are not zero % 64 etc.) If anybody is interested in helping out that would be greatly appreciated. Also, I guess one should also try M4RI (the algorithm) for GF(2^2).

**Some first results**

** **

In these plots red means that Magma is faster by a factor of e^abs(red) and blue means that our code is faster by a factor of e^abs(blue). More details can be found on the SD24 wiki page.

To get a sense for absolute numbers, here are some preliminary benchmarks for row echelon forms of dense random n x n matrices over GF(2^4)

|| n || Sage 4.5 || NTL/2 || Magma || M4RIE (old)|| M4RIE (new) || || 1000 || 49.49 || 18.84 || 0.090 || 0.097 || 0.046 || || 2000 || 429.05 || 149.11|| 0.510 || 0.529 || 0.28 || || 3000 || 1494.33 ||526.57 || 1.640 || 2.315 || 1.00 ||

Bottom line it is much much faster (> 1,000x) than what we have in Sage right now (which is the generic implementation). Also note that switching to NTL would not improve things much.

I should point out that the strategy for multiplication in Tom and Robert’s paper Bitslicing and the Method of Four Russians Over Larger Finite Fields is likely to be better. Judging from the timings in that paper we are about a factor of two behind them. I plan to implement/port their very cool trick for finite extension fields at some point in the future. The trick is limited to multiplication as far as I understand it thus it might make still sense to consider my representation for the elimination base case etc.