GAP for dense linear algebra over GF(2^e)

For some strange reason it hardly every occurs to me to benchmark GAP when it comes to linear algebra over GF(2^e). Turns out, that’s a really bad idea as it seems to be the best open-source implementation for such things. For example, computing the reduced row echelon form of a 4,000 x 4,000 random matrix over \mathbb{F}_{16} takes 16.94s in GAP 4.4.12 but 3267s in Sage and 1005s in NTL 5.4.2. So much much faster than other established open-source projects out there.

I hope libgap becomes a reality at some point in the future.

PS: For comparison, Magma does the same task in in 4.67s and M4RIE in 0.71s.

Asymptotically fast Gaussian elimination for M4RIE

I guess one perk of being on vacation is that one can get more work done. Hence, I finished implementing PLE decomposition (formerly known as PLS decomposition) and hence asymptotically fast Gaussian elimination. The implementation uses two matrix representations: mzd_slice_t and mzed_t. The former is optimised for implementing Karatsuba multiplication (cf., here) and the latter is optimised for using Travolta tables (cf., here). That is, multiplication is based on Karatsuba while the PLE base case is based on Travolta tables. The same applies to TRSM where the base case is also implemented using Travolta tables.

There is still a lot to be done:

  • Both TRSM and PLE base cases use only one Travolta table while Gaussian elimination and multiplication use 6 and 8 in parallel respectively.
  • There way too much copying going on. For example, I was lazy and implemented TRSM upper left with respect to matrices which do not start at offset 0 with respect to a machine word by simply copying the whole matrix out. Hence, we’re wasting memory where we shouldn’t.
  • PLE isn’t cache efficient yet and I assume that the code isn’t very good for sparse-ish matrices (cf. the journey in M4RI improving this)
Still, the results are quite nice already. Below, the left hand side plots the wall time of computing the reduced row echelon form of m \times (m + 1000) random matrices over \mathbb{F}_{2^e}. The right hand side expresses efficiency by dividing the running time by 2mnr^{\omega-2}, i.e., the number of operations (fingers crossed I didn’t screw up the constant).The fact, that we’re not there yet, is clearly expressed by the poor performance of the asymptotically fast code (“PLE”) for small dimensions (< 3,000).
Finally, it should be mentioned, that the green plot (“Travolta”) already is considerably faster than Magma, which – as far as I know – is the fastest other implementation of this kind of operation over these fields. Below, Travolta based elimination is compared with Magma’s implementation, where blueness b expresses a speed-up factor of 2^b (redness the other way around).The code is on bitbucket.

Karatsuba Multiplication of Dense Matrices over GF(2^e)

Elements in GF(2^e) can be represented as c_0 a^0 + c_1 a^1 + \dots + c_{e-1} a^{e-1} where a is a root of the primitive polynomial of GF(2^e) and c_i are elements in GF(2). Thus, we can rewrite a matrix over GF(2^e) as a list of e matrices over GF(2) which contain the coefficients for each degree of a. To put it in another way: we can turn the matrix with polynomial entries into a polynomial with matrix coefficients. In Bitslicing and the Method of Four Russians Over Larger Finite Fields Tomas J. Boothby and Robert Bradshaw (both Sage developers from Seattle by the way) proposed to use this representation to perform matrix multiplication.

#+HTML <!–more–>

As an example consider GF(2^2) with the primitive polynomial x^2 + x + 1. Suppose we want to compute C = AB. The matrix A can be represented as A_0x + A_1 and the matrix B can be represented as B_0x + B_1. Their product is $C = A_0B_0x^2 + (A_0B_1 + A_1B_0)x + A_1B_1.$

Finally, reduction modulo x^2 + x + 1 gives

(A_0B_0 + A_0B_1 + A_1B_0)x + A_1B_1 + A_0B_0.

This product thus costs 4 multiplications over GF(2) and three additions over GF(2).

Optionally, this last expression can be rewritten as

((A_0 + A_1)(B_0 + B_1) + A_1B_1)x + A_1B_1 + A_0B_0

using the divide and conquer trick Karatsuba introduced in the 60s. Thus this multiplication costs 3 matrix multiplications over  GF(2) and 4 matrix additions over GF(2).

To see why this strategy for matrix multiplication might be efficient, consider the following table which shows

  1. the CPU time for multiplying dense 4,000 x 4,000 matrices over GF(2^e),
  2. to how many matrix multiplications over GF(2) this corresponds,
  3. how many GF(2) multiplications naive polynomial multiplication would need and
  4. how many Karatsuba needs.
e CPU time GF(2) factor naive Karatsuba
2 0.664 9.222 4 3
3 1.084 16.937 9  
4 1.288 17.889 16 9
5 3.456 50.824 25  
6 4.396 64.647 36  
7 6.080 84.444 49  
8 11.117 163.471 64 27
9 37.842 556.473 81  
10 47.143 620.261 100  

All times were taken on my Intel i7 powering my Macbook Pro using the M4RI and M4RIE libraries for GF(2) and GF(2^e) respectively.

So much for the theoretical complexity. Yesterday, I implemented this multiplication routine in M4RIE for GF(2^2). The tables below give the CPU time for multiplication of dense n \times n matrices over GF(2^2)

  1. for the old M4RIE Travolta + Strassen code,
  2. the new Karatsuba based code and
  3. how long three matrix multiplications over GF(2) take.

The second column includes the time needed for converting back and forth between the M4RIE matrix layout and the bitsliced representation needed for Karatsuba. The first table is again for my Macbook Pro’s Intel i7 clocked at 2.6Ghz:

n Strassen + Travolta Karatsuba 3 * GF(2) CPU time
1000 0.012 0.012 0.012
2000 0.068 0.052 0.024
3000 0.224 0.136 0.096
4000 0.648 0.280 0.336
5000 1.144 0.520 0.432
6000 1.952 0.984 1.008
7000 3.272 1.444 1.632
8000 4.976 2.076 2.484
9000 6.444 2.784 2.628
10000 8.761 3.668 3.528

The second table is for an old AMD Opteron 885 clocked at 2.6Ghz :

n Strassen + Travolta Karatsuba 3 * GF(2) CPU time
1000 0.100 0.036 0.024
2000 0.724 0.108 0.084
3000 2.076 0.336 0.264
4000 4.996 0.736 0.636
5000 9.509 1.360 1.116
6000 14.989 2.416 2.340
7000 22.985 3.276 3.204
8000 35.302 4.788 4.692
9000 47.695 6.304 5.892
10000 64.712 9.101 7.980

A third table for a new Opteron 8439 SE (redhaw):

n Strassen + Travolta Karatsuba 3 * GF(2) CPU time
1000 0.020 0.020 0.060
2000 0.140 0.070 0.030
3000 0.470 0.200 0.150
4000 1.120 0.480 0.390
5000 2.090 0.870 0.690
6000 3.490 1.500 1.260
7000 5.440 2.270 1.950
8000 8.050 3.230 2.850
9000 10.710 4.560 4.140
10000 14.580 5.770 5.190

Ignoring measurement imprecisions (especially in the first table) it is clear that this new approach is much faster than the old one implemented in M4RIE especially on the Opteron. However, it seems on the Opteron our conversion between M4RIE and Karatsuba has a considerable cost, input how to improve that would be very welcome since I’m out of ideas for now. To compare with Magma: for the 10,000 \times 10,000 case Magma takes 9.18 seconds on the i7 and 11.8 seconds on the Opteron 858. I didn’t compare with Tom and Robert’s implementation but I expect them to essentially be at 3 matrix multiplications over GF(2) since they have no conversion overhead.

I should point out though that the code for GF(2^2) is the least efficient in M4RIE since I only implemented 8 parallel Travolta table which means that over GF(2^2) only 8 \cdot 2 = 16 bits are dealt with at each step in the inner loop. We could use more tables to fill up L1 and we could also implement Kronrod’s method aka M4RM for GF(2^2).

While I expect that we could catch up to Karatsuba at least on Intel CPUs over GF(2^2), I assume that the Karatsuba timings are close to optimal since matrix multiplication in M4RI seems to be close to optimal at least without further tricks being discovered and 3 matrix multiplications seems to be the best one can do for degree two polynomials.

I guess I’ll implement Karatsuba for GF(2^3) and GF(2^4), but I’m not sure I can be asked to do it for bigger fields if I don’t figure out a nice generic way of implementing it where I don’t have to write code for each minimal polynomial etc.

M4RIE new beta releases

I just uploaded new releases of M4RI and M4RIE to the M4RI website. Not much changed except that I implemented Strassen multiplication in M4RIE. The results are visualised below, where blue means M4RIE wins and red means Magma wins. The first two plots are for a Xeon (eno at UW) and the last two for an Opteron (prai243 at RHUL).

I’ve also updated the Sage package http://trac.sagemath.org/sage_trac/ticket/9562 such that it is easier to play around with the new code.

M4RIE

My coding spring project here at Sage Days 24 was/is to write a new library for dense linear algebra over small extensions of GF(2).

The main idea is to represent matrices over GF(2^e) internally as M4RI matrices over GF(2) in a row major flat representation. This allows to  re-use many of the M4RI functions such as addition of rows and matrices, stacking, augmenting etc.

For the elimination and multiplication base  cases we use a simple trick inspired by M4RI: Assume we found a pivot, now we need to rescale it and then add the right multiple of it to every row with a nonzero entry in the same column. If we have an n x n matrix we expect to perform n^2 multiplications and n^2 additions per pivot. However, if the field is small then we can precompute all possible multiplies of our pivot row and store them in a table. Then, we use this table as a lookup to perform the elimination of other rows. This costs 2^k * n multiplications and  n^2 additions per pivot. Note that M4RI has efficient functions to perform the look-up + elimination part.

We haven’t implemented any asymptotically fast methods yet but I hope to finish with Strassen multiplication this week. Asymptotically fast elimination requires slightly more infrastructure (TRSM, taking care of  offsets which are not zero % 64 etc.) If anybody is interested in helping out that would be  greatly appreciated. Also, I guess one should also try M4RI (the algorithm) for GF(2^2).

Some first results

In these plots red means that Magma is faster by a factor of e^abs(red) and blue means that our code is faster by a factor of e^abs(blue). More details can be found on the SD24 wiki page.

To get a sense for absolute numbers,  here are some preliminary benchmarks for row echelon forms of dense  random n x n matrices over GF(2^4)

|| n    || Sage 4.5 || NTL/2 || Magma || M4RIE (old)|| M4RIE (new) ||
|| 1000 ||    49.49 || 18.84 || 0.090 || 0.097      || 0.046       ||
|| 2000 ||   429.05 || 149.11|| 0.510 || 0.529      || 0.28        ||
|| 3000 ||  1494.33 ||526.57 || 1.640 || 2.315      || 1.00        ||

Bottom line it is much much faster (> 1,000x) than what we have in Sage right now (which is the generic implementation). Also note that switching to NTL would not improve things much.

I should point out that the strategy for multiplication in Tom and  Robert’s paper Bitslicing and the Method of Four Russians Over Larger Finite Fields is likely to be better.  Judging from the timings in that paper we are about a factor of two  behind them. I plan to implement/port their very cool trick for finite  extension fields at some point in the future. The trick is limited to  multiplication as far as I understand it thus it might make still sense to consider my representation for the elimination base case etc.