Karatsuba Multiplication of Dense Matrices over GF(2^e)

Elements in GF(2^e) can be represented as c_0 a^0 + c_1 a^1 + \dots + c_{e-1} a^{e-1} where a is a root of the primitive polynomial of GF(2^e) and c_i are elements in GF(2). Thus, we can rewrite a matrix over GF(2^e) as a list of e matrices over GF(2) which contain the coefficients for each degree of a. To put it in another way: we can turn the matrix with polynomial entries into a polynomial with matrix coefficients. In Bitslicing and the Method of Four Russians Over Larger Finite Fields Tomas J. Boothby and Robert Bradshaw (both Sage developers from Seattle by the way) proposed to use this representation to perform matrix multiplication.

#+HTML <!–more–>

As an example consider GF(2^2) with the primitive polynomial x^2 + x + 1. Suppose we want to compute C = AB. The matrix A can be represented as A_0x + A_1 and the matrix B can be represented as B_0x + B_1. Their product is $C = A_0B_0x^2 + (A_0B_1 + A_1B_0)x + A_1B_1.$

Finally, reduction modulo x^2 + x + 1 gives

(A_0B_0 + A_0B_1 + A_1B_0)x + A_1B_1 + A_0B_0.

This product thus costs 4 multiplications over GF(2) and three additions over GF(2).

Optionally, this last expression can be rewritten as

((A_0 + A_1)(B_0 + B_1) + A_1B_1)x + A_1B_1 + A_0B_0

using the divide and conquer trick Karatsuba introduced in the 60s. Thus this multiplication costs 3 matrix multiplications over  GF(2) and 4 matrix additions over GF(2).

To see why this strategy for matrix multiplication might be efficient, consider the following table which shows

  1. the CPU time for multiplying dense 4,000 x 4,000 matrices over GF(2^e),
  2. to how many matrix multiplications over GF(2) this corresponds,
  3. how many GF(2) multiplications naive polynomial multiplication would need and
  4. how many Karatsuba needs.
e CPU time GF(2) factor naive Karatsuba
2 0.664 9.222 4 3
3 1.084 16.937 9  
4 1.288 17.889 16 9
5 3.456 50.824 25  
6 4.396 64.647 36  
7 6.080 84.444 49  
8 11.117 163.471 64 27
9 37.842 556.473 81  
10 47.143 620.261 100  

All times were taken on my Intel i7 powering my Macbook Pro using the M4RI and M4RIE libraries for GF(2) and GF(2^e) respectively.

So much for the theoretical complexity. Yesterday, I implemented this multiplication routine in M4RIE for GF(2^2). The tables below give the CPU time for multiplication of dense n \times n matrices over GF(2^2)

  1. for the old M4RIE Travolta + Strassen code,
  2. the new Karatsuba based code and
  3. how long three matrix multiplications over GF(2) take.

The second column includes the time needed for converting back and forth between the M4RIE matrix layout and the bitsliced representation needed for Karatsuba. The first table is again for my Macbook Pro’s Intel i7 clocked at 2.6Ghz:

n Strassen + Travolta Karatsuba 3 * GF(2) CPU time
1000 0.012 0.012 0.012
2000 0.068 0.052 0.024
3000 0.224 0.136 0.096
4000 0.648 0.280 0.336
5000 1.144 0.520 0.432
6000 1.952 0.984 1.008
7000 3.272 1.444 1.632
8000 4.976 2.076 2.484
9000 6.444 2.784 2.628
10000 8.761 3.668 3.528

The second table is for an old AMD Opteron 885 clocked at 2.6Ghz :

n Strassen + Travolta Karatsuba 3 * GF(2) CPU time
1000 0.100 0.036 0.024
2000 0.724 0.108 0.084
3000 2.076 0.336 0.264
4000 4.996 0.736 0.636
5000 9.509 1.360 1.116
6000 14.989 2.416 2.340
7000 22.985 3.276 3.204
8000 35.302 4.788 4.692
9000 47.695 6.304 5.892
10000 64.712 9.101 7.980

A third table for a new Opteron 8439 SE (redhaw):

n Strassen + Travolta Karatsuba 3 * GF(2) CPU time
1000 0.020 0.020 0.060
2000 0.140 0.070 0.030
3000 0.470 0.200 0.150
4000 1.120 0.480 0.390
5000 2.090 0.870 0.690
6000 3.490 1.500 1.260
7000 5.440 2.270 1.950
8000 8.050 3.230 2.850
9000 10.710 4.560 4.140
10000 14.580 5.770 5.190

Ignoring measurement imprecisions (especially in the first table) it is clear that this new approach is much faster than the old one implemented in M4RIE especially on the Opteron. However, it seems on the Opteron our conversion between M4RIE and Karatsuba has a considerable cost, input how to improve that would be very welcome since I’m out of ideas for now. To compare with Magma: for the 10,000 \times 10,000 case Magma takes 9.18 seconds on the i7 and 11.8 seconds on the Opteron 858. I didn’t compare with Tom and Robert’s implementation but I expect them to essentially be at 3 matrix multiplications over GF(2) since they have no conversion overhead.

I should point out though that the code for GF(2^2) is the least efficient in M4RIE since I only implemented 8 parallel Travolta table which means that over GF(2^2) only 8 \cdot 2 = 16 bits are dealt with at each step in the inner loop. We could use more tables to fill up L1 and we could also implement Kronrod’s method aka M4RM for GF(2^2).

While I expect that we could catch up to Karatsuba at least on Intel CPUs over GF(2^2), I assume that the Karatsuba timings are close to optimal since matrix multiplication in M4RI seems to be close to optimal at least without further tricks being discovered and 3 matrix multiplications seems to be the best one can do for degree two polynomials.

I guess I’ll implement Karatsuba for GF(2^3) and GF(2^4), but I’m not sure I can be asked to do it for bigger fields if I don’t figure out a nice generic way of implementing it where I don’t have to write code for each minimal polynomial etc.

Advertisements

1 thought on “Karatsuba Multiplication of Dense Matrices over GF(2^e)”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s