Elements in can be represented as where is a root of the primitive polynomial of and are elements in . Thus, we can rewrite a matrix over as a list of matrices over which contain the coefficients for each degree of . To put it in another way: we can turn the matrix with polynomial entries into a polynomial with matrix coefficients. In Bitslicing and the Method of Four Russians Over Larger Finite Fields Tomas J. Boothby and Robert Bradshaw (both Sage developers from Seattle by the way) proposed to use this representation to perform matrix multiplication.

#+HTML <!–more–>

As an example consider with the primitive polynomial . Suppose we want to compute . The matrix can be represented as and the matrix can be represented as . Their product is $C = A_0B_0x^2 + (A_0B_1 + A_1B_0)x + A_1B_1.$

Finally, reduction modulo gives

.

This product thus costs 4 multiplications over and three additions over .

Optionally, this last expression can be rewritten as

using the divide and conquer trick Karatsuba introduced in the 60s. Thus this multiplication costs 3 matrix multiplications over and 4 matrix additions over .

To see why this strategy for matrix multiplication might be efficient, consider the following table which shows

- the CPU time for multiplying dense 4,000 x 4,000 matrices over ,
- to how many matrix multiplications over this corresponds,
- how many GF(2) multiplications naive polynomial multiplication would need and
- how many Karatsuba needs.

e |
CPU time |
GF(2) factor |
naive |
Karatsuba |

2 | 0.664 | 9.222 | 4 | 3 |

3 | 1.084 | 16.937 | 9 | |

4 | 1.288 | 17.889 | 16 | 9 |

5 | 3.456 | 50.824 | 25 | |

6 | 4.396 | 64.647 | 36 | |

7 | 6.080 | 84.444 | 49 | |

8 | 11.117 | 163.471 | 64 | 27 |

9 | 37.842 | 556.473 | 81 | |

10 | 47.143 | 620.261 | 100 |

All times were taken on my *Intel i7* powering my Macbook Pro using the M4RI and M4RIE libraries for and respectively.

So much for the theoretical complexity. Yesterday, I implemented this multiplication routine in M4RIE for . The tables below give the CPU time for multiplication of dense matrices over

- for the old M4RIE Travolta + Strassen code,
- the new Karatsuba based code and
- how long three matrix multiplications over take.

The second column includes the time needed for converting back and forth between the M4RIE matrix layout and the bitsliced representation needed for Karatsuba. The first table is again for my Macbook Pro’s Intel i7 clocked at 2.6Ghz:

n |
Strassen + Travolta |
Karatsuba |
3 * GF(2) CPU time |

1000 | 0.012 | 0.012 | 0.012 |

2000 | 0.068 | 0.052 | 0.024 |

3000 | 0.224 | 0.136 | 0.096 |

4000 | 0.648 | 0.280 | 0.336 |

5000 | 1.144 | 0.520 | 0.432 |

6000 | 1.952 | 0.984 | 1.008 |

7000 | 3.272 | 1.444 | 1.632 |

8000 | 4.976 | 2.076 | 2.484 |

9000 | 6.444 | 2.784 | 2.628 |

10000 | 8.761 | 3.668 | 3.528 |

The second table is for an old AMD Opteron 885 clocked at 2.6Ghz :

n |
Strassen + Travolta |
Karatsuba |
3 * GF(2) CPU time |

1000 | 0.100 | 0.036 | 0.024 |

2000 | 0.724 | 0.108 | 0.084 |

3000 | 2.076 | 0.336 | 0.264 |

4000 | 4.996 | 0.736 | 0.636 |

5000 | 9.509 | 1.360 | 1.116 |

6000 | 14.989 | 2.416 | 2.340 |

7000 | 22.985 | 3.276 | 3.204 |

8000 | 35.302 | 4.788 | 4.692 |

9000 | 47.695 | 6.304 | 5.892 |

10000 | 64.712 | 9.101 | 7.980 |

A third table for a new Opteron 8439 SE (redhaw):

n |
Strassen + Travolta |
Karatsuba |
3 * GF(2) CPU time |

1000 | 0.020 | 0.020 | 0.060 |

2000 | 0.140 | 0.070 | 0.030 |

3000 | 0.470 | 0.200 | 0.150 |

4000 | 1.120 | 0.480 | 0.390 |

5000 | 2.090 | 0.870 | 0.690 |

6000 | 3.490 | 1.500 | 1.260 |

7000 | 5.440 | 2.270 | 1.950 |

8000 | 8.050 | 3.230 | 2.850 |

9000 | 10.710 | 4.560 | 4.140 |

10000 | 14.580 | 5.770 | 5.190 |

Ignoring measurement imprecisions (especially in the first table) it is clear that this new approach is much faster than the old one implemented in M4RIE especially on the Opteron. However, it seems on the Opteron our conversion between M4RIE and Karatsuba has a considerable cost, input how to improve that would be very welcome since I’m out of ideas for now. To compare with Magma: for the case Magma takes 9.18 seconds on the i7 and 11.8 seconds on the Opteron 858. I didn’t compare with Tom and Robert’s implementation but I expect them to essentially be at 3 matrix multiplications over since they have no conversion overhead.

I should point out though that the code for is the least efficient in M4RIE since I only implemented 8 parallel Travolta table which means that over only bits are dealt with at each step in the inner loop. We could use more tables to fill up L1 and we could also implement Kronrod’s method aka M4RM for .

While I expect that we could catch up to Karatsuba at least on Intel CPUs over , I assume that the Karatsuba timings are close to optimal since matrix multiplication in M4RI seems to be close to optimal at least without further tricks being discovered and 3 matrix multiplications seems to be the best one can do for degree two polynomials.

I guess I’ll implement Karatsuba for and , but I’m not sure I can be asked to do it for bigger fields if I don’t figure out a nice generic way of implementing it where I don’t have to write code for each minimal polynomial etc.