Matrix Multiplication over GF(p^e)

After my talk at Sage Days 35 in Warwick (that was in winter 2011) David Harvey had an idea on how to speed up matrix multiplication over \mathbb{F}_{p^n}. We spend some time on this in Warwick and developed this idea further (adding fun stuff like Mixed Integer Programming in the process) but did not get around to do much on this project in the mean time (I have explained the idea at the end of my talk in Mykonos, though).

Just now, in a conversation with Richard Parker I was reminded of this dormant project, i.e., the question of how many multiplications i \mathbb{F}_p it takes to do a multiplication in \mathbb{F}_{p^n}. In particular, I recalled to have written some code for Sage which gives some upper bound to this answer which is better than Karatsuba.

Well, here’s an interactive demo … gosh, I love the Sage cell server.

M4RI 20121224

I have just pushed the button to release M4RI 20121224. The main feature of this release is a considerable performance improvement. It all started with Fast matrix decomposition in F2 by Enrico Bertolazzi and Anna Rimoldi showing up on the arXiv. Here’s the abstract

In this work an efficient algorithm to perform a block decomposition (and so to compute the rank) of large dense rectangular matrices with entries in F2 is presented. Depending on the way the matrix is stored, the operations acting on rows or block of consecutive columns (stored as one integer) should be preferred. In this paper, an algorithm that completely avoids the column permutations is given. In particular, a block decomposition is presented and its running times are compared with the ones adopted into SAGE.

… and that comparison made M4RI (which realises this functionality in Sage) look pretty bad. I did’t (and still don’t) share the implicit assumption that avoiding column swaps was the key ingredient in making this code so much faster than ours. I assume the impressive timings are due to a very efficient base case implementation. Anyway, we sat down  and looked for performance bottlenecks the result of which is 20121224. I actually have no idea whether we caught up to the code described in Enrico’s and Anna’s pre-print as they did not publish their sources.

Still, the performance improvements over 20120613 were worth the trouble. Below two plots of the (normalised) leading constants giving the leading constants for multiplication and elimination respectively (more plots on imgur) That is, it plots the running time divided by n^{2.807} \cdot 10^9. In theory these plots should all have slope 0.

Multiplication on Intel Core i7

PLE on Intel Core i7

Finally, here’s the plot for Fast matrix decomposition in F2 which starts very small but has a rather large slope. That’s why I concluded that the performance stems from a very efficient base case. I should get in touch with Enrio and Anna about this.

Linear Algebra for Gröbner Bases over GF(2): LELA

The Efficient Linear Algebra for Gröbner Basis Computations workshop in Kaiserslautern two weeks ago was a welcome opportunity to finally test out LELA, a library specifically written for linear algebra for Gröbner basis computations including for GF(2). The library implements the “Faugère-Lachartre” algorithm (a similar trick, though less developed, appeared before in PolyBoRi) and uses M4RI for dense parts over GF(2).

So, I ran my benchmark matrices through LELA, discovered a bug in the process, then Bradford returned the favour and discovered a bug in M4RI … Finally, below are the timings. The column PLE is the PLE algorithm as implemented in M4RI, M4RI is the M4RI algorithm as implemented in M4RI, GB is a very naive variant of the algorithm LELA uses and LELA is, well, LELA.

problem m n density PLE M4RI GB LELA
HFE 25 12307 13508 0.076 1.0 0.5 0.8 0.56
HFE 30 19907 29323 0.067 4.7 2.7 4.7 3.42
HFE 35 29969 55800 0.059 19.3 9.2 19.5 13.92
Mutant 26075 26407 0.184 5.7 3.9 2.1 12.07
n=24, m=26 37587 38483 0.038 20.6 21.0 19.3 7.72
n=24, m=26 37576 32288 0.040 18.6 28.4 17.0 4.09
SR(2,2,2,4) c 5640 14297 0.003 0.4 0.2 0.1 0.40
SR(2,2,2,4) c 13665 17394 0.013 2.1 3.0 2.0 1.78
SR(2,2,2,4) c 11606 16282 0.035 1.9 4.4 1.5 0.81
SR(2,2,2,4) 13067 17511 0.008 1.9 2.0 1.3 1.45
SR(2,2,2,4) 12058 16662 0.015 1.5 1.9 1.6 1.01
SR(2,2,2,4) 115834 118589 0.003 528.2 578.5 522.9 48.39

What this table means is that one can expect more than an order of magnitude of speed-up when using LELA – which is dedicated to these computations – instead of M4RI – which does not have the specialised algorithm implemented yet. For very small matrices sometimes M4RI/PLE win, but then not by a large margin. The only row where LELA doesn’t do so good is Mutant, which – btw. – is not an F4 matrix but comes from the MXL2 algorithm.  It is possible that LELA’s sparse data structures are not that well equipped to deal with this rather dense matrix.

I am in the process of implementing the algorithm LELA uses in M4RI and will report updated timings here.

Report: Workshop on Efficient Linear Algebra for Gröbner Basis Computations

As you may know, today is the last day of the wokshop on Efficient Linear Algebra for Gröbner Basis Computations that Christian Eder, Burcin Eröcal, Alexander Dreyer and I organised.

I have to say that I am quite pleased with how the workshop played out. We planned the whole thing to be hands on: people were strongly encouraged to work on projects, i.e., to write code preferably together, in addition to attending talks. Those who attended a Sage Days workshop in the past, will know what workshop format I am referring to. Continue reading

Rank-profile revealing Gaussian elimination and the CUP matrix decomposition

by Claude-Pierre Jeannerod, Clément Pernet, Arne Storjohann is now available on the archive. I like this paper a lot and we also referenced it in both the M4RI elimination paper and the M4RIE paper so three cheers that it’s now available.

Abstract: Transforming a matrix over a field to echelon form, or decomposing the matrix as a product of structured matrices that reveal the rank profile, is a fundamental building block of computational exact linear algebra. This paper surveys the well known variations of such decompositions and transformations that have been proposed in the literature. We present an algorithm to compute the CUP decomposition of a matrix, adapted from the LSP algorithm of Ibarra, Moran and Hui (1982), and show reductions from the other most common Gaussian elimination based matrix transformations and decompositions to the CUP decomposition. We discuss the advantages of the CUP algorithm over other existing algorithms by studying time and space complexities: the asymptotic time complexity is rank sensitive, and comparing the constants of the leading terms, the algorithms for computing matrix invariants based on the CUP decomposition are always at least as good except in one case. We also show that the CUP algorithm, as well as the computation of other invariants such as transformation to reduced column echelon form using the CUP algorithm, all work in place, allowing for example to compute the inverse of a matrix on the same storage as the input matrix.

http://arxiv.org/abs/1112.5717

Efficient dense Gaussian elimination over the field with two elements

Finally, we finished our paper about Gaussian elimination in the M4RI library.

Abstract: In this work we describe an efficient implementation of a hierarchy of algorithms for Gaussian elimination upon dense matrices over the field with two elements (\mathbb{F}_2). We discuss both well-known and new algorithms as well as our implementations in the M4RI library, which has been adopted into Sage. The focus of our discussion is a block iterative algorithm for PLE decomposition which is inspired by the M4RI algorithm. The implementation presented in this work provides considerable performance gains in practice when compared to the previously fastest implementation. We provide performance figures on x86_64 CPUs to demonstrate the alacrity of our approach.

The sources of this document are available on bitbucket. But I also compiled a PDF.

Update: arXiv link.

Challenge matrices

Now, that we have a decent PNG reader/writer in M4RI, it’s much easier to get some challenge matrices out of the library. Below, I list and link a few such matrices as they appear during Gröbner basis computations.

file matrix dimensions density PLE M4RI GB
HFE 25 matrix 5 (5.1M) 12307 x 13508 0.07600 1.03 0.59 0.81
HFE 30 matrix 5 (16M) 19907 x 29323 0.06731 4.79 2.70 4.76
HFE 35 matrix 5 (37M) 29969 x 55800 0.05949 19.33 9.28 19.51
Mutant matrix (39M) 26075 x 26407 0.18497 5.71 3.98 2.10
random n=24, m=26 matrix 3 (30M) 37587 x 38483 0.03832 20.69 21.08 19.36
random n=24_ m=26 matrix 4 (24M) 37576 x 32288 0.04073 18.65 28.44 17.05
SR(2,2,2,4) compressed, matrix 2 (328K) 5640 x 14297 0.00333 0.40 0.29 0.18
SR(2,2,2,4) compressed, matrix 4 (2.4M) 13665 x 17394 0.01376 2.18 3.04 2.04
SR(2,2,2,4) compressed, matrix 5 (2.8M) 11606 x 16282 0.03532 1.94 4.46 1.59
SR(2,2,2,4) matrix 6 (1.4M) 13067 x 17511 0.00892 1.90 2.09 1.38
SR(2,2,2,4) matrix 7 (1.7M) 12058 x 16662 0.01536 1.53 1.93 1.66
SR(2,2,2,4) matrix 9 (36M) 115834 x 118589 0.00376 528.21 578.54 522.98

The first three rows are from GB computations for the hidden field equations cryptosystem (those matrices were provided by Michael Brickenstein). The “mutant” row is a matrix as it appears during a run of the MXL2 algorithm on a random system (I believe). It was contributed by Wael Said. The rows “random n=25,m=26″ are matrices as they appear during a GB computation with PolyBoRi for a random system of equations in 24 variables and 26 equations. The remaining rows are matrices from PolyBoRi computations on small scale AES instances. Those rows which have “compressed” in their description correspond to systems where “linear variables” were eliminate before running the Gröbner basis algorithm.

The last three columns give running times (quite rough ones!) for computing an echelon form (not reduced) using (a) the M4RI algorithm, (b) PLE decomposition and (c) a first implementation of the TRSM for trivial pivots trick. As you can see, currently it’s not straight-forward to pick which strategy to use to eliminate matrices appearing during Gröbner basis computations: the best algorithm to pick varies between different problems and the differences can be dramatic.