I have just pushed the button to release M4RI 20121224. The main feature of this release is a considerable performance improvement. It all started with Fast matrix decomposition in F2 by Enrico Bertolazzi and Anna Rimoldi showing up on the arXiv. Here’s the abstract
In this work an efficient algorithm to perform a block decomposition (and so to compute the rank) of large dense rectangular matrices with entries in F2 is presented. Depending on the way the matrix is stored, the operations acting on rows or block of consecutive columns (stored as one integer) should be preferred. In this paper, an algorithm that completely avoids the column permutations is given. In particular, a block decomposition is presented and its running times are compared with the ones adopted into SAGE.
… and that comparison made M4RI (which realises this functionality in Sage) look pretty bad. I did’t (and still don’t) share the implicit assumption that avoiding column swaps was the key ingredient in making this code so much faster than ours. I assume the impressive timings are due to a very efficient base case implementation. Anyway, we sat down and looked for performance bottlenecks the result of which is 20121224. I actually have no idea whether we caught up to the code described in Enrico’s and Anna’s pre-print as they did not publish their sources.
Still, the performance improvements over 20120613 were worth the trouble. Below two plots of the (normalised) leading constants giving the leading constants for multiplication and elimination respectively (more plots on imgur) That is, it plots the running time divided by . In theory these plots should all have slope 0.
Multiplication on Intel Core i7
PLE on Intel Core i7
Finally, here’s the plot for Fast matrix decomposition in F2 which starts very small but has a rather large slope. That’s why I concluded that the performance stems from a very efficient base case. I should get in touch with Enrio and Anna about this.
by Claude-Pierre Jeannerod, Clément Pernet, Arne Storjohann is now available on the archive. I like this paper a lot and we also referenced it in both the M4RI elimination paper and the M4RIE paper so three cheers that it’s now available.
Abstract: Transforming a matrix over a field to echelon form, or decomposing the matrix as a product of structured matrices that reveal the rank profile, is a fundamental building block of computational exact linear algebra. This paper surveys the well known variations of such decompositions and transformations that have been proposed in the literature. We present an algorithm to compute the CUP decomposition of a matrix, adapted from the LSP algorithm of Ibarra, Moran and Hui (1982), and show reductions from the other most common Gaussian elimination based matrix transformations and decompositions to the CUP decomposition. We discuss the advantages of the CUP algorithm over other existing algorithms by studying time and space complexities: the asymptotic time complexity is rank sensitive, and comparing the constants of the leading terms, the algorithms for computing matrix invariants based on the CUP decomposition are always at least as good except in one case. We also show that the CUP algorithm, as well as the computation of other invariants such as transformation to reduced column echelon form using the CUP algorithm, all work in place, allowing for example to compute the inverse of a matrix on the same storage as the input matrix.
I guess one perk of being on vacation is that one can get more work done. Hence, I finished implementing PLE decomposition (formerly known as PLS decomposition) and hence asymptotically fast Gaussian elimination. The implementation uses two matrix representations: mzd_slice_t and mzed_t. The former is optimised for implementing Karatsuba multiplication (cf., here) and the latter is optimised for using Travolta tables (cf., here). That is, multiplication is based on Karatsuba while the PLE base case is based on Travolta tables. The same applies to TRSM where the base case is also implemented using Travolta tables.
There is still a lot to be done:
- Both TRSM and PLE base cases use only one Travolta table while Gaussian elimination and multiplication use 6 and 8 in parallel respectively.
- There way too much copying going on. For example, I was lazy and implemented TRSM upper left with respect to matrices which do not start at offset 0 with respect to a machine word by simply copying the whole matrix out. Hence, we’re wasting memory where we shouldn’t.
- PLE isn’t cache efficient yet and I assume that the code isn’t very good for sparse-ish matrices (cf. the journey in M4RI improving this)
Still, the results are quite nice already. Below, the left hand side plots the wall time of computing the reduced row echelon form of random matrices over . The right hand side expresses efficiency by dividing the running time by , i.e., the number of operations (fingers crossed I didn’t screw up the constant).The fact, that we’re not there yet, is clearly expressed by the poor performance of the asymptotically fast code (“PLE”) for small dimensions ().
Finally, it should be mentioned, that the green plot (“Travolta”) already is considerably faster than Magma, which – as far as I know – is the fastest other implementation of this kind of operation over these fields. Below, Travolta based elimination is compared with Magma’s implementation, where blueness expresses a speed-up factor of (redness the other way around).The code is on bitbucket.