I am writing this while waiting for my taxi to leave Sage Days 35. Although, I didn’t get much actual coding done, it was great fun and very useful. I met a lot of old friend, new faces and managed to put faces to e-mail addresses.
In terms of coding projects, first, I tried to speed up linear algebra mod p where p is a 32 or 64 bit prime. But it turns out that any trick I could think of could not improve on Frederik’s code. So that didn’t lead anywhere but I allowed me to read some code of FLINT2 (very readable) and admire how carefully it is written.
My other two projects both involved evaluate–pointwise-multiply–interpolate algorithms for fast matrix-matrix products over finite extension fields or for matrices with polynomial coefficients (over prime fields). After my talk on M4RI(E) David Harvey worked out how to improve multiplication over from 17 multiplications over to 15, which then lead to a general approach for with composite . Much of it remains to be implemented (efficiently), but the example indeed shows a 10% speed-up as expected. The code is not clean yet, uses way too much memory and doesn’t deal with the more advanced finite field stuff appropriately. It should end up in M4RIE eventually though.
I also contributed a bit to #12177 which is about a “prime slice” implementation of matrices over . The idea is essentially to represent these matrices as polynomials with matrix coefficients and to use fast polynomial multiplication algorithms for these polynomials. It turns out, this works very well even for small finite fields. Burcin Eröcal did all the coding, I only helped with some discussions. We need to polish the code a lot to be usable, so if you like matrices over head over to #12177 and help out.
“A Sage Days workshop around the theme of Algorithms in Number Theory and FLINT.”
See http://wiki.sagemath.org/SageFlintDays for more information and registration.
PS: I’ll be talking about M4RI(E) … big surprise.
I’ve been writing up the ideas that went into the M4RIE library for dense linear algebra over small extensions of . I think it is now in a state to be readable enough to up a PDF of the current draft online. Hence, here it is. While the paper does explain what we mean by “Travolta tables” it doesn’t explain why we call them that way … but the image below does:
I just finished with my talk at SIAM AG11 about the M4RI and the M4RIE libraries … here are the slides.
For some strange reason it hardly every occurs to me to benchmark GAP when it comes to linear algebra over GF(2^e). Turns out, that’s a really bad idea as it seems to be the best open-source implementation for such things. For example, computing the reduced row echelon form of a 4,000 x 4,000 random matrix over takes 16.94s in GAP 4.4.12 but 3267s in Sage and 1005s in NTL 5.4.2. So much much faster than other established open-source projects out there.
I hope libgap becomes a reality at some point in the future.
PS: For comparison, Magma does the same task in in 4.67s and M4RIE in 0.71s.
I guess one perk of being on vacation is that one can get more work done. Hence, I finished implementing PLE decomposition (formerly known as PLS decomposition) and hence asymptotically fast Gaussian elimination. The implementation uses two matrix representations: mzd_slice_t and mzed_t. The former is optimised for implementing Karatsuba multiplication (cf., here) and the latter is optimised for using Travolta tables (cf., here). That is, multiplication is based on Karatsuba while the PLE base case is based on Travolta tables. The same applies to TRSM where the base case is also implemented using Travolta tables.
There is still a lot to be done:
- Both TRSM and PLE base cases use only one Travolta table while Gaussian elimination and multiplication use 6 and 8 in parallel respectively.
- There way too much copying going on. For example, I was lazy and implemented TRSM upper left with respect to matrices which do not start at offset 0 with respect to a machine word by simply copying the whole matrix out. Hence, we’re wasting memory where we shouldn’t.
- PLE isn’t cache efficient yet and I assume that the code isn’t very good for sparse-ish matrices (cf. the journey in M4RI improving this)
Still, the results are quite nice already. Below, the left hand side plots the wall time of computing the reduced row echelon form of random matrices over . The right hand side expresses efficiency by dividing the running time by , i.e., the number of operations (fingers crossed I didn’t screw up the constant).The fact, that we’re not there yet, is clearly expressed by the poor performance of the asymptotically fast code (“PLE”) for small dimensions ().
Finally, it should be mentioned, that the green plot (“Travolta”) already is considerably faster than Magma, which – as far as I know – is the fastest other implementation of this kind of operation over these fields. Below, Travolta based elimination is compared with Magma’s implementation, where blueness expresses a speed-up factor of (redness the other way around).The code is on bitbucket.