## Sage/FLINT Days aka Sage Days 35

I am writing this while waiting for my taxi to leave Sage Days 35. Although, I didn’t get much actual coding done, it was great fun and very useful. I met a lot of old friend, new faces and managed to put faces to e-mail addresses.

In terms of coding projects, first, I tried to speed up linear algebra mod p where p is a 32 or 64 bit prime. But it turns out that any trick I could think of could not improve on Frederik’s code. So that didn’t lead anywhere but I allowed me to read some code of FLINT2 (very readable) and admire how carefully it is written.

My other two projects both involved evaluate–pointwise-multiply–interpolate algorithms for fast matrix-matrix products over finite extension fields or for matrices with polynomial coefficients (over prime fields).  After my talk on M4RI(E) David Harvey worked out how to improve multiplication over $\mathbb{F}_{2^6}$ from 17 multiplications over $\mathbb{F}_2$ to 15, which then lead to a general approach for $\mathbb{F}_{2^m}$ with composite $m$. Much of it remains to be implemented (efficiently), but the $\mathbb{F}_{2^6}$ example indeed shows a 10% speed-up as expected. The code is not clean yet, uses way too much memory and doesn’t deal with the more advanced finite field stuff appropriately. It should end up in M4RIE eventually though.

I also contributed a bit to #12177 which is about a “prime slice” implementation of matrices over $\mathbb{F}_{p^k}$. The idea is essentially to represent  these matrices as polynomials with matrix coefficients and to use fast polynomial multiplication algorithms for these polynomials. It turns out, this works very well even for small finite fields. Burcin Eröcal did all the coding, I only helped with some discussions. We need to polish the code a lot to be usable, so if you like matrices over $\mathbb{F}_{p^k}$ head over to #12177 and help out.

## The M4RIE library for dense linear algebra over small fields with even characteristic

I finally uploaded a pre-print of the M4RIE paper to the arXiv:

Abstract: In this work, we present the M4RIE library which implements efficient algorithms for linear algebra with dense matrices over $\mathbb{F}_{2^e}$ for $2 \leq e \leq 10$. As the name of the library indicates, it makes heavy use of the M4RI library both directly (i.e., by calling it) and indirectly (i.e., by using its concepts). We provide an open-source GPLv2+ C library for efficient linear algebra over $\mathbb{F}_{2^e}$ for e small. In this library we implemented an idea due to Bradshaw and Boothby which reduces matrix multiplication over $\mathbb{F}_{p^k}$ to a series of matrix multiplications over $\mathbb{F}_p$. Furthermore, we propose a caching technique – Newton-John tables – to avoid finite field multiplications which is inspired by Kronrod’s method (“M4RM”) for matrix multiplication over $\mathbb{F}_2$. Using these two techniques we provide asymptotically fast triangular solving with matrices (TRSM) and PLE-based Gaussian elimination. As a result, we are able to significantly improve upon the state of the art in dense linear algebra over $\mathbb{F}_{2^e}$ with $2 \leq e \leq 10$.

## M4RIE Paper

I’ve been writing up the ideas that went into the M4RIE library for dense linear algebra over small extensions of $\mathbb{F}_2$. I think it is now in a state to be readable enough to up a PDF of  the current draft online. Hence, here it is. While the paper does explain what we mean by “Travolta tables” it doesn’t explain why we call them that way … but the image below does:

## GAP for dense linear algebra over GF(2^e)

For some strange reason it hardly every occurs to me to benchmark GAP when it comes to linear algebra over GF(2^e). Turns out, that’s a really bad idea as it seems to be the best open-source implementation for such things. For example, computing the reduced row echelon form of a 4,000 x 4,000 random matrix over $\mathbb{F}_{16}$ takes 16.94s in GAP 4.4.12 but 3267s in Sage and 1005s in NTL 5.4.2. So much much faster than other established open-source projects out there.

I hope libgap becomes a reality at some point in the future.

PS: For comparison, Magma does the same task in in 4.67s and M4RIE in 0.71s.

## Asymptotically fast Gaussian elimination for M4RIE

I guess one perk of being on vacation is that one can get more work done. Hence, I finished implementing PLE decomposition (formerly known as PLS decomposition) and hence asymptotically fast Gaussian elimination. The implementation uses two matrix representations: mzd_slice_t and mzed_t. The former is optimised for implementing Karatsuba multiplication (cf., here) and the latter is optimised for using Travolta tables (cf., here). That is, multiplication is based on Karatsuba while the PLE base case is based on Travolta tables. The same applies to TRSM where the base case is also implemented using Travolta tables.

There is still a lot to be done:

• Both TRSM and PLE base cases use only one Travolta table while Gaussian elimination and multiplication use 6 and 8 in parallel respectively.
• There way too much copying going on. For example, I was lazy and implemented TRSM upper left with respect to matrices which do not start at offset 0 with respect to a machine word by simply copying the whole matrix out. Hence, we’re wasting memory where we shouldn’t.
• PLE isn’t cache efficient yet and I assume that the code isn’t very good for sparse-ish matrices (cf. the journey in M4RI improving this)
Still, the results are quite nice already. Below, the left hand side plots the wall time of computing the reduced row echelon form of $m \times (m + 1000)$ random matrices over $\mathbb{F}_{2^e}$. The right hand side expresses efficiency by dividing the running time by $2mnr^{\omega-2}$, i.e., the number of operations (fingers crossed I didn’t screw up the constant).The fact, that we’re not there yet, is clearly expressed by the poor performance of the asymptotically fast code (“PLE”) for small dimensions ($< 3,000$).
Finally, it should be mentioned, that the green plot (“Travolta”) already is considerably faster than Magma, which – as far as I know – is the fastest other implementation of this kind of operation over these fields. Below, Travolta based elimination is compared with Magma’s implementation, where blueness $b$ expresses a speed-up factor of $2^b$ (redness the other way around).The code is on bitbucket.