Efficient dense Gaussian elimination over the field with two elements

Finally, we finished our paper about Gaussian elimination in the M4RI library.

Abstract: In this work we describe an efficient implementation of a hierarchy of algorithms for Gaussian elimination upon dense matrices over the field with two elements (\mathbb{F}_2). We discuss both well-known and new algorithms as well as our implementations in the M4RI library, which has been adopted into Sage. The focus of our discussion is a block iterative algorithm for PLE decomposition which is inspired by the M4RI algorithm. The implementation presented in this work provides considerable performance gains in practice when compared to the previously fastest implementation. We provide performance figures on x86_64 CPUs to demonstrate the alacrity of our approach.

The sources of this document are available on bitbucket. But I also compiled a PDF.

Update: arXiv link.

Challenge matrices

Now, that we have a decent PNG reader/writer in M4RI, it’s much easier to get some challenge matrices out of the library. Below, I list and link a few such matrices as they appear during Gröbner basis computations.

file matrix dimensions density PLE M4RI GB
HFE 25 matrix 5 (5.1M) 12307 x 13508 0.07600 1.03 0.59 0.81
HFE 30 matrix 5 (16M) 19907 x 29323 0.06731 4.79 2.70 4.76
HFE 35 matrix 5 (37M) 29969 x 55800 0.05949 19.33 9.28 19.51
Mutant matrix (39M) 26075 x 26407 0.18497 5.71 3.98 2.10
random n=24, m=26 matrix 3 (30M) 37587 x 38483 0.03832 20.69 21.08 19.36
random n=24_ m=26 matrix 4 (24M) 37576 x 32288 0.04073 18.65 28.44 17.05
SR(2,2,2,4) compressed, matrix 2 (328K) 5640 x 14297 0.00333 0.40 0.29 0.18
SR(2,2,2,4) compressed, matrix 4 (2.4M) 13665 x 17394 0.01376 2.18 3.04 2.04
SR(2,2,2,4) compressed, matrix 5 (2.8M) 11606 x 16282 0.03532 1.94 4.46 1.59
SR(2,2,2,4) matrix 6 (1.4M) 13067 x 17511 0.00892 1.90 2.09 1.38
SR(2,2,2,4) matrix 7 (1.7M) 12058 x 16662 0.01536 1.53 1.93 1.66
SR(2,2,2,4) matrix 9 (36M) 115834 x 118589 0.00376 528.21 578.54 522.98

The first three rows are from GB computations for the hidden field equations cryptosystem (those matrices were provided by Michael Brickenstein). The “mutant” row is a matrix as it appears during a run of the MXL2 algorithm on a random system (I believe). It was contributed by Wael Said. The rows “random n=25,m=26” are matrices as they appear during a GB computation with PolyBoRi for a random system of equations in 24 variables and 26 equations. The remaining rows are matrices from PolyBoRi computations on small scale AES instances. Those rows which have “compressed” in their description correspond to systems where “linear variables” were eliminate before running the Gröbner basis algorithm.

The last three columns give running times (quite rough ones!) for computing an echelon form (not reduced) using (a) the M4RI algorithm, (b) PLE decomposition and (c) a first implementation of the TRSM for trivial pivots trick. As you can see, currently it’s not straight-forward to pick which strategy to use to eliminate matrices appearing during Gröbner basis computations: the best algorithm to pick varies between different problems and the differences can be dramatic.

PNG Images FTW

I had the pleasure to attend a meeting of the LinBox developers this week in Raleigh, NC. One of the question that came up was how to exchange and store matrices. For example, matrices for whose “class” one wants to find/implement dedicated algorithms. To give a more concrete example: matrices appearing during Gröbner basis computations have a special structure which allows to reduce them faster than random matrices (cf. this paper or this code). Hence, we’d like to have some sort of format to store such matrices such that we can work on these dedicated algorithms. Of course, such a format should be flexible, human readable (preferably writable as well) and reasonably simple. I’m not sure how the LinBox crew would think about me blogging about details of their meeting, hence I’ll only use this opportunity to plug my own proposal (at least for reasonably dense matrices): PNG images.

The PNG file format definitely is readable, since there exists many viewers for it (you are probably using one right now). The format can also be edited thanks to GIMP and friends. That is, we can draw matrices! Furthermore, PNG allows to use lossless compression, i.e., it can compress the actual image data using GZIP. Finally, it is pretty flexible: it supports between one (grayscale) and four (RGB + alpha) channels and various bit depths per channel (1,2,4,8,16). Hence, it can store between 1 bit and 64 bits per pixel. Thus, for finite fields up cardinality 2^{64} we can store each entry simply as a pixel. If we pick our colour assignment right, the pictures even make sense (using the convention that darker is larger as an integer for example).

Of course, the whole format is fundamentally biased towards dense matrices. In fact, we’ve been using 1-bit PNG images as data storage for M4RI matrices for a while now: both Sage and PolyBoRi use the GD library to write M4RI matrices to disk. However, using GD has some shortcomings such as high memory requirements (the whole image is constructed in RAM before being written to disk). This week, I implemented reading/writing of PNG 1-bit images feature directly in the M4RI library using libpng directly. This allows to save a lot of memory and for some cool other features such as control over the GZIP compression level, custom comments, “unknown chunks” as attachments, etc.

I’ve also conducted some experiments to get an impression how well this format works in terms of storage space and loading time:

file size matrix dimension density loading time on 2.6Ghz i7
2.8M 11606 x 16282 0.03532 0.190s
5.1M 12307 x 13508 0.07600 0.205s
16M 19907 x 29323 0.06731 0.619s
30M 37587 x 38483 0.03832 1.565s
36M 115834 x 118589 0.00376 12.132s
37M 29969 x 55800 0.05949 1.685s
39M 26075 x 26407 0.18497 0.847s

To me, the above table – which lists some matrices from Gröbner basis computations over GF(2) – suggests that the format is reasonably efficient. However, I don’t really have anything to compare with, so my sense might be off. Still, compared to some ASCII based formats out there, it’s pretty competitive, as far as I can tell. Note, however, that the above file sizes were produced using GZIP compression level 9 which takes pretty long to write. Using a lower level (such as the default) produces slightly larger files (about 10%-20% depending on the structure).

Finally, wouldn’t it be very awesome if we could use these pictures when debugging our code? So who speaks GDB’s macro language?

M4RI-20110601

I just just pressed the button and release M4RI version 20110601. This version contains quite a significant number of changes both internally (for developers) and for end users. For example, we changed the internal matrix layout (including the bit order per word). Furthermore, we greatly improved our testing and benchmarking code such that it will be easier to write fast code in the future: it gives much more accurate and fine grained timing information (cycles, cache hits/misses, etc.), with confidence interval for the returned times and all bells and whistles. It’s pretty neat. Finally, making use of these new features we improved performance especially for small operations. Details are given in the release note.

Continue reading “M4RI-20110601”

TRSM with Greasing == TRSM reduced to Matrix Multiplication

Recently, I met Sylvain Lachartre (one of the authors of “Parallel Gaussian elimination for Gröbner bases computations in finite fields“) who mentioned to me that his thesis contains a description of Triangular System Solving with Matrices (TRSM) using Greasing or the M4RM caching trick. Since TRSM is a step in asymptotically fast echelon form computations, I figured I should implement it as well in M4RI (note, I haven’t actually read his thesis, so he might have a much more clever idea than I did). That’s what I did today.

As one can easily see the speed improvement – while noticeable – is not that overwhelming (red is the old code, blue is the new code, which doesn’t use the reduction to matrix multiplication). I think the reason for this is that we implicitly had this algorithm implemented already. Clément Pernet, who implemented all the TRSM code in M4RI, implemented TRSM recursively by reducing to matrix multiplication (in order to make use of our asymptotically fast matrix multiplication routines) using the following scheme:

     __________   ______
     \ U00|    | |      |
      \   |U01 | |      |
       \  |    | |  B0  |
        \ |    | |      |
         \|____| |______|
          \    | |      |
           \U11| |      |
            \  | |  B1  |
             \ | |      |
              \| |______|

Using this cutting up of the matrix, one can compute the “Upper Left” TRSM of U and B as follows: First compute TRSM on U11 and B1. Then multiply U10 times B1 and add it to B0 and finally compute TRSM on U00 and B0. Then, B will contain X such that U*X = B.

Now, if the matrix multiply is implemented using M4RM (which it is in the M4RI library for small dimensions), then the multiplication avoids exactly the same redundant additions of rows from B as a straight-forward TRSM with Greasing would. Of course, one isn’t limited to two blocks for B, but can cut it up into finer stripes. I assume that the actual performance improvements we see with the new code are due to better choices for cutting up B.