At SD16 Clément Pernet and myself have been working on improving the asymptotically fast PLUQ factorisation over GF(2) in M4RI. As mentioned earlier, one of the main problems is that column swaps are pretty expensive compared to many other operations we do. Eventually, we settled for LQUP over PLUQ since it has fewer column swaps overall since it does not compress U. We also improved the base case both w.r.t. to sparse matrices and in general (more Gray code tables are used now) and the column swap performance overall (cf. SD 16 Wiki). The result is noticeable, but we are not quite there yet:

M4RI r284 vs. r292

There are still some places which could be improved so this should get better eventually. Also, we might have another strategy to deal with these sparse-ish/structured matrices. Anyway, the new PLUQ code is at least as fast as M4RI for the structured HFE examples on the M4RI website on my Core2Duo 2.33Ghz notebook (and of course much faster on random examples and on other platforms) The new code is available on BitBucket.


Geometric XL

The paper by Sean Murphy and Maura Paterson has been around for quite some time now (same for my toy implementation). Gröbner basis algorithms and related methods like XL are algebraic in nature. In particular, their complexity is not invariant under a linear change of coordinates. For example consider Cyclic-6

    sage: P.<a,b,c,d,e,f,h> = PolynomialRing(GF(32003))
    sage: I = sage.rings.ideal.Cyclic(P,6).homogenize(h)
    sage: J = Ideal(I.groebner_basis())

The generators of J are a Gröbner basis and we can use this property to find a common root for these generators. Now, consider the same equations but permute the variables in the ring.

    sage: P.<a,b,c,d,e,f,h> = PolynomialRing(GF(32003),order='lex')
    sage: I = sage.rings.ideal.Cyclic(P,6).homogenize(h)
    sage: J = Ideal(I.groebner_basis())
    sage: R = PolynomialRing(GF(32003),P.ngens(),list(reversed(P.variable_names())),order='lex')
    sage: H = Ideal([R(str(f)) for f in J.gens()])

The generators of H do not form a Gröbner basis in R which is just P with its variables reversed. If we are only trying to solve a system of equations choosing the right permutation of variables might have a significant impact on the performance of our Gröbner basis algorithm:

    sage: t = cputime()
    sage: gb = H.groebner_basis('libsingular:std')
    sage: gb[-1].degree()
    sage: cputime(t) # output random-ish

While in this example it is easy to see which variable permutation is the cheapest one, this is not necessarily the case in general. The GeometricXL algorithm is invariant under any linear change of coordinates and has the following property:

Let D be the degree reached by the XL algorithm to solve a given system of equations under the optimal linear change of coordinates. Then GeometricXL will also solve this system of equations for the degree D, without applying this optimal linear change of coordinates first.

The above behaviour holds under two assumptions:

  • the characteristic of the base field K is bigger than D and
  • the system of equations has one over “very few” solution.

To demonstrate this behaviour, consider the synthetic benchmark available here which is a Gröbner basis under a linear change of coordinates:


    sage: e,h = random_example(n=6)

e is the original easy system while h is the “rotated” system

    sage: e.basis_is_groebner()

    sage: max([f.total_degree() for f in e.gens()])

    sage: h.basis_is_groebner()

    sage: max([f.total_degree() for f in h.gens()])

GeometricXL recovers linear factors and thus candidates for common roots at D=2:

    sage: hH = h.homogenize()
    sage: f = GeometricXL(hH, D=2); f.factor(False)
    0.0...s -- 1. D: 2
    * (-1056*x5 - 2964*x4 - 177*x3 + 6206*x2 + 376*x1 + 6257*x0 + h)
    * (2957*x5 - 792*x4 - 4323*x3 - 14408*x2 - 2750*x1 - 8823*x0 + h)

While any Gröbner basis algorithm would have to reach at least degree 64:

    sage: gb = h.groebner_basis('libsingular:slimgb')
    sage: gb[-1].degree()

While my toy implementation is neither robust not efficient, the following table (using the same synthetic benchmark as above) should give some indication that for some problems GeometricXL is a good choice:

Synthetic Benchmark for GeometricXL Algorithm over GF(32003)
n GeometricXL degree GeometricXL time in seconds Singular 3-0-4 time in seconds Magma 2.14 time in seconds Gröbner basis degree
2 2 0.01 0.00 0.00 4
3 2 0.07 0.00 0.00 8
4 2 0.18 0.00 0.00 16
5 2 0.37 0.02 0.01 32
6 2 0.73 0.12 0.04 64
7 2 1.36 0.94 0.09 128
8 2 2.35 9.59 0.56 256
9 2 4.24 117.43 3.79 512
10 2 7.34 28.68 1024
11 2 12.08 205.05 2048
12 2 20.03
13 2 35.53
14 2 53.98
15 2 88.36
16 2 143.16

While the benchmark is synthetical and unrealistic, there are a few problems in multivariate quadratic public key cryptography which might be tackled using an idea similar to the GeometricXL algorithm. Also, note that constructing a GeometricMatrixF5 algorithm is straight forward by replacing the first step of the algorithm.

M4RI API Change and Big Matrices

Finally, I found some time to work on M4RI again: I changed the internal representation of matrices to support more than one malloc() call per matrix, i.e. each matrix is split into blocks of some maximal size. This allows to deal with much bigger matrices under Linux because the kernel often won’t just give you 8GB mapped to consecutive addresses but it will give you 1GB eight times. This limitation bugged me quite some time now, since this also limited the kind of systems I could solve using PolyBoRi with M4RI enabled. The result is available at http://bitbucket.org/malb/m4ri/ but bear in mind that the API changed quite a bit for this (e.g., I renamed the packedmatrix to mzd_t) on the way).

64-bit Ubuntu, Xeon X7400 @2.66GHz
(64-bit, 1 core)
(64-bit, 1 core)
100,000 x 100,000 > 1.16 GB 1078.72 s 429.32 s
200,000 x 200,000 > 4.65 GB 2298.30 s
256,000 x 256,000 > 7.63 GB 8979.33 s 3709.25 s

The above table gives the time to compute the rank of random matrices of the given dimension using the given algorithms on http://geom.math.washington.edu.

A Simple Bitslice Implementation of PRESENT

I had a hunch the other day which required $2^{36}$ plaintext—ciphertext pairs to be tested. While the hunch turned out to be wrong, at least I got a simple
bitslice implementation of PRESENT (in pure C99) out of it. The implementation is far from optimal:

  • it is pure C but critical paths should be pushed down to assembly,
  • it doesn’t use SSE2’s 128-bit instructions but it should, and
  • the S-Box is hardly optimised (it is simply ANF).

Still, it is better than what I found on the web via a quick Google search. On my 2.33 Ghz Core2Duo Macbook Pro it seems to run at 28 cycles per byte for long messages (not counting the key schedule). For comparison, I think the current speed of AES in bit slice mode is like 10 cycles per byte on the Core2. If I need some performance in the future, I might sit down and improve the implementation (such that it doesn’t totally suck) but for now I have to attend to other projects.

Bitslicing and the Method of the Four Russians over Larger Finite Fields

Tom Boothby’s and Robert Bradshaw’s paper on the “Method of the Four Russian” multiplication algorithm over F_3, F_5, F_7, F_{2^2} and F_{2^3} is available as pre-print on the arXiv. If you’re into fast exact linear algebra I highly recommend reading it since it has some really nice ideas in it and is well written.

Abstract. “We present a method of computing with matrices over very small finite fields of size larger than 2. Specifically, we show how the Method of Four Russians can be efficiently adapted to these larger fields, and introduce a row-wise matrix compression scheme that both reduces memory requirements and allows one to vectorize element operations. We also present timings which confirm the efficiency of these methods and exceed the speed of the fastest implementations the authors are aware of.”

New M4RI Release Coming Up

I will probably cut a new M4RI release within a week or so. The main changes are:

  • Asymptotically Fast PLUQ Factorisation [mzd_pluq() and _mzd_pluq_mmpf()] due to Clément Pernet and myself:
    This enables asymptotically fast rank computation, row echelon forms and rank profiles. However, don’t hold your breath yet because the code is not really optimised yet. While it is faster than M4RI for random dense matrices it is slower for sparse-ish and structured matrices (see image below).
  • System Solving [mzd_solve_left()]: Jean-Guillaume Dumas wrote a high-level wrapper around the PLUQ and TRSM routines to provide linear system solving.
  • M4RI Performance Improvement [mzd_echelonize_m4ri()]: A bug was fixed in M4RI which resulted in poor performance of M4RI for sparse-ish matrices (see my blog post for details).
  • Refactoring: A few functions where added, deleted and renamed. This release will break source compatibility.


On a related note: Marco Bodrato came up with a new Strassen-like sequence for multiplying and squaring matrices which provides a small (linear) speed-up for squaring. He also provides a drop-in strassen.c replacement for M4RI-20080521 which implements his new sequence.

M4RI’s and MMPF’s Sensitivity to Density

The last couple of days I’ve been working on improving libM4RI for sparse-ish matrices. The matrices I am talking about here are still represented as dense matrices but have non-uniformly distributed entries. While PLUQ factorisation is still very very (very very) slow for e.g. half rank matrices, things are getting better for M4RI matrix elimination. Continue reading “M4RI’s and MMPF’s Sensitivity to Density”