A Simple Bitslice Implementation of PRESENT

I had a hunch the other day which required $2^{36}$ plaintext—ciphertext pairs to be tested. While the hunch turned out to be wrong, at least I got a simple
bitslice implementation of PRESENT (in pure C99) out of it. The implementation is far from optimal:

  • it is pure C but critical paths should be pushed down to assembly,
  • it doesn’t use SSE2’s 128-bit instructions but it should, and
  • the S-Box is hardly optimised (it is simply ANF).

Still, it is better than what I found on the web via a quick Google search. On my 2.33 Ghz Core2Duo Macbook Pro it seems to run at 28 cycles per byte for long messages (not counting the key schedule). For comparison, I think the current speed of AES in bit slice mode is like 10 cycles per byte on the Core2. If I need some performance in the future, I might sit down and improve the implementation (such that it doesn’t totally suck) but for now I have to attend to other projects.

Bitslicing and the Method of the Four Russians over Larger Finite Fields

Tom Boothby’s and Robert Bradshaw’s paper on the “Method of the Four Russian” multiplication algorithm over F_3, F_5, F_7, F_{2^2} and F_{2^3} is available as pre-print on the arXiv. If you’re into fast exact linear algebra I highly recommend reading it since it has some really nice ideas in it and is well written.

Abstract. “We present a method of computing with matrices over very small finite fields of size larger than 2. Specifically, we show how the Method of Four Russians can be efficiently adapted to these larger fields, and introduce a row-wise matrix compression scheme that both reduces memory requirements and allows one to vectorize element operations. We also present timings which confirm the efficiency of these methods and exceed the speed of the fastest implementations the authors are aware of.”

New M4RI Release Coming Up

I will probably cut a new M4RI release within a week or so. The main changes are:

  • Asymptotically Fast PLUQ Factorisation [mzd_pluq() and _mzd_pluq_mmpf()] due to Clément Pernet and myself:
    This enables asymptotically fast rank computation, row echelon forms and rank profiles. However, don’t hold your breath yet because the code is not really optimised yet. While it is faster than M4RI for random dense matrices it is slower for sparse-ish and structured matrices (see image below).
  • System Solving [mzd_solve_left()]: Jean-Guillaume Dumas wrote a high-level wrapper around the PLUQ and TRSM routines to provide linear system solving.
  • M4RI Performance Improvement [mzd_echelonize_m4ri()]: A bug was fixed in M4RI which resulted in poor performance of M4RI for sparse-ish matrices (see my blog post for details).
  • Refactoring: A few functions where added, deleted and renamed. This release will break source compatibility.

PLUQ/M4RI/Magma

On a related note: Marco Bodrato came up with a new Strassen-like sequence for multiplying and squaring matrices which provides a small (linear) speed-up for squaring. He also provides a drop-in strassen.c replacement for M4RI-20080521 which implements his new sequence.

M4RI’s and MMPF’s Sensitivity to Density

The last couple of days I’ve been working on improving libM4RI for sparse-ish matrices. The matrices I am talking about here are still represented as dense matrices but have non-uniformly distributed entries. While PLUQ factorisation is still very very (very very) slow for e.g. half rank matrices, things are getting better for M4RI matrix elimination. Continue reading “M4RI’s and MMPF’s Sensitivity to Density”

Efficient Multiplication of Dense Matrices over GF(2)

We describe an efficient implementation of a hierarchy of algorithms for multiplication of dense matrices over the field with two elements (GF(2)). In particular we present our implementation – in the M4RI library – of Strassen-Winograd matrix multiplication and the “Method of the Four Russians” multiplication (M4RM) and compare it against other available implementations. Good performance is demonstrated on on AMD’s Opteron and particulary good performance on Intel’s Core 2 Duo. The open-source M4RI library is available stand-alone as well as part of the Sage mathematics software.

In machine terms, addition in GF(2) is logical-XOR, and multiplication is logical-AND, thus a machine word of 64-bits allows one to operate on 64 elements of GF(2) in parallel: at most one CPU cycle for 64 parallel additions or multiplications. As such, element-wise operations over GF(2) are relatively cheap. In fact, in this paper, we conclude that the actual bottlenecks are memory reads and writes and issues of data locality. We present our empirical findings in relation to minimizing these and give an analysis thereof.”

Related News: My shiny new version of Magma 2.14-17 seems to perform better than Magma 2.14-14 for matrix multiplication over F_2 on the Core 2 Duo. So I updated the performance data on the M4RI website. However, the changelog doesn’t mention any improvements in this area. Btw. searching for “Magma 2.14” returns the M4RI website first for me, which feels wrong on so many levels. Finally, M4RI is being packaged for Fedora Core.

GCC 4.3 and -O3

I recently upgraded an Opteron server to Debian/Lenny to get GCC 4.3 for OpenMP reasons. It turns out that my code, namely matrix multiplication as implemented in the M4RI library, ran much slower than when compiled with GCC 4.1. For instance, to multiply two $20,000 \times 20,000$ random matrices took 18.38 seconds with GCC 4.1 but 21.00 seconds with GCC 4.3.1 and to multiply two $32,000 \times 32,000$ random matrices took 70.24 seconds with GCC 4.1 but 80.00 second with GCC 4.3.1. Eventually, I checked the highlevel changelog and found: “The -ftree-vectorize option is now on by default under -O3. In order to generate code for a SIMD extension, it has to be enabled as well: use -maltivec for PowerPC platforms and -msse/-msse2 for i?86 and x86_64.” However, we don’t use SSE2 on the Opteron since it is slower than the standard instruction set for this application. Passing -no-tree-vectorize to the compiler fixed the problem. However, to my surprise -O2 didn’t come with a speed penalty either, so I settled for this. The final timings on my Opteron server are:

64-bit Debian/GNU Linux, 2.6Ghz Opteron (Virtualised)
Matrix
Dimension
M4RI GCC 4.3
(64-bit, 4 cores)
M4RI GCC 4.3
(64-bit, 1 core)
M4RI GCC 4.1
(64-bit, 1 core)
Magma 2.14-13
(64-bit, 1 core)
20000×20000 6.36 17.81 18.38 18.35
32000×32000 26.65 68.01 70.24 68.01

I suppose the moral of the story is: -O3 isn’t necessarily better than -O2 just because 3>2.

M4RI Website

I finally put together the website for the M4RI library. For those who don’t know M4RI:

“M4RI is a library for fast arithmetic with dense matrices over $\mathbb{F}_2$. It was started by Gregory Bard and is now maintained by Martin Albrecht and Gregory Bard. The name M4RI comes from the first implemented algorithm: The “Method of the Four Russians” inversion algorithm published by Gregory Bard. This algorithm in turn is named after the “Method of the Four Russians” multiplication algorithm which is probably better referred to as Kronrod’s method. M4RI is used by the Sage mathematics software and the PolyBoRi library. M4RI is available under the General Public License Version 2 or later (GPLv2+).

Features of the M4RI library include:

  • basic arithmetic with dense matrices over $\mathbb{F}_2$ (addition, equality testing, stacking, augmenting, sub-matrices, randomisation)
  • asymptotically fast $O(n^{log_27})$ matrix multiplication via the “Method of the Four Russians” (M4RM) & Strassen-Winograd algorithm,
  • asymptotically fast $O(n^{3}/log_2(n))$ row echelon form computation and matrix inversion via the “Method of the Four Russians” (M4RI), and
  • support for the x86/x86_64 SSE2 instruction set where available.
  • support for Linux and OS X (GCC), support for Solaris (Sun Studio Express) and support for Windows (Visual Studio 2008 Express).”

Performance-wise it is doing okay but not great. On Intel’s Core2Duo it seems to compare favourably to Magma 2.13. Though, I don’t have access to Magma 2.14 yet which improves dense linear algebra over $\mathbb{F}_2$. However, on AMD’s Opteron it is way behind Magma 2.13. This is possibly due to the 1MB L2 cache of the Opteron vs. 4MB L2 cache of the Core2Duo.