GCC 4.3 and -O3

I recently upgraded an Opteron server to Debian/Lenny to get GCC 4.3 for OpenMP reasons. It turns out that my code, namely matrix multiplication as implemented in the M4RI library, ran much slower than when compiled with GCC 4.1. For instance, to multiply two $20,000 \times 20,000$ random matrices took 18.38 seconds with GCC 4.1 but 21.00 seconds with GCC 4.3.1 and to multiply two $32,000 \times 32,000$ random matrices took 70.24 seconds with GCC 4.1 but 80.00 second with GCC 4.3.1. Eventually, I checked the highlevel changelog and found: “The -ftree-vectorize option is now on by default under -O3. In order to generate code for a SIMD extension, it has to be enabled as well: use -maltivec for PowerPC platforms and -msse/-msse2 for i?86 and x86_64.” However, we don’t use SSE2 on the Opteron since it is slower than the standard instruction set for this application. Passing -no-tree-vectorize to the compiler fixed the problem. However, to my surprise -O2 didn’t come with a speed penalty either, so I settled for this. The final timings on my Opteron server are:

64-bit Debian/GNU Linux, 2.6Ghz Opteron (Virtualised)
M4RI GCC 4.3
(64-bit, 4 cores)
M4RI GCC 4.3
(64-bit, 1 core)
M4RI GCC 4.1
(64-bit, 1 core)
Magma 2.14-13
(64-bit, 1 core)
20000×20000 6.36 17.81 18.38 18.35
32000×32000 26.65 68.01 70.24 68.01

I suppose the moral of the story is: -O3 isn’t necessarily better than -O2 just because 3>2.

M4RI Website

I finally put together the website for the M4RI library. For those who don’t know M4RI:

“M4RI is a library for fast arithmetic with dense matrices over $\mathbb{F}_2$. It was started by Gregory Bard and is now maintained by Martin Albrecht and Gregory Bard. The name M4RI comes from the first implemented algorithm: The “Method of the Four Russians” inversion algorithm published by Gregory Bard. This algorithm in turn is named after the “Method of the Four Russians” multiplication algorithm which is probably better referred to as Kronrod’s method. M4RI is used by the Sage mathematics software and the PolyBoRi library. M4RI is available under the General Public License Version 2 or later (GPLv2+).

Features of the M4RI library include:

  • basic arithmetic with dense matrices over $\mathbb{F}_2$ (addition, equality testing, stacking, augmenting, sub-matrices, randomisation)
  • asymptotically fast $O(n^{log_27})$ matrix multiplication via the “Method of the Four Russians” (M4RM) & Strassen-Winograd algorithm,
  • asymptotically fast $O(n^{3}/log_2(n))$ row echelon form computation and matrix inversion via the “Method of the Four Russians” (M4RI), and
  • support for the x86/x86_64 SSE2 instruction set where available.
  • support for Linux and OS X (GCC), support for Solaris (Sun Studio Express) and support for Windows (Visual Studio 2008 Express).”

Performance-wise it is doing okay but not great. On Intel’s Core2Duo it seems to compare favourably to Magma 2.13. Though, I don’t have access to Magma 2.14 yet which improves dense linear algebra over $\mathbb{F}_2$. However, on AMD’s Opteron it is way behind Magma 2.13. This is possibly due to the 1MB L2 cache of the Opteron vs. 4MB L2 cache of the Core2Duo.