I just finished giving a talk about the M4RI & the M4RIE libraries for dense linear algebra over and respectively. I’m in LORIA, Nancy right now visiting the CARAMEL team, btw. Well, here are the slides.

# Tag: multicore

## M4RI OpenMP Tuning

Occasionally I get motivated to try to optimise multicore support in M4RI. It has been around for a while but so far no serious attempt was made to optimise it. In the past few days, I optimised some tuning parameters on geom.math which is a 24 core 2.6Ghz Xeon (X7460) machine with 128GB of RAM. Eventually, it turned out that the best strategy seems to be to just pick rather big chunks of of work and statically assign them to individual threads. Below, I give the speed-up so achieved for various numbers of cores and the “TBops”. TBops is a bit of a funny measure and it is motivated by numerical linear algebra. In numerical linear algebra cubic multiplication and elimination methods rule since they are more numerically stable than asymptotically fast approaches. This gives rise to the measure of “Flops” (floating point operations per second), i.e. 2n3=t where n is the matrix dimension and t the running time. The X7460 is advertised at around 20 GFlops. FFLAS/FFPACK provides fast finite field linear algebra based on numerical linear algebra, that is finite field elements are encoded as floating point numbers and ATLAS et al. are then used to perform multiplication using these numbers. To see what speed FFLAS/FFPACK achieves compared to numerical routines, or how efficient the reduction is, the measure FFops is used which stands for finite field operations per second. LinBox (which includes FFLAS/FFPACK) achieves about 8.9 FFops on the X7460 and I assume this number could be pushed higher if ATLAS was told to use more cores. Note that FFLAS/FFPACK uses asymptotically fast methods and thus that the 2n3=t formula only makes sense when considering the reduction to numerical linear algebra. In principle, this reduction also works for F2. In M4RI we do not use the reduction to floating point arithmetic but perform bit arithmetic directly. Yet, it still might make sense to compare its performance to that achievable by FFLAS/FFPACK based on numerical linear algebra. Bit operations per seconds is meant to provide a way to perform this comparison: it is equivalent to FFops when considering F2. TBops in the plots below means terabops.

I guess the speed-up achieved for M4RI is not that bad. For two cores it is actually very close to linear, four cores still isn’t that bad and almost 12x speed-up at peak performance for 16 cores is not that bad. However, it remains to be seen whether we can make it run at peak performance for larger dimensions than 16,000 x 16,000.

While M4RI has its merits (for instance, it is really in-place and thus uses much less memory than asymptotically fast methods) it is not the fastest method available to us. In particular on sage.math it performs much worse than PLS. Thus, the speed-up achieved for PLS is much more relevant. Unfortunately, this speed-up is much worse. The parallelised code used to generate the plots below does two things: MMPF is parallelised similar to M4RI and matrix multiplication is parallelised by splitting the matrix into 4×4 submatrices and computing the 16 sectors of C=AB in parallel. Clearly, even if this turns out to be a good strategy one should for good performance split in as many sectors as cores are available instead of hard-coding 16. Thus, I expect to be able to push the performance up some more. I am not sure though whether I can go beyond 4x. Btw. I assume the “bit operations per seconds” plot will keep on rising since the improved complexity of n^2.807 really matters for higher dimensions.

Finally, what does that all mean in terms of absolute speed? To give an impression I computed the echelon form (not reduced) of a 500,000 x 500,000 dense random matrix over GF(2). Using the M4RI algorithm and one core this takes 45683.61 seconds = 12.7 hours on geom.math. The best result for M4RI I have so far is 8532.43 seconds using 16 cores. Using PLS decomposition and one core only this takes about 9711.42 seconds = 2.7 hours. Using four cores and PLS decomposition one can compute the echelon form of a random dense 500,000 x 500,000 matrix in 3806.28 seconds = 1.05 hours.