Multicore BKZ in FPLLL

There have been a few works recently that give FPLLL a hard time when considering parallelism on multicore systems. That is, they compare FPLLL’s single-core implementation against their multi-core implementations, which is fair enough. However, support for parallel enumeration has existed for a while in the form of fplll-extenum. Motivated by these works we merged that library into FPLLL itself a year ago. However, we didn’t document the multicore performance that this gives us. So here we go.

I ran

for t in 1 2 4 8; do
  ./ -j  $(expr 28 / $t)  -t $t -s 512 -u 80 ./fplll/strategies/default.json

where is from the strategizer and default.json is from a PR against FPLLL. Note that these strategies were not optimised for multicore performance. I ran this on a system with two Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz i.e. 28 cores. The resulting speed-ups are: multicore-bkz-in-fplll-default.png

As you can see, the speed-up is okay for two cores but diminishes as we throw more cores at the problem. I assume that this is partly due to block sizes being relatively small (for larger block sizes G6K – which scales rather well on multiple cores – will be faster). I also suspect that this is partly an artefact of not optimising for multiple cores, i.e. picking the trade-off between enumeration (multicore) and preprocessing (partly single-core due to LLL calls) right. If someone wants to step up and compute some strategies for multiple cores, that’d be neat.

M4RI OpenMP Tuning

Occasionally I get motivated to try to optimise multicore support in M4RI. It has been around for a while but so far no serious attempt was made to optimise it. In the past few days, I optimised some tuning parameters on geom.math which is a 24 core 2.6Ghz Xeon (X7460) machine with 128GB of RAM. Eventually, it turned out that the best strategy seems to be to just pick rather big chunks of of work and statically assign them to individual threads. Below, I give the speed-up so achieved for various numbers of cores and the “TBops”. TBops is a bit of a funny measure and it is motivated by numerical linear algebra. In numerical linear algebra cubic multiplication and elimination methods rule since they are more numerically stable than asymptotically fast approaches. This gives rise to the measure of “Flops” (floating point operations per second), i.e. 2n3=t where n is the matrix dimension and t the running time. The X7460 is advertised at around 20 GFlops. FFLAS/FFPACK provides fast finite field linear algebra based on numerical linear algebra, that is finite field elements are encoded as floating point numbers and ATLAS et al. are then used to perform multiplication using these numbers. To see what speed FFLAS/FFPACK achieves compared to numerical routines, or how efficient the reduction is, the measure FFops is used which stands for finite field operations per second. LinBox (which includes FFLAS/FFPACK) achieves about 8.9 FFops on the X7460 and I assume this number could be pushed higher if ATLAS was told to use more cores. Note that FFLAS/FFPACK uses asymptotically fast methods and thus that the 2n3=t formula only makes sense when considering the reduction to numerical linear algebra. In principle, this reduction also works for F2. In M4RI we do not use the reduction to floating point arithmetic but perform bit arithmetic directly. Yet, it still might make sense to compare its performance to that achievable by FFLAS/FFPACK based on numerical linear algebra. Bit operations per seconds is meant to provide a way to perform this comparison: it is equivalent to FFops when considering F2. TBops in the plots below means terabops.

I guess the speed-up achieved for M4RI is not that bad. For two cores it is actually very close to linear, four cores still isn’t that bad and almost 12x speed-up at peak performance for 16 cores is not that bad. However, it remains to be seen whether we can make it run at peak performance for larger dimensions than 16,000 x 16,000.

While M4RI has its merits (for instance, it is really in-place and thus uses much less memory than asymptotically fast methods) it is not the fastest method available to us. In particular on sage.math it performs much worse than PLS. Thus, the speed-up achieved for PLS is much more relevant. Unfortunately, this speed-up is much worse. The parallelised code used to generate the plots below does two things: MMPF is parallelised similar to M4RI and matrix multiplication is parallelised by splitting the matrix into 4×4 submatrices and computing the 16 sectors of C=AB in parallel. Clearly, even if this turns out to be a good strategy one should for good performance split in as many sectors as cores are available instead of hard-coding 16. Thus, I expect to be able to push the performance up some more. I am not sure though whether I can go beyond 4x. Btw. I assume the “bit operations per seconds” plot will keep on rising since the improved complexity of n^2.807 really matters for higher dimensions.

Finally, what does that all mean in terms of absolute speed? To give an impression I computed the echelon form (not reduced) of a 500,000 x 500,000 dense random matrix over GF(2). Using the M4RI algorithm and one core this takes 45683.61 seconds = 12.7 hours on geom.math. The best result for M4RI I have so far is 8532.43 seconds using 16 cores. Using PLS decomposition and one core only this takes about 9711.42 seconds = 2.7 hours. Using four cores and PLS decomposition one can compute the echelon form of a random dense 500,000 x 500,000 matrix in 3806.28 seconds = 1.05 hours.