In this work we describe an efficient implementation of a hierarchy of algorithms for Gaussian elimination upon dense matrices over the field with two elements. We discuss both well-known and new algorithms as well as our implementations in the M4RI library, which has been adopted into Sage. The focus of our discussion is a block iterative algorithm for PLE decomposition which is inspired by the M4RI algorithm. The implementation presented in this work provides considerable performance gains in practice when compared to the previously fastest implementation. We provide performance figures on x86_64 CPUs to demonstrate the alacrity of our approach.
We describe an efficient implementation of a hierarchy of algorithms for Gaussian elimination of dense matrices over the field with two elements (F 2 ). In particular, we describe algorithms for PLE decomposition [Jeannerod et al. 2011] which decompose a matrix A into P โข L โข E where P is a permutation matrix, L is a unit lower triangular matrix and E a matrix in row echelon form. Since matrix decomposition is an essential building block for solving dense systems of linear and non-linear equations [Lazard 1983] much research has been devoted to improving the asymptotic complexity of such algorithms. A line of research which produced decompositions such as PLUQ [Golub and Van Loan 1996], LQUP [Ibarra et al. 1982], LSP [Ibarra et al. 1982] and PLE [Jeannerod et al. 2011], among others. Each of those decompositions can be reduced to matrix-matrix multiplication and hence can be achieved in O(n ฯ ) time where ฯ is the complexity exponent of linear algebra 1 . However, in [Jeannerod et al. 2011] it was shown that many of these decompositions are essentially equivalent (one can be produced from the other in negligible cost) and that for almost all applications the PLE decomposition is at least as efficient in terms of time and memory as any of the other. Hence, in this work, we focus on the PLE decomposition but consider the special case of F 2 .
In particular, we propose a new algorithm for block-iterative PLE decomposition and discuss its implementation. We also describe our implementation of previously known algorithms in the M4RI library [Albrecht and Bard 2011].
This article. is organised as follows. We will first highlight some computer architecture issues and discuss in place methods, as well as define notation in the remainder of this section. We will start by giving the definitions of reduced row echelon forms (RREF) and PLE decomposition in Section 2. We will then discuss the three algorithms and their implementation issues for performing PLE decomposition in Section 3, and other implementation issues in Section 4. We conclude by giving empirical evidence of the viability of our approach in Section 5. In Appendix A we will describe the original M4RI algorithm (for reducing F 2 matrices into either row echelon form or RREF), which initiated this line of research.
The M4RI library implements dense linear algebra over F 2 . Our implementation focuses on 64-bit x86 architectures (x86 64), specifically the Intel Core 2 and the AMD Opteron. Thus, we assume in this work that native CPU words have 64 bits. However, it should be noted that our code also runs on 32-bit CPUs and on non-x86 CPUs such as the PowerPC.
Element-wise operations over F 2 , being mathematically trivial, are relatively cheap compared to memory access. In fact, in this work we demonstrate that the two fastest implementations for dense matrix decomposition over F 2 (the one presented in this work and the one found in Magma [Bosma et al. 1997] due to Allan Steel) perform worse for moderately sparse matrices despite the fact that fewer field operations are performed. Thus our work provides further evidence that more refined models for estimating the expected efficiency of linear algebra algorithms on modern CPUs and their memory architectures are needed.
An example of the CPU-specific issues would be the “SSE2 instruction set,” which we will have cause to mention in Section 4. This instruction set has a command to XOR (exclusive-OR) two 128-bit vectors, with a single instruction. Thus if we are adding a pair of rows of 128n bits, we can do that with n instructions. This is a tremendous improvement on the 128n floating-point operations required to do that with real-numbered problems. This also means that any algorithm which touches individual bits instead of blocks of 128 bits will run about two orders of magnitude slower than an algorithm which avoids this.
By A[i, j] we denote the entry of the matrix A at row i and column j. By i we denote the ith element of the vector . We represent mรm permutation matrices as integer vectors of length m, i.e., in LAPACK-style. That is to say, the permutation , 2, 4, 4], where for each index i the entry P i encodes which swap ought to be performed on the input. This representation allows to apply permutations in-place.
All indices start counting at zero, i.e. A[2, 2] refers to the intersection of the third row and third column. All algorithms, lemmas, theorems, propositions in this work are presented for the special case of F 2 . Hence, any statement that some algorithm produces a certain output is always understood to apply to the special case of F 2 , unless stated otherwise.
Matrix triangular decompositions allow that the lower triangular and the upper triangular matrices can be stored one above the other in the same amount of memory as for the input matrix: their non trivial coefficients occupy disjoint areas, as the diagonal of one of them has to be unit and can therefore be omitted. For two mรn
This content is AI-processed based on open access ArXiv data.