Efficient Gaussian Elimination on a 2D SIMD Array of Processors without Column Broadcasts

U.P.B . Sci. Bull., Series C, Vo l. 71, Iss. 4, 2009 ISSN 1454-234 X EFFICIENT GAUSSIAN ELIMINATION ON A 2D SIMD ARRAY OF PROCESSORS WITHOU T COLUMN BROADCA STS Mugurel Ionu ţ ANDREICA 1 În aceast ă lucrare se prezin t ă o metod ă eficient ă de implementa re a elimin ă rii Gauss iene pen tru o matrice d e dimen siuni n·m (m ≥ n), folosind o arhitectur ă SIMD ce const ă dintr-o matrice d e n·m procesoare. Alg oritmul descris const ă în 2·n-1 =O(n) itera ţ ii , oferind astfel o cre ş tere de vitez ă optim ă fa ţă de versiunea s erial ă . O particularitate a algoritmului este c ă n u nece sit ă realizarea opera ţ iei d e broadca st decât pe liniile, nu ş i pe c oloanele, matricii d e proceso are. Lucrarea prezint ă , de asemenea, o serie de extensii ş i aplica ţ ii ale algo ritmului de eliminare Gau ssian ă . This paper presen ts an effi cient metho d for implemen ting th e Gaussian elimination tec hnique for an n· m (m ≥ n) ma trix, u sing a 2D SIMD array of n·m processors. The described algorithm consists of 2·n-1 =O(n) ite rations, which provides an optimal sp eed-up ove r the serial version . A particu larity of th e algorithm is th at it o nly requires broad casts o n the rows of the p rocessor ma trix an d not on its columns. The paper also pre sents several extension s and app lications of the Gaussian eliminatio n algorithm. Keywords: parallel Gaussian elimination, S IMD, 2D processor array. 1. Introduction The Gaussian elimination algorithm applied to an n · m (m ≥ n) matrix A consists of transforming the m atrix in to an equivalent upper t riangular matrix B (i.e. B(i,j)=0 , for j0. 1 Assist., Computer Scie nce and Engineer ing Department , Faculty of Auto matic Contro l and Computer Science, Politehnica Uni versity of Bucharest, Bucharest, Ro mania, email: mugurel.andreica @cs.pub.ro Mugurel Ionu ţ Andreica swap rows i a nd r in th e matrix A swap co lumns i and c in the matrix A // the reductio n stage for j=i+1 t o n do vaux=A(j,i)/A(i,i ) for k=i t o m do A(j,k)=A(j,k)-vau x·A(i,k) At the end of the algorithm, t he matrix A is an upper triangular matrix. A suitable (row, column) pair (r,c) (i ≤ r ≤ n, i ≤ c ≤ m) is usually considered the o ne with the l argest absolute valu e |A(r,c )| , be cause of num erical stabilit y re asons. There are cases, however, when any (row,column) pair (r,c) is suitable, as long as |A(r,c)|>0 . In such cases, we need to perform the swaps only if A(i,i)=0 . Because of its huge the oretical and pr actical i mportance, man y pa rallel approaches for the Gaussian elimination technique were proposed. In [ 1, 2, 3], parallel G aussian implementations using Open MP and running on multi ple processors ( cores), as well as distributed implementations using MPI and running on multiple computing nodes were anal yzed. G enerating optimal task schedules for Gaussian elim ination on M IMD machines was achieved in [ 4]. In [5] , communication efficiency aspects of parallelizing the Gaussian elimination technique were conside red. A s ystolic arra y implementation for dense matrices over GF(p) (the Galois field) was given in [ 6]. In [7] , a Gaussian eli mination algorithm over a sy nchronous architecture was pre sented, which i s similar in several wa ys with the solution presented in t his p aper. Although the al gorithm in [7] does not require an y kind of bro adcast mechanisms, i t assum es that communication is possi ble between nei ghbouring processors located on th e same row or column (at most 4 neighbours). However, man y of the m entioned parallel implementations seem to ignore the possibili ty that, at the i th iteration of the algorithm, t he entry A(i,i) m ight be zero. This is because pa rallelizing the s earch an d swap stage is mo re difficult. Such a hardness result was obtained in [8]. The ti me complexit y of the serial algorithm is O(n 2 ·m) (or O(n 3 ) if m =O(n) ) even if we do not search for a s uitable entry A(r,c) at ever y iteration (and, t hus, we perform no swaps). However, in the parallel case, t he reduction stage is eas y to p arallelize, while the search and swap stage is not. This paper proposes a novel approach for implementing the Gaussia n elimination for an n · m (m ≥ n) matrix , using a S IMD 2D array of n · m pro cessors ( n rows with m processors each). The approach does not tr y to par allelize directly the serial al gorithm, like many of the existing parallel sol utions do; instead, it uses a different technique which allows matrix rows to slide past each oth er an d reach their cor rect position in the final, upp er tri angular matrix . The sol ution does not search for the entr y wit h the l argest absolute val ue when setting the valu e A(i,i) and, because of this, it might not be numerically stable. However, it do es permit Efficient Gaussia n Eliminatio n on a 2D SI MD Arra y of Processors without Colu mn Broadcasts row reorderings, due to the sli ding m echanism, thus working under more realistic conditions than many other existing parallel solutions. Anoth er particularity of the solution described in t his paper is that it onl y requires mechanisms for broadcasting data on a r ow of processors and not on its columns, whi le several other parall el Gaussian elim ination approaches require both row and column broadcasts. Since the broadcast requirements are reduced, the p rocessor interconnection architecture can be simpli fied, thus reducing some of the architectural costs. Section 2 of th is paper contains the description of the parallel algorithm and archit ecture, as well as a formal proof of correctness, based on induction over the number of rows o f processors. Section 3 presents some validation results through simul ations. In Se ction 4 several extensions and applications of the Gaussian elimination m ethod are presented and in Sectio n 5 we conclude and discuss future work. 2. Description of the Parallel Algorithm The n · m pro cessors are arran ged into an n · m grid ( n rows with m processors per row). Each proc essor (i,j) (loca ted on row i and column j ) is connected to the processor on the row below i t and on the s ame col umn. Processors on the n th row are connected to those on the first row. On each row i , processor (i,i) i s a special processor and there are extra connections which allow processor (i,i ) to b roadcast data to all the other processors in its ro w. Each processor (i,j) has 6 r egisters: tmp (i,j) , tmp2(i,j) , f(i,j ) , cnt (i,j) , state(i,j) and state_changed(i,j) . tmp (*,*) and tmp2(*,*) a re used for temporary stor age; f(i,j) is used for storing the final value of the entr y on row i and column j of the t riangular matrix obtained as a re sult of the algorithm. cnt(i,j) is a counter and is incremented after each i teration of the algorithm ( cnt(i,j) is i dentical for all the processors and could be implemented as a single shared register, instead of n · m distributed registers). state(i,j) stores the state of the processor. If ( state(i,j)=1 ), then t he valu e s tored in f(i,j) is the final value o f the (i,j) entr y of the resulting triangular matrix ; otherwise (if sta te(i,j)=0 ), f(i,j) contains no meaningful value. The states of all the processors on the s ame row i a re identical; thus, a single shared register f or each row could be us ed. state_ changed(i,j) is a boolean register which holds the value true if the state of the proc essor changed durin g the current iteration. Except for these registers, each pro cessor st ores its row and column ( i and j ) in two special read-only registers. In the beginning of the a lgorithm, entry (i,j ) of the initial matrix is stored in tmp(i,j) , cnt (*,*) and state(*,*) are s et to 0 an d state_changed(*,*) are set to false . We will perform 2·n-1 iterations. At each iteration, the t mp value of each processor (i,j ) is transferred to the tmp value of the processor (i+1,j) (processor (n,j) transfers the data to processor (1,j) ). All the transfers occur simul taneously, Mugurel Ionu ţ Andreica as in any SIMD computer. After the data t ransfer, each processor performs a series of computati ons, describ ed b y the funct ion Compute (se e below). The algorithm works as follows. At each iteration t ( 1 ≤ t ≤ 2·n-1 ), onl y the processors on rows i ≤ t perform meaningful computations (this i s enforced by the test cnt(i,j) ≥ i in the pseudocode below). If the state of the processors on row i is 0 , t hen we verif y if the current row stored in the tmp (i,*) values could be the final row to b e stored on row i of the resulting upper triangular matrix . Processor (i,i) performs this test, by comparing |tmp(i,i)| against 0 . If (|tmp(i,i)|>0) , t hen we can store the current tmp(i,*) v alues into the f (i,*) v alues, because the condition that the entr y (i,i) is non-zero is fulfil led. The processor (i,i) bro adcasts a changed state announ cement to all the other processors on row i . If this announcement is 1 (|tmp(i,i)|>0) , then the new state of the processors (i,j) b ecomes 1 an d the tmp(i,*) values are copied into the corresponsin g f(i,*) values; after this, tmp(i,*) is set to 0 ; if t he announcement is 0 , then no more computations are pe rformed b y the pr ocessors on row i during the current iteration. Compute(processor (i,j)): cnt(i,j)=cnt(i,j)+1 if (cnt(i,j) ≥ i) then if (state(i,j)=1) then if (i=j) then tmp2(i,i)=tmp(i,i )/f(i,i) broadcast tm p2(i,i) to all the processors (i,j ) on row i, with i ≠ j else // i ≠ j tmp2(i,j)=the va lue bro adcasted by processor (i,i) tmp(i,j)=tmp(i,j )-tmp2(i,j)·f(i,j ) else // (state(i,j )=0) if (i=j) then if (|tmp(i,i)|>0 ) then broadcast tru e to all the processors (i,j ) on row i, with i ≠ j state_chan ged(i,i)=true else broadcast fa lse to all th e processors (i,j) on row i, with i ≠ j else // i ≠ j state_chan ged(i,j)=the value b roadca sted by processor (i,i ) if (state_chan ged(i,j)=true) th en state(i,j)=1 state_chan ged(i,j)=false f(i,j)=tmp(i,j) tmp(i,j)=0 If, at the beginning of the iteration, the state of the processors on row i is 1 , t hen the values stored in tmp(i,*) are pro cessed accordin gly. The value tmp2(i,i)=tmp(i,i)/f(i,i) is computed by processor (i,i) and then broadcasted to all the other processors on row i (which st ore it in their own tmp2 register). Afterwards, e ach v alue t mp(i,j) (1 ≤ j ≤ m) i s decreased by t mp2(i,j)·f(i,j ) . As a result of this, tmp(i,i) becomes 0 . At the next iteration, the processors on the row below Efficient Gaussia n Eliminatio n on a 2D SI MD Arra y of Processors without Colu mn Broadcasts row i receive the decremented tmp(i,*) values of the processors on row i . We will prove the following statement: th e tm p(i,j) valu es received by a row i of processors at an iteration t ≥ i are zero for 1 ≤ j ≤ i-1 . The proof is based on induction over the row i of processors. The base c ase consists of i=1 where it is obvious t hat the statement holds. Let’s assume that the statement holds for the first i-1 rows of processors and we will now tr y to prove it for the i th row, too. Let’s anal yze all the pos sible cases. If, at the beginning of the previous iteration, state(i-1,*)=1 , then according to the algorithm, the entr y tmp(i-1,i-1) was decremented to 0 . The entries tmp(i-1,j) (1 ≤ j ≤ i-2) were zero and t hey were further d ecremented b y tmp2(i-1,j)·f(i-1,j) . However, the statement i s valid for row i-1 and any previous iteration t ≥ i-1 and, thus, also for the it eration t’ when t he state of row i-1 chan ged from 0 to 1 . At iteration t’ , the values f(i-1,j) ( 1 ≤ j ≤ i-2) were set to tmp(i-1,j) , w hich were equal to 0 . Thus, we have f (i-1,j)=0 (1 ≤ j ≤ i-2) and, as a consequence, the tmp(i-1,j) (1 ≤ j ≤ i-2) values were, in fact, not decremented at all. So, at the end of the previous iteration, all the tmp(i-1,j) (1 ≤ j ≤ i-1) values were 0 , after which they were transferred to the row i of processors. W e also have to anal y ze the case where state(i-1,* )=0 at t he beginning of the p revious it eration. If, in the previous iteration, t he row of processors i -1 ch anged it s state from 0 to 1 , then it set all the tmp (i-1,*) v alues to 0 and these were then received b y the nex t row of processors; thus, the statement holds. The onl y case left is whe n, a t the be ginning o f t he p revious ite ration, state(i-1,*)=0 and it does not change to 1 . In thi s case, the tmp(i-1,*) v alues are not changed – they are transferred at row i as t hey a re. Since the statement holds for the rows up to i-1 , w e have t mp(i-1,j)=0 (1 ≤ j ≤ i-2) . However, because the state of the previous row of p rocessors did not chan ge to 1 , we must also have tmp(i- 1,i-1)=0 . This concludes the proof. Using the st atement proved above, it is eas y to conclude that th e final values f(i,j) are 0 ( 1 ≤ j ≤ i-1) and, thus, the obtained matrix is up per-triangular. We only have on e problem left. It is pos sible th at, after 2·n-1 iterations, som e rows of processors are still in state 0 . If a ro w i of processors i s still in state 0 , then none of the remainin g matrix rows have non-zero values in the i th column (and , thus, the initial matrix is singular). Considering that ever y remaining m atrix row also has a zero entr y in every co lumn j , where row j of processors is in st ate 1 , this means that ever y rem aining matrix row is full of zeroes. We have two choices. We can eithe r r eport that the determinant is zero (in case it should have been non- zero), or we can s et f(i,j)=0 for all the processors wit h state(i,j)=0 after the 2·n-1 iterations. From the point of v iew of a SIMD implementation, at the be ginning of each iteration, after performing th e data transfers, we sel ect all the processors and let them i ncrement the counter cnt(i,j) . Then, we select all the processors (i,i) with state(i,i)=1 and cnt(i,i) ≥ i and let them comput e the value t mp2(i,i) and broadcast it to the other p rocessors in their row. Afterwards, we select all the Mugurel Ionu ţ Andreica processors (i,j) with i ≠ j , stat e(i,j)=1 and cnt (i,j) ≥ i and let them receive the value broadcasted b y the pro cessor (i,i) on their ro w. After this, we s elect all the processors ( i,j) with state(i,j)=1 a nd cnt(i,j) ≥ i and le t them modify the value tmp(i,j) accordingl y. In the second part of the iteration we select all the processors (i,i) with state(i,i)=0, |tmp(i,i)|>0 and cnt(i,i) ≥ i and let them perform t he broadcast of t he change d state announcement ( 1 ) to their row of process ors and the assignment state_changed (i,i)=true . Afterwa rds we select all the processors (i,i) with state(i,i)=0, |tmp(i,i)|=0 and cnt(i,i) ≥ i and let them perform t he broadcast of the chan ged state announce ment ( 0 ). W e then select all t he processors (i,j ) with i ≠ j , state(i,j)=0 and cnt (i,j) ≥ i and let them recei ve the changed state announcement bro adcasted b y the processor (i,i) on their row. After this, we select all the processors (i,j) with st ate_changed(i,j)=true and cnt(i,j) ≥ i and we let them perform the assignments state(i,j)=1 , f (i,j)=tmp(i,j) , the cl earance to 0 of tmp(i,j) and the clearance to false of the register state_changed(i,j) . A sample architecture on which the p reviousl y described algorithm can run is depicted in Fig. 1. Fig. 1. A sa mple SIMD 2D arr ay of pro cessors with n= m=4. 3. Validation through Simulations The described parallel a lgorithm was implement ed using the P arallaxis parallel programming language [9] . A suite of tests were performed, whi ch were meant to verif y the correctness of the parallel implementation. Because of this, a serial G aussian elim ination algorithm was also im plemented and both algorithms were ex ecuted on 20 n · (n+1) randoml y generated m atrices, with n ranging from 1 to 50 . S ince the parallel and serial implement ations are b ased on slightly different Efficient Gaussia n Eliminatio n on a 2D SI MD Arra y of Processors without Colu mn Broadcasts sequences of actions, the output s of the two algorithms were not i dentical. However, b y interpretin g t he two output matrices as the (augmented) matrices of two systems of linear eq uations and solving thes e s ystems, we obt ained two sets of values of the n variables. After sortin g t he s ets, both algorithms should have obtained the sam e sequence o f values of the variables. During testi ng, an y singular matrix that occured was discarded and another matrix was regenerated instead. Furth ermore, th e determinants of the upper-triangular m atrices obtained by both algorithms were also computed (the determinant is equal to the product of the elements on the main diagonal of a lower- or upper-triangular matr ix) and their absolute values were compared against each other (because of row reorderings, the si gn of the determinant ma y change, but n ot its absolute value). Since the obtained matrices ma y have more than n columns, the determinant was computed considerin g on ly the first n columns. In order to accuratel y comput e the determinant, a librar y for large re al numbers wa s used [10 ] (becaus e the determinant’s value ex ceeded the standard double or long double t ypes). For each of the 20 matrices, all the tests were pas sed : the same determinants (in absolute value) and the same values of the variables were obtained. 4. Extensions and Applications of the Gaussian Elimination Method The function Com pute , presented in Section 2, can be ex tended, for instance, b y considering t hat all the values of the initi al matrix are i ntegers and that all the operations (addition, subtraction, multiplication, division) are performed modulo a (prime) number M . It is easy to perform addition, subtrac tion and mul tiplication modulo M (we first compute the result R normally, and t hen take its remainder, i. e. R mod M ). The division operation requires the existence of a multiplicative inverse . The multipl icative inverse of a number x is the num ber x -1 such that x·x -1 =1 (mod M ) . x -1 can be comput ed by using the ext ended E uclidean algorithm [11] . A some what easier case occurs when M=2 . In this case, addition and subtraction are equivalent to x or , multiplication is equivalent to and , and division is easy (since we only divide a number x by 1 , the result is alwa ys x ). A matrix operation which is related to the Gaussi an elimination operation is matrix multiplication. Multiplying two k -by- k (square) matri ces can be easily performed in O(k 3 ) time or, if we use S trassen’s algorithm [ 12], in O(k 2.807 ) time. Matrix multiplication has applications in man y do mains. We will consider next an application to C ombinatorics. Let's consider the set of s equences whose el ements belong t o the set {1,…,k} , for which we are giv en a binar y k -b y- k "transi tion" matrix T(i,j) . If T(i,j)=1 , then the element j can be located right after element i in the sequence (if T(i,j)=0 , it cannot). We want to compute the number o f (valid) sequences with n elements (mod M ). The strai ght-forward dynamic programm ing solution is the following. W e compute S(l,j)= the number of (valid) s equences Mugurel Ionu ţ Andreica with l elements whose last element i s j ( 1 ≤ j ≤ k ). We have S(1,j)=1 . For l>1 , we have S(l,j)= the sum of the values (T(i,j)·S(l-1,i )) ( 1 ≤ i ≤ k ). The answer is the sum of the values S(n,i) ( 1 ≤ i ≤ k ). The time complexit y is O(n·k 2 ) . In o rder t o improve the time complexit y (to O(k 3 ·log(n)+k 2 ) ), we will consider the k -element column vectors SC(l) , where SC(l)(j)=S(i,j) . We have SC(l)=T·SC(l-1) . Thus, SC (n)=T n-1 · SC(1) . B y ef ficiently raising the transition matrix T at the (n-1) -th power (e.g. b y using repeat ed squa ring), we obtain an efficient m ethod of computin g the column vector SC(n) and the answer (the sum of t he values SC(n)(i) , with 1 ≤ i ≤ k ). A second efficient method ( O(k 4 ·log(n)) ) to compute the number of valid sequenc es of len gth n is to compute the values S(i,l,j) =the nu mber of sequenc es of length 2 l , for which the fi rst element is i and the l ast eleme nt is j . We have S(i,0,i)=1 and S(i,0,j ≠ i)=0 . For l>0 , S (i,l,j) =the sum of the v alues S(i,l-1,p)·T(p,q)·S (q,l-1,j) , with 1 ≤ p ≤ k and 1 ≤ q ≤ k . Afterwards, we write the num ber n as a sum of powers of 2 , i.e. n=2 pow(0) +…+2 pow(B) . We will compute the values U (i,l,j) =the number of sequences st arting wit h i , ending with j , and whose length is 2 pow(0) +…+2 pow(l) . Obviously, we have U (i,0,j)=S(i,pow(0),j) . For l ≥ 1 , we have U(i,l,j) =the sum of the values U(i,l-1,p)·T(p,q)·S(q,pow(l),j) , with 1 ≤ p ≤ k and 1 ≤ q ≤ k . The answer is the sum of the v alues U(i,B,j) ( 1 ≤ i ≤ k and 1 ≤ j ≤ k ). In both methods, all the arithmetic operations are perfomed modulo M . An interesting application of the Gaussian elimination method is the following. We conside r N numbers A(i) ( 0 ≤ A (i) ≤ 2 B -1, 1 ≤ i ≤ N ). We want to find a subset {i 1 , …, i k } of {1, …, N} , such that A(i 1 ) xor A(i 2 ) xor ... xor A(i k ) is maximum ( k may b e an y number). Let’s d enote this maximum value b y XM . We will compute XM bit b y bit (from the bit B-1 , t he most significant one, down to the bit 0 , the least si gnificant one). Let’s assume that we comput ed the values BV(B-1), BV(B-2 ), ..., BV(i+1) of the bit s B-1, ..., i+1 of XM and we now want to compute BV (i) . In order to achieve this, we wi ll use the following transformation. We wil l obt ain a sy stem of e quations in base 2 , having B-i equation s and N unknown variables. The N unknown variables, x (1), ..., x (N) can be either 0 or 1 . If x(j)=1 ( 1 ≤ j ≤ N ), then i belongs to the subset w e want to compute; otherwise, i does not belon g to this subset. The coefficients of this s ystem, c (p,q) ( i ≤ p ≤ B-1 , 1 ≤ q ≤ N ), are: c (p,q) =the p th bit of A(q) . This wa y, equation p has the following structure: c (p,1)·x(1) xor c(p,2)·x(2) xor ... xor c (p,N)·x(N)=c(p, N+1 ) , where: c (p, N+1)=BV(p) ( i+1 ≤ p ≤ B-1 ) and c(p, i )=1 . We wi ll run the Gaussian elimination method on the extended matrix of the system of linear equations (which has B-i rows and N+1 columns), where all the oper ations are performed modul o 2 (addition and subt raction are equivalent with xor , m ultiplication is equivalent with and , and div ision is not n ecessar y) . Howev er, we will never consider column N+1 as a candidate for swapp ing with another column. After performing the Gaussian elimination, the ex tended matrix of the system ( c(*,*) ) will hav e 1 on the m ain diagonal from row i up to a row q ( i-1 ≤ q ≤ B -1 ) (rows are numbe red from i to B-1 ) Efficient Gaussia n Eliminatio n on a 2D SI MD Arra y of Processors without Colu mn Broadcasts and 0 on the rows q+1, ..., B-1 . Under the main di agonal, the matrix will contain only 0 elements. The system of linear equations has a s olution if we h ave c(j,N+1)=0 on all the rows j=q+1, ..., B-1 . If the system has a solu tion, then BV(i)=1 , otherwise BV(i)=0 . This way, we can compute XM bit by bit. The subset of indices {i 1 , …, i k } is computed from the solut ion of the last s ystem of equ ations. The time compl exity of t his algorithm is O(B 3 ·N) . We can redu ce the time complexit y to O(B 2 ·N) , as follows. We noti ce that the s y stem of equations corresponding to th e bit i has the first B -i-1 rows identical to those o f the system of equations corresponding to the previous bit ( i+1 ). Thus, we wi ll keep the matrix o btained as a result of the Gaussian elimination performed at the bit i+1 , to which we add a new row, corresponding to the bit i (the j th value o f this row is equal to the i th bit corresponding to A(o(j )) , wh ere o(j) is the index corresponding to column j , 1 ≤ j ≤ N ; initially, o(j)=j , but we h ave to swap o(j) and o(k ) whenever we swap the columns j and k between them; the (N+1) st value of the row i s 1 ). Let r(i+1) be t he row correspond ing to th e row added when consid ering the bit i+1 . If BV(i+1)=0 then we will set c(r(i+1),N+1)=not c (r(i+1),N+1) (i.e. we change its value into the opposite on e). We will also st ore the first ro w q(i+1) where c(q(i+1),q(i+1))=0 (i.e. there were no more 1 elements on row q (i+1) o r below it). We will reduce the newly added row b y subtracting from it all the rows having a 1 on their main diagonal position. Then, if this row contains any 1 elem ents, we will swap it with the row q(i+1) ; afterwards, we will swap the column q(i+1) with a column C containing a 1 element on th e reduce d newl y added row and we will set q(i)=q(i+1)+1 . If, after bein g reduced, th e n ewly added ro w contains n o 1 elements, we set q (i)=q(i+1) . Note that this time we consid ered t hat the rows were numbered from 1 to B-i . Checking if BV (i)=1 can be performed in O(B) time (by considerin g every row h from q(i)+1 to B -i and checkin g that c(h,N+1 )=0 ). We will set r (i) to the index of the matrix row on which t he newly added row is located. By using this improvement we basicall y pe rform onl y one Gaussian elimination over the wh ole course o f the algorithm . Thus, the time complexit y becomes only O(B 2 ·N) . Fig. 2 depicts the extended matrix on which the Gaussian elimination is performed. A problem which is v ery similar to the p revious one, and y et i t has a totally different solut ion is the following. W e consider a s equence of N natural numbers A(i) ( 0 ≤ A (i) ≤ 2 B -1 , 1 ≤ i ≤ N ). We want to find a contiguous subsequence A(i), ..., A(j) ( A(i), …, A(j) are located on consecutive positions in the sequence), such that A (i) xor A (i+1) xor ... xor A (j) is maxim um. We will compute the prefi x xors (similar to the well -known prefix sums ): X( 0)=0 and X (1 ≤ i ≤ N)=X(i-1) xor A(i) . Then, we will maintain a trie (a prefix tree), in which we will introduce, one step at a time, t hese p refix xors. Each pre fix x or will be interpr eted as a binar y string with B elements. The first element of the s tring will be t he most sig nificant bit (the bi t B-1 ), which w ill be followed b y the s econd most significant bit, and so Mugurel Ionu ţ Andreica on, until the least significant bit (the bit 0 ). Initially, we will introduce in the t rie the st ring correspondin g t o X(0) . We will t raverse the sequence from 1 to N , and for each p osition i , we will compute the lar gest x or of a contiguous subs equence ending at position i . Let’s consider BV(i, B-1 ), .. ., BV(i, 0 ) to be the bits of the largest x or of a contiguous subsequence ending at position i . We will traverse the binary strin g corresponding to A(i) (which consist s of B bits) with an index j , starting from the bit B-1 , down to the bit 0 . We will denote b y A(i,j ) the j th bit of A(i) . At t he same time, we will maintain a poi nter pnod to a node of the trie, which will be initialized to th e trie’s root. At ever y bit j , we verify if pnod has an edge labelled wit h (1-A (i,j)) towards one of his s ons. If it does, then BV(i,j)=1 and we set pnod to the son correspondin g to the ed ge labelled with (1- A(i,j)) . If pnod does not hav e such an edge, then BV (i,j)=0 and we set pnod to the son corresponding to the edge lab elled with A(i,j) . After computing BV(i,*) we will insert t he string corresponding t o X(i) in the trie. The result will be the largest value among those corresponding to the strings BV(i,*) (interpreted as n umbers having B bits). The time complexit y of this algorithm is O(N·B) . Fig. 2. T he matrix on which the Gaussian eliminatio n is perfo rmed. The problem can be extended b y considering the following constr aints: the length of the computed s equence must be at least L and at most U ( 1 ≤ L ≤ U ≤ N ). I n this case, we will have to remove some prefix xors from the trie. To b e more exact, we will proceed like in the previous algorithm. When we reach the posi tion i and we want to compute BV(i,*) , we perform the foll owing action: if i>U , then we remove from the trie the st ring corresponding to X(i-U-1) . We wil l com pute BV(i,*) onl y if i ≥ L ; after computing BV(i,*) ( i ≥ L ), we will insert into the t rie the string corresponding to X (i-L+1) . In order to be able to remove s trings from a trie, we will s tore fo r each trie node pnod the numb er of strings which cont ain this node (when we add a new string, we increment by 1 the counters of all the nodes encountered alon g the path from the root, and when we remove a string , w e decrement b y 1 the counters of t he same nod es visited when th e no de was Efficient Gaussia n Eliminatio n on a 2D SI MD Arra y of Processors without Colu mn Broadcasts inserted). If, at some point, the counter of a node pn od (different from the trie’s root) becomes 0, we can remove the edge (labelled with E ) between pnod and its parent (this w ay, pnod ’s parent will not have pn od as a son or an ed ge labelled with E anymore). The time complexit y i n this case st ays the same (i.e. O(N·B) ). We notice t hat, although t his problem and the previous one were apparent ly very similar, the algorithmic tec hniques required fo r obtaining a solution are quite different (we didn’t use any Gaussian elimination for this problem). Another application of the Gaussian elimination method is the following. We consider an undirected graph with N vertices. There is a light bulb i n every vertex. The bulb in vertex i is initiall y in a state SI(i) ( SI(i)=1 means “on” , and SI(i)=0 means “off” ). Every vertex i also has a cost C(i) ≥ 0 which needs to b e paid if we want to modif y its state (from on to off , or from off to on ). We want to bring every bul b i into a final state SF(i) (which ma y be identical to SI(i) ), by performing a sequence of the following t ype of actions: we touch the bulb i (and we pa y the cost C(i) ) – as a consequence of this action, the st ate of the bul b i and of all the neighbouring bulbs are changed (but we do not have t o pay an ything for the neighbouring bulbs). We want to find a strateg y whi ch brings the b ulbs t o their final states, such t hat the total cost i s minimum. The general case can be solved as follows. We notice that we n ever have t o touch the same bulb twice. We will associate a variabl e x (i) to ever y bulb i , which wil l be either 0 or 1 , representing the num ber of times the bulb i was t ouched. W e will construct a system of linear equatio ns in base 2 . We will have an equation for each bulb i ( 1 ≤ i ≤ N ): c(i,1)·x(1) xor c(i,2)·x(2) xor ... x or c(i,N)·x(N)=c(i,N+1) . The coefficients c(i,j) will be 1 for j=i and for those bulbs j for which the edge (i,j) exists in the graph; for the other bulbs j , th e coefficients c(i,j ) will be 0 ; c(i,N+1)=SI(i) xor SF (i) . We will use the Gaussia n elimination technique on this system of equations, w here all the operations will be performed modul o 2 . Afterwards, the ext ended matrix c(*,*) of the s ystem will have 1 elements on the main diagonal on the first PR rows ( PR is obtained after running the elimination method), and the last N-PR rows will have c(i,j)=0 ( PR+1 ≤ i ≤ N, 1 ≤ j ≤ N ). If we have c(i,N+1)=1 (for some PR+1 ≤ i ≤ N ), then the problem has no soluti on. Otherwise we have N-P R free v ariables – the on es corresponding to the last N-PR columns. We must pay att ention that during the algorithm, the columns may be swapped amon g each other, such that the last N -PR columns do not necessarily correspond t o the initial variables x(PR+1), ..., x(N) . For the N-PR fr ee variables we will have to t ry ev ery possible combination of assigning v alues to the m (there are 2 N-PR poss ibilities overall). For ev ery combin ation we will compute t he values of the other PR variables (the bound variables). Once we compute the values of all the variables x(1), ..., x (N) , the cost of the strategy is CS=C(1)·x(1)+...+C(N)·x(N) . We will choose the stra tegy with m inimum cost among all the 2 N-PR possibilities. Mugurel Ionu ţ Andreica This method could lead to a significant improvement upon a naïve algorithm which tries e ach of the 2 N possible combinations of assigning values to the N variables. However, the d egree of improvement depends on the structure o f the gr aph. For instanc e, let’s assume that the graph consists of P·Q vertices, arranged on P rows and Q columns. Each vertex (i,j) (on row i , colu mn j ) is adjacent to the vertices to t he north, south, east and west (if these n eighbours exist). In this case, we ca n consider the 2 Q (or 2 P ) possi bilities of assigning values to the variables of the vertices on the first row (first column). Let’s consider the first row case (the first column case is h andled s imilarl y). After assigning values to t he variables of the vertices on the first row, we will traverse the other vertices in increasing row order (with i from row 2 to P ) and, for equal rows, i n increasing column order (with j from 1 to Q ). The value of the variable x(i,j) is uniquel y determined b y t he values of the variables computed earlier. If the vertex (i-1,j) is in its final state (considering the v alues x(*,* ) of itself and of all of its neighbours except for (i,j) ), then x(i,j )=0 ; otherwise, x(i,j)= 1 . If the vertices on the last row are not i n their final states (considering the values x(*,*) o f themselves and of all of their neighbours), th en the assignment of values to the variables o f t he vertices on the first row did not l ead to a solution. Otherwise, we obt ained a solution and we compute its cost (like before). Fig. 3 shows the structure of the P x Q graph. Fig. 3. T he structure of the P x Q graph in the first case. Let’s consider now the same graph, where ever y vertex is adjacent to i ts (at most) 4 diagonal nei ghbours. We will color every vertex (i,j) in white, if (i+ j) is even, or in bl ack, if (i+j) is odd. This way, w e obtained (at most) 2 co nnected components, such that all t he vertices of the same component have the same color. For ever y component, we will split the vertices (i,j) i n groups, ac cording to the values min{i,j} (all the v ertices (i,j) with the same value of min{i,j } wil l belong to the same group). W e will consider every possibili ty of assi gning values to the variables of the vertices (i,j) for which min{i,j} is minimum among all the vertices of the component. For each possibi lity, we will consider all the ot her groups in the component, in i ncreasing order of min{i,j} . For each such group, we will t raverse Efficient Gaussia n Eliminatio n on a 2D SI MD Arra y of Processors without Colu mn Broadcasts the vertices (i,j) for which j ≤ i from the l ower-indexed row to t he hi gher-indexed row, along the column j . Then we will traverse the vertices (i,j) for which i ≤ j from the lower-indexed colum n to the higher-indexed colum n, along the row i . Every time we reach a vertex (i,j) , this vertex will have a neighbour (i’,j’) fo r whic h min{i’,j’}=min{i,j}-1 and for which t he vertex (i,j) is the last vertex whose variable has not been assigned a va lue, yet. Thus, the variable x(i,j) will be uniquely determined, based on the current state of (i’,j’) and its desired final state ( 0 , if the two states are the same, and 1 otherwis e). Like in the previous case, some initi al assignments ma y not lead to a sol ution (if some vertices do not end up in their desired final states). One last case we consi der he re is that in which eve ry vert ex (i,j) is adjacent t o all of its (at most) 8 neighbours (on the same row, column , or diagonal). In this case, we will assign values to the variables of the v ertices on the first row and column. Fo r each assi gnment, we will t raverse the vertices (i,j) like in the previous case, according to min{i,j} . Basicall y, we traverse the ve rtices on column 2 (from l ower-indexed rows to higher-index ed rows), followed by the vertices on row 2 (from lower-indexed columns t o higher-indexed colum ns), and so on. Like in the pr evious case, when we reach a vertex (i,j) durin g the tr aversal, this vertex will have at least one neighbour (i’,j ’) (with its variabl e assigned to some va lue), for which ( i,j) i s the only nei ghbour whose va lue has not be en assigned, yet. Thus, x(i,j) will be uniquel y determined. In this case, we have 2 P+Q-1 possibilities for assigning variables to the vertices (i,j) on the first row and column (those with min{i,j}=1 ). A problem related to the previous ones is the following. We consider a matrix with M rows and N columns. Each cell (i,j) of the mat rix contains a bulb. The bulb (i,j) is in the initial state SI(i,j) ( 1-on or 0-off ) and must be brought into its final state SF(i,j) . Th e onl y operations we can p erform are: changing t he state of all the bulbs on a row i , which costs CL (i) ≥ 0 ( 1 ≤ i ≤ M ), and changing the state of all the bulbs on a colu mn j , which costs CC(j) ≥ 0 ( 1 ≤ j ≤ N ). We want to find a sequence of operations with minimum t otal cost such that, in the e nd, ever y bulb is in its final state. Like before, we noti ce that we never need to perform an operation on a given ro w or column m ore than once. We will denote by xL(i) ( xC(j) ) t he variables which describe if we perform ( 1 ) or not ( 0 ) an operation on the row i (column j ). We wil l consider two cases. In the first case, w e consi der that we do not perform an operation on row 1 : xL (1)=0 . We will traverse all the bulbs (1,j) on the row 1 and, if SI(1,j)=SF(1,j) , t hen we mu st not perform an operation on column j ( xC(j)=0 ); otherwise, we must perform an op eration on column j ( xC(j)=1 ). Then, we traverse all the bulbs (i,1) on the column 1 ( i ≥ 2 ): if SI(i,1) xor xC (1)=SF(i,1) , t hen xL(i)=0 ; otherwise, xL(i)=1 . Then we verif y, for every bulb (i,j) , if SI(i,j ) xor xL(i) xor xC (j)=SF( i,j) . If the condition is true for every bulb, then we ob tained a solution and we compute its cost: CS= Mugurel Ionu ţ Andreica CL(1)·xL(1)+...+CL(M)·xL(M)+CC(1)·xC(1)+...+CC(N )·xC(N) . In the second case we will consider xL(1)=1 . In a similar manner we will compute th e values xC(j) : if S I(1,j) xor xL (1)=SF(1,j) , th en xC(j)=0 ; otherwise, xC(j)=1 . Then we compute the v alues xL(i) ( i ≥ 2 ): if SI (i,1) xor xC(1)=SF(i,1) , then xL(i)=0 ; otherwise, xL (i)=1 . As b efore, we verif y i f we o btained a solution. The minimum cost is given by the minimum of the (at most) 2 possible solut ions. A strongl y r elated proble m is the following. We have the same input data and objective, but we have a d ifferent set of op erations that can be performed. An operation consists of choosing a bulb (i,j) and ch anging t he state of all the bulbs on the row i and the column j (except for the bulb (i,j) ). Th e cost of such an operation is 1 – thus , we want t o minimi ze the total number of op erations required for bringing ever y bulb in its final state. An operation in this problem is equivalent to changing t he state of a row, followed by that of a colum n (in the terms of the previous problem). At first, we will consider that we can perform the same operati ons like i n the previous problem. Thus, usi ng that solution, w e obtain (at most) t wo distinct solutions. The solution i ( i=1,2 ) consists of changing the states of the bulbs on the rows L (i,1), ..., L (i,nL(i)) and on the columns C(i,1), ..., C(i,nC(i)) . We will t ry t o transform thi s solution into one which is compatibl e with the operations allowed in this problem. We will construct pairs of the form (row, column) : (L(i,1), C (i,1)), ..., (L(i,K), C(i,K)) , where K=min{nL (i), nC(i)} . A pair (L(i,j), C(i,j)) means that the bulb (L(i,j), C(i,j)) is chosen for an opera tion. If nL(i)>K , then the extra number of rows (nL(i)-K) must be even. W e will form t he pairs (L(i,K+1), 1), (L(i,K+2), 1 ), ... , (L(i,nL(i)), 1) (i.e. we choose every bulb (L(i,K+1), 1), ..., (L(i,nL( i)), 1 ) in an operation). If nC (i)>K , then (nC(i)-K) must be even and we will choose the bul bs (1, C(i,K+1)), ..., (1,C(i,nC(i))) in an operation. We will choose that s olution i ( i=1,2 ) for which we obt ain the minimum number of operations. The problem of turning light bulbs on and off on a graph (where, when touching a bulb, the states of the bulb and of all of its neighbours change), can be solved wi thout using the Gaussian elimin ation t echnique on graphs wit h special structures, particularl y gr aphs with a tree-like stru cture, like trees, tr ees of c ycles or graphs wit h bound ed treewidth (when a suitable tree de composition is also given). In order to make an informed choice, we first need to recognize the class to which the i nput graph belongs. Thus, graph recognition algorithms are important i n thi s si tuation. W e will pres ent ne xt a s imple graph reco gnition algorithm for the class of trees of cycles. A tree of cycles is an undirected graph with one of the following attributes: • it is a cycle of any length K ( K ≥ 3 ) • it is a graph obtained by attaching a cycle C of som e length K ( K ≥ 3 ) to an edge of a tree of cycles CT Attaching a c yc le to an edge e o f a graph means replacing an ed ge of the Efficient Gaussia n Eliminatio n on a 2D SI MD Arra y of Processors without Colu mn Broadcasts cycle with the edg e e of t he graph and repla cing t he two vertices of the c ycle connected b y the replaced edge b y t he two vertices of the graph connecte d by e . Fig. 4 depicts the operation of attaching a c ycle to a tree of cycles. In order to decide if a given undirected graph is a tree of cycles, we will perform th e followin g op eration r epeatedly (fo r v ertex i ): if there exists a vertex i with degree 2 , and its neighbours are j and k , then we remove the vertex i from the graph ( and its two adjacent ed ges) and we i ntroduce instead the edge j-k , if this edge does not exist in the graph alread y. If the given graph is a tree of c ycles then, in t he end, we will onl y have two vertices left (ini tially, the num ber of vertices of the graph must be at least 3 ). In order to perform the described operation, we will use a queue in which we wil l insert all the vertices which initiall y have degree 2 . Then, we will repeatedly ex tract vertices from the queue (as long as the queue is not empty). Ever y time we extract a vertex i from the queue we pe rform the operation for i . Thus, there ex ists the possibility of modif y ing the de grees of i ts two neighbours j and k . If, after performing the operation for the vertex i , the vertex j ( k ) ends up with degree 2 (before the operation it had a degree greater than 2 ), we i nsert j ( k ) at the end of the queu e. We no w onl y need to use an efficient data structure for checking quickl y if two vertices are adjacent in the graph, for finding the tw o neighbours of a vertex i , and for removin g edges from the graph. We can use, for instance, balanced trees, in which we store 2 p airs (i,j) and (j,i) for ea ch edge i-j which (currently) exi sts in the graph. Thi s wa y, we can check easil y if a given edge exi sts in the tree, o r we can easily remove a given edge from t he tree (b y removing its two corresponding pairs). The two nei ghbours j and k of a vertex i are found by searchin g fo r the two i mmediate successors of the pair (i,- ∞ ) . The time complexity is, thus, O(n+m+m·log(m )) , where n is the (initial) number of vertices of the graph and m is the (initial) number of edges. Fig. 4. Attachin g a cycle to an edge of a tree of cycles. 5. Conclusions and Future Work This paper pr esents a novel approach for implementing the Gaussian elimination technique on a 2D S IMD array of processors. The al gorithm consist s Mugurel Ionu ţ Andreica of 2·n-1 iterations and a llows row r eorderings, b y allowing the m atrix rows to slide past each other. Both the (theoretical) formal proofs of correctness and the (practical) r esults of the simulation tests validated t he concept of the algorithm. Moreover, the paper als o introduced several ex tensions and applications of t he Gaussian elimination technique and compare d the applications to similar problems in which the Gaussian elimination method was no t required for computing the (optimal) solution. As future work, the alg orithm should be implemented on a real architecture wit h n · m processors. Since real-life matrices have sizes which are larger than the number of processors within mains tream parallel comput ers, a modification of the algorithm in which ever y proc essor is responsible for mul tiple entries of the matrix should be considered. A simple method to achieve t his could be to consider t hat t here ar e n · m virtual proc essors and each real p rocessor simulates several « geographicall y cluster ed » virtual processors. R E F E R E N C E S [1] S. F. McGinn, R. E. Shaw, “Parallel Gaussian Elimi nation using OpenMP and MPI”, in Proce edings of the 16 th International Symposiu m o n H igh Perfor mance Computing Syste ms and Applicatio ns, 2002, pp. 169-173 . [2] P. R. Amestoy, I. S. Du ff, J.-Y. L’excellen t, X. S. Li, “Analysis and Co mparison o f T wo General Sparse Solvers for Distributed Memory Computers”, in ACM T ransactions on Mathematical So ftware, vol. 27 , no. 4, 2001, pp. 388 -421. [3] J. W. Demmel, J. R. Gilbert, X. S. Li, “ An Async hronous Parallel Supernodal Algorithm for Sparse Gaussia n Eliminatio n”, in SI AM Jo urnal on Matri x Analysis a nd Applications, vol. 20 , no. 4, 1997, pp. 91 5-952. [4] R. Saad, “An Optimal Schedule for Gaussian Elimination on an MIMD architect ure”, in Journal of Co mputational and Applied Mathe matics, vo l. 185 , no. 1, 2006, pp. 91-106. [5] A. Tiskin, “Communicatio n-Efficient Parallel Gau ssian Eli mination”, in Lecture Notes in Computer Science, v ol. 27 63 , 2003, pp . 369-383. [6] B. Hochet, P. Quintin, Y. Robert, “Systolic Gaussian Elimination Over GF(p) with Partial Pivoting”, in IEE E Tr ansactions on Co mputers, vol. 3 8 , no. 9, 1 989, pp. 1321-13 24. [7] V. Cristea , Algoritmi de pr elucrare par alel ă , Ed. Matrixrom, Bucure ş ti, 2 002. [8] M. Leon cini, “On t he P arallel Complexity o f Ga ussian Elimination with P ivoting”, i n Jo urnal of Computer and System Sciences, vol. 53 , no. 3, 199 6, pp. 380 -394. [9] T. Bra unl, “Parallaxis-III: a str uctured data-pa rallel progra mming la nguage”, in P roceedings of the 1 6 th International Conference on Algor ithms a nd Architectures for Parallel Processing, 1995 , pp. 43-52. [10] D. H. B ailey, H . Yozo, X. S. Li, B. Th ompson, “ARPREC: An Arbitrary Precision Computation P ackage”, Technical Report LBNL-53651 , Lawrence Berkeley National Laboratory, 200 2. [11] D . R. Stinson , Cryptograp hy: Theor y and Practice, CRC P ress, 2005. [12] T. H. Cormen, C. E. Leis erson, R. L. Rivest, C. S tein , Introduction to Algorit hms, 2 nd editio n, MIT P ress and McGra w-Hill, 200 1. [13] V. V. Jinescu, I. Popescu, “Co nsideration abo ut Hi gh-Pressure R eactors with C ylindrical Symmetry”, in U. P.B. Scientific Bulletin, Serie s D, vol. 71 , iss. 1, 20 09, pp. 45 -55.

Efficient Gaussian Elimination on a 2D SIMD Array of Processors without Column Broadcasts

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment