A Transformation--Based Approach for the Design of Parallel/Distributed Scientific Software: the FFT

A TRANSF ORMA TION–BASED APPRO A CH F OR THE DESIGN OF P ARALLEL/DIST RIBUTED SCIENTIFIC SOFTW ARE: THE FFT ∗ HARR Y B. HU NT , LENORE R. MULLIN , DANIEL J. R OSENKRANTZ † , AND JAMES E. RA Y NOLDS ‡ Abstract. W e describe a methodology for designing e ﬃcient p arallel and distributed sc ientiﬁc softw are. This methodology utilizes s equences of mech anizable algebra–based optimizing transfor- mations. In this study , we app ly our metho dology to the FFT, starting from a high–lev el algebraic algorithm description. Abstract multiprocessor plans are dev eloped and reﬁned to sp ecify which computations are to b e done by each pro cessor. T emplates are then created that sp ecify the lo ca- tions o f computat ions and dat a on the processors , as w ell as data ﬂow among pro cessors. T emplat es are dev eloped in both the MPI and OpenMP programmi ng st yles. Preliminary exp eriments comparing code constructed usi ng our methodology wi th code f rom sev eral standard scien tiﬁc libraries s how that our code i s often comp etitiv e and sometimes p erforms better. In terestingly , our co de handled a lar ger range of pr oblem sizes on one target architecture. Keyw ords: FFT, design metho dology , optimizing transformations, mes sage pass- ing, shar ed memor y . 1. In tro duction. — W e pr esent a systematic, algebraica lly based, design metho dolog y for eﬃcient implemen tation of co mputer progra ms optimized ov er multiple levels of the pro ces- sor/memo r y and netw ork hierarch y . Using a co mmon formalism to describ e the pro b- lem and the partitioning of da ta ov er pr o cessor s a nd memory levels allows o ne to mathematically prove the eﬃciency and co rrectness of a g iven algorithm as measured in terms of a set of metr ic s (such a s pro cessor /netw ork sp eeds, etc.). The approa ch allows the av erage pr ogra mmer to achiev e high-level o ptimizations similar to those used by compiler writers (e.g . the notion of tiling ). The appr oach is similar in spirit to o ther e ﬀo r ts us ing librar ies of algorithm build- ing blo cks based on C++ template clas ses. In PO OMA for exa mple, expression tem- plates using the Portable Expression T emplate Engine (PETE) (h ttp://www.acl.la nl.gov.pete) were used to a chiev e eﬃcien t distribution of array in- dexing over sca lar op era tions [1, 2 , 3, 4, 5, 6, 7, 8, 9 , 10]. As another ex a mple, The Matrix T emplate Libr ary (MTL) [1 1, 12] is a sys tem that handles dense and sparse matrices, a nd uses template meta-progr ams to genera te tiled a lgorithms for dense matrices. F or ex a mple: ( A + B ) ij = A ij + B ij , (1.1) can b e g eneralized to the situation in which the multi-dimensional a rrays A and B are selected using a vector o f indices ~ v . In POOMA A and B were r epresented as classes and e xpression templates were used a s re -write r ules to eﬃciently car ry out the tr anslation to scalar o pe r ations implied by: ∗ RESEARCH SUPPOR TED BY NSF GRANT CCR–0105536. † Departmen t of Computer Science, University at Albany , State Universit y of New Y ork, Albany , NY 12203 ‡ College of Nanoscale Science and E ngineering, Univ ersity at Alban y , State Universit y of New Y ork, Albany , NY 12203 1 ( A + B ) ~ v = A ~ v + B ~ v (1.2) The appr oach pres ented in this pap er makes use of A Ma thematics of Arrays (MoA) a nd an index ing ca lculus (i.e. the ψ calculus) to enable the programmer to develop algorithms using high-level compiler-like optimizations through the ability to algebraic ally compo se a nd re duce sequences o f a r ray op era tions. As such, the transla tion from the left hand side of E q. 1.2 to the right side is just one of a wide v ariety of op er ations that can b e carried out using this algebra. In the MoA fo rmalism, the array expressio n in E q. 1.2 would b e wr itten: ~ v ψ ( A + B ) = ~ v ψ A + ~ vψ B (1.3) where we hav e in tro duced the psi-op er ator ψ to denote the o p er ation of ex tracting an element from the multi-dimensional array us ing the index vector ( ~ v ). In this pap er we give a demonstratio n of the appr o ach as applied to the cr eation o f eﬃcient implementations of the F ast F our ie r T ransfor m (FFT) optimized ov er multi- pro cesso r , m ulti-memory/ netw ork, e nvironments. Multi-dimensional data arrays ar e r eshap e d thro ugh the pr o cess of dimension lifting to explicitly add dimensions to enable indexing blo cks of data over the v arious levels of the hierarchy . A sequence of array op era tions (represented by the v arious op era tors o f the algebra acting on arrays) is algebra ically comp osed to achiev e the Denotational Normal F orm (DNF). Then the ψ -calculus is used to reduce the DNF to the O NF ( Op er atio nal Normal F orm ) whic h ex plic itly expresses the algo r ithm in terms of lo ops and op era tio ns o n indices. The ONF can thus b e dir ectly trans lated into eﬃcie nt co de in the langua ge of the pro grammer’s choice be it for hardware or s o ft ware applica tion. The application we us e as a demonstration v ehicle – the F ast F ourier T r ansform – is of signiﬁcant interest in its own right. The F as t F ourier T ransform (FFT) is one o f the mos t imp or tant computational algorithms a nd its use is p erv asive in science and engineering. The w ork in this pap er is a companion to tw o previous pap ers [13] in which the FFT was optimized in terms of in-ca che op erations leading to factor s of two to four sp eedup in compariso n with our previous reco rds. F ur ther background material including co mpa risons with libr ary r outines can b e found in Refs. [14, 15, 16, 17] and [18 ]. Our algo rithm c a n b e seen to be a generalizatio n of s imilar work aimed at o ut- of-core optimizations [19]. Similarly , blo ck decomp ositions o f matric e s (in g e neral) are sp ecial cases of our r ehap e-t r asp ose design. Most imp ortantly , o ur designs are general for any partition size, i.e. not necessary blo ck ed in squa res, a nd any num ber of dimensions . F urthermore, our desig ns use line a r tr ansformatio ns from an a lge- braic sp eciﬁcation and th us they ar e v eriﬁed . Thus, by sp ecifying designs (such as Cormen’s and others [19]) using these techniques, these desig ns to o could b e veriﬁed. The purp o se of this pap er IS NOT to attempt a ny serio us a nalysis o f the num ber of cache misses incurred by the algorithm in the spirit of of Hong and Kung and others [20, 21, 22]. Ra ther, we pr esent a n algebraic metho d that achiev es (or is comp etitive) with such optimizatio ns mecha nically . Through linear transformations we pro duce a normal form , the O NF, that is directly implementable in any hardware or so ftw are la nguage and is r ealized in any of the pro cess or/memor y levels [23]. Most impo rtantly , our desig ns a re completely genera l in that throug h dim e nsion lifting we ca n pro duce any n umber of levels in the pro cessor /memory hierarch y . 2 One o b jection to our appro ach is that o ne might incur an unaccepta ble p erfor- mance cos t due to the p erio dic rearra ngement of the data that is often needed. This will not, how ev er, b e the cas e if we pr e-fetch data b efor e it is needed. The necess ity to pr e - fetch data als o exists in other similar c ache-optimized schemes. Our algor ithm do es what the c o mpiler comm unity calls tiling . Since we have ana lyzed the lo op struc- tures, access patterns, and sp eeds of the pr o cessor /memory levels, pr e-fetching b e- comes a deterministic cost function that can easily b e co mb ined with r esha p e-tr ansp ose or t iling op er a tions. Again we make no attempt to optimize the algor ithm for any particular architec- ture. W e provide a general algo rithm in the form of an Op era tio nal Norma l F or m that allows the user to sp ecify the blo cking size at run time. This ONF there fore enables the individual user to cho o se the blocking size that gives the b est p er fo rmance for any indiv idua l machine, assuming this inten tional informa tio n can be pro cessed by a compiler 1 . It is also imp ortant to note the impor tance of running r epr o ducible and determin- istic exp eriments. Such ex pe r iments are only p oss ible when dedica ted resourc e s exist AND no interrupts or randomness eﬀects memor y/cache/communications b ehavior. This means tha t multiprocess ing and time sharing must b e turned oﬀ for b oth O S’s and Netw orks. Conformal Co mputing 2 is the name given by the authors to this algebraic a p- proach to the construction of computer prog rams for arr ay-based computations in science and engineering. The r eader sho uld not b e misled by the name. Confor mal in this sense is no t related to Conformal Mapping o r similar constructs fro m math- ematics a lthough it was inspired b y these concepts in which cer tain prop er ties a r e preserved under transformatio ns . In particular, b y Conformal Computing we mean a mathematical system that c onforms a s closely as p os sible to the under ly ing s truc- ture of the ha rdware. F ur ther details of the theory including discuss io n of MoA and ψ -calculus are provided in the app endix. In this feasibility study , we pr o ceed in a semi–mechanical fashion. A t each step of algorithm developmen t, the current version o f the algorithm is represented in a co de– like form, a nd a simple basic tr ansformation is a pplied to this co de–like form. Each transformatio n is easily seen to b e mechanizable. W e us ed the algebraic formalism describ ed in Section 1.2 to verify that each transfor mation pro duces a semant ically equiv alen t a lgorithm. At each step, we us e judgmen t in deciding which tra nsforma- tion to carry out. This judgmen t is bas e d o n an understanding of the go als of the transformatio ns. The following is a mo re detailed description o f our metho dology , a s carr ied–out in this feasibility study: 1.1. Ov erview of Metho dology . Phase 1 . O btain a high–le vel description of the problem. In this study , a MA TLAB–like descriptio n is used. Phase 2. Apply a sequence of transformatio ns to optimize the sequential ver- sion o f the algor ithm. The transformatio ns use d ar e of an alg ebraic nature, ca n b e int erpreted in terms o f op era tions o n who le arrays, and s hould no t negatively eﬀect subsequent pa rallelizatio n. Phase 3. Apply a sequence of transfo r mations to develop a p ar allel c omputation plan . Such plans consis ts of sequential co de that indicates which parts of the ov erall 1 Pro cessing int ent ional inform ation will be the topic of a future pap er 2 The name Conformal Computing c  is protected . Copyrigh t 2003, The Researc h F oundation of State Universit y of New Y ork, University at Albany . 3 work a re to b e done by each individua l pro cessor. F or e ach iter ation of a n outer lo o p, the plan sp eciﬁes which itera tions of inner lo ops should b e p erformed by each of the pro cesso r s in a multiprocesso r system. Phase 4. Given a para llel computatio n plan, apply a sequence of tr ansformatio ns to pro duce a p ar al lel c omputation template . Such templates sp ecify a para llel or distributed version of the algor ithm b y indicating (1) whic h par ts of the ov erall work are to be done b y each individual pro cesso r, (2) wher e data is lo ca ted, and (3) da ta mov emen t among pr o cessor s. V ario us par allel programming la nguage styles can b e used to e x press such templates. In this study , we use b oth a message passing p er– pr o c essor st yle, motiv ated by MPI [24], and a n al l–pr o c essor style, motiv ated by Op enMP [2 5]. There is also an implicit ﬁfth phase, in which a pa rallel computation template is transformed into co de express ed in a g iven prog r amming langua ge. In the future, we envision tha t scientiﬁ c progr ams will b e develop ed in in teractive developmen t environmen ts. Such environments will combine human judgment with compiler–like analysis , tra nsformation, and optimizatio n techniques. A pro g rammer will us e knowledge of pr oblem sema nt ics to g uide the selection of the trans formation at each step, but the a pplication of the trans formation and veriﬁcation of its sa fety will b e done mechanically . A given seq uence o f transformatio ns will stop at a p oint where a version of the co de has b een obtained such that s ubsequent optimizations can b e left to an av aila ble o ptimizing compiler. This feasibility study is a ﬁrst step tow ards the development o f such an in teractive progra m developmen t environmen t. 1.2. Program Developmen t Via Use of Arra y and Indexing Algebra. Although not discussed further in this pape r, Mullin’s Psi Calculus [2 6, 27, 28, 29] plays an underlying role in this feasibility study . This calculus provides a uniﬁed mathematical fra mework for representing and studying arrays, ar ray op era tio ns, and array indexing. It is esp ecially useful for developing alg orithms centered aro und tr ans- formations inv olving array addressing, de c o mp o sition, reshaping , distribution, etc. Each of the a lgorithm transformatio ns car ried out here can b e expressed in ter ms of the Psi Calculus , and we used the Psi Ca lculus to verify the co rrectness of progr am transformatio ns in the development of our FFT algo rithms. 1.3. Related Results. A gener a l a lgebraic framework for F o urier and related transforms, including their discrete versions, is discussed in [3 0]. As discussed in [31, 32] and using this framework, many algor ithms for the FFT can b e viewed in terms of computing tens o r pro duct decomp ositio ns of the matrix B L , discusse d in Section 2.1 b elow. Subsequently , a num ber of additional algo rithms for the FFT and related problems hav e been develop ed centered around the use of tenso r pro duct decomp ositions [33, 34, 35, 36, 37, 38]. The work do ne under the ac r onym FFTW is bas e d on a compiler that gener ates eﬃcient sequential FFT co de that is adapted to a target ar chitecture and sp eciﬁed pro blem size [39, 40, 41, 42, 43]. A v a riety of techniques ha ve b een used to construct eﬃcient parallel algo rithms for the FFT [44, 4 5, 46, 47, 4 8, 49]. Previous ly , we ma nu ally developed s e veral distributed alg o rithms for the FFT [50], exp erimenting with v ariants of bit–r ev ersal , weight co mputations, butterﬂy com- putations, and message gener ation mechanisms. In [50], we rep or t o n the ev aluation of t welv e resulting v ariant programs for the one– dimensional FFT, ev aluated by run- ning a consistent set of exp er iments. In [14, 1 5], we compare the p e rformance o f our prog rams with that of several o ther FFT implementations (FFTW, NAG, E SSL, IMSL). As in [51], we b egin b y developing sequent ial c o de for the radix 2 FFT, starting 4 from a high–level sp eciﬁcation, via a seq uence of optimizing algebra– based tr ansfor- mations. This se q uent ial co de provides a common s ta rting po int for the development of s equential co de for the g eneral radix FFT in [51], and for the developmen t of par- allel/distributed alg orithms presented her e. F or the convenience of the reader and to provide co ntext for the tra nsformations used here, w e rep eat this common develop- men t here in Section 3. Similar ly , for the conv enience o f the reader , w e also reca ll relev an t exp erimental r e sults from [51] in Section 7. Thes e exper iments provide ev- idence that the de s ign methodo logy outlined here can already pro duce co mp etitive co de. 2. High–Level Descriptio n of Alg o rithm. 2.1. Basic Algorithm. Our approa ch to algo rithm developmen t beg ins by ex- pressing the algo rithm in a suitable high–level mechanism. This sp eciﬁcation may take the form of a high–level statement of the algor ithm, p ossibly anno ta ted with additional speciﬁca tions, such a s the sp eciﬁcatio n of the array o f w eights for the FFT. Our design and developmen t of eﬃcien t implementations of a n algo r ithm for the FFT beg an with V an Loan’s [52] suitably co mmented hig h–level MA TLAB prog ram for the r adi x 2 FFT, shown in Figure 2 .1. Input: x in C n and n = 2 t , where t ≥ 1 is an integer. Output: The FFT of x . x ← P n x (1) for q = 1 to t (2) beg in (3) L ← 2 q (4) r ← n/L (5) x L × r ← B L x L × r (6) end (7) Here, P n is a n × n p ermutation ma trix. Letting L ∗ = L/ 2 , a nd ω L be the L ’th ro ot o f unit y , ma trix B L =  I L ∗ Ω L ∗ I L ∗ − Ω L ∗  , where Ω L ∗ is a diagonal matrix with v alues 1 , ω L , . . . , ω L ∗ − 1 L along the diagona l. Fig. 2.1 . High–level algorithm for the r adix 2 FFT In Line 1, P n is a p ermutation matr ix that per forms the bit–reversal p er mu tation on the n elements of vector x . The refere nc e to x L × r in Line 6 ca n b e regar ded as r eshaping the n element array x to b e a L × r matrix co nsisting of r columns, where each column can b e viewed as a vector of L elements. This reshaping of x is column–wise, so that each time Line 6 is executed, e a ch pair of adjacent columns of the preceding matrix are concatena ted to pro duce a sing le co lumn of the new ma trix. Line 6 can b e viewed as tr eating each co lumn of this ma trix a s a pair of vectors, each with L/ 2 elemen ts, and doing a butt erﬂ y computation that combines the t wo L/ 2 element vectors in e a ch column to pro duce a vector with L elemen ts. The butterﬂy computation, cor resp onding to multiplication of the data matr ix x by B L , co mbines each pair of L/ 2 element column vectors from the o ld matrix into a new L element vector for each column of the new matrix. 2.2. Comp onents of the Basic Algori thm. By insp ection of Figure 2.1, one can identify the following ﬁve c o mp o nents o f the high–level r adi x 2 FFT alg orithm: 5 1. The c omputation of the bit– r eversal pe rmutation (Line 1). 2. The computation o f the complex weigh ts occ urring in the ma trices Ω L ∗ . These weigh ts are discuss ed further in Section 3 .1. 3. The but terﬂy computation that, using the w eights, combines t wo vectors from x , pr o ducing a vector with twice as many elements (Line 6). The butterﬂy computation is sca larized, and subsequently reﬁned in Sectio ns 3.2–3 .7. Contiguous memory access gives b etter p erfor mance at a ll lev els of a memo ry hierarch y . Consequently , making data access during the butterﬂy compu- tation contiguous is a dr iv ing factor throug hout this paper in r eﬁning the butterﬂy computation. 4. The strateg y for a llo cating the elements o f the re s hap ed ar ray x (line 6) to the pro cess ors for use in paralle l and distr ibuted implementations of the computation lo op (the q lo op of Lines 2 through 7). Alternative stra tegies for which pro cessor will compute which v a lues of x dur ing each iteration of the computatio n lo o p are discussed in Section 4. The a ctual lo cation of the data is discuss ed in Sections 5 and 6. 5. The ge ner ation and bundling of messag es inv olving v alues from x (Lines 2 through 7 ). Messa ge passing of data in distributed computation is discusse d in Sectio n 5. 3. Dev elopment of Sequen tial Co de. 3.1. Sp eciﬁcation of the M atrix of W eight Constan ts. The description o f the algor ithm in Figure 2.1 uses English to describ e the constant matrix B L . This matrix is a lar ge t wo-dimensional sparse arr ay . Of more imp orta nce for o ur purp oses, this a rray can be natura lly sp eciﬁed b y using constr uctors from the array calculus with a few dense o ne dimensional vectors as the basis . This yields a succinct description of how B L can b e constructed via a comp o sition of array op era tions. Indeed, B L is never actually materia lized a s a dense L × L matrix. Rather, the succinc t r epresentation is used to g uide the generation of co de for the FFT, and the generated co de o nly has m ultiplications corresp o nding to m ultiplication by a nonzer o element of B L . On the top–le vel, B L is co ns tructed by the appropr iate concatena tion of fo ur submatrices. Only the constructio n of tw o of these submatrices, namely I L ∗ and Ω L ∗ , need b e separately sp eciﬁed. Matrix I L ∗ o ccurs twice in the decomp osition of B L . Matrix − Ω L ∗ can b e o bta ined from ma trix Ω L ∗ by applying element–wise unary min us. Each of the tw o submatrice s to b e c o nstructed is a diago nal matr ix. F or ea ch of these diagonal ma trices, the v a lue s along the diagonal a re successive p owers o f a given scalar. Psi Calculus cont ains a constructor that pro duces a vector whos e elemen ts are consecutive mu ltiples o r p ow ers of a given scala r v alue. There is another constructor diagonalize 3 that conv erts a vector into a diago nal matrix with the v alues from the vector o ccurring along the diag onal, in the sa me or der. W e sp eciﬁed the matric e s I L ∗ and Ω L ∗ by using the vector c onstructor that pr o duces successive p ow ers, follow ed b y the diagonalize co nstructor. Since L = 2 q , matr ix B L is diﬀerent for ea ch iter ation of the lo op in Figur e 2.1. Accordingly , the sp eciﬁcation o f I L ∗ and Ω L ∗ is p ar ameterize d by L (and hence im- plicitly by q ). 3 The di agonali ze operation is itself sp eciﬁed as a composition of mor e primitive array opera- tions. 6 3.2. Scalarization o f the Matrix M ultipli cation. A direct a nd easily a u- tomated sca larizatio n of the matrix multiplication in Line 6 of Fig ure 2.1 pr o duced co de similar 4 to that g iven in Fig ur e 3.1. Here weigh t is a vector of consecutive powers o f ω L . Note that in Figur e 3.1, the multiplication by the appropria te constant from the diagonal of matrix Ω L ∗ is done explicitly , using a v alue from vector weigh t , and the multiplication by the constant 1 from the dia gonal of matrix I L ∗ is done implicitly . Becaus e the scalar ized co de do e s a n assig nment to one element at a time, the co de stores the result of the array mu ltiplication into a temp orar y ar ray x x , and then copies xx into x . The spars e ness o f matrix B L is reﬂected in the assig nment s to xx(row ,col) . The right–hand side of these assignment statement s is the sum of only the t w o terms corr esp onding to the tw o no nzero elemen ts of the row of B L inv olv ed in the computation o f the left–hand s ide element. Moreover, the “regula r” structure of B L , expr essed in terms o f diag onal submatrices, provides a uniform wa y of selecting the tw o ter ms. do col = 0,r-1 do row = 0,L-1 if (row < L/2 ) then xx(row ,col) = x(ro w,col ) + weigh t(row )*x(ro w+L/2,col) else xx(row ,col) = x(ro w-L/2 ,col) - weight (row- L/2)* x(row,col) end if end do end do Fig. 3.1 . Dir e c t Sc alarization of t he Matrix Multiplic ation 3.3. Representing Data as a 1–Dimensio nal V ector. The data a rray x in Figure 2.1 is a 2-dimens io nal ar r ay , that is r eshap ed during each iteration. Computa- tionally , how ever, it is mo re eﬃcien t to store the data v a lues in x as a one dimensiona l vector, and to co mpletely a void p er forming this reshaping . Psi c a lculus ea sily han- dles av oiding array res ha ping, and automates such transformatio ns a s the mapping o f indices of an element of the 2– dimensional a rray into the index of the corresp o nding element of the vector. The L × r matrix x L × r is sto red as an n – element vector, which we denote as pro gram v ar iable x . The elements of the tw o–dimensiona l matrix are en- visioned as b eing stored in c olumn–major or der, reﬂec ting the column–wise resha ping o ccurring in Figur e 2.1. Thus, element x L × r ( row , col ) of x L × r corres p o nds to elemen t x(L*co l+row ) of x . Consequently , when L changes, and matrix x L × r is reshap ed, no mov emen t of da ta elements of vector x actually o ccurs . Replacing each access to an element o f the tw o dimensiona l ma trix x L × r with an access to the corr esp onding element of the vector x , pr o duces the co de shown in Figur e 3.2. As an alternative to the scalar ized co de of Figure 3.2, the outer lo op v ariable can iterate ov er the s tarting index of each column in vector x , us ing the appropria te stride to incre ment the lo op v ar iable. Instead of using a lo op v ariable c ol , which ranges from 0 to r-1 with a stride o f 1, w e use a v ariable, say co l ′ , suc h that c ol ′ = L * co l , a nd which consequently ha s a str ide of L . By doing this, w e e liminate the multiplication L * co l that o ccurs each time an element of the ar rays xx or x is accessed. This for m of scala rization pro duces the co de shown in Fig ur e 3.3. 4 The data ar ray w as not reshaped for eac h v alue of q , so refer ences to the data array elemen ts wa s more complicated than shown i n Figure 3.1. 7 do col = 0,r-1 do row = 0,L-1 if (row < L/2 ) then xx(L*c ol+ro w) = x(L* col+r ow) + wei ght(ro w)*x( L*col+row+L/2) else xx(L*c ol+ro w) = x(L* col+r ow-L/ 2) - weig ht(ro w-L/2 )*x(L*col+row) end if end do end do Fig. 3.2 . Sc alarize d Co de with Data Stor e d in a V e c tor do col ′ = 0,n-1, L do row = 0,L-1 if (row < L/2 ) then xx(col ′ +row) = x(col ′ +row) + weigh t(row )*x(c ol ′ +row+L /2) else xx(col ′ +row) = x(col ′ +row-L /2) - weight (row- L/2)* x(col ′ +row) end if end do end do Fig. 3.3 . Striding thr ough the Data V e ctor 3.4. Elimi natio n of Condi tional Statemen t. The use of the conditiona l statement tha t tests row < L/2 in the ab ov e co de can b e eliminated automatically , as follows. W e ﬁrst r e–tile the lo op structure so that the innermo st lo op iterates ov er each pair of da ta elements that participate in each butterﬂy co mbination. T o accom- plish this re – tiling, we ﬁrst envision r e s haping the tw o–dimensional array x L × r int o a three–dimensio nal array x L/ 2 × 2 × r . Under this r eshaping, element x L × r ( row , col ) corres p o nds to element x L/ 2 × 2 × r ( row mo d L/ 2 , ⌊ r ow / ( L/ 2 ) ⌋ , col ). The middle di- mension of x L/ 2 × 2 × r splits ea ch column of x L × r int o the upp er and low er pa rts of the column. Scalar izing Line 6 o f Fig ure 2.1 based on the three–dimensional array x L/ 2 × 2 × r , and indexing ov er the third dimensio n in the outer lo op, over the ﬁrs t di- mension in the middle loo p, and ov er the seco nd dimension in the innermost lo op, pro duces the co de shown in Fig ur e 3.4. T o eliminate the conditional statement, we unroll the innermost lo op, and pr o duce the co de shown in Fig ur e 3.5. When we repre sent the data ar ray as a one– dimensional vector, the co de shown in Fig ur e 3.6 is pr o duced. 3.5. Optimizing the Basic Blo c k. The basic blo ck in the inner lo op of the ab ov e code has tw o o ccur r ences of the common sub expressio n weigh t(row) *x(col ′ +row+L /2) . W e hand–o ptimized this basic blo ck, to compute this co mmo n sub expressio n only once. This pr o duces mo re eﬃcient co de for the bas ic blo ck, as s hown in Figure 3.7. Incorp ora ting a ll the transformations descr ibe d so far, the loo p in Figure 2.1 is scalarize d a s s hown in Figur e 3.8. In this co de, pi and i are F o rtran “ parameters ”, i.e., na med consta nts. 3.6. Doing the Butterﬂy Computation In–Place. The use o f the temp ora ry array xx in the butterﬂy computation is unnecessa ry , and can b e avoided by the 8 do col = 0,r-1 do row = 0,L/2- 1 do group = 0,1 if ( group == 0 ) then xx(row ,grou p,col ) = x(row, group ,col) + weight (row) *x(ro w,group+1,col) else xx(row ,grou p,col ) = x(row, group -1,col) - weight (row) *x(row,group,co l) end if end do end do end do Fig. 3.4 . R e–tile d L o op do col = 0,r-1 do row = 0,L/2- 1 xx(row ,0,co l) = x(ro w,0,c ol) + weight (row) *x(ro w,1,col) xx(row ,1,co l) = x(ro w,0,c ol) - weight (row) *x(ro w,1,col) end do end do Fig. 3.5 . Unr olle d Inner L o op use of a s calar v ariable to hold the v alue of x(col ′ +row) . Co de inco rp orating this mo diﬁcation is shown in Fig ure 3.9, where scalar v a riable d is used fo r this purp ose. 3.7. V ectorizing the Butterﬂy Co m putation. An alternative co ding style for the butterﬂy co mputation is to us e mono lithic vector o per ations applied to a p- propriate sections of the da ta array . This vectorization of the butterﬂy co mputation pro duces the co de shown in Figure 3.1 0 5 . Here, cvec is a one–dimensiona l array used to store the vector o f v alues a ssigned to v ariable c dur ing the iter ations of the inner lo op (the row lo op) in Figur e 3.9. Similarly , dvec is a one–dimensio nal array used to store the vector of v alues a ssigned to v ariable d during the iteratio ns of this inner lo op. 4. Dev elopment of Plans for Par allel and Distribute d Co de. 4.1. An Ov erview of Plans, T emplates, and Arc hitectural Issues. W e wan t to gener ate co de for two types of architecture. • Distributed memory ar chit ecture (where each pro cess or has its own lo cal address spa ce). • Shared memory architecture with substantial amoun ts of lo cal memory (where there is a common a ddress space that includes the lo ca l memor ie s ). T o a ccommo date these architectures, we will develop an appropria te p ar a l lel c om- putation plan , follow ed by appropria te p a r al lel c omputation templates . What we mean by a p ar al lel c omputation plan is sequential co de that indicates which parts of the ov erall work is to be done b y each individual pro cessor . A plan 5 W e assume that code can select a set of arra y elemen ts using a start, stop, and stride mec hanism, with a default stride of 1. 9 do col ′ = 0,n-1, L do row = 0,L/2- 1 xx(col ′ +row) = x(col ′ +row) + weigh t(row )*x(col ′ +row+L /2) xx(col ′ +row+L /2) = x(col ′ +row) - weigh t(row )*x(c ol ′ +row+L /2) end do end do Fig. 3.6 . Unr olle d Inner L o op With Data Stor e d as a V e c tor c = weig ht(ro w)*x( col ′ +row+L /2) xx(col ′ +row) = x(col ′ +row) + c xx(col ′ +row+L /2) = x(col ′ +row) - c Fig. 3.7 . Optimize d Basic Blo ck sp eciﬁes for e ach iter ation of a n outer lo o p 6 , which itera tions of inner lo ops should be per formed by each of the pro c essors of a mult ipro cess or sy s tem. A t each p oint in the parallel computation, we envision tha t r esp onsibility for the n elements is partitioned among the pro ce s sors. O ur inten tion is that this resp onsibility is based on an owner c omputes rule, namely a given pro c e ssor is re s p o nsible fo r thos e data elemen ts for which it exe c utes the butterﬂy assignment. What we mean by a t emplate is distributed or para llel “gener ic co de” that in- dicates not only which parts of the ov erall work is to b e done by each individual pro cesso r , but also w her e data is lo cated, and how data is moved. T o conv ert a tem- plate into co de, a sp eciﬁc pr ogra mming langua ge needs to be selected, and more detail needs to be ﬁlled in. W e use tw o styles o f templates. The ﬁrs t style is a p er–pr o c essor style, motiv ated by MPI [24]. The other style is an al l–pr o c essor style, motiv ated by Op enMP [2 5]. It is our inten tion that in the transfor mation fro m a n FFT para llel computatio n plan into a template for a distr ibuted a rchitecture or a shared memory ar chitectu re where subs ta nt ial lo cal memory is av aila ble to each pro cesso r, each o f the pro ces sors will hold in its loca l memor y thos e data elements it is r esp onsible for, and do the butterﬂy op erations on these data elements. 4.2. Splitting the Oute r Lo op of the Butterﬂy Computation. Supp ose there a re m pr o cessor s, where m is a p ower of 2. W e let psize denote n /m , the num ber of elements that each pro cessor is resp ons ible for. Our pa rallel co mputation plans will assume that psiz e ≥ m . Now envision the data in the form of a t wo–dimensional matrix that gets re s hap ed for each v a lue of q , as in Figure 2.1. Dur ing the initial rounds of the computation, for each pro ces sor, the v alue of psize is large enoug h to hold one or more co lumns o f the tw o–dimensional data matrix, so we will make each pro cesso r resp onsible for multip le columns, where the total n um b er of elements in these columns equals psize . F or each iteration of q , the le ng th of a column do ubles, so that at some po int, a c o lumn of the matrix will hav e mor e than psize elemen ts. This ﬁr st o ccurs when q equals log 2 ( psiz e ) + 1. W e call this v alue br e a kp oint . Once q equals br e akp oint , ea ch column has more than psize element s. Conse quently , we no longer wan t only one pr o cessor to b e resp o ns ible for an entire column, and so we change plans. Befor e q eq uals br e akp oint , the n umber of columns is a multip le of m . 6 F or the FFT, this outer lo op is the q lo op of Fi gure 3.9. 10 do q = 1,t L = 2**q do row = 0,L/2- 1 weight (row) = EXP((2 *pi*i *row)/L) end do do col ′ = 0,n-1, L do row = 0,L/2- 1 c = weight (row)* x(col ′ +row+L /2) xx(col ′ +row) = x(col ′ +row) + c xx(col ′ +row+L /2) = x(col ′ +row) - c end do end do x = xx end do Fig. 3.8 . L o op with Optimize d Basic Blo ck do q = 1,t L = 2**q do row = 0,L/2- 1 weight (row) = EXP((2 *pi*i *row)/L) end do do col ′ = 0,n-1, L do row = 0,L/2- 1 c = weight (row)* x(col ′ +row+L /2) d = x(col ′ +row) x(col ′ +row) = d + c x(col ′ +row+L /2) = d - c end do end do end do Fig. 3.9 . L o op with In–Plac e Butterﬂy Computation F rom br e ak p oint on, there a re fewer than m co lumns, but the n umber of rows is a m ultiple of m . As long as q is less tha n br e akp oint , we can use a blo c k a pproach to the co mpu- tation, with ea ch pro cessor computing the new v alues of several c onse cutive columns of the tw o–dimensional matrix. As s ume tha t the pro cess ors are nu mbered 0 thr ough m − 1. In terms of the one– dimensional vector o f data elements, the co lumns whose new v a lues a re to b e computed by pro cessor p ar e the psize c o nsecutive vector ele- men ts b e g inning with vector ele men t p ∗ psiz e . When q equals br e akp oint , the n umber of elements in each column is 2 * psize , where as we wan t each pr o cessor to co mpute the new v a lue of only psize elements. Thus, when q equals br e akp oint , we need to switch to a diﬀer ent plan. T o facilitate the s witch to a diﬀere nt plan, we ﬁrst mo dify Figure 3 .9 by splitting the q loo p into t wo separate lo o ps, as shown in Fig ure 4 .1. 4.3. P arallel Com putation Plan When Lo cal M emory is Av ailable. Co n- sider the t wo q lo ops in Figure 4 .1. F or the ﬁrs t q lo o p, we will use a blo ck appro ach to splitting the computation a mo ng the pro cessor s, as discussed in Section 4.2. Reca ll that b efore q equals br e akp oint , the num ber o f columns is a m ultiple o f m , and fro m 11 do col ′ = 0,n-1, L cvec(0 :L/2- 1) = weight (0:L/2 -1) * x(col ′ +L/2:c ol ′ +L-1) dvec(0 :L/2- 1) = x(col ′ :col ′ +L/2-1 ) x(col ′ :col ′ +L/2-1 ) = dvec(0:L/2 -1) + cvec(0 :L/2- 1) x(col ′ +L/2:c ol ′ +L-1) = dvec( 0:L/2 -1) - cvec(0 :L/2- 1) end do Fig. 3.10 . V ectorizing the Butte rﬂy Computation do q = 1,brea kpoin t - 1 L = 2**q do row = 0,L/2- 1 weight (row) = EXP((2 *pi*i *row)/L) end do do col ′ = 0,n-1, L do row = 0,L/2- 1 c = weig ht(ro w)*x( col ′ +row+L /2) d = x(co l ′ +row) x(col ′ +row) = d + c x(col ′ +row+L /2) = d - c end do end do end do do q = breakp oint, t L = 2**q do row = 0,L/2- 1 weight (row) = EXP((2 *pi*i *row)/L) end do do col ′ = 0,n-1, L do row = 0,L/2- 1 c = weig ht(ro w)*x( col ′ +row+L /2) d = x(co l ′ +row) x(col ′ +row) = d + c x(col ′ +row+L /2) = d - c end do end do end do Fig. 4.1 . L o op Splitting the Butterﬂy Computation br e akp oint on, the num b er of rows is a multiple of m . F or the second q lo op, which beg ins with q equal to br e akp oint , w e use a cyclic appr oach, with each pro cessor p computing the new v alues of all the rows j of the tw o–dimensiona l matrix such that j is congr uent to p mo d m . This corresp onds to all the iterations o f the inner lo o p where pr o gram v ariable row is c ongruent to p mo d m . W e expre s s a computation pla n by mo difying each of the q lo ops in Figure 4.1, so that for each v a lue of q , there is a new outer lo op using a new lo o p v ariable p , which ranges between 0 and m − 1. Our inten tion is tha t all the co mputations w ithin the lo op for a given v a lue of p will be perfo r med b y pro cesso r p . Figure 4.2 shows the 12 eﬀect of restructuring ea ch of the tw o q lo ops in Fig ure 4.1 b y introducing a p lo o p to reﬂect the par allel c o mputation plan. The ﬁr st q lo o p uses a blo ck a pproach for ea ch p , and the second q lo op uses a cy clic appr oach for each p . do q = 1,brea kpoin t - 1 L = 2**q do row = 0,L/2- 1 weight (row) = EXP((2 *pi*i *row)/L) end do do p = 0,m-1 do col ′ = p*ps ize,( p+1)* psize-1,L do row = 0,L/2- 1 c = weig ht(ro w)*x( col ′ +row+L /2) d = x(co l ′ +row) x(col ′ +row) = d + c x(col ′ +row+L /2) = d - c end do end do end do end do do q = breakp oint, t L = 2**q do row = 0,L/2- 1 weight (row) = EXP((2 *pi*i *row)/L) end do do p = 0,m-1 do col ′ = 0,n- 1,L do row = p,L/2- 1,m c = weig ht(ro w)*x( col ′ +row+L /2) d = x(co l ′ +row) x(col ′ +row) = d + c x(col ′ +row+L /2) = d - c end do end do end do end do Fig. 4.2 . Initial Par al lel Computation Plan Each execution of a p lo op in Fig ur e 4 .2 consists of m iteratio ns . Note that for each v alue of q , the s ets of element s of the x vector a ccessed by these m iter ations a re pairwise dis jo int . Our inten tion is that ea ch of these m iter a tions will b e p er formed by a diﬀerent pro cessor . Since these pro cessors will b e ac c essing dis joint sets o f elements from the x v ector, the execution of diﬀerent iter ations of a p lo o p by diﬀerent pro cesso r s c an b e done in par allel. Now consider the ﬁrst q lo op in Figure 4.2. F or each v alue of p within this ﬁrs t q lo op, the data elements accessed ar e the psize elements beg inning at po sition p ∗ psize of x . Since for a given p , these are the same set of data elements for every v alue of q in the outer lo op, we ca n interchange the lo op v ariables q and p in the ﬁrst q lo op. Now consider the s econd q lo o p in Figure 4.2. F or each v alue of p within this 13 second q lo op, the data elements accessed are the psize elements of x whose p o s ition is co ngruent to p mo d m . Since for a given p , these a re the s ame s e t of data elements for every v alue of q in the outer lo o p, we ca n int erchange the lo o p v ariables q and p in the second q lo o p. This analysis shows that we can interc hange the lo op v ariables q and p in each of the tw o lo ops in Figure 4 .2 to make p the outermost lo op v ariable, resulting in Figure 4.3. Note that a co nsequence o f this lo o p interc hange is that the computation o f the weight vector for each v alue of q is now done for each v alue of p . do p = 0,m-1 do q = 1,brea kpoin t - 1 L = 2**q do row = 0,L/2- 1 weight (row) = EXP((2 *pi*i *row)/L) end do do col ′ = p*ps ize,( p+1)* psize-1,L do row = 0,L/ 2-1 c = weig ht(ro w)*x( col ′ +row+L /2) d = x(co l ′ +row) x(col ′ +row) = d + c x(col ′ +row+L /2) = d - c end do end do end do end do do p = 0,m-1 do q = breakp oint, t L = 2**q do row = 0,L/2- 1 weight (row) = EXP((2 *pi*i *row)/L) end do do col ′ = 0,n- 1,L do row = p,L/ 2-1,m c = weig ht(ro w)*x( col ′ +row+L /2) d = x(co l ′ +row) x(col ′ +row) = d + c x(col ′ +row+L /2) = d - c end do end do end do end do Fig. 4.3 . L o c al M e mory Par al lel Computation Plan with Pr o c essor L o ops Oute rmost 4.4. Computing Onl y Needed W eights. In the second q lo o p in Figure 4 .3, each p co mputes an en tire weight vector of L/ 2 elemen ts. How ever, for a given p , only those e lement s in the weight vector whose p os itio n is congruent to p mo d m are actually used. W e can change the pla n s o that o nly these L 2 m elements of the weight vector a re computed. The computatio n of weigh ts in the second q lo op would then be as follows. 14 do row = p,L/2- 1,m weight (row) = EXP((2 *pi*i *row)/L) end do 4.5. Making W eigh t V ector Access Con tiguous. Note that in the second q lo op, the weight vector is acces s ed with consecutive v alues o f the progra m v ariable r ow . These consec utive v alues o f r ow are strided, with a stride of m . Cons equently , the a ccesses of the weight vector in the se c ond q lo op a re s tr ided, with a str ide of m . Within each itera tion of the second q lo op we can make the accesses to the weight vector contiguous, as follows. The key is to use a new weigh t vector weig htcyc lic , in which we store only the L 2 m weigh t elements needed for a given v alue of p . Con- sequently , these needed elements of weightcyclic will b e accessed contiguously . The relationship b etw een the weightcyclic a nd weight vectors is that for each j , 0 ≤ j < L 2 m , wei g htcy cl ic ( j ) will hold the v alue of w eig ht ( m ∗ j + p ), i.e., for e a ch v alue of ro w o ccurring in an inner lo op with lo op control do row = p,L/2- 1,m the v alue of weight ( r ow ) will b e held in weightcyclic ( r ow − p m ). With this change, the c omputation of weight s in the s econd q lo op b ecomes the following do row = p,L/2- 1,m weight cycli c((ro w-p)/m) = EXP(( 2*pi*i *row) /L) end do W e can simplify the lo op control and subscript computatio n for weightcyclic b y changing the lo o p cont rol v ariable. In the lo o p which co mputes the needed weigh t elements, w e use a lo op index v ariable row ′ , wher e row ′ = ( row − p ) /m , i.e. r ow = m ∗ row ′ + p . With this change, the weight computation within the second q lo o p b ecomes the following, wher e r ow ′ has a stride of 1 . do row ′ = 0,L/ (2*m) -1,1 weight cycli c(row ′ ) = EXP((2 *pi*i* (m*row ′ +p))/L ) end do Within the butterﬂy computation, where the lo op v ariable is r ow , a use of weight ( r ow ) b ecomes a use of weightcyclic (( r ow − p ) /m ). Figure 4.4 shows the second q lo op after incorp or a ting the improv emen ts from Section 4.4 and this Section to the computation and acces s of w eights. 4.6. Making Data Access Co n tiguous. Note that the accesses of data vector x in the seco nd q lo op ar e strided, with a s tride of m . W e c a n make access to the needed data in x in the second q lo o p con tiguous, a s follows. W e in tro duce a new data vector xcy clic to hold only those da ta elements a ccessed for each p . F or ea ch p , the vector xcyclic contains L/m elements from each of the n/L c olumns, for a total of n /m = psiz e elements. The r elationship betw een the xcyclic and x vectors is that for e a ch col ′ , where 0 ≤ col ′ < n − 1 and col ′ is a mult iple o f L , and r ow , where 0 ≤ row < L and r ow is congruent to p mo d m , xcy cl i c ( col ′ /m + ( row − p ) /m ) will hold the v a lue of x ( col ′ + r ow ). Co ns equently , for each v alue of col ′ and r ow o ccur ring in the butterﬂy lo op, x ( col ′ + r ow ) will b e held in xcy cl i c ( col ′ /m + ( r ow − p ) /m ) a nd x ( col ′ + row + L/ 2) will b e held in xcy cl ic ( col ′ /m + ( r ow + L/ 2 − p ) /m ). A t the b eginning o f each iteratio n of the p lo o p, the re q uired psize elements of x must b e copied in to x cyclic , and a t the end of each iteration of the p lo op, these psize elements must b e copied from xcyclic int o x . At the template level, this copying will b e done by either message passing or assignment statements, dep ending on the 15 do p = 0,m-1 do q = breakp oint, t L = 2**q do row ′ = 0,L/ (2*m) -1,1 weight cycli c(row ′ ) = EXP((2 *pi*i* (m*row ′ +p))/L ) end do do col ′ = 0,n- 1,L do row = p,L/ 2-1,m c = weig htcyc lic(( row-p)/m)*x(col ′ +row+L /2) d = x(co l ′ +row) x(col ′ +row) = d + c x(col ′ +row+L /2) = d - c end do end do end do end do Fig. 4.4 . Se c ond q L o op with Contiguous A c c e ss of Nee de d Weights architecture and programming lang uage. Here, where we are developing a compilatio n plan, we indicate this co pying abs tr actly , using function calls. W e en vision ar ray x as containing a “centralized” copy of al l the data elements, and x cyclic as containing a subset. The resp ective subs ets of x used by the iter ations of the p lo op r epresent a p artition of the data elements in x . W e envision the n elements o f x as being partitioned into m subsets, where each iteration of the p lo op copies one o f these subsets in to xcyclic . The co pying from x to xcyclic of the psize elements of x whose po sition in x is congruent to p mo d m is expresse d using the function call CENTRA LIZED TO C YCLIC PARTI TIONE D(x,x cyclic,m,psize,p) . The copying of the psize elements of xcyclic into x is expressed using the function call CYCLIC PARTIT IONED TO C ENTRA LIZED( xcyclic,x,m,psize,p) . Figure 4.5 shows the eﬀect of incor p o rating these c hanges to the second q lo op. Note that since the stride in the r ow lo op is m , these changes do indeed provide contiguous da ta acce s s. W e c a n simplify the lo op co nt rol a nd subscr ipt computation of v ector ele ment s o ccurring in Figure 4.5, b y changing the lo op con trol v ariables, as shown in Figure 4.6. In Figure 4.6, we use a v ariable co l ′′ to ra nge ov er a ll the c o lumns, where col ′ = m ∗ col ′′ i.e. col ′′ = col ′ /m . Since the b ounds for lo op control v ariable col ′ in Figure 4.5 are 0,n- 1,L , the b ounds for the lo op co ntrol v ariable col ′′ in Figure 4.6 are 0 ,psiz e-1,L /m . Similarly , in Figur e 4.6, we use a v ariable r ow ′ , where r ow = m ∗ r ow ′ + p , i.e. row ′ = ( row − p ) /m . Consider the bounds o f the row lo o p in the second q lo op; the s tart, sto p, str ide are p,L/2- 1,m . F o r the co r resp onding row ′ lo op, the start bec omes 0, and the s to p b eco mes L 2 m − p +1 m . Since 0 ≤ p < m , we can conclude that 0 < p +1 m ≤ 1 , so the stop can b e simpliﬁed to L 2 m − 1. Finally , the step is 1. So, the bo unds for the row ′ lo op b eco me 0,L/ (2*m)- 1,1 . Consequently , xcy clic( col ′′ +row ′ ) will contain the v a lue of x(m *col ′′ +m*row ′ +p) . 4.7. F ac ili tating Exploitation of Lo cal Memory . No te that in Figure 4.6, there is no data ﬂow inv olving xc yclic be tw een the iterations of the p lo op. A consequence is that when the plan is subsequently tra ns formed int o a template for 16 do p = 0,m-1 CENTRA LIZED TO CYCLIC PART ITION ED(x,x cyclic,m,psize,p) do q = breakp oint, t L = 2**q do row ′ = 0,L/ (2*m) -1,1 weight cycli c(row ′ ) = EXP((2 *pi*i* (m*row ′ +p))/L ) end do do col ′ = 0,n- 1,L do row = p,L/2- 1,m c = weig htcyc lic(( row-p)/m) *xc yclic( col ′ /m+(ro w-p)/ m+L/(2* m)) d = xcyc lic(c ol ′ /m+(ro w-p)/ m) xcycli c(col ′ /m+(ro w-p)/ m) = d + c xcycli c(col ′ /m+(ro w-p)/ m+L/(2* m)) = d - c end do end do end do CYCLIC PARTIT IONED TO CEN TRALIZ ED(xc yclic,x,m,psize,p) end do Fig. 4.5 . Plan for Se c ond q L o op with Contiguous D ata Ac c ess do p = 0,m-1 CENTRA LIZED TO CYCLIC PART ITION ED(x,x cyclic,m,psize,p) do q = breakp oint, t L = 2**q do row ′ = p,L/ 2-1,m weight cycli c(row ′ ) = EXP((2 *pi*i* (m*row ′ +p))/L ) end do do col ′′ = 0,psiz e-1,L /m do row ′ = 0,L/(2 *m)-1 ,1 c = weig htcyc lic(r ow ′ )*xcyc lic(c ol ′′ +row ′ +L/(2* m)) d = xcyc lic(c ol ′′ +row ′ ) xcycli c(col ′′ +row ′ ) = d + c xcycli c(col ′′ +row ′ +L/(2* m)) = d - c end do end do end do CYCLIC PARTIT IONED TO CEN TRALIZ ED(xc yclic,x,m,psize,p) end do Fig. 4.6 . Plan for Se c ond q L o op wit h Contiguous Data A c c ess and Simpliﬁe d L o op Contr ol parallel and distributed co de, xcycl ic can b e a lo cal v ariable of each pro cessor . A simila r transforma tion to that in Section 4.6 can b e do ne to the sec o nd q lo op in the plan, although the pa yoﬀ only o ccur s subsequently when templates a re constructed from the plan. W e can facilitate making access to the data in the ﬁrst q lo op be via a pro ce s sor–lo cal v ariable. This can be acco mplished by intro ducing a new data vector xb lock , to hold the data v alues acce s sed by each iteration of the p lo op. F or eac h p , we will use the vector xb lock to hold the psize elemen ts accessed 17 during that itera tion of the p lo op. The rela tionship betw een the xblo ck and x vectors is that for ea ch col ′ , where 0 ≤ col ′ < n − 1 a nd col ′ is a multiple of L , a nd row , where 0 ≤ row < L , x ( col ′ + row ) will b e held in xbl ock ( col ′ + row − p ∗ psi z e ). Figure 4 .7 shows the eﬀect of incor po rating the use o f x blo ck in the ﬁrst q lo op. do p = 0,m-1 CENTRA LIZED TO BLOCK PAR TITIO NED(x ,xblock,m,psize,p) do q = 1,brea kpoin t - 1 L = 2**q do row = 0,L/2- 1 weight (row) = EXP((2 *pi*i *row)/L) end do do col ′ = p*ps ize,( p+1)* psize-1,L do row = 0,L/ 2-1 c = weig ht(ro w)*xb lock(col ′ +row+L /2-p* psize) d = xblo ck(co l ′ +row-p *psiz e) xblock (col ′ +row-p *psiz e) = d + c xblock (col ′ +row+L /2-p* psize) = d - c end do end do end do BLOCK PARTIT IONED TO C ENTRA LIZED( xblock,x,m,psize,p) end do Fig. 4.7 . Plan for First q L o op W e c a n simplify the lo op co nt rol a nd subscr ipt computation of v ector ele ment s by changing the lo op control v ariable col ′ , as shown in Figur e 4.8. W e use a v ariable col ′′ , where col ′′ = col ′ − p ∗ psiz e . So, the b ounds for the lo op control v ariable col ′′ are 0,psize -1,L . Co nsequently , xbloc k(col ′′ +row) will co nt ain the v alue of x(col ′ +row-p *psiz e) . Figure 4 .9 shows the ﬁnal pla n, combining Figures 4.6 and 4.8. 4.8. Flexibili t y in Setti ng Breakp oint. Note that the split of the q loo p of Figure 3.9 into tw o separa te q loo ps, as shown in Figure 4.1, is co rrect for any v alue of br e akp oint . How ever, for the transfor mation in to the plan shown in Figure 4.2 to be co rrect, br e ak p oint must sa tisfy certa in requirements, as follows. Requirement 1: Consider the ﬁrst q lo op in Figure 4.2. The co r rectness of this q lo op requires that L ≤ psiz e , i.e. 2 q ≤ psiz e . Thus, each v alue o f q for which the lo op is executed m ust satisfy q ≤ log 2 ( psiz e ). Since the highest v alue of q for which the lo o p is executed is br eak point − 1, the ﬁrst q lo op re q uires that br eak point ≤ log 2 ( psiz e )+ 1. Requirement 2: Consider the seco nd q lo op in Figure 4.2. The correctness of this q lo op requir es that L/ 2 ≥ m , i.e. 2 q − 1 ≥ m . Thus, each v alue of q for which the lo op is ex e c uted must s atisfy q ≥ lo g 2 ( m ) + 1. Since the low est v alue of q for which the lo o p is executed is br e a kp oint , the second q lo op requir es that bre ak point ≥ log 2 ( m ) + 1. T ogether, these tw o req uir ements imply that: log 2 ( m ) + 1 ≤ brea k point ≤ log 2 ( psiz e ) + 1 . Since we are assuming that m ≤ p siz e , ther e is a range of p oss ible allow able v alues for br e akp oint . Thus far in Sectio n 4, we have assumed that br e ak p oint is set 18 do p = 0,m-1 CENTRA LIZED TO BLOCK PAR TITIO NED(x ,xblock,m,psize,p) do q = 1,brea kpoin t - 1 L = 2**q do row = 0,L/2- 1 weight (row) = EXP((2 *pi*i *row)/L) end do do col ′′ = 0,psiz e-1,L do row = 0,L/2- 1 c = weig ht(ro w)*xb lock(col ′′ +row+L /2) d = xblo ck(co l ′′ +row) xblock (col ′′ +row) = d + c xblock (col ′′ +row+L /2) = d - c end do end do end do BLOCK PARTIT IONED TO C ENTRA LIZED( xblock,x,m,psize,p) end do Fig. 4.8 . Plan for First q Lo op with Simpliﬁe d Lo op Contr ol at the high e nd of its allowable range, na mely at log 2 ( psiz e ) + 1 . In fact, br e akp oint could b e set anywhere within its allow able rang e . If the pla n shown in Fig ur e 4.4 is used, there is a n adv an tage in setting br e akp oint to the low end of its allowable range, for the following r eason. Moving some v alue s o f q from the ﬁrst q lo o p to the second q lo op would reduce the num ber of weight s that need to be computed by ea ch pro ces sor. 5. Dev elopment of P er–Pro cessor T emplates for Distributed Memo ry Arc hitectures. 5.1. Distributed T emplate Usi ng Lo cal C o p y of Data V ector. 5.1.1. Part ial P er–Pro cessor Distributed Co de T emplate. Figure 5.1 s hows how the paralle l computation plan from Figur e 4.3 ca n b e transformed in to a partially sp eciﬁed template (subsequently abbreviated a s a p artia l template ) for distributed co de 7 . The template in Figure 5.1 is intended to re present p er–pro cessor co de to b e executed on each of the m pro cesso rs. W e ass ume that each pro cessor has a v ariable, named myid , whose v alue is the pro cesso r num ber , where the pro cessor num ber ranges from 0 to m − 1 . Thus, each of the m pro c essors, say pro cessor i , has its v alue of lo cal v ariable myid equal to i . In Figure 4.3, ea ch p lo op has m iterations , with p ranging from 0 to m − 1. In Figure 5.1, v ariable myid is used to ensure that pro cessor i executes iteration i of each p lo op, i.e., the iteration wher e p equals i . A SYN CHRON IZE command is ins erted b etw een the tw o q lo ops to indicate that some for m of synchronization b etw een pro cessor s is needed b ecause data computed by each pro cesso r in its ﬁrst q lo op is s ubs equently accessed by all the pro cess ors in their se cond q lo o p. W e call Fig ure 5 .1 a p artial template b ecause it do e s no t address the issue of da ta lo cation o r data mov emen t b etw een pro ces s ors. These issues must b e resolved, a nd the SY NCHRO NIZE command b etw een the t wo q lo ops must be repla ced 7 Any of the improv emen ts from Section 4, such as computing only the neede d w eight s, as discussed in Section 4.4, could b e incorp orated into the plan prior to this transformation, but for simplicity , here we giv e a transformation from the plan in Figure 4.3. 19 do p = 0,m-1 CENTRA LIZED TO BLOCK PAR TITIO NED(x ,xblock,m,psize,p) do q = 1,brea kpoin t - 1 L = 2**q do row = 0,L/2- 1 weight (row) = EXP((2 *pi*i *row)/L) end do do col ′′ = 0,psiz e-1,L do row = 0,L/ 2-1 c = weig ht(ro w)*xb lock(col ′′ +row+L /2) d = xblo ck(co l ′′ +row) xblock (col ′′ +row) = d + c xblock (col ′′ +row+L /2) = d - c end do end do end do BLOCK PARTIT IONED TO C ENTRA LIZED( xblock,x,m,psize,p) end do do p = 0,m-1 CENTRA LIZED TO CYCLIC PART ITION ED(x,x cyclic,m,psize,p) do q = breakp oint, t L = 2**q do row ′ = p,L/ 2-1,m weight cycli c(row ′ ) = EXP((2 *pi*i* (m*row ′ +p))/L ) end do do col ′′ = 0,psiz e-1,L /m do row ′ = 0,L/(2 *m)-1 ,1 c = weig htcyc lic(r ow ′ )*xcyc lic(c ol ′′ +row ′ +L/(2* m)) d = xcyc lic(c ol ′′ +row ′ ) xcycli c(col ′′ +row ′ ) = d + c xcycli c(col ′′ +row ′ +L/(2* m)) = d - c end do end do end do CYCLIC PARTIT IONED TO CEN TRALIZ ED(xc yclic,x,p,m,psize) end do Fig. 4.9 . Combine d Plan Using Partiti one d Data by statements using an appropria te implementation language mechanism that permits each pro cesso r to pro ceed pas t the s ynchronization point o nly a fter all the pro cess ors hav e rea ched the synchronization p oint. Thes e statements m ust ensur e that the data needed by each pro cess or is indeed av ailable to that pro cesso r . 5.1.2. Resp onsibili t y–Based Distribution of Data V ec tor o n Lo cal Me m - ories. In or der to tr ansform the partial template of Figur e 5.1 into a (fully s pe c iﬁed) template, we need to addr e ss the issue of where (i.e., on which pr o cessor ) data is lo cated, a nd how data mov es b etw een pr o cessor s. A straig ht forward appro ach is to let each pro cessor hav e space for an entire data vector x o f n elements, but in each of the tw o q lo ops, o nly access thos e psize elemen ts of x that the pro cessor is resp onsible 20 do q = 1,brea kpoin t - 1 L = 2**q do row = 0,L/2- 1 weight (row) = EXP((2 *pi*i *row)/L) end do do col ′ = myid*p size, (myid +1)*psize-1,L do row = 0,L/2- 1 c = weight (row)* x(col ′ +row+L /2) d = x(col ′ +row) x(col ′ +row) = d + c x(col ′ +row+L /2) = d - c end do end do end do SYNCHR ONIZE do q = breakp oint, t L = 2**q do row = 0,L/2- 1 weight (row) = EXP((2 *pi*i *row)/L) end do do col ′ = 0,n-1, L do row = myid,L /2-1, m c = weight (row)* x(col ′ +row+L /2) d = x(col ′ +row) x(col ′ +row) = d + c x(col ′ +row+L /2) = d - c end do end do end do Fig. 5.1 . Partial Per–Pr o c essor Par al lel Co de T emp late Obtaine d fr om Plan of Figur e 4.3 for. T o emphasize that space for all n elements of a v ector x is alloc a ted on each pro cesso r , we r efer to this space o n a given pr o cessor p as x p . F or a distributed a rchitecture or shared memory architecture with subs tantial lo cal memories , the “cur rent” v alue of each data element will r eside on the pro cesso r that computes its new v alue, i.e., is resp onsible for it. W e now consider this a s signment of e ach elemen t from the data ar ray to the pro cesso r that is r esp onsible for it. T o star t, consider the ﬁrst q lo op in Fig ure 5.1. Since q ra nges b etw een 1 and br e akp oint − 1, with br e akp oint = log 2 ( psiz e ) + 1 and L = 2 q , we can conclude that L ra nges b etw een 2 and psize . F or each pro cessor p , consider the data elements accesse d in ea ch iteration of this ﬁrst q lo op. These ar e data elements x ( col ′ + row ) a nd x ( col ′ + row + L/ 2), where col ′ ranges b e tween p ∗ psiz e and ( p + 1) ∗ psi z e − 1 in steps of L , and r ow ranges be tw een 0 a nd L/ 2 − 1. Since L ≤ psi z e for each iteratio n of the q lo op, the accessed da ta elements are psize conse c utive elemen ts, b eg inning with element x ( p ∗ psiz e ). These elements can b e des crib ed a s x ( p ∗ psiz e : (( p + 1) ∗ p siz e ) − 1 : 1). This is a blo ck distribution of the resp onsibility for a rray x , co r resp onding to the blo ck approach to the ﬁrst p lo o p, as describ ed in Section 4.3. Now consider the second q lo op in Fig ure 5.1. In each iteration of this q lo op, 21 L/ 2 ≥ psiz e . Since psi z e ≥ m , we can conclude that L / 2 ≥ m . F or each pro cesso r p , consider the data elements acc e s sed in ea ch iteration of the q lo op. These a re data elements x ( col ′ + r ow ) and x ( col ′ + r ow + L/ 2), wher e col ′ ranges be tw een 0 and n − 1, in steps of L , and r ow ra nges b etw een p and L/ 2 − 1, in steps of m . Since L/ 2 is a multiple of m , so is L , so that col ′ ≡ 0 mo d m for every v a lue of col ′ . Also, r ow ≡ p mo d m . Consequently , col ′ + row and col ′ + r ow + L/ 2 are each co ng ruent to p mo d m . Thus, every data ele ment accessed by pro cessor p is congruent to p mo d m . Now, note that there ar e n /L iterations of the col ′ lo op. Because L / 2 ≥ m , for each p and col ′ , there are L/ (2 m ) iterations of the r ow lo op. Each of these iter ations do es the butterﬂy op eratio n on t wo data elements of array x . Note that n L ∗ L 2 m ∗ 2 = n m = psiz e. Thu s for each iteratio n of the q lo op, pr o cessor p do es an a s signment to psize data elements. W e now note that these are distinct elemen ts, since for e ach v alue of r ow and col ′ , diﬀerent data elements are accessed. In conclusion, the elements accessed by pro cessor p during each itera tion o f the lo op are the set o f e lement s x ( i ) such that i ≡ p mo d m . Using F or tran90– like start–stop– stride no ta tion these e le men ts can b e describ ed as x ( p : n − 1 : m ). This is a cyclic distribution of the resp onsibility for array x , corre sp onding to the cyclic appr oach to the second p lo op, as describ ed in Section 4.3. 5.1.3. Construction of M e ssages. W e now consider the constr uction of mes- sages b etw een pr o cessor s for distributed computation. This e ntails bo th bundling of data v alues into mes sages, a nd having b oth the s ender a nd recipient of each message sp ecify whic h of its data element s a re included in each message. The data can be initially stored in v arious ways, such as on disks , distributed among the pro cessors, or all on one pro cessor . E ach of these cases can be readily handled. Here, we envision that the data is initially centralized on one pro cessor. In describing the details of messa ge passing, we ass ume that the num bering of pro cesso r s from 0 to m − 1 is such that initia lly all the data is on pro cesso r 0. Before the ﬁrst q lo op, as p er the blo ck approach, the initially centralized data must b e distributed amo ng the pro cesso rs. The pr o cessor with all the data must send every other pro cessor, otherp , the set o f psize contiguous data elements, b eg inning with da ta element x ( psiz e ∗ other p ), and each such pro cess or m ust receive these data elements. The recipient pr o cessor then stor es the received psize data v alues in the appr opriate po sition in its x vector. A high–level description of messag e–passing code to p erform the blo ck distribution is shown in Figure 5.2. Since this co de r efers to the v ariable myid , the same co de ca n be used on each pro cessor. Pro cess o r 0 will execute the otherp lo op, sending a mes sage to each of the other pr o cessor s. Each pro cessor other than pr o cessor 0 will execute a RECEIV E c ommand. The S END command sends a mess age, and has three par ameters: the ﬁrst element to be sent, the num b er of element s to send, and the pro cesso r which should receive the messag e. The RECEI VE comma nd rece ives a message, and has three parameters : where to pla ce the ﬁrst element received, the num b er of elemen ts to receive, a nd the pro cesso r which se nt the message to be received. In Figure 5.2, we refer to the lo ca l data array as x my id , to emphasize that it is the lo ca l space for the data arr ay that is the sour ce and destina tio n of da ta se nt and r eceived by a given pro cesso r . W e assume that the SEND command is n onblo cking , and uses buﬀering . Thus, the conten ts o f the message to b e sent is copie d in to a buﬀer, at which p oint the 22 statement after the SE ND command can be executed. W e a ssume that the RECEI VE command is blo cki ng . Thus, the data is r eceived and stored in the lo ca tions sp eciﬁed by the command, at which p o int the statement after the R ECEIVE co mmand ca n b e executed. ! DISTRI BUTE CENTRA LIZED TO BLO CK(x my id ,m,psi ze) if myid = 0 then do other p = 1,m-1 SEND(x my id (psize *othe rp),p size,otherp) end do else RECEIV E(x my id (psize *myid ),psi ze,0) endif Fig. 5.2 . Centr alize d to Blo ck D i stribution Now co nsider the blo ck to cyclic r edistribution that o ccurs a fter the ﬁr s t q lo op, and b efor e the second q lo op. Recall tha t each pro ce s sor has psize data elements. Of the psize elements on any given pro cessor b efore the redistribution, psiz e/ m m ust wind up in each of the pro cess ors (including itself ) after the redistr ibutio n. No message need b e sent for the da ta v alues that a re to wind up o n the s a me pro c essor as b efore the redis tribution. Thus, each pro cessor must send psiz e/m e le ment s to each of the other pro c e s sors. A high–level de s cription of mess age–pa s sing c o de to perfor m the blo ck to cyclic redis tribution is shown in Figur e 5.3. W e a ssume that the SEN D and RECEIV E commands are ﬂexible enough s o that the ﬁrst par ameter can specify a set of array elemen ts using nota tion sp ecifying star t, stop, a nd stride. The SEND co mmands can be nonblo cking beca use for any given pro cessor , disjoint sets of data elements are inv olved in any pair of messag e commands on that pro ce ssor. An alternative to Figure 5 .3 is to ﬁrst do all the SEND s, and then do all the R ECEIVE s. ! DISTRI BUTE BLOCK TO CYC LIC(x my id ,m,psi ze) do otherp = 0,m-1 if other p 6 = myid then SEND(x my id (psize *myid +othe rp:psize*(myid+1)-1:m),psize/m,otherp) RECEIV E(x my id (psize *othe rp+my id:psize*(otherp+1)-1:m),psize/m,otherp) endif end do Fig. 5.3 . Blo ck to Cyclic Re distribution Another alter native to Figur e 5 .3 is to ﬁrst c ollect all the data a t the centralized site, i.e. pro cesso r 0, and then do a centralized to cyclic re dis tribution. This s trategy is descr ib ed in more detail in Figur e 5.4, where the blo ck to cyclic r edistribution is done a s blo ck to centralized r edistribution, followed by a cen tralized to cyclic redis- tribution. The approach in Figur e 5.4 ent ails o nly 2( m − 1) messa ges, in co ntrast to the m ( m − 1) mes sages entailed in Figure 5.3. Ho wev er, the centralized site may bec ome a bo ttleneck. Mor eov er, data elemen ts are transmitted twice, to and from the centralized site, ra ther than only once. W e envision that at the end of the FFT computation all the data ele ments are c ol- lected at the centralized pr o cessor , pro ces s or 0. A high–level de s cription of mes s age– passing co de to p erform this cyclic to cent ralized redistribution is shown in Figure 5.5. 23 ! DIS TRIBU TE B LOCK T O CYC LIC(x my id ,m,psi ze) DISTRI BUTE BLOCK TO CEN TRALI ZED(x my id ,m,psi ze) DISTRI BUTE CENT RALIZE D TO CY CLIC( x my id ,m,psi ze,n) ! DIS TRIBU TE BLOCK TO CEN TRALI ZED(x my id ,m,psi ze) if myid = 0 then do other p = 1,m-1 RECEIV E(x my id (psize *othe rp),p size,otherp) end do else SEND(x my id (psize *myid ),psi ze,0) endif ! DIS TRIBU TE CENTRA LIZED TO CYC LIC(x my id ,m,psi ze,n) if myid = 0 then do other p = 1,m-1 SEND(x my id (other p:n-1 :m),p size,otherp) end do else RECEIV E(x my id (myid: n-1:m ),psi ze,0) endif Fig. 5.4 . Blo c k to Cyclic R e distri b ution Using Cent ra lize d Site as Interme diary ! DIS TRIBU TE C YCLIC TO CENTR ALIZED (x my id ,m,psi ze,n) if myid = 0 then do other p = 1,m-1 RECEIV E(x my id (other p:n-1 :m),p size,otherp) end do else SEND(x my id (myid: n-1:m ),psi ze,0) endif Fig. 5.5 . Cyclic t o Centr alize d R e distribution 5.1.4. P er–Pro cessor Distributed Co de T e m plate. Figure 5.6 shows how the par tia l parallel co de template from Figure 5.1 is transformed into a template tha t sp eciﬁes where data is loc ated a nd how it moves b etw een pro cesso rs. In Figure 5.6, we refer to the a rrays inv olved in the template as x my id and w eight my id , to e mphasize that spac e lo ca l to a g iven pr o cessor is be ing acc e ssed by the p er –pro cess o r template. W e assume tha t sca la r v ariables, such as q and L are understo o d to be lo cal v ariables within ea ch pr o cessor . As discussed in Section 5.1.2, the distributed code template will p erform a data redis tr ibution from the initial centralized dis tr ibution to a blo ck distribution, follow ed by the ﬁrst q lo op, follow ed b y a blo ck to c y clic redis tribution, follow ed by the se cond q lo o p (which b eg ins with q equal to br e akp oint ), follow ed by a cyclic to centralized redis tr ibution. The c o de in Fig ure 5 .6 is executed on each of the m pro cessors . T he DI STRIB UTE BLOCK TO CYC LIC co mmand in Figure 5.6 serves as the mechanism implement ing the SYN CHRON IZE command in Figure 5 .1. The blo ck to cyclic redistr ibution can be p er formed either as descr ib ed in Fig ure 5.3 or as descr ib ed in Fig ur e 5.4. In either case, the blo cking RE CEIVE s do the synchronization. 24 DISTRI BUTE CENT RALIZE D TO BL OCK(x my id ,m,psi ze) do q = 1,brea kpoin t - 1 L = 2**q do row = 0,L/2- 1 weight my id (row) = EXP(( 2*pi* i*row )/L) end do do col ′ = myid*p size, (myid +1)*psize-1,L do row = 0,L/2- 1 c = weight my id (row)* x my id (col ′ +row+L /2) d = x my id (col ′ +row) x my id (col ′ +row) = d + c x my id (col ′ +row+L /2) = d - c end do end do end do DISTRI BUTE BLOCK TO CYC LIC(x my id ,m,psi ze) do q = breakp oint, t L = 2**q do row = 0,L/2- 1 weight my id (row) = EXP(( 2*pi* i*row )/L) end do do col ′ = 0,n-1, L do row = myid,L /2-1, m c = weight my id (row)* x my id (col ′ +row+L /2) d = x my id (col ′ +row) x my id (col ′ +row) = d + c x my id (col ′ +row+L /2) = d - c end do end do end do DISTRI BUTE CYCLIC TO CENT RALIZ ED(x my id ,m,psi ze,n) Fig. 5.6 . Per–Pr o c essor Distributed Co de T e mplate Base d on Partial T emplate of Figur e 5.1 5.2. Distributed T emplate with P artitioned Data and Con tiguous Ac- cess. 5.2.1. Developmen t of T emplate from Plan Using P artitioned Data. Figure 5.7 shows a p er–pr o cessor template obtained from Figure 4 .9’s pa rallel co m- bined computation plan using partitioned data. The template in Figure 5.7 assumes that all n data ele ments ar e initially lo cated on pro cesso r 0 and stored in a rray x 0 . It also assumes that at the e nd of the alg orithm, a ll n data elements are to be collected together on pr o cessor 0 and store d in arr ay x 0 . Each pro ces s or p is assumed to hav e its own lo ca l arrays w eight p , xbloc k p , wei ghtcy clic p , and x cycli c p . The iter a- tion over the p lo o ps in Figure 4.9 is replaced by the e x ecution o f the p er–pro cessor template in Figur e 5.7 o n e ach o f the m pro cessor s. A high–level description of mes sage–pa ssing code to p erform the four da ta redistri- bution co mma nds in Figure 5.7 (centralized–to–blo ck–partitioned, blo ck–partitioned– to–centralized, centralized–to–cyclic– partitioned, and cyclic–par titioned–to–centralized) is shown in Figure 5 .8. 25 DISTRI BUTE CENT RALIZE D TO BL OCK PART ITION ED(x 0 ,xbloc k my id ,m,psi ze) do q = 1,brea kpoin t - 1 L = 2**q do row = 0,L/2- 1 weight my id (row) = EXP(( 2*pi* i*row )/L) end do do col ′′ = 0,ps ize-1 ,L do row = 0,L/2- 1 c = weig ht my id (row)* xbloc k my id (col ′′ +row+L /2) d = xblo ck my id (col ′′ +row) xblock my id (col ′′ +row) = d + c xblock my id (col ′′ +row+L /2) = d - c end do end do end do DISTRI BUTE BLOCK PARTI TIONE D TO CENTRAL IZED( xblock my id ,x 0 ,m,psi ze) DISTRI BUTE CENT RALIZE D TO CY CLIC PARTIT IONED (x 0 ,xcycl ic my id ,m,psi ze) do q = breakp oint, t L = 2**q do row ′ = 0,L/(2 *m)-1 ,1 weight cycli c my id (row ′ ) = EXP((2 *pi*i* (m*row ′ +myid) )/L) end do do col ′′ = 0,ps ize-1 ,L/m do row ′ = 0,L/ (2*m) -1,1 c = weig htcyc lic my id (row ′ )*xcyc lic my id (col ′′ +row ′ +L/(2* m)) d = xcyc lic my id (col ′′ +row ′ ) xcycli c my id (col ′′ +row ′ ) = d + c xcycli c my id (col ′′ +row ′ +L/(2* m)) = d - c end do end do end do DISTRI BUTE CYCLIC PART ITION ED T O C ENTRA LIZED( xcyclic my id ,x 0 ,m,psi ze) Fig. 5.7 . Per–Pr o cesso r Dist ribute d Co de T emp late O btaine d fr om Combine d Plan Using Par- titione d Data of Figur e 4.9 5.2.2. Elim inating Bottlenec k in Blo c k to Cyclic Redistributio n. In Fig - ure 5.7, the pair of data redistr ibution commands DISTRI BUTE BLOCK PARTI TIONE D TO CENTRAL IZED( xblock my id ,x 0 ,m,psi ze) DISTRI BUTE CENT RALIZE D TO CY CLIC PARTIT IONED (x 0 ,xcycl ic my id ,m,psi ze) o ccurring b etw een the t wo q lo o ps , tog ether accomplish a blo ck–partitioned–to–c y clic- partitioned da ta redistribution. This tw o–command sequence ca n b e repla ced by the single co mmand DISTRI BUTE BLOCK PARTI TIONE D TO CYCLIC PARTI TIONED (xblock my id ,xcycl ic my id ,m,psi ze) This combined command can be implemen ted by doing a blo ck–partitioned–to–ce ntralized redistribution followed by a centralized–to–blo ck–partitioned redistribution, as indi- cated in Fig ur e 5.7, in a manner a nalogo us to that shown in Fig ur e 5.4. Alternately , the blo ck–partitioned– to–centralized redistribution ca n be done directly , a s shown in 26 ! DISTRI BUTE CE NTRAL IZED TO BLOCK PARTIT IONED (x 0 ,xbloc k my id ,m,psi ze) if myid = 0 then xblock my id = x 0 (0:psi ze-1) do other p = 1,m-1 SEND(x 0 (psize *othe rp),psize,otherp) end do else RECEIV E(xbl ock my id ,psize ,0) endif ! DISTRI BUTE BL OCK P ARTIT IONED TO CENT RALIZ ED(xbl ock myid ,x 0 ,m,psi ze) if myid = 0 then x 0 (0:psi ze-1) = xblock my id do other p = 1,m-1 RECEIV E(x 0 (psize *othe rp),psize,otherp) end do else SEND(x block my id ,psize ,0) endif ! DISTRI BUTE CENTRA LIZED TO CYC LIC PA RTITI ONED( x 0 ,xcycl ic my id ,m,psi ze) if myid = 0 then xcycli c my id = x 0 (0:n-1 :m) do other p = 1,m-1 SEND(x 0 (other p:n-1 :m),psize,otherp) end do else RECEIV E(xcy clic myid ,psize ,0) endif ! DISTRI BUTE CYCLIC PART ITION ED TO C ENTRAL IZED( xcyclic my id ,x 0 ,m,psi ze) if myid = 0 then x 0 (0:n-1 :m) = xcycli c my id do other p = 1,m-1 RECEIV E(x 0 (other p:n-1 :m),psize,otherp) end do else SEND(x cycli c my id ,psize ,0) endif Fig. 5.8 . Co de f or Distribution Commands Oc curring in Figur e 5.7’s Per–Pr o c essor T emplate Using Partitione d Data Figure 5.9, which is a nalogous to Figure 5 .3. The direct implemen tation remov es the centralized site as a b o ttlenec k, but increa ses the nu mber o f mes s ages from 2( m − 1) to m ( m − 1), as discussed in Section 5.1 .3. 6. Dev elopment of All –Pro cessor T emplates for Shared Mem ory Ar- c hitectures with Lo cal Me mory Av ail ability . 27 ! DIS TRIBU TE B LOCK P ARTIT IONED TO CYCL IC PA RTITI ONED( xblock myid , xcycli c my id ,m,psi ze) xcycli c my id (myid* psize /m:(myid+1)*psize/m-1) = xblock my id (myid: psize -1:m) do other p = 0,m-1 if other p 6 = myid then SEND(x block my id (other p:psi ze-1: m),psize/m,otherp) RECEIV E(xcy clic myid (other p*psi ze/m),psize/m,otherp) endif end do Fig. 5.9 . Blo ck –Partitione d to Cyclic –Partitione d R e distribution 6.1. Shared Memory T emplate. The pa r allel computation plan of Figur e 4.3 can b e con verted into a share d memory templa te by converting each of the tw o p lo o ps int o a par allel loop, as shown in Figure 6.1. The template in Figure 6.1 is expre s sed in a style roug hly bas ed on OpenMP . W e assume that there is a thr e a d for ea ch iteration of a par allel d o lo op, and that each thre a d can hav e data that is pr iv ate to that thr e ad. Moreov er, we assume that ea ch threa d is a ssigned to a pro c essor (with the p os sibility that mu ltiple thre a ds are ass ig ned to an y given pro cessor), and data that is priv a te to a given thread is s tored in the lo cal memory of the pro ces sor for that thread. In the template, w e use the nota tio nal co nv en tion that x is a shar ed data array; a nd that scalar v a riables, s uch as q , a re priv ate to ea ch thread. W e a lso ass ume that each thread p has a priv ate arr ay w eight p for the weigh ts used in the execution of that thread. Within each threa d, the as signments to scalar v ariables a nd to elements of weight p mo dify priv ate data. The assig nment to elements of shared array x mo dify shared data, but beca us e we are star ting with the computation plan o f Figure 4.3, the elements of x accessed b y the v arious threads of a given p arall el do lo op are disjoint. 6.2. Shared M emory T emplate U sing Priv ate Local Memory . In Fig- ure 6.1, all the a ccesses to x ar e to a shared data vector. The template can b e mo diﬁed so tha t the butterﬂy op era tes on priv ate data, as shown in the all–pr o cessor template of Figur e 6.2, which uses b o th a sha red memo ry a rray x , a nd a priv ate memory arr ay x p . Figure 6.3 shows the details o f how the data copying is done. The Op enMP–style all–pro cesso r template uses as signment statements within each threa d to copy data between shared memory and the priv ate memory o f that thread. Figure 6.2 is an all– pr o cessor a nalogue of the p er –pro cess or distributed co de tem- plate shown in Figure 5.6. The data copying via assig nment statemen ts in Figure 6.3 is analogous to the data copying via message– passing in Figures 5.2, 5 .4, and 5.5. An O p enMP–style all–pro ces sor template per mits a given thread to copy data be - t ween shared memo ry and priv ate memory of that thread, but do es not p ermit direct copying of data from the priv ate memory of one thread to the priv ate memory of a diﬀerent thread. Accordingly , the DISTR IBUTE BLOCK TO CYC LIC command from Figure 5.6 must b e done as tw o commands (priv ate blo ck to c e ntralized at the end of the ﬁr s t par allel do , fo llowed by cen tralized to priv a te blo ck at the b eg inning of the sec ond parall el do ) in Figure 6.2, corresp onding to the data mov emen t in Figure 5 .4, ra ther than that in Figur e 5.3. 6.3. Shared Me mory T emplate Using Part itione d Data and Contigu- ous Access. Figur e 6.4 shows an Op enMP– style a ll–pro ces sor template for a shared 28 parall el d o p = 0,m-1 do q = 1,brea kpoin t - 1 L = 2**q do row = 0,L/2- 1 weight p (row) = EXP(( 2*pi* i*row )/L) end do do col ′ = p*ps ize,( p+1)* psize-1,L do row = 0,L/2- 1 c = weig ht p (row)* x(col ′ +row+L /2) d = x(co l ′ +row) x(col ′ +row) = d + c x(col ′ +row+L /2) = d - c end do end do end do end parall el do parall el d o p = 0,m-1 do q = breakp oint, t L = 2**q do row = 0,L/2- 1 weight p (row) = EXP(( 2*pi* i*row )/L) end do do col ′ = 0,n- 1,L do row = p,L/2- 1,m c = weig ht p (row)* x(col ′ +row+L /2) d = x(co l ′ +row) x(col ′ +row) = d + c x(col ′ +row+L /2) = d - c end do end do end do end parall el do Fig. 6.1 . Simple Shar e d Memory Al l–Pr o cessor T empla te Obtained f ro m Plan of Figur e 4.3 memory , based on the plan o f Fig ur e 4.9. W e ass ume that ther e is a thread for each pro cesso r , a nd data that is priv ate to a given thread is s tored in the lo cal memor y of the pro cesso r for that threa d. In the template, we use the notationa l c o nv en tion that x is a shared data array; that weight p , we ightcy clic p , xb lock p , and xcy clic p denote priv ate data a rrays for thread p ; a nd that scalar v ariables are pr iv ate to each thread. W e assume that a given thread can access o nly s ha red data and its priv ate data. Con- sequently , the blo ck–to–cyclic redistributio n cannot b e done by moving data directly from a priv ate array in one thread in to a priv ate arr ay in a diﬀerent thr ead. In Fig- ure 6.4, the blo ck–to–cyclic r edistribution is done in tw o steps . In the ﬁrst step, each thread of the ﬁrst p lo op do es a blo ck–to–centralized c o pying of data from its priv ate array xblo ck p int o shared array x . Cons e q uently , when the ﬁrst p lo o p concludes, the data from a ll the priv ate xblock p arrays have been copied into shared ar ray x . In the second step, each thread of the second p lo o p do es a centralized–to– blo ck copying of data fro m x into its pr iv ate a rray xc yclic p . 29 parall el d o p = 0,m-1 COPY CENTRA LIZED TO PRI VATE B LOCK( x,x p ,p,psi ze) do q = 1,brea kpoin t - 1 L = 2**q do row = 0,L/2- 1 weight p (row) = EXP(( 2*pi* i*row )/L) end do do col ′ = p*ps ize,( p+1)* psize-1,L do row = 0,L/ 2-1 c = weig ht p (row)* x p (col ′ +row+L /2) d = x p (col ′ +row) x p (col ′ +row) = d + c x p (col ′ +row+L /2) = d - c end do end do end do COPY PRIVAT E BLOCK TO CE NTRAL IZED(x p ,x,p,p size) end parall el do parall el d o p = 0,m-1 COPY CENTRA LIZED TO PRI VATE C YCLIC (x,x p ,p,m,p size) do q = breakp oint, t L = 2**q do row = 0,L/2- 1 weight p (row) = EXP(( 2*pi* i*row )/L) end do do col ′ = 0,n- 1,L do row = p,L/ 2-1,m c = weig ht p (row)* x p (col ′ +row+L /2) d = x p (col ′ +row) x p (col ′ +row) = d + c x p (col ′ +row+L /2) = d - c end do end do end do COPY PRIVAT E CYCLI C TO C ENTRA LIZED (x p ,x,p,m ,psiz e) end parall el do Fig. 6.2 . A l l–Pr o c essor Shar e d and Private Me mory T emplate Base d on T empla te of Figur e 6.1 Figure 6 .5 shows the details of the ass ignment sta tement s that do the data c o pying betw een shared memor y and the pr iv ate memory o f each thread. 7. Preliminary Experim en tal R esults. Ther e are v arious c o mmercial and vendor s pec iﬁc librar ies whic h include the FFT. The Numerical Analysis Group (NAG) and Visual Numerics (IMSL) pr ovide and supp ort ﬁnely tuned s c ient iﬁc libra r ies sp e- ciﬁc to v arious HPC platforms, e.g SGI a nd IBM. The SGI a nd IB M librarie s, SCSL and E SSL, a r e even more highly tuned, due to their knowledge of propr ietary infor- mation sp eciﬁc to their resp ective architectures. W e ran exp eriments comparing our co de to that in some of these libr aries. W e a ls o did compar isons with the FFTW, although these results are har de r to interpret b eca us e of the planning phase that is 30 ! COP Y CEN TRALI ZED T O PRI VATE BLO CK(x, x p ,p,psi ze) x p (psize *p:ps ize*(p+1)-1:1) = x(psiz e*p:ps ize*(p+1)-1:1) ! COP Y PRIVAT E BLO CK TO CENTRA LIZED (x p ,x,p,p size) x(psiz e*p:p size*(p+1)-1:1) = x p (psize *p:ps ize*(p+1)-1:1) ! COP Y CENTRA LIZED TO P RIVAT E CYCLIC (x,x p ,p,m,p size) x p (p:n-1 :m) = x(p:n- 1:m) ! COP Y PRIVAT E CYC LIC T O CENTRA LIZED (x p ,x,p,m ,psiz e) x(p:n- 1:m) = x p (p:n-1 :m) Fig. 6.3 . Data Copying for Al l–Pr o c essor Shar e d and Private Memory T empl ate of Figur e 6.2 part o f the FFTW. Our exp eriments ar e rep o rted in [14, 15, 51]. Here, we give results from [15], where mor e details are av ailable. 7.1. Exp erimental Environmen t. O ur ex pe r iments were run o n tw o systems: 1. A SGI/Cray Origin20 00 at NCSA 8 in Illinois, with 48, 19 5Mhz R10 000 pr o - cessors , and 14GB o f memory . The L1 cache size is 64KB (32 KB Icache a nd 32 K B Dcache). The Or igin200 0 ha s a 4MB L2 cache. T he OS is IRIX 6.5. 2. An IBM SP–2 a t the MAUI High Performance Computing Center 9 , with 3 2 P2SC 160Mhz pro c essors, and 1 GB of memor y . The L1 cache size is 1 60KB (32KB Icache and 128 KB Dcache), a nd there is no L2 ca che. The OS is AIX 4.3. W e tested the math librar ies av ailable on both machines: on the Or igin 2000, IMSL F ortran Numerical Libraries version 3.0 1, NAG v ersion Mark 19, a nd SGI’s SCSL library ; and on the SP–2 , IBM’s E SSL library . W e also ra n the FFTW o n b oth machines. 7.2. Exp eriments. Exp eriments on the Origin 2 000 were run using bsub , SGI’s batch pro cess ing environment. Similarly , exp eriments on the SP–2 were run using the loadle veler ba tch pr o cessing environmen t. In both cases we used dedicated netw orks and pro c e ssors. F or each vector size (2 3 to 2 24 ), exp eriments were repeated a minimum of three times and averaged. F or improv ed o ptimizations, vendor co mpilers were used with the -O3 and -In line ﬂa gs. W e us e d Perl scripts to automatically compile, run, and time all exp eriments, and to plot our r esults for v arious pro blem sizes. 7.3. Ev aluatio n of Res ults. Our results on the Or igin 200 0 and the SP– 2, as shown in Figures 7 .1 a nd 7.2, and T ables 7 .1 and 7.2, indicate a per formance im- prov ement o ver standard libraries for many problem sizes, depending o n the particular library . 7.3.1. O rigin 2000 R esults. Performance results for o ur monolithic FFT, which we call here FFT-UA , indicate a doubling of time when the vector size is doubled, for all vector sizes tried. IMSL doubled its p er formance up to 2 19 . At 2 19 8 This work was partially supp orted by National Computational Science Al liance, and utilized the NCSA SGI/CRA Y Origin2000 9 W e would like to thank the Maui High Performance Computing Center for access to their IBM SP–2. 31 parall el d o p = 0,m-1 COPY CENTRA LIZED TO PRI VATE B LOCK PARTI TIONED (x,xblock p ,m,psi ze,p) do q = 1,brea kpoin t - 1 L = 2**q do row = 0,L/2- 1 weight p (row) = EXP(( 2*pi* i*row )/L) end do do col ′′ = 0,psiz e-1,L do row = 0,L/ 2-1 c = weig ht p (row)* xbloc k p (col ′′ +row+L /2) d = xblo ck p (col ′′ +row) xblock p (col ′′ +row) = d + c xblock p (col ′′ +row+L /2) = d - c end do end do end do COPY PRIVAT E BLOCK PART ITIONE D TO CENTRA LIZED( xblock p ,x,p,m ,psiz e,n) end parall el do parall el d o p = 0,m-1 COPY CENTRA LIZED TO PRI VATE C YCLIC P ARTIT IONED( x,xcycl ic p ,m,psi ze,p, n) do q = breakp oint, t L = 2**q do row ′ = 0,L/ (2*m) -1,1 weight cycli c p (row ′ ) = EXP( (2*pi *i*(m *row ′ +p))/L ) end do do col ′′ = 0,psiz e-1,L /m do row ′ = 0,L/(2 *m)-1 ,1 c = weig htcyc lic p (row ′ )*xcyc lic p (col ′′ +row ′ +L/(2* m)) d = xcyc lic p (col ′′ +row ′ ) xcycli c p (col ′′ +row ′ ) = d + c xcycli c p (col ′′ +row ′ +L/(2* m)) = d - c end do end do end do COPY PRIVAT E CYCLI C PAR TITIO NED TO C ENTRA LIZED (xcyc lic p ,x,m,p size, p,n) end parall el do Fig. 6.4 . Al l–Pr o c essor Shar e d Memory and Private Partitione d Data T emplate O btaine d fr om Combine d Plan of Figur e 4.9 there is a 400% deg radation in p erformance , pres uma bly due to a change in the use of the memory hiera rch y . F or NAG this degra dation b egins at 2 18 . The SGI libra r y (SCSL) do e s not exhibit this deg radation. SCSL may be doing machine sp eciﬁc op- timizations, p erha ps using more sophisticated o ut of core techniques similar to tho se describ ed by Cormen [53], as evidenced by nearly identical p erformanc e times for 2 17 and 2 18 . 7.3.2. SP–2 Resul ts. FFT-UA outp erfor ms ESSL for vector sizes up to 2 14 , except for tw o ca ses. F or 2 15 and 2 16 , the p erformance is slightly worse. E SSL do es increasingly b etter a s the problem size increase s . The FFT-UA times contin ue to 32 ! COP Y CEN TRALI ZED T O PRI VATE BLO CK PA RTITI ONED( x,xblock p ,p,psi ze) xblock p = x(psiz e*p:p size* (p+1)-1:1) ! COP Y PRIVAT E BLO CK PA RTITI ONED TO CENTR ALIZE D(xbl ock p ,m,psi ze,p) x(psiz e*p:p size*(p+1)-1:1) = xbloc k p ! COP Y CENTRA LIZED TO P RIVAT E CYCLIC PART ITIONE D(x,x cyclic p ,m,psi ze,p, n) xcycli c p = x(p:n- 1:m) ! COP Y PRIVAT E CYC LIC P ARTIT IONED TO CENT RALIZ ED(xc yclic p ,x,m,p size, p,n) x(p:n- 1:m) = xc yclic p Fig. 6.5 . Data Copying f or Al l–Pr o c essor Shar e d Memory and Private Partitione d D ata T em- plate of Figur e 6.4 -100 -50 0 50 100 4 6 8 10 12 14 16 18 20 22 24 Percent Improvement Log2 of the Vector Size Origin 2000 Performance Improvement SGI Library NAG IMSL FFTW Fig. 7.1 . Per c ent impr ovement of FFT-UA over libr ary c o de and FFTW on Origin 2000 double from 2 17 to 2 23 , r eﬂecting the uniformity and simplicity of the c o de. The FFT-UA co de is machine–indep endent, and relies on the eﬃciency o f the compiler used. P r esumably , the ESSL co de is tuned to the machine a rchitecture, us ing machine sp eciﬁc optimizations, a nd should b e exp ected to p erfor m b etter. The ESSL code failed for pro blem sizes 2 22 and highe r , wher eas FFT-UA suc- cessfully pr o cessed problem siz e s thro ugh 2 23 , four times lar ger than the maximum handled b y E SSL. The FFTW co de failed for pr oblem sizes 2 23 and higher . 33 Origin 2000 Execution Tim e in Seconds Size FFT-UA IMSL NA G SCSL 2 3 0.019 0.064 0.010 0.06 5 2 4 0.018 0.061 0.010 0.04 7 2 5 0.018 0.062 0.010 0.06 5 2 6 0.017 0.116 0.011 0.07 3 2 7 0.019 0.063 0.010 0.06 8 2 8 0.018 0.062 0.011 0.10 5 2 9 0.017 0.122 0.011 0.06 9 2 10 0.021 0.065 0.013 0.05 6 2 11 0.021 0.064 0.016 0.05 8 2 12 0.021 0.067 0.023 0.06 7 2 13 0.022 0.075 0.036 0.06 5 2 14 0.024 0.144 0.065 0.06 6 2 15 0.030 0.120 0.135 0.11 0 2 16 0.040 0.209 0.296 0.08 0 2 17 0.065 0.335 0.696 0.07 2 2 18 0.126 0.829 3.205 0.07 5 2 19 0.238 3.007 9.538 0.09 6 2 20 0.442 9.673 18.40 0.14 3 2 21 0.884 23.36 38.93 0.26 0 2 22 1.910 46.70 92.75 0.39 6 2 23 4.014 109.4 187.7 0.67 1 2 24 7.550 221.1 442.7 1.39 6 T able 7.1 R e a l Exe cution Times of FFT-UA and Comp ar ative Libr aries on Origin 2000 7.4. Conclusions from Exp eriments. As evidenc e of the p o tential v alue of the uniform pro gram design metho do logy outlined her e, we cons tructed a machine– independent p ortable solution to the complex problem of faster and bigg er FFTs. Repro ducible exp eriments indicate that our designs outp erfo r m IMSL in all ca s es, NA G for size s grea ter than 2 11 , SCSL for sizes less than 2 18 , and E SSL in some cases. Our single, p ort able, sc alable executable of a pproximately 2,600 bytes also m ust be compared with the larg e suite of machine–speciﬁc so ft ware r equired by NAG, IMSL, SCSL, and ESSL. The us er of these ma chine–speciﬁc libraries must hav e a deep understanding o f the pack age itself and the system on whic h it runs. Many of these pack ages requir e numerous pa rameters: ﬁfteen for ESSL, and eight for SCSL. An incorrect dec ision for these parameters can result in p o or p erfor mance and even po ssibly incorrec t results. In comparison, a naive user of FFT-UA nee d only know the vector size on any pla tform. 8. Conclusions , Extens ions, and F uture R esearc h. W e hav e outlined a semi–mechanical methodo logy for developing eﬃcient scientiﬁc softw are, centered on int eractively develop ed sequences of alg orithm transforma tio ns. W e systematica lly improv ed the eﬃcienc y of the s equential expres s ion o f the high– level sp eciﬁcation of the FFT algor ithm, and formulated pro cess or mapping and da ta decomp ositions strategies. As a phase of this methodolo gy , abstract plans were cons tructed to sp ecify which computations are to be done by ea ch pro ces sor. Subseq ue ntly , templates were 34 -100 -50 0 50 100 4 6 8 10 12 14 16 18 20 22 24 Percent Improvement Log2 of the Vector Size SP-2 Performance Improvement NOTE : ESSL fails for N=22 and above NOTE : ESSL fails for N=22 and above NOTE : FFTW fails for N=23 and above NOTE : ESSL fails for N=22 and above NOTE : FFTW fails for N=23 and above ESSL FFTW Fig. 7.2 . Per c ent impr ovement of FFT-UA over ESSL and FFTW on IBM SP–2 IBM SP–2 Execution Tim e in Seconds Size FFT-UA ESSL Size FFT-UA ESSL 2 3 0.010 0.013 2 14 0.040 0.043 2 4 0.010 0.053 2 15 0.070 0.060 2 5 0.010 0.013 2 16 0.140 0.120 2 6 0.010 0.013 2 17 0.280 0.160 2 7 0.010 0.013 2 18 0.580 0.310 2 8 0.010 0.016 2 19 1.216 0.610 2 9 0.010 0.010 2 20 2.580 1.276 2 10 0.010 0.013 2 21 5.553 3.663 2 11 0.013 0.010 2 22 12.12 F ailed 2 12 0.020 0.020 2 23 25.25 F ailed 2 13 0.033 0.030 T able 7.2 R e a l Exe cution Times of FFT-UA and ESSL on IBM SP–2 created attuned to v arious high–level architectural para digms, e.g. s hared and dis- tributed memor y multiprocesso r computations. Parallel/distributed progra ms ca n b e easily built from templates. Exper imental comparisons indicate tha t o ur progr ams can be comp etitive in p er formance with softw are that has b een hand– tuned to particula r target architectures. 35 The a lgorithm v a riants develop ed here ca n b e further optimized in s e veral wa ys, consistent with our underlying metho dolog y . Examples include tuning for shar ed memory , retiling to match the memory hierar ch y , and better ov erlap of computation and communication. More g enerally , a dditional compiler – like analys is, tr a nsforma- tions, and optimizations ca n b e used to improve p er formance. This c a n b e done during the v arious phases of the metho dolog y . Finally , our prop os ed metho do logy for soft- ware developmen t can b e enhanced by a script–ba sed mechanism for exp erimentally exploring the eﬃciency of a lternative implementations for diﬀerent ranges o f proble m sizes and diﬀerent computing environments. The results of such exp er imen ts should prov e useful in ev aluating and guiding s uccessive and/or a lternative r eﬁnements. The appr oach we hav e presented is similar in spirit to other eﬀorts using librar ies of alg orithm building blo cks bas ed on C++ template classes such as POO MA. Ba s ed on the for malism o f A Mathematics of Arrays (MoA) a nd an indexing calculus (i.e. the ψ calculus ) our approach enables the prog r ammer to dev elop algorithms using high-level compiler-like optimizations through the ability to algebraica lly comp ose and reduce se quences of a r ray op era tions. The resulting ONF is a prescr iption for co de gener ation that can b e directly translated into e ﬃcient co de in the lang ua ge o f the pr ogrammer ’s choice b e it for ha rdware or softw are application. App endices: App endix A. Elements of the theory . A.1. Indexing and Shap es. The central o p e ration of MoA is the indexing function pψ A in which a vector of n integers p is used to select an item of the n -dimens io nal a r ray A . The op era tion is genera lized to select a partitio n of A , so that if q has only k < n comp onents then q ψ A is an arr ay of dimensiona lity n − k and q selects amo ng the po ssible ch oices for the ﬁrst k axes. In MoA zer o origin indexing is a s sumed. F or example, if A is the 3 b y 5 by 4 array       0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19             20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39             40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59       then < 1 > ψ A =       20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39       < 2 1 > ψ A = < 44 4 5 46 4 7 > 36 < 2 1 3 > ψ A = 47 Most of the common a r ray ma nipulation op erations found in languag es like F or- tran90, Matlab, ZP L, e tc., can b e deﬁned from ψ a nd a few elementary vector op er - ations. W e now intro duce nota tion to p ermit us to deﬁne ψ forma lly and to develop the Psi Corr esp ondenc e The or em [54 ], whic h is cen tral to the eﬀectiv e explo ita tion of MoA in array computations. W e will use A, B , ... to denote an arr ay o f nu mbers (in teger, re al, or complex). An array’s dimensionality will b e denoted by d A and will be as sumed to be n if not sp eciﬁed. The shap e of an ar ray A , denoted by s A , is a vector of integers of length d A , each item giving the length of the co rresp o nding axis. The total num b er of items in an array , denoted by t A , is e qual to the pro duct of the items of the s hap e. The subscr ipts will b e omitted in contexts where the meaning is obvious. A full index is a vector of n in tegers that desc r ib es one p osition in an n -dimensional array . Each item of a full index for A is less than the corr esp onding item o f s A . There are precisely t A indices for an ar ray(due to a zero index origin). A partia l index o f A is a vector o f 0 ≤ k < n integers with each item less than the c o rresp onding item of s A . W e will use a tuple notation (o mitting commas ) to describ e vectors of a ﬁxe d length. F or example, < i j k > denotes a vector o f length three. <> will deno te the e mpty vector which is also sometimes wr itten a s Θ. F or e very n -dimensio nal a rray A , there is a v ector o f the items of A , which we denote by the cor resp onding low er case letter , here a . The length of the v ector of items is t A . A vector is itself a one-dimensional array , who se shap e is the one-item vector ho lding the length. Thus, for a , the vector of items of A , the s ha p e of a is s a = < t A > and the num ber of items o r total num ber of comp onents a 10 is t a = t A . The pre cise mapping of A to a is determined by a one-to-o ne ordering function, g amma . Although the choice of or dering is arbitrary , it is essential in the following that a sp eciﬁc o ne b e assumed. By conv en tion we assume the items of A ar e placed in a acc o rding to the lexicogr aphic or de r ing of the indices o f A . This is often referr ed to a s r ow maj or order ing . Man y prog ramming languag es lay o ut the items o f mul- tidimensional arr ays in memor y in a contiguous seg ment using this ordering. F ortran uses the ordering corr esp onding to a transp osed a rray in w hich the axes ar e reversed c olumn m ajor . Sc a lars are introduced as arrays with a n empty shap e vector. There a r e tw o equiv alen t ways of describing an ar ray A : (1) by its shap e and the vector of items, i.e. A = { s A , a } , or 10 W e also use τ a, δa, and ρa to denote total n um b er of componen ts, dimensionalit y and shape of a. 37 (2) by its sha pe and a function that deﬁnes the v alue a t every index p . These tw o forms hav e b een shown to be formally equiv alent [5 5]. W e wish to use the second form in deﬁning functions on multidimensional arr ays using their Cartesia n co ordinates (indices). The ﬁrst fo r m is use d in describing addres s manipula tio ns to achiev e eﬀectiv e computation. T o co mplete o ur no tational conv en tions, we ass ume that p, q , ... will b e used to denote indices or partial indices and that u , v , .. will b e used to denote arbitra ry vectors of in tegers. In order to describ e the i th item o f a vector a , either a i or a [ i ] will b e used. If u is a vector of k integers all less than t A , then a [ u ] will denote the vector o f leng th k , whose items a re the items o f a at p o sitions u j , j = 0 , ..., k − 1 . Before presenting the forma l deﬁnition of the ψ indexing function we deﬁne a few functions o n vectors: u + + v catentation of vectors u a nd v u + v item wise vector addition ass uming t u = t v u × v item wise vector multiplication n + u , u + n additio n of a s c alar to each item of a vector n × u , u × n m ultiplication o f ea ch item of a vector by a sc a lar ι n the vector of the ﬁrs t n integers starting fr om 0 π v a sca lar which is the pro duct of the co mpo nent s o f v k △ u when k ≥ 0 the vector of the ﬁr st k items of u ,(called take ) and when k < 0 the vector of the last k items of u k ▽ u when k ≥ 0 the vector of t u − k la st items of u ,(called dr o p ) and when k < 0 the vector of the ﬁrst t u − | k | items of u k θ u when k ≥ 0 the vector of ( k ▽ u ) + +( k △ u ) and when k < 0 the vector or ( k △ u ) + +( k ▽ u ) Definition A.1. L et A b e an n-dimensional arr ay and p a ve ctor of inte gers. If p is an index of A , pψ A = a [ γ ( s A , p )] , wher e γ ( s A , p ) = x n − 1 deﬁne d by the r e curr enc e x 0 = p 0 , x j = x j − 1 ∗ s j + p j , j = 1 , ..., n − 1 . If p is p artial index of length k < n, pψ A = B wher e t he shap e of B is s B = k ▽ s A , and for every index q of B , q ψ B = ( p + + q ) ψ A The deﬁnition uses the second form o f spe cifying a n array to deﬁne the r esult of a partial index. F or the index case , the function γ ( s, p ) is used to conv ert an index p to an integer giving the loc a tion o f the cor resp onding item in the row ma jor o rder list 38 of items o f an ar r ay of shap e s . The recur rence computation for γ is the one used in most compiler s for conv erting an index to a memory a ddress [5 6]. Corollar y A.2. <> ψ A = A. The following theorem shows that a ψ s election with a par tial index can b e ex- pressed as a comp osition of ψ selectio ns. Theorem A .3. L et A b e an n-dimensional arr ay and p a p artial index so that p = q + + r . Then pψ A = r ψ ( q ψ A ) . Pr o of: The pr o o f is a c onse quenc e of the fact that for ve ctors u,v,w ( u + + v ) + + w = u + +( v + + w ) . If we extend p to a ful l index by p + + p ′ , then p ′ ψ ( pψ A ) = ( p + + p ′ ) ψ A = (( q + + r ) + + p ′ ) ψ A = ( q + + ( r + + p ′ )) ψ A = ( r + + p ′ ) ψ ( q ψ A ) = p ′ ψ ( r ψ ( q ψ A )) pψ A = rψ ( q ψ A ) which c ompletes the pr o of. W e can now use psi to deﬁne o ther op eratio ns on a r rays. F or example, c onsider deﬁnitions of take and dr op for multidimensional ar rays. Definition A.4 ( take: △ ). L et A b e an n -dimensional arr ay, and k a non- ne gative int e ger such that 0 ≤ k < s 0 . Then k △ A = B wher e s B = < k > + + (1 ▽ s A ) and for every index p of B, pψ B = pψ A. (In MoA △ is a lso deﬁned for negative integers and is gener a lized to any vector u with its absolute v alue vector a partial index of A. The deta ils ar e omitted here.) Definition A.5 ( reverse: Φ ). L et A b e an n-dimensional arr ay. Then s Φ A = s A and for every inte ger i, 0 ≤ i < s 0 , < i > ψ Φ A = < s 0 − i − 1 > ψ A. This deﬁnition of Φ do es a reversal of the 0th axis o f A. Note also that all op er ations are over the 0th a xis. The op era tor Ω [26] extends op erations over all other dimensions. 39 Example. Co nsider the ev aluation of the following expr ession using the 3 by 5 by 4 array , A , in tro duced in Sectio n A.1. < 1 2 > ψ (2 △ Φ A ) (A.1) where A is the ar ray g iven in the pr evious section. The s hap e of the result is 2 ▽ s (2 △ Φ A ) = 2 ▽ ( < 2 > + + (1 ▽ s Φ A )) = 2 ▽ ( < 2 > + + (1 ▽ s A )) = 2 ▽ ( < 2 > + + < 5 4 > ) = 2 ▽ < 2 5 4 > = < 4 > . (A.2) The e x pression ca n b e simpliﬁed using the deﬁnitions: < 1 2 > ψ (2 △ Φ A ) = < 1 2 > ψ Φ A = < 2 > ψ ( < 1 > ψ Φ A ) = < 2 > ψ ( < 3 − 1 − 1 > ψ A ) = < 1 2 > ψ A (A.3) This pro cess of simplifying the ex pr ession for the item in terms of its Ca r tesian co or- dinates is called Psi R e duction . The op eratio ns of Mo A have b een designed so that all expressio ns ca n be reduced to a minimal normal for m [26]. Some MoA op erations deﬁned by psi are found in Fig.A.1. App endix B. Higher Order Op erations. Thu s far op eration on arrays, suc h as co ncatenation, rotation, etc., have bee n per formed over their 0th dimensions. W e introduce the highe r order binar y op eration Ω, which is deﬁned when its left argument is a unary or binary o p e ration and its right argument is a vector describing the dimension upo n which op erations a re to b e per formed, or which sub-arrays are used in op er a tions. The dimens io n upon which op erations ar e to b e p erfor med is often called the axis of op eratio n. The result of Ω is a unary or binar y op er ation. B.1. De ﬁni tion of Ω . Ω is deﬁned whenev er its le ft argument is a unar y o r binary oper ation, f or g resp ectively ( f a nd g include the outcome of hig her o rder op eration). Ω’s rig ht ar gument is a vector, ~ d , such that ρ ~ d ≡ < 1 > or ρ ~ d ≡ < 2 > depe nding on whether the op eratio n is una r y or binary . Commonly , f (or g ) will be an op era tion which de ter mines the sha p e of its r esult based on the shap es of its arguments, not on the v a lues of their entries, i.e. for a ll a ppropriate ar guments ρf ξ is determined by ρξ and ρξ l g ξ r is determined by ρξ l and ρξ r . Deﬁnition 0: f Ω ~ d is deﬁned when f is a o ne ar gument function, ~ d ≡ < σ > , with σ ≥ 0 . F or a ny non- empt y a rray ξ , f Ω ~ d ξ (B.1) 40 Symbol Name Description δ Dimensional i ty Returns the number of dime nsions of an array . ρ Shap e Returns a vector of the up p er b oun ds or sizes of each dimen sion in an arra y . ιξ n Iota When n = 0(scalar), returns a vector containing el ements 0, to ξ 0 − 1. Whe n n = 1(vector), return s an array of indices deﬁ ned by th e shap e vector ξ 1 ψ Psi The m ain indexing func tion of the Psi Calculus which deﬁn es all ope rations in MoA. Re turns a scalar if a full index is provided, a sub-arra y otherwise. rav Ra vel vectorizes a m ulti-dim e nsional array based on an array’s lay out( γ row , γ col , γ sparse , .. . ) γ Gamma T ranslate s ind ices into oﬀsets given a shap e . γ ′ Gamma Inverse T ranslate s oﬀsets into indic es given a shap e. ~ s b ρ ξ Reshap e Changes th e shape v ector of an array , possibly aﬀecting its dimensionali ty . Reshap e dep ends on lay out( γ ) . π~ x Pi Returns a scalar and is equi v alent to Q ( τ x ) − 1 i =0 x [ i ] τ T au Returns the number of comp one nts in an array ,( τ ξ ≡ π ( ρξ )) ξ l + + ξ r Catenate Concatenates t w o arrays ov er t h eir pri mary axis. ξ l f ξ r Poin t-wise A data paralle l appli c ation of f is p e rformed Extension betw een all ele m ents of the arra ys. σf ξ r Scalar Ex t e nsion σ is used with every comp onent of ξ r in t h e data parallel ξ l f σ application of f . △ T ake Returns a sub-array from the be gi nning or end of an arra y based on its argument b e ing p ositive or negative. ▽ Drop The i nv erse of T ake op red Reduce Reduce an array’s dime nsion by one by applyin g op ov er the p rimary axis of an array . Φ Re verse Reverses the c omp onents of an array . Θ Rotate Rotates, or shifts cyclic ally , comp onents of an array .  \ T ranspose T ransp oses the e leme nts of an array based on a giv en p e rmutation vector Ω Om ega Applies a unary or b inary fu nction to array argument(s) given partitioni ng inform ation. Ω is used to p erform all op eration s (deﬁne d ov er the primary axis only ) ov er all dim ensions. Fig. A.1 . Summary of MoA Op er ations is deﬁned pr ovided ( i ) ( δ ξ ) ≥ σ , and provided certain other conditions, sta ted b elow, are met. L e t ~ u ≡ ( − σ ) ▽ ρξ . (B.2) W e ca n write ρξ ≡ ~ u + + ~ z (B.3) where ~ z ≡ ( − σ ) △ ρξ . W e further requir e ( ii ) there exists ~ w such that for 0 ≤ ⋆ ~ i < ⋆ ~ u , f ( ~ iψ ξ ) (B.4) is deﬁned and has shap e ~ w . The notation 0 ≤ ⋆ ~ i < ⋆ ~ u , is a shor tha nd which implies that we are comparing t wo vectors ~ i and ~ u comp onent by comp onent. With this ρ ( f Ω ~ d ) ξ ≡ ~ u + + ~ w (B.5) and for 0 ≤ ⋆ ~ i < ⋆ ~ u , ~ iψ ( f Ω ~ d ) ξ ≡ f ( ~ iψ ξ ) (B.6) End of Deﬁnition 0 41 Note tha t co ndition ( ii ) is easily satisﬁe d for common f ’s. Deﬁnition 1: W e similarly deﬁne Ω when its function argument is a binar y o p era- tion g . g Ω ~ d is deﬁned when g is a tw o argument function, ~ d ≡ < σ l σ r > , with σ l ≥ 0 , and σ r ≥ 0 . F or a ny non- empt y a rrays, ξ l , a nd ξ r , ξ l ( g Ω ~ d ) ξ r (B.7) is deﬁned provided ( i ) ( δ ξ l ) ≥ σ l and ( δ ξ r ) ≥ σ r , and pr ovided certain other c ondi- tions, stated b elow, are met. W e let ⌊ denote the binar y op er a tion minimum and let m ≡ (( δξ l ) − σ l ) ⌊ (( δ ξ r ) − σ r ) . (B.8) W e r equire that ( ii ) (( − m ) △ ( − σ l ) ▽ ρξ l ) ≡ (( − m ) △ ( − σ r ) ▽ ρξ r ). Let ~ x ≡ (( − m ) △ ( − σ l ) ▽ ρξ l ) ≡ (( − m ) △ ( − σ r ) ▽ ρξ r ) , (B.9) ~ u ≡ ( − m ) ▽ ( − σ l ) ▽ ρξ l , (B.10) ~ v ≡ ( − m ) ▽ ( − σ r ) ▽ ρξ r . (B.11) Note tha t ~ u ≡ <> or ~ v ≡ <> (bo th co uld b e empty). W e can now wr ite ρξ l ≡ ~ u + + ~ x + + ~ y , (B.12) and, ρξ r ≡ ~ v + + ~ x + + ~ z (B.13) where ~ y ≡ ( − σ l ) △ ρξ l and ~ z ≡ ( − σ r ) △ ρξ r . Any of the vectors a bove could b e empt y . W e also require ( iii ) ther e exists a ﬁxed vector ~ w such that for 0 ≤ ⋆ ~ i < ⋆ ~ u , 0 ≤ ⋆ ~ j < ⋆ ~ v , 0 ≤ ⋆ ~ k < ⋆ ~ x , (( ~ i + + ~ k ) ψ ξ l ) g (( ~ j + + ~ k ) ψ ξ r ) (B.14) is deﬁned and has s hap e ~ w . With all this ρ ( ξ l ( g Ω ~ d ) ξ r ) ≡ ~ u + + ~ v + + ~ x + + ~ w (B.15) and for 0 ≤ ⋆ ~ i < ⋆ ~ u , 0 ≤ ⋆ ~ j < ⋆ ~ v , 0 ≤ ⋆ ~ k < ⋆ ~ x , ( ~ i + + ~ j + + ~ k ) ψ ξ l ( g Ω ~ d ) ξ r ≡ (( ~ i + + ~ k ) ψ ξ l ) g (( ~ j + + ~ k ) ψ ξ r ) (B.16) End of Deﬁnition 1 REFERENCES 42 [1] W. Humphrey , S. Karmesin, F. Bassetti, and J. Reynders. Optimization of data–parallel ﬁeld expressions in the POOM A framework. In Y. Ishik a w a, R. R . Oldehoeft, J. Reynders, and M. Tholburn, editors, Pr o c. First International Confer enc e on Sci e ntiﬁc Computing in Obje ct–Oriente d Par al lel Envir onments (ISCOPE ’97) , volume 1343 of Le ctur e Notes i n Computer Scienc e , pages 185–194, Mari na del Rey , CA, Decem ber 1997. Spri nger–V erlag. [2] S. Karm esi n, J. Crotinger, J. Cummings, S. Haney , W. Humphrey , J. Reynders, S. Smith, and T. Wil liams. Array design and expressi on ev aluation in PO O MA I I. In D. Caromel, R. R. Oldehoeft, and M. Tholburn, editors, Pr o c. Se c ond International Sy mp os ium on Scientiﬁc Computing in Obje ct–Oriente d Par allel Envir onments (ISCOPE ’98) , v olume 1505 of L e c tur e Notes i n Compu ter Scie nc e , Santa F e, NM , December 1998. Springer– V erlag. [3] Marnix Vlot. The POOMA arc hitecture. j- LECT-NOTES-COMP- SCI , 503:365–, 1991. [4] Rogier W ester and Ben Hulshof. The P O OMA operating system. j-LECT-NO TES-COMP- SCI9 , 503:396–??, 1991. [5] John V. W. Reynders, Paul J. Hinke r, Julian C. Cummings, Susan R. Atlas, Subhank ar Baner- jee, William F. Humphr ey , Stev e R. Karmesin, Katarzyna Keahey , M. Srik an t, and Mary- Dell Tholburn. POOMA. In Gr egory V. Wilson and P aul Lu, editors, Par al lel Pr o gr amming Using C++ . MIT Press, 1996. [6] Dav id R. Musser and At ul Saini. STL T utorial and R efer enc e Guide . Addison-W esley , 1996. [7] Matthew H. Austern. Generic Pr o gr amming and the STL: Using and Extending the C++ Standar d T emplate Libr ary . Addison-W esley , 1998. [8] Mark W eiss. Algorithms, Data Structur es, and Pr oblem Solving C++ . Addison-W e sley , 1996. [9] John V. W. Reynders, Paul J. Hinke r, Julian C. Cummings, Susan R. Atlas, Subhank ar Baner- jee, William F. Humphr ey , Stev e R. Karmesin, Katarzyna Keahey , M. Srik an t, and Mary- Dell Tholburn. ” POOMA”. In Par allel Pr o gr amming Using C++ . T he MIT Press, Cam- bridge, 1996. [10] T o dd V eldh uizen. ”Expr ession T empla tes.” C++ R ep ort 7:5 (June, 1995) . Sigs Bo oks, NY, 1995. [11] A. Lumsdaine and B. M cCandless. Parallel extensions to the matri x template libr ary . In SIAM, editor, Pr o c e e dings of the 8th SIAM Confer enc e on Par al lel Pr o c essing for Scientiﬁc Computing , 1997. [12] A. Lumsdaine. The matrix template library: A generic programming approac h to high p erfor - mance numerical linear algebra. In Pr o c ee dings of International Symp osium on Computing in Obje ct-Orient ed Par al lel Envir onments , 1998. [13] Papers I and II of this series hav e b een s ubmitted to the Journal of Computational Physics but are av ailable as online as h ttp://trr.albany .edu/documen ts/TR00004, and h ttp://trr.albany .edu/documen ts/TR00005, resp ectively . [14] L. R. Mullin and S. G. Small. Three easy steps to a faster FFT (no, we don’t need a plan). In M . S. Obaidat and F. Dav oli, editors, Pr o c e e dings of 2001 International Symp osium on Performanc e Evaluation of Computer and T ele c ommunic ation Sy st ems, SPECTS 2001 , pages 604–611, Orlando, FL, July 2001. [15] L. R. M ullin and S. G. Small . Three easy steps to a f aster FFT (the story con tinue s . . . ). In H. R. Arabnis, editor, Pr o c e e dings of the International Confer enc e on Imaging Scienc e, Systems, and T e chnolo gy, CISST 2001 , pages 107–113, Las V ega s, NV, June 2001. [16] L. R. Mul lin and S. G. Small. F our easy wa ys to a f aster ﬀt. Journal of Mathematic al Mo del ling and Algo rithms , 1:193, 2002. [17] L. Mul lin. A uniform wa y of reasoning about array–base d computat ion in r adar: Al gebraically connect ing the hardwa re/softw are b oundary . Digital Signal Pr o c essing (to app e ar) , to appear. [18] J. E. Ray nolds and L. R. Mull in. Computer Physics Communic ations , 170:1, 2005. [19] T. Cormen. Everything y ou alwa ys wan ted to kno w ab out out–of–core FFTs but were af taid to ask. COMP ASS Collo quia Series, U Albany , SUNY, 2000. [20] J.-W. Hong and H. T. Kung. I/o complexity : the red-blue pebble game., 1981. [21] J. E. Sa v ag e. Entending the Hong-Kung mo del to memory hierarc hies. L e ctur e notes in Com- puter Scienc e , 959:270–2 81, 1995. [22] Jeﬀrey Scott Vitter and Eli zab eth A . M. Shriver. Algori thms f or parall el memory I I: Hierar- c hical multilev el memories. T echnical Report T echnical report DUKE–TR–1993–02, 1993. [23] Lenore R. M ullin, Harry B Hun t I I I, and Daniel J. Rosenkran tz. A transform ation- based approac h f or the design of parall el/distributed scien tiﬁc sof t ware: the FFT. h ttp://trr.albany .edu/documen ts/TR00002. [24] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Par al lel Pr o gr amming wit h the 43 Message–Passing Interfac e . MIT Press, Cambrdge, MA, 1994. [25] R. Chandra, R. M enon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald. Par allel Pr o gr am- ming in Op enMP . M organ Kaufmann Publishers, Los Altos, CA, 2000. [26] L. M. R. Mulli n. A Mathematics of Arr ays . PhD thesis, Syracuse Uni v ersity , Decem ber 1988. [27] L. Mul lin. The Psi compiler pro ject. In Workshop on Compilers for Par al lel Computers . TU Delft, Holland, 1993. [28] L. R. Mullin, D. Dooling, E. Sandberg, and S. Thibault. F ormal methods f or sc heduling, routing and comm unication protocol. In Pr o c e e dings of the Sec ond International Symp osium on High Performanc e Distribute d Computing (HPDC-2) . IEEE Computer Society , July 1993. [29] L. R. Mull in, D. Eggleston, L. J. W o o drum, and W. Rennie. The PGI–PSI pro ject: Pre- processi ng optimizations for existing and new F90 intrinsics in HPF using compositional symmetric indexing of the Psi calculus. In M. Gerndt, editor, Pr o c e e dings of the 6th Work- shop on Compilers for Par a l lel Computers , pages 345–355, Aachen, Germany , Decem b er 1996. F orsch ungszen trum J ¨ ulich GmbH. [30] Douglas F. Elliott and K. Ramamohan Rao. F ast T r ans forms: A lgorithms, Analyses, Applic a - tions . Academic Press, Inc., 1982. [31] R. T olimieri, M. An, and C. Lu. Algorithms for Discr ete F o urier T r a nform and Convolution . Springer-V erl ag, 1989. [32] R. T o limi eri, M. An, and C. Lu. M athematics of M ultidimensional F ourier T r ansfo rm Algo- rithms . Springer-V erlag, 1993. [33] D. L. Dai, S. K. S. Gupta, S. D. Kaushik, and J. H. Lu. EXTENT: A p ortable programming en vironment for desi gning and implemen ting high–performance blo ck–recursiv e algorithms. In Pr o ce e dings, Sup er co mputing ’94 , pages 49–58 , W ashington, DC, No ve mber 1994. IEEE Computer So ciet y Press. [34] J. Granata, M. Conner, and R. T ol imieri. Recursive fast algori thms and the role of the tensor product. IEEE T r ans actions on Signal Pr o c essing , 40(12):2921 –2930, Decem b er 1992. [35] S. Gupta, C.-H. Huang, P . Saday appan , and R. Johnson. On the synt hesis of parallel programs from tensor pro duct formulas for blo ck recursive algorithms. In Uptal Banerjee, Dav id Gelern ter, Alex Ni colau, and David Pa dua, editors, Pr o c e e dings of the 5th International Workshop on L anguages and Compilers for Par al lel Computing , v olume 757 of L e ctur e Notes in Computer Sci enc e , page s 264–280 , New Ha v en, CT, August 1992. Springer-V er l ag. [36] S. K. S. Gupta , C.-H. Huang, P . Saday appan , and R. W. Johnson. A fr amew ork for generating distributed–memory parall el programs for blo ck recursive algorithms. Journal of Par alle l and Distribute d Computing , 34(2):137–153, M ay 1996. [37] Jeremy Johnson, Robert W. Johnson, Dav id A. Padu a, and Jianxin X iong. Searc hing for the best FFT for mulas with the SPL compiler . In S. P . Midkiﬀ et al., editors, Pr o c e e dings of the 13th International Workshop on L anguages and Compilers for Par al lel Computing 2000 (LCPC 2000) , volume 2017 of Le ct ur e Notes in Computer Scienc e , pages 112–126, Y orktown Heigh ts, NY, August 2000. Springer. [38] Jianxin Xiong, Jerem y Johnson, Robert Johnson, and David Padua. SPL: A l anguage and compiler for DSP algori thms. In Pr o c e e d ings of the ACM SIGPLAN’01 Confer enc e on Pr o gr amming L anguage Design and Implementation , pages 298–308, Sno wbird, UT, 2001. ACM Press. [39] M. F rigo and S. G. Johnson. The fastest Four i er transform in the West. T e chnica l Report MIT-LCS-TR-728, Massach usett s Institute of T echn ology , Septem ber 1997. [40] M. F rigo and S. G. Johnson. FFTW: An adaptiv e softw are arc hitecture for the FFT. In Pr o c. IEEE International Conf. on A c ous tics, Sp e e ch, and Signal Pr o cessing, V ol. 3 , pages 1381– 1384, May 1998. [41] M. F r igo. A fast Fourier transform compiler. In Pr o c ee dings of the ACM SIGPLAN ’99 Con- fer enc e on Pr o gr amming L anguage D e sign and Implementation , pages 169–180, Atlan ta, GA, May 1999. [42] Kang Su Gatlin and Larry Carter. F a ster FFTs via ar c hitecture–cogn izance. In Pr o c e e dings of the 2000 Internati onal Confer enc e on Par al lel A r chite ctur es and Compilation T echniques (P ACT ’00) , pages 249–260, Philadelphia, P A, Octob er 2000. IEEE C omputer So ciet y Press. [43] Dragan Mir k ovi ´ c , Rishad Mahasoom, and Lennart Johnsson. An adaptiv e softw are library for fast Fourier transforms. In Conferenc e Pr o c e e dings of the 2000 International Confer enc e on Sup er co mputing , pages 215–224, San ta F e, New Mexico, M a y 2000. ACM SIGARCH. [44] R. C. Agarwal, F. G. Gustavson , and M. Zubair. A high p er formance parallel algori thm f or 1-D FFT. In Pr o ce e dings, Sup er c o mputing ’94 , pages 34–40, W ashingt on, DC, Nov em ber 1994. IEEE Computer Soci ety Press. [45] D. Cull er, R. Karp, D. Patterson, A. Sahay , K. E. Schauser, E. Sant os, R. Subramonian, and 44 T. v on Eick en. LogP: T o wa rd a realistic model of parallel computation. In Pr o c e e dings of the F ourth ACM SIGPLAN Symp osium on Principles and Pr actic e of Par al lel Pr o gr amming , pages 1–12, May 1993. [46] Anshul Gupta and Vipin Kumar. The scalability of FFT on paral l el computers. IEEE T r ans- actions on Par al lel and D i stribute d Systems , 4(8):922–932, August 1993. [47] S. K. S. Gupta, C.-H. Huang, P . Saday appan, and R. W. Johnson. Implemen ting fast F ourier transforms on distributed–memory multipro cessors using data redistributions. Par al lel Pr o c essing L etters , 4(4):477–488, December 1994. [48] Douglas Miles. C ompute intensit y and the FFT. In Pr o c e e dings, Sup er c o mputing ’93 , pages 676–684, Po rtland, OR, Nov em b er 1993. IEEE Computer So ciet y Pr ess. [49] Pamela Thulasiraman, Kevin B. Theobald, Ashfaq A. Khokhar, and Guang R. Gao. Mul ti- threaded algorithms for fast Fourier transform . In Pr o c e e dings of the 12th Annu al ACM Symp osium on Par al lel Algorithms and A r chite ctur e (SP AA-00) , pages 176–185, NY, July 2000. ACM Pr ess. [50] H. B . Hunt I I I, L. R. Mullin, and D. J. Rosenkran tz. Experi ment al co nstruction of a ﬁne–g rained polyalgorithm f or the FFT. In Pr o c e e dings of the International Confer e nc e on Par al lel and Distribute d Pr o c essing T e chniques and Applic a tions (PDPT A’99) , pages 1641–1647, Las V egas, NV, June 1999. [51] L. R. Mullin and S. G. Small . F our easy wa ys to a faster FFT. Journ al of Mathematic al Mo del ling and Algorithm s , 1(3):193–214, 2002. [52] Charles V a n Loan. Computational F r ameworks for the F ast F o urier T r a nsform . F r on tiers in Applied Mathematics. SIAM, 1992. [53] Thomas H. Cormen and David M. N icol. Performing out–of–core FFTs on parallel disk systems. Par a l lel Computing , 24(1):5–20, 1998. [54] L. Mul lin and M. Jenkins. Eﬀective data parallel computation using the Psi calculus. Concur- r ency – Pr actic e and Exp erienc e , Septem ber 1996. [55] M. Jenkins, 94. Research Communications. [56] P . Lewis, D. Rosenkrantz, and R. Stearns. Compiler De sign The ory . A ddis on-W esley , 1976. 45

A Transformation--Based Approach for the Design of Parallel/Distributed Scientific Software: the FFT

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment