QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventi…

Authors: Emmanuel Agullo, Camille Coti, Jack Dongarra

QR F actoriz ation of T all and Sk inn y Matri ces in a Grid Computing En vironment Emmanuel Agullo ∗ , Camille Coti † , Jack Dong arra ∗ , Thomas Herault ‡ , Julien Langou § ∗ Dpt of Electrical Engineering and Com puter Science, University of T ennessee, 1122 V olunteer Blvd, Claxton Building , Knoxville, TN 37996-3 450, USA † INRIA Saclay- ˆ Ile de France, F-91893 Orsay , France ‡ Univ P aris Sud, University of T ennesse, LRI, INRIA § Dpt of Mathemati cal and Statis tical Sciences, University of Colorado Den ver , Campus Box 170, P .O. Box 173364, Den ver , Colorado 80217-336 4, USA eagullo@eecs.utk.edu, coti@ lri.fr , dongarra@eecs.utk.edu, herault@lri.fr , julien.lango u@ucden ver .edu Abstract Previous studies have repor ted that c ommon den se linear algebra operatio ns do not achieve speed up by using multiple geogra phical sites of a com putationa l gr id. Because such oper ations are the building blocks of most scientific applications, con ventional supercom puters a re still strongly predominant in h igh-per forman ce computin g and the use o f grids for speeding up large-scale scientific pr oblems is limited to a pplications exhibiting parallelism at a higher le vel. W e have identified two perfor mance bottlen ecks in the distrib uted memory algorith ms implemented in ScaL AP A CK, a state-o f-the-ar t de nse linear algeb ra library . First, bec ause ScaLAP ACK assumes a ho mogen eous commu nication network, the implementation s of ScaLAP ACK alg orithms lack loc ality in their commun ication patter n. Seco nd, the numb er o f messages sent in the ScaLAP ACK algorithms is significantly greater than other algo rithms that trade flops for com munication . In this pap er , we pr esent a ne w ap proach for computin g a QR factorization – one of the main dense linear algebra kernels – of tall an d skinny matrices in a grid computin g en vir onmen t that overcomes these tw o bottlenecks. Our contrib ution is to articulate a recently proposed algorithm (Commun ication A voiding QR) with a top ology -aware middleware ( QCG-OMPI) in ord er to confine intensive commun ications (ScaLAP ACK calls) with in the d ifferent geog raphical sites. An exper imental stu dy condu cted o n th e Grid ’500 0 platfo rm shows that the resulting performanc e increases linearly with the numb er of geo graph ical sites on large- scale problems (and is in par ticular consistently higher than ScaLAP A CK’ s). I . I N T RO D U C T I O N Grid computing [20] as a utility has rea ched the mainstream. Many lar ge-scale scientific problems have been successfull y so lved thanks to the use of computational grids (or , simply , grids ). These p roblems cove r a wide range of s cientific disciplines i ncluding biology (protein folding [31]), medicine (cure muscular dystrophy [9]), financial modeling, earthquake simulatio n, and climate/weather modelin g. Such scientific breakthroughs ha ve relied on the t remendous processing p ower provided b y grid in- frastructures. For example, the Berkeley Open Infrastructu re for Network Com puting (BOINC) [3] gathers the processing power of personal computers provided by people volunteering all over the world. This processing po wer is then m ade a v ailable to researchers through different proj ects su ch as Climateprediction.net [2], Rosetta@hom e [16] and W orld Community Grid (WCG) 1 . As of September 2009, 18, BOINC had 566,00 0 activ e computers (hosts) worldwide for an av erage tot al processing power of 2 . 4 Pflop/s 2 . Furthermore, following the su percomputing trends, grid computing infrastructures hav e successfully exploited the emerging hardware technologies. The Folding@Hom e project [7] – which 1 This work was partly supported by the EC grant for the QosCosGrid project (grant # FP6-2005-IST -5 033883), and NSF -CCF (grant #88152 0). 1 http://www .worldcommunitygrid.or g 2 http://boincs tats.com/ aims at und erstanding protein folding, misfoldi ng, and related diseases – achieves 7 . 9 Pflop/s thanks to grid exploiting specialized hardware such as graphics processing units (GPUs), mu lticore chips and IBM Cell processors. Howe ver , conv entional supercomputers are strong ly predominant in high-performance com puting (HPC) because different limiting factors p re vent the use of grids for solving large-scale scientific problems. First of all, security requirements for grids are not compl etely met in spit e of the import ant ef forts in that directio n [28]. Second, contrary to t heir original purpose (the term grid itself is a metaphor for making computer po wer as easy to access as an electric power grid [20]), grids hav e not been h istorically very user-friendly . Thi rd, not all the grid infrastructures are dimension ed for HPC, which i s on ly o ne of the aims o f grid computi ng. Even recent commercial of ferings s uch as Amazon Elastic Comput e Cloud (EC2) 3 are no t considered mature yet for HPC because o f under-calibrated components [41]. Furthermo re, other aspects are still the focus of i ntensive research, such as service discovery [11], scheduli ng [4], etc. But, above all , the major l imiting factor to a wider usage of grids by comput ational scientists to solve lar ge-scale problems is the fact t hat common dense lin ear algebra operation s do not achieve perfor- mance speed up by using mu ltiple geographi cal sites of a com putational grid, as reported in previous studies [33], [34]. Because those operations are the building blocks o f most scientific appli cations, the immense processing power deli vered by grids v anishes. Unless th e application presents parallelism at a higher le vel (most of the applications running on BOINC are actually embarrassingly pa rallel , i.e. , loosely coupled), its performance becomes lim ited by the processing power of a single geographical site of t he grid in frastructure, ruining the ambition to com pete against conv entional supercomputers. W e hav e id entified two performance bottlenecks in the di stributed memory algorith ms im plemented in ScaLAP A CK [12], a state-of-the-art dense li near alg ebra li brary . First, because ScaLAP A CK assum es a homogeneous com munication network, the imp lementations of t he ScaLAP A CK algorith ms lack l ocality in their communication pattern. Second, th e number of messages sent in the ScaLAP A CK algorithms is significantly greater t han other algorit hms that trade flops for comm unication. In this paper , we present a new app roach for fa ctorizing a dense matrix – one of t he most i mportant operations in dense linear algebra – in a grid compu ting en vironment that overcomes these two bott lenecks. Our approach consists of articulatin g a recently proposed algorith m (Communication A voiding algorithm [17]) wi th a top ology- aw are middlewa re (QCG-OMPI[15]) in o rder to confine intensive comm unications (ScaLAP A CK calls) within the different geographical s ites. In this s tudy , we focus on the QR fa ctorization [23] of a tall and s kinny (TS) dense matrix into an orthogonal matri x Q and an upper triangu lar matrix R and we discuss how our app roach generalizes to all one-sid ed factorizations (QR, LU and Cholesky) of a general d ense matrix (Section IV). Furthermore, we focus on the com putation o f the triangular factor R and d o not explicitly form the orthogonal matrix Q . Howe ver , we show that the performance behavior would be simil ar if we compute Q or not. The paper is organized as follows. W e present the related work and define the scope of our paper in Section II. In Section III , we present the i mplementatio n of a QR f actorization of TS m atrices that con fines intensiv e com munications wit hin t he di f ferent geographi cal sit es. Section IV discusses a performance m odel that allows u s to u nderstand th e basic trends observed in our experimental stud y (Section V). W e conclude and present the fut ure work in Section VI. I I . B A C K G RO U N D W e present here the related work. W e first d escribe previous experimental studies of the behavior of dense l inear algebra operations in a grid com puting en vironment (Section II-A). W e then succinctly present the operation we focus on i n this paper , the QR factorization, as it is imp lemented in ScaLA- P AC K, a state-of-the-art dense li near algebra library for distributed memo ry machines (Section II-B). 3 http://aws.amazon.com /ec2/ W e cont inue wit h the introductio n of a recently p roposed algori thm trading flops for comm unication (Section II-C). T o take advantage in a grid computing en vironm ent of the limit ed amount of communi- cation induced by such an algorithm, we need to articulate it wi th th e topology of the grid. W e p resent in Section II-D a middl ew are enabling this articul ation by (i) retrie ving the system topology to the application and even (ii) allowing the application to reserve suitable resources. Such an articulatio n of the algo rithms with the topology is critical in an en vi ronment built on top of heterogeneous networks such as a grid. W e conclud e this re view by di scussing the scope of th is paper (Section II-E). A. Dense linear algebra on the grid The idea of performi ng dense linear algebra operation s o n the grid is not new; howe ver , success stories are rare in the related bi bliography . Libraries that have an MPI [19] int erface for handling the communication layer , such as ScaLAP A CK or HP Li npack, can be run on a grid by l inking t hem to a grid- enabled im plementation of the MPI standard such as MPICH-G2 [29], P A CX-MPI [22] or GridMPI 4 . MPI has become the de f acto language for programmi ng parallel applications on distributed m emory architectures such as clus ters. Programmers have gained experience using thi s programmi ng paradigm throughout the past decade; scient ific lib raries hav e been de veloped and optimized using M PI. As a consequence, it is natural t o consider it as a first-choice candi date for program ming parallel applications on the grid in order to benefit from this experience and to be able to po rt existing applicati ons for the grid. The GrADS 5 project h ad the purpos e of simp lifying distributed, heterogeneous computing and making grid applicatio n dev elopment as well as performance tuning for real applications an ev eryday practice. Among ot her accomplishm ents, lar ge m atrices could be factorized thanks to th e use of a gri d whereas it was imp ossible to process them on a si ngle cluster because of memory con straints [34], [40]. The resource allocation (number of clusters, etc. ) was automatically chosen in order to maximize th e performance. Howev er , for matrices that could fit in the (dist ributed) memory of t he nodes of a clust er , the experiments (conducted with ScaLAP A CK) showed that the use of a single cluster was opt imal [34]. In other words, using m ultiple geographical sites led to a slow down of the factorization. Indeed, t he overhea d due t o the high cost of int er -cluster communi cations was not b alanced by the benefits of a higher processing power . For the same reason, the EC2 cloud has recently been shown to be i nadequate for dense linear algebra [33]. In this latter study , the aut hors add ress the question whether cloud computing can reach the T op500 6 , i .e. , the ranked list of the fastest computing systems in the world. Based o n experiments conducted wit h the parallel LU fa ctorization [23] im plemented in the HP Linpack Benchmark [18 ], not only did they observe a slow d own when usi ng m ultiple clu sters, but they als o showed th at the financial cost (in dollars) of performance (number of floati ng-point operations per second, in Gflop/s) increases exponentially with the num ber of computing cores us ed, m uch i n cont rast to existing scalable HPC systems such as supercomputers. The HeteroScaLAP A CK project 7 aims at developing a p arallel dense linear algebra package for heterogeneous architectures on top of ScaLAP AC K. This approach is orthogonal (and complementary) to ours since it focus es on t he heterogeneity of the processors [37], whereas we presently aim at mapping the implem entation of the algorithm to the heterogeneity of t he network (topology) through QCG-OMPI. In our present work, we do n ot consider the heterogeneity o f the processors. Another fundamental differ ence with HeteroScaLAP A CK is that we are using T SQR, an algorithm that is not a vailable in ScaLAP A CK. 4 http://www .gridmpi.org 5 Software Support for High-Le vel Grid Application Dev elopment http://ww w.hipersoft.r ice.edu/grads / 6 http://www .top500.org 7 http://hcl.uc d.ie/project/ HeteroScaLAPACK B. ScaLAP ACK’ s QR factor ization The QR factorization of an M × N real matrix A has the form A = QR , where Q is an M × M real orthogonal matrix and R is an M × N real upper triangular matrix. Provided A is nonsing ular , this factorization is essentially unique, that is, it is uni que if we impose the diagonal entries of R to be positive. There i s a variety o f algorithms to obtain a QR factorization from a gi ven m atrix, the most well-known arguably being the Gram-Schmid t algorithm. Dense lin ear algebra li braries hav e been traditionally focusing on algorith ms based on unitary transformations (Gi vens rotations or Householder reflections) because t hey are uncondition ally backward stable [23]. G iv ens rotations are adv antageous when zeroing out a fe w elements o f a matrix whereas Householder transformations are advantageous when zeroing out a vector o f a m atrix. Therefore, for dense m atrices, we consider the QR factorization algorithm based o n H ouseholder reflections. The algo rithm consist s of apply ing successive elementary Householder trans formations of the form H = I − τ v v T where I is the identity matrix, v is a colu mn reflector and τ is a scaling factor [23]. T o achiev e high performance on modern computers with d iffe rent lev els of cache, the application of the Hou seholder reflections is block ed [39]. In ScaLAP A CK [8], b elementary Hous eholder matrices are accumulated within a panel (a block-column) V consis ting of b reflectors. The consecutive appl ications of these b reflectors ( H 1 H 2 ...H b ) is then constructed all at once usin g the matrix equalit y H 1 H 2 ...H b = I − V T V T ( T is a b × b u pper triangu lar matrix). Howe ver , thi s bl ocking incurs an addit ional comput ational overhea d. The overhead i s negligible when there is a l ar ge number of col umns to be updat ed but is sign ificant when there are onl y a fe w colum ns to be updated. D efault values in the ScaLAP A CK PDGEQRF subroutine are NB=64 and NX=128, where NB is the block size, b , and NX is the cross- over point; blocking is not to be used if there is less th an NX columns are t o be updated. PDGEQRF uses PDGEQR2 to perform the panel factorizations. Due to the p anel factorization, the algorit hm i n ScaLAP A CK requires one allreduce operation for each column of the initial m atrix. In ot her words, ScaLAP A CK uses at least N log 2 ( P ) messages to factor an M -by– N matrix. C. Commun ication A voiding QR (CA QR) factorization In this paper we propose an im plementation o f the so-called “Comm unication A voiding QR” (CA QR) algorithm origi nally proposed by Demmel et al. [17]. CA QR belongs to the class of the (f actor panel) / (upda te trailing matrix) algorithms. For all algorithms in this class, the update phase is entirely dictated b y th e panel factorization step and is easily parallelizable. Therefore, w e only discus s th e panel factorization step. The panel factorization in CA QR i s based on the “T all and Skinny QR” factoriza- tion algorithm (TSQR) [17]. In contrast to t he ScaLAP A CK panel fa ctorization algorithm (subroutine PDGEQR2), which requires one allreduce operation per column, TSQR requires one allreduce operation per b colum ns where b i s an arbitrary blo ck size. The num ber of commu nications is therefore divided by b . The volume of communi cation stays the same. The number of operations on the critical path is increased in TSQR b y an additional O (log 2 ( P ) N 3 ) term. TSQR ef fectiv ely trades comm unication for flops. As explained in [17], T SQR is a s ingle complex allreduce operation. The TS matrix is split in P block-rows, called domai ns ; the factorization of a dom ain is t he operation performed on t he leav es of the binary t ree associ ated t o the reduction. The basic operation t hen used i n thi s allreduce op eration is as follows: from two input triangular matrices R 1 and R 2 , stack R 1 on t op of R 2 to form [ R 1 ; R 2 ] , perform the QR factorization of [ R 1 ; R 2 ] , t he output R is given by the R-factor o f [ R 1 ; R 2 ] . On e can show that this operatio n i s binary and associ ativ e. It is also commutative if one imposes t he di agonal of each computed R-factor to h a ve nonnegativ e entries. As for any reduce operation, t he shape o f the o ptimal tree depends on the dimension of the data and the u nderlying h ardware. CA QR with a bi nary t ree has been stud ied in th e parallel d istributed context [17] and CA QR with a flat tree has been implement ed in the context of out-of-core QR f actorization [26]. W e note t hat CA QR with a flat tree also delivers wide parallelism and, for this reason, has been used in the multicore context [10], [30], [36]. Pre vious impl ementations of CA QR have used either a flat tree or a b inary tree. One key origi nality of our present work lies in the fact that our reduction tree is neither b inary nor flat. It i s tuned for the t ar geted computational grid, as illustrated in Fig. 2. First we reduce with a binary tree o n each cluster . Then we reduce wit h a second binary tree the resul t of each cluster at t he grid lev el. The bi nary tree used b y ScaLAP A CK PDGEQR2 (Fig. 1) mi nimizes the sum of the inter -cluster messages and the intra-cluster mess ages. Our tree i s designed to minim ize the t otal num ber of inter-cluster messages. W e now give a brief hist ory of related alg orithmic work in cont rast to the reference work of Demmel et al. [17]. The parallelization of the Giv ens rotations based and Householder reflections based QR factorization algorithm s is a well-st udied area in Nu merical Linear Algebra. The deve lopment of the algorithms has followed architectural trends. In th e late 1970s / early 1980s [27], [32], [38], t he research was focusing on algorithms based o n Givens rotati ons. The focus was on extracting as m uch parallelism as po ssible. W e can interpret these sequences of algorithms as scalar im plementation s using a flat tree of the algorithm in Demmel et al. [17]. In t he late 1980s, the research shifted gears and presented algorithms based o n Householder reflec tions [35], [14]. The motiv ation was to use vector com puter capabilities. W e can interpret all these algo rithms as vec tor imp lementations usi ng a flat tree and/or a binary tree of th e algorithm in Demmel et al. [17]. All these alg orithms require a num ber of messages greater than n , the number of colum ns of the in itial matrix A , as in ScaLAP A CK. The algorithm i n Demmel et al. [17] i s a generalization with m ultiple blocks of columns wit h a nontrivial reduction operation, which enables one to divide th e number of messages of these previous algorithms by the block size, b . Demmel et al. proved that TSQR and CA QR algorithms induce a minim um amount of communication (under certain condit ions, see Section 17 of [17] for more details) and are numerically as s table as the Householder Q R factorization. Fig. 1. Illustration of the ScaLAP ACK panel factorization routine on a M -by-3 matrix. I t inv olves one reduction per column f or t he normalization and one reduction per column for the update. ( No update for the last column.) The reduction tree used by ScaLAP A CK is a binary tree. It In this example, we have 25 inter-cluster messages (10 f or all columns but the last, 5 for the l ast). A tuned reduction tree would have giv en 10 inter-cluster messages (4 per column but the last, 2 for t he last). W e note that if process ranks are randomly distributed, the figure can be worse. Fig. 2. Illustration of the TS QR panel factorization routine on a M -by-3 matri x. It in volv es only one reduction tr ee. Moreo ver the reduction tree is tuned for the grid architecture. W e only have t wo inter-cluster messages. This number ( two) is independen t of the number of columns. This number is obvio usly optimal. One can not expect less than two inter- cluster commun ications when data is spread on the three clusters. D. T opol ogy-awar e MPI middl ewar e for the grid: QCG-OMPI Programming ef ficient applications for grids built by federa ting clus ters is challenging, mostly because of the difference of performance between the various networks th e application has to use. As s een in the table of Figure 3 (a) we can observe two orders of magnitude between int er and intra-cluster latency on a dedicated, nation -wide network, and the differ ence can reach t hree or four orders of magnitude on an int ernational, s hared network s uch as the Internet. As a consequence, the appli cation m ust be adapted to t he intrin sically hierarchical topolo gy of the gri d. In oth er words, the com munication and computation patt erns of the application mus t match the physical topolog y of t he h ardware resources it is executed on. Latency (ms) Orsay T ou louse Bordeaux Sophia Orsay 0.07 7.97 6.98 6.12 T ou louse 0.03 9.03 8.18 Bordeaux 0.05 7.18 Sophia 0.06 Throu ghput ( Mb/s) Orsay T oulou se B ordeau x Sophia Orsay 890 78 90 102 T ou louse 890 77 90 Bordeaux 890 83 Sophia 890 (a) Commun ications performance on Grid’5000 Orsay Bordeaux T oulouse Sophia- Antipolis (b) Grid’5000: a nation-wide experimental testbed. Fig. 3. Grid’5000 communication characteristics. ScaLAP A CK, and many of the l inear algebra lib raries for scientific computing , are programmed in MPI. MPI is fit for homogeneous supercompu ters: processes are mostly in distingui shable one from another , and t he s tandard does not specify anything about process / node p lacement. As a consequence, to ef ficiently program a parallel application on top of a non-uni form network, typically on top of a hierarchical network like a gri d, M PI m ust be extended to hel p programm ers adapt the communi cations of the app lication to t he machine. MPICH-G2 [29] introduced the concept of colors to describe the a vailable topology to the application at runtim e. Colors can be used d irectly by MPI routines i n order to build topolog y-aw are communicators (the abstraction in MPI that is used to group processors together). Howe ver , th e app lication is fully responsible to adapt itself to the topology that is discovered at runtime. T his adaptation, and the load-balancing that it i mplies, m ay be a hard t ask for the application. The QosCosGrid syst em 8 off ers a resource-aware grid meta-scheduler that gives t he possi bility to allo- cate resou rces that match requirements expressed i n a companion file called the applicatio n’ s JobPr o file that describe the future communi cations of the application [6]. The JobPr ofile defines process groups and requirements on the hardware specifications of the resources that have to be allocated for t hese processes such as amount of m emory , CPU speed, and network properties between groups of processes, such as latency and bandwidth. As a consequence, the appl ication will always be executed on an appropriate resource t opology . It can therefore be de veloped for a specific topology in mind , for example, under the assum ption t hat a set of processes will be located on the same cluster or on the same mult i-core m achine. Of course, the more flexibility th e programm er giv es to t he JobProfile and the application, the m ore chances he gets to let the meta-scheduler find a suitable hardware setup. The QosCosGrid sy stem features QCG-OMPI, an M PI im plementation based on OpenMPI [21] and tar geted to computational gri ds. Besides grid-specific communicatio n features that e nable communicating throughout the grid described in [15], QCG-OMPI has the possibility to retrieve topology informati on provided to the schedul er i n the Job Profile at run-time. W e explain i n Section III how we have implemented and articulated TSQR with QCG-OMPI in order t o take adv antage of the topolog y . 8 Quasi-Opportunistic Supercomputing for Complex Systems in Grid En vironments, http://w ww .qoscosgrid.eu E. Scope The QR fac torization of TS m atrices is directly used as a kernel in se veral i mportant applications of linear algebra. For ins tance, block-iterative method s need to re gularly perform this operation in order to obtain an o rthogonal basis for a set of vectors; this step is of particular im portance for block eig ensolvers (BLOPEX, SLEPc, PRIMME). Currently these packages rely on unstable orthogonali zation schemes to a void too m any comm unications. TSQR is a stable algorit hm t hat enables the same tot al number of messages. TSQR can also b e used to perform the p anel factorization of an algorith m handling general matrices (CA QR). Thank s to simu lations, Dem mel et al. [17] antici pated that the benefits obt ained with TSQR sho uld get transposed to CA QR. Said differently , this present st udy can be viewed as a first step tow ards the factorization of general matrices on the grid. Grids agg regate comput ing power from any kind of resource. Howe ver , in some ty pical grid projects, such as Superlink@T echnio n, the Lattice project, EdGE S, and the Condor pool at Un iv . of W isconsin- Madison, a significant p art of the power comes from a fe w in stitutio ns featuring en viron ments with a cluster-like s etup. In this first work, we focus our stu dy on clusters o f clusters, to enable ev aluation in a stable and reprodu cible en vironment. Porting the work to a general desktop grid remains a futu re work. Finally , we emphasi ze that the objective of this paper is to show that we can achie ve a performance speed up over the grid wi th common d ense linear algebra operations. T o i llustrate our claim, we comp are our approach against a state-of-the-art library for distributed mem ory archit ectures, ScaLAP A CK. In order to highlig ht the differ ences, we chose to base o ur approach on ScaLAP A CK (see Section III). I I I . Q C G - T S Q R : A R T I C U L A T I O N O F T S Q R W I T H Q CG - O M P I W e explain in this section how we articulate t he TSQR algorithm wit h QCG-OMPI in order t o confine intensive communications within the dif ferent g eographical sites of the computati onal grid. The first diffe rence from the TSQR algorithm as presented i n Section II-C is that a domai n is processed by a call to ScaLAP A CK (but not LAP A CK as in [17]). By doing so, we may attribute a dom ain to a group of processes (instead of a sin gle process) jointly performing its factorization. Th e particular case of one domain per process corresponds to the original TSQR (calls t o LAP A CK). At the other end of the spectrum, we may ass ociate one domain per geographical site of the compu tational grid. Th e choice of the number of domains impacts performance, as we wi ll illu strate i n Section V -D. In all cases, we call our algorithm TSQR (or QCG-TSQR), since it is a singl e reduce operation based on a binary tree, similarly to the algorithm presented in Section II-C. As explained in Section II-D, the first task of de veloping a QCG-OMPI application consists of defining the kind of topologies expected by the application in a JobProfile. T o get enoug h flexibility , we request that processes are sp lit int o g roups of equivalent com puting po wer , with good network connectivity inside each group (low latency , hig h bandwidt h) and we accept a l ower network connectivity between the groups . This corresponds t o the classi cal cluster of clus ters approach, with a constraint on the relative size o f th e clusters to facilitate load balancing. The meta-scheduler will allocate resources in the physical grid t hat matches th ese requirements. T o enable us to compl ete an exhausti ve study on th e dif ferent kind of topo logies we can get, we also introduced m ore constraints in the reserv ation mechanism, depending on the experiment we ran. For each experiment, the set of machines that are allocated to the job are passed to the MPI mi ddlew are, which exposes those groups using two-dimension al arrays o f group identifiers (the group id entifiers are defined in the JobProfile by the dev eloper). After the initialization, the appli cation retriev es these g roup identifiers from the sys tem (us ing a specific M PI att ribute) and then creates one MPI communicator per group, using the MPI Comm split routi ne. Once t his is don e, the TSQR algorith m has knowledge of the topology that allows it to adapt to the physi cal setup of the grid. The choice t o in troduce a requirem ent of s imilar comp uting powe r bet ween the grou ps h owe ver introduces constraints on the reservation mechanis m. For example, in som e experiments dis cussed later (Section V), only half the cores of s ome of the machines w ere allocated in order to fit this requirement. # msg vol. data exch anged # FLOPs ScaLAP ACK QR2 2 N log 2 ( P ) log 2 ( P )( N 2 / 2) (2 M N 2 − 2 / 3 N 3 ) /P TSQR log 2 ( P ) log 2 ( P )( N 2 / 2) (2 M N 2 − 2 / 3 N 3 ) /P + 2 / 3 log 2 ( P ) N 3 T ABLE I C O M M U N I C A T I O N A N D C O M P U T A T I O N B R E A K D O W N W H E N O N LY T H E R - FAC T O R I S N E E D E D . Another possi bility would hav e been t o handle load balancing issues at the algorit hmic lev el (and not at the middlewar e le vel) in ord er to relie ve t his constraint on the JobProfile and thus increase the number of p hysical setups that would m atch our needs. In the particular case of TSQR, t his is a natural extension; we would only ha ve to adapt the numb er of rows attrib uted to each dom ain as a function o f the processing power dedicated to a domain. This alt ernativ e approach is future work. I V . P E R F O R M A N C E M O D E L In T ables I and II, we give t he amount of communicatio n and computation required for ScaLAP A CK QR2 and TSQR i n two different scenarios: first, when only the R-f actor is requested (T able I) and, second, when both the R-factor and the Q-factor are requested (T able II). In this mod el, we assu me that a binary tree is us ed for the reductions and a homegeneous n etwork. W e recall that the input matrix A is M –by– N and that P is the number of d omains. The num ber o f FLOPS i s the number of FLOPS on the critical path per domain. Assuming a ho mogeneous network, the tot al time of the factorization is then approx imated by the formula: time = β ∗ (# msg ) + α ∗ (vol. data exchanged) + γ ∗ (# FLOPs) , (1) where α is the i n ver se o f the b andwidth, β the latency , and γ the inv erse of the floating point rate of a domain. Alt hough this model is simpl istic, it enables us to forecast the basic trends. Note that i n the case of TS matrices, we have M ≫ N . First we ob serve that the cost to compute both the Q and the R factors is exactly twice th e cost for computing R onl y . Moreover , further theoretical and experimental analysis of the algorithm (see [17]) re veal that the structure of the computation is t he same in both cases and the time to obtain Q is twice the time to obtain R . This leads to Property 1. For bre vity , we mainly focus our study on the computation of R only . Pr operty 1: Th e ti me to compute bo th Q and R is abou t twice the cost for comput ing R only . One of the building blocks of the ScaLAP AC K PDGEQR2 implementation a nd of our TSQR algorithm is the domanial QR factorization of a TS m atrix. The domain can be processed by a core, a node or a group of nodes. W e can not expect performance from o ur parallel d istributed algorithms to be better than the one of its d omanial kernels. This leads to Property 2. In practice, the performance of the QR factorization of TS matrices obt ained from LAP A CK/ScaLAP A CK on a domain (core, node, sm all number of nodes) is a sm all fraction of th e peak. (T erm γ of Eq uation 1 is likely to be s mall.) Pr operty 2: Th e performance of the factorization of TS matrices is lim ited by the domanial perfor- mance of the QR factorization of TS matri ces. W e see that the number of operations is proportion al t o m while all the communicati on terms (latency and bandwi dth) are i ndependent of m . T herefore when m increases, the comm unication time stays constant whereas the domanial comp utation tim e increases. This l eads t o increased performance. Pr operty 3: Th e performance of the factorization of TS matrices increases wi th M . The number of operatio ns is proportio nal to N 2 while the number of messages is proportional to N . Therefore when N in creases, the latency term is hi dden by t he computatio n term. This leads to better performance. W e also note that increasing N enables bett er p erformance of the domanial kernel s ince it can use Lev el 3 BLAS when the number of columns is greater than, perhaps, 100 . This is Property 4. Pr operty 4: Th e performance of the factorization of TS matrices increases wi th N . Finally , we see that the latency term is 2 lo g 2 ( P ) for TSQR while it is 2 N log 2 ( P ) for ScaLAP A CK QR2. On the ot her hand, the FLOPs term has a non paralleli zable additional 2 / 3 lo g 2 ( P ) N 3 term for # msg vol. data exch anged # FLOPs ScaLAP ACK QR2 4 N log 2 ( P ) 2 log 2 ( P )( N 2 / 2) (4 M N 2 − 4 / 3 N 3 ) /P TSQR 2 log 2 ( P ) 2 log 2 ( P )( N 2 / 2) (4 M N 2 − 4 / 3 N 3 ) /P + 4 / 3 log 2 ( P ) N 3 T ABLE II C O M M U N I C A T I O N A N D C O M P U TA T I O N B R E A K D O W N W H E N B O T H T H E Q - FA C T O R A N D T H E R - FA C TO R A R E N E E D E D . the TSQR alg orithm. W e see th at T SQR effecti vely trades messages for flops . W e expect TSQR to be faster than ScaLAP A CK QR2 for N in the mid-range (perhaps b etween five and a few hundreds). For lar ger N , TSQR will become slower because o f th e additional flops . T his is Property 5. (W e note that for large N , one should s top us ing TSQR and switch to CA QR.) Pr operty 5: Th e performance of TSQR is bett er than ScaLAP A CK for N in the mid range. When N gets too l ar ge, the p erformance of TSQR d eteriorates and ScaLAP A CK becomes better . V . E X P E R I M E N T A L S T U DY A. Experimental en vir onment W e present an experimental study o f the performance of the QR factorization of TS matrices in a grid computing en vironment. W e conducted our experiments on Grid ’5000. This p latform is a dedicated, reconfigurable and controll able experimental grid of 13 clusters dis tributed ove r 9 cities in France. Each cluster is itself composed of 58 to 342 nodes. The clusters are in ter- connected through dedicated black fiber . In to tal, Grid’500 0 roughly g athers 5 , 000 CPU cores featuring multiple architectures. For the experiments presented in this st udy , we chose four clusters based on relativ ely h omogeneous dual-processor nodes, ranging from AMD Opteron 246 (2 GHz/1MB L2 cache) for the slowest o nes to AMD Opt eron 2218 (2.6 GHz/2 MB L2 cache) for the fastest ones, wh ich leads to theoretical peaks ranging from 8 . 0 to 10 . 4 Gflop/s per processo r . These four clus ters are the 93-node cluster in Bordeaux, the 312 -node cluster in Orsay , a 80-node cluster in T oulouse, and a 56-node cluster i n Sophia-Antipol is. Because these clusters are located i n different cities, we wi ll indis tinctly use the terms cluster and geogr aphical s ite (or site ) in the following. Nodes are interconnected with a Gig abit Ethernet switch; on each no de, the network controller is shared by both processors. On each cluster , we reserved a subset of 32 dual-processor nodes, leading to a theoretical peak of 512 . 0 to 665 . 6 Gflop/s per node. Our algorithm being synchronous, to ev aluate the proportion of theoretical peak achie ved in an heterogeneous en vi ronment, we consider the ef ficiency of the s lowest c omponent as a base for the ev aluation. Therefore, the t heoretical peak of our grid is equal to 2 , 048 Gflop/s. A consequence of the constraints on the topology expressed by our implementation in QC G-OMPI (see Section II-D) is that in some experiments, machines wit h dual 2-cores processors were booked wi th the ability to use 2 cores (ove r 4) only . The performance of t he int er and intra-cluster com munications is shown i n T abl e 3(a). W ithin a cluster , nodes are connected with a GigaEthernet net work. Clusters are i nterconnected wi th 10 Gb/s dark fibers. Th e intra-cluster throughput is consi stently equal to 890 M b/s but v aries from 61 to 860 Mb/s between clusters. Inter-cluster latency i s roughly greater than intra-cluster latency by two orders of magni tude. Between t wo processors of a same no de, OpenMPI u ses a driv er opti mized for shared- memory architectures, leading to a 17 µ s latency and a 5 Gb/s throughput. One major feature of the Grid5000 project i s the ability of the user to boot her own en vironment (including the operating system , distribution, libraries, etc.) on all the compu ting nodes booked for her job . All the nodes were booted under Lin ux 2.6.30. The test s and benchmarks were com piled with GCC 4.0.3 (flag -O3) and run in dedicated m ode (no ot her user can access t he machines). ScaLAP A CK 1.8.0 and GotoBLAS 1.26 libraries were used. Final ly we recall that we focus on the factorization of TS dense large- scale m atrices in real double precision, corresponding to up to 16 GB of memory ( e.g. a 33 , 5 54 , 432 × 64 matrix in do uble precision ). B. T uning of the a pplications T o achiev e high performance across platforms, dense l inear algebra application s rely on Basic Linear Algebra Subprograms (BLAS) [13] to perform basic operations such as vector and m atrix multi plication. This design by layers allows one to onl y focus on the optimization of these basic operations whi le keeping underlyi ng numerical algorithms common to all machines. From the performance of BLAS operations – and in particular the m atrix mult iplication (DGEMM) – t hus depends the behavior of the overall appl ication. The Automatically T un ed Lin ear Algebra Software (A TLA S) [42 ] li brary is, for instance, a widely used implementatio n of BLAS achieving high performance thanks to autotuning methods. It is furthermore p ossible to take adva ntage of dual -processor nodes thanks to a m ulti-threaded implementati on of BLAS such as GotoBLAS [24]. W e hav e compared the performance of serial and multi-threaded GotoBLAS DGEM M agains t A TLAS. Both configuratio ns of Got oBLAS out performed A TLAS; w e thus selected Goto BLAS to conduct our experiments. Another p ossibili ty to t ake advantage of dual-processor nodes is s imply to create two processes per node at th e application lev el. For bot h ScaLAP A CK and TSQR, that latter configuration consistent ly achie ved a h igher performance. W e therefore used two pr ocesses per no de together with th e serial version of GotoBLAS’ s DGEMM in all the experiments reported in this st udy . W ith DGEMM b eing t he fastest kernel (on top of wh ich other BLAS operations are usually built), we obtain a rough practical performance upper bo und for our computation al grid of about 940 Gflop/s (the ideal case where 256 processors would achiev e the performance of DGEMM, i.e. , about 3 . 67 Gflop/s each) out of the 2 , 048 Gflop/s theoretical peak. SCALAP A CK implem ents block-partitioned algorithms. Its performance depends on the partition ing of the m atrix into blocks . Preliminary experiments (not reported here) showed that a column-wi se 1D- cyclic partition i s optimu m for processi ng TS matrices in our en vi ronment. W e furthermore chose a block size consi stently equal to 64 (a better tu ning of this parameter as a fun ction of the matrix characteristi cs would ha ve occasionally imp roved the performance but we con sidered that the possible gain was not worth the degree of complexity introduced in the analysis of the results.). C. S caLAP A CK perfo rmance Figure 4 reports ScaLAP A CK performance. In accordance with Property 2, the overall performance of the QR factorization of T S m atrices is low (consistently lower than 90 Gflop/s) com pared to the practical upper bound of our grid ( 940 Gflop/ s). Even on a sing le cluster , thi s ratio i s low since the performance at one site is consi stently lower than 70 Gflop/s ou t of a practical upper boun d o f 235 Gflop/s. As expected too (properties 3 and 4), the p erformance increases with the dimensions of t he m atrix. For matrices of small t o moderate height ( M ≤ 5 , 000 , 000 ), the fastest execution is consi stently the one cond ucted on a single site. In oth er words, for th ose matrices, the use of a grid (two or four sites) indu ces a drop in performance, confirming previous studies [34], [33], [41]. For very tall matrices ( M > 5 , 000 , 00 0 ), the proportion of computation relative to the amount of communication becomes high enough so that the use of multip le sites eventually speeds up the performance (right -most part of the graphs and Property 3). This sp eed up howe ver hardly surpasses a value of 2 . 0 while usi ng four sites (Figure 4(b)). D. QCG-TSQR performan ce The performance of TSQR (articulat ed w ith QCG-OMPI as described in Section III) depends on the number of domains used. In Figure 5, we report t he TSQR performance for t he optimum number of domains and we will return later t o the effect of t he num ber of domains. In accordance with Property 2, the overa ll performance i s again onl y a fraction o f the practical u pper bound of ou r grid ( 940 Gflop/s). But, compared to ScaLAP A CK, this ratio is significantly higher sin ce the factorization o f a 8 , 388 , 608 × 512 matrix achieves 256 Gflop/s (Figure 5(d)). Again, in accordance with properties 3 and 4, the overall performance i ncreases with the di mensions of the matrix. Thanks t o its better performance (Property 5), TSQR also achieves a speed up on the grid on matrices of moderate s ize. Indeed, for alm ost all matrices of moderate to great height ( M ≥ 500 , 000 ), the fastest execution is th e one conducted on all four sites. 0 5 10 15 20 25 30 35 100000 1e+06 1e+07 1e+08 Gflop/s Number of rows (M) 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) (a) N = 64 0 10 20 30 40 50 60 100000 1e+06 1e+07 1e+08 Gflop/s Number of rows (M) 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) (b) N = 128 0 10 20 30 40 50 60 100000 1e+06 1e+07 Gflop/s Number of rows (M) 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) (c) N = 256 0 10 20 30 40 50 60 70 80 90 100000 1e+06 1e+07 Gflop/s Number of rows (M) 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) (d) N = 512 Fig. 4. ScaLAP ACK performance. 0 20 40 60 80 100 100000 1e+06 1e+07 1e+08 Gflop/s Number of rows (M) 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) (a) N = 64 0 20 40 60 80 100 120 140 100000 1e+06 1e+07 1e+08 Gflop/s Number of rows (M) 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) (b) N = 128 20 40 60 80 100 120 140 160 180 100000 1e+06 1e+07 Gflop/s Number of rows (M) 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) (c) N = 256 0 50 100 150 200 250 300 100000 1e+06 1e+07 Gflop/s Number of rows (M) 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) (d) N = 512 Fig. 5. TSQR Performance. Furthermore, for very tall m atrices ( M ≥ 5 , 000 , 000 ), TSQR performance scales almost li nearly wit h the number of sites (a speed up o f almos t 4 . 0 is obtained on fou r sites). Thi s result i s the central statement of thi s paper and validates the t hesis that comp utational grids are a valid infrastructure for solving large-scale prob lems relying on the QR factorization of TS matrices. Figure 6 n ow illus trates the effe ct of the numb er of domains per cluster on TSQR performance. Globally , the performance increases with the numb er of domains . For very tall m atrices ( M = 0 20 40 60 80 100 64 32 16 8 4 2 1 Gflop/s Number of domains per cluster M = 33 554 432 M = 4 194 304 M = 524 288 M = 131 072 (a) N = 64 0 20 40 60 80 100 120 140 160 64 32 16 8 4 2 1 Gflop/s Number of domains per cluster M = 33 554 432 M = 4 194 304 M = 524 288 M = 262 144 (b) N = 128 0 50 100 150 200 250 64 32 16 8 4 2 1 Gflop/s Number of domains per cluster M = 8 388 608 M = 2 097 152 M = 524 288 M = 262 144 (c) N = 256 0 50 100 150 200 250 300 350 64 32 16 8 4 2 1 Gflop/s Number of domains per cluster M = 8 388 608 M = 2 097 152 M = 524 288 M = 262 144 (d) N = 512 Fig. 6. Effect of the number of domains on t he performance of TSQR executed on all four sites. 0 5 10 15 20 25 30 35 40 64 32 16 8 4 2 1 Gflop/s Number of domains M = 8 388 608 M = 1 048 576 M = 131 072 M = 65 536 (a) N = 64 0 20 40 60 80 100 64 32 16 8 4 2 1 Gflop/s Number of domains M = 2 097 152 M = 1 048 576 M = 131 072 M = 65 536 (b) N = 512 Fig. 7. Effect of the number of domains on the performance of TSQR executed on a single site. 33 , 5 54 , 432 ), the im pact is limited (but not negligible) si nce there is enough computation to almost mask the ef fect of communications (Property 3). For very skinny matrices ( N = 64 ), the optimu m number of do mains for executing TSQR on a single cluster is 64 (Figure 7 (a)), correspon ding to a configuration with one domain per processor . Thi s optimum selection of the number of dom ains is translated t o executions on m ultiple clusters where 6 4 dom ains per cluster is op timum too (Figure 6(a)). For the widest matrices studi ed ( N = 512 ), the op timum num ber of do mains for executing TSQR on a single cluster i s 32 (Figure 7(b)), correspond ing to a configuration with o ne dom ain per nod e. For thos e matri ces, trading flops for i ntra-node com munications is no t worthwhi le. This behavior is again t ransposable to executions on m ultiple sites (Figure 6(d)) where th e opt imum configuration also corresponds to 32 domains per cluster . This observation illustrates the fact that one s hould u se CA QR and n ot TSQR for l ar ge N , as dis cussed in Section IV. E. QCG-TSQR vs ScaLAP A CK Figure 8 compares TSQR performance (still articul ated with QCG-OMPI) against ScaLAP A CK’ s. W e 0 10 20 30 40 50 60 70 80 90 100000 1e+06 1e+07 1e+08 Gflop/s Number of rows (M) TSQR (best) ScaLAPACK (best) (a) N = 64 0 20 40 60 80 100 120 100000 1e+06 1e+07 1e+08 Gflop/s Number of rows (M) TSQR (best) ScaLAPACK (best) (b) N = 128 20 40 60 80 100 120 140 160 180 100000 1e+06 1e+07 Gflop/s Number of rows (M) TSQR (best) ScaLAPACK (best) (c) N = 256 0 50 100 150 200 250 300 100000 1e+06 1e+07 Gflop/s Number of rows (M) TSQR (best) ScaLAPACK (best) (d) N = 512 Fig. 8. TSQR vs ScaLAP ACK. For each algorithm, the performance of t he optimum configuration (one, two or four sites) is reported. report the maximum performance out of e xecutions on one, two or four sites. For instance, the graph of TSQR in Figure 8(a) is thu s the con ve x hull of the t hree graphs from Figure 5(a). In accordance with Property 5, TSQR consist ently achiev es a h igher performance than ScaLAP A CK. For matrices of limited height ( M = 131 , 072 ), TSQR is opti mum wh en executed on one site (Figure 5(a)). In t his case, its sup eriority over ScaLAP A CK comes from better performance within a cluster (Figure 7(a)). For matrices with a lar ger number of ro ws ( M = 4 , 19 4 , 3 04 ), the im pact of the nu mber of dom ains per cluster is less sensitive (Figure 7(a) and Property 3). O n t he other hand, the matrix is large enou gh to allow a speed up of TSQR ov er the grid (Figure 5(a) and Property 3 (again)) but not of ScaLAP A CK (Figure 4(a) and Property 5), hence the s uperiority of TSQR over ScaLAP A CK for that type of m atrix. For very tall matrices ( M = 33 , 554 , 432 ), the i mpact of t he number of domains per clus ter becomes negligible (Figure 7(a) and Property 3). But (i) TSQR achie ves a speed up of almost 4 . 0 on four sites (Figure 5(a)) whereas (i i) ScaLAP A CK does not achie ve yet such an ideal speed up (Figure 4(a)). Finally , on all th e range of m atrix shapes considered, and for different reasons, we have seen th at TSQR cons istently achieve s a significantly h igher performance than ScaLAP A CK. F or not so tall and not so ski nny matrices (left-most part of Figure 8(d)), the gap between the performance of TSQR and ScaLAP A CK reduces (Property 5). One may ha ve observed that t he time spent in intra-node, then i ntra-cluster and finally inter-cluster communication s becomes negligibl e while th e dimensions of the matrices increase. For lar ger matrices (which would no t hold in the m emory of our machines), we m ay thus ev en expect th at comm unications over the grid for ScaLAP A CK would become negligibl e and thus that TSQR and ScaLAP A CK would e ventually achiev e a similar (scalable) performance (Property 5 ). V I . C O N C L U S I O N A N D P E R S P E C T I V E S This paper has revisited the performance behavior of common dense linear algebra operations in a g rid comp uting en vironment. Cont rary t o past studi es, we hav e s hown t hat th ey can achie ve a performance speed up by using multiple geographical sites of a comput ational g rid. T o do so, we ha ve articulated a recently proposed algorithm (CA QR) wi th a t opology-aware m iddlew are (QCG-OMPI) in order to confine intensive communications (ScaLAP A CK calls) within t he d iffe rent geographical sites. Our experimental study , cond ucted on the experimental Grid’5000 platform, focused on a particular operation, t he QR factorization of TS matrices. W e showed that its performance increases linearly with the number of geographi cal si tes on large-scale probl ems (and i s in particular consistent ly high er than ScaLAP A CK’ s). W e have proved theoretically t hrough our models and experimentally that TSQR is a scalable al- gorithm on the grid. TSQR i s an im portant algorithm in it self since, gi ven a set of vectors, TSQR is a stable way to generate an orthogonal basis for i t. TSQR will come handy as an orthogon alization scheme for sparse iterative methods (eigensolvers or li near solves). TSQR is also the panel factorization of CA QR. A natural question is whether CA QR scales as well on the grid. From models, there i s no doubt that CA QR should scale. Howe ver we will need to perform the experiment t o confirm this claim. W e no te that the work and conclus ion we have reached here for TSQR/CA QR can be (trivially) extended to TSLU/CALU ([25]) and Chol esky factorization [5]. Our approach is based on ScaLAP A CK. Howe ver , recent algorithm s that better fit emer ging archi- tectures would hav e certainly improved the performance obt ained on each cluster and in fine the global performance. For instance, recursi ve f actorizations hav e been shown to achie ve a higher performance on dist ributed memory machines [17]. Oth er codes benefit from mu lticore architectures [1]. If, as discussed in the introd uction, the barriers for computational grids to compete against supercom- puters are mult iple, this study shows that the performance of large- scale dense linear algebra applicatio ns can scale with the number of geographical si tes. W e plan to extend this work to the QR factorization of general matrices and then to ot her one-sided fac torizations (Cholesky , LU). Lo ad balancing t o take in to account heterogeneity of clusters is ano ther direction to in vestigate. Th e use of recursiv e algorithms to achie ve higher performance is to be stu died too. T H A N K S The authors thank Laura Grigo ri for her con structive suggest ions. R E F E R E N C E S [1] E. Agullo, B. Hadri, H. Ltaief, and J. Dongarra. Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In SC , 2009. [2] Bruce Allen, Carl Christensen, Neil Massey , T olu Aina, and Miguel Angel Marquina. Netwo rk computing with einstein@home and climateprediction.net. T echnical report, CERN, G ene v a, Jul 2005. CERN, Genev a, 11 Jul 2005. [3] Dav id P . Anderson. BOINC: A system for public-resource computing and storage. In R ajkumar Buyya, editor , GRID , pages 4–10. IEEE Computer Society , 2004. [4] R. M. Badia, D. Du, E. Huedo, A. K oko ssis, I. M. Llorente, R. S. Montero, M. de Palol, R. Sirvent, and C. V ´ azquez. Integration of GRID superscalar and gridway metascheduler with the DRMAA OGF standard. In Proceed ings of the 14th International Eur o-P ar Confer ence , volume 5168 of LNCS , pages 445–45 5. S pringer , 2008. [5] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Communication-op timal parallel and sequential Cholesky decomposition: extended abstract. I n SP AA , pages 245–252, 2009. [6] P Bar, C Coti, D Groen, T Herault, V Kravtsov , A Schuster, and Running parallel applications wit h t opology-a ware grid middleware. In 5th IEEE International Confer ence on e-Science (eScience’09) , December 2009. to appear . [7] A L. Beber g, D L. Ensign, G Jayachand ran, S Khaliq, and V S . Pande. Folding@ho me: Lessons from eight years of v olunteer distributed computing. In 8th IEEE International W orkshop on High P erformance Computational Biology (Hi COMB 2009) . [8] L. S. Blackford, J. Choi, A. Cleary, E. D’Azev edo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. W alker, and R. W haley. ScaLAP ACK Users’ Guide . SI AM, 1997. [9] Rapha ¨ el Bolze. Analyse et d ´ eploiement de solutions algorithmiques et logicielles pour des applications bioinformatiques ` a grande ´ echelle sur la grille . PhD thesis, ´ Ecole Normale Sup ´ erieure de Lyon, October 2008. [10] Al fredo Buttari, Julien L angou, Jakub Kurzak , and Jack Dongarra. A class of parallel ti led linear algebra algorithms for multicore architectures. P arallel Computing , 35:38–53 , 2009. [11] E Caron, F D esprez, and C T edeschi. Efficiency of tree-structured peer-to-peer service discovery systems. In I PDPS , pages 1–8, 2008. [12] J Choi, J Demmel, I S. Dhillon, J Dongarra, S Ostrouchov , A P etitet, K Stanley , D W . W al ker , and R. C W haley . ScaLAP ACK: A portable linear algebra library f or distributed memory computers - design issues and performance. In P ARA , pages 95–106, 1995. [13] Jaeyoung Choi, Jack D ongarra, S usan Ostrouchov , Antoine Petitet, David W . W alker , and R. Clinton W haley . A proposal for a set of parallel basic linear algebra subprograms. In P AR A , pages 107–114, 1995. [14] E . Chu and A. George. QR factorization of a dense matrix on a hypercube multiprocessor . SIAM J. Sci. Stat. Comput. , 11(5):990– 1028, 1990. [15] C Coti, T Herault, S Peyronnet, A R ezmerita, and F Cappello. Grid services for MPI. In ACM /IEEE, editor, Pr oceedings of t he 8th IEE E International Symposium on Cluster C omputing and the Grid (CCGrid’08) , pages 417–424, L yon , F rance, May 2008. [16] R Das, B Qian, S Raman, R V ernon, J Thompson, P B radley , S Khare, M D D . T yka, D Bhat, D Chi vian, David E E. K, W H H. Sheffler , L Malmstr ¨ om, A M M. W ollacott, C W ang, I Andre, and D Baker . St ructure prediction for CASP 7 targets using extensi ve all-atom refinement with rosetta@home. P r oteins , 69(S8):118–128 , September 2007. [17] J Demmel, L Grigori, M Hoemmen, and J Langou. Communication-av oiding parallel and sequential QR factorizations. CoRR , arixv .org/abs/080 6.2159, 2008. [18] Jack Dongarra, Piotr Luszczek, and Antoine Petitet. The LINP ACK benchmark: past, present and future. Concurr ency and Computation: P ractice and Experience , 15(9):803–820 , 2003. [19] Message Passing Interface Forum. MP I: A message-passing interface standard. T echnical Report UT -CS-94-230, Department of Computer Science, Univ ersity of T ennessee, April 1994. [20] I Foster and C Kesselman. The Grid: B lueprint for a New Computing Infrastructur e . Morgan Kaufmann Publishers, 2 edition, 2003. [21] E Gabriel, G E. Fagg, G Bosilca, T Angskun, J J. Dongarra, J M. Squyres, V Sahay , P Kambadur , B Barrett, A Lumsdaine, R H. Castain, D J. Daniel, R L. Gr aham, and T S. W oodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In P r oceedings, 11th Euro pean PVM/MPI Users’ Gr oup Meeting , pages 97–104, Budapest, Hungary , S eptember 2004. [22] E Gabriel, M M. Resch, T Beisel, and R Keller . Dist ributed computing in a heterogeneous computing en vironment. In Procee dings of the 5th Euro pean PVM/MPI Users’ Gr oup Meeting , volum e 1497 of LNCS , pages 180–187. Springer, 1998. [23] G. H. Golub and C. F . V an Loan. Matrix Computations . Johns Hopkins Univ ersity Press, Baltimore, USA, 2 edition, 1989. [24] K Goto and R A. v an de Geijn. Hi gh-performance implementation of the level-3 blas. ACM Tr ans. Math. Softw . , 35(1), 2008. [25] L aura Grigori, James Demmel, and Hua Xiang. Communication avo iding Gaussian elimination. In SC , page 29, 2008. [26] B. C. Gunter and R. A. van de Geijn. Parallel out-of-core computation and updating of the QR factorization. ACM T ran s. on Math. Soft. , 31(1):60–78, March 2005. [27] D. Heller . A survey of parallel algorithms in numerical linear algebra. SIAM Rev . , ( 20):740–7 77, 1978. [28] Marty Humphrey , Mary T hompson, and K eith Jackson. Security for grids. IEE E , 3:644 – 652, March 2005. [29] Ni cholas T . Karonis, Brian R. T oonen, and Ian T . Foster . M PICH-G2: A grid-enabled implementation of the message passing interface. CoRR , arxiv . org/cs.DC/02060 40, 2002. [30] J. Kurzak and J. J. Dongarra. QR factorization for the CEL L processor . Scientific Pro gramming , Special Issue: High P erformance Computing wi th t he Cell Br oadband Engine , 17(1-2):31–42, 2009. [31] S tefan M. Larson, Christopher D. Snow , Michael Shirts, and V i jay S. Pande. Folding@home and genome@home: Usi ng distributed computing to t ackle previously intractable problems in computational biology . (arxi v .org/abs/arxi v:0901 .0866), 2009. [32] R. E. Lord, J.S. Ko walik, and S.P . Kumar . S olving linear algebraic equations on an MIMD computer . J. ACM , 30(1):103–117 , 1983. [33] Jeff Napper and Paolo Bientinesi. Can cloud computing reach t he T op500? T echnical Report AI CES-2009-5, Aachen Instit ute for Computational Engineering Science, R WTH Aachen, February 2009. [34] A P etitet, S Blackford, J Dongarra, B E llis, G Fagg, K R oche, and S V adhiyar . Numerical libraries and the grid: T he GrADS experimen ts with S caLAP ACK. T echnical Report U T -CS -01-460, ICL , U. of T ennessee, April 2001. [35] A. Pothen and P . Raghav an. Dist ributed orthogonal factorization: Given s and Householder algorithms. SIAM Jou rnal on Scientific and Statisti cal Computing , 10:1113–1134, 1989. [36] G. Quintana-Ort ´ ı, E. S. Quintana-Ort´ ı, E. Chan, F . G. V an Zee, and R. A. van de Geijn. Scheduling of QR factorization algorithms on SMP and multi-core architectures. In Procee dings of PDP’08 , 2008. FLAME W orking Note #24. [37] R. Reddy and A. Lastovetsk y . HeteroMPI + ScaLAP ACK: T ow ards a ScaLAP ACK (Dense Linear Solvers) on hetero geneous networks of computers. volume 4297, pages 242–253, Bangalore, India, 18-21 Dec 2006 2006. Springer , Springer . [38] A. Sameh and D. Kuck. On stable parallel li near system solvers. J . ACM , 25(1):81–91 , 1978. [39] R. S chreiber and C. V an Loan. A storage efficient W Y representation for products of Householder transformations. SIAM J. Sci. Stat. Comput. , 10(1):53–57, 1989. [40] S athish S. V adhiyar and Jack J. Dongarra. S elf adaptivity in grid computing. Concurr ency & Computation: Practice & Experience , 2005. [41] E dward W alker . Benchmarking amazon EC2 for high-performance scientific computing. USENIX Login , 33(5):18–23, 2008. [42] R. Clint W haley , Antoine P etitet, and Jack J. Dongarra. Automated empirical optimization of software and t he A T LAS project. P arallel C omput. , 27(1–2):3–25, 2001.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment