Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems

We evaluate optimized parallel sparse matrix-vector operations for several representative application areas on widespread multicore-based cluster configurations. First the single-socket baseline performance is analyzed and modeled with respect to bas…

Authors: Gerald Schubert, Holger Fehske, Georg Hager

Hybrid-parallel sparse matrix-vector multiplication with explicit   communication overlap on current multicore-based systems
Hybrid-parallel sparse matrix-v ector m ultipli cation with explicit comm un ication o v erl ap on cu rren t m u lticore -based systems Gerald Sc hubert, Holger F ehsk e Institute of Ph ysics, Univ ersit y of Greifswald F elix-Hausdorff-Str. 6, 17487 Greifsw a ld, German y Georg Hager, Gerhard W ellein Erlangen Regional Computing Cente r, Univ ersit y of Erlangen-Nurem b erg Martensstr. 1, 91058 Erlangen, German y June 14, 2011 Abstract W e eval uate optimized parallel sparse matrix- vector op erations for several repre- senta tive application areas on widespread multicore-based cluster configurations. First the single-so ck et baseline p erformance is analyzed and mo d eled with resp ect to ba- sic arc h itectural prop erties of standard multicore chips. Beyond the single node, the p erformance of parallel sparse matrix-vector operations is often limited by co mmuni- cation ov erhead. Starting from the observ ation that nonblocking MPI is not able to hide communication cost using stand ard MPI implementations, w e demonstrate that explicit o verlap of co mmunicatio n and computation can be ac h ieved by using a dedi- cated communication thread, which may run on a v irtual core. M oreo ver w e identif y p erformance b enefits of hybrid MPI/Op enMP p rogramming du e to improv ed load b al- ancing even without explicit communication ov erlap. W e compare p erformance results for pure MPI, the widely used “v ector-like” h y brid programming strategies, and explicit o verl ap on a mo dern multicore-based cluster and a Cra y XE6 sy stem. 1 In tro duction Many pro blems in science and engineer ing inv o lve the s o lution of lar ge eigenv alue problems or extremely spa rse s y stems of linear equations arising fro m, e.g., the discretizatio n o f pa r- tial differential equations. Sparse matr ix -vector multiplication (spMVM) is the do minant op eration in many o f those solv ers and ma y easily consume most of the total r un time. A highly efficient scalable spMVM implementation is th us fundamen ta l, and complemen ts adv ancements and new developmen t in the high- level algor ithms. F or more tha n a dec ade there has b een an intense debate ab out whether the hierarchical structure of current HPC sy s tems needs to be consider ed in parallel prog ramming, o r if pure MPI is sufficient. Hybrid approaches based on MPI+Op enMP have been implemented in co des and kernels for v arious applications areas and compare d with traditional MPI imple- men tations. Most results a re har dware-spec ific, a nd so metimes contradictory . In this pap er we analyz e hybrid MP I+Op enMP v ar iants of a g eneral para llel spMVM op eration. Bey ond the naive approa ch of using Op enMP for par allelization of kernel lo ops (“vector mo de”) we also employ a hybrid “task mo de” to ov e rcome or mitigate a weakness of standard MPI implemen tations: the lack o f truly asy nchronous communication in nonblocking MP I ca lls. W e tes t our implementation ag ainst pure MPI approaches fo r tw o application s cenarios on an InfiniBand c lus ter as well as a Cr ay XE6 system. 1 Listing 1: CRS sparse matrix-vector multiplication kernel 1 d o i = 1 , N r 2 d o j = r o w _ p t r ( i ) , r o w _ p t r ( i + 1 ) - 1 3 C ( i ) = C ( i ) + v a l ( j ) * B(col_id x(j)) 4 e n d d o 5 e n d d o 1.1 Related w ork In recent years the p erfor mance of v arious spMVM algor ithms has b een ev aluated by several groups [1, 2 , 3]. Co vering differ ent matr ix stora ge formats and implemen tatio ns on v arious t yp e s of hardware, they have reviewed a more or les s large num b er o f publicly a v aila ble matrices and rep orted on the obtained p erfor mance. Scalable parallel spMVM implementa- tions hav e also b een pro p osed [4 , 5 ], mo stly based on an MPI-only strategy . Hybrid para llel spMVM appro aches have alr e ady b een devised b efor e the emergence of multicore pro ces- sors [6, 7]. Recently a “ vector mo de” a ppr oach could not comp ete with a scalable MPI implemen tation for a sp ecific problem on a Cray system [4]. There is no up-to- date litera ture that systematically inv estiga tes novel feature s like multicore, ccNUMA no de structure, and simult aneous multithreading (SMT) for hybrid pa rallel spMVM. 1.2 Sparse matrix-v ect or m ultiplication and no de-lev el p erformance mo del A p oss ible definition o f a “spars e” matrix is that the num b er of its nonzero en tries grows only linearly with the matrix dimension; how ever, not a ll problems are eas ily sc a led, so in general a spar se matrix may b e defined to contain “mainly” zero entries. Since keeping such a matrix in co mputer memory with all zer o s included is usually out of the question, an efficient fo r mat to store the nonzeros only is requir ed. The most widely us ed v ariant is “Compresse d Row Stora ge” (CRS) [8]. It do es not exploit sp ec ific featur es that may emerge from the underlying physical problem like, e.g., blo ck structures, symmetries, e tc., but is broadly recognize d as the most efficient format for g eneral spar se ma trices on ca che-based micropro ce s sors. All nonzeros are stored in o ne co nt iguous array val(:) , ro w by row, a nd the starting offsets of all rows ar e contained in a separa te array row _ptr( :) . Array c ol_id x(:) contains the origina l column index of each matrix entry . A matr ix-vector mu ltiplication with a rig ht -hand-side (RHS) vector B (:) a nd a r esult vector C(:) can then b e written as shown in Listing 1. Here N r is the num b er of matrix r ows. While arrays C(:) and val(:) are trav er sed con tiguously , access to B( :) is indexed and may p o tent ially caus e very low spatial and temp o r al lo cality in this data stream. The per formance of spMVM op er ations on a single compute no de is often limited by main memory bandwidth. Code bala nce [9] is th us a go o d metric to establish a simple pe r formance mo del. W e assume the av erage length of the inner ( j ) lo o p to b e N nzr = N nz / N r , wher e N nz is the total num b er of nonzero ma trix en tr ies. Then the contiguous data acces ses in the CRS co de generate (8 + 4 + 16 / N nzr ) bytes o f memory tra ffic for a single inner lo o p iteration, where the fir st t wo co ntributions come from the matrix val(:) (8 bytes ) and the index array c ol_id x(:) (4 b ytes), while the last term reflects the up date of C(i) (write allo cate + evict). The indir ect access pa tter n to B( :) is determined by the spar sity structure of the matrix and can not b e mo deled in g eneral. How ever, B( :) needs to b e loaded at least once from main memo ry , which adds another 8 / N nzr bytes pe r inner iteration. Limited ca che size and nondiagonal access t y pic a lly require loading at least parts of B(:) m ultiple times in a single MVM. This is quantified by a machine- and problem-sp ecific parameter κ : F or each additional time that B(:) is loaded fro m main memo ry , κ increases by 8 / N nzr . T ogether with the co mputational intensit y of 2 flo ps per j iteration the co de bala nce is B CRS =  12 + 24 / N nzr + κ 2  bytes flop =  6 + 12 N nzr + κ 2  bytes flop . (1) 2 (b) (a) (c) occupancy subblock 10 −1 10 −2 10 −3 10 −4 10 −5 10 −6 0.5 N =160 222 796 N= 22 786 800 nz sAMG N= 6 201 600 N =92 527 872 nz HMEp N =92 527 872 N= 6 201 600 nz HMeP N =92 527 872 N= 6 201 600 nz HMrcm N= 4 497 520 N = 552 324 672 nz UHBR (d) (e) Figure 1: Spa r sity patterns o f the matrices describ ed in Sect. 1.3 .1. (a)–(c) describ e the same ph ysical system, but use a different n um ber ing of the basis elements. See text for details. Square subblo cks have b e e n aggr egated and c o lor-co ded accor ding to o ccupa ncy to improv e visibility . On the no de level B CRS can b e used to deter mine an upp er p erfo r mance limit b y mea suring the no de memory bandwidth (e.g., using the STREAM benchmark) and assuming κ = 0. Moreov er, κ can b e determined exp erimentally from the spMVM floating p o int p erformanc e and the memory bandwidth dr awn by the CRS co de (see Sect. 2). Since the “slimmest” matrices used here have N nzr ≈ 7 . . . 15, each additional access to B(:) incurs a nonneglig ible contribution to the data trans fer in those cases. Note that this simple model neglects per formance- limiting a sp ects b eyond bandwidth bo ttlenecks like in-ca che tra ns fer time, loa d imbalance, communication and/o r synchroniza- tion ov e rhead, and the adv erse effects of nonlo cal memory access across ccNUMA locality domains (LDs). 1.3 Exp erimen t al set ting 1.3.1 T est matrices Since the sparsity pattern may have substantial impact on the s ingle no de p erfor mance and parallel scalability , we hav e chosen three application areas known to generate extremely sparse matrices. As a fir st test cas e we use a ma trix fr o m exa ct diag onalization of stro ngly cor related electron-phono n systems in solid state physics. Here gener ic micros copic mo dels are used to treat b o th charge (electrons) and lattice (phonons) degree s o f freedo m in se c ond quantiza- tion. Cho os ing a finite-dimensional bas is s et, whic h is the direct pro duct o f basis sets for bo th subsystems (electro ns ⊗ pho no ns), the generic mo de l can b e represented by a sparse Hamiltonian matrix . Iter ative algorithms such as Lanczo s or Jacobi-Davidson ar e used to compute low-lying eigens tates of the Hamilton matrice s , a nd mo re r e cent methods ba s ed o n po lynomial expansio n allow for computation of sp ectr a l pr op erties [10] or time evolution of quantum states [11]. In a ll tho s e algo rithms, spMVM is the mo st time-consuming step. 3 QPI P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 L3 Memory Interface Memory L3 Memory Interface Memory (a) In tel dual W estmere no de with tw o N UMA l o cality do- mains cHT 16x cHT 8x Coherent HyperTransport (16x+8x) Coherent HyperTransport (16x+8x) P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 P L1D L2 L3 Memory Interface L3 Memory Interface Memory Memory L3 Memory Interface L3 Memory Interface Memory Memory (b) Cr ay X E 6/AM D dual Magny Cours no de with four NUMA lo cality domains Figure 2: No de top olog y of the b enchmark s ystems. Da s hed b oxes indicate so ck ets. In this pape r w e consider the Holstein-Hubbard model (cf. [12] and reference s therein) and choo se six electrons (subs pa ce dimension 400 ) on a six-site la ttice c oupled to 15 phonons (subspace dimensio n 1 . 55 × 10 4 ). The resulting matrix of dimension 6 . 2 × 10 6 is very s parse ( N nzr ≈ 15) and can hav e tw o different spa rsity patterns , depending o n whether the phononic or the electr onic basis elements are num b ere d contiguously (see Figs. 1 (a) and (b), resp ec- tively). W e a lso a pplied the well-known “Reverse Cuthill-McK ee (RCM)” algor ithm [13] to the Ha milton matrix in order to improv e spatial lo cality in the a ccess to the rig ht hand side vector, and to optimize interpro cess commu nication patter ns tow a rds near- neighbor ex- change. Since the RCM-optimized structure (Fig. 1 (c)) s how ed no p erfo rmance adv antage ov er the HMeP v ariant (Fig. 1 (b)) neither o n the no de nor on the highly par allel level, we will no t consider RCM any further in the following. The second matrix was generated by the adaptive m ultigrid co de sAMG (see [14, 15], and references therein) for the irreg ular discr etization of a Poisson problem on a car geometry . Its ma trix dimension is 2 . 2 × 10 7 with an average of N nzr ≈ 7 entries p er row (see Fig. 1 (d)). The UHBR matrix (see Fig. 1 (e)) orig inates fr om aero ela stic sta bilit y in vestigations of an ultra- high bypass ratio (UHBR) turbine fan of the German Aero space Center (DLR) with a linea rized Navier-Stokes so lver [16 ]. This so lver is part of the paralle l simulation sy s tem TRA CE (T ur b o-machinery Resear ch Aerody na mic Computationa l E nvironmen t) which was developed by DLR’s Institute for Propulsio n T echnology . Its matrix dimension is 4 . 5 × 10 6 with an av e r age of N nzr ≈ 123 entries pe r row, ma king it a rather ‘densely p opula ted’ spars e matrix in compariso n to the o ther tes t cases. F or symmetric matr ic e s a s consider e d here it would b e sufficient to store the upper tri- angular matr ix elements and per form, e.g ., a para llel symmetric CRS spMVM [4]. The da ta transfer volume is then reduced by almost a factor of tw o, allowing for a corre s p o nding p er- formance improvemen t. W e do not use this optimization here for tw o ma jor reas ons. Fir st, the discussion o f the h y brid para llel vs . MPI-only implemen ta tion should not b e restric ted to the sp ecial case of explicitly sy mmetric matrices. Sec o nd, an efficient shared-memo ry implemen tation of a symmetric CRS spMVM bas e routine has b een presented only very recently [17]. 1.3.2 T est mac hines In tel Nehalem EP / W estmere EP The tw o In tel platforms represent a “tick” step within Intel’s “tick-to ck” pro duct strateg y . B o th pr o cessor s o nly differ in a few micr oarchi- tectural details; the most imp ortant differe nce is that W estmere, due to the 32 nm pro duction pro cess, accommo dates six core s p er s o ck et ins tead of four while keeping the sa me L3 c ache size p er core (2 MB) a s Nehalem. The pro cess or ch ips (Xeon X5550 and X5650) used for the benchmarks run at 2 .66 GHz base frequency with “ T urb o Mo de” a nd Simultaneous Multi- 4 0.91 1.50 1.95 2.25 4.29 1 2 3 4 1 node 0 1 2 3 4 5 6 performance [GFlop/s] performance spMVM (HMeP) 0 10 20 30 40 bandwidth [GB/s] bandwidth STREAM:Triad bandwidth spMVM (HMeP) cores (a) Intel Nehalem EP 1 2 3 4 5 6 1 AMD socket 1 node 0 1 2 3 4 5 6 7 8 9 performance [GFlop/s] performance spMVM (HMeP -- Westmere) performance spMVM (HMeP -- MagnyCours) 0 10 20 30 40 50 bandwidth [GB/s] bandwidth STREAM:Triad (Westmere) bandwidth STREAM:Triad (MagnyCours) cores (b) Int el W estmere EP and AMD Magny Cours Figure 3 : No de-level p erformanc e for the test systems. E ffective STREAM triads bandwidth a , and p erformance for spMVM us ing the HMeP matrix (bars) is s hown. In (a) we also r ep ort the measur ed memo ry bandwidth for the spMVM o p eration. threading (SMT) enabled. A single so ck et forms its own ccNUMA LD via thr e e DDR3-133 3 memory c hannels (see Fig. 2 (a)), allowing for a p ea k bandwidth of 32 GB/s. W e use standar d dual-so ck et no des that are connected via fully non blo cking Q DR InfiniBand (IB) netw or ks. The In tel compiler in version 11.1 and the In tel MPI library in version 4 .0.1 were us ed throughout. Thre a d-core a ffinit y was controlled with the LIKWID [1 8] to olkit. Cra y XE6 / AMD Magn y Cours The Cr ay XE6 system is based on dual-so cket no des with AMD Magny Cours 12-cor e pro cessor s (2.1 GHz O pteron 61 7 2) and the latest Cr ay “Gemini” in terconnect. The in terno de bandwidth of the 2D torus netw o rk is beyond the capability of QDR InfiniBand. The single node a rchitecture depicted in Fig. 2(b) reveals a unique fea ture of the AMD Magn y Cours c hip ser ies: The 12-cor e pac k ag e comprises t wo 6 -core c hips with separate L3 caches and memory c o ntrollers, tightly bound by “1.5” Hyper T ra nsp ort (HT) 1 6x links. Ea ch 6-core unit fo r ms its own NUMA LD via tw o DDR3- 1333 channels, i.e., a t wo-socket no de compr ises four NUMA lo c ality domains. In total the AMD design use s eigh t memory channels, allowing for a theoretical main memor y bandwidth adv antage of 8 / 6 ov er a W estmere no de. The Cray compiler in v ersion 7 .2.8 was used for the Cray/AMD measurements. In Sect. 3 .1 we also show MPI p er formance results for an o lder Cr ay XT4 system based on AMD O pteron “Bar celona” pr o cessor s. 2 No de-lev el p erformance analysis The basis for each para llel prog ram m ust b e an efficient sing le cor e / no de implementation. Assuming general spars e ma trix structures, the CRS format presented a b ove is very suita ble for moder n c a che-based m ultico re pro cessor s [19]. Even adv ance d ma chine-specific optimiza- tions such as nontempo ral prefetch instructio ns for O pteron pro cessor s provide only minor bene fits [4 ] and are th us not considered here. A simple Op enMP para llelization o f the outer- most lo op, tog ether with an appro priate NUMA-aw are data placement stra tegy has prov en to provide b est no de-level per formance. W e choose the HMeP , HMEp, and UHBR ma trices as re ference cases for our p er formance mo del. Int raso cket and in tra no de spMVM scalability should alw ays b e discussed together with ef- fective STREAM triad num b ers, which form a practical upper ba ndwidth limit. 1 Figure 3 (a) shows the memory bandwidth on the Nehalem EP platfor m drawn b y the STREAM triad 1 Non temporal stores hav e b een suppressed in the STREA M measurements and the bandwidth n umbers repor ted hav e b een scaled appropriately ( × 4 / 3) to accoun t for the write-allo cate transfer. 5 and the spMVM as measured with LIKWID [1 8]. While the STREAM bandwidth so o n sat- urates within a so ck et, the spMVM bandwidth and the corresp onding GFlop/s num b ers still bene fit from the use o f all co res. This is a typical b ehavior for co des with (pa r tially) irreg ular data acce s s patterns. How ever, the fact that more than 85 % of the STREAM bandwidth can b e r eached with spMVM indicates that our CRS implementation makes go o d use of the resource s. The ma x imum spMVM perfor mance can be estimated by dividing the memory bandwidth by the co de balance (1), using N nzr = 15 and κ = 0. F o r a s ingle so cket the sp- MVM draws 18.1 GB/s (STREAM tria ds: 21.2 GB/s), allowing fo r a maximum p erfor mance of 2.66 GFlop/s (3 .12 GFlop/s). Co mbining the measured p erforma nc e (2.2 5 GFlop/s) and bandwidth o f the s pMVM op er a tion with B CRS ( κ ) we find κ = 2 . 5 , i.e., 2.5 a dditional bytes of memory tr affic on B( :) p er inner lo op itera tion (37.3 bytes per row) ar e r equired due to limited cache ca pa city . Thus the complete vector B(:) is loaded six times fro m main memory to ca che, but each element is used N nzr = 15 times on av er age. This r atio ge ts worse if the matrix bandwidth increa ses. F or the HME p matrix we found κ = 3 . 79 , which translates to a 50% incr ease in the additional data tra nsfers for B(: ) . The co de balance implies a per formance drop of ab o ut 10 %, which is consistent with our mea s urements. The UHBR matrix re presents an interesting ca se, since the a verage num b er of nonzero s per r ow is N nzr ≈ 12 3. A t a mea sured spMVM ba ndwidth of 18.9 GB/s and a p er formance of 2.99 GFlop/s pe r so ck et we arrive a t κ = 0 . 43, which means that each elemen t of the RHS is loaded 8 times; how ever, it is used 123 times, which leads to the conclusion that the contribution of the RHS to the memory traffic is minor. W e hav e included this example her e bec ause it shows that the data transfer for the RHS may b e neglig ible even if it is loaded many times — N nzr plays a decis ive role. Nev er theless, since this matrix shows p er fect s caling in the hig hly para llel cas e we will not discuss it a ny further in this work. In Fig. 3 (b) we summarize the per formance characteristics for Intel W estmere and AMD Magny Co urs, whic h b oth co mprise six cores per lo cality domain. While the AMD sys tem is slow er on a sing le LD, its no de-level p erformance is ab out 2 5% higher than on W estmere due to its fo ur LDs per no de. Within the domains spMVM satur a tes at four co res on b o th architectures, leaving ample ro om to use the remaining cores for o ther tasks, like communi- cation (see Sect. 3.5). In the following we will rep ort results for the W estmer e a nd Magny Cours pla tforms only . 3 Distributed-memory parallelization 3.1 Non blo c king p oint-to-point comm unication in MPI Strong scaling of MPI-par allel spMVM is inevitably limited by co mmu nication overhead. Hence, it is vital to find wa ys to hide communication costs as far a s p oss ible. A widely used approach is to e mploy nonblocking p oint-to-p oint MPI calls for overlapping communication with useful work. Howev er, it has b een known for a long time that most MPI implementations suppo rt prog ress, i.e ., actual data transfer , only when MPI library co de is ex ecuted by the user pro cess, although the ha rdware even on standard InfiniBand-based clusters do es not hinder truly asynchronous p oint-to-p oint communication. 2 V ery simple b enchmark tes ts ca n b e us e d to find out whether the nonblocking p oint-to- po int communication calls in an MPI library do actually suppo rt truly a synchronous transfers. Listing 2 (adapted from [9]) shows an example wher e an MPI_I recv( ) op eration is s et off befo re a function ( do _work () ) per forms register-o nly op erations for a config urable amount of time. If the nonblocking messa ge tra nsfer overlaps with computations, the overall run time of the co de will b e constant as lo ng as the time for co mputation is smaller than the time for message tr ansfer. W e hav e used a c o nstant larg e mess age length of 80 MB to get accurate measurements. No te that the results o f such tests may depend cr ucially on tunable parame- ters like, e.g ., the messag e size for the cr oss-over from an “eag er” to a “rendezvous” proto col, esp ecially for small messages. F o r the application scenario s de s crib ed later, most messa ges 2 In fact, dedicated “offload” communication hardwa re was not un usual in historic supercomputer archi- tectures, the In tel Pa ragon of the earl y 1990s being a typical example. 6 Listing 2: A simple b enchmark to determine the capability of the MPI library to p erfo r m asynchronous nonblocking p oint-to-p oint communication for lar ge mes sages (receive v aria nt) . 1 i f ( r a n k = = 0 ) { 2 s t i m e = M P I _ W t i m e ( ) ; 3 MPI_Irec v(rbuf,m count,MPI_DOUBLE,1,0,MPI_COMM_WORLD,&req); 4 do_work( calctime ); 5 MPI_Wait (req, &status); 6 e t i m e = M P I _ W t i m e ( ) ; 7 c o u t < < c a l c t i m e < < " " < < e t i m e - s t i m e < < e n d l ; 8 } e l s e MPI_Send (sbuf,mc ount,MPI _DOUBLE,0,0,MPI_COMM_WORLD); are still be yond such limits. Figure 4 shows overall runtime versus time for co mputation on the In tel W estmere cluster a nd the Cray XT4/XE6 systems, resp ectively . W e only repo r t in- terno de results, since no current MP I implementation on any system supp orts asynchronous nonblocking intranode communication. On the In tel cluster w e compared three differen t MPI implementations: Intel MPI, Op en- MPI, and MV APICH2. The latter w as compiled with the --en able- async -progress flag. Op enMPI 1 .5 supp orts a s imila r setting, but it is do cumented to b e still under developmen t in the current version (1.5.3), and we were not able to pro duce a stable configura tion w ith progre s s threads activ ated. The r esults show that only Op enMPI (even without prog r ess threads explicitly enabled) was capable of a synchronous nonblocking communication, a lb e it only when sending data v ia a nonblocking send. The nonblo cking receive is not asy nchronous, how ever. Comparing the Cray XT4 and XE6 s ystems, it is striking that only the older XT4 has a n MPI implemen tation that supp orts asynchronous nonblocking transfers for lar g e messa ges. In summary , o ne m ust conclude that the naive assumption that “no nblocking” and “asyn- chronous” a re the same thing canno t b e upheld for most cur rent MPI implementations; as a c o nsequence, overlapping computation with comm unica tion is o ften a matter of explicit progra mming. 3.2 MPI-parallel sparse MVM In the following sections we will contrast the “na ive” overlap via nonblocking MPI with an approach that uses a dedicated Op enMP thread for explicitly asynchronous transfer s. W e adopt the no menclature fro m [6] and [9] a nd distinguish b etw een “vector mo de” and “task mo de.” 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 calc time [s] 0 0.02 0.04 0.06 0.08 0.1 0.12 overall time [s] OpenMPI 1.5 ISend OpenMPI 1.5 IRecv Intel MPI 4.0.1 mvapich 2.1.6 Cray XT4 Cray XE6 Figure 4: Int erno de re- sults for the no nblo ck- ing MP I benchmark o n the W es tmere-based tes t cluster and o n Cr ay XT4 and XE6 systems. Un- less indica ted otherwise, results for non blo cking send and r e ceive are al- most identical. 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 ir ir threads N−1 0 b) vector mode with naive overlap threads 0 N−1 time threads 0 N−1 time c) task mode with explicit overlap a) vector mode without overlap time SpMVM of local elements SpMVM: nonlocal elements elements to be transferred local gather (copy) of SpMVM: local el. local gather (copy) of elements to be transf. omp_barrier omp_barrier SpMVM: nonlocal el. Isend MPI_ Waitall MPI_ Irecv MPI_ Waitall MPI_ SpMVM of all elements elements to be transferred local gather (copy) of MPI_ Irecv Isend MPI_ MPI_ Irecv MPI_ Isend MPI_ Waitall nl wa lc cs post of MPI_IRecv ir local gather (copy) & send cs ca calculation: all elements lc calculation: local elements cp pure copy pr parallel region cs wa ca cp nl pr MPI_Waitall wa nl calculation: nonlocal el. Figure 5: Sc hema tic timeline vie w of the implemen ted hybrid k ernel versions: (a) no co m- m unication/calcula tion ov erla p, (b) naive ov erlap using non blo cking MPI, and (c) explicit ov erla p by a dedicated communication thr ead. The abbreviations in the top bars indicate the individual contributions that will be discussed in the following sec tions. MPI para llelization o f spMVM is gener ally done by distributing the no nzeros (or, alter - natively , the matrix rows), the rig ht hand side vector B(:) , and the r esult vector C(:) evenly across MPI pro c e sses. Due to off-diag o nal nonzer os, every proce s s r equires some parts of the RHS vector fr om other pr o cesses to co mplete its own c h unk of the res ult, and must send parts of its own RHS c h unk to others. Note that it is genera lly difficult to establish go o d lo ad balancing for computation and communication at the sa me time. Unless indicated otherwis e we use a balanced distr ibutio n o f nonzero s across the MPI pro cess e s here. At a given num b er of pro cesses, the res ulting communication pa ttern depe nds only on the sparsity structur e, so the necessar y b o o kkeeping needs to b e done only once. The actual spMVM computations can be p erfor med either b y a sing le thre a d o r, if threading is av a ilable, by multiple threads inside the MPI pro cess. 3.3 V ector-lik e parallelization: V ector mo de without o v erlap Gathering the data to be sent by a pro c e s s into a co nt iguous send buffer may b e done after the receive has b een initiated, p otentially hiding the cost o f copying (see Fig. 5 (a)). W e call this naive appro ach “hybrid v ecto r mo de,” since it strong ly resembles the pro gramming mo de l for vector-parallel computer s [6]: The time-consuming (although probably para llel) computation step do es not ov erlap with co mm unication ov erhea d. This is actually ho w “ MP I+Op enMP hybrid progra mming” is still defined in most publications. The questio n whether and why using mult iple threads p er MPI pro cess may improv e p erforma nc e co mpared to a pure MPI 8 version on the same har dware is not eas y to answer. Dep e nding on the problem, different asp ects come into play , a nd there is no gene r al r ule [2 0]. 3.4 V ector-lik e parallelization: V ector mo de with naiv e o v erlap As an alternative one ma y consider hybrid v ecto r mo de with nonblo cking MPI (see Fig. 5 (b)) to p otentially overlap c ommunication with the par t of spMVM that can b e completed using lo cal RHS elements only . After the nonlo cal elements hav e be e n rece ived, the remaining spMVM op eratio ns ca n b e p erfor med. A disadv antage o f splitting the spMVM in tw o parts is that the lo cal result vector must b e written twice, incurring additional memory traffic. The p er fo rmance model (1) can be mo dified to account for a n additio na l da ta transfer of 16 / N nzr bytes pe r inner lo op itera tion, leading to a mo dified co de balance of B split CRS =  6 + 20 N nzr + κ 2  bytes flop . (2) F or N nzr ≈ 7 . . . 15 and assuming κ = 0, one may exp ect a node- level p erfor mance p enalty betw een 1 5% and 8%, and even less if κ > 0 . F or simplicity we will als o use the term “vector mo de” for pure MPI v er sions with s ing le- threaded computation. 3.5 T ask mo de with explicit o verlap A safe wa y to ensure ov erla p o f co mm unication with computation is to use a sepa rate com- m unication thread a nd leav e the co mputational lo ops to the r emaining threads. W e call this “hybrid ta s k mo de,” b eca use it establishes a functional deco mpo sition o f tasks (communica- tion vs. computatio n) across the res ources (se e Fig. 5 (c)): One thread executes MPI calls only , while all others are used to co py data into send buffers, perfor m the spMVM with the lo c a l RHS elements, and finally (a fter a ll communication has finished) do the r emaining spMVM par ts. Since spMVM sa turates at a b out 3 –5 threads p er lo cality doma in (as shown in Fig. 3 (b)), at least one core pe r LD is av aila ble for co mm unication witho ut adversely affecting node- le vel p erformance. On ar chitectures with SMT, like the In tel W estmere, there is also the option of using one compute thr ead p er ph ysical co re and bind the comm unica tion thread to a lo gical core. Note that, even with p e rfect ov e r lap, one may exp ect the speedup compared to a ny vector mo de to b e alwa y s le ss than a factor of t wo. Apart fro m the additional memory traffic due to writing the result vector twice (se e Sect. 3.4), another drawbac k of hybrid tas k mo de is that the s tandard O p e nMP lo op work- sharing dir ective cannot b e used, since there is no concept of “ subteams” in the current Op enMP standar d. W or k distribution is thus implemented explicitly , using one contiguous ch unk of nonzeros p er co mpute thread. 4 In terno de p erformance results and discuss ion In this section we pr esent stro ng scaling r esults for the Holstein-Hubbar d (b oth ba sis nu m- ber ings) and sAMG matrices . Bes ides a discussio n of the bene fits of h ybrid task mo de we also provide evidence that hybrid vector mo de, even without ov er lap, may improve p erformance due to b etter load bala ncing. 4.1 Basis or dering for the Holstein-Hubbard matrix Despite the different spar sity patterns of HMeP and HME p their no de- level perfor mance differs only by ro ug hly 10% (HMEp: 3.8 9 GFlop/s, HMeP: 4.3 4 GFlop/s on a W estmere EP no de). The question arise s whether it is p os sible to choose an appropria te partitioning o f the matrix (or, equiv alently , a cer ta in num ber of pro ces ses) so tha t communication overhead is g reatly reduced, a nd whether the basis or dering plays a relev ant role. An analys is of the 9 0 10 20 30 #nodes 0 20 40 60 80 100 120 cost (= #nodes × time) [ms] cs wa ca 0 10 20 30 #nodes lc nl 0 10 20 30 #nodes cp pr vector mode without overlap vector mode with naive overlap task mode HMeP 0 10 20 30 #nodes 0 50 100 150 200 cost (= #nodes × time) [ms] cs wa ca 0 10 20 30 #nodes lc nl 0 10 20 30 #nodes cp pr vector mode without overlap vector mode with naive overlap task mode HMEp Figure 6: Cost o f the contributions to parallel spMVM vs. num b er o f no des for HMeP (top row, constant num b er of no nz e ros per pro cess) and HMEp (bottom r ow, consta nt num b er of r ows per pro ces s) o n the W estmere cluster. All physical co res of a no de were used, e ither by MP I pro cesses (vector mo des) or additional O p e nMP thre ads (task mo de). Results are given in terms of a ’co st’ function, which is the pro duct of time requir ed and the num b er of no des. See Fig. 5 for a desc ription of the abbrev iations. Costs for p osting the MP I IRecv are marginal on this scale. individual contributions of the par allel spMVM shows the paramo unt ro le assumed by the sparsity pattern as soo n as comm unica tion becomes an issue. In Fig . 6 we sho w for each nu m be r of no des and one MPI pro ces s pe r core (vector mo des) or one MPI pro cess p er no de (task mo de) the cost for co mputing and co mmunicating (time × num b er of no des); each box with whisk e rs denotes the v a r iation across all pro cesses in the parallel run (10th/90th and 2 5th/75 th p erc e ntiles). The broa dening of the b oxes and whis kers with incr easing no de count is a c onsequence of lo a d im balance; se e Sect. 4.2 .1 for details a b o ut this issue. 4.1.1 V ector mo des The purely computational cost (ca, lc, nl) is r oughly on par in both matr ices, a nd scales almost linearly with the num ber of no des (approximately horiz o ntal trend for the slow est pro cesses in the cos t plot), no matter which v ariant of vector mode is ch osen. In contrast, the cost for communication sp ent in MPI Waital l g r ows with the num b er of pro cessors , which is to be exp ec ted s ince the loc al matrix blo cks b ec o me smaller. F urthermore, the inho mo geneous matrix structures result in increas ing v ariatio ns o f the transferred data volume (and thus communication time). While b oth matrices show this tre nd, there a re particula r node counts for which the communication pa ttern of HMEp is obviously more fav or able (10 , 15, 20). At 10 0 96 192 288 PE rank 0 2 4 time [ms] 0 96 192 288 PE rank 0 24 48 72 0 4 8 12 16 time [ms] 0 24 48 72 0 2 4 6 8 10 0 20 40 60 time [ms] ca wa cs 0 2 4 6 8 10 nl lc Figure 7: Con tr ibutions to the runtime of one spMVM (HMeP) for each MPI pro cess at different nu m be r s of pro cesses (12, 96, a nd 38 4 fro m top to b ottom; s trong scaling) on the W estmere cluster. The first column corres p o nds to vector mode without overlap (see Fig. 5 (a)) wherea s the seco nd column presents the run times fo r v ecto r mo de with na ive ov erla p (see Fig. 5 (b)). One cor e per pro cess was used throughout. Abbrev ia tions as in Fig. 5. these p oints the nu m be r of co res is commensur able with the diagonal blo ck str ucture of the matrix, the ma jo r ity of the communication happ ens inside the no des, and o nly few larg e messages are passed b etw e e n no de s . But ev e n then the communication cost for HMEp is still larger tha n for HMeP . 4.1.2 T ask mo de In ta sk mode, which was used here with one MPI pr o cess per no de, the o verall cost is dominated by the parallel regio n (pr). The reduced n um ber of communicating MPI pro cesses alleviates the load balancing pro blem in the communication s cheme, which will b e disc ussed in the following section. More imp or tantly , up to ar ound 15 no des the cost for the par allel reg ion is roug hly constant in the HMeP cas e, implying that communication is hidden co mpletely behind computation as discuss ed in Sect. 3.5. Beyond this po int , communication time starts to b ecome dominant at least for some pr o cesses, as indica ted by the slow rise of the top whisker (90 th p erc entile) fo r the par a llel region. Ho wever, w e still expect decen t per fo rmance scaling for this setup. The HMEp matrix, due to its unfav ora ble communication pattern, do es not allow for sufficient overlap. 4.2 T estcase HMeP 4.2.1 Analysis of runtime con tributions in the MPI-parallel case In order to pinpoint the r elev ant p erfor mance-limiting asp ects in the MPI-para llel case we show in Fig. 7 the different co ntributions to the run time of a s ing le spMVM for each MP I pro cess when using o ne o f the tw o v e c tor mo de v ariants, with one pro ces s p er core a nd at 1 2, 96, and 384 pro cesses, resp ectively . In these gra phs the overall r unt ime is a lwa ys given by the highes t bar; v a riations in runtime acro ss pro cess es are a sig n of load imbalance. Owing 11 0 8 16 24 32 #nodes 0 10 20 30 40 50 60 performance [GFlop/s] 0 8 16 24 32 #nodes (a) vector mode without overlap (b) vector mode with naive overlap (c) task mode best Cray ! 0 8 16 24 32 #nodes one MPI process per physical core one MPIprocess per NUMA LD one MPI process per node Figure 8: Strong s caling p erfor mance data fo r spMVM with the HMeP matrix (constant nu m be r of nonzero s p er pro ces s) o n the In tel W estmere cluster for different pure MP I and hybrid v ariants (kernel version (a) – (c)) a s in Fig . 5). Sym b o ls on each data set indicate the 50% parallel efficiency po int with r esp ect to the b est s ingle-no de version. The be st v ar iant on the Cray XE6 system is shown fo r reference (see text for details ). to the separation of loc a l from nonlo cal spMVM parts, vector mode with na ive overlap (right column) is alwa y s slower than vector mo de without overlap (left c olumn). The histog rams r eflect roug hly the shap e of the spa rsity pattern for HMeP as shown in Fig. 1: As the n umber of pr o cesses is increase d, “sp eede r s,” i.e., pro cesses that a r e faster than the rest, start to show up mainly at the top and b ottom ends of the matrix (low and high ranks). This imbalance is c hiefly caus ed by a smalle r amount of co mmunication ( MPI_Wa itall ), whereas the contribution of co mputation to the runtime is muc h mor e ba l- anced acro ss all pro cesses. One may thus exp ect that a low er num b er o f MPI pro cesses on the same num b er of core s (i.e., using multiple threads p er pro ce ss) improv es p e rformance due to b etter load bala ncing ev en without explicit overlap. A t larg er pro cess count s, execution time star ts to be do mina ted by commu nication. Hence, even with multiple threads p er MPI pro cess and therefore improved loa d balancing, explicit overlap of communication with the lo cal par t of the spMVM (lab eled “lc” in the right g roup of diagrams in Fig. 7) is exp ected to s how significant sp eedups. 4.2.2 P erformance resul ts A t one MPI pro c e ss p e r ph ysical core (left panel in Fig. 8), vector mo de with naive ov erlap is alw ays s low er than the v ariant without ov erla p bec a use the additiona l da ta transfer on the result vector cannot b e c omp ensated by ov e rlapping communication with computation. T a s k mo de w as implemen ted here with one communication thread p er MPI pro cess , r unning o n the se c o nd virtual core . In this case, point-to-p oint transfers explicitly o verlap with the lo cal spMVM, leading to a noticea ble p erfo r mance b o os t. One ma y conclude that MP I libr aries with supp or t for pro gress threa ds could follow the same strategy a nd bind those threa ds to un used lo gical cores, thereby a llowing overlap ev en with single-thr e aded user co de. With one MPI pro cess p er NUMA lo cality domain (middle pa ne l) the adv a nt age o f task mo de is even mo re pronounced. Also the plain vector mode without overlap shows some notable sp eedup compa red to the MPI-o nly v e rsion, which was exp ected fro m the discussion of load balancing in the previous section. Since the memory bus of an LD is already saturated with four threads, it do es not make a differ ence whether six w or ker thre ads are used with one communication thread on a virtual c o re, o r whether a complete physical core is devoted to comm unica tion. The same is true with o nly one MPI pro cess (1 2 threads) pe r no de (right panel). The s y mbols in Fig. 8 indicate the 50 % parallel efficiency p oint (with resp ect to the be s t single-no de p er formance as rep o r ted in Fig. 3 (b)) on each data set. In practice one w ould 12 0 96 192 288 PE rank 0 1 2 3 time [ms] 0 96 192 288 PE rank 0 24 48 72 0 4 8 12 time [ms] 0 24 48 72 0 2 4 6 8 10 0 20 40 60 80 time [ms] ca wa cs 0 2 4 6 8 10 nl lc Figure 9: Contributions to the runt ime of one spMVM (sAMG) for ea ch MPI pr o cess. Parameters and abbrevia tio ns as in Fig . 7 . not go b eyond this num b er of no des be cause of bad r esource utilization. F or the ma trix and the system under investigation it is clear that task mo de a llows strong scaling to muc h higher levels of par allelism with acceptable para llel efficiency than any v ariant of vector mo de. How ever, even the vector mo de v ariants show a significant per formance adv ant age at multiple threads p er pro cess due to improv ed load balancing. Contrary to ex p e ctations based on the s ingle-no de pe r formance num b er s (Fig. 3 (b)), the Cray XE6 can gener ally not match the p erforma nce of the W estmere cluster at larg er no de counts, with the exc e ption of pure MPI where b oth a re roughly on par (left panel, Cray results for vector mode with na ive o verlap). When using threa de d MPI pr o cesses (middle and r ight panel), tas k mo de per forms b est on the Cr ay system. The adv antage ov er the other kernel v ariants is by far no t as pro nounced as on W estmere, how ever. W e hav e obser ved a strong influence of job top olog y and machine lo ad on the communication per formance over the 2D torus netw ork. Since s pMVM requires significant non-nea rest-neighbo r communication with growing pro ces s counts, the nonblocking fat tree netw o rk on the W estmere cluster seems to be b etter suited for this kind of problem. The pr esented results a re b est v alues obtained on a dedicated XE6 machine. There is also a universal dro p in sca lability b eyond ab out six no des, whic h is largely independent of the particular h y br id mo de . This can b e ascrib ed to a strong decrease in ov era ll internode communication volume when the num b er of no des is small. The effect is somewhat less pronounced for pure MPI, since the ov er head of intrano de messag e passing cannot b e neglected. 4.3 T estcase sAMG 4.3.1 Analysis of runtime con tributions in the MPI-parallel case As shown in Fig. 9, there is only a slig ht lo ad imbalance for the sAMG matrix even at 3 8 4 pro cesses (low er diagrams). W e thus only exp ect a marginal p er formance b enefit from using threaded MPI pro ces ses. Also the ov erall fraction of communication o verhead is q uite small; task mo de with e x plicit overlap o f communication will hence not lead to a significa nt sp eedup. 13 0 8 16 24 32 #nodes 0 30 60 90 120 performance [GFlop/s] 0 8 16 24 32 #nodes (a) vector mode without overlap (b) vector mode with naive overlap (c) task mode best Cray ! 0 8 16 24 32 #nodes one MPI process per physical core one MPI process per NUMA LD one MPI process per node Figure 1 0: Strong scaling per formance data for spMVM with the sAMG matrix (same v a riants as in Fig. 8). Parallel efficiency is ab ov e 50% for all v ersions up to 32 no des. The Cray system per formed b est in vector mo de without ov erlap for all cases . 4.3.2 P erformance resul ts As exp ected fr om the a nalysis in the previous section, all v ar iants and hybrid mo des (pure MPI, o ne pro cess pe r L D, and one pr o cess pe r node) show simila r s caling behavior on the W estmere cluster, and there is no a dv a ntage of tas k mo de ov er vector mo de witho ut ov er lap or over pure MPI (see Fig. 1 0). This observ a tio n s uppo rts the genera l rule that it makes no sense to co nsider MPI+Op enMP hybrid progra mming if the pure MPI code a lr eady sca les well and b ehaves in acco rdance with a single-no de p erformance mo del. On the Cray XE 6, vector mo de without overlap p erfor ms b est across a ll hybrid mo des , with a significant adv a ntage when running one MPI pr o cess with six threads p er LD. This asp ect is still to b e investigated. 5 Summary and outlo ok W e hav e investigated the per formance prop erties of pure MPI and h ybr id MP I + Op enMP hybrid v a riants of sparse matrix- vector multiplication on tw o m ultico re-base d parallel sys- tems, using matrices with different s parsity patterns. The s ingle-no de p er formance analysis on Intel W e s tmere and AMD Mag ny Co ur s proces sors showed that memory- b o und spa rse MVM sa tur ates the memory bus of a lo cality domain already at ab o ut four thre a ds, leaving free co res for explicit computation/communication overlap. As most current MPI libraries do not supp ort truly asynchronous point-to-p oint transfers , explicit overlap enabled sub- stantial p erfor mance gains for strong scaling o f comm unication-b ound pro blems. Since the communication thread can run on a virtual cor e, MPI implementations could use the s ame strategy for internal “progress threa ds” a nd so e na ble asynchronous progre ss without changes in MP I - only user co de. W e hav e also identified the clea r adv antage of using threaded MPI pro cesse s, even with- out explicit co mmunication ov erla p, in cases where computatio n is well ba lanced and loa d im balance is caused by the communication pa ttern. Ac kno wledgmen ts W e thank J. T r eibig, R. Keller and T. Sch¨ onemeyer for v aluable discussions, A. Ba s ermann for providing and supp orting the R CM tr ansformatio n and the UHBR test case as well as K. St¨ uben and H. J. Plum for pr oviding and suppo rting the sAMG test cas e. W e ac knowledge financial suppor t from KONWIHR I I (pro ject HQS@HP C I I) a nd thank CSCS Manno for granting access to their Cray X6E sys tem. 14 References [1] G. Goumas, K. Ko urtis, N. Anastop o ulos, V. Ka rak asis , N. Koziris : Perf ormanc e eval- uation of the sp arse matr ix -ve ctor multiplic ation on mo dern ar chite ctur es. J . Sup ercom- puting 50 (1), 36–7 7 (2008 ). [2] S. Williams, L. Oliker, R. V uduc, J. Shalf, K. Y elic k, J. Demmel: Optimization of sp arse matrix-ve ctor mult iplic ations on emer ging m ultic or e platforms. Parallel Co mputing 35 , 178 (200 9) [3] N. Bell and M. Garla nd: Implementing sp arse matrix -ve ctor multiplic ation on thr oughput- oriente d pr o c essors. Pro c eedings o f SC09 . [4] M. Kr otkiewski and M. Dabrowski: Par al lel symmetric sp arse matrix-ve ctor pr o duct on sc alar multi-c or e CPUs. Parallel Computing 3 6 (4 ), 181– 198 (2010 ). [5] R. Geuss and S. R¨ ollin: T owar ds a fast p ar al lel sp arse symmetric matrix-ve ctor mu ltipli- c ation. Parallel Co mputing 27 (1), 88 3–89 6 (2001 ). [6] R. Ra b enseifner and G. W ellein: Communic ation and Optimizatio n A sp e cts of Par al lel Pr o gr amming Mo dels on Hybrid Ar chite ctur es. International Journa l o f High P erformance Computing Applications 17 , 49–62 (2003). [7] G. W ellein, G. Hager, A. Basermann, and H. F ehs ke: F ast sp arse matrix-ve ctor mu ltipli- c ation for TFlop/s c omputers. In: Pro ceeding s o f VECP AR2002, LNCS 2565 , Spring er Berlin (20 0 3). [8] R. B arrett et al. : T emplates for the Solut ion of Line ar Systems: Building Blo cks for Iter ative Metho ds. SIAM, ISB N 978- 08987 1328 2, (1993). [9] G. Hage r and G. W ellein: Intr o duction to High Perfo rmanc e Computing for Scientists and Engine ers. CRC P ress, ISBN 978- 1439 811924, (2 0 10). [10] A. W eiße, G. W ellein, A. Alvermann, and H. F ehske: The kernel p olynomial metho d. Rev. Mo d. P hys. 78 , 275 (2006). [11] A. W eiß e a nd H. F ehske: Chebyshev exp ansion te chniques. In: Co mputational Many Particle Physics, Lectur e Notes in Physics 739 , pp. 545– 577, Spring er Ber lin Heidelb erg (2008). [12] H. F ehske, G. W ellein, G. Hager, A. W eiße, and A. R. Bisho p: Quantum lattic e dy- namic al effe cts on the single-p article ex citations in 1D Mott and Peierls insulators. P hys. Rev. B 69 , 1 6 5115 (200 4). [13] E. Cuthill and J. McKee: R e ducing the b andwidth of sp arse symmetric matric es. Pro- ceedings of 24 th natio nal conference (ACM ’69 ), ACM, New Y o rk, NY, USA, 15 7–17 2. [14] K. St¨ uben: An Int r o duction to A lgebr aic Multigrid. In: Multigrid: Basics, Parallelism and Adaptivity , Eds. U. T r ottenberg et al., Academic P r ess (2 000). [15] http:// www.s cai.fraunhofer.de/en/business- research- areas/numerical- software/products/samg.htm l [16] A. Basermann et al.: HICFD - Highly Efficient I m plementation of CFD Co des for HPC Many-Cor e Ar chite ctur es. In: P ro ceedings of CiHPC, Spring er 2 0 11 [in print] [17] A. Buluc, S. W. Williams, L. Oliker, and J . Demmel: R e duc e d-Bandwidth Multithr e ade d Algo rithms for Sp arse-Matrix V e ctor Multiplic ation. Pro c . IPDPS 20 11 (to a ppe a r). http:/ /gaus s.cs. ucsb.edu/ ~ aydin/ ipdps 2011.pdf [18] http:// code. google.com/p/likwid 15 [19] G. Sch ub ert, G. Hager, and H. F e hske: Performanc e limitations for sp arse matrix-ve ctor multiplic ations on cu rr ent multic or e envir onmen t s. In: High Performance Co mputing in Science and E ngineering, Garching/Munich 2 009, 13–2 6 , Springer Berlin Heidelb erg (2010). [20] R. Rab e ns eifner, G. Hager, and G. Jo s t: Hybrid MPI/Op enMP Par al lel Pr o gr amming on Clusters of Multi-Cor e S MP No des. In: P r o ceedings o f PDP 2009 . 16

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment