ParMAC: distributed optimisation of nested functions, with application to learning binary autoencoders

P arMA C: distributed optimisation of nested functions, with application to learning binary auto enco ders Miguel ´ A. Carreira-P erpi ˜ n´ an Mehdi Alizadeh Electrical Engineering and Computer Science, Univ ersity of California, Merced http://eecs .ucmerced.edu Ma y 30 , 2016 Abstract Man y p o wer ful machine learning mo dels are b ased on the comp osition of m ultiple processing lay ers, such as deep net s, whic h gives rise to nonconvex ob jective functions. A general, recent approach to optimise su ch “nested” functions is the metho d of auxili ary c o or dinates (MAC) (Carreira-P erpi ˜ n´ an and W a n g, 2014). MAC introdu ces an au x ilia ry coordinate for ea ch d ata p oin t in order to decouple the nested mod el into indep endent submodels. This decomposes the optimisation into steps that alternate b etw een training single la yers and up dating the co ordinates. It has th e adv antage that it reuses ex isting single- la yer algorithms, introduces parallelism, and do es not need to u se c hain-ru le gradients, so it works with nondiﬀerentiable la yers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset ma y not ﬁt in a single mac hin e. It is then essential to limit t h e amount of communication b etw een machines so it do es n ot ob literate the b eneﬁt of p aral lelism. W e d escrib e a general wa y to achiev e this, ParMAC . ParMA C works on a cluster of pro cessing machines with a circular top ol ogy and alternates tw o steps until conv ergence: one step trains th e submodels in parallel using sto c hastic up dates, and the oth er trains the co ordinates in parallel. On ly submo del p arameters, no data or coordinates, are ever comm u nicated b et ween mac hines. P arMAC exhibits h igh parallelism, lo w comm un icatio n ove rh ead, a n d facilitates data sh uﬄin g, load balancing, fault tolerance and streaming data pro cessing. W e stud y the conv ergence of ParMA C and prop ose a theoretical mo d el of its runtime and parallel sp eedup. T o illustrate our general results in a sp eciﬁc algorithm, w e develop ParMA C to learn binary au t oenco d ers with application to fast, app ro ximate image retriev al. W e implement this using Message Pass ing Interface (MPI) in a distribut ed system and demonstrate n early p erfect sp eedups in a 128-processor cluster with a training set of 100 million high-dimensional p oin ts. The sp eedups achiev ed agree w ell with the prediction of our th eoretical speedup mo del. 1 In tro duction: big data, parallel pro cessing and nested mo dels Serial computing has reached a plateau and par allel, distributed ar chitectures are b ecoming widely av aila ble , from machines with a few co res to cloud c omputing with 100 0 s o f machines. The combination of p o werful nested m o dels with lar ge datasets is a k ey ingredient to solve diﬃcult problems in machine learning , computer vision and o ther area s, and it underlies r ecen t successes in deep learning (Hinton et al., 2012; Le et al., 2012 ; Dean et al., 2 012). Unfor tunately , parallel computation is no t easy , and many goo d serial a lg orithms do not parallelise well. The cost of communicating (through the memory hierar c hy or a netw ork ) greatly exceeds the co s t of c omputing, b oth in time a nd energy , and will co n tinue to do so for the foreseeable future (F uller and Mille tt, 20 11; Graham et al., 2004). Thus, go o d para llel algorithms m ust minimise comm unica tion and maximise co mputatio n p e r machine, while cr eating suﬃcien tly many subproblems (idea lly independent) to bene ﬁt fro m as many machines as poss ible. The load (in runt ime) on each machine should be approximately equal. F aults b ecome more frequent as the num ber of machines increases, par ticularly if they are inexpe nsiv e machines. Machines may be heterogene o us and diﬀer in CPU a nd memor y ; this is the cas e with initiatives such a s SETI@home, which may bec ome an imp ortant source o f distributed c o mputation in the future. Big data a pplications hav e additiona l restrictions. The size of the data means it canno t b e stored on a single 1 machine, so distributed-memory architectures are necessa ry . Sending data b et ween machines is prohibitive bec ause of the s ize of the data and the high communication costs. In some applications, more data is co llected than can b e stor ed, so da ta m ust be r egularly discarded. In others, such a s s ensor netw o rks, limited battery life and co mputatio nal p o wer imply that data must be pr ocesse d lo cally . In this pap er, we focus o n mac hine learning mo dels of the form y = f K +1 ( . . . f 2 ( f 1 ( x )) . . . ), i.e., consisting of a nes ted mapping from the input x to the output y . Suc h n este d mo dels in volve m ultiple para meterised lay ers of pro cessing and include deep neural nets (Hin ton and Sa la kh utdinov, 2 0 06), cascades for ob ject recognition in co mputer vision (Serre et al., 2007; Ranzato et al., 2007) or for phoneme cla s siﬁcation in speech pro cessing (Gold and Morgan, 2000; Saon a nd Chien, 2012 ), wrapp er approaches to classiﬁcation or r e gression (Kohavi and John, 1997), and v ar ious com binations of feature extraction/lea rning and prepro cessing prior to some lea r ning task. Nested and hierarchical models are ubiquitous in machine learning b ecause they pr o vide a wa y to construct complex mo dels by the co mposition of simple layers. How ever, training nes ted mo dels is diﬃcult even in the seria l case b ecause funct io n c omp osition pr o duc es inher ently nonc onvex functions , which makes gra die nt-based optimisa tion diﬃcult and slow, and sometimes inapplicable (e.g. with nonsmo oth or discrete lay ers). Our starting p oint is a rece ntly prop osed technique to train nested mo dels, the metho d of auxiliary c o or dinates (MAC) (Carr eira-Perpi˜ n´ an a nd W ang, 201 2, 2014). This reformulates the o ptimisation into an iterative pro cedure that alter nates training submo dels independently with co ordinating them. It intro- duces s igniﬁcan t mo de l and data parallelism, can often tra in the submo dels using ex is ting algorithms, a nd has conv erg ence guara n tees with diﬀeren tiable functions to a lo cal stationary p oint, while it also a pplies with nondiﬀerentiable or even discr ete layers. MAC has b een applied to v ario us nested mo dels (Carreir a- Perpi˜ n´ an and W ang, 20 14; W ang and Ca rreira-Perpi ˜ n´ an, 2 014; Carr eira-Perpi˜ n´ an and Razip erc hikolaei, 2015; Razip erc hikolaei and Carreir a-Perpi˜ n´ an, 2016 ; Car reira-Perpi˜ n´ an and Vladymyrov, 2 0 15). How ever, the or iginal pap ers prop osing MAC (Carreira- Perpi ˜ n´ an a nd W ang, 2012 , 2 0 14) did not addr ess how to run MA C on a distributed computing architecture, where co mm unication b et ween machines is far co stlier than computation. This paper prop oses ParMA C , a pa rallel, distributed fra mew ork to lea rn nested mo dels using MA C, implements it in Message Passing Interface (MP I) fo r the problem o f lea rning binary auto enco ders (BAs), and demo nstrates its ability to tra in on lar g e datasets and achiev e lar ge sp eedups o n a distributed cluster. W e ﬁrst rev iew related work (section 2), describ e MAC in g e neral and for BAs (section 3) and in- tro duce the ParMA C mo del and some extensions of it (s e ction 4). Then, we ana lyse theoretically ParMA C’s parallel s peedup (se c tio n 5) and conv erg e nce (section 6). Fina lly , we desc r ibe our MPI implement a tion of ParMA C for BAs (section 7) and show e x perimental res ults (section 8). Although o ur MPI implementation and e x periments are for a particular ParMAC a lgorithm (for binary auto encoder s), w e emphasise that o ur contributions (the deﬁnition of ParMA C and the theoretical ana lysis of its spe edup and conv erg e nce) apply to ParMAC in general fo r any situation where MAC applies , i.e., nested functions with K layers. 2 Related w ork Distributed optimisation and large-sca le machine learning hav e bee n steadily gaining interest in recent y ears . Most w or k has cen tred on c onvex optimisa tion, particular ly when the ob jective function has the form of e m- pirical risk minimisation (data ﬁtting term plus regularis er) (Cevher et al., 2014). This includes man y imp or- tant mo dels in machine learning, such as linea r reg ression, LASSO, logis tic reg ression or SVMs. Such work is typically based o n sto chastic g r adien t descent (SGD) (Bottou, 2010 ), co ordinate descent (CD) (W right, 2016) or the alternating dir ection metho d of multipliers (ADMM) (Boyd et al., 20 11). This has resulted in several v ar iations of parallel SGD (McDonald et al., 2010 ; Ber tsek as, 20 1 1; Zinkevich et al., 2 010; Ge mulla et al., 2 011; Niu et al., 2 011), pa rallel CD (Bradley et al., 201 1 ; Rich t´ arik and T ak´ aˇ c, 20 13; Liu and W right, 2015) and par allel ADMM (Boyd et al., 2011; O uyang et al., 20 13; Zhang and Kwok, 2014). It is instructive to consider the par allel SGD case in some detail. Here, one typically runs SGD in- depe ndently on data subsets (done by P worker machines), and a pa rameter ser v er reg ularly ga ther s the replica para meters from the work er s, a verages them and broa dcasts them back to the work er s . One can show (Zinkevic h et al., 20 10) that, for a small enough s tep size and under so me tec hnical conditions, the distance to the minim um in ob jective function v alue satisﬁes an upper b ound. The upp er b ound has a term that decreases a s the num b er of workers P increa ses, so that para lle lisation helps, but it has another term that 2 is indep enden t of P , so that past a certain p oint par a llelisation do es not help. In pr actice, the sp eedups ov er serial SGD a re genera lly mo dest. Also, the theo r etical g ua ran tees of para llel SGD ar e r estricted to shal low mo dels, as oppo sed to de ep or neste d mo dels, b ecause the comp osition of functions is near ly a lw ays nonconv ex. Indeed, pa rallel SGD can div e r ge with no nco n vex mo dels. The intuit ive reason for this is that, with loca l minima, the av erag e of tw o w or k ers can ha ve a larg er ob jective v alue than e a c h of the indiv idua l work er s , a nd indeed the av er age of tw o minima need not be a minim um. In practice, para llel SGD can give rea s onable results with no ncon vex mo dels if one ta kes care to average replica mo dels that a re clo s e in par ameter spa ce and thus asso ciated with the s ame optim um (e.g. e limina ting “stale” mo dels and other heuristics), but this is not eas y (Dean et al., 20 12). Little work has addressed nonconvex mo dels. Most of it ha s fo cused on de e p nets (Dean et a l., 201 2; Le et al., 2012). F or example, Goo gle’s DistBelief (Dean et al., 2012 ) uses asynchronous para llel SGD, with gradients for the full mo del computed with backpropagation, to achieve data pa rallelism (with the cav eat ab o ve), and some for m of mo del parallelis m. The la tter is a c hieved by ca refully pa rtitioning the neur al net int o pie c es and allo cating them to machines to compute g radien ts. This is diﬃcult to do and requires a careful match of the neural net struc tur e (n umber of lay ers and hidden units, connectivit y , etc.) to the target hardware. Although this has managed to train h ug e nets on huge data sets by using tens of thousands of CPU cor es, the sp eedups achiev ed were very modest. Other work has used s imilar techniques but for GPUs (Chen et a l., 201 2 ; Zhang et a l., 2013 ; Coates et al., 201 3; Seide et al., 201 4 ). Another recent trend is on parallel computation abstrac tio ns tailored to machine learning , such as Spark (Zaharia et al., 2010 ), GraphLab (Lo w et al., 2012), Petuum (Xing et al., 2015 ) or T ensorFlow (Abadi et al., 2015), with the goal of making cloud co mputing easily av a ilable to train ma c hine learning mo dels. Again, this is often based on shallow mo dels traine d with gr adien t-base d conv ex optimisation techniques, such as parallel SGD. Some of these systems implement some form of deep neura l nets. Finally , there also exist sp eciﬁc approximation techniques for certa in types of large-scale mac hine learning problems, such a s spectral problems, us ing the Nystr¨ om formula or other landmark - based metho ds (Williams and Seeger, 2001; Bengio et al., 20 0 4; Drineas and Mahoney, 2005 ; T alwalk ar et al., 2013; Vladymyro v and Carreir a-Perpi˜ n´ an, 2013, 2016). ParMA C is s peciﬁcally designed for nested mo dels, which are typically nonconvex and include deep nets and many other models, some of which ha ve nondiﬀeren tia ble la yers. As we describe b elow, ParMA C has the adv antages of b eing simple and relatively indep enden t of the target hardw a re, while a c hieving high sp eedups. 3 Optimising nested mo d els using the metho d of auxiliary co or- dinates (MA C) Many mac hine learning a rc hitectures s hare a fundamental design principle: mathematic al ly, they c onstruct a (de eply) neste d map ping fr om inputs to o u tputs , o f th e form f ( x ; W ) = f K +1 ( . . . f 2 ( f 1 ( x ; W 1 ); W 2 ) . . . ; W K +1 ) with parameter s W , such a s deep nets or binary auto enco ders co nsisting of multiple pro cessing lay er s . Such problems a re traditionally optimised using metho ds based on gradients computed using the chain rule. How- ever, such g r adien ts may sometimes b e inco n venien t to use, or may not ex ist (e.g. if so me of the lay ers are nondiﬀerentiable). Also , they are hard to parallelise, because of the inheren t sequentialit y in the chain rule. The metho d of auxiliary c o or dinates (MA C) (Carreira -P er pi˜ n´ an and W ang, 201 2, 20 14) is designed to optimise nested mo dels without using c ha in-rule gradients while in tro ducing par allelism. It solves an equiv- alent but in appear ance very diﬀerent problem to the nested o ne, which aﬀo r ds em ba rrassing para llelisation. The idea is to brea k nested functional relationships judiciously by introducing new v aria ble s (the aux ilia ry c o or dinates ) as equality constraints. These are then solved by optimising a penalised function us ing alternat- ing o ptimisation ov er the original parameter s (which we call the W s tep) a nd ov er the co ordinates (which we call the Z step). The res ult is a c o or dination-minimisation (CM) algorithm : the minimisation ( W ) step up dates the parameters by splitting the nes ted mo del into indepe nden t submo dels and training them using existing alg orithms, and the co ordination ( Z ) step ensures that corr e sponding inputs and outputs of submo dels eventually matc h. MA C algor ithms ha ve be en developed for several nested mo dels so far: de e p nets (Carreira-Perpi ˜ n´ an and W ang, 2 014), low-dimensional SVMs (W ang and Car reira-Perpi˜ n´ an, 201 4 ), binary auto enco ders (Car reira- Perpi˜ n´ a n a nd Razipe r c hikolaei, 2015), aﬃnity-based loss functions for binar y ha shing (Razip erc hikolaei and 3 Carreir a-Perpi˜ n´ an, 201 6) and parametr ic nonlinear embeddings (Carr eira-Perpi˜ n´ an and Vladymyrov, 201 5). In this paper we focus mostly on the particular case of binary auto enco ders. These deﬁne a no nco n vex nondiﬀerentiable problem, yet its MAC algor ithm is s imple and eﬀective. It a llows us to demonstrate, in a n actual implemen tation in a distr ibuted system, the fundamental properties of P ar MA C: how MA C in tro duces parallelism; how P ar MA C keeps t he co mm unication betw een machines low; the use o f sto c has tic optimisa tion in the W step; and the tra deoﬀ betw een the diﬀerent amount of par allelism in the W and Z steps. It als o allows us to test how go o d our theor etical mo del of the s p eedup is in exp eriment s . W e ﬁr st g ive the deta ile d MA C alg orithm for binary autoenco ders, and then g eneralise it to K > 1 hidden lay ers . 3.1 Optimising binary auto enco ders using MA C A binary auto enc o der ( BA) is a usual auto enco der but with a binary co de lay er. It c onsists of a n enc o der h ( x ) that maps a real vector x ∈ R D onto a binary co de vector with L < D bits, z ∈ { 0 , 1 } L , and a linear de c o der f ( z ) which maps z back to R D in an eﬀort to r econstruct x . W e will ca ll h a binary hash function (see later). Let us w r ite h ( x ) = s ( Ax ) ( A includes a bias by having an extra dimensio n x 0 = 1 for each x ) where A ∈ R L × ( D +1) and s ( t ) is a step function applied element wise, i.e ., s ( t ) = 1 if t ≥ 0 and s ( t ) = 0 other wise. Given a dataset of D -dimensional patterns X = ( x 1 , . . . , x N ), o ur ob jectiv e function, which inv olves the nested model y = f ( h ( x )), is the usual least-squar e s reconstructio n error: E BA ( h , f ) = N X n =1 k x n − f ( h ( x n )) k 2 . (1) Optimising this nonconvex, nonsmo oth function is NP-complete. Wher e the gra dien ts do exist with resp ect to A they are zer o , so optimisatio n of h using c hain- rule gradients does not a pply . W e in tro duce a s auxiliar y co ordinates the outputs of h , i.e., the co des for each of the N input patter ns, and obtain the follo wing equality-constrained problem: min h , f , Z N X n =1 k x n − f ( z n ) k 2 s.t. z n = h ( x n ) , z n ∈ { 0 , 1 } L , n = 1 , . . . , N . (2) Note the co des ar e binar y . W e now a pply the q uadratic-p enalt y metho d (it is also p ossible to apply the augmented Lag rangian metho d; No cedal and W rig h t, 2 006) a nd minimise the following o b jective function while progres s iv ely increasing µ , so the cons train ts are eventually satisﬁed: E Q ( h , f , Z ; µ ) = N X n =1  k x n − f ( z n ) k 2 + µ k z n − h ( x n ) k 2  s.t. z n ∈ { 0 , 1 } L , n = 1 , . . . , N . (3) Finally , w e apply alternating optimisation ov er Z and W = ( h , f ). This results in the following tw o steps: • Over Z for ﬁxed ( h , f ), this is a binary optimisa tion on N L v ariables , but it separ ates into N indep en- dent optimisations each on only L v ariables, with the for m of a binar y proximal o p erator (wher e we omit the index n ): min z k x − f ( z ) k 2 + µ k z − h ( x ) k 2 s.t. z ∈ { 0 , 1 } L . After so me tra nsformations, this problem ca n be solved exactly for small L by enumeration or approximately for larg e r L by alternating optimisation over bits, initialis ed by solving the r e laxed problem to [0 , 1] and trunca ting its solution (see C a rreira-Perpi ˜ n´ an and Razip erchik olaei, 2 015 for details). • Over W = ( h , f ) fo r ﬁxed Z , w e o btain L + D independent problems: for each of the L single-bit hash functions (which try to predict Z optimally fro m X ), each s o lv able by ﬁtting a linear SVM; and for e ac h of the D linea r deco ders in f (whic h try to reco nstruct X optimally from Z ), each a linear least-squar es problem. With linear h a nd f this s imply inv olves ﬁtting L SVMs to ( X , Z ) and D linear regres s ors to ( Z , X ). The user must c ho ose a schedule for the p enalt y para meter µ (sequence of v alues 0 < µ 1 < · · · < ∞ ). This should incr ease slowly enough that the binary co des can change co nsiderably a nd explore b etter s o lutions befo re the constra in ts are satisﬁed a nd the algor ithm stops. With BAs, MAC s tops for a ﬁnite v alue of µ 4 input X D × N = ( x 1 , . . . , x N ), L ∈ N Initialise Z L × N = ( z 1 , . . . , z N ) ∈ { 0 , 1 } LN for µ = 0 < µ 1 < · · · < µ ∞ for l = 1 , . . . , L W step: h h l ← ﬁt SVM to ( X , Z · l ) f ← least-s q uares ﬁt to ( Z , X ) W step: f for n = 1 , . . . , N Z step z n ← a rg min z n ∈{ 0 , 1 } L k x n − f ( z n ) k 2 + µ k z n − h ( x n ) k 2 if no change in Z a nd Z = h ( X ) the n stop return h , Z = h ( X ) Figure 1: Bina ry auto encode r MAC a lg orithm. (Carreir a -P er pi˜ n´ a n and Razip erc hikolaei, 2015). This o ccurs whenever Z do es not change compared to the previous Z step, which giv es a practical stopping cr iterion. Also, in order to generalis e w ell to unseen data, we stop iterating for a µ v alue not when we (suﬃciently) optimise E Q ( h , f , Z ; µ ), but when the precision of the hash function in a v alida tion set decreases . This is a form of ea rly s to pping that guarantees that we improv e (or leave unchanged) the initial Z , a nd besides is faster. W e also hav e to initialise Z . This can b e done b y running PCA and binarising its res ult, for example. Fig . 1 gives the MA C alg orithm for BAs. The BA was prop osed as a wa y to learn go od bina ry hash functions for fast, appr o ximate informa tio n retriev al (Carre ir a-Perpi˜ n´ a n and Razip erch ikolaei, 2015). Binary hashing (Grauman and F ergus, 20 13) has emerged in r ecen t years as a n eﬀective wa y to do fast, appr o ximate nearest- ne ig h b our sea rc hes in image databases. The r eal-v a lue d, high-dimensiona l image vectors are ma pped onto a binary space with L bits and the search is p erformed there using Hamming distances at a v astly faster sp eed and smaller memory (e.g. N = 1 0 9 po in ts with D = 500 tak e 2 TB, but only 8 GB using L = 64 bits, whic h easily ﬁts in RAM). As shown by Car reira-Perpi˜ n´ an and Razip erchik ola ei (2015), tra ining BAs with MA C b eats approximate optimisation approa c hes suc h as re laxing the co des or the step function in the enco der, and yields state-o f- the-art binary hash functions h in unsup ervised pro blems, improving over e stablished a ppr oach es such as iterative quantisation (ITQ) (Gong et al., 2 0 13). In this pap er, we fo cus o n linear has h functions be cause these ar e, by far, the most used type of hash functions in the litera tur e o f binary hashing, due to the fact tha t co mputing the binary co des for a test image m us t b e fast at run time. W e a lso provide an exp erimen t with nonlinear hash functions (RBF netw ork ). 3.2 MA C with K lay ers W e no w cons ider the more general case of MA C with K hidden lay ers (Carreira -P er pi˜ n´ a n and W ang, 2012 , 2014), inputs x and outputs y (for a BA, x = y ). It helps to think of the case of a deep net and we will us e it as a r unning example, but the ideas apply b eyond deep nets. Co nsider a reg ression pr oblem of mapping inputs x to outputs y (b oth high-dimensio nal) with a deep net f ( x ) given a dataset of N pair s ( x n , y n ). W e minimise the least-squares er ror (other loss functions are p ossible): E ( W ) = 1 2 N X n =1 k y n − f ( x n ; W ) k 2 f ( x ; W ) = f K +1 ( . . . f 2 ( f 1 ( x ; W 1 ); W 2 ) . . . ; W K +1 ) (4) where each layer function has the for m f k ( x ; W k ) = σ ( W k x ), i.e., a linear mapping follow e d by a squashing nonlinearity ( σ ( t ) a pplies a scalar function, such a s the sigmoid 1 / (1 + e − t ), elemen twise to a vector a r gumen t, with output in [0 , 1]). W e int r oduce o ne auxiliary v aria ble p er data p oin t and p er hidden unit and deﬁne the following equality-constrained optimisation problem: 1 2 N X n =1 k y n − f K +1 ( z K,n ; W K +1 ) k 2 s.t. ( z K,n = f K ( z K − 1 ,n ; W K ) . . . z 1 ,n = f 1 ( x n ; W 1 ) ) n = 1 , . . . , N . (5) Each z k,n can be seen a s the co ordinates of x n in an intermediate feature spac e , or as the hidden unit activ ations for x n . Intuitiv ely , by eliminating Z we see this is equiv alent to the nested problem (4 ); we 5 can prov e under very gener al assumptions that b oth problems have exactly the same minimisers (Ca rreira- Perpi˜ n´ a n and W ang, 20 12). Applying the quadr atic-penalty metho d, we o ptimise the following function: E Q ( W , Z ; µ ) = 1 2 N X n =1 k y n − f K +1 ( z K,n ; W K +1 ) k 2 + µ 2 N X n =1 K X k =1 k z k,n − f k ( z k − 1 ,n ; W k ) k 2 (6) ov er ( W , Z ) and drive µ → ∞ . This deﬁnes a contin uous path ( W ∗ ( µ ) , Z ∗ ( µ )) which, under mild assumptions (Carreir a -P er pi˜ n´ a n and W ang, 201 2), co n verges to a minim um of the constr ained pr oblem (5), and thus to a minimum of the o riginal problem (4). In practice, w e follow this path lo osely . The quadra tic- penalty ob jective function can b e seen as breaking the functional dependences in the nested mapping f and unfolding it o ver lay er s . Every squa red term involv es o nly a s ha llo w mapping; all v aria bles ( W , Z ) are equally s c aled, which improv es the co nditio ning of the pr oblem; a nd the deriv atives r equired are simpler: we r equire no backpropagated gra dien ts over W , and s ometimes no gradients a t all. W e now apply alternating optimisation of the quadratic-p enalt y ob jective ov er Z a nd W : W step (subm ode l s) Minimising ov er W for ﬁxed Z results in a separate minimisation ov er the w eig h ts of each hidden unit—each a single-lay er, single-unit submo del that ca n be solved with exis ting algorithms (logistic regressio n). Z step (co ordinates) Minimising ov er Z for ﬁxed W separates ov er the co ordinates z n for ea ch data p oin t n = 1 , . . . , N and ca n b e solved using the deriv atives with resp ect to z of the single-layer functions f 1 , . . . , f K +1 (omitting the subindex n ): min z k y − f K +1 ( z K ) k 2 + µ P K k =1 k z k − f k ( z k − 1 ) k 2 . Thu s , the W s tep re sults in many indep enden t, single-lay er single-unit submo dels that can b e tra ined with existing a lg orithms, without extra programming cos t. The Z step is new and has the form of a “ generalised” proximal op erator (Ro c k afellar, 19 76; Co m b ettes and P es quet, 2 0 11). MA C reduces a complex, highly- coupled pro blem—training a deep net—to a sequence of simple, unco upled problems (the W step) which are co ordinated through the auxiliar y v ariables (the Z step). F or a large net with a lar ge dataset, this aﬀords an enormous potential for para llel computation. 4 P arMA C: a parallel, distributed computation mo del for MA C W e now turn to the contribution of this pap er, the distributed implemen ta tion of MA C alg orithms. As we hav e seen, a s peciﬁc MAC a lgorithm depends on the mo del and ob jectiv e function and on how the auxiliary co ordinates are in tro duced. W e ca n achiev e steps that ar e closed-form, con vex, nonco n vex, binary , or others. How ever, the following a lw ays hold: (1) In the Z step, the N subpr oblems for z 1 , . . . , z N ar e indep endent, one p er data p oint . Each z n step depends o n a ll o r par t of the current mo del. (2) In the W step, there are M indep endent su bmo dels , where M dep ends on the problem. F or exa mple, M is the n umber of hidden units in a deep net, or the n umber of hash functions and linear deco ders in a B A. Each submo del dep ends on a ll the data a nd co ordinates (usually , a g iven submo del dep ends, for each n , on o nly a p ortion o f the vector ( x n , y n , z n )). W e now show how to turn this in to a distr ibuted, low-comm unica tio n ParMA C algo r ithm, give an MPI implementation o f P a rMA C for BAs, and discuss the conv erg ence of ParMA C. Throughout the pap er, unless otherwise indicated, w e will use the term “ mac hine” to mean a single-CPU pro cessing unit with its own lo cal memor y and disk , which can communicate with other machines in a cluster thro ugh a net work o r shar ed memory . 4.1 Description of P arMA C The basic idea in ParMA C is as fo llo ws. With larg e datasets in distributed systems, it is imperative to minimise data mov ement over the netw ork b ecause the communication time gener ally far exc eeds the com- putation time in mo dern architectures. In MAC we have 3 types of data: the or ig inal training data ( X , Y ), the auxiliary co ordinates Z , and the model par ameters (the submo dels). Usua lly , the latter t yp e is far smaller. In ParMA C, we never c ommunic ate tr aining or c o or dinate data; e ach machine ke eps a disjoint p or- tion of ( X , Y , Z ) c orr esp onding to a subset of t he p oints. Only mo del p ar ameters ar e c ommunic ate d, during the W step, fol lowing a cir cu lar t op olo gy, which implici t ly implements a sto chastic optimisation . The mo del 6 P S f r a g r e p l a c e m e n t s Data Mo del Machine 1 Machine 2 Machine 3 Machine 4 w h w h w h w h 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 x n x n x n x n y n y n y n y n z n z n z n z n 1 2 3 4 10 11 12 13 14 20 21 22 23 24 30 31 32 33 34 40 Figure 2: ParMA C mo del with P = 4 ma c hines, M = 12 submo dels and N = 40 data po in ts. “ w h ” represents the submo dels (hash functions and deco ders for BAs, hidden unit weight vectors for deep nets). Submo dels h , h + M , h + 2 M and h + 3 M are copies of submo del h , but only o ne of them is the most currently up dated. At the end of the W step all copies are identical. parameters are the hash functions h and the deco der f for BAs, a nd the weigh t vector w h of each hidden unit h for deep nets. Let us see this in detail (r efer to ﬁg. 2). Assume for simplicit y we hav e P identical pro cessing machines, each with their own memory and CPU, which are co nnected through a netw or k. The ma c hines are co nnec ted in a circular (ring ) unidirectional top ology , i.e., machine 1 → mac hine 2 → · · · → machine P → machine 1, where “machine p → machine q ” means machine p can se nd data directly to machine q (and we say machine q is the successo r of ma c hine p ). Call D = { ( x n , y n , z n ): n ∈ { 1 , . . . , N } } the entire dataset a nd corresp onding co ordinates. E a c h machine p will stor e a subset D p = { ( x n , y n , z n ): n ∈ I p } such that the subsets a re disjoint a nd their unio n is the ent ir e data, i.e., the index sets sa tis fy I p ∩ I q = ∅ if p 6 = q and ∪ P p =1 I p = { 1 , . . . , N } . The Z step is very simple. Befo r e the Z step star ts 1 , each machine will co n tain all the (just up dated) submo dels. This mea ns that in the Z step each machine p pro cesses its auxiliary co ordina tes { z n : n ∈ I p } independently of all o ther machines, i.e., no communication o ccurs. The W step is more subtle. At the b eginning of the W step, ea ch machine will co n tain a ll the submo dels and its p ortion of the data and (just up dated) c o ordina tes. Each s ubmodel must hav e acce ss to the entire data and co ordinates in order to update itself and, since the data cannot leav e its home ma c hine, the submo del must g o to the da ta (this co n trasts with the intuitiv e no tio n of the mo del sitting in a co mputer while data arrive and are pro cessed). W e achieve this in the circula r topolog y as follows. W e a s sume synchronous pro cessing for simplicity , but in practice o ne would implement this asynchronously . Assume arithmetic mo dulo P and an imaginary clo c k whose p erio d equa ls the time that any one ma chine ta kes to pro cess its p ortion M /P of submo dels. At ea c h clo c k tick, the P ma c hines up date each a diﬀerent po rtion M /P of the submo dels. F or example, in ﬁg. 2, at clo ck tick 1 ma c hine 1 up dates submo dels 1–3 us ing its data D I 1 (where I 1 = { 1 , . . . , 10 } ); machine 2 up dates submo dels 4 –6; machine 3 upda tes submo dels 7–9 ; and mac hine 4 up dates submodels 10–1 2. This happ ens in parallel. Then each machin e sends the submo dels upda ted to its successo r, also in pa r allel. In the next tic k, each machine up dates the submo dels it just 1 Also, the mac hines need not start all at the same time in the Z step. A mac hi ne can start the Z step on its data as so on as it has r ece i v ed all the up dat ed submo dels in the W step. Likewise, as so on as a machine ﬁnishes its Z step, it can start the W step immediately , without waiting for a l l other mac hines to ﬁnish their Z step. How ever, in our implementat i on w e consider the W and Z steps as barriers, so that all mac hines start the W or Z step at the same time. 7 received, i.e., ma c hine 1 up dates 1 0–12, machine 2 up dates s ubmodels 1–3, machine 3 upda tes submo dels 4–6; and machine 4 up dates submo dels 7–9 (and each machine alwa ys uses its data p ortion, which never changes). This is rep eated until eac h submo del has vis ited ea c h machine and th us ha s b een up dated with the ent ir e dataset D . This happ ens after P ticks, and we c a ll this an ep o ch . This pro cess may b e rep eated for e e po c hs in eP tic ks. A t this time, each mac hine contains M /P submo dels that a r e ﬁnished (i.e., up da ted e times ov er the entire dataset), and the r emaining M (1 − 1 / P ) s ubmodels it contains are not ﬁnished, indeed the ﬁnished versions of those submo dels reside in other machines. Finally , b efore star ting with the Z step, eac h mac hine must contain all the (just updated) submo dels (i.e., the para meters for the en tire nested mo del). W e achieve this 2 by running a ﬁnal round of co mm unication without computation, i.e., ea c h machine sends its just up dated submo dels to its successo r. Thus, after one clo c k tick, ma c hine p sends M /P ﬁnal submo dels to ma c hine p + 1 and receives M /P submo de ls from machine p − 1. After P − 1 c lo ck ticks, each mac hine ha s received the remaining M (1 − 1 /P ) submodels that were ﬁnished by other machines, hence each machine contains a (redundant) co p y of all the cur ren t submo dels. Fig . 3 illus tr ates the s equence o f op erations during one ep och for the example of ﬁg. 2. In practice, we use an as ync hrono us implementation. Each machine keeps a queue o f s ubmodels to b e pro cessed, and r epeatedly p erforms the following op erations: extr act a submo del from the queue, pro cess it (except in ep o c h e + 1) and se nd it to the machine’s succes sor (which will insert it in its q ueue). If the queue is empty , the machine waits un til it is nonempty . The queue of each machine is initialised with the po rtion of submo dels asso ciated with that machine. Ea c h submodel ca r ries a counter that is initially 1 and increases every time it visits a machine. When it rea c hes P e then the submo del is in the last machine and the last ep och. When it re a c hes P ( e + 1) − 1 , it has undergone e ep ochs of pro cessing and all machines hav e a cop y of it, so it has ﬁnished the W step. Since each submo del is up dated as so on as it visits a mac hine, ra ther than computing the exact gra dien t once it ha s visited all machines and then ta ke a step, the W step is r e ally car rying out sto chastic steps for e ach submo del . F o r example, if the up date is done by a gradient step, w e ar e actually implementing sto c hastic gr adien t descent (SGD) where the minibatches ar e of s ize N /P (or smalle r, if w e subdivide a machine’s data po rtion in to miniba tc hes, which should b e typically the case in practice). F rom this point of view, we can reg ard the W step as doing SGD on each submo del in para llel by having e ac h submo del visit the minibatc hes in each machine. In summar y , using P machines, ParMAC iterates a s follows: W step The submo dels (hash functions and deco ders for BAs) v isit each machine. This implies we tra in them with sto chastic gra dien t descent, where one “ep och” f o r a submodel corr esponds to that submodel having visited all P machines. All submodels a re communicated in pa rallel, asynchronously with respect to each other , in a circular to p ology . With e ep o c hs, the ent ir e mo del para meters are communicated e + 1 times. The last ro und of co mm unication is needed to ensure each ma c hine has the most updated version of the mo del for the Z step. Z step Identical to MAC, ea c h data p oin t’s co ordinates z n are optimised indep enden tly , in para llel over machines (since each machine c o n tains x n , y n , z n , a nd all the mode l parameters). No communication o ccurs at all. 4.2 A W step with only t wo r ounds of comm unication As desc ribed, and a s implemented in our experiments, r unning e ep o c hs in the W s tep requires e r ounds of communication (plus a ﬁnal round). Howev er, we ca n run e epo chs with only 1 round o f communication by having a submo del do e c onse cutive p asses within e ach machine’s data . In the example of ﬁg. 2, running e = 2 epo chs for s ubmodel w 1 means the following: instea d of v is iting the da ta a s 1,. . . ,10, 1 1,. . . ,20 , 21 ,. . . ,30, 31,. . . ,40, 1,. . . ,10, 11,. . . ,20, 21,. . . ,30, 31,. . . ,40, it vis its the da ta as 1,. . . ,10, 1,. . . ,10, 1 1,. . . ,20 , 11,. . . ,20, 21,. . . ,30, 21 ,. . . ,30, 31 ,. . . ,40, 31 ,. . . ,40. (W e can also have intermediate schemes s uc h a s doing b et ween 1 and e within-machine passes.) This reduces the amo un t o f shuﬄing, but should not b e a problem if the data are randomly distributed ov er machines (and we ca n still do within-machine sh uﬄing). This eﬀectiv ely 2 In MPI, this can b e directly achiev ed with MPI Alltoall broadcasting, whic h scatters/gathers data from all members to all members of a group (a complete exc hange). How ever, in this paper w e i mplemen t i t using the circular top ol ogy mec hanism described. 8 P S f r a g r e p la c e m e n t s tick 1 P S f r a g r e p la c e m e n t s tick 2 P S f r a g r e p la c e m e n t s tick 3 P S f r a g r e p la c e m e n t s tick 4 P S f r a g r e p la c e m e n t s Machine 1 Machine 2 Machine 3 Machine 4 tick 5 Figure 3 : Illustration of one epo ch of the sy nc hrono us version of P a rMA C’s W step for the exa mple o f ﬁg. 2 with P = 4 machines and M = 12 submo dels (w e only show the “mo del” part of the ﬁgure). E ac h row corresp onds to one clo ck tick, within which each machine computes on its p ortion o f M /P submo dels (coloured g old) and sends them to its successo r. The last tick (= P + 1) is the start of the next ep o c h, at which p oin t all submo dels (colour ed green) hav e b een updated over the en tire dataset. 9 reduces the total comm unica tion in the W step to 2 rounds regar dless of the num b er of epo chs e (with the second round ne e de d to ensure each machine has the most updated submo dels). 4.3 Extensions of ParMA C In addition, the P a rMA C mo del oﬀer s g o o d potential for data shuﬄing, load balancing , str e aming and fa ult tolerance, whic h make it attrac tive for big data. W e describ e thes e next. Data sh uﬄing It is well k no wn that shuﬄing (rando mly reor dering) the dataset prior to each ep o c h improv es the SGD conv er gence sp eed. With distributed systems, this can sometimes be a problem a nd require data mov ement across machines. Shuﬄing is ea s y in ParMA C. Within a machine, we can simply access the loca l da ta (minibatches) in random o rder a t each ep och. Acros s machines, we can simply reorganise the circula r top ology r andomly (while still circular) at the beg inning of ea c h new ep och (by ge ne r ating a random p erm utatio n and res etting the successo r’s addr ess o f each machine). W e could even hav e each submo del fo llo w a diﬀerent, rando m circular top ology . Howev er, we do not implement this b ecause it is unlikely to help (since the submo dels are indep endent) and can unbalance the load o ver machines. Load balancing This is simple b ecause the work in b oth the W a nd Z steps is propo rtional to the num b er of data points N . Indeed, in the W step each submodel must visit every data point once p er ep och. So, even if the submo dels diﬀer in s ize, the training of a ny submo del is prop ortional to N . In the Z step, each data po in t is a sepa rate problem dep enden t on the c ur ren t mo del (which is the same for all po in ts), th us all N problems are formally iden tica l in complex ity . Hence, in the assumption that the machines are identical a nd that each data p oin t incurs the same r un time, load balancing is trivial: the N p oin ts ar e allo cated in equa l po rtions of N /P to each machine. If the pro cessing p ow er of ma c hine p is prop ortional to α p > 0 (where α p could represent the clo c k frequency of machine p , say), then we allo cate to mac hine p a subset of the N po in ts prop ortional to α p , i.e., machine p gets N α p / ( α 1 + · · · + α P ) data po ints. This is done onc e and for all at loading time. In pra ctice, we can exp ect so me degradatio n of the parallel sp eedup even with identical machines and submo dels of the same type. This is b ecause machines do v ary for v a rious reasons , e.g . the r un time ca n be aﬀected by diﬀerences in ven tilation acro ss machines lo cated in diﬀerent a reas of a data centre, or bec ause mac hines ar e r unning other us er pr ocess e s in a ddition to the ParMAC optimisation. Another type of degra da tion can happ en if the s ubmo dels diﬀer s ig niﬁcan tly in runtime (e.g . b ecause there a re diﬀerent t y p es of submo dels): the r un time of the W step will b e driven by the slow submo dels, which b ecome a bo ttlenec k. As discussed in section 5.4, we can gr oup the M submo dels into a smaller num b er M ′ < M of approximately equal-size ag gregate submo dels, for the purp o se of estimating the sp eedup in theor y . This need no t be the fas tes t wa y to schedule the jobs, and in prac tice w e still pro cess the individual submo dels asynchronously . Streaming Streaming r efers to the ability to disca rd o ld data and to add new da ta fro m training ov er time. This is useful in online lea rning, or to a llo w the data to b e refre shed, but also may b e necessar y whe n a machine collects more data than it can store. The circular topolo gy allows us to a dd or remov e machines on the ﬂy eas ily , and this ca n b e used to implement s tr eaming. W e co nsider tw o forms of str eaming: (1) new data are a dded within a machine (e.g. as this mac hine collects new data), a nd likewise o ld data ar e discarded within a machine. And (2) new da ta a r e added by adding a new machine to the top ology , a nd old data are dis carded b y removing an existing machin e from the top ology . Bo th fo r ms a re e a sily achiev ed in ParMAC. The ﬁrst form, within-machine, is trivial: a machine can always add or remove data without a n y change to the system, b ecause the data for eac h note is priv ate and never in teracts with other ma c hines other than b y up dating submo dels. Adding or discarding data is done at the b eginning of the Z step. Disc a rding data s imply means r emo ving the cor respo nding { ( x n , y n , z n ) } from that machine. Adding data means inser ting { ( x n , y n ) } in that machine and, if neces sary , creating within that machine co ordinate v alues { z n } (e.g. by applying the nested mo del to x n ). W e never upload or se nd any z v alues ov er the net work. The second form, crea ting a new mac hine or removing an existing one, is barely mor e co mplica ted, assuming some suppo rt from the para llel pro cessing libr a ry . W e desc r ibe it conceptually . Imagine w e 10 currently hav e P machines. W e can a dd a new mac hine, with its own pr eloaded data { ( x n , y n ) } , a s follows. Adding it to the cir cular topo lo gy simply requires co nnecting it b et ween any t wo machines (done by setting the addr ess of their success or): b efore we hav e “machine p → mac hine p + 1” , after wards we hav e “machine p → new machine → machine p + 1”. W e a dd it in the W step, making sure it receives a copy o f the ﬁnal mo del that ha s just b een ﬁnished. The eas iest wa y to do this is by inserting it in the top ology at the end of the W step, when ea c h machine is simply sending a long a co p y o f the ﬁnal submo dels. In the Z step, we pro ceed as usua l, but with P + 1 ma chines. Removing a ma c hine is easier. T o remove mac hine p , we do so in the Z step, by rec onnecting “ mac hine p − 1 → machine p + 1” and returning machine p to the clus ter . That is a ll. In the subsequent W step, all mac hines contain the full mo del, and the submo dels will visit the data in ea c h machine, th us not visiting the data in the r emo ved machine. F ault tolerance This situation is simila r to discarding a machine in streaming, except that the fault can o ccur at an y time and is not intended. W e can handle it with a little extr a b o okkeeping, and again assuming some supp ort fro m the parallel proc e ssing libr a ry . Ima gine a fault o ccurs a t machine p and we need to remov e it. If it happens dur ing the Z step, all we need to do is discard the fault y machine and rec o nnect the circular top ology . If it ha pp ens during the W step, we also discard and reconnect, but in addition w e need to r escue the submo dels that were b eing up dated in p , whic h we lose. T o do this, we r e v ert to the previous ly upda ted cop y of them, whic h r esides in th e predecessor of p in the circ ular topology (if no predec e ssor, w e are at the beginning of the W step and we can use any copy in any machine). As for the remaining submodels being up dated in other machines, s o me will hav e alrea dy b een up dated in p (which re q uire no a ction) and some will not have b een up dated in p yet (which sho uld not visit p anymore). W e ca n keep track of this information by tag ging ea c h submo del with a list of the machines it has not yet visited. At the b eginning of the W step the list of each submodel contains { 1 , . . . , P } , i.e., a ll mac hines . When this list is empt y , for a submo del, then that submo del is ﬁnished and needs no fur ther up dates. Essentially , the robustness of ParMA C to faults comes fro m its in-built redunda nce. In the Z (and W ) step, we can do w itho ut the data p oints in o ne machine b ecause a go od mo del ca n s till b e learned from the remaining data p oints in the o ther machines. In the W step, we ca n revert to o lder copies of the lo st submo dels r esiding in other mac hines . The a sync hro nous imp le mentation of ParMAC w e describe d earlier r elied on ta g ging each submo del with a counter in or der to k now whether it needs pro cessing and communicating. A more general mech a nism to run ParMAC asy nc hronous ly is to tag each submodel with a list (p er ep och) of machines it has to visit. All a machine p needs to do up on receiving a submodel is c heck its list: if p is not in the list, then the submo del has alrea dy visited machine p and b een up dated with its data, so machine p simply s ends it along to its successor without updating it ag ain. If p is in the lis t, then machine p up dates the submodel, remov es p fro m its list, and sends it alo ng to its successo r. This works even if w e use a diﬀerent communication topo logy for each submo del a t each ep och. 5 A theoretical mo del of the parallel sp eedup for P arMA C In this section we give a theor etical mo del to estimate the computatio n and communication times and the parallel s peedup in ParMAC. Speciﬁca lly , eq. (12) g iv es the sp eedup S ( P ) as a function of the num b er of machines P and other para meters, whic h seems to agree well with our exper imen ts (section 8.3). In pr actice, this mo del can be used to estimate the optimal num b er of machines P to use, or to explo re the eﬀect on the sp eedup of diﬀ er en t para meter settings (e.g. the num b er of s ubmodels M ). Throughout the rest of the paper, we will ca ll “s peedup” S ( P ) the ra tio of the r untime using a single machine (i.e., the se rial co de) vs using P > 1 machines (the parallel co de), and “p erfect sp e edup” when S ( P ) = P . O ur theo retical model applies to the genera l ParMA C case of K layers, whether diﬀerentiable or no t; it only a ssumes that the resulting submo dels after introducing a uxiliary co ordinates ar e of the s ame “size,” i.e., have the same co mputation and c o mm unication time (this assumption can b e relaxed, a s we discus s at the e nd of the section). W e can o btain a q uick, r ough understanding of the spee dup app ealing to (a gener alisation of ) Amdahl’s law (Go edeck e and Hoisie, 2001 ). ParMAC iterates the W and Z steps a s follows (where M is the n umber of submo dels and N the num b er o f data p oin ts): 11 P S f r a g r e p la c e m e n t s W step: M pro blems Z step: N ≫ M problems Roughly spea king, the W step has M indep enden t pr oblems so its speedup would be min( M , P ), while the Z step has N indepe ndent pro blems so its sp eedup would be min( N , P ) = P (b ecause in prac tice N ≥ P ). So the ov er all sp eedup would be b et ween M and P dep ending on the r elativ e runtimes of the W and Z steps. This sug gests we would exp ect a nearly p erfect sp eedup S ≈ P with P ≤ M a nd diminishing returns for P > M . This simpliﬁed picture ignores important factors suc h as the ratio of co mputation vs c omm unication (whic h our model will make more precise), but it do es capture the basic, qualitativ e behaviour of the speedup. 5.1 The theoretical mo del of the sp eedup Let us now develop a mor e pr ecise, quantitativ e model. Consider a P ar MA C algor ithm, opera ting syn- chronously , such that there are M indep endent submo dels of the same size in the W step, o n a dataset with N training p oin ts, distributed over P ident ica l mach ines (each with N /P p oint s ). The ParMA C algor ithm runs a ce r tain num b er o f iteratio ns, each co ns isting of a W and a Z step, s o if we igno re small overheads (setup a nd termination), w e can estimate the total runtime a s prop ortional to the num b er of iter ations. Hence, w e consider a theoretical mo del of the runtime of one iteration of the ParMAC algorithm, given the following para meters: • P : num b er of machines. • N : num b er of tr aining p oin ts. W e ass ume N > P is divis ible by P . This is not a problem b ecause N ≫ P in pra ctice (otherwise, there w ould b e no reaso n to distribute the o ptimisation). • M : num b er of submodels in the W step. This may be smaller than, equal to o r gre ater than P . • e : num b er of ep o c hs in the W step. • t W r : co mputation time per submo del and data point in the W step. This is the time to pro cess (within the cur ren t ep och) one data p oint by a submo del, i.e., the time do an SGD up date to a weigh t vector, per data p oin t (if we use minibatches, then this is the time to pro cess one minibatch divided by the size of the minibatch). • t W c : co mm unication time per submo del in the W s tep. This is the time to send one submo del fr o m one machine to another, including ov erhea ds such a s buﬀering, partitioning into messages o r waiting time. W e assume communication do es not ov e r lap with computatio n, i.e., a machine can either compute or communicate at a given time but no t b oth. Also, communication inv olves time sp en t b o th b y the sender and the r eceiv er; we interpret t W c as the time spent b y a giv e n machine in ﬁrst receiving a s ubmodel and then s e nding it. • t Z r : co mputation time per data p oint in the Z step. This is the time to ﬁnish one data p oint entirely , using whatever optimisation algor ithm p erforms the Z step. P , N , M and e are in teg e rs grea ter or equal tha n 1 , a nd t W r , t W c and t Z r are real v alues greater than 0. This mo del ass umes that t W r , t W c and t Z r are co nstan t a nd equal for every submo del or and data p oint. In reality , even if the s ubmodels are of the same mathematical for m and dimension (e.g. each submo del is a weigh t vector of a linear SVM o f dimension D ), the actual times may v ar y so mewhat due to many fa ctors. How ever, as we will sho w in section 8.3, the mo del do e s ag ree quite w e ll with the exp erimen tally mea sured sp eedups. Let us compute the runtim e s in the W and Z step under these mo del assumptions. The runtime in the Z step equals the time for any one machine to pr ocess its N /P p oints on all M s ubmo dels, i.e., T Z ( P ) = M N P t Z r (7) 12 since all machines start and end at the same time a nd do the same amount of co mputation, without commu- nication. T o compute the r un time in the W step, we again consider the synchronous pr ocedure of sectio n 4. A t e a c h tick of a n imagina ry clo c k, ea c h machine pro cesses its p ortion M /P o f submo dels a nd sends it to its succes s or. After P ticks, this concludes one e p o c h. This is rep eated for e e p o c hs, followed by a ﬁnal round of communication o f all the submo de ls . If M is not divisible by P , say M = QP + R with Q, R ∈ N and 0 < R < P , we can apply this pro cedure pretending ther e ar e P − R ﬁctitious submo dels 3 (on which machines do useless work). Then, the runtime in each tic k is ⌈ M /P ⌉ N P t W r (time for any one machine to pro cess its N /P points on its p ortion ⌈ M /P ⌉ of s ubmodels) plus ⌈ M /P ⌉ t W c (time for any one machine to send its por tion o f submo dels). The total run time of the W step is then P e times this plus the time of the ﬁnal r ound o f computation: T W ( P ) = ⌈ M /P ⌉  t W r N P + t W c  P e + ⌈ M / P ⌉ t W c P. (8) (The ﬁna l r ound actually requires P − 1 ticks, but we take it as P ticks to simplify the equa tion a bit.) Finally , the total r untime T W ( P ) + T Z ( P ) of one ParMA C iteration ( W and Z step) with P ma chines is: T ( P ) = M N P t Z r + P ⌈ M / P ⌉  e  t W r N P + t W c  + t W c  , P > 1 (9) T (1) = M N t Z r + M N et W r (10) where for P = 1 machine we have no comm unica tio n ( t W c = 0). Hence, the par allel sp eedup is S ( P ) = T (1) T ( P ) = M /P ⌈ M /P ⌉ et W r + t Z r 1 P  et W r + M /P ⌈ M /P ⌉ t Z r  + 1 N ( e + 1) t W c = 1 ⌈ M /P ⌉ M ( et W r + t Z r ) P 1 N ( e + 1) t W c P 2 + et W r P + 1 ⌈ M /P ⌉ M t Z r (11) which can be wr itten more conv eniently as S ( P ) = ρ 1 ⌈ M /P ⌉ M P 1 N P 2 + ρ 2 P + ρ 1 1 ⌈ M /P ⌉ M (12) by deﬁning the following constants: ρ 1 = t Z r ( e + 1) t W c ρ 2 = et W r ( e + 1) t W c ρ = ρ 1 + ρ 2 = et W r + t Z r ( e + 1) t W c . (1 3 ) These consta n ts can b e understo o d as ra tios o f computation vs communication, indep enden t of the tra ining set size, n umber of s ubmo dels and n umber of machines. These ratios dep end on the actual c omputation within the W and Z step, a nd on the p erformance of the distributed system (computation p o wer of ea c h machine, communication sp eed ov er the net work or s hared memor y , eﬃciency of the parallel pro cessing library that handles the communication b et ween machines). The v alue of these ratios can v ary consider ably in pra ctice, but it will typically b e quite sma ller than 1 (say , ρ ∈ [10 − 4 , 1]), because communication is muc h slow er than computation in curr en t computer a r c hitectures. 5.2 Analysis of the sp eedup mo del W e ca n characterise the sp eedup S ( P ) of eq. (12 ) in the following three cases: • If M ≥ P and M is divisible by P , then we ca n write the s peedup as follows: if M divisible by P : S ( P ) = 1 /  1 P + 1 ρN  = P /  1 + P ρN  ≤ P. (14) 3 This means that our estimated r un time is an upp er b ound, b ecause when M is not divisi ble by P , there ma y be a b ett er wa y to organise the computat i on in the W step that reduces the time when any mac hine is idle. In practice this is irr elev ant because w e implement the computation asynchronously . Eac h mac hine keeps a queue of incoming submo dels it needs to pro cess, from which it r epeatedly takes one submodel, processes it and sends it to the machine’s successor. 13 Here, the function S ( P ) is independent of M and monotonically increasing with P . It would asymptote to lim P →∞ S ( P ) = ρN , but the expression is only v alid up to P = M . F rom (14 ) we derive the follo wing condition for p erfect sp eedup to o ccur (in the limit) 4 : S ≈ P ⇐ ⇒ P ≪ ρN . (15) This gives an upper b o und on the num b er P of machines to achieve an approximately p e rfect sp eedup. Although ρ is quite s ma ll in practice, the v alue of N is v e r y la rge (typically millions o r greater), otherwise ther e would b e no need to distribute the data . Hence, we e xpect ρN ≫ 1, so P could b e quite large. In fact, the limit in how lar ge P can b e do es no t co me from this condition (which assumes P ≤ M anyw ay) but from the n umber of s ubmodels, as we will see nex t. In s ummary , w e co nc lude that if M ≥ P and M is divisible by P then the sp eedup S ( P ) is g iv en by (14), and in practice S ( P ) ≈ P t ypic a lly . • If M ≥ P and M is not divisible by P , then S ( P ) is g iven by the full ex pr ession (12), which is studied in a pp endix A. S ( P ) is piecewise cont inuous on M in terv als of the for m  1 , M M − 1  ,  M M − 1 , M M − 2  , . . . ,  M 2 , M  , [ M , ∞ ) . (16) Within eac h in terv al P ∈  M k , M k − 1  for k = 1 , 2 , 3 . . . , M we hav e ⌈ M /P ⌉ = k and w e obtain that S ( P ) either is monoto nic a lly increas ing, or is mono tonically decre asing, or ac hieves a single maximum at P ∗ k = p ρ 1 M N /k S ∗ k = S ( P ∗ k ) = ρM /k ρ 2 + 2 p ρ 1 M / N k . (17) The para llelisation ability in this case is less than if M is divisible by P , since no w some machines are idle a t some times during the W step. • If M < P , then w e can write the speedup as follows: if M < P : S ( P ) = ρ/  ρ 1 P + ρ 2 M + P M N  = ρM /  ρ 2 + ρ 1 M P + P N  (18) which corr esponds to the last interv al P ∈ [ M , ∞ ) (for k = 1) ov er which S is c on tinuous. W e o btain that S ( P ) either is monotonically decreasing (if M ≥ P ∗ 1 ), or it increases from P = M up to a single maximum at P = P ∗ 1 and then decr eases monotonica lly , with P ∗ 1 = p ρ 1 M N S ∗ 1 = S ( P ∗ 1 ) = ρM ρ 2 + 2 p ρ 1 M / N . (19) As P → ∞ we hav e that S ( P ) ≈ ρN M / P → 0 (assuming t W c > 0 s o ρ < ∞ ). This de c rease o f the sp eedup for large P is cause d b y the communication ov erhead in the W step, where P − M mac hines are idle at each tick in the W step. In the impractical case where there is no communication cost ( t W c = 0 so ρ = ∞ ) then S ( P ) is a ctually monotonically increasing and lim P →∞ S ( P ) = S ∗ 1 = ρ ρ 2 M > M , s o the more machines the la rger the sp eedup, althoug h with diminishing returns. Theorem A.1 shows that S ( P ) at the beg inning of eac h int er v al is gr eater than anywhere befor e that in terv al, i.e., S ( M /k ) > S ( P ) ∀ P < M /k , for k = 1 , 2 , . . . , M . That is, altho ugh the sp eedup S ( P ) is not necessarily monotonically increasing for P ≥ 1, it is monotonically increasing for P ∈  1 , M M − 1 , M M − 2 , . . . , M 2 , M  . This suggests selecting v alues o f P that make M /P integer, in pa rticular when M is divis ible by P . 4 Note that if t W c = 0 (no commun i cat i on o verhea d) then ρ = ∞ and there i s no upper bound in (15), but P ≤ N stil l holds, because we ha ve to ha ve at least one data point p er machine. 14 Globally maximum sp eedup S ∗ = max P ≥ 1 S ( P ) This is given by (see app endix A): • If M ≥ ρ 1 N : S ∗ = M /  1 + M ρN  ≤ M , achieved a t P = M . • If M < ρ 1 N : S ∗ = S ∗ 1 = ρM ρ 2 +2 √ ρ 1 M / N > M , achieved a t P = P ∗ 1 = √ ρ 1 M N > M . In practice, with larg e v alues o f N , the more likely case is S ∗ = S ∗ 1 > M for P = P ∗ 1 > M . In this case, the maximum speedup is achiev ed using more machines than submodels (even though this means some machines will b e idle a t some times in the W step), and is big ger than M . Since diminishing r eturns o ccur as we approach the maximum, the practically bes t v alue of P will be somewha t smaller than P ∗ 1 . The “large dataset” case The ca se where N is la rge is practically imp ortant because the need for distributed optimisa tion ar is es mainly from this. Sp eciﬁcally , if we take P ≪ ρ 2 N , the sp eedup b ecomes (see app endix A): if M divisible by P : S ( P ) ≈ P ; if M > P : S ( P ) ≈ ρ/  ρ 1 P + ρ 2 M  (20) so that the sp eedup is almo st p e rfect up to P = M , and then it is a ppr o ximately the weight ed harmo nic mean of M and P (hence, S ( P ) is monotonica lly increas ing a nd b et ween M and P ). F or P ≫ ρ 1 , we have S ( P ) ≈ ρ ρ 2 M > M . The “do minan t Z step” case If we take t Z r ≫ t W r , t W c or equiv alently ρ ≈ ρ 1 very large, which means the Z s tep dominates the runtime, then S ( P ) ≈ P . This is becaus e the Z step parallelises p erfectly (as long as P < N ). T ransformations that k eep the sp eedup i n v arian t W e ca n re wr ite the speedup of eq. (12) as: S ( P ) = ρ ′ 1 ⌈ M /P ⌉ M P P 2 + ρ ′ 2 P + ρ ′ 1 1 ⌈ M /P ⌉ M (21) with ρ ′ = ρN = ( ρ 1 + ρ 2 ) N = N ( et W r + t Z r ) ( e + 1) t W c ρ ′ 1 = ρ 1 N = N t Z r ( e + 1) t W c ρ ′ 2 = ρ 2 N = N et W r ( e + 1) t W c (22) so tha t S ( P ) is indep endent of N , which has b een absorb ed into the co mm unication-computatio n ratios. This means that S ( P ) dep ends on the dataset size ( N ) a nd computation/communication times ( t W r , t Z r , t W c ) only through ρ ′ , ρ ′ 1 and ρ ′ 2 , and is therefore inv ariant to para meter transfor mations that le ave these ra tios unch a nged. Such transfor mations are the following (where α > 0): • Scaling N , t W r and t Z r as αN , 1 α t W r and 1 α t Z r . “Larg e r da taset, faster computation,” or “ smaller dataset, slower co mputation.” • Scaling N and t W c as αN and αt W c “Larg e r da taset, slow er comm unicatio n,” or “smaller dataset, faster co mm unication.” • Scaling t W r , t Z r and t W c as αt W r , αt Z r and αt W c “F aster c o mputation, faster communication,” or “slow er computation, slow er co mm unication.” 5.3 Discussion and examples Fig. 4 plots a “typical” sp eedup curve S ( P ), obtained with a realistic c hoice of parameter v alues. It displays the pro tot ypical sp eedup shap e we should exp ect in practice (the exp erimen tal sp eedups of ﬁg. 10 conﬁrm this). F or P ≤ M the curve is very clo s e to the pe r fect sp eedup S ( P ) = P , slowly deviating from it as P approaches M . F o r P > M , the curve co n tinues to increase until it rea c hes its maximum at P = P ∗ 1 , and decreases thereafter. 15 Fig. 5 plots S ( P ) for a wider range of parameter settings. W e set the dataset size to a practically small v a lue ( N = 50 0 00), other wise the curves tend to lo ok like the typical curve fro m ﬁg. 4. The pa rameter settings are represe ntative of diﬀerent, p otent ia l practica l situations (some more likely than other s). W e note the following observ ations: • Again, the most impor tan t observ ation is that the num b er of submo dels M is the par ameter with the most direct eﬀect on the sp eedup: near- perfect s p eedups ( S ≈ P ) o ccur if M ≥ P , otherwise the sp eedups are b et ween M and P (and event ua lly satura te if P ≫ M ). • When the time sp en t on communication is lar g e in relative terms, the s peedup is decreased. This can happ en when the runt ime of the Z step is low (small t Z r ), when the communication cost is large (larg e t W c ), or with many epo chs (la rge e ). Indeed, since the Z step is p erfectly para llelisable, any decrease of the sp eedup should co me fro m the W step. • Some of the curv es display noticeable discon tinuities, caused b y the function ( M /P ) / ⌈ M /P ⌉ , o ccurring at v a lues of P o f the form M /k for k ∈ { 1 , . . . , M } . A t each such v alue, S ( P ) is gr e ater than fo r any smaller v alue of P , in acco r dance with theorem A.1. This again sugg ests selecting v alues o f P that make M /P integer, in particular when M is a multiple of P ( P , 2 P , 3 P . . . ). This achiev es the be st sp eedup eﬃciency in that machines are never idle (in o ur theoretica l mode l). Also, for ﬁxed P , the function ( M / P ) / ⌈ M /P ⌉ can take the same v alue for diﬀerent v alues o f M (e.g. M = 3 2 and M = 64 for P = 6 0). T his explains why so me curves (for diﬀeren t M ) pa r tly ov er la p. • The maximum sp eedup is typically larger than M and o ccurs for P > M . It is p ossible to have the maximum sp eedup be smaller than M (in which cas e it o ccurs at P = M ); an example app ears in ﬁ g . 5 , row 3, column 2 (for the la r ger M v alues). But this happ ens only when M ≥ ρ 1 N , which requires an un us ua lly s ma ll dataset and an impractically large num b er of s ubmodels. Gene r ally , w e should exp ect P > M to be b eneﬁcial. This is also seen exp erimen tally in ﬁg. 10. 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 100 200 300 400 500 600 700 P S f r a g r e p la c e m e n t s nu mber o f machines P sp eedup S ( P ) P ≤ M : S ( P ) = P 1 + P ρN ≈ P P > M : S ( P ) = ρ ρ 1 P + ρ 2 M + P M N ≈ ρ ρ 1 P + ρ 2 M S ( P ∗ 1 ) Figure 4: Typical form of the theoretical sp eedup curve for realistic parameter settings, spe ciﬁcally N = 1 0 6 data p oin ts, M = 5 12 submo dels, e = 1 ep och in the W s tep, a nd t W r = 1 (this sets the units of time), t Z r = 5 and t W c = 10 3 (so ρ 1 = 0 . 0025, ρ 2 = 0 . 0005 and ρ = 0 . 003). Some of the discontin uities of the curve (where ⌈ M /P ⌉ is dis c on tinuous) are visible. W e mark the v alues for P such that M is divisible b y P ( ◦ ) and the maxim um sp eedup ( ∗ ), which occur s for P = P ∗ 1 > M . 16 e = 1 ep o c h e = 8 ep ochs t W c = 1 t Z r = 1 1 32 64 96 128 1 32 64 96 128 1 2 4 8 16 32 64 128 256 512 P S f r a g r e p la c e m e n t s sp eedup S ( P ) 1 32 64 96 128 1 32 64 96 128 1 2 4 8 16 32 64 128 256 512 P S f r a g r e p la c e m e n t s t W c = 1 t Z r = 100 1 32 64 96 128 1 32 64 96 128 1 2 4 8 16 32 64 128 256 512 P S f r a g r e p la c e m e n t s sp eedup S ( P ) 1 32 64 96 128 1 32 64 96 128 1 2 4 8 16 32 64 128 256 512 P S f r a g r e p la c e m e n t s t W c = 10 0 t Z r = 1 1 32 64 96 128 1 32 64 96 128 1 2 4 8 16 32 64 128 256 512 P S f r a g r e p la c e m e n t s sp eedup S ( P ) 1 32 64 96 128 1 32 64 96 128 1 2 4 8 16 32 64 128 256 512 P S f r a g r e p la c e m e n t s t W c = 1 0 00 t Z r = 10 0 1 32 64 96 128 1 32 64 96 128 1 2 4 8 16 32 64 128 256 512 P S f r a g r e p la c e m e n t s nu mber of machines P sp eedup S ( P ) 1 32 64 96 128 1 32 64 96 128 1 2 4 8 16 32 64 128 256 512 P S f r a g r e p la c e m e n t s nu mber of machines P Figure 5: Theor etical sp eedup S ( P ) a s a function o f the nu mber of machines P for v ario us settings of the parameters of ParMAC w ith a bina ry auto enco der. The parameter s are: datas e t size N = 50 000 training po in ts; num b er of submo dels M ∈ { 1 , 2 , 4 , 8 , 16 , 32 , 64 , 128 , 256 , 512 } ; num b er of ep ochs in the W step e ∈ { 1 , 8 } ; W step computation time (p er submo del a nd data po in t) t W r = 1 (this sets the units of time); W s tep comm unicatio n time (per submo del) t W c ∈ { 1 , 10 0 , 1 000 } ; Z s tep computation time (p er submo del and da ta p oin t) t Z r ∈ { 1 , 10 0 } . Within each plot, ea c h curve corr e sponds to one v alue of M , indicated on the right end of the plot. 17 5.4 Practical considerations In pr a ctice, given a sp eciﬁc pr oblem (with a known num b er of submo dels M , ep ochs e and data set size N ), our theoretica l s p eedup curves can b e used to determine o ptimal v alues fo r the num b er of machines P to use. As seen in sectio n 8.3, the theor etical curves agr ee quite well with the exp erimen tally meas ured sp eedups. The theo r etical curves do need estimates for the computatio n time and commu nica tion times of the W and Z s teps. These a re ha rd to obtain a prio ri; the computational complexity of the alg o rithm in O -notation ignores constant factors that aﬀect signiﬁcantly the a ctual times. Their es timates should b e measured fr o m test runs. As seen from eq. (2 1) , we can leav e the sp eedup unc ha nged by trading oﬀ datas et size ( N ) and compu- tation/communication times ( t W r , t Z r , t W c ) in v ar ious wa ys, as long as one o f the three following holds: the pro ducts N t W r and N t Z r remain constant; or the quotient N/t W c remains constant; or the quo tien ts t W r /t W c and t Z r /t W c remain constant. Theoretically , the mo s t eﬃcient op erating po in ts for P are v alues such that M is divisible b y P , beca use this means no machine is ever idle. In practice with an asynchronous implementation and with t Z r , t W r and t W c exhibiting s ome v a riation ov er submo dels and data p oint s , this is not true anymore. Still, if in a g iven application one is constrained to using P ≤ M machines, c ho osing P close to a divisor of M would pro bably be pr e ferable. One assumption in o ur sp eedup mo del is that the P machines ar e ident ica l in pro cessing p ow er. The mo del do es extend to the ca se where the ma c hines are diﬀerent, as no ted in o ur dis cussion of loa d bala ncing (section 4 .3). This is b ecause the work p erformed by each machine is prop ortional to the num b er of data po in ts it contains: in the W s tep, b ecause ev er y submo del runs (implicitly) SGD, and ev er y submo del must visit each machine; in the Z step, bec a use each data p oin t is a separate problem, and in volves all submo dels (whic h r eside in each machine). Hence, we can eq ua lise the work over machines by loading each machine with an amo un t of da ta prop ortional to its pro cessing speed, indep enden t o f the n umber o f submo dels. Another a ssumption in our mo del (in the W step) is that all M submodels are identical in “size” (computation and comm unica tio n time). This is sometimes not tr ue. F or ex ample, in the BA, w e have submo dels of tw o t y p es: the L enco ders (each a binar y linear SVM op erating on a D -dimensional input) and the D deco ders (each a linear regr e s sor o perating on an L -dimensio nal input). Since D > L , the enco ders are bigg er than the deco ders and take lo nger to train and communicate. W e ca n still a pply our sp eedup mo del if we “gro up” smaller submo dels into a single s ubmo del of size c o mparable to the larg er submo dels, so as to eq ualise as muc h as p ossible the submo del sizes (the a c tua l implemen tation do es not need to group submo dels, of cours e). F or the BA, under the r easonable assumption that the ratio o f computation times t W r (and communication times t W c ) of deco der vs enco der is L/ D < 1, we ca n group the D deco ders into L groups of D /L de c oders each. Ea c h gro up of deco ders has now a co mputation and communication time equal to that o f one enco der. This gives a n eﬀective n umber of indep enden t submo dels M = 2 L , and this is what we use when apply ing the mo del to the exp erimen tal sp eedups in s e c tion 8.3. Finally , we emphasise that the g oal of this section was to characterise the pa r allel sp eedup of ParMA C quantitativ ely and demonstrate the runtime gains that are achiev able by using P machines. In practice, other co nsiderations are also impo rtan t, such as the economic cost of using P machines (whic h ma y limit the maximum P av aila ble); the type o f machines (obviously , w e wan t all the computation and communication times as small a s possible); the choice of optimisation algo rithm in the W and Z steps; the fact that, b ecause of its size, the dataset may need to b e, or alr e a dy is, distributed acr oss P machines; etc. It is also p ossible to combine ParMA C with o ther, orthogo nal techniques. F or example, if ea c h of the submo dels in the W step is a conv ex optimisation problem (as is the case with the linear SVMs with the binary autoe nco der), we could use the techniques describ ed in section 2 for dis tr ibuted con vex o ptimisation to e ac h submo del. This would eﬀectively allow for larger sp eedups when P > M . 6 Con v ergence of P arMAC The only a ppro x ima tion that ParMAC makes to the o riginal MAC algor ithm is us ing SGD in the W step. Since w e can guarantee conv e r gence o f SG D under certain conditions, we can recover the or iginal co nvergence guarantees for MA C. Let us see this in mo re detail. Conv er gence of MAC to a stationar y p oin t is g iv en by theorem B.3 in Car reira-Perpi ˜ n´ an and W ang (2012), which w e quote here: 18 Theorem 6.1. Consider the c onst ra ine d pr oblem of e q. (5 ) and its quadr atic-p enalty funct io n E Q ( W , Z ; µ ) of e q. (6) . Given a p ositive incr e asing s e quenc e ( µ k ) → ∞ , a nonne gative se quenc e ( τ k ) → 0 , and a starting p oint ( W 0 , Z 0 ) , su pp ose the quadr atic-p enalty metho d ﬁnds an app ro x imate minimiser ( W k , Z k ) of E Q ( W k , Z k ; µ k ) that satisﬁes   ∇ W , Z E Q ( W k , Z k ; µ k )   ≤ τ k for k = 1 , 2 , . . . Then, lim k →∞ ( W k , Z k ) = ( W ∗ , Z ∗ ) , which is a K KT p oint for the pr oblem (5) , and its L agr ange multiplier ve ctor has elements λ ∗ n = lim k →∞ − µ k ( Z k n − F ( Z k n , W k ; x n )) , n = 1 , . . . , N . This theor em applies to the gene r al c ase of K diﬀerentiable lay ers, where the standar d Ka r ush-Kuhn- T uc ker (KKT) conditions ho ld (Noceda l and W right, 200 6). It relies on a standa rd condition for p enalty metho ds for nonco n vex problems (No cedal and W rig h t, 2006), namely that we must b e able to reduce the gradient o f the p enalised function E Q below an arbitrar y toleranc e τ k ≥ 0 for each v alue µ k of the pena lt y parameter (in MAC iteratio ns k = 1 , 2 , . . . ). This can b e a chieved b y running a suitable (unconstrained) optimisation metho d fo r suﬃciently many iterations. How do es this change in the ca se of ParMAC? The Z step r emains unchanged with r espect to MAC (the fac t that the optimisation is distributed is ir relev a nt since the N subproblems z 1 , . . . , z N are indep endent) . The W step do es change, b ecause we are now oblig ed to use a distributed, s tochastic training. What we need to ensure is that we c an r educe the g radien t of the pena lised function with r espect to ea c h submo del (since they ar e indep endent subpr oblems in the W step) below a n arbitrar y toleranc e . This can als o b e guara n teed under standard conditions. In gener al, we ca n use conv erg ence co nditions fro m sto c ha stic optimisation (Benv enis te et al., 19 90; Kushner and Yin, 2003 ; Pﬂug, 1 996; Spall, 200 3; Bertsek as a nd Ts itsiklis, 2000 ). Es s en tially , these are Robbins-Mo nro s c hedules, which require the learning rate η t of SGD to decrease suc h that lim t →∞ η t = 0, P ∞ t =1 η t = ∞ , P ∞ t =1 η 2 t < ∞ , where t is the epo ch num ber (SGD iteration, or pa ss over the e ntire dataset 5 ). W e can give muc h tighter conditions on the conv er g ence a nd the conv er gence ra te when the subproblems in the W step are convex (whic h is often the case, as with log is tic or linear regre ssion, linear SVMs, etc.). This is a topic that has received m uch attention r e cen tly (see sec tio n 2), and many such conditions exist, often ba sed on techniques such as Nester o v accelerated algorithms and stochastic a verage gradien t (Cevher et al., 2014). They typically bo und the distance to the minimum in ob jectiv e function v alue a s O (1 /t α ) or O (1 /β t ) where the co eﬃcien ts α > 0 , 0 < β < 1 and the co nstan t facto rs in the O - notation dep end on the (str ong) conv e x it y prop erties of the problem, L ips c hitz constant, etc. In summary , c onver genc e of ParMAC to a stationary p oint is guar ante e d by t he same the or em as MA C, with an adde d SGD-typ e c ondition for the W step . This co n vergence guarantee is independent of the num b er of layers and submo dels (since they are indep enden t in the W step) and the num b er of machines P (since eﬀectively we are doing SGD on shuﬄed datasets of size N , even if they are par titioned on p ortions of size N /P ). W e ca n als o g uarant ee ParMA C’s conv erg e nce with only the orig ina l MAC theorem, without SGD-type conditions, while still in the distributed setting and ach ie v ing signiﬁcant par allelism. This can b e done by computing the gradient in the W s tep exactly (as MAC assumes). First, each machine p = 1 , . . . , P computes the exact sum o f p er-po in t gra die nts for each submo del (by summing over its data p ortion), in parallel. Then, we aggregate thes e P par tial gra dien ts in to one exact gra dien t, for each submo del. This could be done via a para meter server, or by having each machine act as the parameter server for one s ubmodel, and co uld b e easily implemen ted with MPI functions. How e ver, as is well known, this is far slow er than using SGD. With nondiﬀerentiable lay e r s, the con vergence prop erties of MA C (and ParMA C) are not well under stoo d. In pa rticular, for the binary auto enco der the enco ding lay er is discr e te and the pr oblem is NP-complete. But, again, the only modiﬁca tion of ParMAC o ver MA C is the fact tha t the enco der and deco der a r e trained with SGD in the W step, whose conv ergence tolerance can be a c hieved with SGD-type co nditions. Indee d, our exper imen ts show ParMA C giv e s almost iden tical results to MA C. While convergence guarantees a re impor tan t theor etically , in practical a pplications with larg e da ta sets in a distributed setting one t y pically runs SGD for just a few ep ochs, even one or less than one (i.e., w e stop 5 Note that it is not necessary to assume that the p oin ts (or minibatch es) are sampled at random during updates. V arious results exist that guaran tee con vergenc e wi th determinis tic err ors (e.g. Bertsek as and Tsitsiklis, 2000 and r eference s therein), rather than stocha stic errors. These results assume a b ound on the determinis tic error s (rather than a b ound on the v ariance of the stochastic err ors), and apply to general, noncon vex ob jectiv e functions with standard conditions (Lipschitz cont i n uity , Robbins-Monro sche dules for the step size, etc.). They apply as a particular case to the “incremental gradient ” method, w here we cycle through the data p oin ts in a ﬁxed sequence. 19 SGD befor e pa s sing through all the data). This typically r educes the ob jective function to a go o d enough v a lue as fa s t as p ossible, sinc e each pass over the data is very costly . I n our exper iments, one to tw o e p o c hs in the W step ma k e ParMA C very simila r to MAC using an exact s tep. 7 Implemen tation of P arMAC for binary auto enco ders W e have implement e d ParMA C for binar y a utoenco ders in C/C++ using the GNU Scie ntic Library (GSL) ( http:/ /www.gnu. org/s/gsl ) and Basic Linear Algebra Subroutines (BLAS) library ( http://ww w.netlib. org ) for mathematical op erations and linear alge bra, and the Message Passing In terface (MPI) (Gropp et al., 1999a ,b; Mess a ge Passing Interface F orum, 2012 ) for interpro cess communication. GSL and BLAS provide a wide r ange o f mathematical routines such as bas ic matrix op erations, v ario us matrix deco mpositions and least-square s ﬁtting. W e used the v e r sions of GSL and BLAS that co me with our Linux dis tr ibution (Ubun tu 14.04). Considerably b etter per formance could b e achieved b y using LAP ACK and an optimised version of BLAS (suc h as A TLAS, o r a s pr o vided b y a c omputer vendor for their sp eciﬁc architecture). MPI is one of the mo st widely us ed frameworks for hig h-perfor mance para llel computing to day , and is the b est option for ParMAC b ecause of its suppo rt for distributed-memory machines and SP MD (single progra m, multiple data) mo del, its lang uage indep endence, and its av ailability in m ultiple machines, from small shared-memo ry multiproce ssor ma c hines to hybrid clusters. In MPI, diﬀerent process es ca nnot dir ectly access each other’s memory space, but data ca n b e transfer red by s ending messa ges fro m one pro cess to another, o r co llectiv ely among mult iple pr ocesse s . The SP MD mo del, very useful in distributed machine learning, means that all pro cesses shar e the same co de (a nd e xecutable ﬁle), and ea ch of them can o perate on diﬀeren t data with ﬂow control using its individua l pro cess id. MPI is an industr y standard for which there ar e ma ny implementations, s uc h as MPI CH or O penMPI, mostly compatible with each other. W e used MPICH on our UC Merced s ha red-memory cluster and Op en- MPI on the UCSD TSCC distributed cluster (see section 8 .1). Our P a rMA C C++ co de c ompiles a nd runs with b oth implementations. W e used the highest co mpiler optimisation level, sp eciﬁcally we ran mpi cc - O3 -lgsl -lgs lcblas -l m . This ca lls the GNU C compiler with option - O3 , which turns on all the av aila ble co de optimisation ﬂag s. It results in a lo nger co mpila tion time but more e ﬃcie nt code. The co de snippet in ﬁgure 6 shows the main steps o f the ParMA C algor ithm for the BA. All the functions starting with MPI are API calls fro m the MPI library . As with a ll MPI prog r ams, we start the c o de b y initialising the MPI en vir onmen t and end by ﬁnalising it. T o r eceiv e data we use the synchronous 6 , blo c king MPI receive function MPI Recv . The pro cess ca lling this blo cks until the data a rrives. T o send data we use the buﬀered blo cking version of the MPI send functions, M PI Bse nd . This requires that we a llocate eno ug h memory and a ttac h it to the system in adv ance. The pro cess calling MPI Bsend blo c ks until the buﬀer is copied to the MP I in terna l memory; after that, the MPI libra ry tak es care o f sending the data a ppr opriately . The b eneﬁt o f us ing this version of send is that the programmer can send messa ges without worrying a bout where they ar e buﬀered, so the co de is simpler. Appendix B brieﬂy describ es imp ortant MPI functions and their arguments. 6 Note that the w ord “sync hronous” here d o es not ref er to ho w we pro cess the diﬀeren t submo dels, wh i c h a s we stated earlier are not sync hronised to start or end at speciﬁc clo c k ticks, hence are processed async hronously w i th respect to eac h other. The wo r d “sync hronous” here refers to MPI’s handling of an individual receive function (see appendix B). This can be done either b y calling MPI Recv , which will blo c k un til the data is received (sync hronous blo c king function), as in the pseudoco de in ﬁg. 6; or b y ca l ling MPI Irecv (a synchronous non blo c king function) follow ed by a MPI Wait , which will block until t he data is received, like this: MPI Irecv(re ceivebuffer, commbuffsize, MPI CHAR, MPI ANY SOURCE, MODEL MSG TAG, MPI COMM WORLD, &recvRe quest); MPI Wait(&re cvRequest, &recvStatus ); Both options ar e equiv alent f or our purp ose, whic h i s to ensure we receive the submo del b efore starting to train i t. The MPI Irecv / MP I Wait o ption i s slightly more ﬂexible i n that it would all ow us to do some additional pro cessing betw een MP I IRecv and MPI Wait and possibl y ac hieve some per formance gain. 20 MPI Init(&arg c, &argv); // initialise the MPI ex ecu tion en v ironment MPI Comm rank(MPI COMM WORLD, &mpirank ); // get the rank of th e calli n g MPI pro cess MPI Comm size(MPI COMM WORLD, &mpisize ); // get the total number of MPI pro cesses loadsettin gs(); // load parameters ( µ , epo c h s, d atas et path, etc.) loaddatase ts(); // load inpu t and ou tput datasets and initial auxiliary co ordinates initialize layers(); // allocate memory and initialise f , h and Z steps // w e use MPI Bsend to av oid managing send buﬀers, so w e need to allocate the required // amount of b uﬀer sp ace into which data can b e copied until it is delivered MPI Pack size(com mbuffsize, MPI CHAR, MPI COMM WORLD, &mpi attach buff size); // allocate enough memory so it can store the whole mo del mpi attach buff = malloc(t otalsubmodelcount* (mp i attach buff size+MPI BSEND OVERHEAD)); MPI Buffer attach(mpi attach buff, mpi attach buff size); // attac h the allocated buﬀer for (iter=1 to length( µ )) { // iterate ov er all th e v alues of µ // b egin W -step visitedsub models = 0; // each pro cess visits all t he sub models, ep ochs + 1 times while (visitedsubm odels <= to talsubmodelcount*e pochs) { // stepcounte r is a number that each submodel carries and increases by one in eac h step . // Once it reaches a certain v alue we stop sending the submo del around and it stops. // W e reset stepcounter for all the submod els in the beginning of eac h W -step. if (stepco unter > 0) { // if this is not th e ﬁrst submodel to train in the iteration, we w ait t o receive // MPI Recv blo c ks u n til the requested data is av ailable in the application buﬀer in the receiving task MPI Recv (rece ivebuffer, commbuffsize, MPI CHAR, MPI ANY SOURCE, MODEL MSG TAG, MPI COMM WORLD, &recvStatu s); savesubmod el(receivebuffer); // sa ve th e receiv ed buﬀ er into a suitable struct } if (stepco unter < epochs*mpisize) { // w e don’t train the submo dels in the last up date round switch(sub modeltype) // train each submodel according to its type case ’SVM’: HtrainSGD() ; case ’linlayer ’: FtrainS GD(); } if (stepco unter < (ringepochs+1)*mp isize) { // w e still need to send th is submo del around // the looku p table is created randomly and stores the path of each submo del o ver epo c hs and iterations successor = next in lookup table(); // pick the successor process from the lookup table loadsubmod el(sendbuffer); // load the submo del from its stru ct into the send b u ﬀer // MPI Bsend returns after th e data h as b een copied from app lica tion buﬀer space to the allo cated send b uﬀer MPI Bsend (sen dbuffer, taskbufsize*sizeo f(double), MPI CHAR, success or, MODEL MSG TAG, MPI COMM WORLD); } visitedsub models++; } // end W -step // b egin Z -step updateZ relaxed(); // initialise auxiliary co ordinates based on a truncated, relaxed solution updateZ alternate( ); // up date auxiliary co ord in ates by alternating opt imis ation o ver bits // end Z -step } MPI Buffer detach(&mpi attach buff, &mpi attach buff size); // detach the allocated buﬀer free(mpi attach buff); // free the allocated memory MPI Finalize() ; // terminate the MPI execution en v ironmen t Figure 6: Bina ry auto encode r ParMAC algorithm (frag men t), showing impo r tan t MPI calls. 21 T able 1 : Detaile d har dw are spe c iﬁcation of the tw o mac hines used in our e x periments. Distributed-memory (TSCC at UCSD) Shared-memor y (cluster at UC Mer ced) CPU Int el(R) Xeon(R) CPU E5-2 670 0 Int el(R) Xeon(R) CPU E5-2 699 v 3 CPU c a c he 20 MB 45 MB CPU ma x freq uency 3.3 GHz 3.6 GHz Cores/thr eads 8/16 8/16 Memory t yp es DDR3 800/10 66/1333/1600 DDR4 1600/1 866/2133 RAM bandwidth 51.2 GB/s 68 GB/s Pro cessor co nnec tion 10GbE shared memory 8 Exp erimen ts 8.1 Setup Computing systems W e used tw o diﬀerent computing sy s tems, to which w e will r efer as distribute d and shar e d-memory : Distributed-mem ory This used General Computing Nodes from the UCSD T riton Shared Co mputing Cluster (TSCC), av a ilable to the public for a fee. Each no de co n tains 2 8 -core Intel Xeon E5-2 670 pro cessors (16 cores in total), 64GB DRAM (4GB/co re) and a 500GB har d drive. The no des ar e connected through a 10 GbE net work. W e used up to P = 128 pro cessor s. Detailed sp ecs ar e in table 1 (obtained b y running dm idecode in the actual pr o cesso r) a nd htt p://idi.u csd.edu/computing . Shared-memory This is a 7 2 -pro cessor machine (36 physical cores with hyperthreading ) with 256GB RAM lo cated at UC Mer ced. The pr ocesso rs comm unica te through shared memory . W e used this only for the larg e -scale experiment, and we used 64 of the 72 proc essors. Detailed specs are in table 1 (obtained by running dmide code in the actual pro cessor). In b oth systems, the interproces s communication is ha ndled by MP I (Op enMPI on the TSCC cluster and MPICH o n our sha red-memory cluster). The sha r ed-memory system has b oth faster pro cessors and fa s ter communication than the distributed one and this is seen in o ur e x periments (3–4 times fa s ter). This do es not imply that shared-memo r y s ystems are neces sarily sup erior in practice, it simply reﬂects c har acteristics of the eq uipment we had acces s to. The ParMAC sp eedups as a function o f the nu mber of pr ocess ors are comparable in both systems. Datasets W e hav e used 4 datasets commonly used as image r etriev al benchmarks. (1) CIF AR (K r izhevsky, 2009) contains 60 000 32 × 3 2 colour images in 10 ob ject cla sses. W e ignor e the lab els in this pap er and use N = 50 0 0 0 images as tra ining set and 10 000 as test set. W e extract D = 320 GIST features (Oliv a a nd T orralba , 20 0 1) fro m each image. (2) SIFT-10K (J´ ego u et al., 2 011a) c on tains N = 10 00 0 tr aining high- resolution colour imag es and 1 00 test images, each repres en ted by D = 128 SIFT featur es. (3) SIFT-1M (J´ ego u et al., 201 1a) co n tains N = 10 6 training and 10 4 test images. (3) SIFT-1 B (J´ ego u et al., 201 1 b; http:/ /corpus- texmex.irisa.fr ) has three subsets: 10 9 base vectors wher e the sear c h is perfor med, N = 1 0 8 learning vectors used to train the mo del and 1 0 4 query vectors. P erformance measures Regarding the quality of the BA and has h functions learnt, we rep ort the fol- lowing. (1) The binary auto enco de r error E BA ( h , f ) w hich w e w a n t to minimise, eq . (1). (2) The qua dratic- pena lt y function E Q ( h , f , Z ; µ ) which we actually minimise for e ac h v alue of µ , eq. (3 ). (3) The retriev al precision (%) in the test set using as true neig hbo urs the K near est images in Euclidea n dis ta nce in the original space, a nd a s retrieved neig h b ours in the binary space we use the k nea rest ima ges in Hamming distance. W e se t ( K, k ) = (1 000 , 1 00) for CIF AR, (100 , 10 0) for SIFT-10K a nd (10 000 , 1 0 0 00) for SIFT-1 M. F or SIFT-1B, as sug gested by the datas et creator s, we r e p ort the r ecall@R: the av er age num b er o f queries for which the nearest neig h b our is r ank ed within the top R p ositions (for v ar ying v a lues of R ); in case of tied distances, w e place the query as top ra nk. All these meas ur es are co mputed oﬄine once the BA is tra ined. 22 Mo dels and their parameters W e use BAs with linea r enco ders (linear SVM) except with SIFT-1B , where we also use kernel SVMs. The deco der is a lw ays linear. W e set L = 16 bits (hash functions) for CIF AR, SIFT-10K and SIFT-1M a nd L = 64 bits for SIFT-1B. The Z step uses enumeration for SIFT- 10K and SIFT-1M, and alter nating optimisation (initialis e d b y a truncated relaxed solution) otherwise. W e initialise the binary co des fro m truncated P C A ran on a subset of the training set (small enoug h that it ﬁts in one ma chine). T o train the encode r ( L SVMs) and deco der ( D linea r mappings ) with sto c has tic o ptimisation, w e used the SGD co de from Bottou a nd Bousquet (200 8) ( ht tp://leon .bottou.org/projects/sgd ), using its de fa ult parameter settings . The SGD step size is tuned a uto matically in ea c h iter ation by ex amining the ﬁrst 1 00 0 datap oin ts. W e use a multiplicativ e µ schedule µ i = µ 0 a i where the initial v a lue µ 0 and the factor a > 1 are tuned oﬄine in a trial r un using a sma ll subset of the data. F or CIF AR we use µ 0 = 0 . 005 and a = 1 . 2 ov er 26 iterations ( i = 0 , . . . , 25). F or SIFT-10 K and SIFT-1M we use µ 0 = 10 − 6 and a = 2 ov er 20 iterations. F o r SIFT-1B w e use µ 0 = 10 − 4 and a = 2 over 10 iteratio ns. 8.2 Eﬀect of st o chas t ic steps in the W step Figures 7 to 9 show the eﬀect on SIFT-10 K a nd CIF AR of v arying the num be r of ep o c hs and shuﬄing the data as a function of the num b er of ma c hines P on the learning curves (erro r s E Q and E BA , precision). As the num b er of ep ochs increa ses, the W step is s olv ed mo re e x actly (8 ep o c hs is pr actically exact in these datasets). F ewer ep ochs, even just one, cause only a small degrada tion. The reaso n is that, a lthough these are rela tively small datasets, they contain suﬃcient redundance that few ep o c hs ar e suﬃcien t to decrease the error co nsiderably . This is also help ed by the acc umulated eﬀect of ep ochs over MA C iterations. Running more ep o c hs increases the runtime and low ers the para llel speedup in this par ticular mo del, b ecause we use few bits ( L = 16) a nd therefore few s ubmodels ( M = 2 L = 32) c o mpared to the num b er of machines (up to P = 128), s o the W step has less parallelism. Fig. 9 shows the p ositive eﬀect o f data sh uﬄing in the W step. T o s huﬄe the minibatches, the successor of a machine is given b y a ra ndom lo okup table. Shuﬄing g enerally r educes the e rror (this is pa rticularly clear in E Q , whic h is what w e actually minimise) and increases the prec is ion with no increase in run time. Note that, even if we keep the cir cular top ology ﬁxed throughout the W step (i.e., we do no t randomise the top ology at ea c h ep och), there is still a small amo un t of shuﬄing. This o ccurs b ecause, although a ll submo dels pr ocess the data minibatches in the same “dir ection”, submo dels in diﬀerent machines start a t diﬀerent minibatches. W ere it not for this (and if no s huﬄing was do ne within-machine), ParMAC w o uld give an identical result no ma tter the num b er o f machines. Howev er, as s een fro m ﬁgures 7 to 9, it seems that this sma ll in trins ic sh uﬄing simply randomises the le arning curv es , but do es not make them b etter than the learning cur v e for one machine. 8.3 Sp eedup The fundament a l adv ant a ge of ParMA C and distributed optimisation in gener al is the ability to train on datasets that do not ﬁt in a single machine, and the reduction in runtime bec a use of parallel pro cessing. Fig. 10 (top row) shows the str ong scaling 7 sp eedups achiev ed exp erimentally , as a function o f the num b er of machines P fo r ﬁxed problem siz e (datas et and mo del), in CIF AR and SIFT-1 M ( N = 50K and 1M training po in ts, res pectively). Even thoug h these datasets and esp ecially the num b er of indep enden t s ubmodels ( M = 2 L = 32 eﬀective submo dels of the same size, as discussed in section 5 .4) ar e rela tiv ely small, the sp eedups w e a c hieve are ne a rly perfect for P ≤ M and hold very w ell for P > M up to the max imum n umber of machines we used ( P = 12 8 in the distributed system). The s peedups ﬂatten as the num b er of ep ochs (and consequently the a moun t of co mm unication) increases, b ecause for this ex p eriment the bo ttlenec k is the W step, who se pa rallelisation ability (i.e., the num b er of concurr en t pro cesses) is limited by M = 2 L (the Z step has N independent pro cesses and is never a bottleneck, since N is very la rge). Howev er, as noted earlier , us ing 1 to 2 ep ochs gives a go o d eno ugh result, very clos e to doing an e xact W step. The 7 In “strong s caling”, the total problem size is ﬁxed and the problem size on eac h m achine is i n versely proportional to the n umber of mac hines P . In “weak scaling”, the problem size on eac h machine is ﬁxed, so the total pr oblem size is prop ortional to P . High sp eedups are easier to obtain in weak scali ng than in strong scaling (Go edec k e and Hoisie, 2001). 23 . . . . . . . . . Num b er of ep o chs in W step . . . . . . . . . . . . . . . . . . . . . Num b er of machines P . . . . . . . . . . . . error -iteration view error -time view 1 epo ch in W step 8 epo chs in W step 0 5 10 15 20 1.9 1.95 2 2.05 2.1 x 10 4 1 epoch 2 epochs 3 epochs 4 epochs 8 epochs P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 100 200 300 400 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 5 10 15 20 1 machine 8 machines 16 machines 24 machines 32 machines P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 5 10 15 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 5 10 15 20 2 2.05 2.1 2.15 2.2 x 10 4 P S f r a g r e p la c e m e n t s E Q E BA p r e c is io n 0 100 200 300 400 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 5 10 15 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 5 10 15 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 5 10 15 20 0.22 0.24 0.26 0.28 0.3 P S f r a g r e p la c e m e n t s E Q E B A precision iteration 0 100 200 300 400 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n time 0 5 10 15 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n iteration 0 5 10 15 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n iteration Figure 7 : SIFT-10K dataset. L eft two c olumns : sing le machine ( P = 1) and diﬀerent num b er of ep o c hs e in the W step. R ight two c olumn s : ﬁxed n umber of ep o c hs (either 1 or 8) but diﬀerent num b er of machines P . runtime for SIFT-1 M o n P = 128 machines with 8 ep o c hs w as 1 2 minutes and its sp eedup 10 0 × . This is particularly remark able given tha t the original, nested mo de l did not hav e model pa rallelism. These sp eedups are v astly lar ger than those ac hieved b y ea rlier large-s c ale nonconvex optimisation approaches suc h as Go ogle’s DistBelief system (Le et al., 2 012; Dea n et al., 2012), although admittedly the dee p nets trained there w ere far larger than our BAs. Fig. 10 (b ottom) shows the s peedups pre dicted b y our theoretical mo del of section 5 . W e set the param- eters e and N to their known v alues, a nd M = 2 L = 3 2 for CI F AR a nd SIFT-1M and M = 2 L = 128 for SIFT-1B (eﬀective num b er of indepe ndent equa l-size submo dels). F or the time parameters , w e set t W r = 1 to ﬁx the time units, a nd we set t W c and t Z r by tr ia l and err or to achiev e a reas onably go o d ﬁt to the ex- per imen tal sp eedups. Sp eciﬁcally , we set t W c = 10 4 for bo th datas e ts, and t Z r = 200 for CIF AR and 40 for SIFT-1M. Although these ar e fudge factor s, they are in roug h agreement with the fact that co mm unicating a w e ig h t vector over the netw ork is orders of magnitude s lo wer than upda ting it with a g radien t step, and that the Z step is quite slow er than the W step b ecause of the binar y optimisation it in volves. Fig. 10 (b ottom, rig h t plo t) also shows the theor etical pre dic tio n for the SIFT-1B da taset ( N = 10 8 , M = 128 ), using the sa me para meters a s in SIFT-1M (aga in a ssuming the distributed memory sys tem): t W c = 10 4 and t Z r = 40 . Since here M is quite larger and N is muc h lar ger, the sp eedup is nearly p erfect ov er a very wide ra nge (note the plo t go es up to P = 1 02 4 ma c hines, even thoug h our exp erimen ts ar e limited to P = 128). 8.4 Large-scale exp erimen t: SIFT-1B dataset SIFT-1B is one o f the largest datasets, if not the largest one, that are publicly av a ilable for comparing nearest-neig h b our search a lgorithms with kno wn ground-truth (i.e., precomputed exact Euclidea n distances 24 . . . . . . . . . Number of epo chs in W step . . . . . . . . . . . . . . . . . . . . . Number of mac hines P . . . . . . . . . . . . error -iteration view error -time view 2 ep ochs in W s tep 8 ep ochs in W step 0 10 20 5.5 5.6 5.7 5.8 5.9 6 x 10 4 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 500 1000 1500 1 epoch 2 epochs 3 epochs 4 epochs 8 epochs P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 10 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 10 20 1 machine 32 machines 64 machines 96 machines 128 machines P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 10 20 5.7 5.8 5.9 6 6.1 6.2 6.3 x 10 4 P S f r a g r e p la c e m e n t s E Q E BA p r e c is io n 0 500 1000 1500 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 10 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 10 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 10 20 26 28 30 32 34 P S f r a g r e p la c e m e n t s E Q E B A precision iteration 0 500 1000 1500 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n time 0 10 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n iteration 0 10 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n iteration Figure 8: CIF AR dataset. L eft two c olumns : single machine ( P = 1) and diﬀerent num b er of ep o chs e in the W step. R ight two c olumn s : ﬁxed n umber of ep o c hs (either 2 or 8) but diﬀerent num b er of machines P . for each query to its k nearest vectors in the ba se set). The tra ining set co ntains N = 10 0M v e c to rs, each consisting of 1 28 SIFT features. Handling the SIFT-1B dataset req uir ed s p ecial care b ecause of its s iz e and the limited amount o f memory (total 512GB for 128 pro c e ssors in the distr ibuted system and 256GB for 64 processo rs in the shared- memory one). Each v ecto r has 128 SIFT features a nd each feature in the original data set is stor ed in a s ingle b y te rather than as double-precision ﬂoats (8 b ytes), as in our other experiments, totalling 12.8GB for the training set if using a linea r hash function a nd 200GB if using an RBF one (see b elow). Rather than conv erting it to ﬂoats, which w ould exc e e d 1TB, we mo diﬁed our co de to conv ert each feature only as needed. In the Z step each datap oin t is pro cessed indep enden tly and the conversion to double is done one p oin t at a time. In the W step it is done one minibatch at a time. It is of course pos sible to use hard disk as additional stora ge but this would slow down training. The auxiliary co ordinates, which must b e stored in MAC alg orithms, take only 6.25% the memo r y of the data (64 bits p er datap oint compar ed to 128 bytes). W e used L = 64 bits (hash functions). As hash function, we trained a linea r SVM as befor e, a nd a k er nel SVM using m Ga ussian radial basis functions (RBF) with ﬁxed bandwidth σ and centres. This means the only trainable par ameters are the weigh ts, so the MAC algor ithm does not change except that it op erates on an m -dimensional input v ecto r of kernel v alues (stored as one byte each), instead of the 128 SIFT features. The Gaussian k ernel v a lues a re in (0 , 1] but, as befor e, to save memory w e store them as an uns igned o ne-b yte int eg er v alue in [0 , 2 55]. W e used m = 2 000 centres, the maxim um w e could ﬁt in memory , pick ed at random from the training set. In tr ia ls with a subset of the tra ining set, we set σ = 160 . This work ed well and was wide enoug h to e ns ure that, with our limited one-byte precision, no data p oin t would pro duce m zeros as kernel v alues. On trials on a subse t of the tra ining data set, we set the num b er of epo chs to e = 2 with s huﬄing (we observed no improvemen ts by using more ep ochs, which is understanda ble given the s ize of the dataset). 25 . . . . . . . . . Number of epo chs in W step . . . . . . . . . . . . . . . . . . . . . Number of mac hines P . . . . . . . . . . . . error -iteration view error -time view 2 ep ochs in W s tep 8 ep ochs in W step 0 10 20 5.6 5.7 5.8 5.9 x 10 4 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 500 1000 1500 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 10 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 10 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 10 20 5.7 5.8 5.9 6 6.1 6.2 6.3 x 10 4 P S f r a g r e p la c e m e n t s E Q E BA p r e c is io n 0 500 1000 1500 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 10 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 10 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n 0 10 20 26 28 30 32 34 P S f r a g r e p la c e m e n t s E Q E B A precision iteration 0 500 1000 1500 1 epoch 2 epochs 8 epochs 1 epoch shuffled 2 epochs shuffled 8 epochs shuffled P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n time 0 10 20 P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n iteration 0 10 20 1 machine 32 machines 64 machines 1 machine shuffled 32 machines shuffled 64 machines shuffled P S f r a g r e p la c e m e n t s E Q E B A p r e c is io n iteration Figure 9: Like ﬁg. 8 but with and without minibatch shuﬄing in the W step (solid a nd dashed lines, resp.). W e initialised the binary co des from truncated PCA tr ained on a subset of size 1M (reca ll@R=100: 55 .2%), which gav e results c omparable to the baseline in J´ eg ou et al. (2 011b) if us ing 8 bytes p er indexed vector (without pos tpr ocessing by rer a nking as done in that pap er). W e ran ParMAC on the whole training set in the distributed system with 128 pro cessor s for 6 iterations and in the shar e d-memory one with 64 pr o cesso rs for 10 itera tions. The results a re g iv en in the following table and ﬁg ures 11 –12: Hash function (enco der) Recall @R=100 Time (hours) distrib. sha red linear SVM 61.5% 29.30 11.04 kernel SVM 66.1% 83.44 32.19 The learning curves (ﬁg. 11) a re e s sen tially identical over b oth systems. The nonlinear RBF hash function outp e rforms the linear o ne in recall, as one would expect. The improvemen t o ccurs across the whole range of R recall v alues (ﬁg. 1 2). No te the erro r in the nested mo del, E BA , do es not decrease mo notonically . This is b ecause MAC optimises instead the penalis ed function E Q , in an eﬀort the minimise E BA as µ increases. Based o n our previous results, the sma ll num b er of ep ochs and the larg er n umber of s ubmodels in the W step, w e exp ect nearly p erfect spee dups. W e cannot compute the a ctual sp eedup becaus e the single-machine runtime is enormous. Using a sca le d-do wn mo del and training set, w e es tima ted that tra ining in one mac hine (with enough RAM to hold the data and pa r ameters) would take months. Although the sp eedups are compara ble on b oth the distr ibuted and the shared-memo ry system, the former is 3–4 times slow er . The reaso n is the distributed system has b oth slow er proc e ssors and slow er int er proces sor commun ic a tion (acro ss a ne tw o rk); see also ﬁg. 13. 26 CIF AR SIFT-1M SIFT-1B 1 32 64 96 128 0 20 40 60 80 1 epoch 2 epochs 3 epochs 4 epochs 8 epochs P S f r a g r e p la c e m e n t s sp eedup S ( P ) (exp eriment) 1 32 64 96 128 0 20 40 60 80 100 1 epoch 8 epochs P S f r a g r e p la c e m e n t s to o long to run 1 32 64 96 128 0 20 40 60 80 1 epoch 2 epochs 3 epochs 4 epochs 8 epochs P S f r a g r e p la c e m e n t s nu mber of machines P sp eedup S ( P ) (theory) 1 32 64 96 128 0 20 40 60 80 100 1 epoch 2 epochs 3 epochs 4 epochs 8 epochs P S f r a g r e p la c e m e n t s nu mber of machines P 1 256 512 768 1024 1 256 512 768 1024 1 epoch 2 epochs 3 epochs 4 epochs 8 epochs P S f r a g r e p la c e m e n t s nu mber of machines P Figure 10 : Spe e dup S ( P ) as a function of the num b er of machines P for CIF AR, SIFT-1 M and SIFT-1B. T op: ex perimental r esult in the distr ibuted memor y sys tem. Bottom: theo retical r esult predicted by the mo del of section 5. The datas e t size and num b er of submo dels ( N , M ) is (50 000 , 32 ) for CIF AR, (10 6 , 32) for SIFT-1M a nd (10 8 , 128) for SIFT-1B. 8.5 Shared-mem or y vs distributed systems The TSCC distributed cluster consists of node s containing 1 6 pro cessors and 64 GB RAM. These pro cessor s communicate through shared-memor y within a no de, and a cross a netw ork otherwis e (which is slower). In all our exp eriments up to now, we a lw ays allo cated pro cessor s within the same no de if p ossible. But, depe nding o n whether a user requests processo rs within or across no des, there is then a tr adeoﬀ bet ween both communication mo des. A full study o f this is sue is beyond our sco pe, which is to understand the ParMAC algorithm in gener al rather than for sp eciﬁc computer a rc hitectures . Howev er, we ran a small ex periment to ev aluate the computation and communication time s pent as a function of the num b er of pr o cesso rs p er no de. W e set the total num b er of pro cessor s to P = 16 but allo cated them in the following conﬁguratio ns: from a single no de with all 16 pro cessor s (1 × 16, pure sha red-memory) to 16 no des each with 1 pro cessor (16 × 1, pure distributed), and in termedia te conﬁg urations such as 2 nodes each with 8 pro cessors (2 × 8). W e used the RBF hash function fro m the SIFT-1B exp eriment with a ll settings as b efore and trained it on a subset of 20 K p oints for a single iteration. Figure 1 3 shows the res ulting times. While the computation time remains constant, the communication time incr eases as we mov e fr o m shar ed-memory to distributed settings, as ex pected. Hence, the eﬀect on the ParMA C algorithm would b e to incre a se the W step runtime corres p ondingly a nd low er the pa rallel sp eedup. 9 Discussion Developing paralle l, distributed optimisation alg orithms for nonconv ex problems in mac hine lear ning is challenging, a s shown by recent eﬀorts by lar ge teams o f r esearchers (Le et al., 2 012; Dean et al., 2 0 12). One importa n t adv antage of ParMA C is its simplicity . Data and model par allelism arise naturally thanks to the in tro duction of auxiliary co ordinates. The co rresp onding optimisation subproblems can often b e so lv ed reusing existing co de as a blac k b ox (as with the SGD training of SVMs and linear mappings in the BA). A circular top ology is suﬃcient to achieve a low comm unica tio n b et ween machines. There is no close coupling 27 2 4 6 8 10 0.8 0.82 0.84 0.86 0.88 RBF, shared lin, shared RBF, distrib. lin, distrib. P S f r a g r e p la c e m e n t s r e c a ll@ R = 1 0 0 E BA iteration 0 2 4 6 8 10 56 58 60 62 64 66 RBF, shared lin, shared RBF, distrib. lin, distrib. P S f r a g r e p la c e m e n t s recall@R= 100 E B A iteration Figure 11: SIFT-1B dataset, using the shared-memory and distributed system (solid and dashed lines , resp.). 10 0 10 1 10 2 10 3 10 4 20 40 60 80 100 tPCA RBF lin P S f r a g r e p la c e m e n t s recall@R R 10 0 10 1 10 2 10 3 10 4 10 20 30 40 50 60 70 80 90 100 600 800 1000 81 82 83 84 85 Initialization (tPCA) 10 2 4 6 8 P S f r a g r e p la c e m e n t s R (linear hash function) 10 0 10 1 10 2 10 3 10 4 10 20 30 40 50 60 70 80 90 100 600 800 1000 85 86 87 88 89 Initialization (tPCA) 2 4 6 8 10 P S f r a g r e p la c e m e n t s R (RBF hash function) Figure 1 2: Rec a ll@R on the SIFT-1B dataset for truncated PCA (initialisation), linea r and k ernel ha sh functions (left plo t: ﬁnal result; right t wo plo ts: ov er iter ations, as lab elled). 1x16 2x8 4x4 8x2 16x1 0 10 20 30 40 50 60 70 80 communication computation P S f r a g r e p la c e m e n t s # nodes × # pro cessors time (s) Figure 13 : Time sp en t o n co mmunication and computation as a function of the num b er o f pro cessors p er no de in the TSCC cluster. The time for our shared- memory UC Merced cluster (also using 16 pro cesso rs, i.e., corresp onding to 1 × 16 ) is 2.5 7 and 8.76 s econds for communication and computation, resp ectively . 28 betw e e n the mo del structure and the distributed sy stem architecture. The developmen t and implement a tion of ParMAC fo r binary a utoenco ders on large datase ts in a distributed cluster was achiev ed in a few mo n ths by the PI a nd one junior PhD studen t (b oth without pr io r exp erience in MPI). Rather than an alg orithm, MA C is a meta-algor ithm that can produce a sp eciﬁc o ptimisa tion algorithm for a given nested problem, dep ending on ho w the auxiliar y co ordinates ar e in tro duced a nd on how the resulting subproble ms a re so lv ed (in this sense, MAC is similar to expectation-maximis ation (EM) alg orithms; Dempster et al., 197 7 ). F or exa mple, in the low-dimensional SVMs of W ang and Carreir a-Perpi˜ n´ a n (2014 ), the Z step is a small qua dr atic pro gram for ea c h data p oin t. How e ver, r egardless of thes e spec iﬁcs, the resulting MAC algor ithm typically ex hibits a W step with M indep endent submo dels a nd a Z step with N indep endent co ordinates fo r the data p oin ts. Likewise, the sp eciﬁcs o f a ParMA C algor ithm (how the W and Z s teps ar e optimised) will dep end on its cor respo nding MAC a lgorithm. How ever, it will alwa ys split the da ta a nd auxiliary co ordinates ov er machines a nd consist of a Z step with no co mm unication betw e e n mac hines, and a W s tep where submo dels visit machines in a circula r to pology , eﬀectiv ely training themselves by sto c has tic optimisation. F urther improvemen ts can be made in sp eciﬁc problems. F or example, it is po ssible to have further parallelisa tion o r less dependenc ie s (e.g. the weigh ts of hidden units in lay er k o f a neural net dep end o nly on aux iliary co ordinates in lay ers k and k + 1). This may reduce the co mm unication in the W s tep, b y sending to a g iv en ma c hine only the mo del po rtion that it needs, or by a llocating cores within a multicore machine accordingly . Also, the W and Z step optimisa tio ns can make use of further pa r allelisation b y GPUs or b y distr ibuted convex optimisation algorithms. And, if the submo dels are small in s iz e, it may b e be tter for each machine to op erate on a set of submodels and then send them all together in a larger message (rather than sending each submo del as it is ﬁnished), since this will reduce la tency overheads (i.e., the setup cost of a message). Many mor e r e ﬁnemen ts can (and should) b e done in an industria l implementation. F or example, one can s tore and communicate r e duced-precision v a lues for data and par ameters with little eﬀect of the accura cy , a s has b een done in neura l nets (e.g. Gupta et al., 201 5; Han et a l., 20 15, 2016 ). V ar ious system-dep enden t optimisations may b e p ossible (b ey o nd thos e that a go o d compiler may b e able to do , such a s lo op unr olling or co de inlining), such as improving the spatial or temp oral lo cality of the co de given the t yp e and size of the cache in each machine. In this pa per, we hav e tried to keep our implementation as simple a s po s sible, b ecause our goal was to understand the parallelisa tion sp eedups of ParMAC in a s etting as general as pos sible, rather than trying to a c hieve t he very best p erformance for a particular dataset, mo del or distributed system. ParMA C is very eﬃcient in communication: no data or co ordinates are ever sent , only the entire model e + 1 times per itera tion. Using one epo ch (which is suﬃcient in lar ge datasets), or using e epo chs but per forming them within each machine, the entire mo del is sent twice p er itera tion. This is near optimal if we note the following. If the data ca nnot b e communicated, then at every itera tion each submo del m us t visit each mac hine (for it to b e trained on the entire data). Hence, any co rrect algorithm will hav e to send the entire model a t le a st once; ParMA C do es so twice. Also, the cir cular topolo gy is the minimal top ology (considered as a directed graph on the P machines) that is necess a ry to b e able to optimise a globa l mo del on the entire da taset with P ma c hines, beca use each machine must be able to communicate with some other machine. It has P edg e s and is truly distr ibuted, with each machine having the same imp ortance. A p opular mo del for distributed optimisatio n (e.g. with parallel SGD) uses a work er- server co nnec tivit y: W ≫ 1 workers that do actual para meter optimisa tion and S ≥ 1 “ parameter ser v ers ” that collect and broadcas t parameters to work er machines. This is a bipa rtite gra ph with bidirectional edges betw een servers and work ers , having S W edg es, which is quite larger than the num b er of machines P = S + W . The en tir e mo del must b e s en t twice p er iteration, to and from the parameter s erv er , but this creates a b ottleneck when m ultiple workers send da ta to the same server, and S ≪ W in practice. No s uch b ottleneck o ccurs in ParMA C. Parallel and distributed co mputing systems have b e en ar ound for de c a des. One imp ortant clas s are sup e rcomputers, which ar e carefully designed in terms of the pro cessors, memory system and connection net work. They hav e b een traditionally used to so lv e a wide v arie t y of large- scale scientiﬁc computation problems, such as weather prediction, nuclear reactor mo delling, or as tr oph ysica l or molecular simulations. Another importa n t clas s are clusters of inexp ensive, heterog enous w or kstations connected through an E ther- net net work, with workstations diﬀering in sp eed, memory/disk capacity , num b e r o f cor e s/GPUs, etc. This is us ed in data cen tre s in Go ogle, Amazon and other companies, and also in distributed computation mo de ls 29 such as SETI@home that capitalise o n the computation and Internet connec tivity av ailable to individuals, and their willingness to dona te them to pro jects they ﬁnd worthy . In these systems, the machine lea rning task ma y b e o ne of other tasks running concurrently , such as web searches or email in a da ta centre (which may o perate on the sa me data as the machine lear ning task), or pe r sonal applica tions in an individual’s workstation. Super computers and clusters diﬀer c onsiderably across imp ortant factors : suitability fo r a particular problem, computatio n and communication sp eed, size of memory and disk, connection netw or k, fault tolerance, load, cos t, energ y consumption, etc. A t pr esen t it is unclear what the b est choices will be for machine learning models (whic h exhibit a wide v ar ie ty themselves), and we exp ect to see many diﬀerent po ssibilities b een re s earched in the immediate future. W e sugges t that ParMAC, by itself or in combina- tion with other techniques, may play an imp ortant role with nested mo dels b e cause o f the em barr assing parallelism it intro duces and its lo ose demands on the underlying distributed system. 10 Conclusion W e hav e prop osed ParMA C, a distributed mo del for the metho d of auxiliary co ordinates for tra ining nested, nonconv ex mo dels in genera l, analysed its par allel sp eedup and co n vergence, and demo nstrated it with an MPI-based implemen tation for a pa rticular case, to train binar y auto enco ders. MAC creates par allelism by int r oducing auxiliary co ordinates for each da ta p oin t to deco uple nes ted ter ms in the ob jective function. ParMA C is able to tr anslate the para llelism inherent in MA C into a distributed s ystem by 1 ) using data parallelism, so that each ma chine keeps a p ortion of the orig inal data and its corr esponding auxiliar y co or- dinates; and 2 ) using mo del para llelism, so that indep e nden t s ubmodels (weigh t vectors of a hash function or hidden unit) visit every machine in a circular top ology , eﬀectively executing ep ochs of a sto c ha stic opti- misation, without the need for a para meter server and therefore no communication b ottlenec k s . This keeps the communication b et ween machines to a minimum within each iteratio n. In this sense, ParMA C can b e seen as a strategy to b e able to use existing, w ell- dev elop ed (co n vex) distributed optimisa tion techniques— applicable to simple functions—to a se tting where simple functions ar e co upled by nesting into a nonconv ex function whose training data is distributed over machines. The conv erg e nc e pro perties o f MA C (to a s ta - tionary p oin t of the ob jective function) remain essentially unaltered in ParMA C. The para lle l sp eedup can be theoretically predicted to b e nearly per fect when the num ber of submo dels is compara ble or larger than the num b er of ma chines, and to even tually s aturate as one contin ues to increase the num be r of ma c hines, and indeed this was conﬁr med in our e x periments. ParMAC also makes it easy to a c c oun t for da ta shu ﬄing , load balancing , streaming and fault to lerance. Hence, we exp ect that ParMAC could b e a basic building blo c k, in co m bination with other techniques, for the distributed o ptimisa tion of nested mo dels in big data settings. Ac kno wl edgemen ts W o rk suppor ted b y a Go ogle F aculty Resea rc h Award a nd by NSF aw ard IIS–1 423515. W e thank Dong Li (UC Merced) for us e ful discussio ns a b out MPI and p erformance ev aluation on par allel systems , and Quo c Le (Go ogle) for useful discussion ab out Go ogle’s DistBelief s ystem. 30 A Theoretical analysis of the sp eedup : pro ofs In section 5 we prop osed the follo wing theoretical estimate for the speedup S ( P ): S ( P ) = T (1) T ( P ) = ρ 1 ⌈ M /P ⌉ M P 1 N P 2 + ρ 2 P + ρ 1 1 ⌈ M /P ⌉ M . (12 ’) Consider S ( P ) a s a real function of a rea l v a riable P ≥ 1 (keeping in mind that only integer v alues of P can o ccur in practice). T he function ⌈ M /P ⌉ is piecewise co nstan t and takes the v alues M , M − 1 , . . . , 1 as P incr eases from P = 1, with discontin uities wher e M / P = k for k = M , M − 1 , . . . , 1. Hence, S ( P ) is piecewis e contin uous on M in ter v als of the form  1 , M M − 1  ,  M M − 1 , M M − 2  , . . . ,  M 2 , M  , [ M , ∞ ). (Many of these interv als o ccur betw e e n int eg er v alues o f P so they are a ctually unobserved in practice; for example, for M = 16 ther e ar e 8 interv als b et ween P = 1 and P = 2.) Within ea c h interv a l P ∈  M k , M k − 1  we hav e ⌈ M /P ⌉ = k , hence we ca n equiv alently write the sp eedup o f e q. (12 ) as the following rational function of P : S ( P ) = 1 k ρM P 1 N P 2 + ρ 2 P + 1 k ρ 1 M for P ∈  M k , M k − 1  , k = M , M − 1 , . . . , 1 . (23) A.1 Characterisation of the sp eedup function S ( P ) Our main theorem is theo rem A.1 b elo w. It characterises how the sp eedup grows as a function of P . Let us deﬁne P ∗ k = p ρ 1 M N /k S ∗ k = S ( P ∗ k ) = ρM /k ρ 2 + 2 p ρ 1 M / N k k = 1 , 2 , . . . , M . (17’) Theorem A.1. Consider the fu n ction S ( P ) of e q. (23) and P ∗ k and S ∗ k as in e q. (17 ’) . Then: 1. S ∗ k < S ∗ k − 1 for k = 2 , . . . , M . 2. Within interval  M k , M k − 1  for k = 1 , 2 , . . . , M , we have t hat S ( P ) either is monotonic al ly incr e asing, or is monotonic al ly de cr e asing, or achieves a single maximum S ∗ k = S ( P ∗ k ) at P ∗ k . 3. S  M k  > S ( P ) for 1 ≤ P < M k , for k = 2 , . . . , M . Pr o of. Part 1 is o b vious b y writing S ∗ k = ρM /k ρ 2 + 2 p ρ 1 M / N k = ρM ρ 2 k + 2 p ρ 1 k M / N . T o prove pa rt 2, we a pply lemma A.4 to S ( P ) within interv al  M k , M k − 1  for k = 1 , 2 , . . . , M . W e obtain that S ( P ) e ither is mono tonically incre asing, or is monotonically decreasing , or achiev es a single maximum S ∗ k = S ( P ∗ k ) at P ∗ k . T o pr o ve part 3, we a pply theor em A.3 repea tedly for k = 2 , . . . , M . R emark A.2 . As a par ticular ca s e of theorem A.1 pa rt 2 for k = 1, we obtain that for P ∈ [ M , ∞ ) P ∗ 1 = p ρ 1 M N S ∗ 1 = S ( P ∗ 1 ) = ρM ρ 2 + 2 p ρ 1 M / N (19’) and S ( P ) is either monoto nically de c r easing with P if M ≥ P ∗ 1 , o r it incre a ses from P = M up to a single maximum a t P = P ∗ 1 and then decreases monoto nically . In both cases, if t W c > 0 we hav e that S ( P ) → 0 as P → ∞ , and S ( P ) ≈ ρN M /P for larg e P . Theorem A. 3. Consider the function of e q. (23) , written mor e simply using ρ ′ 1 = ρ 1 N > 0 , ρ ′ 2 = ρ 2 N > 0 and ρ ′ = ρN = ρ ′ 1 + ρ ′ 2 > 0 : S ( P ) = 1 k ρ ′ M P P 2 + ρ ′ 2 P + 1 k ρ ′ 1 M for P ∈  M k , M k − 1  , k = 2 , . . . , M . ( 2 4) Then, for k = 2 , . . . , M : S  M k − 1  > S ( P ) ∀ P ∈  M k , M k − 1  . 31 Pr o of. F r om theorem A.1 pa rt 2 , we know that exactly one of the following three cases holds fo r P ∈  M k , M k − 1  : 1. S ( P ) is monoto nically increasing . Then, it suﬃces to prov e tha t lim P → M k − 1 S ( P ) < S  M k − 1  . Indeed, lim P → M k − 1 S ( P ) = ρ ′ M ρ ′ ( k − 1 ) + ρ ′ 2 + M k / ( k − 1 ) < ρ ′ M ρ ′ ( k − 1) + M = S  M k − 1  . 2. S ( P ) is monoto nically decreasing . Then, it suﬃces to prov e that S  M k  < S  M k − 1  . Indeed, S  M k  = ρ ′ M ρ ′ k + M < ρ ′ M ρ ′ ( k − 1 ) + M = S  M k − 1  . 3. S ( P ) achieves a single maxim um S ∗ k = S ( P ∗ k ) at P ∗ k in the in terior of the in terv al. Then, it suﬃces to prov e that S ∗ k < S  M k − 1  . This la st case is more complicated. In the sequel, we provide a pr o of that has not technical diﬃculties, although it is somewha t cumbersome. Let us then prove that S ∗ k < S  M k − 1  . After a bit of algebra, fro m eqs. (24) and (17’) w e obtain the following: S ∗ k < S  M k − 1  ⇔ M + ρ ′ 1 k − ρ ′ < 2 k P ∗ k = 2 k q ρ ′ 1 M /k . If M + ρ ′ 1 k − ρ ′ ≤ 0, then the condition holds and the pro of is done. Otherwise, assume M + ρ ′ 1 k − ρ ′ > 0 and ta k e squares in the prev ious equatio n: ( M + ρ ′ 1 k − ρ ′ ) 2 < 4 ρ ′ 1 k M ⇔ M 2 + ( ρ ′ 1 k − ρ ′ ) 2 − 2( ρ ′ 1 k + ρ ′ ) M < 0 ⇔ ρ ′ 1 k + ρ ′ − 2 p ρ ′ 1 ρ ′ k < M < ρ ′ 1 k + ρ ′ + 2 p ρ ′ 1 ρ ′ k . Hence, w e need to prov e that the following inequalities hold: ρ ′ 1 k + ρ ′ − 2 p ρ ′ 1 ρ ′ k < M < ρ ′ 1 k + ρ ′ + 2 p ρ ′ 1 ρ ′ k ( 2 5) under the following assumptions: • P ∗ k = p ρ ′ 1 M /k ∈  M k , M k − 1  , since c a se 3 ab ov e mea ns that S ( P ) achiev e s a maximum S ∗ k at P ∗ k in the in terio r of the interv a l. Equiv alently , ρ ′ 1 k > M > ρ ′ 1 ( k − 1) 2 /k = ρ ′ 1  k − 2 + 1 k  . • M ≥ k ≥ 2, since k takes the v alues 2 , 3 , . . . , M . • M > ρ ′ − ρ ′ 1 k , from ab ov e. F rom M < ρ ′ 1 k it follows that M < ρ ′ 1 k + ρ ′ + 2 p ρ ′ 1 ρ ′ k , a nd the RHS inequality in (25) is prov en. Now let us pr o ve the LHS inequality in (25), which states M > ρ ′ 1 k + ρ ′ − 2 p ρ ′ 1 ρ ′ k . This can b e derived from the assumption ab ov e that M > ρ ′ 1  k − 2 + 1 k  . Sp eciﬁcally , we will prove that ρ ′ 1  2 − 1 k  < − ρ ′ + 2 p ρ ′ 1 ρ ′ k , and this will complete the pro of of the theorem. F rom a ssumptions ρ ′ 1 k > M > ρ ′ − ρ ′ 1 k w e get that ρ ′ < 2 ρ ′ 1 k , a nd since ρ ′ ≥ ρ ′ 1 , w e hav e that ρ ′ ∈ [ ρ ′ 1 , 2 ρ ′ 1 k ). Now, wr ite ρ ′ = a 2 ρ ′ 1 with a ∈ [1 , √ 2 k ). Then ρ ′ 1  2 − 1 k  < − ρ ′ + 2 p ρ ′ 1 ρ ′ k ⇔ a 2 + 2 − 1 k < 2 a √ k . Since k ≥ 2 ⇔ 3 2 ≥ 2 − 1 k , to prove a 2 + 2 − 1 k < 2 a √ k it suﬃces to prov e that a 2 + 3 2 < 2 a √ k , o r equiv alently a ∈  √ k − q k − 3 2 , √ k + q k − 3 2  . This interv a l indeed contains [1 , √ 2 k ): a little algebra shows that √ k − r k − 3 2 < 1 ⇔ k >  5 4  2 and √ k + r k − 3 2 > √ 2 k ⇔ k > 3 4( √ 2 − 1) bo th of which hold b ecause k ≥ 2. 32 Lemma A. 4 . Given c onst ant s α, β , γ , δ > 0 , deﬁne t he fol lowi n g re al funct io n for P ≥ 0 : ψ ( P ) = δ P αP 2 + β P + γ (26) and let P ∗ = p γ /α and ψ ∗ = ψ ( P ∗ ) = δ / ( β + 2 √ αγ ) . Then, in the interval [ a, b ) with 1 ≤ a < b ≤ ∞ , ψ      is monotonic al ly de cr e asing if P ∗ ≤ a achieves a single max imu m at P ∗ if P ∗ ∈ ( a, b ) is monotonic al ly incr e asing if P ∗ ≥ b. Pr o of. The deriv a tiv es of ψ with resp ect to P are: ψ ′ ( P ) = δ ( − αP 2 + γ ) ( αP 2 + β P + γ ) 2 ψ ′′ ( P ) = − 2 δ ( − α 2 P 3 + 3 αγ P + β γ ) ( αP 2 + β P + γ ) 3 . Hence ψ ′ ( P ∗ ) = 0 a t P ∗ = p γ /α and ψ ′′ ( P ∗ ) < 0 , and the lemma follows. A.2 Globally maxim um sp eedu p S ∗ = max P ≥ 1 S ( P ) The maximum sp eedup can b e determined as follows. Both S ( P ) in (14) and S ∗ k in (17) a re monotonica lly increasing with P (note S ∗ k is decrea sing w ith k , a nd k is decreasing with P ). Hence, the global maximum of S ( P ) o ccurs in the last interv al [ M , ∞ ), either at the b eginning ( P = M , if P ∗ 1 ≤ M ) or in its interior ( P = P ∗ 1 , if P ∗ 1 > M ). This also follows from theor em A.1. Speciﬁcally , the global maximum S ∗ of S ( P ) is: • If M ≥ ρ 1 N : S ∗ = M /  1 + M ρN  ≤ M , achieved a t P = M . • If M < ρ 1 N : S ∗ = S ∗ 1 = ρM ρ 2 +2 √ ρ 1 M / N > M , achieved a t P = P ∗ 1 = √ ρ 1 M N > M . Let us prove that S ∗ 1 > M . Assume S ∗ 1 ≤ M , then ρM ρ 2 + 2 p ρ 1 M / N ≤ M ⇔ ρ ≤ ρ 2 + 2 p ρ 1 M / N ⇔ ρ 1 ≤ 4 M / N ⇔ M ≥ ρ 1 N / 4 which contradicts the condition that M < ρ 1 N , hence S ∗ 1 > M . In practice, with la rge v alues of N , the mor e likely case is S ∗ = S ∗ 1 for P = P ∗ 1 > M . A.3 The “large dataset” case If we tak e P ≪ ρ 2 N (“la rge dataset” case), the P 2 term in the speedup e xpression (12’) b ecomes negligible, and the sp eedup beco mes S ( P ) = ρ 1 ⌈ M /P ⌉ M P 1 N P 2 + ρ 2 P + ρ 1 1 ⌈ M /P ⌉ M ≈ ρ/  ρ 1 P + k ρ 2 M  where k = ⌈ M / P ⌉ ∈ { 1 , 2 , . . . , M } . Now, by taking k = 1 ( M < P ) a nd k = M /P ( M divisible by P ) we obtain the following impor tan t cases: if M divisible by P : S ( P ) ≈ P ; if M > P : S ( P ) ≈ ρ/  ρ 1 P + ρ 2 M  (20’) so that the sp eedup is almo st p e rfect up to P = M , and then it is a ppr o ximately the weight ed harmo nic mean of M and P (hence, S ( P ) is monotonica lly increas ing a nd b et ween M and P ). F or P ≫ ρ 1 , we have S ( P ) ≈ ρ ρ 2 M > M . 33 B Imp ortan t MPI functions F or refer ence, we brieﬂy descr ibe imp ortant MP I functions and their pa r ameters (Gropp et a l., 1 999a,b; Message Passing Int e r face F orum, 2012). B.1 En vironmen t Managemen t R outines • MPI Init (&argc,&a rgv) : initialises the MPI execution environment. It must be called exactly o nce in every MP I program b efore calling any other MPI functions. F or C progr ams, it may b e used to pass the command-line arguments to all pr o cesses . Input: argc , pointer to the num b er of a rgumen ts; ar gv , po in ter to the ar gumen t v ec tor. • MPI Comm size(c omm,&siz e) : r eturns the total num be r of MPI pro cesses in the sp eciﬁed communi- cator. If the communicator is MPI COMM WOR LD , then it r e presen ts the num b er o f MPI tasks av a ilable to your a pplication. Input: comm , communicator (handle). Output: si ze , num b er of pro cesses in the group o f c omm (integer). • MPI Comm rank(c omm,&ran k) : returns the rank o f the calling MPI pro cess within the sp eciﬁed com- m unica tor. Initially , each pro cess is assigned a unique integer rank betw een 0 and the num ber of tasks (1 within the co mm unicator MPI COMM WOR LD ). This rank is often referred to as a task ID. If a pro cess bec omes asso ciated with other communicators, it will hav e a unique rank within each o f these as w ell. Input: comm , communicator (handle). Output: rank , rank of the calling pro cess in the gro up of comm (in teg er). • MPI Fina lize() : ter mina tes the MPI execution en vironment. It should be the last MPI function called in a ny MP I progr am. B.2 P oint to P oint Commu nication R outines MPI p o in t-to-p oint op erations in volve messag e passing b et ween ex a ctly tw o MP I task s. O ne tas k p erforms a send op eration a nd the other task p erforms a matching r eceiv e o peration. There are diﬀerent types of send a nd receive functions , used for diﬀerent purp oses, s uch as synchronous send; blo cking send, blo cking receive; non-blo cking send, non-blo cking receive; buﬀere d send; co m bined send-receive. Their arg umen t list generally takes one of the following for mats: • Blo c king send: MP I Send(b uffer,co unt,type,dest,tag,comm) . • Non-blo c king send: M PI Isend( buffer,c ount,type,dest,tag,comm,request) . • Blo c king receive: M PI Recv(b uffer,co unt,type,source,tag,comm,status) . • Non-blo c king receive: MPI Irecv( buffer,c ount,type,source,tag,comm,requ e st) . Here is a brief descr iption of the pa r ameters: • buffer : program (application) addres s spa ce that reference s the data that is to b e sent or received. In most ca ses, this is s imply the v ariable name that is to b e sent or received. F or C programs, buffe r is pa ssed by refere nc e and m ust b e prepended with an amp ersand: &buf fer . • count : indicates the num b er of data elements of t yp e type to b e sen t. • type : the data t yp e that is sent or re ceiv ed. F or rea sons of p ortability , MPI pr edeﬁnes its elemen ta ry data t yp es. • dest : for send routines, it indicates the pro cess to which a messag e should b e delivered. Speciﬁed as the rank of the receiving pro cess. • source : for receive routines, it indicates the originating pr ocess of the message. Sp eciﬁed as the r ank of the sending pro cess. It may be s et to the wild car d MPI ANY SOUR CE to receive a message from any task. 34 • tag : arbitr ary non-negative in teger assigned by the programmer to uniquely identify a message . Send and receive op erations should match message ta g s. F or a re c eiv e op eration, the wild c a rd MPI A NY TAG can be used to receive any message regardless o f its tag. • comm : it indicates the communication context, or set o f pro cesses for which the so ur ce or destinatio n ﬁelds a re v alid. Unless the progra mmer is explicitly cr eating new comm unica tors, the predeﬁned communicator MPI COMM WOR LD is usually used. • status : fo r receive r outines, it indicates the s ource of the messag e and the tag of the messag e. In C, status is a p oin ter to a predeﬁned structure MPI Status . Additionally , the a ctual num be r o f bytes received is obtainable fro m st atus via the M PI Get co unt routine. • reques t : us e d by non-blo cking send/receive. Since no n- blocking op erations may return b efore the requested system buﬀer space is obtained, the system issues a unique “re q uest n umber ”. The pr o- grammer uses this system-as signed “handle” later (in a Wait -t yp e ro utine) to determine completion of the non-blo c king oper ation. In C, r equest is a p ointer to a predeﬁned structure MPI Reques t . These are the blo cking message p assing r outines : • MPI Send(& buf,coun t,datatype,dest,tag,comm) : basic blo cking s end op eration. It returns only after the a pplication buﬀer in the s ending tas k is free for r euse. • MPI Recv (&buf,cou nt,datatype,source,tag,comm,&status) : receive a message and blo c k un til the requested data is av aila ble in the a pplica tion buﬀer in the receiving ta s k. • MPI Ssend( &buf,cou nt,datatype,dest,tag,comm) : synchronous blo c king s e nd. It s ends a messa ge and blo cks until the applicatio n buﬀer in the sending task is free for reuse and the destination pro cess has started to receive the message . • MPI Bsend( &buf,cou nt,datatype,dest,tag,comm) : buﬀer e d blo c king s e nd. It allows the progra m- mer to allo cate the required amo un t of buﬀer spa ce into whic h data can be copied until it is delivered. It alleviates problems a ssocia ted with insuﬃcient system buﬀer space. It r e turns after the data has bee n copied from the application buﬀer space to the a llocated send buﬀer. It must b e used with the MPI Buffer atta ch routine. • MPI Buff er attac h(&buffer ,size), MPI B uffer det ach(&buf fer,size) : us e d by the pr ogrammer to allo cate or deallo cate message buﬀer space to be used b y MPI Bsend . The size argument is speciﬁed in actual data b ytes (not a count of data elemen ts). Only one buﬀer can b e attached to a pro cess at a time. • MPI Wait(& request, &status) : blo c ks unt il a sp eciﬁed non-blo c king send or receive op eration has completed. These are the non- blo cking message p assing r outines : • MPI Isend( &buf,cou nt,datatype,dest,tag,comm,&req u est) : identiﬁes an are a in memory to ser v e as a send buﬀer. P roces sing contin ues immediately without waiting for the mess age to b e co pied out from the a pplication buﬀer. A communication r equest ha ndle is returned for handling the pending mes- sage status. The prog ram should not modify the a pplication buﬀer until subsequent calls to MPI Wait or M PI Test indicate that the non- blocking send has completed. • MPI Irec v(&buf,co unt,datatype,source,tag,comm,&request) : identiﬁes an a rea in memory to serve as a receive buﬀer. Pro cessing contin ues immediately witho ut actually waiting for the mes sage to be received and copied into the the application buﬀer. A comm unicatio n request handle is returned for ha ndling the p ending messa ge status. The pr o gram must use calls to MPI Wait or MPI Test to determine w he n the non- blocking re c eiv e o peration completes and the requested messag e is av a ilable in the applicatio n buﬀer. • MPI Test(& request, &flag,&status) : checks the sta tus o f a spe c iﬁe d no n- blocking send or receive op eration. It returns in flag logical tr ue (1) if the oper ation completed and logical false (0) otherwise. 35 B.3 Collectiv e Comm unication Routines • MPI Bcast( &buffer, count,datatype,root,comm) : data mov ement oper a tion. IT broadc a sts (sends) a message fro m the pro cess with rank root to a ll other pro cesses in the group. • MPI Gather (&sendbu f,sendcnt,sendtype,&recvbuf,re c vcount,recvtype,root,comm) : data mov e- men t op eration. It g a thers distinct messag es from each task in the group to a s ingle destination task. Its r ev ers e op eration is MPI Scatte r . • MPI Allgat her(&sen dbuf,sendcount,sendtype,&recvb u f,recvcount,recvtype,comm) : data mov e- men t op eration. It concatenates data to all tasks in a group. Ea c h task in the group, in eﬀect, p erforms a one-to-all br oadcasting op eration within the gro up. • MPI Reduce (&sendbu f,&recvbuf,count,datatype,op,r o ot,comm) : collective computation op eration. It applies a reduction op eration on all tasks in the group and places the result in one tas k . • MPI Allr educe(&se ndbuf,&recvbuf,count,datatype,op,comm) : collective computation and data mov ement op eration. It applies a reduction op eration and pla ces the result in all tasks in the gro up. It is equiv alent to MPI Reduce follo wed by MPI Bcast . References M. Abadi, A. Agar w al, P . Bar ham, E. Brevdo, Z. Chen, C. Citro, G. S. Corr ado, A. Davis, J. Dean, M. Devin, S. Ghemaw a t, I. Go o dfello w, A. Harp, G. Ir ving, M. Is ard, Y. Jia, R. J ozefo wicz, L. Ka iser, M. Kudlur, J. Leven b erg, D. Man´ e, R. Mong a, S. Mo ore, D. Mur ra y , C. O lah, M. Sch uster, J . Shlens, B. Steiner , I. Sutsk ever, K. T alwar, P . T uck er, V. V anhouck e, V. V asudev an, F. Vi´ egas, O. Viny a ls, P . W arden, M. W a tten b erg, M. Wick e, Y. Y u, and X. Zheng . T enso rFlo w: Large-sca le machine learning on heterogeneous systems, 2 0 15. T ens o rFlo w WhiteP a per. Y. Bengio, J.-F. Paiemen t, P . Vincen t, O . Delalle au, N. Le Roux, and M. Ouimet. Out-of-sa mple extensions for LLE, Isomap, MDS, Eigenmaps, and sp e c tral clustering . In S. Thrun, L. K . Saul, and B. Sch¨ olk o pf, editors, A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , volume 16 . MIT Pr ess, Cambridge, MA, 2004. A. Benv eniste, M. M´ etivier, and P . Priouret. A daptive Algo rithms and St o chastic Appr oximations , volume 22 of Applic ations of Mathematics . Springer-V erlag, Ber lin, 1990 . D. P . Bertsek a s. Incremental g radien t, subgradient, a nd proximal metho ds for co n vex optimization: A survey . In S. Sra, S. Now ozin, and S. J. W right, editor s, Optimization for Machine L e arning . MIT P ress, 201 1. D. P . Bertsek as and J. N. Tsitsiklis . Gradient conv er gence in gra dien t methods with error s. SIAM Journal on Optimization , 1 0 (3):627–642 , 2000. L. Bottou. Large-scale ma c hine learning with s tochastic gr adien t des cen t. In Pr o c. 19th Int. Conf. Compu- tational S tatistics (COMPST A T 2010) , pages 177 –186, Paris, F rance, Aug. 22–27 20 10. L. Bottou and O . Bousquet. The tradeoﬀs of large sc ale lea rning. In J. C. Platt, D. K oller, Y. Singer, and S. Row eis, editors, A dvanc es in Neura l Information Pr o c essing Systems (NIPS) , v olume 20, pages 161–1 68. MIT Press, Cam bridg e, MA, 2 008. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed o ptimization and sta tistical learning via the alter nating directio n metho d of multipliers. F oundations and T re n ds in Machine L e arning , 3(1): 1–122 , 20 11. J. Bra dley , A. K yrola, D. Bickson, and C. Guestrin. P ar allel co ordinate descent for l1-r egularized loss minimization. I n L. Geto or and T . Sc heﬀer , editors, Pr o c. of the 28th Int. Conf. Machine L e arning (ICML 2011) , pages 3 21–328, Bellevue, W A, June 28 – July 2 2011 . 36 M. ´ A. Carre ir a-Perpi˜ n´ a n and R. Ra ziperchik o laei. Ha shing with binary auto enco ders. In Pr o c. of the 2015 IEEE Computer So ciety Conf. Computer Vision and Pattern R e c o gnition (CVPR’15) , pages 5 57–566, Boston, MA, June 7– 12 2015 . M. ´ A. Ca rreira-Perpi ˜ n´ an and M. Vladymyrov. A fas t, universal algo rithm to lear n pa rametric nonlinea r embeddings. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, A dvanc es in Neur al Information Pr o c essing Systems (N IPS) , volume 28, pag es 253 –261. MIT P ress, Cambridge, MA, 2015. M. ´ A. Car reira-Perpi ˜ n´ an a nd W. W ang . Distributed optimization of deeply nested sys tems. ar Xiv:1212.592 1 [cs.LG], Dec. 24 20 12. M. ´ A. Carreira -P er pi˜ n´ an a nd W. W ang . Distributed o ptimiza tion of deeply nested systems. In S. Kask i and J. Cora nder, edito rs, Pr o c. of the 17th Int. Conf. Artiﬁcial Intel ligenc e and Statistics (AIST A TS 2014) , pages 1 0–19, Reykjavik, Icela nd, Apr. 22–25 2014. V. Cevher, S. Beck er , and M. Schmidt . Convex optimizatio n for big data: Scalable, randomized, and parallel algorithms for big da ta analytics. IEEE Signal Pr o c essing Magazine , 31(5):32 –43, Sept. 2014. X. Chen, A. E v erso le, G. Li, D. Y u, and F. Seide. Pip elined back-propagation for co n text-dep enden t deep neural net works. In Pr o c. of Intersp e e ch’12 , pages 26–29 , Portland, OR, Sept. 9– 13 2012. A. Coates, B. Huv al, T. W a ng, D. W u, B. Catanzaro , a nd A. Ng. Deep learning with CO TS HPC systems. In S. Dasgupta and D. McAllester, editors, Pr o c. of the 30th Int. Conf. Machine L e arning (ICML 2013) , pages 1 337–134 5, Atlan ta, GA, June 16–21 20 13. P . L. Co m b ettes and J .-C. Pesquet. Proximal splitting metho ds in signal pro cessing. In H. H. Bauschk e, R. S. Burachik, P . L. Combettes, V. E lser, D. R. Luke, a nd H. W olko wicz, editors, Fixe d-Point Algo rithms for Inverse Pr oblems in Scienc e and Engine ering , Spr inger Series in Optimiza tio n and Its Applications, pages 1 85–212. Spring er-V er lag, 2011. J. Dean, G. Cor rado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P . T uck er, K. Y ang, and A. Ng. La rge sca le distributed de e p net works. In F. Pereira, C. J. C. Bur g es, L. Bottou, and K . Q. W einberger , editor s, A dvanc es in Neur al In fo rm atio n Pr o c essing Systems (NIPS) , volume 25, pages 1 232–124 0. MIT Pres s , Cam br idge, MA, 2012. A. P . Dempster , N. M. Laird, and D. B. Rubin. Ma x im um likelihoo d fro m incomplete data via the EM algorithm. Journal of t he R oyal Statistic al S o ciety, B , 39(1):1–3 8 , 1 9 77. P . Drine a s and M. W. Mahoney . On the Nystr¨ o m method for approximating a Gram matr ix for impr o ved kernel-based learning . J. Machine L e arning R ese ar ch , 6:215 3–2175, Dec. 200 5. S. H. F uller and L. I. Millett, editors. The F utu re of Computing Performanc e: Game Over or N ext L evel? National Academic Press, 2 011. R. Gemulla, E. Nijk amp, P . J . Haas, and Y. Sismanis. Lar ge-scale matrix factoriz a tion with distributed sto c hastic gr adien t descent. In Pr o c. of the 17th ACM SIGKDD Int. Conf. Know le dge Disc overy and D ata Mining (S IGKDD 201 1) , pages 69–77 , San Diego , CA, Aug. 21–24 2 0 11. S. Go edec ke and A. Hoisie. Performa n c e Optimization of Nu m eric al ly Inten s ive Co des . Soft ware, Environ- men ts a nd T o ols. SIAM Publ., 2001 . B. Gold a nd N. Morgan. Sp e e ch and Audio Signal Pr o c essing: Pr o c essing and Per c eption of Sp e e ch and Music . Jo hn Wiley & Sons, New Y or k, London, Sydney , 2 000. Y. Gong, S. La zebnik, A. Gordo, and F. P er ronnin. Iterative quan tization: A Pro crustean approach to lea r n- ing binary co des for lar ge-scale image retriev al. IEEE T r ans. Pattern Analysi s and Machine Intel ligenc e , 35(12):29 16–2929 , Dec. 201 3. 37 S. L. Graha m, M. Snir, and C. A. Patterson, editor s. Getting up t o Sp e e d: The F utu r e of Sup er c omputing . National Academic Press, 2 004. K. Gra uma n and R. F erg us. Learning binary hash co des for larg e-scale image search. In R. Cipo lla , S. Bat- tiato, and G. F arinella , edito r s, Machine L e arning for Computer Vision , pages 49 –87. Springer- V erlag , 2013. W. Gropp, E . Lusk , a nd A. Skjellum. Using MPI: Portable Par al lel Pr o gr amming with the Message-Passing Interfac e . MIT Pre ss, second edition, 1999a. W. Gropp, E. L usk, and R. Thakur. Using MPI-2: A dvanc e d F e atur es of the Message-Passing Interfac e . MIT Press, 1999b. S. Gupta, A. A g ra wal, K. Gopalakrishnan, and P . Nar ayanan. Deep learning with limited numerical precision. In F. Bach and D. Blei, editors, Pr o c. of the 32nd Int. Conf. Machine L e arning (ICML 2015) , pa ges 1737– 1746, Lille, F r ance, July 6–11 2015. S. Han, J. Pool, J. T r an, and W. Dally . Learning both weigh ts a nd connections for eﬃcien t neural netw o rk. In C. Cor tes, N. D. L awrence, D. D. Lee , M. Sug iy ama, and R. Garnett, editors, A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , volume 28, pages 1135 –1143. MIT P ress, Cambridge, MA, 2015. S. Han, H. Mao, and W. J . Dally . Deep compress io n: Compre ssing deep neur a l netw ork s with pruning, trained quantization and Huﬀman co ding. In Int. Conf. L e arning R epr esent ations (ICLR 2016) , San Juan, Puerto Rico, May 2 –4 2016 . G. Hin ton, L. Deng, D. Y u, G. Dahl, A. rahman Mohamed, N. Jaitly , A. Senio r , V. V anhouck e, P . Nguyen, T. N. Sainath, and B. King s bury . Deep neur al ne tw o rks for aco us tic modeling in sp eech r ecognition: The shared views of four re s earch gro ups. IEEE Signal Pr o c essing Magazine , 29(6):82 – 97, No v . 201 2 . G. E . Hinton and R. R. Sala kh utdinov. Reducing the dimensiona lit y of data with neura l net works. Scienc e , 313(57 86):504–50 7, July 28 200 6. H. J´ eg ou, M. Douze, a nd C. Sc hmid. Pro duct quantization for nearest neighbor sear ch. IEEE T ra ns . Pattern Analy sis and Machine In tel ligenc e , 33 (1):117–128 , Jan. 2 011a. H. J´ ego u, R. T avenard, M. Douze, and L. Amsale g . Searching in one billion vectors: Re-ra nk with source co ding. In Pr o c. of the IEEE Int . Conf. A c oust ics, Sp e e ch and Sig. Pr o c. (ICASS P’ 11) , pages 861 – 864, Prague , Czech Republic, Ma y 22– 27 2011b. R. Kohavi and G. H. John. W ra pp ers for fea ture subset selection. Artiﬁcial Intel ligenc e , 97 (1–2):273–3 24, Dec. 1997. A. Kr iz he v sky . Learning multiple lay ers of features from tiny imag es. Mas ter ’s thesis, Dept. of Co mputer Science, Universit y o f T oro n to, Apr. 8 2009 . H. J. Kushner and G. G. Yin. Sto chastic Appr oximation and R e cursive Algorithms and Applic ations . Springer Series in Stochastic Mo delling and Applied Probability . Spr inger-V erlag, second edition, 2003 . Q. Le , M. Ranzato , R. Monga, M. Devin, G. Co rrado, K. Chen, J. Dean, a nd A. Ng. Building high-level features using la rge sca le unsup ervised le a rning. In J . Lang ford a nd J. Pineau, editors , Pr o c. of t he 29th Int. Conf. Machine L e arning (ICML 2012) , Edinburgh, Scotland, June 26 – July 1 2012 . J. Liu and S. J . W rig ht. Async hr onous sto c ha stic co ordina te descent: Parallelism and conv erg ence prop erties. SIAM J ournal on Optimization , 25(1 ):3 51–376, 2015. Y. Low, D. B ic kson, J. Gonza lez, C. Guestrin, A. Kyro la, and J. M. Heller s tein. Distributed GraphLab: A framework for machine learning a nd data mining in the cloud. Pr o c. VLDB Endowment , 5(8):716 – 727, Apr. 2012. 38 R. McDona ld, K. Hall, and G. Mann. Distributed training stra tegies for the structur ed p erceptron. In North Americ an Chapter of the As s o ciation for Computational Lingu istics - Human L anguage T e chnolo gies (NAACL HL T) , pages 4 5 6—464, 2010 . Message Passing Interface F orum. MPI: A Message-Passing Interfac e Standar d, V ersion 3.0 . High Perfor- mance Computing Cen ter Stuttgart (HLRS), Sept. 2 1 201 2. F. Niu, B. Rech t, C. R´ e, and S. J. W right. H ogwild! : A lo ck-free appr oac h to par allelizing sto c hastic gradient descent. In J . Shaw e-T aylor, R. S. Zemel, P . Bar tlett, F. Pereira, a nd K . Q. W einberg e r , edi- tors, A dvanc es in N eu r al Information Pr o c ess ing Systems (NIPS) , v o lume 24, pa ges 693– 701. MIT P r ess, Cambridge, MA, 2011. J. Nocedal a nd S. J. W rig h t. N umeric al Optimization . Springer Series in Op erations Rese a rc h and Financia l Engineering . Springer-V erlag , New Y ork, second edition, 2006. A. Oliv a and A. T orr alba. Mo deling the shap e of the scene: A ho listic repres en tation of the spatial env elo pe. Int. J. Computer V ision , 4 2(3):145–17 5, May 2001 . H. Ouyang, N. He, L. T ran, and A. Gray . Stochastic alternating direction method of multipliers. In S. Da sgupta and D. McAllester, editors, Pr o c. of the 30th In t. Conf. Machine L e arning (ICML 2013) , pages 8 0–88, Atlan ta, GA, June 16–21 2 013. G. C. Pﬂug. Optimization of Sto chastic Mo dels: The Interfac e b etwe en Simulation and Optimization . Klu wer Academic Publishers Group, 1 996. M. Ranz a to, F. J. Huang, Y. L. Bo ureau, and Y. LeCun. Unsuper v ised learning of inv ar ian t fea ture hierar - chies with applicatio ns to ob ject recognition. In Pr o c. of the 2007 IEEE Computer So ciety Conf. Computer Vision and Pattern Re c o gnition (CVPR ’07) , pag es 1–8, Minneapolis , MN, June 18–23 2007. R. Razipe r c hikolaei and M. ´ A. Carre ir a-Perpi˜ n´ a n. Optimizing aﬃnity-based binary hashing using auxiliary co ordinates. arXiv:15 01.05352 [cs.LG], F e b. 5 2016 . P . Rich t´ arik and M. T a k´ aˇ c. Distr ibuted coor dinate descen t metho d for lear ning with big data. arXiv:131 0.2059 [s tat.ML], Oc t. 8 2013 . R. T. Ro c k afellar. Monotone oper ators and the proximal p oin t algor ithm. SIA M J . Contro l and Optim. , 14 (5):877–8 98, 1 9 76. G. Sao n and J.- T. Chien. Larg e-v o cabulary contin uous sp eech r ecognition systems: A lo ok at some recent adv ances. IEEE Signal Pr o c essing Magazine , 29(6 ):18–33, No v . 2012 . F. Seide, H. F u, J. Dropp o, G. Li, a nd D. Y u. 1-bit sto c hastic gr adien t descent a nd its a pplication to data-para llel distributed training of speech DNNs. In Pr o c. of Intersp e e ch’14 , Sing apore , Sept. 1 4–18 2014. T. Serr e, L. W olf, S. Bileschi, M. Ries enh ub er, a nd T. Poggio. Robust ob ject r ecognition with co rtex-like mechanisms. IEEE T r ans. Pattern Analysis and Machine In tel ligenc e , 29 (3):411–426 , Mar. 2007. J. C. Spall. Intr o duction to Sto chastic Se ar ch and Optimization: Estimation, Simulation, and Contr ol . John Wiley & So ns, 2003 . A. T a lw alk ar, S. Kumar , M. Mohri, a nd H. Rowley . Larg e-scale SVD and ma nifold lear ning . J. Machine L e arning R ese ar ch , 14(1):31 29–3152 , 2013. M. Vladym y ro v and M. ´ A. Carreira -P er pi˜ n´ an. Entropic a ﬃnities: P roper ties and eﬃcien t numerical compu- tation. In S. Dasgupta and D. McAllester, edito rs, Pr o c. of the 30th Int. Conf. Machine L e arning (ICML 2013) , pages 4 77–485, Atlan ta, GA, June 1 6–21 2013 . M. Vlady myrov and M. ´ A. Car reira-Perpi ˜ n´ an. The V ariational Nystr¨ om metho d for large-sca le sp ectral problems. I n M.-F. Balca n and K. W einber g er, editors, Pr o c. of the 33r d In t. Conf. Machine L e arning (ICML 2016) , New Y ork, NY, June 19 – 24 2016 . 39 W. W ang and M. ´ A. Carr e ira-Perpi˜ n´ a n. The r ole of dimensionality reduction in cla ssiﬁcation. In C. E. Bro dley and P . Stone, editors, Pr o c. of the 28th National Confer enc e on Artiﬁcia l Int el ligenc e (A A AI 2014) , pages 2 128–213 4, Queb ec City , Cana da, July 27–31 2 014. C. K. I. Williams and M. Seeg e r. Using the Nystr¨ om metho d to sp eed up kernel machin e s . In T. K. Leen, T. G. Dietterich, and V. T resp, editors, A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , volume 13, pages 6 82–688. MIT Press, Camb r idge, MA, 2001 . S. J. W rig ht. Co ordinate descen t alg orithms. Math. Pr o g. , 151(1 ):3–34, J une 20 16. E. P . Xing, Q. Ho, W. Dai, J. K . Kim, J. W ei, S. Lee, X. Zheng , P . Xie, A. Kumar, and Y. Y u. Petuum: A new platform for distr ibuted ma c hine learning o n big data. IEEE T r ans. Big Data , 1(2):49– 67, Apr.–June 2015. M. Zaharia, M. Chowdh ury , M. J. F rank lin, S. Shenk er , and I. Sto ic a. Spark: Cluster computing with working sets. In Pr o c. 2nd US ENIX Conf. Hot T opics in Cloud Computing (HotCloud 2010) , 2010 . R. Zhang and J. Kwok. Asynchronous distributed ADMM algorithm for global v ariable consensus optimiza- tion. In E. P . Xing and T. Jebara , editor s, Pr o c. of the 31st Int. Conf. Machine L e arning (ICML 2014) , pages 1 701–170 9, Beijing, China, June 21–26 2014. S. Zha ng, C. Zhang, Z. Y ou, R. Zheng, and B. Xu. Asynchronous sto c ha stic gr a dien t desce nt for DNN training. In Pr o c. of the IEEE Int. Conf. A c oust ics, Sp e e ch and Sig. Pr o c. (ICAS SP’1 3) , pag es 6660–66 63, V a ncouv er , Canada, Mar. 26–30 20 1 3. M. Zinkevic h, M. W e imer , A. Smola, and L. Li. Paralle liz ed s tochastic g radien t descent. In J. Laﬀert y , C. K. I. Williams, J. Shaw e-T aylor, R. Zemel, and A. Culotta, editors, A dvanc es in Neura l Information Pr o c essing Systems (NIPS) , volume 23, pages 2595–2 603. MIT Pres s , Cam br idge, MA, 2010 . 40

ParMAC: distributed optimisation of nested functions, with application to learning binary autoencoders

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment