DimmWitted: A Study of Main-Memory Statistical Analytics

We perform the first study of the tradeoff space of access methods and replication to support statistical analytics using first-order methods executed in the main memory of a Non-Uniform Memory Access (NUMA) machine. Statistical analytics systems dif…

Authors: Ce Zhang, Christopher Re

DimmWitted: A Study of Main-Memory Statistical Analytics
DimmWitted: A Stud y of Main-Memory Statistical Analytics Ce Zhang †‡ Christopher R ´ e † † Stanf ord University ‡ University of Wisconsin-Madison { czhang, chrismre } @cs.stanf ord.edu ABSTRA CT W e p erform the first study of the tradeoff space of access metho ds and replication to support statistical analytics us- ing first-order metho ds executed in the main memory of a Non-Uniform Memory Access (NUMA) mac hine. Statistical analytics systems differ from conv en tional SQL-analytics in the amount and types of memory incoherence that they can tolerate. Our goal is to understand tradeoffs in accessing the data in ro w- or column-order and at what gran ularit y one should share the mo del and data for a statistical task. W e study this new tradeoff space and discov er that there are tradeoffs b etw een hardware and statistical efficiency . W e argue that our tradeoff study may provide v aluable infor- mation for designers of analytics engines: for each system w e consider, our prototype engine can run at least one pop- ular task at least 100 × faster. W e conduct our study across fiv e arc hitectures using popular mo dels, including SVMs, lo- gistic regression, Gibbs sampling, and neural netw orks. 1. INTR ODUCTION Statistical data analytics is one of the hottest topics in data-managemen t research and practice. T o day , even small organizations ha ve access to mac hines with large main mem- ories (via Amazon’s EC2) or for purchase at $5/GB. As a result, there has b een a flurry of activity to supp ort main- memory analytics in both industry (Go ogle Brain, Impala, and Piv otal) and research (GraphLab, and MLlib). Each of these systems picks one design p oint in a larger tradeoff space. The goal of this pap er is to define and explore this space. W e find that to day’s research and industrial systems under-utilize mo dern commo dity hardware for analytics— sometimes by tw o orders of magnitude. W e hop e that our study iden tifies some useful design p oints for the next gen- eration of such main-memory analytics systems. Throughout, we use the term statistic al analytics to refer to those tasks that can be solved b y first-or der metho ds –a class of iterative algorithms that use gradien t information; these methods are the core algorithm in systems such as ML- lib, GraphLab, and Go ogle Brain. Our study examines an- alytics on commo dit y multi-sock et, multi-core, non-uniform memory access (NUMA) machines, whic h are the de facto standard machine configuration and th us a natural target for an in-depth study . Moreo ver, our exp erience with sev eral en terprise companies suggests that, after appropriate pre- pro cessing, a large class of enterprise analytics problems fit in to the main memory of a single, mo dern machine. While this architecture has b een recently studied for traditional SQL-analytics systems [16], it has not b een studied for sta- tistic al analytics systems. Statistical analytics systems are different from traditional SQL-analytics systems. In comparison to traditional SQL- analytics, the underlying metho ds are in trinsically robust to error. On the other hand, traditional statistical theory do es not consider which op erations can b e efficiently executed. This leads to a fundamental tradeoff b etw een statistic al ef- ficiency (how many steps are needed until conv ergence to a given tolerance) and har dwar e efficiency (how efficien tly those steps can b e carried out). T o describ e such tradeoffs more precisely , we describ e the setup of the analytics tasks that w e consider in this paper. The input data is a matrix in R N × d and the goal is to find a vector x ∈ R d that minimizes some (conv ex) loss function, sa y the logistic loss or the hinge loss (SVM). T ypically , one mak es several complete passes o v er the data while updating the mo del; we call each such pass an ep o ch . There may b e some communication at the end of the ep o ch, e.g., in bulk- sync hronous parallel systems such as Spark. W e identify three tradeoffs that ha ve not been explored in the literature: (1) ac c ess methods for the data , (2) mo del r eplic ation , and (3) data r eplication . Current systems hav e pick ed one p oin t in this space; we explain eac h space and discov er points that hav e not b een previously considered. Using these new p oin ts, w e can perform 100 × faster than previously explored p oin ts in the tradeoff space for several p opular tasks. Access Methods. Analytics systems access (and store) data in either row-ma jor or column-ma jor order. F or example, systems that use sto chastic gr adient desc ent metho ds (SGD) access the data ro w-wise; examples include MADlib [23] in Impala and Pivotal, Google Brain [29], and MLlib in Spark [47]; and sto chastic c o or dinate desc ent metho ds (SCD) access the data column-wise; examples include GraphLab [34], Shogun [46], and Thetis [48]. These metho ds hav e essen- tially identical statistical efficiency , but their wall-clock per- formance can b e radically different due to hardware effi- 1 Node 2 ! L3 Cache ! RAM ! Core3 ! Core4 ! N ! d ! (a) Memory Model for DimmW itted ! Data A ! Model x ! Machine ! Node 1 ! L3 Cache ! RAM ! Core1 ! Core2 ! d ! G oa l ! (d) NUMA Architecture ! (c) Access Methods ! Row-wise ! Col.-wise ! Col.-to-row ! Read set of data ! W rite set of model ! (b) Pseudocode of SGD ! Procedur e: ! m <-- initial model ! while not converge ! foreach row z in A: ! m <-- m – α grad( z, m ) ! test convergence ! Input : Data matrix A , stepsize α ! Gradient function grad . ! Output : Model m . ! Epoch ! L1/L2 ! L1/L2 ! L1/L2 ! L1/L2 ! Figure 1: Illustration of (a) DimmWitted’s Memory Mo del, (b) Pseudo co de for SGD, (c) Differen t Statistical Metho ds in DimmWitted and Their Access P atterns, and (d) NUMA Architecture. ciency . How ev er, this tradeoff has not b een systematically studied. T o study this tradeoff, we introduce a storage ab- straction that captures the access patterns of p opular statis- tical analytics tasks and a prototype called DimmWitted . In particular, w e identify three access methods that are used in p opular analytics tasks, including standard supervised mac hine learning mo dels such as SVMs, logistic regression, and least squares; and more adv anced metho ds such as neu- ral netw orks and Gibbs sampling on factor graphs. F or dif- feren t access methods for the same problem, w e find that the time to conv erge to a given loss can differ b y up to 100 × . W e also find that no access metho d dominates all oth- ers, so an engine designer may wan t to include b oth access metho ds. T o show that it ma y b e p ossible to supp ort both metho ds in a single engine, we develop a simple cost mo del to choose among these access metho ds. W e describ e a sim- ple cost mo del that selects a nearly optimal point in our data sets, mo dels, and different machine configurations. Data and Model Replication. W e study tw o sets of trade- offs: the lev el of granularit y , and the mechanism by which m utable state and imm utable data are shared in analytics tasks. W e describ e the tradeoffs we explore in b oth (1) m u- table state sharing, which we informally call mo del r eplic a- tion , and (2) data r eplic ation . (1) Model Replication. During execution, there is some state that the task mutates (t ypically an update to the mo del). W e call this state, whic h may b e shared among one or more pro cessors, a mo del r eplic a . W e consider three differen t granularities at which to share mo del replicas: • The PerCo re approac h treats a NUMA mac hine as a distributed system in which ev ery core is treated as an individual mac hine, e.g., in bulk-sync hronous models suc h as MLlib on Spark or even t-driv en systems such as GraphLab. These approaches are the classical shared- nothing and even t-driven arc hitectures, respectively . In PerCo re , the part of the mo del that is up dated b y eac h core is only visible to that core un til the end of an epo ch. This metho d is efficient and scalable from a hardware persp ective, but it is less statistically ef- ficien t, as there is only coarse-grained communication b et w een cores. • The PerMachine approach acts as if each processor has uniform access to memory . This approac h is tak en in Hogwild! and Go ogle Downpour [19]. In this metho d, the hardware tak es care of the coherence of the shared state. The PerMachine metho d is statistically efficient due to high communication rates, but it ma y cause con tention in the hardware, which may lead to sub op- timal running times. • A natural hybrid is PerNode ; this metho d uses the fact that P erCo re communication through the last-level cac he (LLC) is dramatically faster than communica- tion through remote main memory . This metho d is no vel; for some mo dels, PerNode can b e an order of magnitude faster. Because mo del replicas are mutable, a k ey question is how often should we synchr onize mo del r eplic as? W e find that it is beneficial to synchronize the models as m uc h as p ossible— so long as w e do not imp ede throughput to data in main memory . A natural idea, then, is to use PerMachine sharing, in whic h the hardw are is responsible for synchronizing the replicas. How ev er, this decision can be suboptimal, as the cac he-coherence protocol ma y stall a pro cessor to preserve coherence, but this information may not b e worth the cost of a stall from a statistical efficiency p ersp ective. W e find that the PerNode metho d, coupled with a simple tec hnique to batc h writes across so ck ets, can dramatically reduce com- m unication and pro cessor stalls. The PerNode method can result in an o v er 10 × runtime improv emen t. This technique dep ends on the fact that we do not need to main tain the mo del consisten tly: w e are effectiv ely dela ying some up dates to reduce the total num ber of up dates across so ck ets (which lead to pro cessor stalls). (2) Data Replication. The data for analytics is immutable, so there are no synchronization issues for data replication. The classical approach is to partition the data to tak e ad- v antage of higher aggregate memory bandwidth. Ho w ever, eac h partition may con tain skew ed data, whic h ma y slo w con vergence. Th us, an alternate approach is to replicate the data fully (say , p er NUMA no de). In this approach, eac h no de accesses that node’s data in a differen t order, which means that the replicas provide non-redundant statistical information; in turn, this reduces the v ariance of the esti- mates based on the data in each replicate. W e find that for some tasks, fully replicating the data four w a ys can conv erge to the same loss almost 4 × faster than the sharding strategy . Summary of Contrib utions. W e are the first to study the three tradeoffs listed ab ov e for main-memory statistical an- alytics systems. These tradeoffs are not intended to b e an exhaustiv e set of optimizations, but they demonstrate our 2 main conceptual p oint: tr e ating NUMA machines as dis- tribute d systems or SMP is sub optimal for statistic al ana- lytics . W e design a storage manager, DimmWitted , that sho ws it is p ossible to exploit these ideas on real data sets. Finally , w e ev aluate our tec hniques on multiple real datasets, mo dels, and architectures. 2. B A CKGROUND In this section, we describe the memory model for DimmWitted , which provides a unified memory mo del to implemen t popular analytics metho ds. Then, we recall some basic prop erties of mo dern NUMA arc hitectures. Data for Analytics. The data for an analytics task is a pair ( A, x ), whic h w e call the data and the mo del, respec- tiv ely . F or concreteness, we consider a matrix A ∈ R N × d . In mac hine learning parlance, eac h ro w is called an exam- ple . Thus, N is often the num ber of examples and d is often called the dimension of the model. There is also a mo del, t ypically a vector x ∈ R d . The distinction is that the data A is read-only, while the model v ector, x , will be updated during execution. F rom the p ersp ective of this pap er, the imp ortan t distinction we make is that data is an immutable matrix, while the model (or portions of it) are mutable data. F irst-Or der Methods for Analytic Algorithms. DimmWit- ted considers a class of popular algorithms called first-or der metho ds . Such algorithms make several passes ov er the data; w e refer to eac h such pass as an ep o ch . A p opular exam- ple algorithm is sto chastic gradient descen t (SGD), whic h is widely used b y web-companies, e.g., Go ogle Brain [29] and V owP al W abbit [1], and in enterprise systems suc h as Piv otal, Oracle, and Impala. Pseudo co de for this metho d is sho wn in Figure 1(b). During eac h ep o ch, SGD reads a single example z ; it uses the current v alue of the mo del and z to estimate the deriv ativ e; and it then up dates the model v ector with this estimate. It reads each example in this loop. After each ep o ch, these metho ds test conv ergence (usually b y computing or estimating the norm of the gradien t); this computation requires a scan ov er the complete dataset. 2.1 Memory Models for Analytics W e design DimmWitted ’s memory mo del to capture the trend in recen t high-p erformance sampling and statistical metho ds. There are tw o asp ects to this memory mo del: the c oher enc e level and the stor age layout . Coher ence Level. Classically , memory systems are coher- en t: reads and writes are executed atomically . F or analytics systems, we sa y that a memory mo del is c oher ent if reads and writes of the en tire mo del v ector are atomic. That is, access to the model is enforced by a critical section. Ho w- ev er, many mo dern analytics algorithms are designed for an inc oher ent memory model. The Hogwild! method show ed that one can run suc h a metho d in parallel without lo cking but still pro v ably con verge. The Hogwild! memory mo del relies on the fact that writes of individual comp onents are atomic, but it do es not require that the entire vector b e up dated atomically . How ever, atomicity at the level of the cac heline is pro vided by essen tially all modern processors. Empirically , these results allo w one to forgo costly lo cking (and coherence) protocols. Similar algorithms ha v e been Algorithm ! Access Method ! Implementation ! Stochastic Gradient Descent ! Row-wise ! MADlib , Spark, Hogwild! ! Stochastic Coordinate Descent ! Column-wise ! GraphLab, Shogun, Thetis ! Column-to-row ! Figure 2: Algorithms and Their Access Methods. prop osed for other p opular metho ds, including Gibbs sam- pling [25, 45], stochastic co ordinate descent (SCD) [42, 46], and linear systems solvers [48]. This technique was applied b y Dean et al. [19] to solve con v ex optimization problems with billions of elements in a mo del. This memory mo del is distinct from the classical, ful ly c oher ent database execution. The DimmWitted prototype allo ws us to sp ecify that a region of memory is coherent or not. This region of memory ma y be shared by one or more pro cessors. If the memory is only shared per thread, then we can sim ulate a shared- nothing execution. If the memory is shared p er machine, w e can simulate Hogwild!. Access Methods. W e identify three distinct access paths used by mo dern analytics systems, which w e call ro w-wise, column-wise, and column-to-row. They are graphically il- lustrated in Figure 1(c). Our prototype supports all three access metho ds. All of our metho ds p erform several ep o chs, that is, passes ov er the data. How ev er, the algorithm may iterate ov er the data row-wise or column-wise. • In r ow-wise ac cess , the system scans eac h row of the table and applies a function that takes that ro w, ap- plies a function to it, and then updates the mo del. This metho d may write to all comp onents of the mo del. P opular metho ds that use this access metho d include stochastic gradient descent, gradien t descent, and higher-order metho ds (such as l-BFGS). • In c olumn-wise ac c ess , the system scans each column j of the table. This metho d reads just the j component of the mo del. The write set of the metho d is typically a single comp onent of the mo del. This metho d is used b y sto chastic co ordinate descent. • In c olumn-to-r ow ac c ess , the system iterates conceptu- ally ov er the columns. This metho d is typically applied to sparse matrices. When iterating on column j , it will read all rows in which column j is non-zero. This metho d also up dates a single comp onent of the mo del. This metho d is used by non-linear supp ort vector ma- c hines in GraphLab and is the de facto approac h for Gibbs sampling. DimmWitted is free to iterate ov er rows or columns in es- sen tially any order (although t ypically some randomness in the ordering is desired). Figure 2 classifies popular imple- men tations by their access method. 2.2 Architectur e of NUMA Machines W e briefly describ e the architecture of a mo dern NUMA mac hine. As illustrated in Figure 1(d), a NUMA machine con tains multiple NUMA no des. Eac h node has multiple cores and pro cessor caches, including the L3 cache. Each 3 W orker ! RAM ! 6GB/s ! QPI 11GB/s ! Name (abbrv .) ! #Node ! #Cores/ ! Node ! RAM/ ! Node (GB) ! CPU ! Clock (GHz) ! LLC ! (MB) ! local2 (l2) ! 2 ! 6 ! 32 ! 2.6 ! 12 ! local4 (l4) ! 4 ! 10 ! 64 ! 2.0 ! 24 ! local8 (l8) ! 8 ! 8 ! 128 ! 2.6 ! 24 ! ec2.1 (e1) ! 2 ! 8 ! 122 ! 2.6 ! 20 ! ec2.2 (e2) ! 2 ! 8 ! 30 ! 2.6 ! 20 ! local2 ! W orker ! RAM ! 6GB/s ! Figure 3: Summary of Machines and Memory Band- width on lo cal2 T ested with STREAM [9]. Data A ! Machine ! …" …" Data ! Replica ! Model ! Replica ! W orker ! Read ! Update ! Execution Plan ! Model x ! Optimizer ! Figure 4: Illustration of DimmWitted’s Engine. no de is directly connected to a region of DRAM. NUMA no des are connected to eac h other b y buses on the main b oard; in our case, this connection is the Intel Quic k Path In terconnects (QPIs), which has a bandwidth as high as 25.6GB/s. 1 T o access DRAM regions of other NUMA nodes, data is transferred across NUMA no des using the QPI. These NUMA architectures are cache coherent, and the coherency actions use the QPI. Figure 3 describes the configuration of eac h machine that w e use in this paper. Machines con- trolled b y us hav e names with the prefix “lo cal”; the other mac hines are Amazon EC2 configurations. 3. THE DIMMWITTED ENGINE W e describ e the tradeoff space that DimmWitted ’s op- timizer considers, namely (1) access metho d selection, (2) mo del replication, and (3) data replication. T o help un- derstand the statistical-v ersus-hardw are tradeoff space, w e presen t some experimental results in a T r ade offs paragraph within each subsection. W e describ e implementation details for DimmWitted in the full v ersion of this pap er. 3.1 System Overview W e describ e analytics tasks in DimmWitted and the ex- ecution mo del of DimmWitted given an analytics task. System Input. F or each analytics task that w e study , we assume that the user provides data A ∈ R N × d and an initial mo del that is a v ector of length d . In addition, for each access metho d listed ab ov e, there is a function of an ap- propriate type that solves the same underlying mo del. F or example, w e provide both a row- and column-wise wa y of solving a support v ector machine. Eac h method tak es tw o argumen ts; the first is a p oin ter to a mo del. 1 www.intel.com/content/www/us/ en/io/quickpath- technology/ quick- path- interconnect- introduction- paper.html T radeoff ! Strategies ! Existing Systems ! Access ! Methods ! Row-wise ! SP , HW ! Column-wise ! GL ! Column-to-row ! Model ! Replication ! Per Core ! GL, SP ! Per Node ! Per Machine ! HW ! Data ! Replication ! Sharding ! GL, SP , HW ! Full Replication ! Figure 5: A Summary of DimmWitted’s T radeoffs and Existing Systems (GraphLab (GL), Hogwild! (HW), Spark (SP)). • f row captures the the row-wise access metho d, and its second argument is the index of a single row. • f col captures the column-wise access metho d, and its second argument is the index of a single column. • f ctr captures the column-to-ro w access metho d, and its second argument is a pair of one column index and a set of row indexes. These rows corresp ond to the non-zero entries in a data matrix for a single column. 2 Eac h of the functions mo difies the mo del to which they re- ceiv e a p ointer in place. How ev er, in our study, f row can mo dify the whole mo del, while f col and f ctr only mo dify a single v ariable of the mo del. W e call the ab o v e tuple of func- tions a mo del sp e cific ation . Note that a mo del sp ecification con tains either f col or f ctr but typically not b oth. Execution. Giv en a mo del sp ecification, our goal is to gen- erate an execution plan. An execution plan, sc hematically illustrated in Figure 4, sp ecifies three things for eac h CPU core in the machine: (1) a subset of the data matrix to op- erate on, (2) a replica of the model to up date, and (3) the access metho d used to up date the mo del. W e call the set of replicas of data and models lo cality gr oups , as the replicas are describ ed physically; i.e., they corresp ond to regions of memory that are lo cal to particular NUMA no des, and one or more work ers ma y be mapp ed to each locality group. The data assigned to distinct lo cality groups may o verlap. W e use DimmWitted ’s engine to explore three tradeoffs: (1) Access metho ds, in whic h we can select b etw een either the row or column metho d to access the data. (2) Model replication, in whic h we c hoose how to create and assign replicas of the model to eac h w orker. When a work er needs to read or write the mo del, it will read or write the mo del replica that it is assigned. (3) Data replication, in which we c hoose a subset of data tuples for eac h work er. The replicas ma y b e o verlap- ping, disjoint, or some com bination. Figure 5 summarizes the tradeoff space. In each section, w e illustrate the tradeoff along tw o axes, namely (1) the statistic al efficiency , i.e., the num ber of ep o chs it tak es to con verge, and (2) har dwar e efficiency , the time that each metho d takes to finish a single ep o ch. 2 Define S ( j ) = { i : a ij 6 = 0 } . F or a column j , the input to f ctr is a pair ( j, S ( j )). 4 Algorithm ! Read ! Write (Dense) ! Write (Sparse) ! Row-wise ! Column-wise ! Column-to-row ! n i ∑ d dN n i 2 ∑ n i ∑ n i ∑ Figure 6: Per Ep o ch Execution Cost of Ro w- and Column-wise Access. The W rite column is for a single mo del replica. Given a dataset A ∈ R N × d , let n i b e the n um b er of non-zero elements a i . 0.01$ 0.1$ 1$ 10$ 0.1$ 1$ 10$ Time/Epoch (seconds) ! Cost Ratio ! Row-wise ! Column-wise ! (a) Number of Epochs to Converge ! (b) Time for Each Epoch ! 1$ 10$ 100$ SVM1$ SVM2$ LP1$ LP2$ # Epoch ! Models and Data Sets ! Column-wise ! Row-wise ! Figure 7: Illustration of the Metho d Selection T radeoff. (a) These four datasets are R CV1, Reuters, Amazon, and Go ogle, resp ectively . (b) The “cost ratio” is defined as the ratio of costs estimated for row-wise and column-wise methods: (1 + α ) P i n i / ( P i n 2 i + αd ) , where n i is the n umber of non-zero elemen ts of i th ro w of A and α is the cost ratio b etw een writing and reads. W e set α = 10 to plot this graph. 3.2 Access Method Selection In this section, w e examine eac h access method: row-wise, column-wise, and column-to-row. W e find that the execu- tion time of an access method dep ends more on hardware efficiency than on statistical efficiency . T radeof fs. W e consider the t w o tradeoffs that we use for a simple cost mo del (Figure 6). Let n i b e the n um ber of non-zeros in row i ; when w e store the data as sparse vec- tors/matrices in CSR format, the num ber of reads in a ro w- wise access metho d is P N i =1 n i . Since each example is lik ely to b e written back in a dense write, w e p erform dN writes p er epo ch. Our cost model combines these tw o costs lin- early with a factor α that accounts for writes b eing more exp ensiv e, on a v erage, because of conten tion. The factor α is estimated at installation time b y measuring on a small set of datasets. The parameter α is in 4 to 12 and grows with the num ber of so ck ets; e.g., for lo cal2, α ≈ 4, and for lo cal8, α ≈ 12. Thus, α may increase in the future. Statistical Efficiency . W e observ e that each access metho d has comparable statistical efficiency . T o illustrate this, we run all methods on all of our datasets and report the num b er of ep o chs that one metho d conv erges to a given error to the optimal loss, and Figure 7(a) sho ws the result on four datasets with 10% error. W e see that the gap in the n umber of ep o chs across different metho ds is small (alwa ys within 50% of each other). Hardw are Efficiency . Differen t access methods can c hange the time p er ep o ch by up to a factor of 10 × , and there is a cross-ov er point. T o see this, w e run b oth meth- o ds on a series of synthetic datasets where w e con trol the n umber of non-zero elemen ts per row by subsampling eac h ro w on the Music dataset (see Section 4 for more details). F or each subsampled dataset, we plot the cost ratio on the x -axis, and w e plot their actual running time p er ep o ch in Figure 7(b). W e see a cross-o v er p oint on the time used p er ep o ch: when the cost ratio is small, ro w-wise outp er- forms column-wise by 6 × , as the column-wise metho d reads more data; on the other hand, when the ratio is large, the column-wise metho d outp erforms the row-wise method b y 3 × , as the column-wise method has low er write con tention. W e observ e similar cross-ov er points on our other datasets. Cost-based Optimizer . DimmWitted estimates the exe- cution time of different access metho ds using the n um b er of b ytes that eac h method reads and writes in one ep o ch, as sho wn in Figure 6. F or writes, it is slightly more complex: for mo dels such as SVM, each gradient step in row-wise ac- cess only up dates the coordinates where the input vector con tains non-zero elements. W e call this scenario a sp arse up date; otherwise, it is a dense up date. DimmWitted needs to estimate the ratio of the cost of reads to writes. T o do this, it runs a simple b enchmark dataset. W e find that, for all the eight datasets, five statis- tical models, and fiv e mac hines that we use in the exp eri- men ts, the cost mo del is robust to this parameter: as long as writes are 4 × to 100 × more exp ensive than reading, the cost model makes the correct decision betw een row-wise and column-wise access. 3.3 Model Replication In DimmWitted , we consider three mo del replication strategies. The first t w o strategies, namely P erCore and P erMachine , are similar to traditional shared-nothing and shared-memory architecture, respectively . W e also consider a hybrid strategy , PerNode , designed for NUMA machines. 3.3.1 Granularity of Model Replication The difference betw een the three model replication strate- gies is the gran ularit y of replicating a model. W e first de- scrib e P erCo re and P erMachine and their relationship with other existing systems (Figure 5). W e then describe PerN- o de , a simple, no v el h ybrid strategy that w e designed to lev erage the structure of NUMA mac hines. P erCore . In the PerCo re strategy , each core maintains a mu- table state, and these states are com bined to form a new v ersion of the mo del (typically at the end of each epo ch). This is essentially a shared-nothing architecture; it is imple- men ted in Impala, Pivotal, and Hado op-based frameworks. P erCore is p opularly implemented b y state-of-the-art sta- tistical analytics frameworks such as Bismarck, Spark, and GraphLab. There are subtle v ariations to this approach: in Bismarc k’s implemen tation, each work er pro cesses a parti- tion of the data, and its mo del is av eraged at the end of eac h ep o ch; Spark implements a minibatch-based approach in which parallel w orkers calculate the gradient based on examples, and then gradien ts are aggregated by a single thread to up date the final mo del; GraphLab implemen ts an ev en t-based approach where each different task is dy- namically sc heduled to satisfy the giv en consistency require- men t. In DimmWitted , we implement P erCo re in a wa y that is similar to Bismarc k, where each work er has its own mo del replica, and eac h w ork er is responsible for up dating 5 1" 10" 100" 1000" 10000" 1%" 10%" 100%" PerCore ! # Epoch ! Error to Optimal Loss ! (a) Number of Epochs to Converge ! PerNode ! PerMachine ! (b) Time for Each Epoch ! 0.1" 1" 10" Time/Epoch (second) ! Model Replication Strategies ! PerMachine ! PerCore ! PerNode ! Figure 8: Illustration of Model Replication. its replica. 3 As we will sho w in the experiment section, DimmWitted ’s implemen tation is 3-100 × faster than either GraphLab and Spark. Both systems ha ve additional sources of ov erhead that DimmWitted does not, e.g., for fault tol- erance in Spark and a distributed environmen t in b oth. W e are not making an argumen t about the relative merits of these features in applications, only that they would obscure the tradeoffs that we study in this pap er. P erMachine . In the PerMachine strategy , there is a single mo del replica that all w ork ers update during execution. Per- Machine is implemen ted in Hogwild! and Go ogle’s Down- p our. Hogwild! implemen ts a lock-free proto col, whic h forces the hardw are to deal with coherence. Although differ- en t writers may ov erwrite each other and readers ma y hav e dirt y reads, Niu et al. [38] prov e that Hogwild! con v erges. P erNo de . The P erNo de strategy is a hybrid of PerCo re and P erMachine . In P erNode , each NUMA node has a single mo del replica that is shared among all cores on that node. Model Synchr onization. Deciding how often the replicas sync hronize is k ey to the design. In Hadoop-based and Bismarc k-based mo dels, they synchronize at the end of each ep o c h. This is a shared-nothing approach that w orks well in user-defined aggregations. Ho w ev er, w e consider finer gran- ularities of sharing. In DimmWitted , w e chose to hav e one thread that p erio dically reads mo dels on all other cores, av- erages their results, and up dates eac h replica. One k ey question for mo del synchronization is how fr e- quently should the mo del b e synchr onize d? Intuitiv ely , w e migh t exp ect that more frequent synchronization will low er the throughput; on the other hand, the more frequently we sync hronize, the few er num ber of iterations we might need to conv erge. How ev er, in DimmWitted , we find that the optimal choice is to communicate as frequen tly as p ossi- ble. The intuition is that the QPI has staggering band- width (25GB/s) compared to the small amount of data we are shipping (megabytes). As a result, in DimmWitted , we implemen t an asynchronous version of the mo del av eraging proto col: a separate thread av erages mo dels, with the effect of batching many writes together across the cores in to one write, reducing the num ber of stalls. 3 W e implemented MLlib’s minibatch in DimmWitted . W e find that the Hogwild!-like implementation alwa ys dom- inates the minibatch implementation. DimmWitted ’s column-wise implemen tation for PerMachine is similar to GraphLab, with the only difference that DimmWitted does not schedule the task in an even t-driven wa y . 1" 100" 10000" 1%" 10%" 100%" Sharding ! FullReplication ! (a) Number of Epochs to Converge ! (b) Time for Each Epoch ! 0.0001" 0.001" 0.01" local2" local4" local8" Sharding ! FullReplication ! Time/Epoch (seconds) ! # Epoch ! Error to Optimal Loss ! Different Machines ! Figure 9: Illustration of Data Replication. T radeof fs. W e observ e that PerNode is more hardware ef- ficien t, as it takes less time to execute an ep o ch than P er- Machine ; P erMachine might use few er num ber of ep o chs to con verge than PerNode . Statistical Efficiency . W e observ e that P erMachine usu- ally takes few er epo chs to con verge to the same loss com- pared to P erNo de , and P erNo de uses few er n umber of ep o chs than P erCo re . T o illustrate this observ ation, Figure 8(a) sho ws the n umber of ep o chs that eac h strategy requires to con verge to a given loss for SVM (RCV1). W e see that P erMachine alw a ys uses the least num b er of epo chs to con- v erge to a giv en loss: in tuitiv ely , the single model replica has more information at each step, which means that there is less redundant work. W e observ e similar phenomena when comparing PerCo re and PerNode . Hardw are Efficiency . W e observ e that P erNode uses m uch less time to execute an ep o ch than PerMachine . T o illustrate the difference in the time that each mo del replica- tion strategy uses to finish one ep o ch, w e show in Figure 8(b) the execution time of three strategies on SVM (RCV1). W e see that PerNode is 23 × faster than PerMachine and that P er- Co re is 1.5 × faster than PerNode . PerNode takes adv an tage of the locality pro vided by the NUMA arc hitecture. Using PMUs, we find that P erMachine incurs 11 × more cross-no de DRAM requests than PerNode . Rule of Thumb. F or SGD-based models, PerNode usually giv es optimal results, while for SCD-based mo dels, P erMa- chine do es. In tuitively , this is caused by the fact that SGD has a denser up date pattern than SCD, so, PerMachine suf- fers from hardware efficiency . 3.4 Data Replication In DimmWitted , eac h work er pro cesses a subset of data and then updates its model replica. T o assign a subset of data to each work er, we consider tw o strategies. Sha rding . Sha rding is a p opular strategy implemented in sys- tems such as Hogwild!, Spark, and Bismarc k, in whic h the dataset is partitioned, and each work er only works on its par- tition of data. When there is a single model replica, Sha rding a voids wasted computation, as each tuple is pro cessed once p er epo ch. How ev er, when there are m ultiple mo del repli- cas, Sharding migh t increase the v ariance of the estimate w e form on each no de, low ering the statistical efficiency . In DimmWitted , we implement Sha rding b y randomly parti- tioning the ro ws (resp. columns) of a data matrix for the ro w-wise (resp. column-wise) access metho d. In column-to- ro w access, we also replicate other rows that are needed. F ullReplication . A simple alternativ e to Sharding is FullRepli- cation , in which we replicate the whole dataset many times 6 Model ! Dataset ! #Row ! #Col. ! NNZ ! Size ! (Sparse) ! Size ! (Dense) ! Sparse ! SVM ! ! LR ! ! LS ! RCV1 ! 781K ! 47K ! 60M ! 914MB ! 275GB ! ✔ ! Reuters ! 8K ! 18K ! 93K ! 1.4MB ! 1.2GB ! ✔ ! Music ! 515K ! 91 ! 46M ! 701MB ! 0.4GB ! Forest ! 581K ! 54 ! 30M ! 490MB ! 0.2GB ! LP ! Amazon ! 926K ! 335K ! 2M ! 28MB ! >1TB ! ✔ ! Google ! 2M ! 2M ! 3M ! 25MB ! >1TB ! ✔ ! QP ! Amazon ! 1M ! 1M ! 7M ! 104MB ! >1TB ! ✔ ! Google ! 2M ! 2M ! 10M ! 152MB ! >1TB ! ✔ ! Gibbs ! Paleo ! 69M ! 30M ! 108M ! 2GB ! >1TB ! ✔ ! NN ! MNIST ! 120M ! 800K ! 120M ! 2GB ! >1TB ! ✔ ! Figure 10: Dataset Statistics. NNZ refers to the Num b er of Non-zero elemen ts. The # columns is equal to the num b er of v ariables in the model. ( P erCore or P erNode ). In P erNode , each NUMA node will ha ve a full copy of the data. Each node accesses its data in a differen t order, whic h means that the replicas pro vide non-redundan t statistical information. Statistically , there are t w o benefits of F ullReplication : (1) av eraging different estimates from each node has a low er v ariance, and (2) the estimate at each no de has low er v ariance than in the Shard- ing case, as each node’s estimate is based on the whole data. F rom a hardware efficiency p ersp ective , reads are more fre- quen t from lo cal NUMA memory in P erNode than in Per- Machine . The PerNode approach dominates the PerCo re ap- proac h, as reads from the same no de go to the same NUMA memory . Thus, we do not consider PerCo re replication from this p oint on. T radeof fs. Not surprisingly , we observ e that F ullReplication tak es more time for each epo ch than Sharding . Ho w ev er, w e also observe that FullReplication uses few er ep o chs than Sha rding , esp ecially to achiev e low error. W e illustrate these t wo observ ations by showing the result of running SVM on Reuters using PerNode in Figure 9. Statistical Efficiency . FullReplication uses few er epo chs, esp ecially to low-error tolerance. Figure 9(a) sho ws the n umber of ep o chs that each strategy tak es to conv erge to a giv en loss. W e see that, for within 1% of the loss, F ullRepli- cation uses 10 × fewer ep o chs on a t wo-node machine. This is b ecause each mo del replica sees more data than Sha rd- ing , and therefore has a better estimate. Because of this difference in the num ber of ep o chs, F ullReplication is 5 × faster in wall-clock time than Sha rding to con v erge to 1% loss. How ev er, w e also observe that, at high-error regions, F ullReplication uses more epo chs than Sharding and causes a comparable execution time to a giv en loss. Hardw are Efficiency . Figure 9(b) shows the time for eac h ep o ch across different mac hines with different n um bers of no des. Because we are using the PerNode strategy , which is the optimal choice for this dataset, the more no des a ma- c hine has, the slow er FullReplication is for eac h ep o c h. The slo w-down is roughly consisten t with the num ber of no des on each machine. This is not surprising b ecause each ep o ch of F ullReplication pro cesses more data than Sha rding . 4. EXPERIMENTS W e v alidate that exploiting the tradeoff space that we de- scrib ed enables DimmWitted ’s orders of magnitude sp eedup o ver state-of-the-art comp etitor systems. W e also v alidate that each tradeoff discussed in this pap er affects the p erfor- mance of DimmWitted . 4.1 Experiment Setup W e describe the details of our exp erimental setting. Datasets and Statistical Models. W e v alidate the per- formance and qualit y of DimmWitted on a div erse set of statistical models and datasets. F or statistical models, we c ho ose five models that are among the most p opular mo dels used in statistical analytics: (1) Support V ector Machine (SVM), (2) Logistic Regression (LR), (3) Least Squares Regression (LS), (4) Linear Programming (LP), and (5) Quadratic Programming (QP). F or eac h mo del, we c hoose datasets with differen t characteristics, including size, spar- sit y , and under- or ov er-determination. F or SVM, LR, and LS, we c ho ose four datasets: Reuters 4 , RCV1 5 , Music 6 , and F orest. 7 Reuters and RCV1 are datasets for text classifica- tion that are sparse and underdetermined. Music and F orest are standard b enchmark datasets that are dense and o verde- termined. F or QP and LR, we consider a so cial-netw ork ap- plication, i.e., net w ork analysis, and use tw o datasets from Amazon’s customer data and Google’s Go ogle+ so cial net- w orks. 8 Figure 10 shows the dataset statistics. Metrics. W e measure the quality and performance of DimmWitted and other competitors. T o measure the qual- it y , we follow prior art and use the loss function for all func- tions. F or end-to-end p erformance, we measure the wall- clo c k time it takes for eac h system to conv erge to a loss that is within 100%, 50%, 10%, and 1% of the optimal loss. 9 When measuring the wall-clock time, we do not coun t the time used for data loading and result outputting for all sys- tems. W e also use other measurements to understand the details of the tradeoff space, including (1) lo cal LLC request, (2) remote LLC request, and (3) lo cal DRAM request. W e use Intel Performance Monitoring Units (PMUs) and follow the manual 10 to conduct these exp eriments. Experiment Setting. W e compare DimmWitted with four comp etitor systems: GraphLab [34], GraphChi [28], ML- lib [47] ov er Spark [55], and Hogwild! [38]. GraphLab is a distributed graph pro cessing system that supp orts a large range of statistical mo dels. GraphChi is similar to GraphLab but with a fo cus on m ulti-core machines with secondary stor- age. MLlib is a pac k age of machine learning algorithms im- plemen ted ov er Spark, an in-memory implementation of the MapReduce framew ork. Hogwild! is an in-memory lo ck- free framew ork for statistical analytics. W e find that all four systems pick some p oints in the tradeoff space that we con- sidered in DimmWitted . In GraphLab and GraphChi, all mo dels are implemen ted using stochastic co ordinate descen t 4 archive.ics.uci.edu/ml/datasets/Reuters- 21578+ Text+Categorization+Collection 5 about.reuters.com/researchandstandards/corpus/ 6 archive.ics.uci.edu/ml/datasets/YearPredictionMSD 7 archive.ics.uci.edu/ml/datasets/Covertype 8 snap.stanford.edu/data/ 9 W e obtain the optimal loss b y running all systems for one hour and choosing the low est. 10 software.intel.com/en- us/articles/ performance- monitoring- unit- guidelines 7 Dataset Within 1% of the Optimal Loss Within 50% of the Optimal Loss GraphLab GraphChi MLlib Hogwild! DW GraphLab GraphChi MLlib Hogwild! DW SVM Reuters 58 . 9 56 . 7 15.5 0.1 0.1 13 . 6 11 . 2 0.6 0.01 0.01 RCV1 > 300 . 0 > 300 . 0 > 300 61.4 26.8 > 300 . 0 > 300 . 0 58 . 0 0.71 0.17 Music > 300 . 0 > 300 . 0 156 33.32 23.7 31 . 2 27 . 1 7.7 0.17 0.14 F orest 16 . 2 15 . 8 2 . 70 0.23 0.01 1 . 9 1 . 4 0 . 15 0.03 0.01 LR Reuters 36 . 3 34 . 2 19 . 2 0.1 0.1 13 . 2 12 . 5 1 . 2 0.03 0.03 RCV1 > 300 . 0 > 300 . 0 > 300 . 0 38.7 19.8 > 300 . 0 > 300 . 0 68 . 0 0.82 0.20 Music > 300 . 0 > 300 . 0 > 300 . 0 35.7 28.6 30 . 2 28 . 9 8 . 9 0.56 0.34 F orest 29.2 28.7 3 . 74 0.29 0.03 2.3 2.5 0 . 17 0.02 0.01 LS Reuters 132 . 9 121 . 2 92 . 5 4.1 3.2 16 . 3 16 . 7 1.9 0.17 0.09 RCV1 > 300 . 0 > 300 . 0 > 300 27.5 10.5 > 300 . 0 > 300 . 0 32 . 0 1.30 0.40 Music > 300 . 0 > 300 . 0 221 40.1 25.8 > 300 . 0 > 300 . 0 11 . 2 0.78 0.52 F orest 25.5 26.5 1 . 01 0.33 0.02 2.7 2.9 0 . 15 0.04 0.01 LP Amazon 2.7 2.4 > 120 . 0 > 120 . 0 0.94 2.7 2.1 120 . 0 1.86 0.94 Google 13.4 11.9 > 120 . 0 > 120 . 0 12.56 2.3 2.0 120 . 0 3.04 2.02 QP Amazon 6.8 5.7 > 120 . 0 > 120 . 0 1.8 6.8 5.7 > 120 . 0 > 120 . 00 1.50 Google 12.4 10.1 > 120 . 0 > 120 . 0 4.3 9.9 8.3 > 120 . 0 > 120 . 00 3.70 Figure 11: End-to-End Comparison (time in seconds). The column DW refers to DimmWitted. W e tak e 5 runs on local2 and rep ort the av erage (standard deviation for all n um b ers < 5% of the mean). Entries with > indicate a timeout. 1" 10" 100" 1000" 1%" 10%" 100%" 1" 10" 100" 1000" 1%" 10%" 100%" 0.1" 1" 10" 100" 1000" 1%" 10%" 100%" 0.1" 1" 10" 100" 1%" 10%" 100%" 0.01" 0.1" 1" 10" 1%" 10%" 100%" 0.01" 0.1" 1" 10" 1%" 10%" 100%" 0.1" 1" 10" 100" 1%" 10%" 100%" 0.1" 1" 10" 100" 1%" 10%" 100%" SVM (RCV1) ! Error to Optimal Loss ! Time (seconds) ! (b) ! Model Replication ! SVM (Music) ! LP (Amazon) ! LP (Google) ! PerNode ! PerMachine ! (a) ! Access Method Sel. ! Time (seconds) ! Row-wise ! Column-wise ! PerCore ! Figure 12: T radeoffs in DimmWitted. Missing points timeout in 120 seconds. (column-wise access); in MLlib and Hogwild!, SVM and LR are implemen ted using stochastic gradien t descen t (ro w-wise access). W e use implementations that are pro vided by the original developers whenever p ossible. F or models with- out co de provided b y the developers, we only change the corresp onding gradient function. 11 F or GraphChi, if the corresp onding mo del is implemen ted in GraphLab but not GraphChi, we follow GraphLab’s implemen tation. W e run exp eriments on a v ariety of architectures. These mac hines differ in a range of configurations, including the n umber of NUMA no des, the size of last-level cache (LLC), and memory bandwidth. See Figure 3 for a summary of these machines. DimmWitted , Hogwild!, GraphLab, and GraphChi are implemented using C++, and MLlib/Spark is implemen ted using Scala. W e tune b oth GraphLab and MLlib according to their best practice guidelines. 12 F or both 11 F or sparse models, w e c hange the dense vector data struc- ture in MLlib to a sparse vector, whic h only impro v es its p erformance. 12 MLlib: spark.incubator.apache.org/docs/0.6.0/ tuning.html ; GraphLab: graphlab.org/tutorials- 2/ fine- tuning- graphlab- performance/ . F or GraphChi, w e tune the memory buffer size to ensure all data fit in memory and that there are no disk I/Os. W e describ e more detailed tuning for MLlib in the full version of this paper. GraphLab, GraphChi, and MLlib, we try different wa ys of increasing lo cality on NUMA machines, including trying to use numactl and implementing our own RDD for MLlib; there is more detail in the full v ersion of this pap er. Systems are compiled with g++ 4.7.2 (-O3), Jav a 1.7, or Scala 2.9. 4.2 End-to-End Comparison W e v alidate that DimmWitted outp erforms comp etitor systems in terms of end-to-end p erformance and qualit y . Note that b oth MLlib and GraphLab hav e extra ov erhead for fault tolerance, distributing work, and task sc heduling. Our comparison betw een DimmWitted and these comp eti- tors is intended only to demonstrate that existing work for statistical analytics has not obviated the tradeoffs that w e study here. Pr otocol. F or each system, we grid search their statistical parameters, including step size ( { 100.0,10.0,...,0.0001 } ) and mini-batc h size for MLlib ( { 1%, 10%, 50%, 100% } ); w e al- w ays rep ort the b est configuration, which is essentially the same for eac h system. W e measure the time it takes for eac h system to find a solution that is within 1%, 10%, and 50% of the optimal loss. Figure 11 sho ws the results for 1% and 50%; the results for 10% are similar. W e rep ort end-to-end n umbers from lo cal2, which has t wo no des and 24 logical 8 cores, as GraphLab do es not run on machines with more than 64 logical cores. Figure 14 sho ws the DimmWitted ’s c hoice of p oint in the tradeoff space on lo cal2. As shown in Figure 11, DimmWitted alwa ys conv erges to the giv en loss in less time than the other competitors. On SVM and LR, DimmWitted could b e up to 10 × faster than Hogwild!, and more than tw o orders of magnitude faster than GraphLab and Spark. The difference b etw een DimmWitted and Hogwild! is greater for LP and QP , where DimmWitted outp erforms Hogwild! b y more than t w o or- ders of magnitude. On LP and QP , DimmWitted is also up to 3 × faster than GraphLab and GraphChi, and tw o orders of magnitude faster than MLlib. T radeof f Choices. W e div e more deeply into these num b ers to substantiate our claim that there are some p oints in the tradeoff space that are not used by GraphLab, GraphChi, Hogwild!, and MLlib. Each tradeoff selected by our sys- tem is sho wn in Figure 14. F or example, GraphLab and GraphChi uses column-wise access for all models, while ML- lib and Hogwild! use row-wise access for all mo dels and al- lo w only P erMachine mo del replication. These special points w ork well for some but not all mo dels. F or example, for LP and QP , GraphLab and GraphChi are only 3 × slow er than DimmWitted , which chooses column-wise and PerMachine . This factor of 3 is to be exp ected, as GraphLab also allows distributed access and so has additional ov erhead. How ev er there are other p oints: for SVM and LR, DimmWitted outp erforms GraphLab and GraphChi, b ecause the column- wise algorithm implemented by GraphLab and GraphChi is not as efficient as ro w-wise on the same dataset. DimmWit- ted outperforms Hogwild! b ecause DimmWitted takes ad- v antage of model replication, while Hogwild! incurs 11 × more cross-no de DRAM requests than DimmWitted ; in con trast, DimmWitted incurs 11 × more lo cal DRAM re- quests than Hogwild! do es. F or SVM, LR, and LS, w e find that DimmWitted out- p erforms MLlib, primarily due to a different p oin t in the tradeoff space. In particular, MLlib uses batc h-gradien t- descen t with a P erCo re implementation, while DimmWitted uses stochastic gradient and PerNode . W e find that, for the F orest dataset, DimmWitted takes 60 × few er num ber of ep o c hs to conv erge to 1% loss than MLlib. F or each ep o ch, DimmWitted is 4 × faster. These tw o factors con tribute to the 240 × sp eed-up of DimmWitted ov er MLlib on the F or- est dataset (1% loss). MLlib has ov erhead for scheduling, so w e break down the time that MLlibuses for scheduling and computation. W e find that, for F orest, out of the total 2.7 seconds of execution, MLlib uses 1.8 seconds for computa- tion and 0.9 seconds for scheduling. W e also implemented a batc h-gradien t-descen t and P erCo re implementation inside DimmWitted to remo ve these and C++ v ersus Scala dif- ferences. The 60 × difference in the n um ber of ep o chs until con vergence still holds, and our implementation is only 3 × faster than MLlib. This implies that the main difference b e- t ween DimmWitted and MLlib is the point in the tradeoff space—not low-lev el implemen tation differences. F or LP and QP , DimmWitted outp erforms MLlib and Hogwild! because the row-wise access metho d implemented b y these systems is not as efficient as column-wise access on the same data set. GraphLab does hav e column-wise access, so DimmWitted outp erforms GraphLab and GraphChi b e- cause DimmWitted finishes eac h ep o ch up to 3 × faster, SVM ! (RCV1) ! LR ! (RCV1) ! LS ! (RCV1) ! LP ! (Google) ! QP ! (Google) ! Parallel ! Sum ! GraphLab ! 0.2 ! 0.2 ! 0.2 ! 0.2 ! 0.1 ! 0.9 ! GraphChi ! 0.3 ! 0.3 ! 0.2 ! 0.2 ! 0.2 ! 1.0 ! MLlib ! 0.2 ! 0.2 ! 0.2 ! 0.1 ! 0.02 ! 0.3 ! Hogwild! ! 1.3 ! 1.4 ! 1.3 ! 0.3 ! 0.2 ! 13 ! D IMM W ITTED ! 5.1 ! 5.2 ! 5.2 ! 0.7 ! 1.3 ! 21 ! Figure 13: Comparison of Throughput (GB/seconds) of Differen t Systems on lo cal2. Access Methods ! Model Replication ! Data Replication ! SVM ! LR ! LS ! Reuters ! Row-wise ! PerNode ! FullReplication ! RCV1 ! Music ! LP ! QP ! Amazon ! Column-wise ! PerMachine ! FullReplication ! Google ! Figure 14: Plans that DimmWitted Cho oses in the T radeoff Space for Eac h Dataset on Mac hine lo cal2. primarily due to low-lev el issues. This supp orts our claims that the tradeoff space is interesting for analytic engines and that no one system has implemen ted all of them. Thr oughput. W e compare the throughput of different sys- tems for an extremely simple task: parallel sums. Our im- plemen tation of parallel sum follows our implementation of other statistical mo dels (with a trivial up date function), and uses all cores on a single machine. Figure 13 shows the throughput on all systems on differen t mo dels on one dataset. W e see from Figure 13 that DimmWitted ac hiev es the highest throughput of all the systems. F or parallel sum, DimmWitted is 1.6 × faster than Hogwild!, and we find that DimmWitted incurs 8 × fewer LLC cache misses than Hog- wild!. Compared with Hogwild!, in which all threads write to a single copy of the sum result, DimmWitted maintains one single copy of the sum result per NUMA node, so the w orkers on one NUMA no de do not in v alidate the cac he on another NUMA no de. When running on only a single thread, DimmWitted has the same implementation as Hog- wild!. Compared with GraphLab and GraphChi, DimmWit- ted is 20 × faster, lik ely due to the o v erhead of GraphLab and GraphChi dynamically sc heduling tasks and/or main- taining the graph structure. T o compare DimmWitted with MLlib, which is written in Scala, we implemen ted a Scala v ersion, whic h is 3 × slo w er than C++; this suggests that the ov erhead is not just due to the language. If we do not coun t the time that MLlibuses for sc heduling and only coun t the time of computation, we find that DimmWitted is 15 × faster than MLlib. 4.3 T radeoffs of DimmWitted W e v alidate that all the tradeoffs describ ed in this pap er ha ve an impact on the efficiency of DimmWitted . W e re- p ort on a more modern arc hitecture, local4 with four NUMA so c k ets, in this section. W e describ e how the results change with different architectures. 4.3.1 Access Method Selection W e v alidate that different access metho ds hav e different p erformance, and that no single access metho d dominates the others. W e run DimmWitted on all statistical mo dels and compare tw o strategies, ro w-wise and column-wise. In 9 0.01$ 0.1$ 1$ 10$ 100$ 0.01$ 0.1$ 1$ 10$ 100$ Ratio of Execution Time per ! Epoch (row-wise/col-wise) ! SVM (RCV1) ! LP (Amazon) ! [e1] ! [e2] ! [l2] ! [l4] ! [l8] ! 8x2 ! 8x2 ! 6x2 ! 10x4 ! 8x8 ! #Cores/Socket ✕ # Sockets [Machine Name] ! [e1] ! [e2] ! [l2] ! [l4] ! [l8] ! 8x2 ! 8x2 ! 6x2 ! 10x4 ! 8x8 ! Figure 15: Ratio of Execution Time per Epo ch (ro w- wise/column-wise) on Differen t Arc hitectures. A n um ber larger than 1 means that ro w-wise is slo w er. l2 means lo cal2, e1 means ec2.1, etc. eac h experiment, we force DimmWitted to use the corre- sp onding access metho d, but report the best point for the other tradeoffs. Figure 12(a) sho ws the results as we mea- sure the time it tak es to ac hiev e each loss. The more strin- gen t loss requirements (1%) are on the left-hand side. The horizon tal line segments in the graph indicate that a mo del ma y reac h, say , 50% as quic kly (in ep o chs) as it reac hes 100%. W e see from Figure 12(a) that the difference b etw een ro w- wise and column-to-ro w access could b e more than 100 × for different mo dels. F or SVM on RCV1, ro w-wise access con verges at least 4 × faster to 10% loss and at least 10 × faster to 100% loss. W e observ e similar phenomena for Music; compared with R CV1, column-to-ro w access con- v erges to 50% loss and 100% loss at a 10 × slo w er rate. With such datasets, the column-to-row access simply re- quires more reads and writes. This supports the folk wis- dom that gradient metho ds are preferable to co ordinate de- scen t methods. On the other hand, for LP , column-wise access dominates: ro w-wise access do es not conv erge to 1% loss within the timeout p erio d for either Amazon or Go ogle. Column-wise access con verges at least 10-100 × faster than ro w-wise access to 1% loss. W e observ e that LR is similar to SVM and QP is similar to LP . Th us, no access method dominates all the others. The cost of writing and reading are differen t and is cap- tured b y a parameter that we called α in Section 3.2. W e de- scrib e the impact of this factor on the relative p erformance of row- and column-wise strategies. Figure 15 sho ws the ratio of the time that each strategy uses (row-wise/column- wise) for SVM (R CV1) and LP (Amazon). W e see that, as the n umber of so ck ets on a machine increases, the ratio of execution time increases, which means that ro w-wise b e- comes slo w er relativ e to column-wise, i.e., with increasing α . As the write cost captures the cost of a hardw are-resolv ed conflict, w e see that this constan t is likely to grow. Thus, if next-generation architectures increase in the n um ber of so c k ets, the cost parameter α and consequently the imp or- tance of this tradeoff are lik ely to grow. Cost-based Optimizer. W e observed that, for all datasets, our cost-based optimizer selects ro w-wise access for SVM, LR, and LS, and column-wise access for LP and QP . These c hoices are consistent with what we observed in Figure 12. 4.3.2 Model Replication W e v alidate that there is no single strategy for mo del repli- cation that dominates the others. W e force DimmWitted to run strategies in P erMachine , PerNode , and P erCo re and c ho ose other tradeoffs by choosing the plan that ac hiev es the b est result. Figure 12(b) shows the results. W e see from Figure 12(b) that the gap b et ween PerMa- chine and P erNode could b e up to 100 × . W e first observ e that PerNode dominates PerCo re on all datasets. F or SVM on RCV1, PerNode con v erges 10 × faster than P erCo re to 50% loss, and for other mo dels and datasets, we observe a similar phenomenon. This is due to the low statistical effi- ciency of PerCo re , as w e discussed in Section 3.3. Although P erCore eliminates write con tention inside one NUMA node, this write con tention is less critical. F or large models and mac hines with small caches, w e also observe that PerCo re could spill the cache. These graphs sho w that neither PerMachine nor PerNode dominates the other across all datasets and statistical mo d- els. F or SVM on RCV1, PerNode conv erges 12 × faster than P erMachine to 50% loss. How ev er, for LP on Amazon, Per- Machine is at least 14 × faster than PerNode to conv erge to 1% loss. F or SVM, P erNode conv erges faster b ecause it has 5 × higher throughput than P erMachine , and for LP , P erN- o de is slo w er because PerMachine takes at least 10 × fewer ep o c hs to con verge to a small loss. One in teresting obser- v ation is that, for LP on Amazon, PerMachine and PerNode do hav e comparable p erformance to conv erge to 10% loss. Compared with the 1% loss case, this implies that P erN- o de ’s statistical efficiency decreases as the algorithm tries to ac hieve a smaller loss. This is not surprising, as one must reconcile the PerNode estimates. W e observ e that the relativ e p erformance of P erMachine and PerNode dep ends on (1) the num ber of so ck ets used on eac h machine and (2) the sparsity of the up date. T o v alidate (1), we measure the time that PerNode and P erMachine take on SVM (RCV1) to conv erge to 50% loss on v arious arc hitectures, and w e report the ratio ( P erMachine / PerNode ) in Figure 16. W e see that PerNode ’s relativ e p erformance impro v es with the num b er of so ck ets. W e attribute this to the increased cost of write conten tion in PerMachine . T o v alidate (2), we generate a series of synthetic datasets, eac h of whic h subsamples the elemen ts in eac h row of the Music dataset; Figure 16(b) shows the results. When the sparsit y is 1%, PerMachine outp erforms P erNode , as each up date touches only one elemen t of the mo del; thus, the write conten tion in P erMachine is not a b ottleneck. As the sparsit y increases (i.e., the up date b ecomes denser), we ob- serv e that PerNode outp erforms P erMachine . 4.3.3 Data Replication W e v alidate the impact of differen t data replication strate- gies. W e run DimmWitted by fixing data replication strate- gies to FullReplication or Sha rding and c hoosing the best plan for each other tradeoff. W e measure the execution time for eac h strategy to conv erge to a given loss for SVM on the same dataset, RCV1. W e report the ratio of these tw o strategies as F ullReplication / Sha rding in Figure 17(a). W e see that, for the low-error region (e.g., 0.1%), FullReplication is 1.8-2.5 × faster than Sha rding . This is b ecause FullReplica- tion decreases the skew of data assignment to each work er, so hence each individual mo del replica can form a more ac- curate estimate. F or the high-error region (e.g., 100%), we 10 0.1 ! 1 ! 10 ! 100 ! 0 ! 0.5 ! 1 ! 0.1 ! 1 ! 10 ! 100 ! (a) Architecture ! #Cores/Socket ✕ # Sockets ! [Machine Name] ! Ratio of Execution T ime ! (PerMachine/PerNode) ! [e1] ! [e2] ! [l2] ! [l4] ! [l8] ! 8x2 ! 8x2 ! 6x2 ! 10x4 ! 8x8 ! (b) Sparsity ! Sparsity of Synthetic ! Data sets on Music ! PerMachine Better ! PerNode Better ! Figure 16: The Impact of Differen t Arc hitectures and Sparsit y on Mo del Replication. A ratio larger than 1 means that PerNode con v erges faster than P erMachine to 50% loss. (b) ! 1" 10" 100" Gibbs ! NN ! # V ariables/second ! (Million) ! Classic Choice ! DimmW itted ! 0" 1" 2" 3" 4" 5" 0.1%" 1.0%" 10. 0%" 100.0%" (a) ! Ratio of Exec. T ime ! ( FullRepl ./Sharding) ! FullRepl . Better ! Sharding Better ! Figure 17: (a) T radeoffs of Data Replication. A ra- tio smaller than 1 means that F ullReplication is faster. (b) P erformance of Gibbs Sampling and Neural Net- w orks Implemented in DimmWitted. observ e that FullReplication app ears to b e 2-5 × slo w er than Sha rding . W e find that, for 100% loss, both F ullReplication and Sha rding conv erge in a single ep o ch, and Sharding may therefore b e preferred, as it examines less data to complete that single ep o ch. In all of our exp eriments, F ullReplication is never substantially worse and can b e dramatically better. Th us, if there is a v ailable memory , the FullReplication data replication seems to b e preferable. 5. EXTENSIONS W e briefly describ e how to run Gibbs sampling (which uses a column-to-row access metho d) and deep neural net- w orks (which uses a ro w access method). Using the same tradeoffs, we achiev e a significant increase in sp eed ov er the classical implementation choices of these algorithms. A more detailed description is in the full version of this pap er. 5.1 Gibbs Sampling Gibbs sampling is one of the most popular algorithms to solve statistical inference and learning ov er probabilis- tic graphical mo dels [43]. W e briefly describe Gibbs sam- pling ov er factor graphs and observe that its main step is a column-to-ro w access. A factor graph can b e thought of as a bipartite graph of a set of v ariables and a set of factors. T o run Gibbs sampling, the main operation is to select a single v ariable, and calculate the conditional probability of this v ariable, which requires the fetching of all factors that con tain this v ariable and all assignmen ts of v ariables con- nected to these factors. This operation corresponds to the column-to-ro w access metho d. Similar to first-order meth- o ds, recen tly , a Hogwild! algorithm for Gibbs w as estab- lished [25]. As shown in Figure 17(b), applying the tec h- nique in DimmWitted to Gibbs sampling achiev es 4 × the throughput of samples as the P erMachine strategy . 5.2 Deep Neural Networks Neural net w orks are one of the most classic machine learn- ing models [35]; recently , these mo dels hav e b een intensiv ely revisited by adding more lay ers [19, 29]. A deep neural net- w ork contains multiple lay ers, and each la yer contains a set of neurons (v ariables). Differen t neurons connect with each other only b y links across consecutiv e lay ers. The v alue of one neuron is a function of all the other neurons in the pre- vious la y er and a set of weigh ts. V ariables in the last lay er ha ve human lab els as training data; the goal of deep neural net work learning is to find the set of w eigh ts that maximizes the likelihoo d of the h uman labels. Back-propagation with sto c hastic gradien t descen t is the de facto method of opti- mizing a deep neural netw ork. F ollowing LeCun et al. [30], we implemen t SGD o v er a sev en-lay er neural netw ork with 0.12 billion neurons and 0.8 million parameters using a standard handwriting-recognition b enc hmark dataset called MNIST 13 . Figure 17(b) shows the n umber of v ariables that are processed by DimmWitted p er second. F or this application, DimmWitted uses PerN- o de and FullReplication , and the classical c hoice made by Le- Cun is PerMachine and Sharding . As shown in Figure 17(b), DimmWitted achiev es more than an order of magnitude higher throughput than this classical baseline (to achiev e the same quality as reported in this classical pap er). 6. RELA TED WORK W e review work in four main areas: statistical analytics, data mining algorithms, shared-memory multiprocessors op- timization, and main-memory databases. W e include more extensiv e related work in the full version of this paper. Statistical Analytics. There is a trend to integrate sta- tistical analytics into data pro cessing systems. Database v endors ha v e recen tly put out new products in this space, in- cluding Oracle, Piv otal’s MADlib [23], IBM’s SystemML [21], and SAP’s HANA. These systems support statistical analyt- ics in existing data managemen t systems. A k ey c hallenge for statistical analytics is p erformance. A handful of data processing framew orks hav e been de- v elop ed in the last few y ears to supp ort statistical ana- lytics, including Mahout for Hado op, MLI for Spark [47], GraphLab [34], and MADLib for PostgreSQL or Green- plum [23]. Although these systems increase the p erfor- mance of corresponding statistical analytics tasks signifi- can tly , we observe that each of them implements one p oint in DimmWitted ’s tradeoff space. DimmWitted is not a system; our goal is to study this tradeoff space. Data Mining Algorithms. There is a large b o dy of data mining literature regarding how to optimize v arious al- gorithms to b e more architecturally aw are [39, 56, 57]. Zaki et al. [39, 57] study the p erformance of a range of differen t al- gorithms, including asso ciated rule mining and decision tree on shared-memory machines, by impro ving memory lo cality and data placement in the granularit y of cachelines, and de- creasing the cost of coheren t maintenance b etw een multiple CPU cac hes. Ghoting et al. [20] optimize the cache behavior of frequen t pattern mining using no v el cac he-conscious tec h- niques, including spatial and temp oral lo cality , prefetching, 13 yann.lecun.com/exdb/mnist/ 11 and tiling. Jin et al. [24] discuss tradeoffs in replication and lo c king sc hemes for K-means, asso ciation rule mining, and neural nets. This work considers the hardware efficiency of the algorithm, but not statistical efficiency , which is the fo- cus of DimmWitted . In addition, Jin et al. do not consider lo c k-free execution, a key aspect of this pap er. Shared-memory Multipro cessor Optimization. Per- formance optimization on shared-memory m ultiprocessors mac hines is a classical topic. Anderson and Lam [4] and Carr et al.’s [14] seminal work used complier tec hniques to improv e lo cality on shared-memory multiprocessor ma- c hines. DimmWitted ’s loc ality gr oup is inspired by Ander- son and Lam’s discussion of c omputation de c omp osition and data de comp osition . These lo cality groups are the center- piece of the Legion pro ject [6]. In recent years, there hav e b een a v ariet y of domain sp e cific languages (DSLs) to help the user extract parallelism; tw o examples of these DSLs in- clude Galois [36, 37] and OptiML [49] for Delite [15]. Our goals are orthogonal: these DSLs require kno wledge about the trade-offs of the hardware, such as those provided by our study . Main-memory Databases. The database communit y has recognized that m ulti-sock et, large-memory machines ha ve changed the data pro cessing landscap e, and there has b een a flurry of recen t work ab out how to build in-memory analytics systems [3, 5, 16, 27,31, 40, 41, 52]. Classical tradeoffs ha ve b een revisited on mo dern architectures to gain signif- ican t impro v emen t: Balk esen et al. [5], Albutiu et al. [3], Kim et al. [27], and Li [31] study the tradeoff for joins and sh uffling, respectively . This w ork takes adv an tage of modern arc hitectures, e.g., NUMA and SIMD, to increase memory bandwidth. W e study a new tradeoff space for statistical analytics in whic h the p erformance of the system is affected b y b oth hardware efficiency and statistical efficiency . 7. CONCLUSION F or statistical analytics on main-memory , NUMA-aw are mac hines, w e studied tradeoffs in access metho ds, mo del replication, and data replication. W e found that using nov el p oin ts in this tradeoff space can ha v e a substan tial b ene- fit: our DimmWitted prototype engine can run at least one p opular task at least 100 × faster than other comp etitor systems. This comparison demonstrates that this tradeoff space ma y be interesting for current and next-generation statistical analytics systems. Ac knowledgmen ts W e would like to thank Arun Kumar, Victor Bittorf, the Delite team, the Adv anced Analytics at Oracle, Green- plum/Pivotal, and Impala’s Cloudera team for sharing their exp eri- ences in building analytics systems. W e gratefully acknowledge the support of the Defense Advanced Research Pro jects Agency (DARP A) XDA T A Program under No. F A8750-12-2-0335 and the DEFT Pro- gram under No. F A8750-13-2-0039, the National Science F ounda- tion (NSF) CAREER Award under No. I IS-1353606, the Office of Nav al Research (ONR) under aw ards No. N000141210041 and No. N000141310129, the Sloan Research F ellowship, American F amily In- surance, Google, and T oshiba. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the au- thors and do not necessarily reflect the view of DARP A, NSF, ONR, or the US go v ernmen t. 8. REFERENCES [1] A. Agarwal, O. Chap elle, M. Dud ´ ık, and J. Langford. A reliable effective terascale linear learning system. ArXiv e-prints , 2011. [2] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Sk ounakis. W eaving relations for cache performance. In VLDB , 2001. [3] M.-C. Albutiu, A. Kemp er, and T. Neumann. Massiv ely parallel sort-merge joins in main memory multi-core database systems. PVLDB , pages 1064–1075, 2012. [4] J. M. Anderson and M. S. Lam. Global optimizations for parallelism and locality on scalable parallel machines. In PLDI , pages 112–125, 1993. [5] C. Balk esen and et al. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB , pages 85–96, 2013. [6] M. Bauer, S. T reichler, E. Slaugh ter, and A. Aiken. Legion: expressing locality and independence with logical regions. In SC , page 66, 2012. [7] N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. T echnical report, NVIDIA Corporation, 2008. [8] N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC , pages 18:1–18:11, 2009. [9] L. Bergstrom. Measuring NUMA effects with the STREAM benchmark. A rXiv e-prints , 2011. [10] C. Boutsidis and et al. Near-optimal coresets for least-squares regression. IEEE T ransactions on Information The ory , 2013. [11] J. K. Bradley , A. Kyrola, D. Bic kson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In ICML , pages 321–328, 2011. [12] G. Buehrer and et al. T ow ard terab yte pattern mining: An architecture-conscious solution. In PPoPP , pages 2–12, 2007. [13] G. Buehrer, S. Parthasarathy , and Y.-K. Chen. Adaptiv e parallel graph mining for cmp architectures. In ICDM , pages 97–106, 2006. [14] S. Carr, K. S. McKinley , and C.-W. Tseng. Compiler optimizations for impro ving data locality . In ASPLOS , 1994. [15] H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atrey a, and K. Oluk otun. A domain-specific approach to heterogeneous parallelism. In PPOPP , pages 35–46, 2011. [16] C. Chasseur and J. M. Patel. Design and ev aluation of storage organizations for read-optimized main memory databases. PVLDB , pages 1474–1485, 2013. [17] C. T. Ch u and et al. Map-reduce for machine learning on multicore. In NIPS , pages 281–288, 2006. [18] E. F. D’Azev edo, M. R. F ahey , and R. T. Mills. V ectorized sparse matrix m ultiply for compressed row storage format. In ICCS , pages 99–106, 2005. [19] J. Dean and et al. Large scale distributed deep net w orks. In NIPS , pages 1232–1240, 2012. [20] A. Ghoting and et al. Cache-conscious frequen t pattern mining on modern and emerging processors. VLDBJ , 2007. [21] A. Ghoting and et al. SystemML: Declarative machine learning on MapReduce. In ICDE , pages 231–242, 2011. [22] Y. He and et al. Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In ICDE , pages 1199–1208, 2011. [23] J. M. Hellerstein and et al. The MADlib analytics library: Or MAD skills, the SQL. PVLDB , pages 1700–1711, 2012. [24] R. Jin, G. Y ang, and G. Agraw al. Shared memory parallelization of data mining algorithms: T echniques, programming in terface, and performance. TKDE , 2005. [25] M. J. Johnson, J. Saunderson, and A. S. Willsky . Analyzing Hogwild parallel Gaussian Gibbs sampling. In NIPS , 2013. [26] M.-Y. Kan and H. O. N. Thi. F ast w ebpage classification using url features. In CIKM , pages 325–326, 2005. [27] C. Kim and et al. Sort vs. hash revisited: F ast join implementation on mo dern multi-core CPUs. PVLDB , 2009. [28] A. Kyrola, G. Blello ch, and C. Guestrin. Graphchi: Large-scale graph computation on just a p c. In OSDI , pages 31–46, 2012. [29] Q. V. Le and et al. Building high-level features using large scale unsupervised learning. In ICML , pages 8595–8598, 2012. [30] Y. LeCun and et al. Gradient-based learning applied to document recognition. IEEE , pages 2278–2324, 1998. [31] Y. Li and et al. NUMA-aw are algorithms: the case of data shuffling. In CIDR , 2013. [32] J. Liu and et al. An async hronous parallel sto chastic co ordinate descent algorithm. ICML , 2014. 12 [33] Y. Lo w and et al. Graphlab: A new framework for parallel machine learning. In UAI , pages 340–349, 2010. [34] Y. Lo w and et al. Distributed GraphLab: A framework for machine learning in the cloud. PVLDB , pages 716–727, 2012. [35] T. M. Mitc hell. Machine L e arning . McGraw-Hill, USA, 1997. [36] D. Nguy en, A. Lenharth, and K. Pingali. A light weight infrastructure for graph analytics. In SOSP , 2013. [37] D. Nguy en, A. Lenharth, and K. Pingali. Deterministic Galois: On-demand, portable and parameterless. In ASPLOS , 2014. [38] F. Niu and et al. Hogwild: A lo ck-free approac h to parallelizing stochastic gradien t descen t. In NIPS , pages 693–701, 2011. [39] S. P arthasarath y , M. J. Zaki, M. Ogihara, and W. Li. P arallel data mining for association rules on shared memory systems. Know l. Inf. Syst. , pages 1–29, 2001. [40] L. Qiao and et al. Main-memory scan sharing for m ulti-core CPUs. PVLDB , pages 610–621, 2008. [41] V. Raman and et al. DB2 with BLU acceleration: So muc h more than just a column store. PVLDB , pages 1080–1091, 2013. [42] P . Ric h t´ arik and M. T ak´ aˇ c. Parallel coordinate descent methods for big data optimization. A rXiv e-prints , 2012. [43] C. P . Rob ert and G. Casella. Monte Carlo Statistic al Methods (Springer T exts in Statistics) . Springer, USA, 2005. [44] A. Silberschatz, J. L. Peterson, and P . B. Galvin. Op erating System Concepts (3r d Ed.) . Addison-W esley Longman Publishing Co., Inc., Boston, MA, USA, 1991. [45] A. Smola and S. Naray anamurth y . An architecture for parallel topic models. PVLDB , pages 703–710, 2010. [46] S. Sonnen burg and et al. The SHOGUN machine learning toolb ox. J. Mach. L e arn. Res. , pages 1799–1802, 2010. [47] E. Sparks and et al. MLI: An API for distributed mac hine learning. In ICDM , pages 1187–1192, 2013. [48] S. Sridhar and et al. An appro ximate, efficient LP solv er for LP rounding. In NIPS , pages 2895–2903, 2013. [49] A. K. Sujeeth and et al. OptiML: An Implicitly P arallel Domain-Specific Language for Machine Learning. In ICML , pages 609–616, 2011. [50] S. T atikonda and S. Parthasarath y . Mining tree-structured data on m ulticore systems. PVLDB , pages 694–705, 2009. [51] J. Tsitsiklis, D. Bertsekas, and M. Athans. Distributed asynchronous deterministic and sto chastic gradient optimization algorithms. IEEE T ransactions on Automatic Contr ol , pages 803–812, 1986. [52] S. T u and et al. Speedy transactions in multicore in-memory databases. In SOSP , pages 18–32, 2013. [53] S. Williams and et al. Optimization of sparse matrix-v ector multiplication on emerging multicore platforms. In SC , pages 38:1–38:12, 2007. [54] X. Y ang, S. Parthasarath y , and P . Saday appan. F ast sparse matrix-vector multiplication on gpus: Implications for graph mining. PVLDB , pages 231–242, 2011. [55] M. Zaharia and et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI , 2012. [56] M. Zaki, C.-T. Ho, and R. Agraw al. P arallel classification for data mining on shared-memory multiprocessors. In ICDE , pages 198–205, 1999. [57] M. J. Zaki, S. Parthasarath y , M. Ogihara, and W. Li. New algorithms for fast disco v ery of association rules. In KDD , pages 283–286, 1997. [58] M. Zink evic h and et al. P arallelized sto chastic gradien t descent. In NIPS , pages 2595–2603, 2010. APPENDIX A. IMPLEMENT A TION DET AILS In DimmWitted , we implement optimizations that are part of scientific computation and analytics systems. While these optimizations are not new, they are not universally implemen ted in analytics systems. W e briefly describ es eac h optimization and its impact. Data and W orker Collocation. W e observe that different strategies of lo cating data and work ers affect the p erfor- mance of DimmWitted . One standard technique is to col- lo cate the work er and the data on the same NUMA no de. In this w a y , the work er in eac h no de will pull data from its o wn DRAM region, and do es not need to o ccupy the no de- DRAM bandwidth of other no des. In DimmWitted , we tried tw o differen t placemen t strategies for data and work- ers. The first protocol, called OS , relies on the op erating system to allo cate data and threads for work ers. The op er- ating system will usually lo cate data on one single NUMA no de, and work er threads to different NUMA no des using heuristics that are not exp osed to the user. The second pro- to col, called NUMA , evenly distributes w ork er threads across NUMA nodes, and for eac h work er, replicates the data on the same NUMA node. W e find that for SVM on R CV1, the strategy NUMA can be up to 2 × faster than OS . Here are t wo reasons for this impro v emen t. First, by locating data on the same NUMA no de to w ork ers, we achiev e 1.24 × im- pro vemen t on the throughput of reading data. Second, b y not asking the op erating system to allo cate work ers, w e ac- tually ha ve a more balanced allo cation of w ork ers on NUMA no des. Dense and Sparse. F or statistical analytics w orkloads, it is not uncommon for the data matrix A to b e sparse, es- p ecially for applications such as information extraction and text mining. In DimmWitted , we implemen t t w o protocols, Dense and Sparse , which store the data matrix A as a dense or sparse matrix, resp ectively . A Dense storage format has t wo adv antages: (1) if storing a fully dense vector, it requires 1 2 the space as a sparse representation, and (2) Dense is able to leverage hardware SIMD instructions, which allows mul- tiple floating p oint operations to be performed in parallel. A Spa rse storage format can use a BLAS-style scatter-gather to incorp orate SIMD, whic h can improv e cache performance and memory throughput; this approac h has the additional o verhead for the gather operation. W e find on a synthetic dataset in which w e v ary the sparsit y from 0.01 to 1.0, Dense can b e up to 2 × faster than Spa rse (for sparsity=1.0) while Spa rse can be up to 4 × faster than Dense (for sparsity=0.01). The dense vs. sparse tradeoff might c hange on new er CPUs with V GA THERDPD intrinsic designed to sp ecifically sp eed up the gather op eration. How ever, our current ma- c hines do not support this in trinsics and how to optimize sparse and dense computation kernel is orthogonal to the main goals of this pap er. Row-major and Column-major Stor ag e. There are tw o w ell-studied strategies to store a data matrix A : Ro w-majo r and Column-major storage. Not surprisingly , we observed that choosing an incorrect data storage strategy can cause a large slowdo wn. W e conduct a simple exp eriment where w e multiply a matrix and a v ector using row-access method, 13 where the matrix is stored in column- and row-ma jor order. W e find that the Column-major could resulting 9 × more L1 data load misses than using Row-majo r for t w o reasons: (1) our architectures fetch four doubles in a cacheline, only one of which is useful for the current op eration. The prefetc her in In tel mac hines does not prefetch across page b oundaries, and so it is unable to pic k up significan t p ortions of the strided access; (2) On the first access, the Data cache unit (DCU) prefetc her also gets the next cacheline comp ound- ing the problem, and so it runs 8 × slow er. 14 Therefore, DimmWitted alwa ys stores the dataset in a wa y that is consisten t with the access method—no matter how the in- put data is stored B. EXTENDED RELA TED WORK W e extend the discussion of related w ork. W e summarize in Figure 18 a range of related data mining work. A k ey difference is that DimmWitted considers both hardw are efficiency and statistical efficiency for statistical analytics solv ed by first-order metho ds. Data Mining Algorithms. Probably the most related work is by Jin et al. [24], who consider how to take adv an tage of replication and different lo cking-based schemes with differ- en t caching b ehavior and lo cking gran ularit y to increase the p erformance (hardware efficiency performance) for a range of data mining tasks including K-means, frequent pattern mining, and neural net works. Ghoting et al. [20] optimize cac he-b eha vior of frequen t pattern mining using nov el cache- conscious techniques, including spatial and temporal local- it y , prefetc hing, and tiling. T atik onda et al. [50] considers impro ving the p erformance of mining tree-structured data m ulticore systems b y decreasing the spatial and temp oral lo calit y , and the technique they use is b y careful study of dif- feren t gran ularit y and t ypes of task and data c hunking. Chu et al. [17] apply the MapReduce to a large range of statisti- cal analytics tasks that fit into the statistical query mo del, and implements it on a multicore system and shows almost linear speed-up to the n umber of cores. Zaki et al. [56] study ho w to sp eed up classification tasks using decision trees on SMP mac hines, and their technique takes adv an tage data parallelism and task parallelism with lo ckings. Buehrer and P arthasarathy et al. [13] study ho w to build a distributed system for frequent pattern mining with terab ytes of data. Their fo cus is to minimize the I/O cost and communication cost b y optimizing the data placemen t and the n um ber of passes ov er the dataset. Buehrer et al. [12] study implemen t- ing efficien t graph mining algorithms ov er CMP and SMP mac hines with the fo cus on load balance, memory usage (i.e., size), spatial lo cality , and the tradeoff of pre-computing and re-computing. Zaki et al. [39, 57] study on ho w to implemen t parallel asso ciated rule mining algorithms on shared memory systems b y optimizing reference memory lo cality and data placemen t in the gran ularit y of cachelines. This w ork also considers how to minimize the cost of coherent main tenance b et w een m ultiple CPU caches. All of these techniques are related and relev an t to our w ork, but none consider optimiz- ing first-order metho ds and the affect of these optimizations on their efficiency . 14 www.intel.com/content/dam/www/ public/us/en/documents/manuals/ 64- ia- 32- architectures- optimization- manual.pdf High P erformance Computation. The techniques that we considered in DimmWitted for efficien t implementation (Sec- tion A) are not new, and they are borrow ed from a wide range of literature in high p erformance computation, database, and systems. Lo cality is a classical tec hnique: work er and data collo cation technique has been advocated since at least 90s [4, 14] and is a common systems design principle [44]. The role of dense and sparse computation is well studied in the by the HPC communit y . F or example, efficient com- putation k ernels for matrix-v ector and matrix-matrix mul- tiplication [7, 8, 18, 53]. In this w ork, we only require dense- dense and dense-sparse matrix-vector m ultiplies. There is recen t work on mapping sparse-sparse m ultiplies to GPUs and SIMD [54], whic h is useful for other data mining models b ey ond what we consider here. The row- vs. column-storage has b een intensiv ely studied b y database communit y o v er traditional relational database [2] or Hado op [22]. DimmWitted implements these tec hniques to make sure our study of hardw are efficiency and statisti- cal efficiency reflects the status of modern hardware, and w e hop e that future dev elopmen t on these topics can be applied to DimmWitted . Domain Specific Languages. Domain sp ecific languages (DSLs) are intended to make it easy for a user to write par- allel programs b y exposing domain-sp ecific patterns. Exam- ples of suc h DSLs include Galois [36, 37] and OptiML [49] for Delite [15]. T o b e effectiv e, DSLs require the kno wl- edge ab out the trade-off of the target domain to apply their compilation optimization, and w e hope the insights from DimmWitted can b e applied to these DSLs. Mathematical Optimization. Man y statistical analytics tasks are mathematical optimization problems. Recently , the math- ematical optimization comm unity has b een looking at ho w to parallelize optimization problems [32, 38, 58]. F or exam- ple, Niu et al. [38] for SGD and Shotgun [11] for SCD. A lo c k-free asynchronous v arian t w as recently established b y Ji et al. [32]. C. ADDITIONAL EXPERIMENTS C.1 More Detailed T uning Inf ormation for Spark W e rep ort details of ho w w e tune our Spark installation for fair comparison. Figure 19 sho ws the list of parame- ters that we used to tune Spark. F or each combination of the parameter, w e run one exp erimen t for measuring the throughput using parallel sum, and use it for all other ex- p erimen ts to maximize the performance. F or eac h task, we try all combinations of step size and batch size. Statistical Ef ficiency: Step Size and Batch Size. W e ob- serv e that step size and batch size of gradien t together has significan t impact on the time that Spark needs to con v erge. As sho wn in Figure 19, for eac h experiment, w e try 28 differ- en t combinations of these settings (7 step sizes and 4 batc h sizes). W e see that these parameters could con tribute to more than 100 × in the time to conv erge to the same loss on the same dataset! Therefore, as shown in Figure 19, we tried a large range of these t wo parameters and pic k the best one to rep ort. 14 T arget Architecture T arget Application T arget Efficiency Multicore NUMA (SMP) Distributed Data Mining Graph Mining Gradient-based Hardware Statistical Jin et al. [24] X X X X Ghoting et al. [20] X X X T atikonda et al. [50] X X X Chu et al. [17] X X X X Zaki et al. [56] X X X Buehrer et al. [13] X X X Buehrer et al. [12] X X X Zaki et al. [39, 57] X X X Tsitsiklis et al. [51] X X Niu et al. [38] X X X Bradley et al. [11] X X X GraphChi [28] X X X X X X GraphLab [33, 34] X X X X X X MLlib [47] X X X X X DimmWitted X X X X Figure 18: A T axonomy of Related W ork T yp e Parameters V alues Statistical Step size 100, 10, 1, 0.1, 0.01, 0.001, 0.0001 Efficiency Batch size 100%, 50%, 10%, 1% Data Replication 1, 2, 3 Serialization T rue, F alse Hardware Storage Lev el MEMOR Y ONL Y Efficiency Compression T rue, F alse locality .wait 1, 100, 1000, 3000, 10000 SP ARK MEM 48g, 24g, 1g numactl locallo c, interlea v e, NA Figure 19: The Set of P arameters W e T ried for T un- ing Spark Sour ces of Overhead in Spark. Spark has o verhead in sc heduling the task and provide fault tolerance, b oth of whic h are features that DimmWitted do es not supp ort. T o mak e our comparison as fair as p ossible, we conduct the fol- lo wing exp eriments to understand how scheduling and fault tolerance impact our claims. W e implement our own version of batch-gradien t descent algorithm in DimmWitted by strictly following MLlib’s al- gorithm in C++. On F orest, we first observe that our o wn batch-gradien t implementation uses similar num bers of ep o c hs (within 5%) to conv erge to 1% loss as MLlib giv en the same step size and batch size. Second, for each ep o ch, our batc h-gradient implementation is 3-7 × faster cross differen t arc hitectures–this implies that MLlib does hav e ov erhead compared with DimmWitted ’s framew ork. How ev er, our o wn batch-gradien t implementation is still 20-39 × slow er than DimmWitted cross different arc hitectures. W e break down the execution time into the num ber of ep o c hs that each system needs to conv erge and the time that MLlib used for scheduling and computation. In par- ticular, we use the F orest dataset as an example. On this dataset, DimmWitted uses 1 ep o ch to con verge to 1% loss, while b oth MLlib and our own C++ implementation use 63 and 64 ep o chs, respectively . MLlib uses 2.7 seconds for these 64 ep o chs, and 0.9 seconds of these are used for scheduling, and other 1.8 seconds are used to enumerate each example, and calculate the gradient. 15 The difference in the num ber of ep o chs to conv erge implies that the difference b etw een MLlib and DimmWitted is not caused by lo w-level imple- 15 W e observ e similar break down on other datasets except the smallest dataset, Reuters. On this dataset, the time used for scheduling is up to 25 × of the computation time. 1 3 5 7 9 11 1 3 5 7 9 1 1 PerCore PerNode PerMachine Delite # Thr eads Speedup Linear Speedup Figure 20: Comparison with Delite using LR (Mu- sic) on lo cal2. men tations, instead, that MLlib only implemen ts a subset of p oints in DimmWitted ’s tradeoff space. Har dwar e Efficiency . W e summarize the impact of param- eters to the throughput of MLlib. F or eac h out of totally 540 com binations of all sev en parameters related to hardw are ef- ficiency , we run the parallel sum to measure the throughput. W e find, not surprisingly , that the parameter SP ARK MEM has significant impact on the throughput–On Music, when this parameter is set to 48GB, Spark achiev es 7 × speed- up ov er 1GB. This is not surprising because this parameter sets the amount of RAM that Spark can use. W e also find that, giv en the SP ARK MEM parameter to be 48GB, all other parameters only hav e less than 50% difference with eac h other. Therefore, in our experiments we alwa ys use SP ARK MEM and set other parameters to b e the setting that achiev es highest throughput in our experiment on the corresp onding dataset. C.2 Comparison with Delite Recen tly , there hav e b een a trend of using domain specific language to help user write parallel programs more easily . W e conduct a simple exp eriment with one popular DSL, namely Delite [15], to illustrate that the tradeoff we studied in this paper has the potential to help these DSLs to ac hieve higher p erformance and quality . W e use the official implementation of logistic regression in Delite [15] and run b oth DimmWitted and Delite on the Music dataset using lo cal2. W e try our b est effort for the 15 0.1 1 10 100 0.01 0.1 1 Scale (1x = 0.5B rows, 4B NNZs, 49GB) Time/Epoch (seconds) Figure 21: Scalability of DimmWitted using ClueW eb 2009 on local2. lo calit y of Delite b y trying different settings for n umactl. W e v ary the n um b er of threads that each program can use and plot the sp eed-up curve as shown in Figure 20. First, we see from Figure 20 that different mo del replica- tion strategy in DimmWitted has differen t speed-up behav- ior. Not surprisingly , PerCo re sp eeds up more linearly than P erNo de and PerMachine . These observ ations are consisten t with the hardw are efficiency that w e discussed in this paper. More interestingly , we see that Delite do es not sp eed-up b e- y ond a single so ck et (i.e., 6 cores). Therefore, b y applying the PerNode strategy in DimmWitted to Delite, w e hop e that w e can improv e the sp eed-up b ehavior of Delite as we illustrated in Figure 20. C.3 Scalability Experiments W e v alidate the scalability of DimmWitted by testing it on larger dataset. Dataset. W e follow Kan et al. [26] to create a dataset that con tains 500 million examples, 100K features for each ex- ample, and 4 billion non-zero elements b y using a W eb-scale data set called ClueW eb. 16 ClueW eb contains 500 million W eb pages, and the approach of Kan et al. tries predict the P ageRank score of each W eb page by using features from its URLs by a least squares mo del. Result. T o v alidate the scalabilit y of DimmWitted , we randomly subsampled 1% examples, 10% examples, and 50% examples to create smaller datasets. W e run DimmWit- ted using the rule-of-thum bs in Figure 14, and measure the time that DimmWitted used for eac h ep o ch. Figure 21 sho ws the result. W e see that on this dataset, the time that DimmWitted needs to finish a single ep o ch gro ws almost linearly with the num b er of examples. W e b elieve that this is caused by the fact that for all sub-sampled datasets and the whole dataset, the mo del (100K w eigh ts) fits in the LLC cac he. C.4 Importance Sampling as a Data Replica- tion Strategy The Sha rding and FullReplication sampling sc heme that we discussed in Section 3 assumes that data tuples are equally imp ortan t. How ev er, in statistic analytics, it is not uncom- mon that some data tuples are more imp ortant than others. One example is the linear lev erage score. 16 h ttp://lemurpro ject.org/cluew eb09/ 0.01$ 0.1$ 1$ 10$ 100$ 1000$ 0.01$ 0.1$ 1$ !"#$%&'()*+,-. !"#$%&'()*+,+-. /0'%12(3 . 45667*#62)'8$(. 92"*.:;*)$(1;<. =%%$%.&$.>#8"'6.?$;;. Figure 22: Important Sampling on Music (local2). Example C.1 (Linear Leverage Score [10]). F or A ∈ R N × d and b ∈ R N . Define s ( i ) = a T i  A T A  − 1 a i , where a i is the i th r ow of A . L et ˜ A and ˜ b b e the r esult of sampling m r ows, wher e row i is sele cte d with pr ob ability pr op ortional to s ( i ) . Then, for all x ∈ R d , we have Pr      k Ax − b k 2 2 − N m k ˜ Ax − ˜ b k 2 2     < ε k Ax − b k 2 2  > 1 2 So long as m > 2 ε − 2 d log d . F or general loss functions (e.g., logistic loss), the linear lev erage score calculated in the same wa y as abov e do es not necessarily satisfy the prop erty of approximating the loss. Ho wev er, we can still use this score as a heuristic to decide the relativ e imp ortance of data examples. In DimmWitted , w e consider the follo wing proto col that we called Importance . Giv en a dataset A , w e calculate the lev erage score s ( i ) of the i th ro w as a T i ( A T A ) − 1 a i . The user specifies the error tolerance  that is acceptable to her, and for each ep o ch, DimmWitted samples for eac h w ork er 2 ε − 2 d log d examples with a probability that is propositional to the lev erage score. This procedure is implemented in DimmWitted as one data replication strategy . Experimental Results. W e run the ab ov e importance sam- pling on the same data set as Section 4, and v alidate that on some datasets the imp ortance sampling scheme can im- pro ve the time that DimmWitted needs to conv erge to a giv en loss. Figure 22 sho ws the results of comparing differ- en t data replication strategies on Music running on lo cal2, where Imp ortance0.1 and Imp ortance0.01 uses 0.1 and 0.01 as the error tolerance  , respectively . W e see that, on Music, Imp ortance0.1 is 3x faster than F ullReplication, for 10% loss. This is caused b y the fact that Imp ortance0.1 pro cesses only 10% of the data compared with F ullReplication. How ev er, Importance0.01 is slo wer than F ullReplication. This is b ecause when the error toler- ance is low er, the n um ber of samples one needs to draw for eac h epo ch increases. F or Music, Importance0.01 pro cesses the same amount of tuples than F ullReplication. D. DET AILED DESCRIPTION OF EXTEN- SIONS W e describ e in more details of each extension that w e men tioned in Section 5. D.1 Gibbs Sampling Figure 23(a) illustrates a factor graph, whic h is a bipartite graph that con tains a set of v ariable, a set of factors, and 16 V ariable Factor Current V ariable Factor V ariable Current V ariable (a) Factor Graph (b) DimmW itted (c) Deep Neural Networks Layer 1 Layer 2 Layer n … Figure 23: Illustration of F actor Graph and Deep Neural Netw orks in DimmWitted. (a) and (b) show a factor graph and how DimmWitted represen ts it as column-to-row access. (c) shows a deep neural net w ork, and the de facto approac h to solv e it is to run SGD for eac h lay er DimmWitted in a round- robin fashion. a set of links betw een v ariables and factors. T o run Gibbs sampling ov er a factor graph, one pro cesses one v ariable at a time to calculate the conditional probability for different assignmen t of this v ariable. This in volv es fetching all con- nected factors and all current assignmen ts of v ariables that connected to these factors. Gibbs sampling then up date the curren t v ariable assignment by randomly sampling a v alue according to the conditional probability and pro ceed to the next random v ariable. Similar to first order metho ds, re- cen t theory prov es a lock-free proto col to sample multiple v ariables at the same time [25]. W e also know from classic statistical theory [43] that one can main tain m ultiple cop y of the same factor graph, and aggregate the samples pro duced on each factor graph at the end of execution. Figure 23(b) illustrates how DimmWitted mo dels Gibbs sampling as column-to-row access. W e see that each row corresp onding to one factor, eac h column corresp onding to one v ariable, and the non-zero elemen ts in the matrix cor- resp ond to the link in the factor graph. T o process one v ariable, DimmWitted fetc hes one column of the matrix to get the set of factors, and other columns to get the set of v ariables that connect to the same factor. In DimmWitted , we implemen t the P erNo de strategy for Gibbs sampling by running one indep endent chain for each NUMA node. At the end of sampling, w e can use all samples generated from eac h NUMA node for estimation. Therefore, w e use throughput, i.e., num b er of samples generated p er second as the measuremen t for p erformance in Section 5. 17 In DimmWitted , w e implement Gibbs sampling for gen- eral factor graphs, and compare it with one hand-coded implemen tation for topic mo deling in GraphLab. W e run all systems on lo cal2 with 100K do cuments and 20 topics. W e find that on local2, DimmWitted ’s implemen tation is 3.7 × faster than GraphLab’s implemen tation without an y application-sp ecific optimization. D.2 Deep Neural Networks Figure 23(c) illustrates a Deep Neural Netw ork as we de- scrib ed in Section 5. Stochastic gradien t descent is the de facto algorithm to solv e a neural netw ork [30], with one twist that w e will discuss as follows. As sho wn in Figure 23(c), 17 There has b een a long historical discussion about the trade- off betw een a single deep chain and multiple indep endent c hains in statistics. This tradeoff is out of the scop e of this pap er. a deep neural net work usually contains multiple lay ers, and the SGD algorithm needs to b e run within eac h lay er, and pro cess all lay ers in a round-robin fashion. Therefore, in DimmWitted , we use the same SGD co de path inside each la yer one at a time, and in vok e this code path m ultiple times to pro cess different lay ers. 17

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment