Purine: A bi-graph based deep learning framework

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we introduce a novel deep learning framework, termed Purine. In Purine, a deep network is expressed as a bipartite graph (bi-graph), which is composed of interconnected operators and data tensors. With the bi-graph abstraction, networks are easily solvable with event-driven task dispatcher. We then demonstrate that different parallelism schemes over GPUs and/or CPUs on single or multiple PCs can be universally implemented by graph composition. This eases researchers from coding for various parallelization schemes, and the same dispatcher can be used for solving variant graphs. Scheduled by the task dispatcher, memory transfers are fully overlapped with other computations, which greatly reduce the communication overhead and help us achieve approximate linear acceleration.

💡 Research Summary

Purine is a deep‑learning framework that represents a neural network as a bipartite graph (bi‑graph) whose vertices are either tensors (data) or operators (functions). Edges exist only between tensors and operators, guaranteeing a directed acyclic structure. This abstraction enables the entire forward‑and‑backward computation of a model to be expressed as a single graph, which can be executed by an event‑driven task dispatcher.

The dispatcher monitors the readiness of tensors: an operator fires only when all its input tensors are ready, and a tensor becomes ready when all its consuming operators have finished. Consequently, the dispatcher automatically respects data dependencies while launching every operator that can run in parallel, without requiring explicit synchronization code.

To support iterative training (e.g., stochastic gradient descent), Purine treats one training step as a graph and repeats the graph sequentially. Within a single graph, parallelizable operations are executed concurrently; the boundary between successive graphs provides an implicit synchronization point. This solves the classic problem that DAGs cannot directly encode loops.

Parallelism is achieved by attaching a “location” attribute to each tensor and operator. The attribute consists of a hostname and a device identifier (CPU or GPU and GPU ordinal). A second attribute, “thread,” allows the user to assign operators to specific threads on the same device. By configuring these attributes, users can implement:

Model parallelism – different layers or sub‑networks are placed on different devices. The paper illustrates this with a two‑layer fully‑connected network split into three sub‑graphs (A, B, C) that are executed in sequence, while the whole network is replicated three times to keep all devices busy.
Data parallelism – each device holds a full copy of the model and processes a distinct mini‑batch. Gradients are aggregated either by a parameter server or via an All‑Reduce scheme. The framework also supports hybrid schemes (e.g., data parallelism for convolutional layers and model parallelism for fully‑connected layers) as proposed by Krizhevsky.

Because tensors residing on one device are not directly accessible from another, Purine introduces a special Copy operator. Copy operators are ordinary graph nodes; they are scheduled by the same dispatcher and can be assigned to separate threads. This design makes it trivial to overlap communication with computation: while one thread copies data between devices, other threads continue executing independent operators. In multi‑machine settings each machine runs its own dispatcher; a copy from machine A to B appears as a sink on A and a source on B, eliminating the need for a global scheduler.

The authors evaluated Purine on GoogLeNet using data parallelism across up to 12 GPUs. Each GPU processed a mini‑batch of 128 images (total batch size = 128 × #GPUs). The machines were linked by 10 GbE, so GPU‑to‑GPU traffic had to pass through host memory. Despite this bottleneck, the framework’s ability to hide transfers behind computation yielded near‑linear scaling:

GPUs	Images/s
1	112.2
2	222.6
3	336.8
6	673.7
9	1010.5
12	1383.7

Even when the per‑GPU batch size was reduced to 32 (total batch 384), the system achieved a 9.5× speed‑up with 12 GPUs, demonstrating that the dispatcher can keep the pipeline busy with relatively small batches. Profiling showed that only the first convolution layer’s memory copy created a noticeable gap between iterations; all other copies were fully overlapped with backward‑pass computation.

Key contributions of Purine are:

Unified graph‑based abstraction that captures both forward and backward passes, model structure, and inter‑device data movement.
Event‑driven dispatcher that automatically schedules operators, respects dependencies, and overlaps communication without user‑written synchronization code.
Flexible parallelism through location and thread attributes, enabling model, data, and hybrid parallelism on heterogeneous clusters.
Copy operator that integrates data transfer into the computational graph, allowing seamless overlap of communication and computation.
Empirical evidence of linear scaling on a realistic large‑scale network (GoogLeNet) across multiple machines and GPUs, even with modest network bandwidth.

Limitations include the lack of direct GPU‑to‑GPU peer‑to‑peer transfers (all copies go through host memory), which can become a bottleneck on larger clusters or slower interconnects. Future work could incorporate GPU‑direct RDMA, automatic graph partitioning, and more sophisticated scheduling heuristics to further improve scalability.

In summary, Purine demonstrates that representing deep‑learning workloads as bipartite graphs and driving them with an event‑based dispatcher provides a clean, extensible, and high‑performance foundation for a wide range of parallel training strategies.

Purine: A bi-graph based deep learning framework

💡 Research Summary

Comments & Academic Discussion

Leave a Comment