Couillard: Parallel Programming via Coarse-Grained Data-Flow Compilation
Data-flow is a natural approach to parallelism. However, describing dependencies and control between fine-grained data-flow tasks can be complex and present unwanted overheads. TALM (TALM is an Architecture and Language for Multi-threading) introduces a user-defined coarse-grained parallel data-flow model, where programmers identify code blocks, called super-instructions, to be run in parallel and connect them in a data-flow graph. TALM has been implemented as a hybrid Von Neumann/data-flow execution system: the \emph{Trebuchet}. We have observed that TALM’s usefulness largely depends on how programmers specify and connect super-instructions. Thus, we present \emph{Couillard}, a full compiler that creates, based on an annotated C-program, a data-flow graph and C-code corresponding to each super-instruction. We show that our toolchain allows one to benefit from data-flow execution and explore sophisticated parallel programming techniques, with small effort. To evaluate our system we have executed a set of real applications on a large multi-core machine. Comparison with popular parallel programming methods shows competitive speedups, while providing an easier parallel programing approach.
💡 Research Summary
The paper introduces Couillard, a source‑to‑source compiler that bridges conventional C programming with a coarse‑grained data‑flow execution model called TALM (TALM is an Architecture and Language for Multi‑threading). TALM allows programmers to define “super‑instructions”, which are user‑specified blocks of code that execute as atomic units in a data‑flow graph. Super‑instructions can be declared as either single (a unique node in the graph) or parallel (multiple instances may run concurrently, depending on available cores).
Couillard extends ANSI‑C with a small set of annotations. A block of code surrounded by #BEGIN SUPER … #ENDSUPER together with a header of the form treb_super <single|parallel> input(... ) output(... ) declares a super‑instruction, its input ports, and its output ports. Variables that serve as inputs or outputs must be declared beforehand, and for parallel super‑instructions the special storage class treb_parout is used. Instances of such variables are addressed with a suffix :: followed by an instance identifier: ::0 for the first instance, ::* for all instances, ::mytid for the current instance, or arithmetic expressions such as ::(mytid‑1). This syntax enables explicit inter‑instance data dependencies, which are essential for pipelines, speculative execution, and other advanced parallel patterns.
The compiler front‑end, built with PLY (Python Lex‑Yacc), parses the annotated source and builds an abstract syntax tree. From this tree it generates three artefacts: (1) a C source file for each super‑instruction, compiled with a standard C compiler into a shared library (.so); (2) a TALM assembly file that describes the data‑flow graph, including both super‑instructions and the fine‑grained control instructions (loops, conditionals, etc.) that are not part of any super‑instruction; and (3) a Graphviz representation of the graph for visualization. The TALM assembly encodes the data dependencies using the instance‑addressing syntax, thereby defining the exact firing order of nodes at runtime.
Execution is performed by Trebuchet, a hybrid Von Neumann / data‑flow virtual machine built on POSIX threads. Each processing element (PE) maps to a host thread. When a super‑instruction fires, Trebuchet loads the corresponding shared‑library function and invokes it directly, bypassing interpretation. Simple instructions are interpreted according to the data‑flow rules. Trebuchet also implements a work‑stealing scheduler based on a FIFO double‑ended queue (a modification of the ABP algorithm) that gives priority to older instructions, which is advantageous for the target workloads.
The authors evaluate Couillard and Trebuchet on two PARSEC benchmarks, Blackscholes and Swaptions, using a 24‑core machine. Compared with hand‑written OpenMP versions, the data‑flow approach achieves speedups ranging from 1.2× to 1.5×, with the largest gains observed in applications that hide I/O latency or employ speculative transaction execution. The results demonstrate that the coarse‑grained data‑flow model can match or exceed the performance of traditional thread‑pool models while offering a higher level of abstraction for complex parallel designs.
In summary, Couillard automates the most cumbersome aspects of TALM programming: it extracts super‑instruction code, generates the data‑flow graph, and handles instance‑level variable mapping. By doing so, it lets programmers focus on algorithmic parallelism rather than low‑level synchronization, while still obtaining competitive or superior performance to established models such as OpenMP, Pthreads, and Intel TBB. The work therefore represents a significant step toward practical, high‑performance data‑flow programming on modern multicore architectures.
Comments & Academic Discussion
Loading comments...
Leave a Comment