Estimating the overlap between dependent computations for automatic parallelization

Estimating the overlap between dependent computations for automatic   parallelization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Researchers working on the automatic parallelization of programs have long known that too much parallelism can be even worse for performance than too little, because spawning a task to be run on another CPU incurs overheads. Autoparallelizing compilers have therefore long tried to use granularity analysis to ensure that they only spawn off computations whose cost will probably exceed the spawn-off cost by a comfortable margin. However, this is not enough to yield good results, because data dependencies may \emph{also} limit the usefulness of running computations in parallel. If one computation blocks almost immediately and can resume only after another has completed its work, then the cost of parallelization again exceeds the benefit. We present a set of algorithms for recognizing places in a program where it is worthwhile to execute two or more computations in parallel that pay attention to the second of these issues as well as the first. Our system uses profiling information to compute the times at which a procedure call consumes the values of its input arguments and the times at which it produces the values of its output arguments. Given two calls that may be executed in parallel, our system uses the times of production and consumption of the variables they share to determine how much their executions would overlap if they were run in parallel, and therefore whether executing them in parallel is a good idea or not. We have implemented this technique for Mercury in the form of a tool that uses profiling data to generate recommendations about what to parallelize, for the Mercury compiler to apply on the next compilation of the program. We present preliminary results that show that this technique can yield useful parallelization speedups, while requiring nothing more from the programmer than representative input data for the profiling run.


💡 Research Summary

The paper tackles a well‑known shortcoming of automatic parallelization systems: they usually decide whether to spawn a parallel task based solely on the estimated cost of the task relative to the fixed overhead of spawning. This approach ignores the impact of data dependencies between tasks. When one task produces a value that another task needs early, the consumer will block almost immediately, and the potential parallelism is wasted, often leading to slower execution than the sequential version.

To address this, the authors present a profiling‑driven analysis for the Mercury logic/functional language that quantifies the overlap between dependent computations. Mercury’s strong mode system (which classifies each argument as input or output) and its deterministic execution model simplify the analysis. The system works in four steps:

  1. Profiling collection – The program is compiled with profiling enabled and run on representative inputs. For each atomic subgoal (calls, unifications, etc.) the profiler records execution time, call count, and the set of variables bound by that subgoal.
  2. Production/consumption timing extraction – Using the profiling data, the tool determines for every variable the “production time” (the earliest moment the producer subgoal makes the value available) and the “consumption time” (the earliest moment a consumer subgoal needs the value). Mercury’s future mechanism, which implements producer‑consumer synchronization, provides a natural place to insert these timestamps.
  3. Overlap computation – For each candidate parallel conjunction (a sequence of goals that could be turned into a parallel conjunction), the algorithm examines all shared variables. If a variable is produced late and consumed early, the overlap is small; if it is produced early and consumed late, the overlap is large. The algorithm aggregates these per‑variable overlaps to estimate the total parallel execution time, adding the cost of spawning, signaling, and waiting on futures.
  4. Partition selection – When a conjunction has more than two conjuncts, many ways exist to group them into parallel tasks (e.g., (c1 & c2) & c3, c1 & (c2 & c3), or c1 & c2 & c3). The tool evaluates each partition using the overlap model and selects the one with the smallest predicted runtime.

The search for candidates is depth‑first over the call tree. A node is examined only if its total cost exceeds a configurable threshold and if the amount of parallel work discovered so far does not exceed the target machine’s core count. This prevents the system from considering tasks that would be dominated by spawn overhead or that would oversubscribe the hardware.

The resulting “advice file” lists the conjunctions that should be parallelized and the exact grouping to use. During the next compilation, Mercury’s compiler reads this file and automatically rewrites the selected sequential conjunctions into parallel ones, inserting futures and barrier synchronizations as needed. Because the analysis is based on profiling rather than source annotations, developers need not modify their code; they only need to provide representative input data for the profiling run. If the program changes, a new profiling run quickly yields updated advice.

Experimental evaluation on several Mercury benchmarks shows that the technique can achieve speedups ranging from modest (≈1.2×) to substantial (≈1.8×). The gains are most pronounced when shared variables are produced early and consumed late, confirming the intuition that overlap matters. Conversely, when production is late and consumption early, the system correctly refrains from parallelizing, avoiding slowdowns.

In summary, the paper contributes:

  • A concrete method to estimate runtime overlap of dependent tasks using production/consumption timestamps.
  • Algorithms to evaluate all possible parallel partitions of a conjunction and select the optimal one.
  • An end‑to‑end toolchain that integrates profiling, analysis, and compiler feedback without requiring programmer annotations.
  • Empirical evidence that considering data‑dependency overlap yields better parallelization decisions than granularity‑only heuristics.

The work demonstrates that effective automatic parallelization must account for both task granularity and the temporal relationship of shared data, and it provides a practical solution for a language (Mercury) where deterministic execution and rich mode information make such analysis feasible.


Comments & Academic Discussion

Loading comments...

Leave a Comment