Profiling parallel Mercury programs with ThreadScope
The behavior of parallel programs is even harder to understand than the behavior of sequential programs. Parallel programs may suffer from any of the performance problems affecting sequential programs, as well as from several problems unique to parallel systems. Many of these problems are quite hard (or even practically impossible) to diagnose without help from specialized tools. We present a proposal for a tool for profiling the parallel execution of Mercury programs, a proposal whose implementation we have already started. This tool is an adaptation and extension of the ThreadScope profiler that was first built to help programmers visualize the execution of parallel Haskell programs.
💡 Research Summary
The paper addresses the difficulty of understanding and optimizing parallel programs, focusing on the pure logic programming language Mercury. While sequential profiling tools are well‑established, parallel programs introduce additional challenges such as task granularity, load imbalance, synchronization overhead, and data dependencies that can cause blocking or excessive spark creation. To give Mercury developers a practical way to observe these phenomena, the authors adapt the ThreadScope visual profiler, originally built for parallel Haskell programs compiled with the Glasgow Haskell Compiler (GHC).
Mercury’s parallel runtime consists of “engines” (virtual CPUs mapped to OS threads) and “contexts” (units of work that may be running or suspended). Parallel conjunctions are the only parallel construct; the compiler rewrites a conjunction into a barrier and a series of “sparks” (potentially parallel sub‑tasks). Data dependencies between conjuncts are expressed via “futures”: a producer writes a value into a future and signals its availability; a consumer waits on the future if the value is not yet ready. This model can generate many events: contexts may block waiting for futures, sparks may be stolen from one engine’s queue by another, and engines may be idle or busy with garbage collection.
ThreadScope records a timestamped stream of events (STARTUP, SHUTDOWN, CREATE_THREAD, RUN_THREAD, STOP_THREAD, RUN_SPARK, STEAL_SPARK, etc.) and visualizes per‑engine activity over time. The authors extend this framework to capture Mercury‑specific information. Two key extensions are added to the existing RUN_SPARK and STEAL_SPARK events: a unique spark identifier and the identifier of the context that created the spark. This allows the profiler to distinguish whether a spark is newly created or a reuse of an existing context, and to trace the exact path a spark takes from creation through execution, including any cross‑engine stealing.
The extended log format remains compact: most events inherit the current engine and context from a block header, reducing redundancy. The design respects ThreadScope’s principle of recording only essential data at runtime and inferring the rest during analysis. The authors also introduce a CREATE_SPARK_THREAD event to note when a new context is allocated for a spark, though the log does not differentiate new versus reused contexts directly; this can be inferred by scanning earlier events.
With the enriched log, three stakeholder groups can extract actionable insights:
-
Application programmers can identify excessive barrier wait times, over‑generation of sparks, and load imbalance. By visualizing when and where tasks block on futures, they can adjust granularity (e.g., combine loop iterations into larger tasks) or restructure data dependencies to reduce waiting.
-
Runtime system implementers gain quantitative data on scheduler behavior: how often sleeping workers are awakened, the latency between a future’s signal and a consumer’s wake‑up, cache warm‑up effects when a task resumes on the same engine, and the impact of garbage collection pauses on parallel progress. This information can guide improvements to spark queue management, work‑stealing policies, and GC coordination.
-
Researchers in automatic parallelisation can calibrate cost‑benefit models for parallelisation opportunities. By measuring real costs of spark creation, future signaling, and barrier synchronization, they can refine heuristics that decide whether a particular conjunction should be parallelised.
The paper also discusses related work, noting that while ThreadScope was designed for Haskell’s capability/thread model, the similarity between GHC’s and Mercury’s runtimes (both use per‑engine threads and work‑stealing) makes adaptation feasible. The extensible log format of ThreadScope allowed the authors to add Mercury‑specific events without breaking compatibility with existing visualisation tools.
In conclusion, the authors present a practical, low‑overhead profiling solution for Mercury parallel programs. By reusing and extending an existing Haskell tool, they avoid building a profiler from scratch, while providing detailed, visual insight into the dynamic behaviour of Mercury’s parallel runtime. Future directions include more sophisticated statistical analyses, real‑time profiling, and applying the approach to other parallel logic languages.
Comments & Academic Discussion
Loading comments...
Leave a Comment