Adapting the DMTCP Plugin Model for Checkpointing of Hardware Emulation
Checkpoint-restart is now a mature technology. It allows a user to save and later restore the state of a running process. The new plugin model for the upcoming version 3.0 of DMTCP (Distributed MultiThreaded Checkpointing) is described here. This plugin model allows a target application to disconnect from the hardware emulator at checkpoint time and then re-connect to a possibly different hardware emulator at the time of restart. The DMTCP plugin model is important in allowing three distinct parties to seamlessly inter-operate. The three parties are: the EDA designer, who is concerned with formal verification of a circuit design; the DMTCP developers, who are concerned with providing transparent checkpointing during the circuit emulation; and the hardware emulator vendor, who provides a plugin library that responds to checkpoint, restart, and other events. The new plugin model is an example of process-level virtualization: virtualization of external abstractions from within a process. This capability is motivated by scenarios for testing circuit models with the help of a hardware emulator. The plugin model enables a three-way collaboration: allowing a circuit designer and emulator vendor to each contribute separate proprietary plugins while sharing an open source software framework from the DMTCP developers. This provides a more flexible platform, where different fault injection models based on plugins can be designed within the DMTCP checkpointing framework. After initialization, one restarts from a checkpointed state under the control of the desired plugin. This restart saves the time spent in simulating the initialization phase, while enabling fault injection exactly at the region of interest. Upon restart, one can inject faults or otherwise modify the remainder of the simulation. The work concludes with a brief survey of checkpointing and process-level virtualization.
💡 Research Summary
This paper presents a novel extension of the Distributed MultiThreaded Checkpointing (DMTCP) framework, specifically its upcoming version 3.0 plugin model, to support checkpoint‑restart workflows in hardware emulation environments used for electronic design automation (EDA). The authors identify three distinct stakeholders—circuit designers, DMTCP developers, and hardware‑emulator vendors—and argue that a plugin‑based approach enables seamless collaboration among them while preserving proprietary IP.
The core contribution is a process‑level virtualization layer that intercepts system calls and translates external identifiers such as process IDs (PIDs), file paths, and environment variables into “virtual” values visible to the application. Wrapper functions are injected via a dynamically loaded plugin library, which sits in the ELF symbol resolution order ahead of the standard libraries. At checkpoint time the plugin marks connections to the hardware emulator (and other external services such as X‑server, licensing servers) as “external” and excludes them from the checkpoint image. Upon restart, the plugin re‑establishes these connections, possibly to a different emulator instance, thereby allowing migration across machines or emulator generations.
A programmable barrier mechanism is introduced to coordinate checkpointing across distributed processes. A central DMTCP coordinator orchestrates barrier entry, enabling all processes to reach a quiescent state before the checkpoint is taken. This barrier also serves as a hook for vendor‑specific actions, such as saving the emulator state or performing license re‑authentication.
Three real‑world case studies illustrate the practicality of the approach.
- GUI‑based simulation: The authors extend DMTCP to checkpoint applications that use X‑server connections by routing them through VNC or XPRA, which can be restarted under DMTCP control. Licensing services are handled by vendor plugins that re‑validate seats after restart.
- Environment and path virtualization: By virtualizing environment variables and file system paths, a checkpoint image can be moved to a different cluster or cloud environment, and the plugin rewrites the paths at restart, enabling efficient reuse of expensive emulator resources.
- Hardware interface and lock management: High‑speed emulator‑host interfaces are quiesced before checkpointing to avoid data loss. The plugin also wraps lock acquisition/release primitives to track lock state across restarts, solving the problem of stale lock identifiers when thread IDs change after a restart.
The framework supports fault injection by allowing user‑defined code to be executed immediately after restart or by interposing on selected library calls. This enables systematic evaluation of silicon logic fault tolerance: the simulation can be run to a region of interest, checkpointed, and then repeatedly restarted with injected transient faults to assess error propagation.
Performance optimizations such as fast‑restart (using mmap‑based on‑demand page loading) and forked checkpointing (leveraging copy‑on‑write) keep overhead low even for large MPI jobs. The authors compare their approach to traditional system‑level checkpointing (which cannot control external resources) and application‑level checkpointing (which is error‑prone and hard to maintain). The DMTCP plugin model offers a middle ground: transparent checkpointing with fine‑grained control over external abstractions, making it uniquely suited for complex EDA workflows.
In conclusion, the paper demonstrates that extending DMTCP with a flexible plugin architecture enables reliable checkpoint‑restart for hardware‑emulated circuit verification, supports migration across heterogeneous resources, and provides a platform for programmable fault injection. Future work includes standardizing the plugin API for broader adoption, integrating automated resource scheduling in cloud‑based emulation farms, and expanding support for mixed‑architecture (32‑/64‑bit) toolchains.
Comments & Academic Discussion
Loading comments...
Leave a Comment