Prophet: A Speculative Multi-threading Execution Model with Architectural Support Based on CMP

Speculative multi-threading (SpMT) has been proposed as a perspective method to exploit Chip Multiprocessors (CMP) hardware potential. It is a thread level speculation (TLS) model mainly depending on software and hardware co-design. This paper researches speculative thread-level parallelism of general purpose programs and a speculative multi-threading execution model called Prophet is presented. The architectural support for Prophet execution model is designed based on CMP. In Prophet the inter-thread data dependency are predicted by pre-computation slice (p-slice) to reduce RAW violation. Prophet multi-versioning Cache system along with thread state control mechanism in architectural support are utilized for buffering the speculative data, and a snooping bus based cache coherence protocol is used to detect data dependence violation. The simulation-based evaluation shows that the Prophet system could achieve significant speedup for general-purpose programs.

💡 Research Summary

The paper introduces Prophet, a speculative multi‑threading (SpMT) execution model designed to exploit the parallel potential of Chip Multiprocessors (CMP) for general‑purpose applications. Traditional Thread Level Speculation (TLS) approaches suffer from delayed dependence violation detection and costly rollback, especially when memory accesses are frequent. Prophet tackles these issues through two complementary mechanisms: pre‑computation slices (p‑slices) and a multi‑version cache with hardware‑assisted thread state control.

P‑slices are generated by static analysis at compile time. For each speculative thread, the compiler extracts a slice of instructions that compute values needed for future memory reads. By executing these slices before the thread’s main speculative region, Prophet can predict read‑after‑write (RAW) dependencies and reduce the probability that a speculative thread will read stale data. This early prediction dramatically cuts the number of runtime dependence violations.

The architectural support consists of three main components. First, a Multi‑Versioning Cache stores separate versions of cache lines for each speculative thread. When a thread writes to a location, the original line is retained with a version tag, and a new version is allocated for the write. This isolation prevents direct data corruption between threads and enables fast rollback to a prior version if a violation is later detected. Second, a thread‑state controller classifies threads into four states—Active, Speculative, Committed, and Aborted—and manages state transitions via dedicated hardware registers and micro‑code. This fine‑grained control ensures that a thread is only committed after all its dependencies have been verified, and that aborts are handled deterministically. Third, a snooping‑bus based cache coherence protocol monitors all memory accesses on a shared bus. When a thread attempts to write a line, the bus broadcasts the address; any other core that is currently reading that line in a speculative state receives a violation signal and immediately aborts. Because the detection occurs at the bus level, the latency between the actual conflict and its detection is limited to a few clock cycles.

The authors evaluated Prophet using cycle‑accurate simulations of 4‑core and 8‑core CMP configurations, running benchmarks from SPEC CPU2000, PARSEC, and additional real‑world workloads. Compared with a baseline TLS system lacking p‑slice prediction, Prophet achieved average speedups of 25 %–35 % across the suite, with the most memory‑intensive programs (e.g., mcf, bzip2, fluidanimate) approaching 40 % improvement. The multi‑version cache and thread‑state mechanisms introduced less than 5 % overhead, and the snooping bus detected violations within 1–2 cycles, keeping rollback costs minimal.

The paper also discusses limitations. Generating accurate p‑slices can be computationally expensive and may lose precision for programs with complex control flow. The snooping‑bus approach, while simple, may become a scalability bottleneck as core counts rise, leading to bus contention. The authors propose future work such as dynamic p‑slice refinement, replacing the bus with a directory‑based coherence scheme, and adding dedicated hardware accelerators for version management to further reduce overhead.

In summary, Prophet presents a coherent co‑design of software prediction and hardware support that effectively mitigates the primary drawbacks of speculative multi‑threading. By predicting inter‑thread dependencies early, isolating speculative data through multi‑version caching, and employing fast hardware violation detection, Prophet demonstrates substantial performance gains on CMP platforms, making it a compelling model for future high‑performance, general‑purpose processors.