SPPAM: Signature Pattern Prediction and Access-Map Prefetcher

SPPAM: Signature Pattern Prediction and Access-Map Prefetcher
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The discrepancy between processor speed and memory system performance continues to limit the performance of many workloads. To address the issue, one effective and well studied technique is cache prefetching. Many prefetching designs have been proposed, with varying approaches and effectiveness. For example, SPP is a popular prefetcher that leverages confidence throttled recursion to speculate on the future path of program’s references, however it is very susceptible to the reference reordering of higher-level caches and the out-of-order core. Orthogonally, AMPM is another popular approach to prefetching which uses reordering-resistant access maps to identify patterns within a region, but is unable to speculate beyond that region. In this paper, we propose SPPAM, a new approach to prefetching, inspired by prior works such as SPP and AMPM, while addressing their limitations. SPPAM utilizes online-learning to build a set of access-map patterns. These patterns are used in a speculative lookahead which is throttled by a confidence metric. Targeting the second-level cache, SPPAM alongside state-of-the-art prefetchers Berti and Bingo improves system performance by 31.4% over no prefetching and 6.2% over the baseline of Berti and Pythia.


💡 Research Summary

The paper addresses the persistent “Memory Wall” problem by introducing SPPAM, a novel second‑level cache (L2) prefetcher that synergistically combines the strengths of the Signature Path Prefetcher (SPP) and the Access Map Prefetcher (AMPM) while mitigating their respective weaknesses.

Motivation and Background
SPP excels at deep recursive look‑ahead using delta‑signatures, but its predictions are order‑dependent; out‑of‑order cores and higher‑level cache reordering corrupt the signatures, leading to missed opportunities. AMPM, on the other hand, uses order‑agnostic access‑maps within a region, making it robust to reordering, yet it relies on a static catalog of patterns and cannot speculate beyond the region boundaries. Modern workloads interleave memory accesses from multiple data structures, causing short‑lived, fluctuating patterns that neither approach can reliably capture.

Core Design
SPPAM builds on AMPM’s region tracking: the L2 cache is divided into 4 KiB regions, each represented by a 64‑bit access bitmap and a parallel prefetch bitmap. When a region experiences enough activity (controlled by an access‑counter and an inactivity timer), a “scrape” operation extracts forward (positive) and backward (negative) patterns from the bitmap using a sliding window whose size matches the pattern table width.

These patterns index into two direct‑mapped pattern tables (positive and negative). Each table entry stores an associative set of possible predictions; the most frequent prediction is selected as the next prefetch candidate. Because the prediction itself can be fed back as a new index, SPPAM inherits SPP’s recursive look‑ahead capability, but the look‑ahead is denser: a single table lookup can generate multiple consecutive prefetches.

Confidence and Throttling
Every pattern entry maintains an 8‑bit “useful” counter, an 8‑bit “useless” counter, and a 4‑bit “usefulness” field. When a prefetched line is later used, the useful counter increments; when it is evicted without use, the useless counter increments. Once the sum reaches a configurable sampling threshold, the usefulness value is recomputed. If any entry’s confidence reaches a maximum, all entries are halved (aging) to prevent stale patterns from dominating.

A global usefulness tracker samples overall prefetch outcomes and stores a 4‑bit global confidence that is used when region‑local feedback is unavailable (e.g., the region has already been evicted). A 4‑bit “region lifespan” metric indicates how many useless prefetches still belong to active regions; low lifespan triggers reliance on the global metric.

Three throttling mechanisms adapt the prefetch rate:

  1. Usefulness‑based degree and drop‑chance tables – as usefulness declines, the number of prefetches issued per cache miss (prefetch degree) drops, and the probability of dropping a prefetch increases.
  2. DRAM bandwidth feedback – a 4‑bit bandwidth utilization signal from the memory controller proportionally reduces the prefetch degree, preventing memory‑channel saturation.
  3. Dynamic drop counters – a 7‑bit cycle counter decides whether a particular prefetch should be suppressed based on the current drop‑chance value.

Filtering and Multi‑Level Coordination
SPPAM filters duplicate prefetches using its L2 region table; a line already prefetched will not be re‑issued until the region is reclaimed. When the L2 Miss‑Status‑Holding‑Register (MSHR) is full, SPPAM can forward prefetches directly to the LLC. To avoid redundant prefetches in the LLC, a second region table tracks recently accessed or prefetched LLC blocks; any prefetch targeting a block already present in this table is dropped.

The design also integrates with existing prefetchers: Berti (L1D) supplies virtual‑address region association information, enabling cross‑page pattern learning, while Bingo (LLC) operates in a core‑partitioned mode to avoid interference with SPPAM’s learning. Minor enhancements to Berti and Bingo are described (e.g., Berti’s filter inspired by AMPM’s region tables).

Hardware Overhead
Table 1 in the paper lists the per‑core storage requirements, totaling 109.16 KB. The largest structures are the L2 region table (≈92.6 KB) and the pattern history table (≈56.3 KB). Compared to the original SPP (≈2.5 KB) and AMPM (≈3 KB), the overhead is modest for a modern core that already dedicates tens of kilobytes to prefetcher state.

Evaluation
The authors evaluate SPPAM on a 4 GHz, 8‑wide issue/retire out‑of‑order core with a Perceptron branch predictor, 576‑entry ROB, and 352‑entry LSQ. Benchmarks include SPEC‑CPU2006, PARSEC, and several memory‑intensive real‑world applications. Baselines are: (a) no prefetching, (b) Berti + Bingo (state‑of‑the‑art L1D/LLC combo), and (c) Berti + Pythia (another strong L1D/LLC pair).

Results show an average IPC improvement of 31.4 % over the no‑prefetch baseline and 6.2 % over the Berti+Pythia baseline. The gains are most pronounced on streaming and index‑heavy workloads where deep look‑ahead and region‑wide pattern reuse are beneficial. SPPAM maintains higher accuracy than SPP under heavy reordering, and it outperforms AMPM on workloads that require speculation beyond a single region.

Conclusions and Future Work
SPPAM demonstrates that (1) order‑agnostic region tracking can provide a stable foundation for pattern extraction, (2) a multi‑level confidence system enables aggressive yet safe speculative prefetching, (3) dynamic throttling based on bandwidth and global usefulness prevents resource over‑commitment, and (4) seamless integration with existing L1D/LLC prefetchers yields additive performance benefits. The authors suggest future research directions such as adaptive pattern‑table sizing, cross‑core coordination for shared regions, and machine‑learning‑enhanced confidence estimation.


Comments & Academic Discussion

Loading comments...

Leave a Comment