Analytical Modeling the Multi-Core Shared Cache Behavior with Considerations of Data-Sharing and Coherence

To mitigate the ever worsening “Power wall” and “Memory wall” problems, multi-core architectures with multilevel cache hierarchies have been widely accepted in modern processors. However, the complexity of the architectures makes modeling of shared caches extremely complex. In this paper, we propose a data-sharing aware analytical model for estimating the miss rates of the downstream shared cache under multi-core scenarios. Moreover, the proposed model can also be integrated with upstream cache analytical models with the consideration of multi-core private cache coherent effects. This integration avoids time consuming full simulations of the cache architecture that required by conventional approaches. We validate our analytical model against gem5 simulation results under 13 applications from PARSEC 2.1 benchmark suites. Compared to the results from gem5 simulations under 8 hardware configurations including dual-core and quad-core architectures, the average absolute error of the predicted shared L2 cache miss rates is less than 2% for all configurations. After integrated with the refined upstream model with coherence misses, the overall average absolute error in 4 hardware configurations is degraded to 8.03% due to the error accumulations. The proposed coherence model can achieve similar accuracies of state of the art approach with only one tenth time overhead. As an application case of the integrated model, we also evaluate the miss rates of 57 different multi-core and multi-level cache configurations.

💡 Research Summary

The paper addresses a critical challenge in modern processor design: accurately predicting the behavior of shared caches in multi‑core systems while accounting for inter‑core data sharing and cache‑coherence effects. As power and memory walls continue to limit performance gains, designers rely increasingly on deep multi‑level cache hierarchies. However, the complexity of these hierarchies makes conventional simulation‑based evaluation prohibitively expensive, especially when exploring large design spaces.

To overcome this, the authors propose a comprehensive analytical model that estimates the miss rate of a downstream shared L2 cache in the presence of multi‑core workloads. The model is built in four logical stages. First, it uses established probabilistic techniques (reuse‑distance and stack‑distance analysis) to characterize the hit‑miss behavior of each core’s private L1 cache based on the observed memory‑access trace. Second, it quantifies inter‑core data sharing by identifying shared objects, measuring their access frequencies, and determining the sharing degree across cores. This information feeds a coherence sub‑model that captures the state‑transition probabilities of a typical MESI‑type protocol, explicitly modeling invalidations, write‑backs, and read‑for‑ownership traffic that generate coherence‑induced misses.

Third, the model merges the L1 miss stream with the coherence miss stream and feeds the combined request flow into an analytical representation of the shared L2 cache. The L2 model incorporates line size, associativity, replacement policy (e.g., LRU or PLRU), and, crucially, the contention among concurrent core requests. By treating the L2 as a shared resource with finite bandwidth, the model predicts the probability that any incoming miss will be satisfied from the L2 versus being forwarded to main memory.

Finally, the authors integrate the L2 miss prediction with a higher‑level performance estimator to obtain system‑wide metrics such as average memory‑access latency and instructions‑per‑cycle (IPC). Validation is performed against detailed gem5 simulations using 13 applications from the PARSEC 2.1 benchmark suite across eight hardware configurations (dual‑core and quad‑core variants). The results show that the analytical model predicts shared L2 miss rates with an average absolute error below 2 %, and when the upstream L1 and coherence effects are also included, the overall system miss‑rate error remains under 8.03 %. Notably, the analytical approach requires roughly one‑tenth of the simulation time needed by state‑of‑the‑art full‑system models, delivering comparable accuracy with a dramatically reduced computational cost.

Beyond validation, the authors demonstrate the model’s practical utility by evaluating miss rates for 57 distinct multi‑core and multi‑level cache configurations, illustrating how designers can rapidly explore trade‑offs among cache size, associativity, line size, and coherence protocol parameters without resorting to exhaustive simulation.

Key contributions of the work include:

A unified analytical framework that simultaneously captures private‑cache behavior, inter‑core data sharing, and coherence‑induced misses.
Demonstrated accuracy (≤2 % error for L2 miss rates) and substantial speed‑up (≈10× faster than full simulation) across a diverse set of workloads and hardware setups.
Scalability to design‑space exploration, enabling quick assessment of a large number of cache configurations and providing actionable insights for power‑performance optimization in contemporary multi‑core processors.

In summary, this paper delivers a robust, low‑overhead analytical tool that bridges the gap between fast, coarse‑grained estimations and expensive, cycle‑accurate simulations, thereby facilitating more efficient and informed cache hierarchy design in the era of ever‑growing core counts.

💡 Research Summary

📜 Original Paper Content