Gemini: Reducing DRAM Cache Hit Latency by Hybrid Mappings

Gemini: Reducing DRAM Cache Hit Latency by Hybrid Mappings
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Die-stacked DRAM caches are increasingly advocated to bridge the performance gap between on-chip Cache and main memory. It is essential to improve DRAM cache hit rate and lower cache hit latency simultaneously. Prior DRAM cache designs fall into two categories according to the data mapping polices: set-associative and direct-mapped, achieving either one. In this paper, we propose a partial direct-mapped die-stacked DRAM cache to achieve the both objectives simultaneously, called Gemini, which is motivated by the following observations: applying unified mapping policy to different blocks cannot achieve high cache hit rate and low hit latency in terms of mapping structure. Gemini cache classifies data into leading blocks and following blocks, and places them with static mapping and dynamic mapping respectively in a unified set-associative structure. Gemini also designs a replacement policy to balance the different blocks miss penalty and the recency, and provides strategies to mitigate cache thrashing due to block type transitions. Experimental results demonstrate that Gemini cache can narrow the hit latency gap with direct-mapped cache significantly, from 1.75X to 1.22X on average, and can achieve comparable hit rate with set-associative cache. Compared with the state-of-the-art baselines, i.e., enhanced Loh-Hill cache, Gemini improves the IPC by up to 20% respectively.


💡 Research Summary

The paper addresses a fundamental dilemma in die‑stacked DRAM caches used as last‑level caches: direct‑mapped designs offer the lowest hit latency because a tag and its data are fetched together, but suffer from poor hit rates due to conflict misses; set‑associative designs achieve high hit rates by allowing multiple blocks per set, yet incur substantial hit‑latency overhead because the tag must be read before the data (tag‑then‑data serialization). Existing works have treated these two approaches as mutually exclusive, optimizing either latency or hit rate but not both.

Gemini introduces a hybrid mapping that exploits the observation that, when a tag cache (small SRAM) is employed, the first access to a DRAM‑cache set (the “leading block”) triggers a tag‑batch fetch from DRAM to the tag cache, while subsequent accesses to the same set (“following blocks”) benefit from the tags already resident in SRAM. Empirical measurements on 18 workloads show that about 89 % of tag fetches are caused by leading blocks, and that 97 % of the hit latency for following blocks is due solely to data fetch. Moreover, the miss penalty for a leading block (≈273 cycles) is roughly 1.3× higher than that for a following block.

Based on these facts, Gemini classifies every cache line as either a leading block or a following block. It applies static direct mapping to leading blocks, eliminating the tag‑then‑data serialization for the majority of tag fetches and thereby reducing hit latency dramatically. For following blocks it retains dynamic set‑associative mapping, preserving the high hit rate of a conventional set‑associative cache.

Because leading and following blocks have different miss costs, Gemini proposes a replacement policy called Range‑Variable CLOCK (RV‑CLOCK). RV‑CLOCK extends the classic CLOCK algorithm with per‑type weightings, giving higher protection to leading blocks (which are more expensive to miss) while still respecting recency information.

A second challenge is the occasional transition of a block’s type (e.g., a following block becoming a leading block after the tag cache evicts its tag). Frequent type switches could cause “cache thrashing” and negate the benefits of the hybrid scheme. Gemini mitigates this with two mechanisms:

  1. Priority reservation – recently identified leading blocks are temporarily reserved in the direct‑mapped region, preventing premature eviction.
  2. High‑frequency variation filter – monitors the rate of type changes and forces a block back to dynamic mapping if its type is unstable.

Experimental evaluation uses a 512 MB stacked DRAM cache with 64‑byte lines. Compared with a pure direct‑mapped cache, Gemini reduces average hit latency from 1.75× to 1.22×; compared with a pure set‑associative cache, it achieves comparable hit rates while cutting latency by the same factor. When benchmarked against the state‑of‑the‑art enhanced Loh‑Hill cache (set‑associative with tag cache), Gemini improves IPC by up to 20 % across a suite of 18 workloads.

In summary, Gemini demonstrates that block‑level heterogeneity—distinguishing leading from following blocks—allows a DRAM cache to simultaneously enjoy the low latency of direct mapping and the high hit rate of set associativity. The combination of hybrid mapping, a miss‑aware replacement policy, and mechanisms to stabilize block types provides a practical pathway for future high‑capacity 3D‑stacked DRAM caches in modern processors.


Comments & Academic Discussion

Loading comments...

Leave a Comment