Methodology

Simulation Infrastructure

We evaluate our proposal using the simulation framework released by the Second Cache Replacement Championship (CRC-2) . Table 1 summarizes the key elements of our methodology. We target both single-core and four-core processors with a 2 MB per-core shared LLC. The processors benefit from a non-inclusive cache hierarchy that employs LRU as the default replacement policy. We report both cache statistics and the end-to-end performance of the competing policies.

Evaluation Parameters.
Parameter	Value
Processing Nodes	6-stage pipeline, 256-entry ROB
L1-D/I Caches	32 KB, 8-way, 4-cycle load-to-use
Private L2 Cache	256 KB, 8-way, 8-cycle access latency
Shared LLC	2 MB per core, 16-way, 20-cycle hit latency
Data Prefetcher	L1 next-line prefetcher
Data Prefetcher	L2 PC-based stride prefetcher

We use SPEC CPU2006 benchmarks for evaluating the competing replacement policies. For the multi-program workloads, we randomly choose a hundred combinations of single-core programs and use them to evaluate the competing policies on a four-core processor. For single-core evaluations, we execute 4-billion instructions and use half of the instructions for warm-up and the rest for actual measurements. For the multi-core evaluations, we execute 2-billion instructions per core and use the first half for warm-up and the rest for measurements.

Evaluated Methods

We evaluate the following replacement policies:
Baseline LRU. Well-known Least Recently Used replacement policy is used as the baseline in our evaluations. It keeps four bits per block in each set to establish the LRU stack.
Dynamic RRIP (DRRIP) . Each block has a 3-bit storage unit named RRPV. Upon each hit, RRPV of the block is set to zero, and upon each miss, the block with the maximum RRPV is evicted. If none of the blocks have the maximum RRPV, the RRPV of all blocks is incremented. This procedure is repeated until at least one block gets the maximum RRPV. It uses set-dueling for choosing an insertion policy between Static RRIP (SRRIP) and Bimodal RRIP (BRRIP). In SRRIP, all blocks are inserted with RRPV of maximum minus one. In BRRIP, the RRPV of inserted block gets the value of maximum minus one (with the probability of $`\frac{1}{32}`$) or maximum (with the probability of $`\frac{31}{32}`$). Thirty-two random sets emulate SRRIP and another thirty-two random sets emulate BRRIP. The remaining sets follow the winner of the duel. The total area overhead of DRRIP is 12 KB/48 KB in single-core/four-core substrate.
SHiP . Signature-Based Hit Predictor replacement policy relies on RRIP but attempts to distinguish the dead blocks from the live ones, which will be re-referenced in the cache. It uses the PC of corresponding instruction for classification of dead and live blocks. The replacement policy differs from RRIP since it sets the RRPV of a block during insertion based on the dead block prediction. SHiP uses an LRU replacement simulator to find the cache-friendliness/-averseness of blocks, and stores the outcome in a particular structure. In addition, SHiP uses two bits per block for storing RRPV information. The total area overhead of SHiP is 39 KB/156 KB in single-core/four-core substrate.

MPKI reduction of the competing replacement policies over the baseline LRU.

Multiperspective. This replacement policy leverages machine learning concepts to determine whether the incoming block should bypass the cache or not. It also uses a mechanism for sending feedback to assist in choosing a victim. The machine learning mechanism used in this method exploits several features to determine the output and has a sampler unit to calculate the weight of each feature. The sampler and other metadata structures used in this mechanism occupy approximately 25.25 KB/95.5 KB storage in a single-/four-core processor. The baseline replacement of this algorithm is different in single-core and four-core systems: in single-core systems, it uses PseudoLRU as the baseline replacement policy of the main cache, which has 3.75 KB hardware overhead; however, in four-core systems, SRRIP is the baseline replacement policy, with 32 KB additional storage overhead. The total storage overhead of this method is 29 KB/127.5 KB in single-core/four-core substrate.
Hawkeye . Hawkeye uses an optimal replacement simulator, named OPTGen, to simulate Belady’s MIN and classify load instructions (PCs) into cache-friendly or cache-averse. OPTGen’s hardware overhead is 15.2 KB. Moreover, Hawkeye uses three bits for the RRPV, which are set, upon each access based on the averseness or friendliness of the PCs of incoming accesses. Upon each miss, Hawkeye evicts a cache-averse block from the cache. If there is no cache-averse block in the set, it chooses the oldest block in the set as the victim. The total overhead of this replacement policy is 30 KB/90 KB in single-core/four-core substrate.
EHC. Implemented on top of Hawkeye . This policy makes changes in the victim selection mechanism of Hawkeye to efficiently select a victim when all cache blocks in a given set are predicted to be cache-friendly. The total storage overhead of EHC on top of Hawkeye is 12 KB per core.

Introduction

The ever-increasing expansion of datasets in memory-intensive applications has resulted in massive working sets beyond what can be captured by on-chip caches of modern processors . As a result, processors executing such applications encounter frequent data misses, losing significant performance potentials . Among the data misses, which frequently happen in various levels of a modern deep cache hierarchy, Last Level Cache (LLC) misses are of more importance as for every LLC miss, the off-chip DRAM should be accessed for getting the data. Off-chip accesses significantly hurt system performance due to the limited bandwidth and long latency of DRAM accesses.

While LLC misses are inevitable due to large datasets of applications, not all off-chip misses are capacity misses . One way to reduce the number of non-capacity LLC misses is through a well-behaved replacement policy. The replacement policy decides, out of all possible candidates, which one should be evicted from the cache upon arrival of a new block of data.

The fraction of replacement decisions made by Hawkeye when no block in the set has most-recently been touched by a cache-averse load instruction.

The optimal replacement policy is Belady’s MIN . Belady’s MIN replacement policy evicts a block of data that is going to be referenced further into the future. As the optimal replacement policy requires knowledge of the future, it is impractical. As a result, most replacement polices use various heuristics to evict a cache block (e.g., recency or frequency). Unfortunately, there is a significant gap between the effectiveness of Belady’s MIN and practical replacement policies.

While implementation of Belady’s MIN as a whole is impractical, few replacement policies emulate (approximate) Belady’s algorithm to choose a victim for replacement . Rajan and Ramaswamy used an extra storage, called a Shepherd cache, in order not to evict blocks until future references determine which block, based on Belady’s MIN, should be chosen as the victim. Unfortunately, this technique requires a large storage to be truly effective. Jain and Lin benefited from the observation that with Belady’s MIN, some load instructions are cache-friendly and some others are cache-averse. Based on this observation, they proposed Hawkeye that uses a minimal storage to emulate Belady’s MIN for the purpose of determining whether a load instruction is cache-friendly or cache-averse. Hawkeye uses cache-friendliness/-averseness of load instructions to choose a victim, i.e., evicting cache-averse blocks while maintaining cache-friendly ones. The storage requirement of Hawkeye is minimal because: (1) Hawkeye only stores the block addresses and not block data (as in Shepherd ) to determine cache-friendliness of load instructions, and (2) Hawkeye only emulates Belady’s MIN for a small number of sets in the cache, leveraging the fact that load instructions behave similarly on all cache sets.

While classifying load instructions into cache-friendly and cache-averse is quite effective¹, this technique is useful only if there are blocks that most-recently have been accessed with a cache-averse load instruction. The blocks that are accessed with cache-averse load instructions are the prime candidate for replacement. However, if no block is most-recently accessed with a cache-averse load instruction, the replacement policy should pick a victim for replacement using standard replacement mechanisms, e.g., recency or frequency, which are in many cases ineffective.

To show how frequently a Belady-inspired replacement policy based on cache-friendly/-averse load instructions finds no cache-averse block in a set, Figure 2 shows the percentage of replacement decisions with Hawkeye when no cache-averse block is in the set. The figure shows that for 0.5% to 42.9% of all replacements (with an average of 15.1% across all benchmarks), Hawkeye should rely on traditional replacement policies to pick a victim, which limits its effectiveness. Consequently, a Belady-inspired replacement policy based on cache-friendly/-averse load instructions cannot benefit from Belady’s MIN in choosing a victim in a considerable number of cases.

To address this limitation, this work makes the fundamental observation that with Belady’s MIN replacement policy, there is a strong correlation among hit counts of blocks in the same memory region²; i.e., the hit counts of two blocks from the same memory region before being evicted by Belady’s MIN are correlated. Using this observation, we estimate the hit counts of blocks in various memory regions by emulating Belady’s MIN and use the expected hit count for replacement when no cache-averse block is available in a set. Just like Hawkeye, to estimate the expected hit count of a region, we only need to store the block addresses and not block data. Moreover, as we only need to estimate the hit count of memory regions and not individual blocks, the storage overhead of our proposal is insignificant.

In this paper, we make the following contributions:

We show that a Belady-inspired replacement policy that classifies load instructions into cache-friendly or cache-averse, in a considerable fraction of cases, are unable to perform the replacement decisions based on the history of loads.
We show that with Belady’s MIN replacement policy, there is a strong correlation between the hit count of a cache block in two consecutive residencies in the cache.
Furthermore, we show that with Belady’s MIN replacement policy, not only there is a strong correlation between the hit count of a cache block in two consecutive residencies but also there is a strong correlation among hit counts of blocks in the same memory region.
Using these observations, we augment a Belady-inspired replacement policy based on cache-friendly-averse load instructions with a small structure to track the hit counts of various memory regions and use the region hit count in replacement decisions when no cache-averse block exists in a set to improve the victim selection quality.
We use a simulation infrastructure to evaluate our proposal in the context of both single and multi-core processors. Our results show that our proposal offers 17.5% lower Miss Per Kilo Instructions (MPKI) and 5.2% higher performance, as compared to the baseline LRU, and outperforms all prior state-of-the-art replacement policies.

Memory-intensive workloads operate on massive amounts of data that cannot be captured by last-level caches (LLCs) of modern processors. Consequently, processors encounter frequent off-chip misses, and hence, lose a significant performance potential. One way to reduce the number of off-chip misses is through using a well-behaved replacement policy in the LLC. Existing processors employ a variation of least recently used (LRU) policy to determine a victim for replacement. Unfortunately, there is a large gap between what LRU offers and that of Belady’s MIN, which is the optimal replacement policy. Belady’s MIN requires selecting a victim with the longest reuse distance, and hence, is unfeasible due to the need to know the future. Consequently, Belady-inspired replacement polices use Belady’s MIN to derive an indicator to help them choose a victim for replacement.

In this work, we show that the indicator that is used in the state-of-the-art Belady-inspired replacement policy is not decisive in picking a victim in a considerable number of cases, and hence, the policy has to rely on a standard metric (e.g., recency or frequency) to pick a victim, which is inefficient. We observe that there exist strong correlations among the hit counts of cache blocks in the same region of memory when Belady’s MIN is the replacement policy. Taking advantage of this observation, we propose an expected-hit-count indicator for the memory regions and use it to improve the victim selection mechanism of Belady-inspired replacement policies when the main indicator is not decisive. Our proposal offers a 5.2% performance improvement over the baseline LRU and outperforms Hawkeye, which is the state-of-the-art replacement policy.

Performance improvement of the competing replacement policies over the baseline LRU.

Evaluation

Miss Reduction

Figure 1 shows the MPKI reduction of various policies over the baseline LRU. As shown clearly, our proposal outperforms all previously-proposed replacement policies. On average, our proposal reduces the MPKI by 17.5%. The second best policy is Hawkeye, which offers a 15.4% MPKI reduction.

Performance

Figure 3 compares the performance improvement of the evaluated policies over the baseline LRU. We use the Instruction per Cycle (IPC) as the metric for performance. EHC offers the highest performance improvement on average. The average performance improvement is 5.19% across all workloads. The second best policy is Hawkeye with an average performance improvement of 4.76%.

Figure 4 compares the performance improvement of the evaluated policies over the baseline LRU in a four-core system. Again, EHC achieves the highest performance, outperforming all other replacement policies on average.

Performance improvement of the competing replacement policies over the baseline LRU.

Why is EHC Effective?

The quality of chosen victims in Hawkeye and EHC. Blocks in a set are sorted with the reuse distance: Block 0 has the longest and Block 16 has the shortest reuse distance. Selecting a victim with the lower index is better.

To show that the expected hit count with a non-ideal replacement policy is correlated with the reciprocal of the reuse distance, Figure 5 compares the quality of the chosen replacement victims in Hawkeye and EHC. Every time a victim needs to be selected, we sort the blocks in the set and the incoming block based on their reuse distance. Block 0 has the longest and Block 16 has the shortest reuse distance. Ideally, a replacement policy always picks Block 0. Comparing EHC and Hawkeye reveals that the victims chosen by EHC have higher quality as compared to those chosen by Hawkeye.

Hawkeye, which is based on classifying load instructions into cache- friendly/-averse, is the champion of the second cache replacement championship (CRC-2) . ↩︎
A memory region is referred to a chunk of contiguous cache blocks in the memory, holding several kilobytes of data. In this paper, we consider 128 KB memory regions. ↩︎