Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance
Conclusion
In our DSN 2014 paper , we a new methodology to quantify the tolerance of applications to memory errors. Using this methodology, we a case study of three new data-intensive workloads that , among other new insights, that there exists a diverse spectrum of memory error tolerance both within and these applications. We new hardware/software heterogeneous-reliability memory system designs, and them to show that (1) the one-size-fits-all approach to reliability in modern servers is inefficient in terms of cost, and (2) heterogeneous-reliability systems can achieve the benefits of both low cost and high single server availability/reliability. We hope that our techniques can enable the use of lower-cost memory devices to reduce the server hardware cost of datacenters, and that our analyses will spur future research on heterogeneous-reliability memory systems. As DRAM technology scales into small feature sizes and becomes less reliable and memory cost becomes more important in datacenters in the future, we hope that our findings and ideas will inspire more research to improve the cost–reliability trade-off in memory systems.
| Configuration | ||||||||
|---|---|---|---|---|---|---|---|---|
| 2-4 (lr)5-9 | Private (36GB) | Heap (9GB) | Stack (60MB) | Memory cost savings (%) | Server HW cost savings (%) | Crashes/ server/ month | Single server availability | # incorrect/ million queries |
| Typical Server | ECC | ECC | ECC | % | ||||
| Consumer PC | NoECC | NoECC | NoECC | % | ||||
| Detect&Recover | Par+R | NoECC | NoECC | % | ||||
| Less-Tested (L) | NoECC | NoECC | NoECC | (16.4-37.8) | (4.9-11.3) | % | ||
| Detect&Recover/L | ECC | Par+R | NoECC | (3.1-27.9) | (0.9-8.4) | % |