AgilePkgC: An Agile System Idle State Architecture for Energy Proportional Datacenter Servers
This paper presents the design of AgilePkgC (APC): a new C-state architecture that improves the energy proportionality of servers that operate at low utilization while running microservices of user-facing applications. APC targets the reduction of power when all cores are idle in a shallow C-state, ready to transition back to service. In particular, APC targets the power of the resources shared by the cores (e.g., LLC, network-on-chip, IOs, DRAM) which remain active while no core is active to use them. APC realizes its objective by using low-overhead hardware to facilitate sub-microsecond entry/exit latency to a new package C-state and judiciously selecting intermediate power modes for the different shared resources that offer fast transition and, yet, substantial power savings. Our experimental evaluation supports that APC holds the potential to reduce server power by up to 41% with a worst-case performance degradation of less than 0.1% for several representative workloads. Our results clearly support the research and development and eventual adoption of new deep and fast package C-states, like APC, for future server CPUs targeting datacenters running microservices.
💡 Research Summary
AgilePkgC (APC) introduces a novel package‑level C‑state, called P C1A (Package C1 Agile), designed to address the energy‑proportionality gap in modern datacenter servers that run latency‑critical microservice workloads. Current server CPUs only allow deep package C‑states (e.g., P C6) when all cores are in the deepest core C‑state (C C6). Because C C6 entry/exit latencies are on the order of tens of microseconds, datacenter operators typically disable deep core C‑states to avoid tail‑latency violations. Consequently, when the average utilization is low (5‑20 %), the uncore components (LLC, mesh interconnect, high‑speed I/O, and DRAM) remain fully powered even though no core is doing useful work, leading to poor energy proportionality.
APC’s key contribution is to enable a deep package C‑state that can be entered as soon as every core reaches a shallow idle state (C C1). The design hinges on four tightly integrated hardware techniques:
-
Agile Power Management Unit (APMU) – a dedicated hardware block that monitors core C‑state signals and, within ~100 ns of the last core entering C C1, triggers the package‑level power‑saving flow. This eliminates the need for software polling and provides deterministic, sub‑microsecond response.
-
I/O Standby Mode (IOSM) – leverages existing link‑layer power states (L0s, L0p) for PCIe, DMI, UPI, and DDR PHYs. IOSM drives these interfaces into nanosecond‑scale standby modes without OS intervention, cutting I/O power by roughly 10‑15 % of total package consumption.
-
CLM Retention (CLMR) – uses the processor’s Fully Integrated Voltage Regulator (FIVR) to drop the voltage of the Cache‑and‑Home‑Agent, Last‑Level‑Cache, and Mesh NoC domains to a retention level. The voltage transition occurs in tens of nanoseconds, reducing CLM power by more than 70 % while preserving data integrity.
-
PLL Retention – keeps all system phase‑locked loops (core, I/O, CLM, and global power‑management PLLs) active during P C1A. Modern all‑digital PLLs consume negligible static power, and by avoiding PLL re‑locking the exit latency is limited to ~10 ns, a dramatic improvement over the several‑microsecond lock time of conventional PLLs.
Combined, these mechanisms allow P C1A to achieve a total package power of approximately 29 W (SoC + DRAM) with an entry‑plus‑exit latency under 200 ns—more than 250× faster than the traditional P C6 state (>50 µs). The fast transition makes the state usable for the short, unpredictable idle periods typical of microservice‑driven workloads.
The authors evaluate APC on an Intel Skylake‑Xeon (SKX) 10‑core platform using representative microservice benchmarks: Memcached, Redis, Nginx, and MongoDB. Results show an average energy reduction of 25 % across workloads, with a peak reduction of 41 % for Memcached when the server is idle. Performance impact is measured at less than 0.1 % degradation, confirming that the sub‑microsecond latency does not affect tail‑latency targets (30‑250 µs). Core‑level idle statistics reveal that at 5 % load, all cores spend ~57 % of the time in C C1, and at 10 % load this figure is ~39 %; thus a large fraction of execution time can benefit from P C1A.
Compared with prior approaches—software‑driven DVFS, aggressive core‑C‑state scheduling, or deep‑sleep DRAM techniques—APC delivers superior energy savings while maintaining deterministic latency. Its reliance on existing hardware blocks (FIVR, ADPLL, link‑layer power states) means the concept can be ported to other architectures (AMD, ARM) with modest design changes.
In conclusion, AgilePkgC provides a practical, hardware‑centric solution to the “killer microseconds” problem that has prevented deep idle states from being used in latency‑sensitive datacenters. By enabling a fast, deep package C‑state that activates as soon as cores are shallowly idle, APC dramatically improves energy proportionality for low‑utilization servers without compromising the strict latency guarantees of modern microservice applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment