Taking the Leap: Efficient and Reliable Fine-Grained NUMA Migration in User-space

Taking the Leap: Efficient and Reliable Fine-Grained NUMA Migration in User-space
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Modern multi-socket architectures offer a single virtual address space, but physically divide main-memory across multiple regions, where each region is attached to a CPU and its cores. While this simplifies the usage, developers must be aware of non-uniform memory access (NUMA), where an access by a thread running on a core-local NUMA region is significantly cheaper than an access from a core-remote region. Obviously, if query answering is parallelized across the cores of multiple regions, then the portion of the database on which the query is operating should be distributed across the same regions to ensure local accesses. As the present data placement might not fit this, migrating pages from one NUMA region to another can be performed to improve the situation. To do so, different options exist: One option is to rely on automatic NUMA balancing integrated in Linux, which is steered by the observed access patterns and frequency. Another option is to actively trigger migration via the system call move_pages(). Unfortunately, both variants have significant downsides in terms of their feature set and performance. As an alternative, we propose a new user-space migration method called page_leap() that can perform page migration asynchronously at a high performance by exploiting features of the virtual memory subsystem. The method is (a) actively triggered by the user, (b) ensures that all pages are eventually migrated, (c) handles concurrent writes correctly, (d) supports pooled memory, (e) adaptively adjusts its migration granularity based on the workload, and (f) supports both small pages and huge pages.


💡 Research Summary

The paper addresses the performance penalty caused by non‑uniform memory access (NUMA) on modern multi‑socket servers, where remote memory accesses can be significantly slower than local ones. Existing solutions are either the kernel’s automatic NUMA balancing, which lacks user control and cannot migrate pages into pre‑allocated pooled memory, or the explicit move_pages() system call, which forces migration at the page granularity, does not guarantee that all pages are moved, and also cannot use pooled memory. Both approaches suffer from high overheads, especially under transactional workloads with frequent concurrent writes.

To overcome these limitations, the authors propose a novel user‑space migration primitive called page_leap(). The method exploits “memory rewiring”: it obtains a handle to physical memory via a main‑memory file and uses mmap to remap virtual pages to arbitrary physical pages at runtime. Migration proceeds in two phases. First, the content of a source page (or a larger contiguous area) is copied with memcpy into a destination page allocated from the target NUMA node’s memory pool. Second, the virtual mapping is atomically changed to point to the new physical page using mmap, making the migration transparent to the application.

Concurrent writes are handled without requiring the application to use a special API. The target area is temporarily set to read‑only with mprotect; any write triggers a segmentation fault, which a custom signal handler catches, marks the area as dirty, restores write permission, and lets the write proceed. Before the final remapping, the handler checks whether the area became dirty; if so, the migration for that area is aborted and the area is queued for a retry. Retries are performed with a smaller granularity, automatically adapting to the workload’s modification pressure.

The adaptive granularity mechanism requires the user only to specify an initial area size. During migration, if an area is dirty, page_leap() splits it according to a reduction factor, thereby reducing the chance of further conflicts while keeping system‑call overhead low. This contrasts with move_pages(), which always migrates at the fixed page size and cannot adjust dynamically.

Experimental evaluation on a two‑socket Intel Xeon Gold 6326 system (256 GB total RAM) compares automatic NUMA balancing, move_pages(), raw memcpy, and page_leap() for both 4 KB small pages and 2 MB huge pages. Baseline measurements confirm that remote accesses are roughly twice as costly as local ones across sequential and random patterns, justifying migration. move_pages() incurs an 18 % overhead over memcpy when copying into fresh memory, but the overhead jumps to 82 % when copying into pooled memory. page_leap() achieves performance close to memcpy, especially when the initial migration area is set between 64 KB and 16 MB, where it outperforms move_pages() by a factor of two or more.

In a no‑concurrent‑write scenario, page_leap() consistently beats move_pages() across all tested granularities except the smallest 4 KB size, where system‑call overhead dominates. The adaptive splitting introduces plateaus at 512 KB and 16 MB, indicating these as good starting points for the initial area size. When concurrent writes are present, page_leap()’s dirty‑tracking and retry mechanism ensures that every page is eventually migrated, a guarantee that move_pages() lacks.

Overall, page_leap() delivers five key advantages: (1) user‑controlled, asynchronous migration; (2) guaranteed migration of all specified pages; (3) correct handling of concurrent writes; (4) support for pooled memory and both small and huge pages; and (5) workload‑adaptive migration granularity. These properties make it especially suitable for database management systems and other high‑performance applications that manage large in‑memory data sets across multiple NUMA nodes. The authors conclude that page_leap() can substantially reduce remote‑memory traffic, improve throughput, and lower latency in NUMA‑aware workloads.


Comments & Academic Discussion

Loading comments...

Leave a Comment