Monte Cimone v2: Down the Road of RISC-V High-Performance Computers

Monte Cimone v2: Down the Road of RISC-V High-Performance Computers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Many RISC-V (RV) platforms and SoCs have been announced in recent years targeting the HPC sector, but only a few of them are commercially available and engineered to fit the HPC requirements. The Monte Cimone project targeted assessing their capabilities and maturity, aiming to make RISC-V a competitive choice when building a datacenter. Nowadays, Systems-on-chip (SoCs) featuring RV cores with vector extension, form factor and memory capacity suitable for HPC applications are available in the market, but it is unclear how compilers and open-source libraries can take advantage of its performance. In this paper, we describe the performance assessment of the upgrade of the Monte Cimone (MCv2) cluster with the Sophgo SG2042 processor on HPC workloads. Also adding an exploration of BLAS libraries optimization. The upgrade increases the attained node’s performance by 127x on HPL DP FLOP/s and 69x on Stream Memory Bandwidth.


💡 Research Summary

The paper presents a comprehensive evaluation of the Monte Cimone v2 (MCv2) cluster, which upgrades the original Monte Cimone v1 (MCv1) HPC platform from SiFive U740‑based 4‑core nodes to Sophgo Sophon SG2042‑based nodes featuring 64 RV64 cores with RVV 0.7.1 vector extensions. Four new SG2042 nodes are added, three as single‑socket “Milk‑V Pioneer” boxes (64 cores, 128 GB DDR4 each) and one as a dual‑socket system (128 cores, 256 GB DDR4). The hardware provides 64 KB L1 per core, a 1 MB L2 shared among four‑core clusters, a massive 64 MB L3, four channels of 3200 MHz ECC DDR4, and 32 PCIe Gen4 lanes, delivering a theoretical peak far beyond the 4 GFLOP/s of MCv1.

The software stack is built on Spack, SLURM, and ExaMon, but introduces two compiler toolchains: the Xuantie GNU toolchain (GCC 10) targeting RVV 0.7.1 and the upstream GCC 14 with thread‑vector support. This dual‑toolchain approach enables compilation of both generic RV64 code and code that explicitly uses the vector unit.

Performance is measured with the STREAM memory bandwidth benchmark and the High‑Performance Linpack (HPL) benchmark. STREAM shows 41.9 GB/s on a single‑socket SG2042 node using 64 OpenMP threads and 82.9 GB/s on the dual‑socket node, a >70× improvement over MCv1’s 1.1 GB/s. HPL scaling demonstrates 13 GFLOP/s on 64 cores (single socket) and 22 GFLOP/s on 128 cores (dual socket). When the full MCv2 cluster (8 nodes, 256 cores) is considered, the achieved performance is 127× higher than the original MCv1 cluster.

Two BLAS libraries are examined. OpenBLAS is compiled in two variants: a generic RV64 build that does not exploit the vector unit (≈68 % efficiency) and an SG2042‑optimized build that uses RVV assembly kernels (≈89 % efficiency). Both suffer a drop in efficiency when all cores are active, indicating memory and cache bottlenecks.

The authors also port and optimize BLIS. The original BLIS micro‑kernels target RVV 1.0; they are manually retargeted to RVV 0.7.1 by adjusting load/store instructions, vsetvl syntax, and adding the “th.” prefix so GCC 14 recognizes them. Performance analysis identifies the micro‑kernel’s register usage as the primary bottleneck. By increasing the LMUL parameter from 1 to 4, the optimized kernel reduces the number of loads and vfmac instructions required to update a matrix column, effectively cutting instruction count by a factor of four. This micro‑kernel optimization yields a 5–10 % performance gain over the optimized OpenBLAS baseline.

Cache‑miss profiling (L1 and L3) shows that the BLIS version reduces L1 miss rates compared to OpenBLAS, while L3 miss rates remain similar, confirming that the micro‑kernel changes improve register and L1 utilization but do not eliminate the underlying memory subsystem limits. Moreover, the 1 Gbps Ethernet interconnect becomes a scalability bottleneck as node count grows; multi‑node scaling efficiency plateaus at 1.33× relative to a single node.

In conclusion, the study demonstrates that a modern RISC‑V SoC with vector extensions can deliver HPC‑class performance comparable to traditional x86/ARM systems when both hardware and software are co‑optimized. However, the ecosystem is still immature: compiler support for RVV, library vectorization, and cache‑aware scheduling are critical for extracting full performance. Future work should address high‑speed interconnects (e.g., 100 GbE or InfiniBand), NUMA‑aware thread placement, support for newer RVV specifications (1.0/1.1), and a thorough analysis of power‑efficiency and cost‑performance trade‑offs.


Comments & Academic Discussion

Loading comments...

Leave a Comment