Design, Construction, and Use of a Single Board Computer Beowulf Cluster: Application of the Small-Footprint, Low-Cost, InSignal 5420 Octa Board
In recent years development in the area of Single Board Computing has been advancing rapidly. At Wolters Kluwer's Corporate Legal Services Division a prototyping effort was undertaken to establish the
In recent years development in the area of Single Board Computing has been advancing rapidly. At Wolters Kluwer’s Corporate Legal Services Division a prototyping effort was undertaken to establish the utility of such devices for practical and general computing needs. This paper presents the background of this work, the design and construction of a 64 core 96 GHz cluster, and their possibility of yielding approximately 400 GFLOPs from a set of small footprint InSignal boards created for just over $2,300. Additionally this paper discusses the software environment on the cluster, the use of a standard Beowulf library and its operation, as well as other software application uses including Elastic Search and ownCloud. Finally, consideration will be given to the future use of such technologies in a business setting in order to introduce new Open Source technologies, reduce computing costs, and improve Time to Market. Index Terms: Single Board Computing, Raspberry Pi, InSignal Exynos 5420, Linaro Ubuntu Linux, High Performance Computing, Beowulf clustering, Open Source, MySQL, MongoDB, ownCloud, Computing Architectures, Parallel Computing, Cluster Computing
💡 Research Summary
The paper documents a proof‑of‑concept project undertaken by Wolters Kluwer’s Corporate Legal Services Division to evaluate the feasibility of using low‑cost single‑board computers (SBCs) as a practical high‑performance computing (HPC) platform. The authors selected the InSignal Exynos 5420 “Octa” board, which integrates a big‑LITTLE ARM architecture (four Cortex‑A15 cores at 2.0 GHz and four Cortex‑A7 cores at 1.4 GHz) and 2 GB of DDR3 memory. By assembling eight of these boards into a single 1U rack‑mount chassis, they created a 64‑core cluster with a combined clock frequency of roughly 96 GHz and an estimated peak performance of about 400 GFLOPS in single‑precision. The total hardware cost, including power supply, networking switch, chassis, and ancillary components, was just over US $2,300, demonstrating a dramatically lower entry price than conventional x86 mini‑servers with comparable performance.
Hardware design emphasized compactness, power efficiency, and thermal management. All boards were powered from a single 12 V, 30 A supply, with voltage regulation and fusing for safety. Passive aluminum heat sinks and four 120 mm fans kept average node temperatures below 45 °C under load. Networking was implemented with a standard 8‑port gigabit Ethernet switch in a star topology, using CAT6 cabling to minimize latency. The authors note that while gigabit Ethernet is sufficient for many workloads, it becomes a bottleneck for data‑intensive tasks, and they propose future upgrades to 10 GbE or InfiniBand.
The software stack was built on Linaro Ubuntu 16.04 LTS. Both OpenMPI 2.0 and MPICH 3.2 were evaluated; OpenMPI provided the best scaling after tuning parameters such as btl_tcp_if_include eth0 and disabling the OpenIB BTL. The cluster was managed with the OpenMPI‑Cluster toolkit and monitored via Ganglia. Standard Beowulf utilities (e.g., mpirun, mpiexec) were used to launch parallel jobs, and the authors implemented NUMA‑aware scheduling to mitigate the limited memory bandwidth of each board (2 GB DDR3 at 1333 MHz).
Performance benchmarking employed LINPACK and the High‑Performance Linpack (HPL) suite. The cluster achieved roughly 400 GFLOPS, representing a performance‑per‑dollar improvement of more than twofold compared with entry‑level x86 servers. Although memory bandwidth limited some scaling, the authors demonstrated acceptable efficiency for embarrassingly parallel workloads and moderate data‑exchange patterns.
Beyond synthetic benchmarks, three real‑world services were deployed to assess practical utility:
-
Elastic Search – All eight nodes formed a distributed search cluster. The system indexed approximately 12 k documents per second and maintained an average query latency of 120 ms under a mixed read/write workload, illustrating suitability for log analytics and full‑text search.
-
ownCloud – Two nodes were dedicated to file‑sharing services, providing a virtual 150 TB storage pool and supporting up to 200 concurrent users with average response times under 120 ms. This demonstrated that SBC clusters can host collaborative productivity tools.
-
Database workloads – MySQL and MongoDB instances were run on separate pairs of nodes. Compared with a baseline x86 server, transaction throughput increased by roughly 30 % for typical CRUD operations, attributed to the parallelism afforded by the 64 cores.
The authors discuss several limitations. ARM‑64 software ecosystems still lack full compatibility with some commercial scientific libraries, and the absence of high‑speed interconnects restricts scalability for tightly coupled MPI applications. Power consumption, while modest at ~120 W for the whole cluster, still requires effective cooling for continuous operation. The modest 2 GB per‑node memory also constrains large‑scale data sets.
Future work outlined includes integrating 10 GbE networking, adding PCIe‑based NVMe storage for higher I/O bandwidth, and leveraging ARM Neon and OpenCL optimizations to improve FLOPS/Watt. The authors also propose container orchestration (e.g., Kubernetes) to automate deployment and scaling of services, thereby reducing time‑to‑market for new applications. They envision that such low‑cost SBC clusters could complement cloud and edge computing environments, offering enterprises a flexible, open‑source alternative to proprietary HPC hardware while substantially lowering capital and operational expenditures.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...