In-Storage Embedded Accelerator for Sparse Pattern Processing

Reading time: 5 minute
...

📝 Abstract

We present a novel architecture for sparse pattern processing, using flash storage with embedded accelerators. Sparse pattern processing on large data sets is the essence of applications such as document search, natural language processing, bioinformatics, subgraph matching, machine learning, and graph processing. One slice of our prototype accelerator is capable of handling up to 1TB of data, and experiments show that it can outperform C/C++ software solutions on a 16-core system at a fraction of the power and cost; an optimized version of the accelerator can match the performance of a 48-core server.

💡 Analysis

We present a novel architecture for sparse pattern processing, using flash storage with embedded accelerators. Sparse pattern processing on large data sets is the essence of applications such as document search, natural language processing, bioinformatics, subgraph matching, machine learning, and graph processing. One slice of our prototype accelerator is capable of handling up to 1TB of data, and experiments show that it can outperform C/C++ software solutions on a 16-core system at a fraction of the power and cost; an optimized version of the accelerator can match the performance of a 48-core server.

📄 Content

In-Storage Embedded Accelerator for Sparse Pattern Processing Sang-Woo Jun*, Huy T. Nguyen#, Vijay Gadepally#, and Arvind #MIT Lincoln Laboratory, *MIT Computer Science & Artificial Intelligence Laboratory
Email: wjun@csail.mit.edu, hnguyen@ll.mit.edu, vijayg@ll.mit.edu, arvind@csail.mit.edu

Abstract We present a novel architecture for sparse pattern processing, using flash storage with embedded accelerators. Sparse pattern processing on large data sets is the essence of applications such as document search, natural language processing, bioinformatics, subgraph matching, machine learning, and graph processing. One slice of our prototype accelerator is capable of handling up to 1TB of data, and experiments show that it can outperform C/C++ software solutions on a 16-core system at a fraction of the power and cost; an optimized version of the accelerator can match the performance of a 48-core server.
I. INTRODUCTION Many data analysis algorithms of interest on large data sets composed of documents, images, audio and video can be formulated as operations on very large but very sparse vectors and matrices. There are two challenges: the size of data is generally too large to fit in the system memory (DRAM) of a single server [1], and current server architectures are far from ideal for processing sparse datasets, causing the CPU itself to become a bottleneck.
The traditional way to overcome the size challenge is to use a cluster of machines so that data can be accommodated in the collective main memory (DRAM), and distribute the computation across the machines in the cluster [2]. A 1.5 TB 192-core distributed system with a dozen nodes of 128 GB DRAM memory each would cost about $60k. This system would use a software layer for distributed data processing such as Hadoop. With circa 2016 new server technology, it is possible to configure memory capacity up to 1.5 TB on a quad-socket server, which would cost $27k to $49k, for 48 cores and 72 cores, respectively. This single-box system would not need to access data over the network, thus, its software could be streamlined for performance. The industry’s trend of adding DRAM and cores per server helps consolidating the machines and improves performance. However, it is not a panacea. Large memory buffers cause large loading, and requires high power circuitries for fast access. Data and results also need to be stored on non- volatile storage devices for disruption recovery. Flash-based secondary storage such as Solid-State Drives (SSDs) is a high performance and power efficient technology solution. A recent interface standard, the Non- Volatile Memory Express (NVMe), makes this type of solution even more attractive. We propose to take this technology further: to use flash-based non-volatile memory (NVM) as main data store instead of DRAM. Comparatively, flash is at least 10x cheaper, takes 10x less space, and is 10x less power-consuming than DRAM. In such a system, all data will be on SSDs for processing rather than read in from hard drives. Of course, using flash memory in place of DRAM incurs longer latency in accessing information, and consequently the system has to be optimized for such accesses. Instead of approaching this as a storage problem, we will address it in conjunction with computing, and optimize across boundaries where possible to achieve a better overall solution. Computing with application-specific hardware accelerators can lead to one to three orders of magnitude better performance with less power consumption compared to CPU cores performing similar tasks [3]. Many accelerator devices are packaged as an independent system component, which can be plugged into a high-speed bus such as PCIe to interface with CPUs and system memory. In this paper, we demonstrate a system that integrates the storage and computing solutions together, as shown in Fig. 1, to reduce DRAM memory size and CPU workload [4]. In this architecture, the Field-Programmable Gate Array (FPGA) directly accesses flash storage, and processes data prior to presenting the results to the CPU, hence, “in-storage computing”. A dedicated high bandwidth/low-latency FPGA-based network is used to connect FPGAs together for scaling to problem size of 10s of TBs.

Sparse pattern processing is very inefficient on general purpose computers, and is a good candidate for acceleration on our in-storage computing architecture. We will show that our baseline accelerator outperforms a 16-core server system while using only 2/3 power, and an optimized version could match a 48-core system at ¼ power and ¼ cost. (These comparisons are based on C/C++ server code; comparisons with Java-based implementation would correspond to 3x more cores, i.e., 48 cores and 144 cores.) Each accelerator slice can handle problem size of 1 TB. This work is partially supported by the Assistant Secretary of Defense for Rese

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut