In-Storage Embedded Accelerator for Sparse Pattern Processing
📝 Abstract
We present a novel architecture for sparse pattern processing, using flash storage with embedded accelerators. Sparse pattern processing on large data sets is the essence of applications such as document search, natural language processing, bioinformatics, subgraph matching, machine learning, and graph processing. One slice of our prototype accelerator is capable of handling up to 1TB of data, and experiments show that it can outperform C/C++ software solutions on a 16-core system at a fraction of the power and cost; an optimized version of the accelerator can match the performance of a 48-core server.
💡 Analysis
We present a novel architecture for sparse pattern processing, using flash storage with embedded accelerators. Sparse pattern processing on large data sets is the essence of applications such as document search, natural language processing, bioinformatics, subgraph matching, machine learning, and graph processing. One slice of our prototype accelerator is capable of handling up to 1TB of data, and experiments show that it can outperform C/C++ software solutions on a 16-core system at a fraction of the power and cost; an optimized version of the accelerator can match the performance of a 48-core server.
📄 Content
In-Storage Embedded Accelerator for Sparse Pattern Processing
Sang-Woo Jun*, Huy T. Nguyen#, Vijay Gadepally#, and Arvind
#MIT Lincoln Laboratory, *MIT Computer Science & Artificial Intelligence Laboratory
Email: wjun@csail.mit.edu, hnguyen@ll.mit.edu, vijayg@ll.mit.edu, arvind@csail.mit.edu
Abstract We present a novel architecture for sparse pattern
processing, using flash storage with embedded accelerators.
Sparse pattern processing on large data sets is the essence of
applications such as document search, natural language
processing, bioinformatics, subgraph matching, machine
learning, and graph processing. One slice of our prototype
accelerator is capable of handling up to 1TB of data, and
experiments show that it can outperform C/C++ software
solutions on a 16-core system at a fraction of the power and
cost; an optimized version of the accelerator can match the
performance of a 48-core server.
I. INTRODUCTION
Many data analysis algorithms of interest on large data
sets composed of documents, images, audio and video can
be formulated as operations on very large but very sparse
vectors and matrices. There are two challenges: the size of
data is generally too large to fit in the system memory
(DRAM) of a single server [1], and current server
architectures are far from ideal for processing sparse
datasets, causing the CPU itself to become a bottleneck.
The traditional way to overcome the size challenge is to
use a cluster of machines so that data can be accommodated
in the collective main memory (DRAM), and distribute the
computation across the machines in the cluster [2]. A 1.5 TB
192-core distributed system with a dozen nodes of 128 GB
DRAM memory each would cost about $60k. This system
would use a software layer for distributed data processing
such as Hadoop. With circa 2016 new server technology, it
is possible to configure memory capacity up to 1.5 TB on a
quad-socket server, which would cost $27k to $49k, for 48
cores and 72 cores, respectively. This single-box system
would not need to access data over the network, thus, its
software could be streamlined for performance. The
industry’s trend of adding DRAM and cores per server helps
consolidating the machines and improves performance.
However, it is not a panacea. Large memory buffers cause
large loading, and requires high power circuitries for fast
access. Data and results also need to be stored on non-
volatile storage devices for disruption recovery.
Flash-based secondary storage such as Solid-State
Drives (SSDs) is a high performance and power efficient
technology solution. A recent interface standard, the Non-
Volatile Memory Express (NVMe), makes this type of
solution even more attractive. We propose to take this
technology further: to use flash-based non-volatile memory
(NVM)
as
main
data
store
instead
of
DRAM.
Comparatively, flash is at least 10x cheaper, takes 10x less
space, and is 10x less power-consuming than DRAM. In
such a system, all data will be on SSDs for processing rather
than read in from hard drives. Of course, using flash
memory in place of DRAM incurs longer latency in
accessing information, and consequently the system has to
be optimized for such accesses. Instead of approaching this
as a storage problem, we will address it in conjunction with
computing, and optimize across boundaries where possible
to achieve a better overall solution.
Computing
with
application-specific
hardware
accelerators can lead to one to three orders of magnitude
better performance with less power consumption compared
to CPU cores performing similar tasks [3]. Many accelerator
devices are packaged as an independent system component,
which can be plugged into a high-speed bus such as PCIe to
interface with CPUs and system memory.
In this paper, we demonstrate a system that integrates
the storage and computing solutions together, as shown in
Fig. 1, to reduce DRAM memory size and CPU workload
[4]. In this architecture, the Field-Programmable Gate Array
(FPGA) directly accesses flash storage, and processes data
prior to presenting the results to the CPU, hence, “in-storage
computing”. A dedicated high bandwidth/low-latency
FPGA-based network is used to connect FPGAs together for
scaling to problem size of 10s of TBs.
Sparse pattern processing is very inefficient on general purpose computers, and is a good candidate for acceleration on our in-storage computing architecture. We will show that our baseline accelerator outperforms a 16-core server system while using only 2/3 power, and an optimized version could match a 48-core system at ¼ power and ¼ cost. (These comparisons are based on C/C++ server code; comparisons with Java-based implementation would correspond to 3x more cores, i.e., 48 cores and 144 cores.) Each accelerator slice can handle problem size of 1 TB. This work is partially supported by the Assistant Secretary of Defense for Rese
This content is AI-processed based on ArXiv data.