📝 Original Info
- Title: FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
- ArXiv ID: 1709.01190
- Date: 2018-07-04
- Authors: Researchers from original ArXiv paper
📝 Abstract
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force ($n^2D$), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results.
💡 Deep Analysis
Deep Dive into FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search.
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale
📄 Full Content
FLASH: Randomized Algorithms Accelerated over CPU-GPU for
Ultra-High Dimensional Similarity Search
Yiqiu Wang
Rice University
Houston, Texas
yiqiu.wang@rice.edu
Anshumali Shrivastava
Rice University
Houston, Texas
anshumali@rice.edu
Jonathan Wang
Rice University
Houston, Texas
jw96@rice.edu
Junghee Ryu
Rice University
Houston, Texas
jr51@rice.edu
ABSTRACT
We present FLASH (Fast LSH Algorithm for Similarity search accel-
erated with HPC), a similarity search system for ultra-high dimen-
sional datasets on a single machine, that does not require similar-
ity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing proce-
dure and combining it with several principled techniques, such as
reservoir sampling, recent advances in one-pass minwise hashing,
and count based estimations, we reduce the computational and
parallelization costs of similarity search, while retaining sound
theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets
from different domains, including text, malicious URL, click-through
prediction, social networks, etc. Our experiments shed new light
on the difficulties associated with datasets having several million
dimensions. Current state-of-the-art implementations either fail on
the presented scale or are orders of magnitude slower than FLASH.
FLASH is capable of computing an approximate k-NN graph, from
scratch, over the full webspam dataset (1.3 billion nonzeros) in less
than 10 seconds. Computing a full k-NN graph in less than 10 sec-
onds on the webspam dataset, using brute-force (n2D), will require
at least 20 teraflops. We provide CPU and GPU implementations of
FLASH for replicability of our results1.
KEYWORDS
Similarity search; locality sensitive hashing; reservoir sampling;
GPGPU
ACM Reference Format:
Yiqiu Wang, Anshumali Shrivastava, Jonathan Wang, and Junghee Ryu. 2018.
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High
1https://github.com/RUSH-LAB/Flash
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
SIGMOD’18, June 10–15, 2018, Houston, TX, USA
© 2018 Copyright held by the owner/author(s). Publication rights licensed to Associa-
tion for Computing Machinery.
ACM ISBN 978-1-4503-4703-7/18/06...$15.00
https://doi.org/10.1145/3183713.3196925
Dimensional Similarity Search. In SIGMOD’18: 2018 International Conference
on Management of Data, June 10–15, 2018, Houston, TX, USA. ACM, New
York, NY, USA, 15 pages. https://doi.org/10.1145/3183713.3196925
1
INTRODUCTION
Similarity search, or k-nearest-neighbor search (k-NNS), is one of
the most frequent operations in large scale data processing systems.
Given a query objectq, with feature representationq ∈RD, the goal
of similarity search is to find, from a collection C of N data instances,
an object x (or set of objects) most similar to the given query. The
notions of similarity are based on some popular Euclidian type
measures such as cosine similarity [11] or Jaccard similarity [8].
k-NNS over Ultra-high Dimensional and Sparse Datasets:
Recommendation systems naturally deal with ultra-high dimen-
sional and sparse features as they usually consist of categorical
combinations. Even the general user-item matrix representation
leads to ultra-sparse and ultra-high dimensional representation.
Neighborhood models [15] are popular in recommendation sys-
tems where the first prerequisite is to find a set of near-neighbor
for every item. Social networks are another natural ground for
ultra-high dimensional and extremely sparse representations. In so-
cial networks, we represent the friendship relations between users
as graphs. Given d users, we describe each user as a d ultra-high
dimensional and very sparse vector, whose non-zero entries corre-
spond to edges. By representing each user as a column, we construct
matrix A of dimension d × d. Finding similar entries of such a user
representation is one of the first operations required for a variety of
tasks including link prediction [26], personalization [14], and other
social network mining tasks [47]. Other popular applications where
k-NNS over ultra-high dimensional and sparse dataset is common
include click-through predictions [32] and plagiarism detection [7].
The naive way to perform k-NNS is to compute the exact distance
(or similarity) between the query and all data points, followed by
ranking. The naive approach suffers from time complexity of O(N ·
D) per query
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.