FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

Reading time: 6 minute
...

📝 Original Info

  • Title: FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
  • ArXiv ID: 1709.01190
  • Date: 2018-07-04
  • Authors: Researchers from original ArXiv paper

📝 Abstract

We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force ($n^2D$), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results.

💡 Deep Analysis

Deep Dive into FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search.

We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale

📄 Full Content

FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search Yiqiu Wang Rice University Houston, Texas yiqiu.wang@rice.edu Anshumali Shrivastava Rice University Houston, Texas anshumali@rice.edu Jonathan Wang Rice University Houston, Texas jw96@rice.edu Junghee Ryu Rice University Houston, Texas jr51@rice.edu ABSTRACT We present FLASH (Fast LSH Algorithm for Similarity search accel- erated with HPC), a similarity search system for ultra-high dimen- sional datasets on a single machine, that does not require similar- ity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing proce- dure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 sec- onds on the webspam dataset, using brute-force (n2D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results1. KEYWORDS Similarity search; locality sensitive hashing; reservoir sampling; GPGPU ACM Reference Format: Yiqiu Wang, Anshumali Shrivastava, Jonathan Wang, and Junghee Ryu. 2018. FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High 1https://github.com/RUSH-LAB/Flash Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SIGMOD’18, June 10–15, 2018, Houston, TX, USA © 2018 Copyright held by the owner/author(s). Publication rights licensed to Associa- tion for Computing Machinery. ACM ISBN 978-1-4503-4703-7/18/06...$15.00 https://doi.org/10.1145/3183713.3196925 Dimensional Similarity Search. In SIGMOD’18: 2018 International Conference on Management of Data, June 10–15, 2018, Houston, TX, USA. ACM, New York, NY, USA, 15 pages. https://doi.org/10.1145/3183713.3196925 1 INTRODUCTION Similarity search, or k-nearest-neighbor search (k-NNS), is one of the most frequent operations in large scale data processing systems. Given a query objectq, with feature representationq ∈RD, the goal of similarity search is to find, from a collection C of N data instances, an object x (or set of objects) most similar to the given query. The notions of similarity are based on some popular Euclidian type measures such as cosine similarity [11] or Jaccard similarity [8]. k-NNS over Ultra-high Dimensional and Sparse Datasets: Recommendation systems naturally deal with ultra-high dimen- sional and sparse features as they usually consist of categorical combinations. Even the general user-item matrix representation leads to ultra-sparse and ultra-high dimensional representation. Neighborhood models [15] are popular in recommendation sys- tems where the first prerequisite is to find a set of near-neighbor for every item. Social networks are another natural ground for ultra-high dimensional and extremely sparse representations. In so- cial networks, we represent the friendship relations between users as graphs. Given d users, we describe each user as a d ultra-high dimensional and very sparse vector, whose non-zero entries corre- spond to edges. By representing each user as a column, we construct matrix A of dimension d × d. Finding similar entries of such a user representation is one of the first operations required for a variety of tasks including link prediction [26], personalization [14], and other social network mining tasks [47]. Other popular applications where k-NNS over ultra-high dimensional and sparse dataset is common include click-through predictions [32] and plagiarism detection [7]. The naive way to perform k-NNS is to compute the exact distance (or similarity) between the query and all data points, followed by ranking. The naive approach suffers from time complexity of O(N · D) per query

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut