Distributed Data Processing Frameworks for Big Graph Data

Reading time: 6 minute
...

📝 Abstract

Recently we create so much data (2.5 quintillion bytes every day) that 90% of the data in the world today has been created in the last two years alone [1]. This data comes from sensors used to gather traffic or climate information, posts to social media sites, photos, videos, emails, purchase transaction records, call logs of cellular networks, etc. This data is big data. In this report, we first briefly discuss what programming models are used for big data processing, and focus on graph data and do a survey study about what programming models/frameworks are used to solve graph problems at very large-scale. In section 2, we introduce the programming models which are not specifically designed to handle graph data but we include them in this survey because we believe these are important frameworks and/or there have been studies to customize them for more efficient graph processing. In section 3, we discuss some techniques that yield up to 1340 times speedup for some certain graph problems when applied to Hadoop. In section 4, we discuss vertex-based programming model which is simply designed to process large graphs and the frameworks adapting it. In section 5, we implement two of the fundamental graph algorithms (Page Rank and Weight Bipartite Matching), and run them on a single node as the baseline approach to see how fast they are for large datasets and whether it is worth to partition them.

💡 Analysis

Recently we create so much data (2.5 quintillion bytes every day) that 90% of the data in the world today has been created in the last two years alone [1]. This data comes from sensors used to gather traffic or climate information, posts to social media sites, photos, videos, emails, purchase transaction records, call logs of cellular networks, etc. This data is big data. In this report, we first briefly discuss what programming models are used for big data processing, and focus on graph data and do a survey study about what programming models/frameworks are used to solve graph problems at very large-scale. In section 2, we introduce the programming models which are not specifically designed to handle graph data but we include them in this survey because we believe these are important frameworks and/or there have been studies to customize them for more efficient graph processing. In section 3, we discuss some techniques that yield up to 1340 times speedup for some certain graph problems when applied to Hadoop. In section 4, we discuss vertex-based programming model which is simply designed to process large graphs and the frameworks adapting it. In section 5, we implement two of the fundamental graph algorithms (Page Rank and Weight Bipartite Matching), and run them on a single node as the baseline approach to see how fast they are for large datasets and whether it is worth to partition them.

📄 Content

Distributed Data Processing Frameworks for
Big Graph Data
Afsin Akdogan, Hien To University of Southern California Los Angeles, CA, 90089, USA [aakdogan,hto]@usc.edu

I. INTRODUCTION Recently we create so much data (2.5 quintillion bytes every day) that 90% of the data in the world today has been created in the last two years alone [1]. This data comes from sensors used to gather traffic or climate information, posts to social media sites, photos, videos, emails, purchase transaction records, call logs of cellular networks, etc. This data is big data. In this report, we first briefly discuss what programming models are used for big data processing, and focus on graph data and do a survey study about what programming models/frameworks are used to solve graph problems at very large-scale. In section 2, we introduce the programming models which are not specifically designed to handle graph data but we include them in this survey because we believe these are important frameworks and/or there have been studies to customize them for more efficient graph processing. In section 3, we discuss some techniques that yield up to 1340 times speedup for some certain graph problems when applied to Hadoop. In section 4, we discuss vertex-based programming model which is simply designed to process large graphs and the frameworks adapting it. In section 5, we implement two of the fundamental graph algorithms (Page Rank and Weight Bipartite Matching), and run them on a single node as the baseline approach to see how fast they are for large datasets and whether it is worth to partition them.

II. BIG DATA PROCESSING FRAMEWORKS Distributed data processing models has been one of the active areas in recent database research. Several frameworks have been proposed in database literature. Figure 1 shows the release date of some of the successful frameworks. The arrows show the dependencies among the models. For example, Hive converts the scripts written with its own language into MapReduce tasks so there is an arrow connecting them.
In 2004 Google proposed the MapReduce functional programming model which provides regular programmers the ability to produce parallel distributed programs easily. Although it is extremely scalable, this simplified framework is either not capable of modelling and solving many of the problems or it is very inefficient to model every single problem with this general-purpose framework. For example, it has been shown that using Hadoop on relational data is at least a factor of 50 less efficient than it needs to be [10, 11]. To address this inefficiency, problem and data-specific programming models have been developed such as Facebook’s Hive [3] for SQL-like workloads, Yahoo’s Pig Latin [4] for iterative data processing, Google’s Pregel [5] for graph processing, Microsoft’s Dryad [8], and SpatialHadoop [12] for geospatial data processing, etc. In this section we discuss these programming models.

A. MapReduce With ever-increasing popularity of mobile devices, social media and web-based services such as emails, we create so much data that the on-hand relational database tools are not capable of handling and processing the data at this scale. Therefore, in 2004 Google proposed the MapReduce functional programming model which provides regular programmers the ability to produce parallel distributed programs easily, by requiring them to write only the simple map and reduce functions. Figure 2 shows how Hadoop [9] (an open-source implementation of MapReduce model) processes the data of four split in parallel where there are three map machines and two reduce machines.

Figure 1: Timeline for distributed programming models for big data processing

The problem with Hadoop is that its strength is also its weakness. Hadoop gives the user power to scale different data management problems. However, this flexibility that allows the user to perform inefficient operations and not care because they can add more computing nodes and use Hadoop’s scalability to hide inefficiency in user code, and since it is designed for batch data processing, they can let their process run in the background and not care about how long it will take for it to return. For example, it has been shown that using Hadoop on relational data is at least a factor of 50 less efficient than it needs to be [10, 11]. Instead of having a single general-purpose programming model, it is more efficient to group certain types of problems and come up with a framework works only for that domain. In the following sections we briefly introduce these frameworks.

split 1 split 2 split 3 split 4

Figure 2: The data flow in Hadoop Architecture B. Hive Hive is an open-source framework built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are first translated into MapReduce task

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut