Storing and Analyzing Historical Graph Data at Scale
The work on large-scale graph analytics to date has largely focused on the study of static properties of graph snapshots. However, a static view of interactions between entities is often an oversimplification of several complex phenomena like the spread of epidemics, information diffusion, formation of online communities}, and so on. Being able to find temporal interaction patterns, visualize the evolution of graph properties, or even simply compare them across time, adds significant value in reasoning over graphs. However, because of lack of underlying data management support, an analyst today has to manually navigate the added temporal complexity of dealing with large evolving graphs. In this paper, we present a system, called Historical Graph Store, that enables users to store large volumes of historical graph data and to express and run complex temporal graph analytical tasks against that data. It consists of two key components: a Temporal Graph Index (TGI), that compactly stores large volumes of historical graph evolution data in a partitioned and distributed fashion; it provides support for retrieving snapshots of the graph as of any timepoint in the past or evolution histories of individual nodes or neighborhoods; and a Spark-based Temporal Graph Analysis Framework (TAF), for expressing complex temporal analytical tasks and for executing them in an efficient and scalable manner. Our experiments demonstrate our system’s efficient storage, retrieval and analytics across a wide variety of queries on large volumes of historical graph data.
💡 Research Summary
The paper addresses a critical gap in large‑scale graph analytics: the lack of native support for temporal dimensions. While many graph databases and processing frameworks (Neo4j, Titan, Pregel, GraphX, etc.) excel at handling a single, static snapshot, real‑world networks continuously evolve, and analysts need to query past states, track node or neighborhood histories, and run complex time‑aware algorithms. To meet these needs, the authors introduce the Historical Graph Store (HGS), a two‑component system comprising a Temporal Graph Index (TGI) and a Temporal Graph Analysis Framework (TAF).
Temporal Graph Index (TGI)
TGI stores the entire evolution of a graph as a sequence of atomic events (additions, deletions, attribute updates) called deltas. Rather than materializing a full copy of the graph at each change (the “Copy” approach) or keeping only a log of changes (the “Log” approach), TGI adopts a hybrid strategy. It compresses deltas, groups them into time‑based chunks, and writes each chunk to a distributed key‑value store (Apache Cassandra). The index is tunable: parameters control chunk size, replication factor, and temporal granularity, allowing a trade‑off between storage overhead and query latency. Crucially, TGI supports both time‑centric queries (e.g., “give me the whole graph as of timestamp t”) and entity‑centric queries (e.g., “retrieve the full history of node v”). By partitioning the graph and storing deltas per partition, TGI scales horizontally and can adapt to changing topology without costly repartitioning. Retrieval algorithms selectively read only the chunks required for a given query, then reconstruct the desired snapshot or node history by applying the relevant deltas.
Temporal Graph Analysis Framework (TAF)
TAF sits on top of Apache Spark and provides a high‑level, node‑centric programming model for temporal analytics. The core abstraction is a Set of Nodes (SoN), which couples a collection of vertices with a time interval. A small library of temporal operators—Select, Timeslice, Filter, Map, MapDelta—allows users to compose complex pipelines declaratively in either Java or Python. For example, an analyst can express “compute the average degree of each node over the last three years” as a sequence of Timeslice → Map (degree) → Reduce (average). Under the hood, TAF translates these operators into Spark RDD transformations, automatically parallelizing work across both nodes and time slices. When an operator requires raw graph data, TAF issues targeted fetches to TGI, leveraging its selective access capabilities. The framework also caches frequently accessed deltas to reduce network traffic.
Evaluation
The authors evaluate HGS on real‑world social‑network datasets containing up to 100 M vertices and 500 M edges. They compare storage consumption against pure Copy and pure Log baselines, showing a 2–3× reduction in disk usage while maintaining near‑Copy query latency. Snapshot reconstruction times are reduced to under 30 % of the Log‑only approach. Node‑history queries exhibit sub‑second latency for most cases. For analytics, a suite of temporal workloads (evolution of centrality measures, community detection over time, dynamic PageRank) is executed on a 64‑node Spark cluster. Results demonstrate near‑linear scaling and overall execution times that are an order of magnitude faster than a naïve implementation that repeatedly materializes full snapshots.
Key Insights
- A delta‑based hybrid index can simultaneously satisfy fast point‑in‑time access and compact storage, overcoming the classic Copy vs. Log trade‑off.
- Providing both time‑centric and entity‑centric indexing within the same structure is essential for the diverse query patterns of temporal graph analysis.
- Abstracting temporal analytics as a small set of composable operators on a node‑time set enables analysts to write expressive code without dealing with low‑level data movement.
- Leveraging a mature distributed key‑value store (Cassandra) for delta storage gives the system out‑of‑the‑box scalability and fault tolerance.
Conclusions and Future Work
The Historical Graph Store demonstrates that large‑scale, history‑preserving graph management is feasible and performant. By integrating a compact, tunable index with a Spark‑based analytical layer, the system bridges the gap between storage and analytics for evolving networks. Future directions include tighter integration with streaming graph ingestion pipelines, extending the operator set to support temporal pattern mining, and exploring adaptive partitioning strategies that react to workload characteristics in real time.
Comments & Academic Discussion
Loading comments...
Leave a Comment