Big Data Technology Literature Review

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

A short overview of various algorithms and technologies that are helpful for big data storage and manipulation. Includes pointers to papers for further reading, and, where applicable, pointers to open source projects implementing a described storage type.

💡 Research Summary

The paper presents a comprehensive literature review of the technologies that underpin modern big‑data storage and manipulation. It begins by outlining the foundational layer of persistent storage, focusing on distributed file systems such as Hadoop Distributed File System (HDFS) and object‑oriented solutions like Ceph. HDFS provides block‑level replication and data locality to accelerate batch‑oriented MapReduce jobs, while Ceph’s RADOS layer unifies block, file, and object storage, offering higher flexibility and seamless scaling across heterogeneous workloads. Both systems embed fault‑tolerance mechanisms, yet their design trade‑offs differ: HDFS is optimized for high‑throughput sequential access, whereas Ceph targets a broader spectrum of service‑level agreements.

The review then categorises NoSQL databases into key‑value stores (Redis, DynamoDB), column‑family stores (Apache Cassandra, HBase), and document‑oriented stores (MongoDB, Couchbase). Redis and DynamoDB achieve microsecond‑level latency through in‑memory architectures and automatic sharding, making them ideal for caching and real‑time user‑profile services. Cassandra’s ring‑based partitioning and tunable consistency model enable worldwide write scalability with “always‑on” availability, while HBase, built on top of HDFS, excels at large‑scale scans and random reads but suffers from higher write latency and operational complexity. Document stores provide schema‑flexibility and rich aggregation pipelines, facilitating rapid application development and iterative data modelling.

Columnar storage formats—Parquet, ORC, and Avro—are examined for their compression, encoding, and query‑optimization capabilities. By storing data column‑wise, these formats achieve superior compression ratios and enable vectorized execution, which improves CPU cache utilisation. Parquet’s page‑level compression and schema‑evolution support make it the de‑facto standard for Spark, Hive, and Presto, while ORC’s built‑in indexes and statistics empower cost‑based optimizers to prune data aggressively. Avro, primarily a serialization framework, offers robust schema evolution via a central registry, making it suitable for data interchange in streaming pipelines.

The evolution of processing engines is traced from the original MapReduce model to in‑memory, DAG‑based systems such as Apache Spark and Apache Flink. Spark’s Resilient Distributed Datasets (RDDs) and DataFrame API, combined with the Catalyst optimizer, provide automatic logical‑to‑physical plan transformation, enabling iterative machine‑learning workloads and interactive analytics. Structured Streaming unifies batch and streaming under a single API, simplifying pipeline development. Flink adopts a true streaming‑first architecture, guaranteeing exactly‑once semantics and offering sophisticated state‑management APIs that are essential for low‑latency event‑driven applications. Both engines integrate with a wide array of connectors and file formats, allowing seamless data movement between data lakes, warehouses, and operational stores.

In the streaming domain, the paper surveys Apache Kafka, Kafka Streams, ksqlDB, Apache Storm, and Samza. Kafka’s partitioned log model delivers high throughput (hundreds of thousands of messages per second) and durability, serving as the backbone for event sourcing and change‑data‑capture architectures. Kafka Streams and ksqlDB provide declarative DSLs and SQL‑like query capabilities, enabling developers to express complex transformations without managing consumer groups manually. Storm offers ultra‑low‑latency topologies but lacks built‑in state handling, while Samza, tightly coupled with Kafka, supplies checkpointing and exactly‑once processing for stateful stream jobs.

Graph processing technologies are covered next, highlighting Pregel‑style batch systems (Apache Giraph) and real‑time graph databases (Neo4j, JanusGraph). Giraph excels at massive static graph analytics such as PageRank or community detection, leveraging bulk‑synchronous parallelism. Neo4j delivers ACID transactions and the expressive Cypher query language for traversals, making it suitable for recommendation engines and fraud detection. JanusGraph extends this capability to distributed back‑ends (Cassandra, HBase, ScyllaDB), enabling petabyte‑scale graph storage with configurable consistency.

The review also addresses the rise of cloud‑native storage and data‑lake architectures. Services like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage provide virtually unlimited object storage, forming the foundation for modern data lakes. Table formats such as Apache Iceberg and Delta Lake introduce snapshot isolation, ACID semantics, and schema‑evolution handling on top of these object stores, bridging the gap between raw lake storage and reliable analytical warehouses. Complementary query engines—Presto, Trino, and Apache Drill—offer federated SQL access across heterogeneous sources, employing cost‑based optimizers and extensible connector ecosystems to support interactive BI and ad‑hoc analytics.

Finally, the paper synthesises selection criteria for practitioners. It recommends evaluating data volume, access patterns (read‑heavy vs. write‑heavy, random vs. sequential), consistency requirements, cost models, and team expertise when assembling a technology stack. By providing a curated list of seminal research papers and active open‑source projects for each category, the review equips readers with concrete entry points for deeper exploration and implementation. The overarching insight is that no single technology solves all big‑data challenges; instead, a carefully orchestrated combination of storage formats, databases, and processing engines—aligned with specific workload characteristics—delivers the most robust, scalable, and cost‑effective solutions.

Big Data Technology Literature Review

💡 Research Summary

Comments & Academic Discussion

Leave a Comment