Big Data Technology Literature Review
A short overview of various algorithms and technologies that are helpful for big data storage and manipulation. Includes pointers to papers for further reading, and, where applicable, pointers to open source projects implementing a described storage type.
đĄ Research Summary
The paper presents a comprehensive literature review of the technologies that underpin modern bigâdata storage and manipulation. It begins by outlining the foundational layer of persistent storage, focusing on distributed file systems such as Hadoop Distributed File System (HDFS) and objectâoriented solutions like Ceph. HDFS provides blockâlevel replication and data locality to accelerate batchâoriented MapReduce jobs, while Cephâs RADOS layer unifies block, file, and object storage, offering higher flexibility and seamless scaling across heterogeneous workloads. Both systems embed faultâtolerance mechanisms, yet their design tradeâoffs differ: HDFS is optimized for highâthroughput sequential access, whereas Ceph targets a broader spectrum of serviceâlevel agreements.
The review then categorises NoSQL databases into keyâvalue stores (Redis, DynamoDB), columnâfamily stores (Apache Cassandra, HBase), and documentâoriented stores (MongoDB, Couchbase). Redis and DynamoDB achieve microsecondâlevel latency through inâmemory architectures and automatic sharding, making them ideal for caching and realâtime userâprofile services. Cassandraâs ringâbased partitioning and tunable consistency model enable worldwide write scalability with âalwaysâonâ availability, while HBase, built on top of HDFS, excels at largeâscale scans and random reads but suffers from higher write latency and operational complexity. Document stores provide schemaâflexibility and rich aggregation pipelines, facilitating rapid application development and iterative data modelling.
Columnar storage formatsâParquet, ORC, and Avroâare examined for their compression, encoding, and queryâoptimization capabilities. By storing data columnâwise, these formats achieve superior compression ratios and enable vectorized execution, which improves CPU cache utilisation. Parquetâs pageâlevel compression and schemaâevolution support make it the deâfacto standard for Spark, Hive, and Presto, while ORCâs builtâin indexes and statistics empower costâbased optimizers to prune data aggressively. Avro, primarily a serialization framework, offers robust schema evolution via a central registry, making it suitable for data interchange in streaming pipelines.
The evolution of processing engines is traced from the original MapReduce model to inâmemory, DAGâbased systems such as Apache Spark and Apache Flink. Sparkâs Resilient Distributed Datasets (RDDs) and DataFrame API, combined with the Catalyst optimizer, provide automatic logicalâtoâphysical plan transformation, enabling iterative machineâlearning workloads and interactive analytics. Structured Streaming unifies batch and streaming under a single API, simplifying pipeline development. Flink adopts a true streamingâfirst architecture, guaranteeing exactlyâonce semantics and offering sophisticated stateâmanagement APIs that are essential for lowâlatency eventâdriven applications. Both engines integrate with a wide array of connectors and file formats, allowing seamless data movement between data lakes, warehouses, and operational stores.
In the streaming domain, the paper surveys Apache Kafka, Kafka Streams, ksqlDB, Apache Storm, and Samza. Kafkaâs partitioned log model delivers high throughput (hundreds of thousands of messages per second) and durability, serving as the backbone for event sourcing and changeâdataâcapture architectures. Kafka Streams and ksqlDB provide declarative DSLs and SQLâlike query capabilities, enabling developers to express complex transformations without managing consumer groups manually. Storm offers ultraâlowâlatency topologies but lacks builtâin state handling, while Samza, tightly coupled with Kafka, supplies checkpointing and exactlyâonce processing for stateful stream jobs.
Graph processing technologies are covered next, highlighting Pregelâstyle batch systems (Apache Giraph) and realâtime graph databases (Neo4j, JanusGraph). Giraph excels at massive static graph analytics such as PageRank or community detection, leveraging bulkâsynchronous parallelism. Neo4j delivers ACID transactions and the expressive Cypher query language for traversals, making it suitable for recommendation engines and fraud detection. JanusGraph extends this capability to distributed backâends (Cassandra, HBase, ScyllaDB), enabling petabyteâscale graph storage with configurable consistency.
The review also addresses the rise of cloudânative storage and dataâlake architectures. Services like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage provide virtually unlimited object storage, forming the foundation for modern data lakes. Table formats such as Apache Iceberg and Delta Lake introduce snapshot isolation, ACID semantics, and schemaâevolution handling on top of these object stores, bridging the gap between raw lake storage and reliable analytical warehouses. Complementary query enginesâPresto, Trino, and Apache Drillâoffer federated SQL access across heterogeneous sources, employing costâbased optimizers and extensible connector ecosystems to support interactive BI and adâhoc analytics.
Finally, the paper synthesises selection criteria for practitioners. It recommends evaluating data volume, access patterns (readâheavy vs. writeâheavy, random vs. sequential), consistency requirements, cost models, and team expertise when assembling a technology stack. By providing a curated list of seminal research papers and active openâsource projects for each category, the review equips readers with concrete entry points for deeper exploration and implementation. The overarching insight is that no single technology solves all bigâdata challenges; instead, a carefully orchestrated combination of storage formats, databases, and processing enginesâaligned with specific workload characteristicsâdelivers the most robust, scalable, and costâeffective solutions.
Comments & Academic Discussion
Loading comments...
Leave a Comment