Column-Oriented Storage Techniques for MapReduce
Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a MapReduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs. We show that simply using binary storage formats in Hadoop can provide a 3x performance boost over the naive use of text files. We then introduce a column-oriented storage format that is compatible with the replication and scheduling constraints of Hadoop and show that it can speed up MapReduce jobs on real workloads by an order of magnitude. We also show that dealing with complex column types such as arrays, maps, and nested records, which are common in MapReduce jobs, can incur significant CPU overhead. Finally, we introduce a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost. Experiments on a real intranet crawl are used to show that our column-oriented storage techniques can improve the performance of the map phase in Hadoop by as much as two orders of magnitude.
💡 Research Summary
The paper “Column‑Oriented Storage Techniques for MapReduce” investigates how to bring the performance benefits of column‑oriented storage, long used in parallel database systems, to Hadoop‑based MapReduce workloads. The authors begin by quantifying the cost of Hadoop’s default text‑based SequenceFile format, showing that a simple switch to a binary format yields a three‑fold improvement in scan speed because it eliminates costly text parsing and UTF‑8 decoding.
Building on this, they design a true column‑oriented storage layout that respects Hadoop’s replication and scheduling constraints. Storing each column in a separate file would normally break data locality, as HDFS’s three‑way block replication places blocks independently across the cluster. To solve this, the authors introduce the Column‑Oriented Fileset (COF) abstraction: each logical partition (a “split”) contains all column files for that partition, and the entire partition is stored within a single HDFS block. By co‑locating the block replicas, any map task that processes a split can read all its columns locally, avoiding remote I/O.
The paper then focuses on the particular challenges posed by complex data types (arrays, maps, nested records) that are common in MapReduce jobs. In Java‑based Hadoop, deserialization of these types incurs high CPU overhead because each field must be materialized into an object and often cast to the appropriate type. To mitigate this, the authors propose a skip‑list column format together with lazy record construction. The skip‑list stores per‑record offsets so that a mapper can read only the columns it actually accesses; unneeded columns are skipped without deserialization. Lazy construction defers object creation until a field is requested, mirroring “late materialization” techniques from columnar DBMSs. Experiments demonstrate an additional 1.5× speedup over eager deserialization.
Compression is another key factor. While columnar storage naturally yields better compression ratios, standard Hadoop compressors such as LZO or ZLIB either provide insufficient compression for complex types or impose high CPU costs during decompression. The authors evaluate a lightweight dictionary‑based compressor tailored to columnar data, which achieves slightly lower compression ratios than LZO but dramatically reduces decompression CPU time, leading to overall faster job execution.
All of these innovations are implemented solely through Hadoop’s extensibility points (custom InputFormat and OutputFormat); no changes to the Hadoop core are required. Consequently, existing hand‑coded MapReduce jobs can adopt the new storage format without code modifications, and the approach works seamlessly with popular serialization frameworks such as Avro, Thrift, and Protocol Buffers.
The authors validate their design on a real‑world intranet crawl dataset comprising hundreds of gigabytes. Using the column‑oriented format, the map phase of a typical job is accelerated by up to two orders of magnitude (≈100×), and the end‑to‑end job runtime improves by more than an order of magnitude. The performance gains stem from a combination of reduced I/O (binary and columnar layout), eliminated unnecessary deserialization (skip‑list + lazy construction), and efficient lightweight compression.
In summary, the paper demonstrates that by carefully adapting column‑oriented storage concepts to Hadoop’s architecture—addressing replication, locality, complex type handling, and compression—significant speedups can be achieved without sacrificing Hadoop’s flexibility or requiring a shift to a traditional relational database system. This work provides a practical pathway for enterprises to retain their Hadoop investments while attaining near‑DBMS performance for read‑heavy analytics workloads.
Comments & Academic Discussion
Loading comments...
Leave a Comment