ByteHouse: ByteDance's Cloud-Native Data Warehouse for Real-Time Multimodal Data Analytics

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

With the rapid rise of intelligent data services, modern enterprises increasingly require efficient, multimodal, and cost-effective data analytics infrastructures. However, in ByteDance’s production environments, existing systems fall short due to limitations such as I/O-inefficient multimodal storage, inflexible query optimization (e.g., failing to optimize multimodal access patterns), and performance degradation caused by resource disaggregation (e.g., loss of data locality in remote storage). To address these challenges, we introduce ByteHouse (https://bytehouse.cloud), a cloud-native data warehouse designed for real-time multimodal data analytics. The storage layer integrates a unified table engine that provides a two-tier logical abstraction and physically consistent layout, SSD-backed cluster-scale cache (CrossCache) that supports shared caching across compute nodes, and virtual file system (NexusFS) that enable efficient local access on compute nodes. The compute layer supports analytical, batch, and incremental execution modes, with tailored optimizations for hybrid queries (e.g., runtime filtering over tiered vector indexes). The control layer coordinates global metadata and transactions, and features an effective optimizer enhanced by historical execution traces and AI-assisted plan selection. Evaluations on internal and standard workloads show that ByteHouse achieves significant efficiency improvement over existing systems.

💡 Research Summary

ByteHouse is a cloud‑native data warehouse designed by ByteDance to meet the growing demand for real‑time multimodal analytics across its diverse services, including e‑commerce, gaming, finance, and AI‑powered applications. The authors identify three fundamental shortcomings of existing solutions: (1) I/O‑inefficient multimodal storage, (2) inflexible query optimization that cannot adapt to hybrid text‑vector workloads, and (3) performance loss caused by disaggregated storage where compute nodes must repeatedly fetch data and metadata from remote object stores. To address these issues, ByteHouse introduces a tightly integrated three‑layer architecture: storage, compute, and control.

The storage layer centers on a unified table engine that provides a two‑tier logical abstraction (documents and chunks) and a physically consistent layout composed of stable and delta segments. This design enables MVCC‑based snapshot isolation while supporting both large‑scale columnar scans and fine‑grained point lookups needed for vector retrieval. A self‑describing file format, Sniper, co‑locates raw data, indexes, and metadata, eliminating external metadata lookups. An adaptive compaction controller adjusts the compaction intensity (α) based on the number of active delta segments, thereby balancing write amplification against scan locality. To bridge the latency gap between compute and remote object storage, ByteHouse deploys CrossCache, an SSD‑backed distributed cache that shards data into fine‑grained chunks, applies consistent hashing, and performs prefetching and asynchronous flushing. NexusFS, a virtual file‑system abstraction, unifies access to heterogeneous back‑ends (TOS, HDFS, local SSD) under a single namespace, providing alignment‑aware region management and buffer orchestration for zero‑copy Arrow‑based data transfer.

The compute layer supports three execution modes: Analytic Pipeline Mode (APM) for distributed multi‑stage SQL processing, Staged Batch Mode (SBM) for long‑running ETL jobs with checkpointing, and Incremental Processing Mode (IPM) for delta‑aware real‑time updates. Hybrid retrieval operators such as RANK_FUSION combine semantic vector scores with lexical relevance, while runtime filtering pushes scalar predicates into vector scans to prune irrelevant vectors early. A tiered vector index architecture is tuned for different service latency‑cost profiles, offering online, near‑real‑time, and cost‑sensitive configurations.

The control layer orchestrates global metadata via a catalog manager backed by ByteKV, issues globally ordered timestamps for serializable transactions, and runs background maintenance (compaction, merge). A history‑based optimizer (HBO) reuses runtime statistics from prior executions to refine selectivity, cardinality, and operator cost estimates. Building on HBO, an AI‑assisted planner employs regression and deep‑learning models to learn correlations among query structures, data distributions, and execution behavior, enabling automatic predicate push‑down, join side selection, and plan generalization for unseen queries.

Experimental evaluation includes both internal ByteDance workloads and public benchmarks. On ClickBench, ByteHouse reduces average query latency by more than 25 % compared to leading OLAP systems, primarily due to CrossCache and NexusFS eliminating remote metadata lookups. In multimodal workloads such as Cohere and C4, throughput improves by over 50 % because the unified table engine and hybrid operators allow simultaneous text, image, and vector processing within a single pipeline. The system powers over 400 services, delivering sub‑100 ms response times for real‑time dashboards, log analysis, video deduplication, and LLM‑driven knowledge‑base retrieval while maintaining cost efficiency.

In summary, ByteHouse presents a holistic redesign of data warehousing for cloud‑native environments, integrating storage‑level self‑describing formats, adaptive compaction, shared SSD caching, a virtual file system, multi‑mode execution, and AI‑enhanced optimization. The combination of these techniques yields significant latency reductions, higher throughput, and better resource utilization than existing analytical engines or vector databases, establishing a new baseline for real‑time multimodal analytics at scale.

ByteHouse: ByteDance's Cloud-Native Data Warehouse for Real-Time Multimodal Data Analytics

💡 Research Summary

Comments & Academic Discussion

Leave a Comment