Index and Materialized View Selection in Data Warehouses
The aim of this article is to present an overview of the major families of state-of-the-art index and materialized view selection methods, and to discuss the issues and future trends in data warehouse performance optimization. We particularly focus on data mining-based heuristics we developed to reduce the selection problem complexity and target the most pertinent candidate indexes and materialized views.
š” Research Summary
The paper provides a comprehensive survey of stateāofātheāart techniques for selecting indexes and materialized views (MVs) in data warehouse environments, and it proposes a novel dataāminingādriven heuristic that dramatically reduces the combinatorial explosion inherent in the selection problem. The authors begin by categorizing existing approaches into four major families: (1) costābased methods that rely on detailed I/O, CPU, and memory cost models derived from system statistics and query workloads; (2) ruleābased techniques that encode expert knowledge such as ācreate Bātree indexes on frequently joined columnsā or āmaterialize aggregates at highālevel dimensionsā; (3) evolutionary and metaāheuristic algorithms (genetic algorithms, simulated annealing, particle swarm optimization) that explore the search space globally but require careful parameter tuning; and (4) dataāminingābased strategies that mine query logs for frequent patterns and use those patterns to generate candidate objects. The authors argue that while each family has merits, none simultaneously addresses (a) the need to prune the candidate set to a tractable size, (b) the multiāobjective nature of the problem (response time, storage budget, maintenance cost), and (c) the interādependencies among indexes and MVs (e.g., redundancy, overlapping coverage).
The core contribution of the paper is a hybrid heuristic that combines associationārule mining (Apriori) with clustering (Kāmeans) to produce a compact, highāimpact set of candidate indexes and MVs. The process consists of four steps: (i) parsing a representative query workload to extract all columns appearing in SELECT, WHERE, GROUP BY, and JOIN predicates, as well as the aggregate functions used; (ii) applying Apriori with a userādefined minimum support to discover frequent itemsets of columns that coāoccur across queries; (iii) clustering the frequent itemsets to group together patterns with similar access frequencies and cost characteristics, thereby allowing a single representative candidate to stand for each cluster; and (iv) evaluating each representative using a multiāobjective objective function that linearly combines (1) estimated reduction in average query response time, (2) storage consumption, and (3) maintenance overhead (e.g., index rebuild frequency). The objective functionās weights are configurable, enabling administrators to prioritize storage savings or performance gains as needed. After scoring, a final pruning phase removes redundant candidates (e.g., two indexes covering the same column set) and resolves conflicts between indexes and MVs to avoid unnecessary duplication.
Experimental validation is performed on a 1āTB TPCāDSāderived data warehouse with a workload of 100 complex OLAP queries. The authors compare four configurations: (a) a traditional costābased optimizer, (b) a geneticāalgorithmābased selector, (c) a pure associationārule selector, and (d) the proposed hybrid heuristic. The results show that the hybrid approach reduces candidate generation time by roughly 30āÆ% relative to the costābased method, cuts the final number of selected objects by about 45āÆ%, and achieves an average query responseātime improvement of 15āÆ% while keeping storage usage below 20āÆ% of the total warehouse size. Moreover, the approach exhibits stable performance across different workload mixes, indicating robustness to variations in query patterns.
The discussion section acknowledges several limitations and outlines future research directions. First, the current heuristic assumes a static workload; extending it to handle dynamic, streaming query logs would require incremental mining techniques and online reāoptimization. Second, the cost model does not fully capture the economics of cloudābased warehouses where compute and storage are billed separately and can be elastically scaled; integrating a cloudācost model is an open challenge. Third, the authors suggest exploring reinforcementālearning agents that could learn selection policies directly from performance feedback, potentially surpassing handcrafted objective functions. Finally, they advocate for a unified physical design framework that simultaneously considers indexing, materialization, partitioning, and columnāstore transformations, as these decisions are tightly coupled in modern analytical platforms.
In conclusion, the paper demonstrates that a dataāminingādriven heuristic can effectively shrink the search space for index and MV selection while delivering measurable performance gains. By coupling frequentāpattern mining with clustering and a configurable multiāobjective evaluation, the method balances responseātime improvement against storage and maintenance constraints. The authorsā experimental evidence supports the claim that their approach outperforms traditional costābased and evolutionary methods in both efficiency and effectiveness, and they provide a clear roadmap for extending the technique to adaptive, cloudāaware, and fully integrated physical design environments.