Hippasus: Effective and Efficient Automatic Feature Augmentation for Machine Learning Tasks on Relational Data
Machine learning models depend critically on feature quality, yet useful features are often scattered across multiple relational tables. Feature augmentation enriches a base table by discovering and integrating features from related tables through join operations. However, scaling this process to complex schemas with many tables and multi-hop paths remains challenging. Feature augmentation must address three core tasks: identify promising join paths that connect the base table to candidate tables, execute these joins to materialize augmented data, and select the most informative features from the results. Existing approaches face a fundamental tradeoff between effectiveness and efficiency: achieving high accuracy requires exploring many candidate paths, but exhaustive exploration is computationally prohibitive. Some methods compromise by considering only immediate neighbors, limiting their effectiveness, while others employ neural models that require expensive training data and introduce scalability limitations. We present Hippasus, a modular framework that achieves both goals through three key contributions. First, we combine lightweight statistical signals with semantic reasoning from Large Language Models to prune unpromising join paths before execution, focusing computational resources on high-quality candidates. Second, we employ optimized multi-way join algorithms and consolidate features from multiple paths, substantially reducing execution time. Third, we integrate LLM-based semantic understanding with statistical measures to select features that are both semantically meaningful and empirically predictive. Our experimental evaluation on publicly available datasets shows that Hippasus substantially improves feature augmentation accuracy by up to 26.8% over state-of-the-art baselines while also offering high runtime performance.
💡 Research Summary
Hippasus is a modular framework designed to automate feature augmentation for machine learning tasks on relational databases. The authors identify three core challenges: (1) discovering promising join paths that connect a base table to candidate tables, (2) efficiently executing those joins to materialize augmented data, and (3) selecting the most informative features from the resulting dataset. Existing solutions either restrict exploration to immediate neighbors—sacrificing predictive power—or employ costly neural models that require extensive training and still suffer from scalability issues.
To address these gaps, Hippasus decomposes the augmentation pipeline into four independent components. First, a Feature Description Generator uses large language models (LLMs) to enrich poorly named columns with natural‑language descriptions, providing semantic context for downstream stages. Second, a Path Explorer combines lightweight statistical signals (e.g., null ratios, correlation scores) with LLM‑derived semantic relevance to prune unpromising join paths before any materialization. This early pruning dramatically reduces the exponential blow‑up of the join graph.
Third, a Join Executor implements a multi‑way join algorithm adapted from Yannikakis (2020) with left‑join semantics, allowing several tables to be joined in a single pass. It also introduces a Consolidation step that resolves competing versions of the same external attribute (originating from different paths) by selecting the variant with the highest information gain while preserving the integrity of the base table.
Finally, a Feature Selector blends LLM‑based semantic reasoning with traditional statistical measures (mutual information, tree‑based feature importance) to rank and filter the augmented features. The hybrid approach ensures that selected features are both semantically meaningful for the prediction task and empirically predictive on the data.
The authors evaluate Hippasus on seven publicly available datasets covering classification and regression tasks, comparing against state‑of‑the‑art baselines such as ARD‑A, AutoFeat, FeatPilot, and Metam. Hippasus achieves up to a 26.8 % absolute improvement in downstream model accuracy (average gain of 12.4 %) while reducing total runtime by a factor of 2–3. The early path pruning and multi‑way join execution are responsible for most of the speedup, and batching plus caching of LLM calls keep the additional language‑model overhead below 30 % of total cost.
Key insights include: (i) decoupling path exploration from join execution enables aggressive pruning without sacrificing feature quality; (ii) LLMs provide valuable semantic signals that complement statistical heuristics, uncovering useful multi‑hop features that purely data‑driven methods miss; (iii) a consolidation phase is essential to avoid feature duplication and maintain a clean, dense augmented table.
In summary, Hippasus demonstrates that integrating LLM‑driven semantic understanding with efficient join processing yields a practical, scalable solution for automatic feature augmentation on relational data, bridging the long‑standing trade‑off between effectiveness and efficiency. Future work will explore prompt optimization, cost‑effective fine‑tuning of LLMs, and extensions to streaming or federated data environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment