Robustness of AutoML on Dirty Categorical Data
The goal of automated machine learning (AutoML) is to reduce trial and error when doing machine learning (ML). Although AutoML methods for classification are able to deal with data imperfections, such as outliers, multiple scales and missing data, their behavior is less known on dirty categorical datasets. These datasets often have several categorical features with high cardinality arising from issues such as lack of curation and automated collection. Recent research has shown that ML models can benefit from morphological encoders for dirty categorical data, leading to significantly superior predictive performance. However the effects of using such encoders in AutoML methods are not known at the moment. In this paper, we propose a pipeline that transforms categorical data into numerical data so that an AutoML can handle categorical data transformed by more advanced encoding schemes. We benchmark the current robustness of AutoML methods on a set of dirty datasets and compare it with the proposed pipeline. This allows us to get insight on differences in predictive performance. We also look at the ML pipelines built by AutoMLs in order to gain insight beyond the best model as typically returned by these methods.
💡 Research Summary
The paper investigates how current Automated Machine Learning (AutoML) systems handle “dirty” categorical data—datasets characterized by high cardinality, typographical errors, inconsistent abbreviations, and heterogeneous missing‑value representations. While AutoML tools such as auto‑sklearn, TPOT, GAMA, and H2O typically rely on simple encodings (one‑hot, ordinal, or basic target encoding), these approaches often lead to excessive dimensionality, sparsity, and loss of semantic information when applied to dirty categorical features. Recent advances in morphological encoders (often called “dirty‑cat” encoders) address these issues by exploiting the internal string structure of categories: similarity encoding computes pairwise Jaccard similarity on n‑grams, Min‑Hash approximates similarity via hash functions, and Gamma‑Poisson (GAP) learns latent topics from sub‑string co‑occurrences. These methods are unsupervised, scalable, and produce dense numeric representations that better capture relationships among high‑cardinality levels.
The authors propose a preprocessing pipeline that (1) automatically infers column types to identify categorical features, (2) applies a chosen morphological encoder to each categorical column, and (3) feeds the transformed dataset into an existing AutoML system. The pipeline preserves the number of rows while potentially increasing the number of columns, depending on the encoder. By converting dirty categorical data into richer numeric features before AutoML’s own search, the pipeline aims to improve robustness and predictive performance without altering the AutoML’s internal mechanisms.
Experimental evaluation focuses on the GAMA AutoML framework, using a hold‑out split (75 % training, 25 % validation) and a one‑hour time budget per dataset. The authors benchmark several real‑world dirty datasets from domains such as medical payments, traffic violations, and unfinished drug listings—each exhibiting high‑cardinality categorical fields and various forms of noise. Results show that the baseline GAMA (using its default encodings) achieves an average accuracy of roughly 68 %, whereas the same AutoML run on data preprocessed by the proposed pipeline reaches about 74 % accuracy, a gain of approximately 6 percentage points. Moreover, analysis of the pipelines generated by GAMA reveals a shift toward tree‑based learners (Random Forest, XGBoost) after morphological encoding, as well as more frequent use of dimensionality‑reduction and missing‑value‑imputation steps. This suggests that richer numeric representations enable AutoML to explore more effective model families and preprocessing strategies.
Key insights include: (i) specialized morphological encoders substantially expand the search space of AutoML, leading to better model selection and hyper‑parameter tuning; (ii) because the encoders are unsupervised, they can be applied even when target labels are unavailable, making the approach broadly applicable; (iii) automatic type inference is crucial for handling mixed‑type columns that traditional libraries (e.g., pandas) may misclassify.
The paper concludes by outlining future work: extending the evaluation to other AutoML platforms (auto‑sklearn, H2O, AutoGluon) to assess generality; jointly optimizing encoder choice and dimensionality‑reduction techniques via meta‑learning; and investigating scalability in streaming or online learning contexts where dirty categorical data arrive continuously. Overall, the study demonstrates that integrating morphological categorical encoders into the preprocessing stage markedly improves the robustness and performance of AutoML systems on real‑world, imperfect tabular data.
Comments & Academic Discussion
Loading comments...
Leave a Comment