LLMDR: Large language model driven framework for missing data recovery in mixed data under low resource regime

LLMDR: Large language model driven framework for missing data recovery in mixed data under low resource regime
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The missing data problem is one of the important issues to address for achieving data quality. While imputation-based methods are designed to achieve data completeness, their efficacy is observed to be diminishing as and when there is increasing in the missingness percentage. Further, extant approaches often struggle to handle mixed-type datasets, typically supporting either numerical and/or categorical data. In this work, we propose LLMDR, automatic data recovery framework which operates in two stage approach, wherein the Stage-I: DBSCAN clustering algorithm is employed to select the most representative samples and in the Stage-II: Multi-LLMs are employed for data recovery considering the local and global representative samples; Later, this framework invokes the consensus algorithm for recommending a more accurate value based on other LLMs of local and global effective samples. Experimental results demonstrate that proposed framework works effectively on various mixed datasets in terms of Accuracy, KS-Statistic, SMAPE, and MSE. Further, we have also shown the advantage of the consensus mechanism for final recommendation in mixed-type data.


💡 Research Summary

The paper addresses the pervasive problem of missing values in mixed‑type tabular data (numerical, categorical, and textual fields) and proposes a novel two‑stage framework called LLMDR (Large Language Model‑Driven Recovery) that is designed to operate under low‑resource constraints. In the first stage, the authors aim to reduce the search space by extracting a compact set of representative samples. The raw dataset is first normalized and one‑hot encoded, after which DBSCAN clustering is applied using the Gower distance metric, which treats quantitative and qualitative variables with equal weight. The centroids of the resulting clusters are taken as “Local Effective Samples” (LES). To capture local variability, the t nearest neighbours of each centroid are added, forming the “Global Effective Samples” (GES). This clustering‑based reduction is intended to limit the amount of data that must be processed by downstream language models, thereby saving computation and memory.

In the second stage, the LES and GES are each indexed in a Retrieval‑Augmented Generation (RAG) system. Two language models are deployed in parallel: a “local LLM” that queries only the LES index, and a “global LLM” that queries the GES index. When a record with a missing field is presented, both models retrieve the most relevant samples from their respective indices, construct prompts that embed these retrieved contexts, and generate candidate imputed values. The two candidate values are then fed into a consensus algorithm (the paper references a prior work by Sairam et al.) which resolves discrepancies and outputs the final imputed value. By avoiding a full‑table scan and limiting token consumption, the authors claim that the approach is suitable for environments with limited computational resources.

The experimental evaluation uses three publicly available datasets that reflect real‑world mixed‑type scenarios: (1) a “Buy” e‑commerce product catalog containing name, description, and manufacturer; (2) a “Phone” dataset with categorical labels and numeric attributes such as price, rating, and review count; and (3) a “Restaurant” directory with fields like address, city, phone, and cuisine type. For each dataset, missing values are introduced under a Missing‑At‑Random (MAR) mechanism at rates of 10 %, 20 %, and 30 %. Performance is measured with four metrics: (i) Accuracy (exact match of the imputed value), (ii) KS‑Statistic complement (distribution similarity between original and imputed columns), (iii) SMAPE (symmetrical mean absolute percentage error), and (iv) MSE (mean squared error).

Results on the Buy dataset (Table 1) show that the manufacturer field retains relatively high accuracy (≈55 % at 10 % missing, decreasing to ≈42 % at 30 % missing) and low error, while the name field achieves 0 % accuracy across all missing rates, indicating that the model cannot recover unique identifiers. The description field exhibits increasing SMAPE and MSE as missingness grows, suggesting sensitivity to the amount of missing data. The authors note that the global LLM generally provides more stable predictions than the local LLM, and that the consensus step yields modest improvements for some attributes.

Despite the innovative combination of clustering, RAG, and multi‑LLM consensus, the paper has several notable limitations. First, there is no quantitative comparison against established imputation baselines such as mean/mode substitution, K‑Nearest Neighbours, MICE, auto‑encoders, or recent graph‑neural‑network approaches, making it difficult to assess the true advantage of LLMDR. Second, the choice of DBSCAN hyper‑parameters (ε, minPts) and the number of neighbours t is not systematically explored; these settings can dramatically affect the composition of LES/GES and thus the downstream imputation quality. Third, the computational savings claimed for low‑resource settings are not substantiated with concrete timing, memory, or token‑usage measurements. Fourth, the consensus algorithm is described only at a high level, without details on weighting, voting rules, or confidence estimation, which hampers reproducibility. Finally, the paper does not specify which large language model(s) are used (e.g., GPT‑3.5, LLaMA, Claude), whether any fine‑tuning or prompt engineering was performed, or how prompts were constructed for mixed‑type fields.

In summary, LLMDR presents a creative pipeline that leverages density‑based clustering to condense a mixed‑type dataset, then applies dual LLMs with retrieval‑augmented generation and a consensus mechanism to impute missing values. The approach is promising for scenarios where computational resources are limited and where data contain a blend of textual and structured information. However, to establish its practical relevance, future work should (a) benchmark against strong traditional and deep‑learning imputation methods, (b) provide automated or adaptive strategies for clustering hyper‑parameters, (c) report detailed resource consumption profiles, (d) disclose the exact consensus logic and LLM configurations, and (e) explore scalability to larger, higher‑dimensional datasets. Addressing these gaps would strengthen the claim that LLMDR can become a reliable, low‑cost solution for mixed‑type missing‑data recovery in real‑world applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment