Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) have significantly automated data science workflows, but a fundamental question persists: Can these agentic AI systems truly match the performance of human data scientists who routinely leverage domain-specific knowledge? We explore this question by designing a prediction task where a crucial latent variable is hidden in relevant image data instead of tabular features. As a result, agentic AI that generates generic codes for modeling tabular data cannot perform well, while human experts could identify the important hidden variable using domain knowledge. We demonstrate this idea with a synthetic dataset for property insurance. Our experiments show that agentic AI that relies on generic analytics workflow falls short of methods that use domain-specific insights. This highlights a key limitation of the current agentic AI for data science and underscores the need for future research to develop agentic AI systems that can better recognize and incorporate domain knowledge.
Data science is a central interdisciplinary field that blends statistics, computer science, and domain expertise to extract actionable insights from complex, heterogeneous data [1], [2]. By transforming raw information into knowledge and value, data science drives innovation and shapes decision-making in science, industry, healthcare, finance, and beyond [3], [4].
Recent advancements in large language models (LLMs) have significantly accelerated the automation of data science workflows. LLMs such as GPT-4 [5] and Claude [6] have demonstrated impressive capabilities in automating code generation and executing regular machine learning tasks [7]- [10]. These developments offer promising potential for streamlining common analytical processes and reducing the manual workload of human data scientists.
Despite these advancements, there is still a lack of understanding of whether agentic AI really perform as good as human data scientists. In practice, human data scientists consistently rely on specialized knowledge about the data This paper is based upon work supported by the Cisco Research gift fund and National Science Foundation under CAREER Grant No. 2338506.
or task and incorporate crucial nuances that enhance model performance [11]- [15]. Such domain-driven decisions are often subtle yet essential, as they address complexities not captured by typical analytics workflows. However, current research on LLM-driven data science has largely focused on generating generic code and pipeline executions [7], [10]. These approaches often neglect the domain-specific knowledge needed for complex, real-world problems. Meanwhile, existing evaluation benchmarks such as MLE-bench [16] and DSBench [17] aim to assess predictive performance, but do not test whether agentic AI can effectively leverage domain insights outside tabular data. The above observations motivate a fundamental question: Can agentic AI, which typically relies on generic code generation, truly match the performance of human data scientists who could apply domain knowledge? The lower panel depicts an agentic AI’s approach, which applies standard tabular modeling while ignoring the image modality and domain-specific cues, resulting in much worse performance (normalized Gini = 0.3823). This demonstrates the empirical gap between human data scientist and agentic AI performance when domain knowledge is necessary.
To address this question, our paper presents an experimental study on a carefully curated dataset that mimics the complexity of real-world data science problems. The use of synthetic data allows us to control key latent variables that introduce complexity behind observed feature variables. This controlled setup reveals important differences: human data scientists are often able to explicitly identify and leverage domain-specific cues, whereas agentic AI may rely on generic algorithms that do not fully capture the influence of latent factors. Figure 1 illustrates this idea in our property insurance setup.
Our main contributions are summarized below.
โข We design a synthetic dataset that clearly illustrates a fundamental gap between agentic AI (which generates generic, tabular-focused code) and human data scientists (who leverage domain knowledge embedded in images). โข We empirically quantify this performance gap and demonstrate the importance of domain knowledge in achieving excellent prediction performance. Through this work, we aim to highlight the need for future research to improve agentic AI’s ability to identify and use domain-specific knowledge from multimodal data sources.
To evaluate the limits of both human data scientists and agentic AI, we generate a controlled synthetic dataset with a hidden latent factor that affects the prediction target but is not present in the tabular features. Instead, this latent variable is embedded in a secondary modality (here, overhead images) to ensure it can be accessed only through domain knowledge and not by generic code. While the approach could generalize to other modalities such as text or audio, we focus on images to illustrate the mechanism. these images are crafted so that a knowledgeable data scientist can interpret the latent variable in the context of property insurance, making the challenge meaningful and realistic.
To accomplish this, we use a text-to-image model with engineered prompts to ensure that the generated images faithfully reflect the intended values of the latent variable. This design allows us to examine the gap between generic AI pipelines and human data scientists that can look for the incorporation of domain knowledge.
The data science task specifies a policy table as tabular dataset, and each policy is associated with an image. Their goal is to predict each home’s total insured loss in the next policy year, Y p . The key latent variable is RoofHealth, a threelevel variable-Good, Fair, or Bad. This variable is never shown in the policy table but can be inferred from the image. Below we present how we create this synthetic dataset.
For each policy p = 1, . . . , n we draw: 1) PolicyID: “POL-000001”, . . . 2) HouseValue: X val,p โผ LogNormal(12.9, 0.45) (median โ $403k). 3) HouseAge: X age,p โผ 120 Beta(4, 3) (โ 40 yr median). 4) WallType: X wall,p โผ Bernoulli(Wood, Brick) with probabilities {0.9, 0.1}. 5) AreaRisk: X risk,p โผ Beta(2, 5) (0-1 storm exposure). 6) CreditScore: X cred,p drawn from the US FICO distribution (300-850). 7) RoofHealth (latent): compute S p = 0.02 X age,p + 3.0 X risk,p -2.0 X cred,p /850 + ฮต p , ฮต p โผ N (0, 1) , then assign Good, Fair, or Bad by partitioning S p at the 55th and 80th percentiles of all scores. Only columns 1-6 are released as features in the policy table.
Step c) Total loss:
The training file exposes NextYearLoss = Y p ; the test file omits it for evaluation. See Figure 3 for an illustration of how the outcome variable, Next-Year Loss, is generated. The construction of our synthetic property insurance dataset is grounded in established actuarial practice and empirical research. Roof condition is an important factor in property risk and claims, but it is often not directly available in tabular data [18], [19]. Our use of roof images is intended to reflect this real-world limitation. The target outcome, next-year loss, is generated using a compound frequency-severity model. This approach is a standard actuarial method for property insurance loss modeling [20], [21]. Together, these design choices ensure our dataset realistically captures the complexities of property insurance prediction.
Our goal is to evaluate how the use of domain knowledge impacts predictive performance when crucial information is embedded in the roof images. To do this, we compare three groups of modeling strategies. The first group simulates a generic agentic AI approach that uses only tabular data. The second group represents methods a human data scientist might use, combining tabular data with different ways of extracting information from roof images. The final group is an oracle model that has access to the true latent variables and the underlying data generation process, serving as the best achievable benchmark. By comparing these approaches, we can quantify the value of domain knowledge and highlight the limitations of generic AI workflows.
Below, we outline the specific modeling strategies in each group and how they reflect use of domain knowledge: (2) A latent variable, Roof Health (Good, Fair, Bad), is determined by a function of selected features, but is not included in the released tabular data. Instead, it is visually encoded in an accompanying roof image, which is generated for each policy using a random combination of roof style and shingle color. ( 3) Claim count and claim loss are simulated using both policy features and the latent RoofHealth. The total insured loss for the next year (Yp) is calculated as the sum of all claim loss. To achieve optimal prediction, the hidden Roof Health must be inferred from the roof image.
- Agentic AI (Generic pipeline) Only the tabular features are used and no image data is included. This matches how agentic AI would typically apply generic code to a standard tabular prediction problem. 2) Data Scientists (Image use). Both tabular features and image data are utilized. We consider several practical ways a human data scientist might incorporate image information. One approach is to extract features from the images using a pretrained CLIP model [22] and either use these features directly or cluster them into categories for use in the predictive model. Another approach is to apply a vision-language model gpt-4o-mini to extract the RoofHealth label from the images. Finally, we include an ideal scenario where the data scientist perfectly labels the true RoofHealth for each image. This represents the best possible use of domain expertise. 3) Oracle (Best achievable). This method uses the exact data-generation formulas and the true RoofHealth.
It calculates the predicted loss as the exact product of expected claim counts and severities, i.e.
where ฮป p and ยต p are computed from Equation 1 and 2 respectively. This gives the Bayes-optimal expected loss and any remaining error reflects only inherent randomness in claims.
All experiments use the synthetic data with 2 000 policies generated as described in Section II. Among them, 1 000 policies are for training and the other 1 000 policies are held out for evaluation. Each policy carries six tabular features and a 1024 ร 1024 overhead roof image.
We measure predictive performance using normalized Gini coefficient, a standard practice and widely used metric in the insurance domain to evaluate predictive models [23]- [26]. It is a rank-based metric that captures how well predicted scores prioritize higher value observations and is appropriate for loss outcomes, which are often heavy-tailed.
Let {(y i , ลทi )} n i=1 be the true responses and model predictions. Sort the pairs by descending predicted value, yielding y (1) , ลท(1) , . . . , y (n) , ลท(n) . Define the cumulative true sum
The raw Gini coefficient is then
To make this metric in [-1, 1] and comparable across datasets, it is normalized by the “perfect” Gini achieved when ลทi = y i :
G norm = 1 indicates a perfect ranking, and G norm = 0 correspond to a random ordering. G norm < 0 would mean predictions are worse than random. The higher normalized Gini signals better model performance.
In Table II, we compare predictive performance across the modeling approaches described above. This highlights the gap between generic agentic AI pipelines that use only tabular data and methods used by human data scientists that incorporate domain knowledge from images. For methods using image features, the “Corr.” column shows how well the extracted variable aligns with the true underlying roof health.
The first row, Agentic AI (Generic pipeline), reflects typical agentic AI workflows that use only tabular data and ignore image and domain-specific information. This approach represents the performance of a standard pipeline LLMs would generate without domain insight, achieving a normalized Gini of 0.3823. In this case, when important information is hidden in images, standard pipelines struggle to achieve good results.
The next group, Data Scientists (Image use), includes several practical strategies for incorporating image information, just as a human data scientist might try. Using naive clustering of CLIP embeddings as categorical features provides some improvement (Gini 0.5042), but does not fully capture the signal (correlation with true roof health is 0.40). Feeding the full CLIP features into the model yields much better results (Gini 0.7719). Extracting the RoofHealth label from images with a vision-language model (gpt-4o-mini) also boosts performance (Gini 0.7271), with a much higher correlation to the true latent variable (0.81). When the model is given the true RoofHealth label as if a human labels the images perfectly, the performance almost matches the best possible (Gini 0.8310). The clear trend is that methods using image-based domain knowledge achieve much higher predictive performance.
The last row, Oracle (Best achievable) represents the optimal achievable performance (Gini 0.8379), where predictions utilize the exact underlying generative mechanism and the true RoofHealth labels. This tier’s result reflects only inherent randomness in claims data and sets a practical upper bound for predictive performance.
The improvements observed across these levels clearly demonstrate the importance of domain-specific knowledge in data science and highlight the limitations of generic, tabularonly approaches typically employed by current agentic AI.
In this work, we illustrate that agentic AI cannot match the performance of human data scientists in our controlled setting, using a carefully designed synthetic dataset. The dataset is constructed so that an important latent variable is hidden within the image data. As a result, generic algorithms that rely solely on tabular data become insufficient, which is precisely the approach typically employed by agentic AI. In contrast, a human data scientist equipped with domain knowledge can correctly identify and utilize this latent information from the images, resulting in substantially improved performance. This underscores a limitation of current agentic AI for data science: they typically generate generic algorithms without adequately incorporating domain-specific insights. We hope this work will inspire further research into building agentic AI that can critically incorporate and utilize domain-specific knowledge, thereby bridging the gap between automated workflows and expert human performance.
This content is AI-processed based on open access ArXiv data.