AI-generated data contamination erodes pathological variability and diagnostic reliability

AI-generated data contamination erodes pathological variability and diagnostic reliability
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generative artificial intelligence (AI) is rapidly populating medical records with synthetic content, creating a feedback loop where future models are increasingly at risk of training on uncurated AI-generated data. However, the clinical consequences of this AI-generated data contamination remain unexplored. Here, we show that in the absence of mandatory human verification, this self-referential cycle drives a rapid erosion of pathological variability and diagnostic reliability. By analysing more than 800,000 synthetic data points across clinical text generation, vision-language reporting, and medical image synthesis, we find that models progressively converge toward generic phenotypes regardless of the model architecture. Specifically, rare but critical findings, including pneumothorax and effusions, vanish from the synthetic content generated by AI models, while demographic representations skew heavily toward middle-aged male phenotypes. Crucially, this degradation is masked by false diagnostic confidence; models continue to issue reassuring reports while failing to detect life-threatening pathology, with false reassurance rates tripling to 40%. Blinded physician evaluation confirms that this decoupling of confidence and accuracy renders AI-generated documentation clinically useless after just two generations. We systematically evaluate three mitigation strategies, finding that while synthetic volume scaling fails to prevent collapse, mixing real data with quality-aware filtering effectively preserves diversity. Ultimately, our results suggest that without policy-mandated human oversight, the deployment of generative AI threatens to degrade the very healthcare data ecosystems it relies upon.


💡 Research Summary

The paper investigates a previously under‑explored risk: the contamination of medical data repositories by synthetic content generated by large language models (LLMs), vision‑language models, and image synthesis systems. The authors argue that as generative AI tools become embedded in electronic health records (EHRs) – drafting discharge summaries, radiology reports, and even generating imaging data – these AI‑produced artifacts are subsequently reused as training material for newer models, creating a self‑referential feedback loop.

To quantify this phenomenon, the researchers built a closed‑loop “self‑referential training” framework spanning five generations (Gen 0–Gen 4). Each generation starts from the original pretrained checkpoint (e.g., GPT‑2 with 124 M parameters, Qwen‑3‑8B with 8 B parameters) and is fine‑tuned exclusively on the synthetic outputs of the preceding generation, never seeing real human‑authored data again. This design isolates data‑quality degradation from ordinary model drift. The experiments cover three clinically central modalities: (1) clinical text generation (radiology reports, ICU discharge instructions, ophthalmology notes), (2) vision‑language radiology reporting, and (3) medical image synthesis (CT, X‑ray). In total more than 800 000 synthetic data points were generated and analyzed.

Multiple quantitative metrics were employed: lexical diversity (type‑token ratio), unique medical‑term count, semantic coherence scores, perplexity on real clinical text, and a novel “pathological variability” index that captures the presence of rare findings (e.g., pneumothorax, effusions, fractures) and demographic balance. The results are striking. Vocabulary in radiology reports collapses from 12 078 unique tokens to roughly 200 by Gen 4 (a 98.9 % reduction). Unique medical terms drop by two‑thirds across datasets. Report length and section structure converge to a single template; the “Impression” section’s word count variance shrinks dramatically. Condition co‑occurrence matrices reveal that rare but clinically critical findings disappear entirely, while common diagnoses dominate. Demographic representation skews heavily toward middle‑aged males.

Crucially, model confidence inflates while diagnostic accuracy plummets. The “false reassurance” rate—instances where the model confidently reports a normal study despite an underlying pathology—triples from ~13 % to ~40 %. Perplexity on authentic clinical text rises from 17.5 to over 786, indicating a loss of true language understanding despite higher self‑reported certainty.

A blinded physician evaluation confirms that after just two generations, AI‑generated documentation becomes clinically unusable, requiring extensive manual revision.

The authors test three mitigation strategies: (1) scaling up the volume of synthetic data, (2) mixing real clinical data with synthetic data at varying proportions, and (3) applying a quality‑aware filter that discards low‑quality synthetic samples before retraining. Scaling synthetic volume alone does not prevent collapse. In contrast, mixing real data (≥30 % of the training set) or employing quality filtering preserves lexical diversity, maintains rare pathology mentions, and keeps false‑reassurance rates near baseline.

Based on these findings, the paper proposes concrete policy and workflow recommendations: mandatory provenance metadata for all AI‑generated records, routine human verification of AI‑drafted outputs, enforced retention of a minimum proportion of authentic data in any retraining pipeline, and automated quality checks before synthetic data are ingested. Without such safeguards, the deployment of generative AI threatens to erode the very data ecosystem that underpins future diagnostic tools, potentially amplifying health inequities and compromising patient safety at population scale.


Comments & Academic Discussion

Loading comments...

Leave a Comment