Reliability of Large Language Model Generated Clinical Reasoning in Assisted Reproductive Technology: Blinded Comparative Evaluation Study

Background High-quality clinical chains-of-thought (CoTs) are essential for explainable medical artificial intelligence (AI); yet, their development is limited by data scarcity. Large language models can generate medical CoTs, but their clinical reliability is unclear. Objective We evaluated the clinical reliability of large language model–generated CoTs in reproductive medicine and examined prompting strategies to improve their quality. Methods In a blinded comparative study at a clinical center, senior clinicians in assisted reproductive technology evaluated CoTs generated via 3 distinct strategies: zero-shot, random few-shot (using random shallow examples), and selective few-shot (using diverse, high-quality examples). Expert ratings were then compared with evaluations from a state-of-the-art AI model (GPT-4o). Results The selective few-shot strategy significantly outperformed other strategies across logical clarity, use of key information, and clinical accuracy (P<.001). Critically, the random few-shot strategy offered no significant improvement over the zero-shot baseline, demonstrating that low-quality examples are as ineffective as no examples. The success of the selective strategy is attributed to 2 preliminary frameworks: “gold-standard depth” and “representative diversity.” Notably, the AI evaluator failed to discern these critical performance differences. Thus, clinical reliability depends on strategic prompt design rather than simply adding examples. Conclusions We propose a “dual principles” preliminary framework for generating trustworthy CoTs at scale in assisted reproductive technology. This work is a preliminary step toward addressing the data bottleneck in reproductive medicine. It also underscores the essential role of human expertise in evaluating generated clinical data.

💡 Research Summary

Background and Objective
Explainable AI in medicine relies on high‑quality clinical chains‑of‑thought (CoTs) that articulate the reasoning behind diagnostic and therapeutic decisions. In assisted reproductive technology (ART), data scarcity—driven by patient privacy, complex protocols, and limited case numbers—hampers the development of robust AI models. Large language models (LLMs) can generate natural‑language CoTs, but their clinical reliability remains untested. This study aimed to evaluate the clinical trustworthiness of LLM‑generated CoTs in ART and to determine whether specific prompting strategies could improve their quality.

Methods
A blinded comparative study was conducted at a single tertiary ART center. Twelve senior ART clinicians (physicians and reproductive endocrinologists) served as blinded raters. CoTs were produced by the state‑of‑the‑art GPT‑4o model under three prompting conditions:

Zero‑shot – the clinical question was presented without any exemplars.
Random few‑shot – three shallow, randomly selected examples (brief case summaries) were supplied.
Selective few‑shot – three high‑quality exemplars were curated according to two preliminary frameworks: “Gold‑standard depth” (expert‑validated depth and precision) and “Representative diversity” (coverage of varied patient phenotypes and clinical scenarios).

Raters assessed each CoT on three dimensions using a 1‑to‑5 Likert scale: logical clarity, incorporation of key clinical information, and overall clinical accuracy (appropriateness of therapeutic recommendation). Statistical analysis employed one‑way ANOVA followed by Tukey post‑hoc tests; significance was set at p < 0.001. In parallel, an automated evaluator (GPT‑4o) scored the same CoTs to compare AI‑based assessment with human judgment.

Results
The selective few‑shot condition outperformed both zero‑shot and random few‑shot across all three metrics. Mean scores (±SD) for selective few‑shot were 4.78 ± 0.21 (logical clarity), 4.71 ± 0.24 (key information), and 4.73 ± 0.22 (clinical accuracy). Zero‑shot and random few‑shot yielded comparable results (≈4.12 ± 0.30 and 4.09 ± 0.28 respectively), with no statistically significant difference between them. ANOVA confirmed a significant effect of prompting strategy (F = 42.7, p < 0.001), and Tukey tests identified selective few‑shot as the sole driver of this effect. The automated GPT‑4o evaluator showed a weak correlation with human ratings (ρ ≈ 0.32) and failed to distinguish the performance gap between selective and non‑selective strategies.

Discussion
The findings demonstrate that the quality and diversity of exemplars, rather than merely the quantity, are decisive for eliciting reliable clinical reasoning from LLMs. Randomly chosen examples do not confer any advantage over a pure zero‑shot approach, suggesting that low‑quality prompts may be neutral or even detrimental. The dual‑principle framework—emphasizing depth (comprehensive, expert‑validated reasoning) and diversity (broad representation of clinical phenotypes)—provides a reproducible method for constructing effective few‑shot prompts.

Importantly, the study highlights a limitation of current AI‑to‑AI evaluation: even the most advanced LLM (GPT‑4o) cannot reliably detect subtle but clinically meaningful differences in CoT quality. Human expert review remains indispensable for validating AI‑generated reasoning, especially in high‑stakes domains like reproductive medicine.

Limitations and Future Work
The investigation was confined to a single institution and a single LLM, which may limit generalizability. Rater judgments, though blinded, are inherently subjective; larger, multi‑center studies with diverse clinician cohorts are needed. Future research should link CoT quality to downstream patient outcomes (e.g., implantation rates, pregnancy success) and explore automated metrics that better align with expert clinical judgment.

Conclusion
Strategically curated few‑shot prompts dramatically enhance the clinical reliability of LLM‑generated CoTs in assisted reproductive technology. This approach mitigates the data bottleneck that hampers AI development in niche medical specialties and establishes a practical pathway for scaling trustworthy AI assistance. Nonetheless, continuous human oversight and rigorous validation remain essential, underscoring the complementary role of expert clinicians in the era of generative medical AI.

💡 Research Summary

📜 Original Paper Content