TRACE: Temporal Radiology with Anatomical Change Explanation for Grounded X-ray Report Generation

TRACE: Temporal Radiology with Anatomical Change Explanation for Grounded X-ray Report Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Temporal comparison of chest X-rays is fundamental to clinical radiology, enabling detection of disease progression, treatment response, and new findings. While vision-language models have advanced single-image report generation and visual grounding, no existing method combines these capabilities for temporal change detection. We introduce Temporal Radiology with Anatomical Change Explanation (TRACE), the first model that jointly performs temporal comparison, change classification, and spatial localization. Given a prior and current chest X-ray, TRACE generates natural language descriptions of interval changes (worsened, improved, stable) while grounding each finding with bounding box coordinates. TRACE demonstrates effective spatial localization with over 90% grounding accuracy, establishing a foundation for this challenging new task. Our ablation study uncovers an emergent capability: change detection arises only when temporal comparison and spatial grounding are jointly learned, as neither alone enables meaningful change detection. This finding suggests that grounding provides a spatial attention mechanism essential for temporal reasoning.


💡 Research Summary

Temporal comparison of chest X‑rays is a cornerstone of radiology, yet existing AI systems either generate reports from a single image, detect longitudinal changes without localization, or produce grounded reports without explicit temporal reasoning. The authors introduce TRACE (Temporal Radiology with Anatomical Change Explanation), the first model that simultaneously performs temporal comparison, three‑way change classification (worsened, improved, stable), and spatial grounding of each reported finding.

To train and evaluate TRACE, the authors construct a large‑scale dataset by pairing consecutive studies from the same patient in MIMIC‑CXR‑JPG and extracting anatomical bounding boxes and temporal change labels from Chest ImaGenome scene graphs. The final corpus contains 79,202 training pairs and 22,553 test pairs, with a patient‑disjoint split to avoid leakage. Each pair is rendered as a grounded sentence, e.g., “Interval worsening of pneumothorax 0.19,0.11,0.52,0.63 in right lung.”

The architecture builds on the LLaVA paradigm. A frozen BioViL‑T vision transformer encodes the prior and current X‑rays separately, yielding two 196‑token visual sequences (14 × 14 spatial grid, 512‑dimensional). The sequences are concatenated (392 tokens) and projected via a two‑layer MLP into the 4096‑dimensional embedding space of a large language model (LLM). The LLM, Vicuna‑7B (or Mistral‑7B in experiments), is fine‑tuned with Low‑Rank Adaptation (LoRA) applied to all query and value projection matrices (rank = 128, α = 256), adding only ~30 M trainable parameters. The total number of trainable parameters is ~34 M; the vision encoder remains frozen. Implicit cross‑attention in the LLM’s self‑attention layers enables the model to compare corresponding spatial regions across the two studies without explicit image subtraction.

Training proceeds in two stages. First, only the MLP projector is trained for one epoch to align visual features with the LLM’s embedding space. Second, the projector and LoRA‑adapted LLM are jointly optimized using standard autoregressive language modeling loss, conditioned on an instruction prompt that asks for grounded temporal reports.

Evaluation covers four dimensions: (1) change detection accuracy (three‑way classification), (2) grounding accuracy measured by IoU > 0.5, (3) natural language generation quality (BLEU‑4, METEOR, ROUGE‑L), and (4) clinical adequacy (RadGraph F1, CheXbert F1). TRACE achieves 48.0 % change detection accuracy, 90.2 % grounding accuracy at IoU > 0.5, and competitive NLG scores (BLEU‑4 0.260). Clinical metrics confirm that the generated reports are medically plausible.

A key finding emerges from ablation studies: when either temporal input or grounding supervision is removed, the model collapses to predicting “stable” for every case, yielding zero change‑detection performance. This demonstrates an emergent capability: meaningful temporal reasoning only appears when spatial grounding provides a focused attention mechanism. Consequently, grounding is not merely an interpretability add‑on but a functional component that enables the model to attend to specific anatomical regions while comparing studies.

The paper’s contributions are: (1) definition of a new task—grounded temporal change detection—and release of a large, patient‑disjoint dataset; (2) a lightweight architecture that combines a frozen temporal‑aware vision encoder with LoRA‑fine‑tuned LLM, requiring only ~34 M trainable parameters; (3) discovery of the emergent dependence of change detection on joint temporal and spatial learning; (4) strong empirical results showing >90 % localization accuracy and clinically relevant report generation.

Limitations include modest change‑detection accuracy (still far from perfect), reliance on 2‑D bounding boxes that may not capture complex 3‑D pathology evolution, and potential bias inherited from the source datasets. Future work could explore richer temporal representations (e.g., difference maps or recurrent modules), integration of CT or multi‑modal data, and deployment‑ready user interfaces for radiologists. TRACE sets a new benchmark for interpretable, temporally aware AI assistance in radiology, paving the way for safer and more informative automated report generation.


Comments & Academic Discussion

Loading comments...

Leave a Comment