A multi-centre, multi-device benchmark dataset for landmark-based comprehensive fetal biometry
Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset comprises 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.
💡 Research Summary
This paper addresses a critical bottleneck in fetal growth assessment: the lack of large, diverse, and publicly available ultrasound (US) datasets annotated with the anatomical landmarks required for standard biometric measurements. Manual identification of landmarks on standard planes (head, abdomen, femur) is time‑consuming, operator‑dependent, and highly susceptible to variability across ultrasound devices and clinical sites, which hampers the reproducibility and generalisation of automated deep‑learning (DL) methods.
To overcome this limitation, the authors introduce an open, multi‑centre, multi‑device benchmark dataset comprising 4,513 de‑identified fetal US images from 1,904 subjects. The images were collected at three clinical sites (UCL in the UK, two sites in Tel‑Aviv, Israel) using seven different US machines from manufacturers such as GE, Philips, and Hitachi. For each image, expert sonographers annotated the two anatomical points needed to compute five clinically relevant biometric parameters: biparietal diameter (BPD), occipito‑frontal diameter (OFD), transverse abdominal diameter (T‑AD), antero‑posterior abdominal diameter (AP‑AD), and femur length (FL).
The dataset is released with subject‑disjoint training, validation, and test splits, together with Python evaluation code and baseline results. The authors adopt BiometryNet, a state‑of‑the‑art landmark‑regression framework that incorporates Dynamic Orientation Determination (DOD) to enforce consistent measurement orientation and uses a modified HRNet backbone for high‑resolution feature extraction. Performance is quantified using Normalised Mean Error (NME), which measures landmark localisation error relative to the inter‑landmark distance.
Within‑domain experiments (training and testing on the same source) achieve low NME values (≈0.03–0.08 for head and abdomen, ≈0.05 for femur), confirming that the baseline model can accurately predict landmarks when data distribution is homogeneous. Cross‑domain experiments, however, reveal substantial domain shift: NME roughly doubles for head and abdomen measurements and can reach as high as 0.90 for femur length when a model trained on the “FP” subset is evaluated on the “UCL” subset. This degradation is attributed to differences in device manufacturer, probe handling, magnification, and fetal pose, which affect landmark position, size, and orientation.
Crucially, a model trained on the combined multi‑centre (M‑C) data—i.e., integrating all three sites and all seven devices—demonstrates the best overall generalisation. When tested on the UCL set, the M‑C model attains NME = 0.02 ± 0.02 for BPD and 0.03 ± 0.11 for OFD, outperforming even the UCL‑specific model on its own test data. Similar improvements are observed for abdominal measurements and femur length, underscoring that exposure to diverse acquisition conditions during training markedly mitigates domain shift.
Beyond the benchmark itself, the authors provide a thorough analysis of anatomical variability across sites using kernel density estimates of landmark centre‑point distribution, normalised size, and polar histograms of orientation. These visualisations confirm that the dataset captures realistic clinical heterogeneity, making it an ideal testbed for domain‑adaptation strategies, meta‑learning, and federated learning approaches that aim to build robust, device‑agnostic fetal biometry tools.
In summary, this work delivers the first publicly available, multi‑centre, multi‑device, landmark‑annotated fetal US dataset covering all primary biometric planes. By supplying standardized splits, evaluation pipelines, and baseline results, it enables fair, reproducible comparison of AI methods and encourages the development of models that generalise across scanners, operators, and clinical environments—an essential step toward reliable, automated fetal growth monitoring in routine obstetric practice.
Comments & Academic Discussion
Loading comments...
Leave a Comment