DeepGB-TB: A Risk-Balanced Cross-Attention Gradient-Boosted Convolutional Network for Rapid, Interpretable Tuberculosis Screening
Large-scale tuberculosis (TB) screening is limited by the high cost and operational complexity of traditional diagnostics, creating a need for artificial-intelligence solutions. We propose DeepGB-TB, a non-invasive system that instantly assigns TB risk scores using only cough audio and basic demographic data. The model couples a lightweight one-dimensional convolutional neural network for audio processing with a gradient-boosted decision tree for tabular features. Its principal innovation is a Cross-Modal Bidirectional Cross-Attention module (CM-BCA) that iteratively exchanges salient cues between modalities, emulating the way clinicians integrate symptoms and risk factors. To meet the clinical priority of minimizing missed cases, we design a Tuberculosis Risk-Balanced Loss (TRBL) that places stronger penalties on false-negative predictions, thereby reducing high-risk misclassifications. DeepGB-TB is evaluated on a diverse dataset of 1,105 patients collected across seven countries, achieving an AUROC of 0.903 and an F1-score of 0.851, representing a new state of the art. Its computational efficiency enables real-time, offline inference directly on common mobile devices, making it ideal for low-resource settings. Importantly, the system produces clinically validated explanations that promote trust and adoption by frontline health workers. By coupling AI innovation with public-health requirements for speed, affordability, and reliability, DeepGB-TB offers a tool for advancing global TB control.
💡 Research Summary
DeepGB‑TB is a multimodal AI system designed to screen for pulmonary tuberculosis (TB) using only cough audio recordings and basic demographic information. The authors address two critical gaps in existing AI‑based TB screening: (1) the inability of most models to jointly exploit acoustic biomarkers and patient‑level risk factors, and (2) the lack of loss functions that reflect the clinical priority of minimizing false‑negative diagnoses.
The architecture consists of three main components. First, a Cross‑Validated Probability Embedding Module (CVPEM) trains a LightGBM model on the tabular demographic data with a 5‑fold cross‑validation scheme, producing out‑of‑sample probability estimates for each subject. These probabilities are concatenated with the original tabular features, yielding an enriched vector that captures robust, data‑driven risk estimates while preserving interpretability.
Second, the cough audio is pre‑processed (noise reduction, volume normalization) and transformed into a rich set of acoustic descriptors, including MFCCs, spectral centroid, chroma, zero‑crossing rate, and fundamental frequency. A lightweight 1‑dimensional convolutional neural network (1D‑CNN) with three convolution‑batch‑norm‑ReLU‑max‑pool blocks extracts temporal patterns from the raw waveform, producing an audio embedding.
The core of the system is the Cross‑Modal Bidirectional Cross‑Attention (CM‑BCA) module. CM‑BCA treats the audio and tabular embeddings as sequences of length one and applies multi‑head self‑attention in both directions: the tabular embedding queries the audio embedding and vice‑versa. Each attention pass is followed by a residual connection, layer‑normalization, and a position‑wise feed‑forward network. The process iterates until the embeddings converge (ℓ2 change below a small ε). This bidirectional refinement enables the model to focus on complementary cues—for example, high‑frequency cough components that are especially informative for older patients with a history of smoking.
To align training with the public‑health objective of catching as many true TB cases as possible, the authors introduce a Tuberculosis Risk‑Balanced Loss (TRBL). TRBL is a weighted cross‑entropy where the weight for positive samples (TB cases) is set to α > 1, thereby penalizing false negatives more heavily than false positives. This simple yet effective modification drives the model toward higher sensitivity without sacrificing overall discriminative ability.
The system was evaluated on a multinational dataset of 1,105 adult participants collected across India, the Philippines, South Africa, Uganda, Vietnam, Tanzania, and Madagascar. Each subject presented with a new or worsening cough of at least two weeks, and TB status was confirmed by culture, Xpert MTB/RIF, or Xpert Ultra. The data were split into 70 % training, 15 % validation, and 15 % test sets, preserving country‑wise distribution.
DeepGB‑TB achieved an area under the ROC curve (AUROC) of 0.903, sensitivity of 0.89, specificity of 0.84, and an F1‑score of 0.851 on the held‑out test set. These results surpass several strong baselines: a vanilla 1D‑CNN (AUROC 0.86), a ResNet‑based audio model (AUROC 0.88), and a late‑fusion CNN‑LightGBM ensemble (AUROC 0.87). Ablation studies showed that removing CM‑BCA (replacing it with simple concatenation) drops AUROC to 0.874, while training with standard cross‑entropy instead of TRBL reduces sensitivity to 0.81, confirming the contribution of each design element.
From a deployment perspective, the model contains only 1.2 million parameters and runs in under 30 ms on a typical ARM Cortex‑A53 mobile processor, using less than 12 MB of RAM. This makes offline inference feasible on low‑cost smartphones, eliminating the need for cloud connectivity—a crucial advantage in resource‑constrained settings.
Interpretability is addressed through two complementary visualizations. SHAP values from the LightGBM component highlight the demographic factors most influencing risk (e.g., age, smoking history, weight loss). Meanwhile, attention heatmaps from CM‑BCA reveal which acoustic frames the model attends to when conditioned on specific patient profiles, providing clinicians with an intuitive explanation of the decision process.
The authors acknowledge several limitations. The cohort consists solely of adults, so performance on pediatric populations remains unknown. Cultural and health‑system differences may affect label quality, potentially introducing bias. Moreover, the iterative attention mechanism can increase memory consumption on ultra‑low‑end devices, suggesting a need for further optimization (e.g., linear‑complexity attention variants).
Future work will explore expanding the modality set to include chest‑ultrasound or radiographic features, applying federated learning to preserve privacy across distributed health networks, and investigating lightweight attention architectures (such as Performer or Linformer) to reduce computational overhead.
In summary, DeepGB‑TB demonstrates that a carefully engineered combination of gradient‑boosted tabular embeddings, lightweight convolutional audio processing, and risk‑aware cross‑modal attention can deliver a fast, accurate, and explainable TB screening tool suitable for deployment on inexpensive mobile hardware, thereby advancing the goal of equitable, large‑scale TB detection in low‑resource environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment