Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.

💡 Research Summary

This paper investigates whether locally deployed vision‑language models (VLMs) can outperform a supervised Vision Transformer (TimeSFormer) for fine‑grained activity recognition in newborn resuscitation videos. Accurate documentation of resuscitation actions is critical for quality improvement, yet manual recording is rare and existing automated methods (3D‑CNN‑based ORAA‑net, semi‑supervised SVFormer) struggle with the subtle, multi‑label nature of the task and raise privacy concerns when cloud services filter medical content.

To address these issues, the authors use open‑source, edge‑friendly models: LLaVA‑Next Video (7 B parameters) as the VLM and Mistral 7 B as the large language model (LLM). They evaluate the approaches on a simulated dataset of 13.26 hours of newborn resuscitation, comprising 3‑second clips (25 fps, 768 × 1024) annotated with four multilabel classes (ventilation, stimulation, suction, baby‑on‑table). The dataset is heavily imbalanced; a weighted binary cross‑entropy loss is employed.

Five strategies are compared: (1) a supervised TimeSFormer baseline (pre‑trained on Something‑Something V2, fine‑tuned on the dataset); (2) Zero‑Shot Classification with Constrained Output (ZSC‑CO) that prompts the VLM to emit a comma‑separated label vector; (3) Zero‑Shot Binary Testing (ZS‑B) where the VLM answers a yes/no question for each class; (4) Zero‑Shot Captioning with LLM Judge (ZSC‑J) where the VLM generates a caption and a separate LLM maps it to binary labels; and (5) supervised fine‑tuning of the VLM with a classification head, explored in two variants: FT‑LC (only the head is trained) and FT‑C‑LoRA (Low‑Rank Adaptation modules are inserted into the cross‑modal attention layers, allowing limited backbone updates).

Experiments use macro‑F1 as the primary metric due to class imbalance and label interdependence. Hyper‑parameter optimization via Optuna (TPE algorithm) is applied to TimeSFormer and FT‑LC; for VLM‑based methods, features are pre‑extracted to reduce compute. Results show that zero‑shot approaches suffer from hallucinations and achieve modest macro‑F1 scores (0.45–0.58). ZSC‑J improves caption quality but still lags (0.66). FT‑LC reaches 0.84 macro‑F1, demonstrating that the frozen VLM already provides rich visual embeddings. The LoRA‑enhanced FT‑C‑LoRA achieves the highest performance, macro‑F1 = 0.91, surpassing the TimeSFormer baseline (0.70) while adding less than 1 % additional parameters.

The study highlights that (i) careful prompt engineering is essential to mitigate VLM hallucinations; (ii) LoRA enables efficient adaptation of large multimodal models without full fine‑tuning; (iii) the simulated dataset limits conclusions about real‑world clinical deployment; and (iv) rare classes such as chest compressions remain under‑represented. Nonetheless, the findings demonstrate that locally run VLM‑LLM pipelines, when fine‑tuned with low‑rank updates, can significantly improve fine‑grained activity recognition in sensitive medical video streams while preserving privacy and avoiding cloud‑based content filters. The work points toward practical, on‑premise multimodal AI solutions for neonatal care documentation and quality improvement.

Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment