A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video

A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The growing volume of video-based news content has heightened the need for transparent and reliable methods to extract on-screen information. Yet the variability of graphical layouts, typographic conventions, and platform-specific design patterns renders manual indexing impractical. This work presents a comprehensive framework for automatically detecting and extracting personal names from broadcast and social-media-native news videos. It introduces a curated and balanced corpus of annotated frames capturing the diversity of contemporary news graphics and proposes an interpretable, modular extraction pipeline designed to operate under deterministic and auditable conditions. The pipeline is evaluated against a contrasting class of generative multimodal methods, revealing a clear trade-off between deterministic auditability and stochastic inference. The underlying detector achieves 95.8% mAP@0.5, demonstrating operationally robust performance for graphical element localisation. While generative systems achieve marginally higher raw accuracy (F1: 84.18% vs 77.08%), they lack the transparent data lineage required for journalistic and analytical contexts. The proposed pipeline delivers balanced precision (79.9%) and recall (74.4%), avoids hallucination, and provides full traceability across each processing stage. Complementary user findings indicate that 59% of respondents report difficulty reading on-screen names in fast-paced broadcasts, underscoring the practical relevance of the task. The results establish a methodologically rigorous and interpretable baseline for hybrid multimodal information extraction in modern news media.


💡 Research Summary

The paper addresses the growing need for automated extraction of on‑screen personal names from broadcast and social‑media‑native news videos. Because graphic layouts, typography, and platform‑specific design patterns vary widely, manual indexing is infeasible, and a recent user survey found that 59 % of viewers struggle to read names in fast‑paced broadcasts. To tackle this problem, the authors introduce two main contributions: (1) the News Graphics Dataset (NGD), a curated collection of 1 500 annotated frames drawn from 300 videos covering local, international, and social‑media news sources, and (2) the Accurate Name Extraction Pipeline (ANEP), a deterministic, modular system that combines object detection, OCR, named‑entity recognition, and name clustering.

NGD provides 4 749 bounding‑box annotations across six graphic categories (breaking news graphics, digital on‑screen graphics, lower thirds, headlines, tickers, and other graphics). Frames are standardized to 640 × 640 pixels and augmented for brightness, exposure, and noise. A stratified 93 %/4 %/3 % train‑validation‑test split is used, with perceptual hashing to remove near‑duplicate frames and reduce leakage.

ANEP consists of five sequential stages. First, video frames are sampled at 1 FPS and deduplicated via hashing. Second, a YOLOv12 model trained from scratch on NGD detects graphic regions; this model achieves 95.8 % mAP@0.5, 93.9 % precision, and 93.5 % recall, outperforming externally pretrained variants. Third, detected regions undergo contrast enhancement and adaptive thresholding before OCR with Tesseract; only outputs with confidence ≥ 0.6 are kept. Fourth, a BERT‑large‑based NER model fine‑tuned on domain data extracts person names, achieving 92 % F1, and is further filtered by heuristic rules (e.g., capitalization patterns). Fifth, extracted names are canonicalized using a combination of fuzzy string matching, Jaccard similarity, and contextual embeddings (sentence‑BERT), producing a timeline of name occurrences. Each stage logs intermediate results, ensuring full traceability and auditability.

For comparison, two generative multimodal pipelines are implemented: Gemini 1.5 Pro and LLaMA 4 Maverick. Both receive base64‑encoded frames and a structured prompt requesting only real‑world personal names. Gemini attains the highest raw performance (precision 93.33 %, recall 76.67 %, F1 84.18 %) and runs in 94.68 seconds per video, but its internal reasoning is a black box, preventing data lineage verification. LLaMA shows lower recall (50 %) and higher variance, with a runtime of 140.18 seconds.

ANEP’s overall name‑extraction results are precision 79.92 %, recall 74.44 %, and F1 77.08 %, with an average runtime of 542.15 seconds. The lower F1 compared with Gemini stems mainly from OCR‑induced string variations and temporal misalignment, but ANEP provides deterministic outputs, complete intermediate artifacts, and reproducible processing—qualities essential for journalistic, legal, and regulatory contexts.

The authors discuss trade‑offs between raw accuracy and transparency. While generative models can marginally improve extraction scores, they sacrifice auditability and may hallucinate non‑existent names. The deterministic pipeline, by contrast, offers full data provenance, stable performance across unseen broadcast material, and the ability to diagnose failures at any stage.

In conclusion, the study delivers a rigorously evaluated, openly released dataset and a transparent extraction framework that sets a baseline for hybrid multimodal information extraction in modern news media. Future work is suggested to accelerate ANEP (e.g., lightweight YOLO variants, GPU‑accelerated OCR) and to explore optional multimodal reasoning modules that could boost accuracy without compromising traceability.


Comments & Academic Discussion

Loading comments...

Leave a Comment