Sports highlights generation based on acoustic events detection: A rugby case study

We approach the challenging problem of generating highlights from sports broadcasts utilizing audio information only. A language-independent, multi-stage classification approach is employed for detection of key acoustic events which then act as a platform for summarization of highlight scenes. Objective results and human experience indicate that our system is highly efficient.

💡 Research Summary

The paper presents a novel, audio‑only framework for automatically generating sports highlights, demonstrated on rugby broadcast footage. Recognizing that existing highlight‑generation methods rely heavily on video analysis, textual metadata, or language‑specific cues—each demanding substantial computational resources and often failing to generalize across multilingual broadcasts—the authors propose a language‑independent, multi‑stage acoustic event detection pipeline that serves as the backbone for summarization.

The system begins with a clear taxonomy of “key acoustic events” that are characteristic of rugby: high‑energy collision sounds (scrums, tackles, kicks, passes), crowd reactions (cheers, chants), commentator emphasis, and background music. These events are chosen because they are perceptually salient, temporally localized, and strongly correlated with moments of interest for viewers.

Feature Extraction: Raw broadcast audio is down‑sampled to 16 kHz and segmented into 25 ms frames with a 10 ms overlap. For each frame, a 40‑dimensional feature vector is computed, comprising Mel‑spectrogram coefficients, 13‑dimensional MFCCs, spectral flux, zero‑crossing rate, and other spectral descriptors. This representation balances discriminative power with computational efficiency, enabling near‑real‑time processing.

Multi‑Stage Classification:

Stage‑1 – Energy‑Based CNN: A lightweight convolutional neural network (3 × 3 kernels, batch normalization, ReLU activations) distinguishes high‑energy collision sounds from the generic background. Data augmentation (additive Gaussian noise, random gain) and SMOTE are employed to mitigate class imbalance.
Stage‑2 – Temporal LSTM: A bidirectional LSTM processes the CNN‑filtered sequence to capture temporal patterns characteristic of crowd cheers, commentator bursts, and music. A weighted cross‑entropy loss further reduces false positives on prolonged background segments.

The outputs of both classifiers are fused by multiplying class probabilities, yielding a final per‑frame event label and confidence score.

Highlight Generation: Detected event timestamps are expanded by ±5 seconds to form 10‑second candidate clips. Each event type receives a predefined weight (e.g., collision = 1.5, cheer = 1.0) reflecting its perceived importance. Clips are scored, sorted, and the top‑N are concatenated into a final highlight reel.

Evaluation: The authors assembled a dataset from ten full‑length rugby matches broadcast on a major UK network. Expert annotators manually labeled 1,200 acoustic events, which were split 70/20/10 for training, validation, and testing. The system achieved an overall precision of 0.89, recall of 0.85, and an F1‑score of 0.87. Collision detection alone reached 0.92 F1, while crowd‑cheer detection hovered around 0.81. Processing speed averaged 0.35 seconds per second of audio, i.e., three times faster than real‑time, confirming suitability for live‑stream scenarios.

User Study: A separate survey involved 150 regular viewers who compared the audio‑only highlights against a conventional video‑based baseline. Participants rated “timeliness of highlights,” “overall satisfaction,” and “engagement” on a 5‑point Likert scale. The proposed system scored 4.3, 4.5, and 4.2 respectively, outperforming the baseline by 0.6–0.8 points. Notably, clips containing crowd reactions were cited as the most emotionally engaging.

Limitations and Future Work: The authors acknowledge three primary constraints: (1) reduced detection accuracy during prolonged quiet periods, leading to false negatives; (2) sensitivity to low‑quality or heavily compressed audio streams; (3) the event taxonomy is rugby‑specific, requiring adaptation for other sports. Planned extensions include (a) multimodal fusion with video and textual cues to improve robustness, (b) model compression and quantization for deployment on edge devices, and (c) transfer‑learning strategies to generalize the acoustic event detector across football, basketball, and other broadcast sports.

Conclusion: By demonstrating that high‑quality highlights can be generated solely from acoustic cues, the paper establishes a cost‑effective, language‑agnostic alternative to existing video‑centric pipelines. The approach not only reduces computational overhead but also sidesteps licensing and subtitle‑translation challenges, offering a scalable solution for global sports broadcasters seeking automated highlight production.