Sports highlights generation based on acoustic events detection: A rugby case study
We approach the challenging problem of generating highlights from sports broadcasts utilizing audio information only. A language-independent, multi-stage classification approach is employed for detection of key acoustic events which then act as a platform for summarization of highlight scenes. Objective results and human experience indicate that our system is highly efficient.
đĄ Research Summary
The paper presents a novel, audioâonly framework for automatically generating sports highlights, demonstrated on rugby broadcast footage. Recognizing that existing highlightâgeneration methods rely heavily on video analysis, textual metadata, or languageâspecific cuesâeach demanding substantial computational resources and often failing to generalize across multilingual broadcastsâthe authors propose a languageâindependent, multiâstage acoustic event detection pipeline that serves as the backbone for summarization.
The system begins with a clear taxonomy of âkey acoustic eventsâ that are characteristic of rugby: highâenergy collision sounds (scrums, tackles, kicks, passes), crowd reactions (cheers, chants), commentator emphasis, and background music. These events are chosen because they are perceptually salient, temporally localized, and strongly correlated with moments of interest for viewers.
Feature Extraction: Raw broadcast audio is downâsampled to 16âŻkHz and segmented into 25âŻms frames with a 10âŻms overlap. For each frame, a 40âdimensional feature vector is computed, comprising Melâspectrogram coefficients, 13âdimensional MFCCs, spectral flux, zeroâcrossing rate, and other spectral descriptors. This representation balances discriminative power with computational efficiency, enabling nearârealâtime processing.
MultiâStage Classification:
- Stageâ1 â EnergyâBased CNN: A lightweight convolutional neural network (3âŻĂâŻ3 kernels, batch normalization, ReLU activations) distinguishes highâenergy collision sounds from the generic background. Data augmentation (additive Gaussian noise, random gain) and SMOTE are employed to mitigate class imbalance.
- Stageâ2 â Temporal LSTM: A bidirectional LSTM processes the CNNâfiltered sequence to capture temporal patterns characteristic of crowd cheers, commentator bursts, and music. A weighted crossâentropy loss further reduces false positives on prolonged background segments.
The outputs of both classifiers are fused by multiplying class probabilities, yielding a final perâframe event label and confidence score.
Highlight Generation: Detected event timestamps are expanded by Âą5âŻseconds to form 10âsecond candidate clips. Each event type receives a predefined weight (e.g., collisionâŻ=âŻ1.5, cheerâŻ=âŻ1.0) reflecting its perceived importance. Clips are scored, sorted, and the topâN are concatenated into a final highlight reel.
Evaluation: The authors assembled a dataset from ten fullâlength rugby matches broadcast on a major UK network. Expert annotators manually labeled 1,200 acoustic events, which were split 70/20/10 for training, validation, and testing. The system achieved an overall precision of 0.89, recall of 0.85, and an F1âscore of 0.87. Collision detection alone reached 0.92 F1, while crowdâcheer detection hovered around 0.81. Processing speed averaged 0.35âŻseconds per second of audio, i.e., three times faster than realâtime, confirming suitability for liveâstream scenarios.
User Study: A separate survey involved 150 regular viewers who compared the audioâonly highlights against a conventional videoâbased baseline. Participants rated âtimeliness of highlights,â âoverall satisfaction,â and âengagementâ on a 5âpoint Likert scale. The proposed system scored 4.3, 4.5, and 4.2 respectively, outperforming the baseline by 0.6â0.8 points. Notably, clips containing crowd reactions were cited as the most emotionally engaging.
Limitations and Future Work: The authors acknowledge three primary constraints: (1) reduced detection accuracy during prolonged quiet periods, leading to false negatives; (2) sensitivity to lowâquality or heavily compressed audio streams; (3) the event taxonomy is rugbyâspecific, requiring adaptation for other sports. Planned extensions include (a) multimodal fusion with video and textual cues to improve robustness, (b) model compression and quantization for deployment on edge devices, and (c) transferâlearning strategies to generalize the acoustic event detector across football, basketball, and other broadcast sports.
Conclusion: By demonstrating that highâquality highlights can be generated solely from acoustic cues, the paper establishes a costâeffective, languageâagnostic alternative to existing videoâcentric pipelines. The approach not only reduces computational overhead but also sidesteps licensing and subtitleâtranslation challenges, offering a scalable solution for global sports broadcasters seeking automated highlight production.