A Preprocessing Framework for Video Machine Vision under Compression
There has been a growing trend in compressing and transmitting videos from terminals for machine vision tasks. Nevertheless, most video coding optimization method focus on minimizing distortion according to human perceptual metrics, overlooking the heightened demands posed by machine vision systems. In this paper, we propose a video preprocessing framework tailored for machine vision tasks to address this challenge. The proposed method incorporates a neural preprocessor which retaining crucial information for subsequent tasks, resulting in the boosting of rate-accuracy performance. We further introduce a differentiable virtual codec to provide constraints on rate and distortion during the training stage. We directly apply widely used standard codecs for testing. Therefore, our solution can be easily applied to real-world scenarios. We conducted extensive experiments evaluating our compression method on two typical downstream tasks with various backbone networks. The experimental results indicate that our approach can save over 15% of bitrate compared to using only the standard codec anchor version.
💡 Research Summary
The paper addresses a practical yet under‑explored problem: how to preserve the performance of downstream machine‑vision tasks when video streams are heavily compressed for transmission from edge devices. Conventional video codecs such as H.264/AVC and H.265/HEVC are designed to minimize distortion measured by human‑centric metrics (e.g., PSNR, VMAF). Consequently, at low bitrates the visual information that neural networks rely on for tasks like action recognition or object tracking degrades dramatically, leading to a steep drop in accuracy.
To tackle this, the authors propose a two‑component framework. The first component is a learnable video pre‑processor, a convolutional neural network with two parallel branches. The temporal branch performs convolutions across the time dimension to capture inter‑frame motion cues, while the spatial branch extracts intra‑frame texture and color features. The two feature streams are fused via a conditional attention module and added back to the original frames through a residual connection, producing a “pre‑processed” video that is more robust to subsequent compression.
The second component is a differentiable virtual codec that mimics the essential operations of a real video codec (prediction, transform, quantization, inverse transform) using tensor operations. A quantization factor f_q, sampled randomly between 30 and 50, stands in for the standard QP. The virtual codec outputs a reconstructed residual, from which a distortion loss (MSE) and a rate loss (estimated bits‑per‑pixel using the factorized prior of Balle et al.) are computed.
Training optimizes a combined loss:
L = α (L_D + λ L_R) + L_Acc
where L_D is distortion, L_R is rate, L_Acc is the task‑specific accuracy loss (computed by fixed downstream models), α balances compression‑related loss against task loss, and λ controls the rate‑distortion trade‑off. Only the pre‑processor parameters are updated; the downstream networks and the virtual codec remain frozen.
Experiments cover two representative machine‑vision tasks. For action recognition, the Kinetics‑400 dataset is used with six popular backbones (SlowOnly, SlowFast, C2D, I3D, Swin, TPN). For object tracking, the GOT‑10K dataset is evaluated with four trackers (KYS, DiMP, ATOM, PrDiMP). During testing, the pre‑processed videos are compressed with real open‑source encoders x264 and x265 under the “medium” preset and QP values 30, 35, 40, 45, 50.
Results show consistent bitrate savings while maintaining or even improving accuracy. With H.264, the best backbone (SlowOnly) achieves a –17.6 % BD‑Rate reduction in top‑1 accuracy; with H.265 the best reduction is –16.2 %. For tracking, the AUC metric improves by up to –17.5 % BD‑Rate. Moreover, perceptual quality measured by VMAF also improves modestly, indicating that the pre‑processor does not sacrifice human‑viewable quality. An ablation where downstream models are fine‑tuned directly on compressed data yields some gains but still falls short of the proposed pre‑processing approach, confirming that the pre‑processor effectively mitigates the mismatch between compression artifacts and task‑specific feature extraction.
The paper acknowledges limitations: the pre‑processor is trained separately for each backbone and codec, which may hinder universal deployment, and the virtual codec, while differentiable, cannot capture all intricacies of real entropy coding and motion compensation, leading to minor train‑test gaps. Future work could explore a single pre‑processor that jointly adapts to multiple backbones and codecs, or integrate more sophisticated learned entropy models into the virtual codec for tighter rate‑distortion estimation.
In summary, the authors deliver a practical end‑to‑end pipeline that couples a neural video pre‑processor with a differentiable virtual codec, enabling joint optimization of compression efficiency and downstream machine‑vision accuracy. Their extensive experiments demonstrate over 15 % bitrate reduction without degrading task performance, offering a compelling solution for real‑world edge video analytics where bandwidth is scarce but AI inference quality is critical.
Comments & Academic Discussion
Loading comments...
Leave a Comment