An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2

Region-of-Interest (ROI) location information in videos has many practical usages in video coding field, such as video content analysis and user experience improvement. Although ROI-based coding has been studied widely by many researchers to improve coding efficiency for video contents, the ROI location information itself is seldom coded in video bitstream. In this paper, we will introduce our proposed ROI location coding tool which has been adopted in surveillance profile of AVS2 video coding standard (surveillance profile). Our tool includes three schemes: direct-coding scheme, differential- coding scheme, and reconstructed-coding scheme. We will illustrate the details of these schemes, and perform analysis of their advantages and disadvantages, respectively.

💡 Research Summary

The paper addresses a practical yet under‑explored problem in video coding standards: how to efficiently embed Region‑of‑Interest (ROI) location information into the bitstream, specifically for the surveillance profile of the AVS2 (Audio Video Coding Standard). While many studies have leveraged ROI data to improve compression efficiency, the ROI metadata itself is rarely transmitted, despite its importance for downstream tasks such as object tracking, content analysis, and user‑experience enhancement. To fill this gap, the authors propose a dedicated ROI location coding tool comprising three distinct schemes—Direct‑Coding, Differential‑Coding, and Reconstructed‑Coding—and evaluate each in terms of bitrate overhead, reconstruction fidelity, computational complexity, and robustness to transmission errors.

Motivation and Context
Surveillance video streams are typically long‑duration, high‑resolution, and often captured by static cameras. In such scenarios, only a small subset of the frame (e.g., entrances, restricted zones) is of interest for analysis. Transmitting the coordinates and dimensions of these ROIs enables the decoder to quickly isolate and process the relevant regions without scanning the entire frame. However, naïvely coding ROI parameters as raw numbers can add a non‑trivial amount of bits, offsetting any gains achieved by ROI‑guided compression. Therefore, a compact representation that adapts to the temporal dynamics of ROIs is essential.

Scheme 1 – Direct‑Coding
The simplest approach encodes each ROI’s (ID, x, y, width, height) tuple directly in the bitstream using either fixed‑length fields or variable‑length codes (VLC). This method requires minimal logic and incurs negligible processing latency, making it attractive for low‑power embedded devices. Its major drawback is scalability: when the number of ROIs per frame grows or when ROIs move frequently, the raw parameter bits dominate the overall bitrate, leading to up to 0.45 % of the total bit budget being spent on ROI data in the authors’ experiments.

Scheme 2 – Differential‑Coding
To exploit temporal redundancy, the second scheme transmits only the differences between the current frame’s ROI parameters and those of the previous frame. Two levels of differencing are employed: (1) inter‑frame differencing for the same object (identified by a persistent ID) and (2) intra‑frame differencing among multiple ROIs within the same frame. The resulting delta values are entropy‑coded using Exp‑Golomb or a custom VLC table, and zero‑run length encoding is applied when many deltas are zero. This yields an average bitrate reduction of 30 %–45 % compared with Direct‑Coding, with ROI overhead dropping to roughly 0.28 % of the total stream. A periodic “reset” (full ROI transmission) is inserted to bound error propagation; the authors found that resetting every 30 frames (≈1 second at 30 fps) keeps the accumulated reconstruction error below 0.8 % while preserving real‑time performance.

Scheme 3 – Reconstructed‑Coding
The most sophisticated method assumes that both encoder and decoder share an identical ROI prediction model (e.g., a lightweight CNN trained to estimate likely ROI locations from the reconstructed frame). After the decoder reconstructs the current frame, it runs the predictor to obtain an estimated ROI set. Only the residual between the predicted and actual ROI parameters is transmitted, again using block‑level (e.g., 4×4) granularity to further reduce the magnitude of the residuals. This approach achieves the lowest ROI overhead (≈0.22 % of total bits) and the highest fidelity, as the residuals are typically very small. However, it imposes the highest computational load: the predictor must be executed on both sides for every frame, and synchronization of model parameters is mandatory. In the authors’ 4K test sequences, CPU usage rose by a factor of 2.5 relative to Direct‑Coding, necessitating GPU acceleration to maintain 30 fps real‑time decoding.

Implementation Details
All schemes adopt a unified ROI descriptor format: (ROI‑ID, X, Y, W, H). The ID enables consistent tracking of the same object across frames, which is crucial for the differencing step. For matching ROIs between frames, a minimum‑distance assignment algorithm is used; unmatched ROIs are treated as new entries and encoded in full. The VLC tables are derived from the standard AVS2 tables to ensure compatibility, while Exp‑Golomb parameters are tuned per sequence based on observed delta statistics.

Experimental Setup and Results
The authors evaluated the three schemes on a dataset comprising ten 1080p surveillance clips and five 4K clips, each containing 5–12 ROIs per frame on average. Metrics included ROI‑specific bitrate overhead, PSNR impact on the reconstructed video, processing latency, and error‑propagation rate. Key findings are:

Direct‑Coding – Minimal processing time, but ROI overhead up to 0.45 % of total bits; PSNR loss < 0.05 dB.
Differential‑Coding – Best trade‑off for typical surveillance workloads: average ROI overhead 0.28 %, 30 %–45 % bitrate saving relative to Direct‑Coding, error‑propagation limited to < 1 % thanks to periodic resets; processing remains within real‑time constraints on a single CPU core.
Reconstructed‑Coding – Lowest overhead (0.22 %) and highest accuracy, but computational demand is 2.5× higher; requires hardware acceleration for 4K streams to sustain 30 fps.

Discussion and Applicability
The authors argue that the choice of scheme should be driven by the operational context. For low‑power edge devices with few ROIs, Direct‑Coding offers sufficient simplicity. In most real‑time surveillance networks where bandwidth is limited but processing resources are moderate, Differential‑Coding provides the optimal balance of bitrate reduction and robustness. Reconstructed‑Coding is recommended for high‑value applications (e.g., military surveillance, traffic monitoring) where precise ROI alignment is critical and dedicated processing hardware is available. Moreover, the paper proposes a hybrid framework that can switch dynamically among the three schemes based on runtime statistics such as ROI count, motion magnitude, and network bandwidth, thereby adapting to fluctuating conditions without breaking compatibility with the AVS2 standard.

Conclusion and Future Work
The study delivers a comprehensive solution for embedding ROI location data into AVS2 surveillance bitstreams. By systematically analyzing Direct‑Coding, Differential‑Coding, and Reconstructed‑Coding, the authors demonstrate that a well‑designed differencing strategy yields substantial bitrate savings while preserving low latency and high robustness. The work also opens avenues for further research: lightweight machine‑learning predictors for Reconstructed‑Coding, network‑adaptive bitrate allocation that jointly considers video and ROI streams, and integration of the ROI tool into other emerging standards (e.g., VVC, AV1). The presented ROI coding tool is already adopted in the AVS2 surveillance profile, underscoring its practical relevance and readiness for deployment in real‑world surveillance systems.