Feature Compression for Machines with Range-Based Channel Truncation and Frame Packing

Feature Compression for Machines with Range-Based Channel Truncation and Frame Packing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper proposes a method that enhances the compression performance of the current model under development for the upcoming MPEG standard on Feature Coding for Machines (FCM). This standard aims at providing inter-operable compressed bitstreams of features in the context of split computing, i.e., when the inference of a large computer vision neural-network (NN)-based model is split between two devices. Intermediate features can consist of multiple 3D tensors that can be reduced and entropy coded to limit the required bandwidth of such transmission. In the envisioned design for the MPEG-FCM standard, intermediate feature tensors may be reduced using Neural layers before being converted into 2D video frames that can be coded using existing video compression standards. This paper introduces an additional channel truncation and packing method which enables the system to preserve the relevant channels, depending on the statistics of the features at inference time, while preserving the computer vision task performance at the receiver. Implemented within the MPEG-FCM test model, the proposed method yields an average reduction in rate by 10.59% for a given accuracy on multiple computer vision tasks and datasets.


💡 Research Summary

The paper addresses the emerging MPEG‑FCM (Feature Coding for Machines) standard, which aims to compress intermediate feature tensors generated by split‑neural‑network inference so that they can be transmitted efficiently between a resource‑constrained front‑end device and a more powerful back‑end processor. Existing FCM pipelines first apply a learned feature‑reduction network to shrink the tensor depth, then reshape the reduced tensor into a 2‑D video frame (typically 10‑bit monochrome) and encode it with a conventional video codec such as H.266/VVC. While this approach leverages mature video compression tools, it still transmits many channels that contain only low‑activation noise and therefore contribute little to the downstream computer‑vision task.

The authors propose a lightweight, inference‑time method that dynamically identifies and removes such low‑activation channels before the video encoding stage, and then packs the remaining active channels into a more compact frame layout. The key steps are:

  1. Channel Range Extraction – For each channel of the post‑reduction tensor, compute the range (max − min). The average range across all N channels is multiplied by a scaling factor α (0 < α < 1) to obtain a threshold: \

Comments & Academic Discussion

Loading comments...

Leave a Comment