A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion

A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Model compression is essential in the deployment of large Computer Vision models on embedded devices. However, static optimization techniques (e.g. pruning, quantization, etc.) neglect the fact that different inputs have different complexities, thus requiring different amount of computations. Dynamic Neural Networks allow to condition the number of computations to the specific input. The current literature on the topic is very extensive and fragmented. We present a comprehensive survey that synthesizes and unifies existing Dynamic Neural Networks research in the context of Computer Vision. Additionally, we provide a logical taxonomy based on which component of the network is adaptive: the output, the computation graph or the input. Furthermore, we argue that Dynamic Neural Networks are particularly beneficial in the context of Sensor Fusion for better adaptivity, noise reduction and information prioritization. We present preliminary works in this direction. We complement this survey with a curated repository listing all the surveyed papers, each with a brief summary of the solution and the code base when available: https://github.com/DTU-PAS/awesome-dynn-for-cv .


💡 Research Summary

This survey provides a comprehensive overview of Dynamic Neural Networks (DNNs) as a solution to the inefficiencies of static model compression techniques in computer vision, and it extends the discussion to multi‑modal sensor fusion. The authors begin by highlighting the limitation of conventional compression methods—pruning, quantization, knowledge distillation, etc.—which treat every input identically despite the wide variance in input complexity. Inspired by the human visual system, DNNs introduce elasticity into the network, allowing the amount of computation to be conditioned on the difficulty of each sample.

To bring order to a rapidly expanding literature, the paper proposes a taxonomy that classifies DNNs according to the component that adapts: (i) Early Exits – auxiliary classifiers are attached to intermediate layers; a confidence, entropy, or margin‑based policy decides whether to terminate inference early. The survey catalogs key architectures (BranchyNet, MSDNet, Shallow‑Deep Net, etc.), loss‑weighting schemes, self‑distillation, out‑of‑distribution detection, and NAS‑based placement of exits. It also points out the lack of a unified benchmark, which hampers fair quantitative comparison.

(ii) Dynamic Routing – the network’s computational graph is built on‑the‑fly or routes inputs to different expert modules. The authors review Mixture‑of‑Experts, Switch‑Transformers, dynamic convolutions, and various gating mechanisms (hard binary routing vs. soft weighted routing). Emphasis is placed on how routing policies can be learned jointly with the main task while minimizing the overhead of the routing decision itself.

(iii) Token Skimming – primarily relevant to Vision Transformers, this line of work reduces the number of tokens processed at each layer. Techniques include token dropping, hierarchical token selection, token merging, and per‑block importance scoring. By discarding or aggregating less informative tokens, these methods achieve substantial memory and FLOP savings for high‑resolution image and video tasks.

After establishing the three pillars, the survey turns to sensor fusion, arguing that dynamic mechanisms are especially beneficial when multiple modalities (RGB, depth, LiDAR, radar, etc.) are combined. Dynamic routing can prioritize reliable modalities under adverse conditions, early exits can provide rapid responses for easy scenes, and token skimming can filter noisy sensor streams. The authors organize 163 papers (published between 2016 and September 2025) into the taxonomy, covering applications such as object detection, semantic segmentation, and autonomous navigation. For each work, a brief description, dataset, and code availability are listed in an accompanying GitHub repository (https://github.com/DTU-PAS/awesome-dynn-for-cv), facilitating reproducibility.

The survey also reviews related prior surveys, noting that earlier works either focus on theoretical unification across domains or on natural language processing, whereas this paper concentrates exclusively on computer‑vision‑centric literature and introduces a dedicated sensor‑fusion chapter—an area previously under‑explored.

In the discussion, the authors identify open challenges: (1) the need for standardized benchmarks that capture both accuracy and dynamic latency across diverse hardware; (2) hardware support for on‑device gating and routing, especially on low‑power edge accelerators; (3) stability and predictability of dynamic policies in safety‑critical systems; and (4) the scarcity of long‑term, real‑world sensor‑fusion experiments that validate robustness under varying environmental conditions.

Overall, the paper serves as a valuable reference for researchers interested in making vision models more adaptive and efficient. By systematically categorizing early‑exit, routing, and token‑skimming approaches and linking them to multi‑modal fusion tasks, the survey clarifies the state of the art, highlights gaps, and outlines promising directions for future work.


Comments & Academic Discussion

Loading comments...

Leave a Comment