DANCE: Dynamic 3D CNN Pruning: Joint Frame, Channel, and Feature Adaptation for Energy Efficiency on the Edge

D ANCE: Dynamic 3D CNN Pruning: Joint Frame, Channel, and F eatur e Adaptation f or Energy Efﬁciency on the Edge Mohamed Mejri Ashiqur Rasul Abhijit Chatterjee School of Electrical and Computer Engineering, Georgia T ech, Atlanta, GA, USA mmejri3@gatech.edu, arasul6@gatech.edu, abhijit.chatterjee@ece.gatech.edu Abstract Modern con volutional neural networks (CNNs) ar e workhorses for video and image processing , but fail to adapt to the computational complexity of input samples in a dynamic manner to minimize ener gy consumption. In this r esearc h, we pr opose D ANCE, a ﬁne-grained, input-awar e, dynamic pruning frame work for 3D CNNs to maximize power efﬁciency with ne gligible to zer o impact on performance. In the pr oposed two-step appr oach, the ﬁrst step is called activation variability ampliﬁcation (A V A) and the 3D CNN model is r etrained to incr ease the variance of the magnitude of neuron activations across the network in this step, facilitating pruning decisions acr oss diverse CNN input scenarios. In the second step, called adaptive acti- vation pruning (AAP), a lightweight activation contr oller network is tr ained to dynamically prune fr ames, channels and features of 3D con volutional layers of the network (differ ent for each layer), based on statistics of the outputs of the ﬁrst layer of the network. Our method achieves substantial savings in multiply–accumulate (MA C) opera- tions and memory accesses by intr oducing sparsity within con volutional layers. Har dware validation on the NVIDIA Jetson Nano GPU and the Qualcomm Snapdragon 8 Gen 1 platform demonstrates respective speedups of 1.37 × and 2.22 × , ac hieving up to 1.47 × higher ener gy efﬁciency compar ed to the state of the art. 1 1. Introduction Con volutional neural networks (CNNs) are widely used for computer vision applications inv olving both 3D and 2D structures, such as for videos and images, respectively , for tasks such as gesture classiﬁcation, action recognition, video caption generation, object detection and tracking, se- mantic segmentation and so on. Howe ver , con volutional 1 Code and experimental data of this research work will be published after acceptance of the article. neural networks are computationally expensi ve, with costs that grow with network depth and scaling from 2D (image [ 27 , 31 ]) to 3D (video or point cloud [ 33 , 38 ]) processing. T o address this, structured pruning has emerged as a promis- ing solution [ 5 , 9 , 19 , 20 ], reducing both memory footprint and computational requirements. Pruning can also be ap- plied dynamically , remo ving parameters adaptiv ely depend- ing on the input [ 4 , 7 , 12 , 22 , 26 , 35 , 40 ]. Despite the fact that dynamic pruning has been applied mostly to 2D CNNs, where focus is largely on coarse strate gies such as channel pruning or layer remov al techniques, which, while effecti ve, remain limited in granularity and o verall scope of po wer ef- ﬁciency . In this work, we propose DANCE , a two-step, ﬁne- grained, input-aware pruning framew ork for 3D conv olu- tional neural networks that dynamically removes unneces- sary frames, channels, and features inside intermediate 4D ( heig ht × w idth × channel × f r ame ) acti vation tensors of the hidden layers of the network. The ﬁrst step, referred to as activation variability ampliﬁcation (A V A) , consists of training the network while increasing the variance of ac- tiv ation magnitudes across the frames, channels, and fea- tures inside the network. This variance ampliﬁcation facil- itates later pruning by reinforcing critical neurons and sup- pressing redundant neurons. The second step, adaptive ac- tivation pruning (AAP) , introduces a lightweight activation contr oller network that dynamically prunes frames, chan- nels, and features corresponding to different hidden lay- ers of the network for each input sample, based on thresh- olds applied to the aggregated outputs (representing statis- tics) of the ﬁrst layer of the network. This enables the 3D CNN model to skip unnecessary neurons dynamically , al- lowing each input sample to trigger a tailored pruning pat- tern across the frames, channels, and features of each hid- den layer . Computational sa vings arise from skipping op- erations tied to pruned activ ations, greatly reducing MAC operations and memory accesses. Additionally , removing noisy or irrele vant activ ations can enhance accuracy by ﬁl- tering out noisy or misleading features. The sparsity of channels and feature activations varies both fr om frame to frame and acr oss differ ent con volutional layers in a 3D CNN . The key contrib utions of this paper are: (1) W e propose a nov el input-awar e adaptive ﬁne-grained pruning frame work for 3D con volutional neural networks, capable of dynamically pruning frames, channels, and fea- tures in 4D activ ation tensors of the hidden layers of the network, on a per-layer basis, based on statistics of the out- puts of the ﬁrst layer of the network. (2) A two-step adaptation pr ocess is used : (1) Activa- tion variability ampliﬁcation (A V A) , which increases ac- tiv ation variance across the 4D activ ation tensor dimen- sions, without degrading network performance, and (2) Adaptive activation pruning (AAP) , which employs a lightweight network to predict input-aw are pruning thresh- olds for frame, channel, and feature pruning, allowing each input sample to trigger a tailored pruning pattern. (3)The proposed method introduces structured sparsity to reduce compute and memory o verhead while maintaining (and sometimes improving) accuracy compared to prior work. W e demonstrate the versatility of our approach on an NVIDIA Jetson Nano GPU and a Snapdragon 8 Gen 1 (Samsung S22) mobile CPU. By le veraging custom Neon SIMD kernels, we ha ve achie ved computational speedups of 1.37 × and 2.22 × , respecti vely , with up to 1.47 × higher energy ef ﬁciency ov er optimized baselines. 2. Related W orks T o develop input adaptiv e neural networks with dynamic computational graphs, researchers ha ve approached the problem from three angles: sample-wise adapti ve, spatially adaptiv e, and temporally adaptiv e. Sample-wise adapti ve models adapt their computational graphs according to the complexity of individual input instances, whereas spatially adaptiv e and temporally adaptiv e neural networks modify and optimize themselv es along spatial and temporal dimen- sions. Layer and channel skipping[ 26 , 35 , 40 ] is a pop- ular technique to av oid unnecessary computation in neu- ral networks, and this method inv olves adopting gating or halting mechanisms using an auxiliary policy network. BlockDrop[ 39 ] suggested a policy gradient reinforcement learning (REINFORCE)[ 11 , 36 , 44 ] algorithm for block se- lection within residual neural network architectures. The authors of Conditional Deep Learning [ 26 ] propose to add linear classiﬁcation heads at the end of each con volution block, creating a cascaded architecture for early exit neu- ral networks. Class-aw are channel pruning [ 1 , 14 ] methods dynamically reduce the model computational b urden by se- lecting the important kernels based on class-aware saliency scores. Furthermore, a uniﬁed static and dynamic pruning framew ork has been suggested in the works of [ 6 ] with dif- ferentiable gating and channel selection mechanism using the Gumbel [ 13 , 21 ] sigmoid trick. Aforementioned opti- mization techniques can be extended to 3D con volutional neural networks for ef ﬁcient processing of their relativ ely demanding computational task as well. Few research works hav e been conducted on static compression and pruning of 3D neural networks [ 25 , 41 ], ho wever , most of the inv esti- gations lack the essence of adaptation in their performance optimization scheme. Li et al [ 18 ] formulated an adap- tiv e policy to optimize the use of 3D con volutional ker- nels where a reinforcement learning based selection net- work governs the usage of frame and conv olutional ﬁl- ters. Additionally , Chen et al [ 3 ] proposed a pruning tech- nique on the basis of con verting con volutional kernels in the temporal axis to frequency domain representations, reduc- ing the computational complexity by remo ving insigniﬁcant frequency components. 3. Overview The framework for the proposed dynamic 3D CNN prun- ing approach is giv en in Figure 1 . Figure 1 (top) shows input frames (video) to our system that are processed by the ﬁrst con volutional layer (3D Conv 1) of the network. The outputs of the ﬁrst layer are passed to an Adaptive Activa- tion Pruning (AAP) Contr oller module. The AAP controller generates control signals for the activ e pruning of frames, channels, and features of all subsequent 3D con volutional layers in the network, as described belo w . An expanded view of the “3D Conv 4” layer is shown in Figure 1 , serving as the reference structure for all sub- sequent layers throughout the network. The inputs to this layer consist of C in channels, with each H × W feature map corresponding to the respective channel ﬁlter , repli- cated over T temporal frames of video input, forming a 4D tensor as illustrated in box 1. Box (2) of Figure 1 sho ws the unfolded “ﬂattened” representation (matrix) of the 4D ten- sor above. The rows and columns of this matrix are modu- lated by the AAP controller as sho wn in box (3) of Figure 1 . Frame pruning eliminates one or more complete frames of the matrix ( H i × W i rows, 1 ≤ i ≤ T , box (3) top). Channel pruning eliminates complete columns of the matrix, while feature pruning eliminates corresponding weights of feature maps across all the channels of the network corresponding to different time frames (different for different time frames). The rearranged tensor is shown in box (4) of Figure 1 . The pruned tensor is subsequently conv olved with the kernels of the current layer , 3D Conv 4, to produce the output, sho wn as the 4D tensor in box (5), which serves as the input to the next layer in the network. Computational sa vings are achiev ed during con volution by skipping the operations as- sociated with pruned frames, channels, and features. Based on the frame work described in Figure 1 the dynamic prun- ing approach proceeds in two steps. Step (1) : T o facilitate dynamic pruning it is necessary to amplify the effects of div erse network inputs (video frames across time, images) on the outputs of all the layers of the Figure 1. Illustration of 4D tensor pruning in intermediate con- volution layers. Sparsity in the activ ation tensor is introduced at three lev els by dynamic pruning: for frames (dotted in red), chan- nels (shaded in green), and features (lined in black and white). 3D con volution network of Figure 1 . A nov el 3D CNN training method called activation variability ampliﬁcation (A V A) is used to increase the v ariance of the distrib utions of the magnitudes of neuron activ ations across frames, chan- nels, and features of the layers of the network of Figure 1 . After model training, the post-A V A magnitude distribution exhibits higher v ariation across all corresponding 4D tensor dimensions. This induced v ariation enables clearer separa- tion between relev ant and irrelev ant frames, channels, and features for more effecti ve dynamic pruning. Step (2) : The outputs of the ﬁrst layer of the network are passed to an adapti ve activ ation pruning (AAP) controller module. This eliminates frames, channels, and features of individual layers of the netw ork based on processing of the outputs of the ﬁrst layer and a generated threshold for each pruned element (inspired by the component retention crite- rion in Principal Component Analysis (PCA)[ 23 ], adapted here to operate on acti vation magnitudes as described in Step 1). It is worth mentioning that, only the parameters in volved in the adaptive acti vation pruning controller are learned in the second step (refer to ﬁgure 1 ), while other parameters are kept frozen. In the following, Steps (1) and (2) are discussed in Sections 4 and 5 , respectiv ely . 4. Activation V ariability Ampliﬁcation (A V A) Figure 2 describes the proposed adaptive variability am- pliﬁcation (A V A) training procedure of Step 1 discussed in Section 3 . Inspired by the Con volutional Block Atten- tion Module (CB AM) [ 37 ], which enhances the representa- tional power of 2D conv olutional neural networks through channel and features attention, the A V A module extends this concept by applying three hierarchical attention mod- ules: frame, channel, and feature based attention. Unlike CB AM, our approach serves a dif ferent purpose: to quan- tify and adaptiv ely re-weight neuron activ ation v ariability across the frame, channel, and feature dimensions. These attention modules help reweighting frames, channels and features within the activ ation tensor which helps increasing their magnitude variability . F rame Attention Module: Consider the “ﬂattened” 4D ten- sor X sho wn in Box (2) of Figure 1 . This is also inter- changeably represented as a matrix X [], with sub-matrices [ x 1 [], x 2 [], .... x T [ ]. Each sub-matrix x t [j,k] is of dimen- sion H t * W t by C in (rows by columns) corresponding to each frame t , 1 ≤ t ≤ T , as sho wn in the ﬁgure. Consider a trainable vector W F R of dimension T , W F R = [ w F R (1) , w F R (2) ,... w F R ( T )] . If we assume that H t is equal to a ﬁxed height H and W t is equal to a ﬁxed width W for all 1 ≤ t ≤ T , we compute: L FR ( t ) = | w F R ( t ) | C · H · W H ∗ W X j =1 C X k =1 | x t [ j, k ] | (1) The variance of L FR (t) over all 1 ≤ t ≤ T is denoted as σ 2 FR . Maximizing σ 2 FR during training helps in determin- ing the relev ance of dif ferent frames relative to others on a per -input basis. The sub-matrices x t [], 1 ≤ t ≤ T are weighted by the scalars w F R ( t ) (represented in tensor form as X F R = ReLU ( X × W F R ) in Figure 2 ) and passed on to the channel attention module. Channel Attention Module: The weighted matrices x ′ t [j,k] = ReLU ( w F R ( t ) * x t [j,k]), 1 ≤ t ≤ T , are for- warded to a channel attention module, where channel- wise modulation is performed using the trainable v ec- tor W C H = [ w C H (1) , w C H (2) , ...w C H ( C in )] . A frame- dependent, channel-wise quantity L CH (t,c), 1 ≤ t ≤ T , Frame Attention Block Channel Attention Block Feature Attention Block H W T C T otal V ariance Final Output X X X + + Conv N-1 Conv N Figure 2. Overview of the Activation V ariability Amplication (A V A) mechanism. V ariance along the frame ( L F R ), channel ( L C H ) and feature ( L F E ) dimensions are aggregated to determine the total variance σ 2 AV A , which is used as parameter in the loss function to train the model for boosting variance within acti vations. 1 ≤ c ≤ C in , is computed as: L CH ( t, c ) = | w C H ( c ) | H · W H ∗ W X j =1 | x ′ t [ j, c ] | (2) The v ariance of L CH ( t, c ) ov er all 1 ≤ t ≤ T , 1 ≤ c ≤ C in , is computed as σ 2 CH . Maximizing σ 2 CH during training helps in determining the relev ance of different channels relative to others on a per-input basis. The tensor X C H depicted by the columns of the sub-matrices x ′ t [] (representing channels), weighted by the scalars w C H ( c ) , 1 ≤ t ≤ T , 1 ≤ c ≤ C in , gi ven as X C H = ReLU ( X F R × W C H ) in Figure 2 are passed on to the feature attention module. F eatur e Attention Module: The columns of the sub- matrices x ′ t [j,k] abov e, are weighted by the quantities w C H ( k ) , 1 ≤ k ≤ C in , to yield the sub-matrices x ∗ t [j,k]. These are forwarded to a feature attention module where feature-wise modulation is performed using the trainable parameter W FE = [ w F E (1) , w F E (2) , ...w F E ( H ∗ W )] . A frame-dependent feature-based quantity L FE is deri ved as follows: L FE ( t, f ) = | w F E ( f ) | C C X c =1 | x ∗ t [ f , c ] | , (3) where f is a ro w index for the sub-matrix x ∗ t [], 1 ≤ f ≤ H ∗ W . The variance of L FE ( t, f ) ov er all 1 ≤ t ≤ T , 1 ≤ f ≤ H ∗ W , is computed as σ 2 FE . Maximizing σ 2 FE dur- ing training helps in determining the relev ance of different features relative to others on a per-input basis. The tensor X F E depicted by the rows of the sub-matrices x ∗ t [] (repre- senting features), weighted by the scalars w F E ( f ) , 1 ≤ t ≤ T , 1 ≤ f ≤ H ∗ W , given as X F E = ReLU ( X C H × W F E ) in Figure 2 are passed on to further con volutional layers of the network and for computation of the objective function for network training. The frame, channel and feature based variances referred as σ 2 FR , σ 2 CH and σ 2 FE respectiv ely are aggre gated to generate the total variance denoted σ 2 A V A . The total variances from all the A V A modules inside the 3D CNN are summed together to generate the ﬁnal v ariance referred to as σ 2 f . The 3D CNN is then trained to jointly minimize the cross-entropy loss L C E between predicted and ground-truth labels and maximize the aggregated v ariance σ 2 A V A . The overall ob- jectiv e function is deﬁned as: L f = L C E − β σ 2 f where β is a small positi ve weighting factor that regulates the contrib ution of the v ariance term, ensuring that the opti- mization primarily focuses on reducing classiﬁcation error while increasing feature variability . The choice of the standard de viation also serves the pur- pose of sparsity maximization. In our setting, the quanti- ties L CH ( t, c ) , L FE ( t, f ) , and L FR ( t ) are ﬁrst normalized with respect to the ℓ 1 norm, i.e., their entries are nonneg- ativ e and rescaled to sum to one. Under this normalization, Lemma 1 shows that maximizing the standard deviation is equiv alent to maximizing sparsity in the sense of the Hoyer measure [ 42 ], deﬁned for a vector x ∈ R D as H ( x ) = √ D − ∥ x ∥ 1 ∥ x ∥ 2 √ D − 1 . Lemma 1. On the pr obability simple x ∆ D − 1 , the stan- dar d deviation std( x ) and the Hoyer measure H ( x ) are both strictly incr easing functions of ∥ x ∥ 2 . Consequently , arg max x ∈ ∆ D − 1 std( x ) = arg max x ∈ ∆ D − 1 H ( x ) . Pr oof. Since ∥ x ∥ 1 = 1 , then ¯ x = 1 /D and std( x ) 2 = 1 D ∥ x ∥ 2 2 − 1 D 2 , which is strictly increasing in ∥ x ∥ 2 . Sim- ilarly , for x ∈ ∆ D − 1 , the Hoyer measure simpliﬁes to H ( x ) = √ D − 1 / ∥ x ∥ 2 √ D − 1 . Because the term − 1 / ∥ x ∥ 2 is strictly increasing in ∥ x ∥ 2 , H ( x ) is also a strictly increasing func- tion of ∥ x ∥ 2 . Since both objectives are monotonic trans- formations of the same norm, the y share identical maximiz- ers. Conv 1 Conv N Channel Pruning Feature Pruning Conv N-1 Interm. 4D T ensor( X ) T W C H Frame AttentionModule Frame Pruning Channel AttentionModule Feature AttentionModule FFN FE FFN CH FFN FR 3D average pooling AAPController Zoom-inFrame pruningprocess Normalized Magnitude T Tp Comp. Figure 3. Adapti ve Activ ation Pruning (AAP): The AAP function is applied sequentially on frames, channels, and features (depicted from top to bottom) to produce a structured sparsity pattern in the activ ation tensor fed into the subsequent con volutional layer . 5. Adaptive Acti vation Pruning (AAP) Adaptiv e activ ation pruning (AAP) introduces sparsity into the activ ation tensors of intermediate con volutional layers of the network based on precomputed thresholds. Follo wing the v ariability ampliﬁcation phase, high v ariance within ac- tiv ation tensor dimensions stimulates large magnitude dif- ferences among frames, channels, and features, which in turn facilitates elimination of unnecessary frames, channels, and features (those with low magnitude) from the computa- tion graph and following the step, the weights of the 3D con volutional network ar e fr ozen and only the parameters of the AAP module are trained . Figure 3 illustrates the pro- cessing in volv ed in the Adaptiv e Activ ation Pruning (AAP) module, and proceeds in three consecutiv e steps. The output of the 1 st con volutional layer of the 3D CNN serves as input to the AAP controller, which utilizes 3D av erage pooling follo wed by three FFNs (FFN FR , FFN CH , FFN FE ) and sigmoid acti vations to produce global pruning thresholds θ FR , θ CH , and θ FE . These thresholds are shared across all layers and operate within [0 , 1] due to the nor- malization of the importance metrics ( L ). While a global threshold may be less efﬁcient than layer-wise adaptation, it av oids the linear overhead of per-layer modules, preserv- ing the net computational beneﬁt. Pruning is ex ecuted as follows: frames t are suppressed if L F R ( t ) ≤ θ FR (Eq. 1 ), channels if L C H ( t, c ) ≤ θ CH (Eq. 2 ), and features if L F E ( t, c ) ≤ θ FE (Eq. 3 ). As illustrated in Figure 3 , frame, channel, and feature pruning are applied after each corresponding attention mod- ule to the intermediate 4D tensor ( X ) using the retention thresholds described abov e. Since the model has already been trained with the A V A module, retraining the entire network while pruning is un- necessary and could e ven reduce variability in intermedi- ate acti vations. Therefore, the 3D CNN weights are kept frozen, and only the AAP controller network is trained to produce higher retention thresholds while maintaining clas- siﬁcation performance. Howe ver , training the AAP controller network jointly with adaptive activ ation pruning poses a challenge, as pruning in volves non-differentiable binary mask genera- tion. T o address the non-dif ferentiability issue, we adopt the Straight-Through Estimator (STE) technique [ 2 ], which applies a hard (non-differentiable) operation in the forward pass and a soft counterpart in the backward pass, allo wing gradients to ﬂow through the 3D CNN. For the binary mask generation, we employ a soft gating operation inspired by the Gumbel-Softmax technique [ 13 , 21 , 28 ], as sho wn in Equation 4 : y =        1  x − θ + n τ ≥ 0  , (forward pass) , σ  x − θ + n τ  , (backward pass) . (4) In Equation 4 , σ represents the sigmoid function, x de- 3D CNN: 99.87% AAP Ctrl: 0.01% A V A: 0.12% 3D CNN: 99.49% AAP Ctrl: 0.5% A V A: 0.01% (A) Params count (B) FLOPs Figure 4. Overhead of the 3D CNN, A V A modules, and AAP con- troller notes the normalized magnitude of frames, channels, or pix- els; θ represents the retention threshold, and n ∼ N (0 , ϵ ) is a Gaussian noise term with zero mean and small vari- ance ϵ , introduced to smooth the threshold boundary . This stochasticity enables more stable gradient ﬂo w near deci- sion boundaries. The output y populates the binary mask 0 1 2 3 4 5 6 7 F rame Index 0.00 0.05 0.10 0.15 0.20 0.25 Normalized Activation (a) 0 200 400 600 800 1000 Channel Index 0.000 0.005 0.010 0.015 0.020 0.025 Normalized Activation (b) 0 200 400 600 800 F eature Position (H*W) 0.000 0.002 0.004 0.006 0.008 0.010 Normalized Activation (c) 0 1 2 3 4 5 6 7 F rame Index 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Activation (d) 0 200 400 600 800 1000 Channel Index 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 Normalized Activation (e) 0 100 200 300 400 500 600 700 800 F eature Position (H*W) 0.000 0.005 0.010 0.015 Normalized Activation (f) Figure 5. V isualization of frame, channel, and feature magnitude distribution before and after A V A with 1 if the normalized magnitude is above the threshold and zero otherwise. The AAP controller network is trained to adaptively select the highest possible retention thresh- olds, resulting in higher pruning rates while maintaining ac- curacy . The ﬁnal objective loss is thus deﬁned as: L f = L C E − λ · X θ ∈ AAP Modules ( θ FR + θ CH + θ FE ) where λ is a small hyperparameter controlling the regular - ization strength. 6. Experimental Results W e ﬁrst present simulation results for computational ef ﬁ- ciency and ﬁnal model accuracy after pruning on a baseline dataset, and compare the same ag ainst state-of-the-art meth- ods. Next, we conduct an ablation study to assess (1) the im- portance of the A V A step and (2) the ef fect of retraining the full network together with the AAP controller on accuracy and pruning ratio. W e then report the overhead introduced by the AAP controller and A V A modules relative to the 3D CNN model. Subsequently , we describe the pruning ratios for individual components; frames, channels, and features, and pro vide visual proﬁles of their magnitude distributions before and after the A V A procedure. W e also include per- layer FLOP proﬁles before and after pruning. Finally , we show the speedup and energy efﬁcienc y achie ved on v alida- tion hardware. T ables 1 and 2 compare respecti vely , our method against state-of-the-art (SO T A) pruning schemes on UCF101 [ 30 ] and HMDB51 [ 16 ] using the R(2+1)D [ 34 ] and C3D [ 32 ] models pretrained on Kinetics [ 15 ], reporting pruning rates, baseline accuracies, and post-pruning top-1 accura- cies. Across all settings, our dynamic, ﬁne-grained prun- ing consistently outperforms e xisting static approaches. On Model Sparsity Scheme Pruning Rate Baseline Acc. (%) T op-1 Acc. (%) ∆ Acc. (%) R(2+1)D[ 34 ] Filter 2.6 × 94.5 90.5 -4.00 V anilla 2.6 × 91.7 -2.80 KGS[ 25 ] 3.2 × 92.0 -2.50 KGRC 3.02 × 92.65 -1.85 Our method 4.0 × 91.5 93.05 +1.55 C3D[ 32 ] DCP[ 43 ] 1.99 × 82.77 71.72 -11.05 FP[ 17 ] 2.02 × 75.44 -7.33 TP[ 24 ] 2.02 × 70.00 -12.77 MDP[ 8 ] 2.00 × 76.82 -5.95 MDP+KD 2.00 × 80.10 -2.67 KGR 1.76 × 82.82 81.84 -0.98 KGC 1.93 × 81.39 -1.43 KGRC 3.04 × 80.21 -2.61 Our method 3.49 × 87.9 87.93 +0.03 T able 1. Comparison of pruning methods on UCF101. Sparsity Scheme Pruning Rate Accuracy (%) ∆ Acc. (%) DCP[ 43 ] 2.02 × 40.59 -7.6 FP[ 17 ] 2.02 × 40.98 -7.2 TP[ 24 ] 2.11 × 34.84 -13.3 MDP[ 8 ] 2.14 × 43.20 -4.9 MDP+KD 2.14 × 45.62 -2.6 Our method 2.77 × 46.17 -2.02 T able 2. Performance comparison of model compression ap- proaches on HMDB51 (Split 1), using pruning ratio and accuracy improv ements over the C3D baseline. UCF101 with R(2+1)D [ 34 ], we achiev e a 4× pruning rate compared to the best SO T A technique, (which achieves 3.02×), while improving accuracy by 1.5%, whereas all static baselines incur accurac y degradation. For UCF101 with C3D [ 32 ], our method provides a 3.49× pruning rate—higher than the best SO T A at 3.04×, with no accu- racy drop. On HMDB51 with C3D, we again surpass all prior w ork with a 2.77× pruning rate versus 2.14×, and only a minimal 2% accurac y reduction. Overall, the tables show that dynamic pruning coupled with ﬁne-grained sparsiﬁca- tion yields more optimal compression and superior accuracy preservation than the coarse-grained static strategies used in current state-of-the-art methods. The higher baseline ac- curacies (91.5% for R (2 + 1) D and 87.9% for C 3 D ) re- sult from the A V A module acting as a feature enhancer that strengthens representations by suppressing redundant acti- vations. This dual-purpose design facilitates ﬁne-grained pruning while simultaneously boosting model robustness, enabling superior compression rates without the accuracy degradation typical of SO T A methods. UCF101 HMDB51 Metric w/ A V A w/o A V A AAP Co-train w/ A V A w/o A V A AAP Co-train Pruning rate 3.49 1.1 2.86 2.77 1.2 2.32 Accuracy 87.93 85.04 86.7 46.17 44.5 45.57 T able 3. Ablation study on pruning rate and accuracy for UCF101 and HMDB51 with and without the A V A step. Dataset Frame only Channel only Feature only Combined UCF101 1.03 × 2.47 × 1.50 × 3.49 × HMDB51 1.35 × 2.62 × 1.39 × 2.77 × T able 4. Pruning-ratio distributions (frame, channel, feature, com- bined) for C3D on UCF101 and HMDB51. T able 3 shows that A V A is essential; without it, uniform activ ation magnitudes pre vent the AAP controller from es- tablishing effecti ve pruning thresholds. Furthermore, co- training the CNN and AAP controller disrupts the structured variability required for pruning, decreasing accuracy with- out providing additional o verhead reduction. Figure 4 further sho ws that both A V A and the AAP con- troller add negligible memory and computation relative to the main CNN, conﬁrming that these components are nec- essary and introduce minimal ov erhead. T able 4 sho ws how frame, channel, and feature pruning individually contribute to pruning the full C3D model on UCF101 and HMDB51 datasets. In both cases, channels are the sparsest component and provide over twice the overhead reduction of frames and features. Features exceed frames only in acti vation pruning, as removing a full frame w ould signiﬁcantly degrade CNN accurac y . As depicted in ﬁgure 5 , variance is boosted along the frame, channel, and feature dimensions of the C3D model by suppressing the redundant neurons, and the sub-ﬁgures show the contrast in normalized activ ation magnitude at the fourth con volutional layer of the C3D model for a sample input. From our e xperiments, it has been observ ed that the adaptiv e v ariability ampliﬁcation framew ork is more ef fec- tiv e along the frame or channel dimension, compared to the feature. Training the model to increase variance within the activ ation tensor , therefore, facilitates the elimination of re- dundant neurons, which in turn, promotes sparsity when passed through the Adaptiv e Activ ation Pruning module. Conv1 Conv2 Conv3 Conv4 Conv5 Conv6 Conv7 Conv8 0 2 4 6 8 10 Flop Count (GFL OP s) 1.04 1.04 -0.0% 11.10 1.92 82.7% 5.55 5.41 2.4% 11.10 0.98 91.2% 2.77 1.11 59.9% 5.55 0.81 85.3% 0.69 0.17 76.1% 0.69 0.10 85.8% Original Flops Layer Flops A f ter P runing Figure 6. FLOPs reduction per con v . layer in the C3D model. Figure 6 shows the a verage FLOP reduction across the C3D con volutional layers. Red bars indicate dense-layer FLOPs, while blue bars sho w FLOPs after introducing acti- vation sparsity . The ﬁrst layer shows no reduction because its output is needed by the AAP controller . From the second layer onward, sparsity has a clear impact. The red vertical lines denote each layer’ s FLOP range, with Con v2 show- ing the greatest v ariation. Conv2 and Con v4 also achie ve the largest pruning gains due to their high original computa- tional overhead. Figure 7 illustrates the distribution of infer- ence computational overhead (FLOPs) after adapti ve prun- ing of the two baseline 3D CNN architectures: C3D (Fig- ure 7a ) and R2+1D (Figure 7b ). The computational cost is analyzed with respect to the input information lev el of each video. The spatial and temporal information measures, de- noted as R s and R t , quantify the structural complexity within frames and the variability across frames, respec- tiv ely . Inspired by the information-theoretic formulation in [ 29 ] and later used in [ 10 ], both metrics are deriv ed from spatio-temporal DCT energy ratios between high- and low-frequenc y components and are log-normalized as R = log  1 + E high E low  to stabilize variance and reduce sk ewness. W e model the relationship between intrinsic video infor- mation content and the computational effort required dur - ing inference. A strong correlation is observed between [ R s , R t ] and FLOPs. For C3D, the multiple correlation co- efﬁcient and coefﬁcient of determination are R = 0 . 863 and R 2 = 0 . 745 , respecti vely , indicating that 74.5% of the v ari- 0.0 0.2 0.4 0.6 0.8 1.0 R ( s ) 0.0 0.1 0.1 0.2 0.2 0.3 0.3 0.4 R ( t ) 8 10 12 14 16 18 GFL OP s DCT k4-k4 (F ilter ed) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 R ( s ) 0.0 0.2 0.4 0.6 0.8 R ( t ) 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 GFL OP s R(2+1)D DCT (a) C3D ( R 2 = 0 . 745 ) 0.0 0.2 0.4 0.6 0.8 1.0 R ( s ) 0.0 0.1 0.1 0.2 0.2 0.3 0.3 0.4 R ( t ) 8 10 12 14 16 18 GFL OP s DCT k4-k4 (F ilter ed) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 R ( s ) 0.0 0.2 0.4 0.6 0.8 R ( t ) 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 GFL OP s R(2+1)D DCT (b) R2+1D ( R 2 = 0 . 271 ) Figure 7. Inference GFLOPs distribution on UCF-101 for different adaptiv ely pruned models w .r .t. spatial and temporal information T 0 T 1 T 2 T 3 (a) Highest-complexity example: Rafting activity T 0 T 1 T 2 T 3 (b) Lowest-comple xity example: Skiing activity Figure 8. T wo extrema examples from UCF101 processed with C3D. The left video frames (Rafting) correspond to the maximum compu- tational cost (14.5 GFLOPs), while the right video frames (Skiing) correspond to the minimum computational cost (7.8 GFLOPs). ance is e xplained by input complexity . In contrast, R2+1D exhibits weaker correlation ( R = 0 . 521 , R 2 = 0 . 271 ). These results conﬁrm that the proposed adapti ve frame- work is input-dependent, allocating more computation to videos that are more complex from an information-theoretic perspectiv e. The disparity between C3D and R2+1D stems from architectural differences: R2+1D is a residual network in which skip connections limit the propagation of sparsity across layers, whereas C3D follows a VGG-style architec- ture without skip connections, allowing sparsity to accumu- late and making its cost more sensitiv e to input complexity . Figure 8 presents representative frames from the most and least computationally intensiv e video samples pro- cessed by the C3D model. Speciﬁcally , Figures 8a and 8b display four do wnsampled frames from the rafting and ski- ing acti vities, respectively . The rafting sequence requires 14.5 GFLOPs for inference, whereas the skiing sequence requires approximately one-third of that computation. Consistent with Figure 7 , the rafting video’ s high frame- to-frame variability and spatial detail dri ve higher R s and R t lev els. Conv ersely , the skiing video’ s uniform back- ground and localized motion reduce spatio-temporal com- plexity , lowering computational demand. R efer ence A daptive (Our) 200 400 600 800 Latency (ms) 809 590 (a) Inference latency R efer ence A daptive (Our) 0 1 2 3 4 Energy (J) 3.55 2.40 (b) Inference Energy Figure 9. R(2+1)D latenc y and energy on Jetson Nano. W e e valuate computational ef ﬁciency on two edge archi- tectures: an NVIDIA Jetson Nano (4GB GPU) and a Sam- sung S22 (Qualcomm Snapdragon 8 Gen 1 ARM CPU). T o optimize platform-speciﬁc constraints—branching penal- ties on GPUs and vectorization capabilities on CPUs—we employ temporal/channel pruning for the GPU and tempo- ral/spatial pruning for the CPU.GPU Implementation: W e implemented R(2+1)D on the Jetson Nano using UCF101. The decomposed con volution structure of R(2+1)D is less memory-intensiv e, suiting the GPU’ s restricted RAM. La- 1 2 3 Threads count 0 200 400 600 800 1000 1200 Latency (ms) 1113 767 583 1033 770 576 476 346 272 ONNX Our (Dense) Our (A daptive) Figure 10. C3D inference latency on CPU (UCF101) vs. baseline. tency was measured via GPU clock time, and po wer was recorded using an external PSU (reported as delta over idle). As shown in Figures 9a and 9b , our adaptiv e method achiev es a 1 . 37 × latency reduction and 1 . 47 × better en- ergy ef ﬁciency over the PyT orch baseline. On the mo- bile CPU, we dev eloped a custom C3D kernel optimized for ARMv9. W e lev eraged Neon SIMD instructions and FP16 (Half-Precision) to mitigate memory bottlenecks. Fig- ure 10 compares three implementations: an ONNX base- line (with standard Con v-BN-ReLU fusions), our custom Dense kernel, and our Adapti ve Pruning kernel.Across 1, 2, and 3 threads, our Dense kernel matches the ONNX base- line; howe ver , the Adaptiv e Pruning method deli vers a sub- stantial 2 . 22 × a verage speedup. The sub-linear scaling ob- served with increased thread counts suggests the system is memory-bandwidth bound rather than compute-bound. This indicates that the shared memory bus saturates before the execution units, presenting an opportunity for ev en more aggressiv e pruning to further alleviate memory pressure. 7. Conclusion W e propose D ANCE , a ﬁne-grained activ ation pruning framew ork for 3D CNNs, enabling energy-ef ﬁcient infer- ence on edge devices. Using acti vation variance to guide pruning, the method adapts dynamically to input complex- ity . It surpasses SO T A pruning approaches in both efﬁ- ciency and accuracy , ev en after accounting for the controller ov erhead. The framework generalizes across datasets and can be extended to other high-cost models such as vision transformers. References [1] Shiv am Aggarwal, Kuluhan Binici, and T ulika Mi- tra. Crisp: Hybrid structured sparsity for class-aw are model pruning, 2024. 2 [2] Y oshua Bengio, Nicholas L ´ eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computa- tion. arXiv pr eprint arXiv:1308.3432 , 2013. 5 [3] Hanting Chen, Y unhe W ang, Han Shu, Y ehui T ang, Chunjing Xu, Boxin Shi, Chao Xu, Qi Tian, and Chang Xu. Frequency domain compact 3d con vo- lutional neural networks. In 2020 IEEE/CVF Con- fer ence on Computer V ision and P attern Recognition (CVPR) , pages 1638–1647, 2020. 2 [4] Zhourong Chen, Y ang Li, Samy Bengio, and Si Si. Y ou look twice: Gaternet for dynamic ﬁlter selection in cnns. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 9172–9180, 2019. 1 [5] Determine Filters’Importance. Pruning ﬁlters for ef- ﬁcient con vnets. arXiv pr eprint arXiv:1608.08710 , 2016. 1 [6] Shangqian Gao, Y anfu Zhang, Feihu Huang, and Heng Huang. Bilevelpruning: Uniﬁed dynamic and static channel pruning for con volutional neural net- works. In 2024 IEEE/CVF Conference on Com- puter V ision and P attern Recognition (CVPR) , pages 16090–16100, 2024. 2 [7] Xitong Gao, Y iren Zhao, Robert Mullins, Cheng- zhong Xu, et al. Dynamic channel pruning: Fea- ture boosting and suppression. arXiv pr eprint arXiv:1810.05331 , 2018. 1 [8] Jinyang Guo, Dong Xu, and W anli Ouyang. Multidi- mensional pruning and its extension: A uniﬁed frame- work for model compression. IEEE T ransactions on Neural Networks and Learning Systems , 35(9):13056– 13070, 2024. 6 [9] Y ihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating v ery deep neural networks. In Pr oceedings of the IEEE international confer ence on computer vision , pages 1389–1397, 2017. 1 [10] Charles Herrmann, Richard Strong Bowen, Neal W ad- hwa, Rahul Gar g, Qiurui He, Jonathan T Barron, and Ramin Zabih. Learning to autofocus. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 2230–2239, 2020. 7 [11] Jian Hu, Jason Klein Liu, Haotian Xu, and W ei Shen. Reinforce++: An ef ﬁcient rlhf algorithm with robust- ness to both prompt and rew ard models, 2025. 2 [12] W eizhe Hua, Y uan Zhou, Christopher M De Sa, Zhiru Zhang, and G Edward Suh. Channel gating neural networks. Advances in neural information pr ocessing systems , 32, 2019. 1 [13] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax, 2017. 2 , 5 [14] Mengnan Jiang, Jingcun W ang, Amro Eldebiky , Xun- zhao Y in, Cheng Zhuo, Ing-Chao Lin, and Grace Li Zhang. Class-aware pruning for ef ﬁcient neural net- works, 2024. 2 [15] Will Kay , Joao Carreira, Karen Simon yan, Brian Zhang, Chloe Hillier , Sudheendra V ijayanarasimhan, Fabio V iola, Tim Green, Tre vor Back, Paul Natsev , Mustafa Suleyman, and Andrew Zisserman. The ki- netics human action video dataset, 2017. 6 [16] H. Kuehne, H. Jhuang, E. Garrote, T . Poggio, and T . Serre. Hmdb: A large video database for human mo- tion recognition. In 2011 International Conference on Computer V ision , pages 2556–2563, 2011. 6 [17] Hao Li, Asim Kadav , Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning ﬁlters for efﬁcient con- vnets, 2017. 6 [18] Hengduo Li, Zuxuan W u, Abhinav Shri vasta va, and Larry S. Davis. 2d or not 2d? adaptiv e 3d con vo- lution selection for efﬁcient video recognition. 2021 IEEE/CVF Confer ence on Computer V ision and P at- tern Recognition (CVPR) , pages 6151–6160, 2020. 2 [19] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Y an, and Changshui Zhang. Learning ef- ﬁcient con volutional networks through network slim- ming. In Pr oceedings of the IEEE international con- fer ence on computer vision , pages 2736–2744, 2017. 1 [20] Jian-Hao Luo, Jianxin W u, and W eiyao Lin. Thinet: A ﬁlter level pruning method for deep neural network compression. In Pr oceedings of the IEEE interna- tional confer ence on computer vision , pages 5058– 5066, 2017. 1 [21] Chris J. Maddison, Andriy Mnih, and Y ee Whye T eh. The concrete distribution: A continuous relaxation of discrete random variables, 2017. 2 , 5 [22] Mrinal Mathur , Barak A. Pearlmutter, and Serge y M. Plis. MIND ov er body: Adaptiv e thinking using dy- namic computation. In The Thirteenth International Confer ence on Learning Representations , 2025. 1 [23] Andrzej Ma ´ ckiewicz and W aldemar Ratajczak. Prin- cipal components analysis (pca). Computers & Geo- sciences , 19(3):303–342, 1993. 3 [24] Pavlo Molchanov , Stephen T yree, T ero Karras, Timo Aila, and Jan Kautz. Pruning con volutional neural net- works for resource ef ﬁcient inference, 2017. 6 [25] W ei Niu, Mengshu Sun, Zhengang Li, Jou-An Chen, Jiexiong Guan, Xipeng Shen, Y anzhi W ang, Sijia Liu, Xue Lin, and Bin Ren. Rt3d: Achie ving real-time exe- cution of 3d con volutional neural networks on mobile devices, 2021. 2 , 6 [26] Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy . Conditional deep learning for energy-ef ﬁcient and enhanced pattern recognition. In Proceedings of the 2016 Confer ence on Design, Automation & T est in Eur ope , page 475–480, San Jose, CA, USA, 2016. ED A Consortium. 1 , 2 [27] Olga Russakovsk y , Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy , Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition chal- lenge. International journal of computer vision , 115 (3):211–252, 2015. 1 [28] Mahmoud Salem, Mohamed Osama Ahmed, Freder- ick T ung, and Gabriel Oliveira. Gumbel-softmax se- lectiv e networks, 2022. 5 [29] Chun-Hung Shen and Homer H Chen. Robust fo- cus measure for lo w-contrast images. In 2006 Digest of technical papers international conference on con- sumer electr onics , pages 69–70. IEEE, 2006. 7 [30] Khurram Soomro, Amir Zamir , and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. ArXiv , abs/1212.0402, 2012. 6 [31] Mingxing T an and Quoc Le. Efﬁcientnet: Rethinking model scaling for con volutional neural networks. In International confer ence on machine learning , pages 6105–6114. PMLR, 2019. 1 [32] Du T ran, Lubomir Bourdev , Rob Fer gus, Lore nzo T or- resani, and Manohar Paluri. Learning spatiotemporal features with 3d con volutional networks, 2015. 6 [33] Du T ran, Lubomir Bourdev , Rob Fer gus, Lore nzo T or- resani, and Manohar Paluri. Learning spatiotemporal features with 3d con volutional netw orks. In Pr oceed- ings of the IEEE international confer ence on computer vision , pages 4489–4497, 2015. 1 [34] Du Tran, Heng W ang, Lorenzo T orresani, Jamie Ray , Y ann LeCun, and Manohar Paluri. A closer look at spatiotemporal con volutions for action recognition, 2018. 6 [35] Xin W ang, Fisher Y u, Zi-Y i Dou, T rev or Darrell, and Joseph E. Gonzalez. Skipnet: Learning dynamic rout- ing in con volutional networks, 2018. 1 , 2 [36] Ronald J W illiams. Simple statistical gradient- following algorithms for connectionist reinforcement learning. Mac hine Learning , 8:229–256, 1992. 2 [37] Sanghyun W oo, Jongchan P ark, Joon-Y oung Lee, and In So Kweon. Cbam: Con volutional block attention module. In Computer V ision – ECCV 2018: 15th Eur opean Confer ence, Munic h, Germany , September 8–14, 2018, Pr oceedings, P art VII , page 3–19, Berlin, Heidelberg, 2018. Springer -V erlag. 3 [38] W enxuan W u, Zhongang Qi, and Li Fuxin. Pointcon v: Deep con volutional networks on 3d point clouds. In 2019 IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pages 9613–9622, 2019. 1 [39] Zuxuan W u, T ushar Nagarajan, Abhishek K umar, Stev en Rennie, Larry S. Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks, 2019. 2 [40] W enhan Xia, Hongxu Y in, Xiaoliang Dai, and Ni- raj Kumar Jha. Fully dynamic inference with deep neural networks. IEEE T ransactions on Emerging T opics in Computing , 10:962–972, 2020. 1 , 2 [41] Zhiwei Xu, Thalaiyasingam Ajanthan, V ibhav V ineet, and Richard Hartley . Ranp: Resource aware neuron pruning at initialization for 3d cnns, 2021. 2 [42] Huanrui Y ang, W ei W en, and Hai Li. Deephoyer: Learning sparser neural network with dif ferentiable scale-in variant sparsity measures. arXiv pr eprint arXiv:1908.09979 , 2019. 4 [43] Chenxin Zhang, K eqin Xu, Jie Liu, Zhiyang Zhou, Liangyi Kang, and Dan Y e. Identity-link ed group channel pruning for deep neural netw orks. In 2021 International J oint Confer ence on Neural Networks (IJCNN) , pages 1–9, 2021. 6 [44] Junzi Zhang, Jongho Kim, Brendan O’Donoghue, and Stephen Boyd. Sample efﬁcient reinforcement learn- ing with reinforce, 2020. 2

DANCE: Dynamic 3D CNN Pruning: Joint Frame, Channel, and Feature Adaptation for Energy Efficiency on the Edge

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment