Dynamic Semantic Compression for CNN Inference in Multi-access Edge Computing: A Graph Reinforcement Learning-based Autoencoder

1 Dynamic Semantic Compression for CNN Inference in Multi-access Edge Computing: A Graph Reinforcement Learning-based Autoencoder Nan Li, Student Member , IEEE, Alexandros Iosiﬁdis, Senior Member , IEEE and Qi Zhang, Senior Member , IEEE Abstract —This paper studies the computational ofﬂoading of CNN infer ence in dynamic multi-access edge computing (MEC) networks. T o address the uncertainties in communication time and computation resource a vailability , we propose a novel seman- tic compression method, autoencoder -based CNN ar chitectur e (AECNN), for effective semantic extraction and compression in partial ofﬂoading. In the semantic encoder , we intr oduce a featur e compression module based on the channel attention mechanism in CNNs, to compress intermediate data by selecting the most informati ve features. In the semantic decoder , we design a lightweight decoder to reconstruct the intermediate data through learning from the receiv ed compressed data to improve accuracy . T o effectively trade-off communication, computation, and infer- ence accuracy , we design a reward function and formulate the ofﬂoading problem of CNN inference as a maximization problem with the goal of maximizing the average inference accuracy and throughput over the long term. T o address this maximization problem, we propose a graph reinf orcement learning-based AECNN (GRL-AECNN) method, which outperforms existing works DROO-AECNN, GRL-BottleNet++ and GRL-DeepJSCC under differ ent dynamic scenarios. This highlights the advantages of GRL-AECNN in ofﬂoading decision-making in dynamic MEC. Index T erms —CNN inference, semantic communication, fea- ture compression, GRL, service reliability , edge computing I . I N T RO D U C T I O N T HE widespread adoption of Internet of Things (IoT) devices, has pav ed the way for de veloping real-time and context-a ware applications, such as autonomous driving and augmented reality . These de vices generate enormous vol- umes of data, necessitating ef ﬁcient processing and inference capabilities. Howe ver , the limited computational resources and constrained bandwidth on IoT de vices pose signiﬁcant challenges in performing local computing , especially for com- putationally intensiv e con v olutional neural networks (CNNs) that require massiv e multiply-accumulate operations [2]. T o perform the computation-demand and memory-required CNN inference task within a stringent deadline, a common approach is to compress and prune CNN topology thereby reducing the computational operations. Howe v er , over -pruning CNNs may cause severe accuracy degradation. This work is supported by Agile-IoT project (Grant No. 9131-00119B) granted by the Danish Council for Independent Research. Part of this paper is accepted by IEEE ICC 2023 [1]. N. Li, A. Iosiﬁdis, and Q. Zhang are with the Department of Electrical and Computer Engineering, Aarhus University , Finlandsgade 22, 8200, Denmark, DIGIT and (email: lnzyy170320@gmail.com; ai@ece.au.dk; qz@ece.au.dk). T o mitigate this issue, edge computing has emerged as an efﬁcient approach, enabling IoT devices to fully ofﬂoad computational tasks (i.e., full ofﬂoading ) to edge servers (ESs) through wireless channels [3]. Howe ver , the ﬂuctuations in communication time caused by stochastic wireless channel states may introduce inherent uncertainty in the communica- tion time, resulting in varying and unpredictable communica- tion delays [4]. In addition, the varying size of inference tasks generated by IoT devices adds further v ariability , contrib uting to the overall uncertainty in communication delays. Conse- quently , the uncertainty of communication time directly affects the av ailable time budget for performing the computation, which may lead to task failure when the computation cannot be completed within the deadline. Furthermore, the computational resources of each ES are usually shared by multiple IoT devices, resulting in dynamic changes in resource av ailability [5]. This unpredictable computation resource exacerbates the uncertainty in computation time, further increasing the likeli- hood of tasks failing to meet the deadline. T o strike a balance between communication and computa- tion, dynamic of ﬂoading methods hav e been proposed to opti- mize the ofﬂoading decision-making process [2], [6]. Howe ver , when communication takes too much time or the av ailable computation resources at ESs are insufﬁcient, meeting strin- gent deadlines by running the entire pre-trained CNN model on ES becomes challenging. As such, dynamic neural networks such as skipping layers [7], kernel ﬁlters [8] and early-exits [9], modify the CNN architecture thereby allowing dynamic inference time at the expense of inference accuracy . Howe ver , dynamic neural networks still face challenges in meeting strict time constraints due to uncertain communication and compu- tation time, potentially resulting in signiﬁcant degradation of inference accuracy . These limitations ha ve driven the develop- ment of other alternati ves, among which split computing (i.e., partial ofﬂoading) has shown promise in striking a balance between communication and computation [10]. Howe ver , most existing works on split computing primarily focus on model splitting, and less attention was paid to the compression of intermediate feature [11]. In general, a well-trained CNN model often contains redun- dant features that are not essential for performing an inference task [12], and not all of these features play the same role therein (as shown in Fig. 1), i.e., different features have varying degrees of importance in making predictions [13]. Therefore, under poor wireless channel conditions, it is desir- able to prune less important features to reduce communication 2 Original image Feature map Fig. 1: The 1st CL ’ s output feature maps in ResNet-50. The blue one is almost useless for inference, while the red one has enough information to be used to generate the rest. ov erhead thereby meeting the deadline. This idea aligns with the emerging paradigm of semantic communication, which aims to extract the “meaning” of information to be transmitted at a transmitter and successfully interpret the recei ved semantic information at a receiv er [14]. Semantic compression can be used to extract and utilize semantic information to compress the intermediate tensor in the early layers in partial of ﬂoading and optimize the communication process. For example, in image classiﬁcation tasks, not all the features but only the local features (e.g., pixels) of the image directly relev ant to the classiﬁcation are transmitted thereby reducing the communi- cation overhead [15]. Moti vated by the fault-tolerant property of CNNs, Shao et al. [11] proposed BottleNet++, which used a CNN-based encoder to resize the feature dimension. Similarly , Janko wski et al. [16] proposed DeepJSCC to compress the intermediate feature by using a CNN-based encoder . Howe ver , directly resizing feature dimensions may compromise the effecti ve representation of the semantic information in the features and result in accuracy degradation. In wireless edge computing systems, time-varying wireless channel states and a v ailable computing resources signiﬁcantly impact the optimal decision-making process for ofﬂoading tasks, especially in multi-access edge computing (MEC) net- works. In MEC, one of the major challenges is the joint optimization of computing paradigms (i.e., local computing , full ofﬂoading or split computing ), wireless resource allocation (e.g., transmission power , transmission size of intermediate se- mantic information) and inference accuracy . This optimization problem inv olves ternary ofﬂoading variables and is typically formulated as a mixed integer programming (MIP) problem [6], which can be solv ed using dynamic programming and heuristic local search methods. Howe ver , these approaches either suffer from prohibitiv ely high computational complexity or require a considerable number of iterations to conv erge to an optimal solution, making them impractical for real-time ofﬂoading decisions in time-varying wireless channels [6]. Reinforcement learning (RL) is a holistic learning paradigm that interacts with the dynamic MEC to maximize long-term rew ards. Li et al. [17] proposed to use deep RL (DRL)- based optimization methods to address dynamic computational ofﬂoading problem. Howe ver , applying DRL directly to the problem is inefﬁcient in a practical deployment because it typically requires many iterations to search an effecti ve strat- egy for unseen scenarios. Huang et al. [6] proposed DR OO to signiﬁcantly improve the conv ergence speed through efﬁcient scaling strategies and direct learning of ofﬂoading decisions. Howe ver , the DNN used in DR OO can only handle Euclidean data, which makes it not well suitable for the graph-like structure data of MEC. In addition, all the above methods do not provide dynamic inference, which is lack of ﬂexibility in making good use of any av ailable computation resource under stringent latency . In this paper, we propose an adaptiv e semantic compression technique to address the challenges associated with ex ecuting computationally intensiv e CNN inference tasks in MEC. Our approach lev erages the advancements in semantic communi- cation to achieve efﬁcient CNN inference ofﬂoading while maintaining inference accuracy . The main contributions are summarized as follows: • Semantic Encoder: W e design a feature compression module based on the channel attention (CA) method in CNNs to quantify the importance of channels in the intermediate tensor . By utilizing the statistics of channel importance, we can calculate the importance of each channel, enabling intermediate tensor compression by pruning channels with lower importance. Furthermore, we employ entropy encoding to remove statistical re- dundancy in the compressed intermediate tensor , further reducing the communication overhead. • Semantic Decoder: W e design a lightweight feature r e- covery (FR) module that employs a CNN to learn and recov er the intermediate tensor from the receiv ed com- pressed tensor . This process enhances inference accuracy by effecti vely reconstructing the compressed tensor . • Rewar d Function and Optimization: W e deﬁne a re ward function that strikes a balance between communication, computation, and inference accuracy . The CNN infer- ence ofﬂoading problem is formulated as a maximization problem to optimize the av erage inference accuracy and throughput over the long term under the constraints of latency and transmission po wer . • Graph Reinfor cement Learning (GRL)-based Autoen- coder: T o address the challenges posed by stochastic av ailable computing resources at ESs and uncertainties in communication time, we propose GRL-AECNN to ensure that the inference task is completed within the giv en time constraints by leveraging the capacity of reinforcement learning and graph con volutional network (GCN). • P erformance Evaluation: W e employ a step-by-step ap- proach to fasten the training process [18]. Experimental results demonstrate that GRL-AECNN achiev es better performance than the existing works DROO-AECNN, GRL-BottleNet++ and GRL-DeepJSCC under different dynamic scenarios, which demonstrates the effecti veness 3 of GRL-AECNN in ofﬂoading decision-making. The remainder of this article is organized as follows. The system model is presented in Section II. Section III describes the proposed AECNN architecture for CNN inference ofﬂoad- ing. In Section IV, the CNN inference ofﬂoading problem is modeled as a maximization problem. In Section V, GRL- AECNN method is proposed to solve the optimization problem The simulation results are presented and discussed in Section VI, and the conclusions are drawn in Section VII. The nota- tions used in this paper are listed in T able I. I I . S Y S T E M M O D E L W e consider a dynamic MEC network composed of U IoT devices and S ESs, as illustrated in Fig. 2. The set of IoT devices and ESs are denoted as U = { 1 , 2 , · · · , U } and S = { 1 , 2 , · · · , S } respectiv ely . At each timeslot k ∈ K = { 1 , 2 , · · · , K } , each IoT device generates a computational task that needs to be processed within a giv en time constraint. The duration of each timeslot is assumed to be constant and denoted as τ . W e mainly focuses on the image classiﬁcation task and assumes that the computational task utilizes a CNN model Ω with L conv olutional layers (CLs) and se veral fully- connected (FC) layers. W e denote the set of CLs as L = { 0 , 1 , · · · , L } , where the special layer 0 represents the initial stage of the CNN computation. T o execute the computational task, each IoT device adheres to a ternary computational policy , i.e., local computing , full ofﬂoading or split computing . A. T ask Model The parameters associated with the computational tasks at timeslot k are deﬁned as I k ≜ {  d k u , σ k u    ∀ u ∈ U } . Here d k u represents the size of the task generated by IoT device u at timeslot k , typically referring to the size of the original image unless otherwise speciﬁed in this paper . The parameter σ k u indicates the maximum tolerable latency , ensuring that the latency experienced by each inference task does not exceed σ k u . For the sake of clarity , we consider local computing and full ofﬂoading as two distinct cases within the framew ork of split computing , and introduce a binary v ariable α k u,l ∈ { 0 , 1 } to indicate whether the inference task generated by IoT device u at timeslot k is split at CL l . Speciﬁcally , α k u, 0 = 1 means that the entire inference task is fully of ﬂoaded to an ES; α k u,L = 1 denotes that the task is performed locally; otherwise, the computation of the inference task is split between IoT device u and an ES, wherein the IoT de vice ofﬂoads the intermediate feature map to the ES after computing the ﬁrst part locally (i.e., from CL 0 to l ), and then the ES performs the remaining computation (i.e., from CL l + 1 to L ) and send back the result to IoT de vice. Since each IoT device can only utilize one computation mode to perform the inference task at each timeslot, a feasible ofﬂoading policy must satisfy with the following constraint: P l ∈L α k u,l = 1 , ∀ u ∈ U . (1) Additionally , we deﬁne a binary variable β k u,s ∈ { 0 , 1 } to indicate whether the computation of IoT device u at timeslot Original task Inference result Edge Computing Intermedaite tensor Inference result Splitting Computing Local Computing IoT Applications BS with ES Potential connection Fig. 2: An example of task ofﬂoading in an MEC network. T ABLE I S U M M A RY O F N O T AT I O N S Notation Description L , U , S , K , M The set of CLs, IoT devices, ESs, timeslots, compression ratios α k u,l , β k u,s , γ k u,s The CNN inference ofﬂoading decisions at timeslot k Ω CNN model with L CLs and sev eral FLs X l The output feature map of CL l F A l , FM l The aggregated features after avg-pooling and max-pooling W A l The attention weight map of CL l R k u,s , B k u,s The data rate, bandwidth between IoT device u and ES s p k u,s , g k u,s The transmission power , channel gain of link between u and s I k The parameters of task with size d k u and deadline σ k u D k u,s The transmitted data size of IoT device u n 0 The background noise power spectral density P u The maximal transmission power of IoT device u t com u,s,k The communication time of u ofﬂoading data to s t cmp u,i,j , t cmp s,i,j The computation time from CL i to CL j on u and s t cmp u,k The computing time of u at timeslot k t cmp u,s,k The computing time of s performing u ’ s task T arr u,s,k The arrival time instant of u ’ s task at s t que u,s,k The queue delay of u ’ s task at s t u,s,k The total completion time of u ’s task at s E com u,s,k The energy consumption of u transmitting data to s E cmp u,l,k The energy consumption of u computing CL 0 to CL l E u,s,k The energy consumption of u when ofﬂoading task to s k is ofﬂoaded to ES s , where β k u,s = 1 means that the computation of IoT device u is ofﬂoaded to ES s , and vice versa. Therefore, we hav e X s ∈S β k u,s =  0 , α k u, 0 = 1 , 1 , otherwise . (2) When the computation is split at CL l ∈ L\ { 0 , L } , the intermediate data tends to be a high-dimensional tensor, leading to a potential increase in communication overhead. T o address this challenge, we propose a novel semantic compression approach named AECNN , as detailed in Section III. AECNN allows to compress the intermediate feature map at a predeﬁned compression ratio m ∈ M = { 1 , 2 , · · · , M } while maintaining an acceptable level of inference accuracy . T o facilitate this compression process, we utilize a binary variable γ k u,m ∈ { 0 , 1 } to indicate whether the intermediate tensor of IoT device u at timeslot k is compressed using a 4 compression ratio m . Therefore, we hav e X m ∈M γ k u,m = ( 1 , α k u,l ∈L\{ 0 ,L } = 1 , 0 , otherwise . (3) Note that compressing the intermediate data may result in a degradation of inference accuracy . Therefore, we use η k u,m to denote the achiev ed inference accuracy of the task generated by u at timeslot k when employing the compression ratio m . In general, to perform the computational task within a gi ven deadline, the following three decisions need to be made: to which CL the CNN model should be split, i.e., α k u,l ; to which ES an IoT device should ofﬂoad its tasks, i.e., β k u,s ; which compression ratio an IoT device should select, i.e., γ k u,m . B. Communication Model In case that IoT device ofﬂoads its entire task or intermedi- ate tensor to an ES, the incurred transmission delay inv olves deliv ering of entire inference task or intermediate feature map and its inference result between IoT device and ES. Since the output of the CNN is typically a small-sized v alue representing the classiﬁcation or detection result, we do not consider the transmission delay of the feedback in this paper . Consequently , the amount of data transmitted from IoT device u to ES s can be described as follows: D k u,s =          β k u,s d k u , a k u, 0 = 1 , 4 β k u,s C l H l W l P m ∈M γ k u,m m , a k u,l ∈L\{ 0 ,L } = 1 , 0 , a k u,L = 1 , (4) where C l , H l and W l are the channel, height and width dimensions of CL l ’ s output feature map X l ∈ R C l × H l × W l , respectiv ely . Note that the output tensor X l is usually in ﬂoat32 data type, and the size mentioned abov e is measured in bytes. W e assume the uplink channel gain between IoT device u and ES s at timeslot k is denoted as g k u,s ∈ G k = { g k u,s   ∀ u ∈ U , ∀ s ∈ S } , capturing the effects of path loss and shadowing fading. Consequently , the uplink transmission data rate from IoT device u to ES s can be expressed as: R k u,s = B k u,s log 2 1 + p k u,s g k u,s n 0 B k u,s ! , (5) where B k u,s is the channel bandwidth allocated to the link between IoT device u and ES s , and n 0 denotes the noise power spectral density . The transmission po wer of IoT device u when ofﬂoading the entire task or intermediate feature map to ES s , p k u,s , should not be greater than its maximal transmission power P u , i.e., p k u,s ≤ P u . During data transmission, we do not consider the data o ver- head introduced by the network protocol stack and forward error correction. Therefore, the transmission delay of IoT device u when of ﬂoading its entire task or intermediate feature map to ES s can be expressed as, t com u,s,k = D k u,s /R k u,s . (6) In terms of the energy consumption during data transmis- sion, we do not consider the ef ﬁciency of the power ampliﬁer in the antenna and po wer consumption in the baseband circuit. Therefore, the energy consumption incurred by IoT device u when ofﬂoading data to ES s can be represented as E com u,s,k = t com u,s,k p k u,s . (7) C. Computation Model In CNN, the computation time is speciﬁc to the hardware architecture, and can vary based on various factors, including the device, power management techniques, memory access patterns and etc [19]. Therefore, we employ statistical methods to measure the computation time of each layer , and denote the measured computation time from CL i to CL j on IoT de vice u and ES s as t cmp u,i,j and t cmp s,i,j , respectively . In AECNN , the feature compression module is needed only during the training phase for the pruned CL l but not in the inference process. Additionally , the computation time required for the lightweight FR module at the ES is so small that can be considered negligible in practice. Therefore, the computation time includes the CNN computation time t cmp u, 0 ,l and the feature encoding time t enc u,l,m on the IoT de vice, as well as the feature decoding time t dec s,l,m and CNN computation time on the ES t cmp s,l +1 ,L . Correspondingly , we present the computation time of an inference task on IoT device u and ES s as t cmp u,k =        0 , α k u, 0 = 1 , t cmp u, 0 ,l + t enc u,l,m , α k u,l ∈L\{ 0 ,L } = 1 , t cmp u, 0 ,L , α k u,L = 1 , (8) and t cmp u,s,k =        β k u,s t cmp s, 0 ,L , α k u, 0 = 1 , β k u,s  t dec s,l,m + t cmp s,l +1 ,L  , α k u,l ∈L\{ 0 ,L } = 1 , 0 , a k u,L = 1 . (9) Giv en the energy constraints of IoT devices, we primarily focus on inv estigating the energy consumption of CNN infer- ence tasks on these energy-constrained IoT devices. According to [20], we denote the number of ﬂoating-point operations (FLOPs) that can be performed per W att per second as ρ and calculate the energy consumption on IoT device u as E cmp u,l,k = l P i =0 ξ l /ρ, α k u,l = 1 , (10) where ξ i represents the FLOPs count of CL i . The detailed calculation of the FLOPs count can be found in Appendix A. I I I . A R C H I T E C T U R E O F A E C N N In this section, we ﬁrst present an overvie w of our proposed AECNN architecture. Next, we describe the structural com- ponents of our designed feature compression module in the encoder and how to compress the intermediate tensor . Finally , we introduce the designed FR module in the decoder . 5 Splitting Local computing Edge computing Dog CL l CL l+ 1 Feature compression Module Encoder Decoder Received fmap Pruned CL l Decoded fmap Encoded fmap Feature recovery module Entropy Decoding Entropy Encoding Compressed fmap (a) AECNN architecture in device-edge co-inference system (b) Feature compression module (c) Feature recovery module Output fmap A vgpool Sigmoid Channel attention map CL l CL l+ 1 Norm Norm Feature compression module Channel attention module Pruned CL l Maxpool Insert for training Prunning Statics Importance 0.9 0.5 0.3 0.2 0.1 Feature recovery module Received fmap Group Covolution Recovered fmap Fig. 3: The proposed AECNN architecture in device-edge co-inference system. (a) depicts the overall framew ork of AECNN; (b) sho ws the design of FC module; and (c) displays the designed FR module by using a CNN with group cov olutional layers. A. Overview of AECNN Arc hitectur e Fig. 3 depicts the overall framew ork of our proposed AECNN, which consists of an encoder and a decoder . In the encoder , a CA module is designed to assess the statistical importance of channels during inference. This enables the pruning of the channels with low importance based on a prede- ﬁned compression ratio M . Subsequently , an entropy coding module can be applied to remove the statistical redundancy in the remaining intermediate features. Finally , the decoder ﬁrst uses the entropy decoding module to decode the received data, and then uses the designed FR module to recover the pruned features from the decoded features. B. F eatur e compression module The attention mechanism can effecti vely improve the classi- ﬁcation performance of CNNs by enhancing the representation of features with more important information and suppressing unnecessary information interference [13]. Channel attention used in CNN usually focuses on e valuating the importance of tensor’ s channels by paying attention to different channels of the tensor . For example, for CL l , each element of the channel attention map W l ∈ R C l × 1 × 1 corresponds to a channel’ s weight of the output tensor X l . As such, the channels with lower importance can be identiﬁed and removed, thereby reducing the size of the intermediate tensor and reducing the communication and computation time on the IoT device. Note that as IoT device knows which channels will be pruned for any pre-deﬁned compression ratio, it only needs to compute the retained channels of the output tensor . In the previous designs of channel attention [13], two FC layers are used to handle the attention weight of the chan- nels. Howev er, this may introduce two drawbacks. First, the reduction of channel dimensionality for saving computation ov erhead may hav e side effects on the prediction of channel attention. Second, the learned channel attention by FC is intrinsically implicit, resulting in unknowable behavior of neuronal output. T o address these issues, normalization can yield competition or cooperation relationships among chan- nels, using fewer computation resources while providing more robust training performance [21]. Motiv ated by the abov e, we design a CA module, i.e., a global max-pooling layer and a global average-pooling layer with normalization, and insert it into the original CNN model after the splitting point l , and then train the resulting network to generate the importance value of each channel, as shown in Fig. 3(b). Since the calculation of the global avgpooling layer and global maxpooling layer are similar , we hereby take the avgpooling layer as an e xample. The aggre gated features after the avgpooling layer can be represented as F A l = A vgP ool ( X l ) , (11) where F A l ∈ R C l × 1 × 1 . And then the aggregated features F A l is normalized as F A l = F A l − µ √ δ 2 + ϵ , (12) where ϵ > 0 is a small positiv e constant, and the parameters µ and δ are the mean and the standard deviation of F A l , respectiv ely . Then, the normalized features are subjected to element-wise summation and sigmoid activ ation operation [22] to generate the ﬁnal channel attention map W A l as W A l = sigmoid  F A l + FM l  , (13) where FM l is the normalized features of maxpooling layer . 6 T arr u,s,k =    t cmp u,k + t com u,s,k , k = 1 , max  T arr u,s ′ ∈N ,k − 1 , ( k − 1) τ + t cmp u,k  + t com u,s,k , k  = 1 . (16) t que u,s,k = arg max k ′ ∈K ,u ′ ∈U      1  T arr u,s,k − T arr u ′ ,s,k ′  ·    T arr u ′ ,s,k ′ + t que u ′ ,s,k ′ + t cmp u ′ ,s,k ′ | {z } Completion time instant of u ′ ’ s task − T arr u,s,k         . (17) Note that the generated channel attention weights vary depending on the input data (i.e., images), as sho wn in Fig. 5 in the experimental results. T o measure the importance of the channels, we use the statistical information found by element- wise averaging the weight in the channel attention map for all training data. The importance of channel c for the intermediate tensor of CL l , ω c l ∈ W A l , can be calculated as ω c l = 1 |Z | P z ∈Z ω c l , 1 ≤ c ≤ C l , (14) where Z is the training dataset with size |Z | . Finally , according to the compression ratio m ∈ M , the original output tensor of CL l can be compressed by pruning the less important channels and, thus, it outputs the com- pressed intermediate tensor X l ∈ R C l × H l × W l . Note that the number of channels of the compressed tensor is C l = C l /m . C. F eatur e reco very module Since the computational operation of CNN is essentially a series of linear and nonlinear transformations, some redundant features can be obtained from other features by performing inexpensi ve nonlinear transformation operations [12]. Moti- vated by this, we design a lightweight CNN-based FR module to recov er the intermediate tensor of CL l from the received compressed information. As entropy coding is lossless, the entropy decoding yields the original compressed intermediate tensor X l . Therefore, we just need to generate the channels pruned by the CA module using X l , thereby rebuilding the intermediate tensor X l , as shown in Fig. 3(c). Unlike previous work [12], we use all the channels of the received tensor to generate each pruned channel, which allows better learning and recov ery of the representation of the channels that are pruned. T o illustrate the feature recovery module, we use a function f R ( · ) to represent the computation operation of learning the c th channel pruned by the CA module. Thus, the recovered c th channel can be denoted as X c l = f R  X l  , c ∈ { C l + 1 , · · · , C l } , (15) where the recov ered C l − C l channels will be concatenated to the received tensor X l as the input for CL l + 1 . I V . P R O B L E M F O R M U L A T I O N In this section, we ﬁrst present the completion time a CNN inference task. Then, we formulate an optimization problem to maximize the average inference accuracy and throughput of inference tasks in a long-term perspectiv e, while considering the energy consumption and transmission power constraints of an IoT device. A. T ask Completion T ime and Energy Consumption W e assume that CNN inference tasks are processed on a ﬁrst-come-ﬁrst-served basis. In other words, an IoT device or ES can start processing a newly arrived task only after it has ﬁnished processing all previous arriv als. Let us consider the scenario where IoT device u generates a task at timeslot k , which is ofﬂoaded to ES s . The IoT device can only initiate the transmission of this task after completing the transmission of its previous tasks. Ne glecting propagation time, we assume the task generated by IoT device u at timeslot k arri ves at ES s at time instant T arr u,s,k , and express it as (16). Additionally , we use a function 1 ( · ) to indicate the task of another IoT de vice u ′ arriv es at ES s before IoT device u ’ s task, and calculate the queuing delay of IoT device u , t que u,s,k , as (17). There are three steps to complete computation ofﬂoading. First, an IoT device sends an inference task to an ES over wireless uplink, and then ES performs inference. Finally , ES sends the inference result back to the IoT device via downlink. Therefore, the completion time of an inference task includes communication time, queuing delay and computation time, which is denoted as t u,s,k = t cmp u,k + t com u,s,k + t que u,s,k + t cmp u,s,k . (18) Regarding the energy consumption of an IoT device, it encompasses both computation energy and communication energy . As such, we can calculate the total energy consumption of IoT device u as E u,s,k = E cmp u,k,l + E com u,s,k . (19) B. Objective The goal of CNN inference ofﬂoading is to maximize the av erage inference accuracy and throughput of inference tasks in a long term perspectiv e by designing a reasonable com- putation ofﬂoading policy and resource scheduling policy . T o achiev e this goal, we deﬁne a re ward function, Υ ( G k , I k , A k ) to denote the achiev ed reward at timeslot k , as belo w Υ ( G k , I k , A k ) = X u ∈U X s ∈S X m ∈M η k u,m ψ ( t u,s,k ) , (20) where A k ≜ n α k u,s β k u,l γ k u,m | u ∈ U , s ∈ S , l ∈ L , m ∈ M o is the ofﬂoading decision determining the computing mode, the matching between IoT devices and ES, and the com- pression ratio of transmitted intermediate tensor; the function ψ ( x ) introduces a penalty mechanism for tasks that exceed their deadlines. This penalty function mediates the trade-offs among communication efﬁcienc y , computation capability and 7 (a) Create graph Actor Network GCN Node embedding Experience replay buffer Sample mini-batch T raining Link embedding (b) Actor-Critic Network (c) Offloading policy update Offloading policy update Offloading action generation Graph Graph Action Time ... Graph Action Time Graph Action T ime Graph Action T ime ... Graph Action T ime Graph Action Time Critic Network Compute ... Quantization MEC Fig. 4: Framework of graph reinforcement learning-based AECNN inference accuracy . The deﬁnition of ψ ( x ) is introduced in Theorem 1 and the proof is detailed in Appendix B. Theor em 1: Let ψ ( x ) ≜ 2  1 − sigmoid  5 x σ k u   . For any completion time t u,s,k and latency requirement σ k u , we have: 1) As t u,s,k approaches σ k u , ψ ( t u,s,k ) → 0 . 2) As t u,s,k approaches 0, ψ ( t u,s,k ) → 1 . Accordingly , we express the av erage achiev ed rew ard func- tion over a period as below: Q ( K , G , I , A ) = 1 K X k ∈K Υ ( G k , I k , A k ) . (21) The optimization problem of maximizing the av erage ac- curacy and throughput of inference tasks over a period is based on the abo ve re ward function. It is a mixed integer pro- gramming non-con vex problem that is dif ﬁcult to solve with con ventional algorithms. T o address this issue, we decouple it into two subproblems, P 1 is the ofﬂoading strategy: P 1 : max K , A Q ( K , G , I , A ) (22) s.t. α k u,l ∈ { 0 , 1 } , ∀ u ∈ U and ∀ l ∈ L , (22a) β k u,s ∈ { 0 , 1 } , ∀ u ∈ U and ∀ s ∈ S , (22b) γ k u,m ∈ { 0 , 1 } , ∀ u ∈ U and ∀ m ∈ M (22c) Once the optimal computation ofﬂoading decision A ∗ = {A ∗ 1 , A ∗ 2 , · · · , A ∗ K } is determined, the optimization problem is simpliﬁed to a con ve x optimization problem P 2 to optimize the resource allocation: P 2 : max K Q ( K , G , I , A ∗ ) (23) s.t. t u,s,k ≤ σ k u , ∀ u ∈ U and s ∈ S , (23a) p k u,s ≤ P u , ∀ u ∈ U and ∀ s ∈ S , (23b) E u,s,k ≤ E cmp u,k,L , ∀ u ∈ U and ∀ s ∈ S . (23c) Note that the constraint (23c) ensures that ofﬂoading is more energy-ef ﬁcient than performing the computation locally . V . G R A P H R E I N F O R C E M E N T L E A R N I N G - BA S E D A E C N N In this section, we present the framew ork of our proposed GRL-AECNN to address the optimization problem described in P 1 . Subsequently , we provide a comprehensiv e ov erview of GRL-AECNN and outline the training strategy . A. GRL-AECNN framework In dynamic MEC networks, the data exhibits a graph-like structure rather than a regular Euclidean format. T o effecti vely handle such graph data, we propose GRL-AECNN by applying GCN [23] to analyze the characteristics of graph data through message passing and aggregation between nodes, as shown in Fig. 4. By learning the aggregation method based on the relationships between nodes, GCNs can effecti vely process and understand the graph-like characteristics of the data. Additionally , GRL-AECNN can automatically ﬁlter out mes- sages from disconnected nodes through graph data updates, which obviates the need for retraining the aggregation function when facing a ne w MEC network topology . Consequently , the GRL-AECNN exhibits robust adaptability in handling changes within the dynamic MEC network structure, without necessitating extensiv e reconﬁguration. In the proposed GRL-AECNN framework, an actor-critic network is used to generate ofﬂoading decisions and update ofﬂoading policies. The actor network is responsible for pre- dicting actions; the critic network quantiﬁes the prediction and generates ofﬂoading decisions; and the experience replay buf fer stores historical experiences and samples mini-batch training data to train the GCN. Since each task can only be split at one layer l , compressed by one compression ratio m then ofﬂoaded to one ES s , the three-step task ofﬂoad decision for de vice u ’ s task can be merged into one step, i.e., an IoT device has ( L + 1) M S options to perform its task. In GRL- AECNN, we model the structure information of the MEC scenario at timeslot k as graph data Γ k = ( V k , E k ) , where M devices and ( L + 1) M S options are represented by the graph vertices V k , and each IoT device and option is connected by a directed edge e ∈ E k . 8 B. Actor network In the actor network, we represent the feature of the i th GCN layer as h ( i ) = n h ( i ) v | v ∈ V k o . Specially , we parse the MEC state Γ k as the initial input data h (0) for GCN. GCN uses multiple graph conv olutional layers to aggregate the neighborhood information. For each node v ∈ V k , we deﬁne the neighborhood information aggreg ation process as follo ws, h ( i +1) v = Relu  ϖ ( i +1) C  h ( i ) v , A ( i )  h ( i ) v ′  , v ′ ∈ ϱ v (24) where ϖ ( i +1) is the weight parameters, A ( i ) ( · ) is the aggre- gation function, ϱ v is the set of node v ’ s neighbors, C ( · ) is a concatenate operation, Relu ( · ) is a non-linear function [22]. The system can acquire the information of tasks and the status of ESs by aggregating the information in the second- order neighborhood of nodes. For example, IoT device u can grasp the information of its second-order neighborhood (other IoT devices connected to ES s ) through its ﬁrst- order neighborhood ES s ; ES s can acquire the status of its second-order neighborhood (other ESs) through its ﬁrst-order neighborhood (IoT devices connected to ES s ). Therefore, we use two GCN layers in GRL-AECNN, i.e., i ∈ { 0 , 1 } in (24). Once the information aggregation of nodes is ﬁnished, the next step is to obtain the feature representation of edge e ∈ E k , h e , through concatenating the features of its source node v ′ ∈ V k and destination node v ′′ ∈ V k . This process is outlined as h e = C  h (2) v ′ , h (2) v ′′  . (25) Then, we can classify the edges to get the relaxed ofﬂoading action ζ k = { a k,e | a k,e = F ( h e ) , e ∈ E k } , by the function F ( h e ) = sigmoid ( MLP 2 ( Relu ( MLP 1 ( h e )))) , (26) where MLP 1 and MLP 2 are multi-layer perceptions to extract the feature of edge e . W e use sigmoid ( · ) function to make the relaxed ofﬂoading action satisfy 0 < a k,e < 1 [22]. C. Critic network In critic network, we ﬁrst use the order-preserving method in DROO [6] to quantify the relaxed ofﬂoading action ζ k and generate N = U LM S candidate binary of ﬂoad- ing decisions A k = n ζ (1) k , ζ (2) k , · · · , ζ ( N ) k o , where ζ ( n ) k = n a ( n ) k,e | a ( n ) k,e ∈ { 0 , 1 } , e ∈ E k o . Recall that each candidate of ﬂoading action ζ ( n ) k can achie ve rew ard by solving (20). Therefore, the optimal ofﬂoading action at k th timeslot can be generated as A ∗ k = arg max ζ ( n ) k ∈ A k Q ( K , G , I , A ) . (27) D. Complexity Analysis and T raining Strate gy 1) Complexity Analysis: At timeslot k , the computational complexity associated with the ofﬂoading decision A k is represented as U LM S . Howe ver , numerous IoT devices and multiple candidate splitting points of a CNN model may cause substantial complexity . In fact, not all the candidate splitting points are meaningful for decision-making. As described in Algorithm 1: Training strategy of AECNN Input: CNN model Ω , the set of CLs L , the set of compression ratio M , the set of training data Z . Output: The set of AE-enhanced CNN models Ω =  Ω 2 1 , · · · , Ω M 1 , · · · , Ω M L − 1  . 1: f or l = 1 to L − 1 do 2: Insert CA module after the splitting point l . 3: T rain the resulting network on training data Z . 4: Calculate the importance of each channel using (14). 5: Sort the importance of all the channels. 6: Remov e the inserted CA module. 7: for m = 2 to M do 8: Compress CL l by pruning C l  1 − 1 m  less important channels. 9: Fine-tune the pruned CNN model. 10: Insert FR module into the pruned CNN model before CL l + 1 , then ﬁne-tune the resulting neural network to get Ω m l . 11: end for 12: end for 13: retur n Ω Section VI-B, with the same communication overhead, split- ting the CNN model at the deeper split points may result in lower inference accuracy while increasing the computation ov erhead on the IoT device. As such, we ﬁrst train the proposed AECNN and select the meaningful splitting points that are used to train the GRL-AECNN thereby performing the ofﬂoading decision-making. 2) T raining of AECNN: The proposed AECNN architecture can be trained in an end-to-end manner . Ho we ver , this may result in very slow con vergence. Therefore, we use a step- by-step training approach to train our proposed AECNN. W e ﬁrst insert the designed CA module into the original CNN model and then train the resulting neural network to ﬁgure out the importance of the channels. Based on the statistic of channels’ importance, for a giv en compression ratio m , C l (1 − 1 m ) channels with the lowest importance are identiﬁed as prunable. Then, we remove the inserted CA module and prune the original CNN model by removing the identiﬁed channels and the corresponding ﬁlters. Next, we ﬁne-tune the pruned CNN model to recover the accuracy loss caused by the model pruning. Finally , we insert the designed FR module into the pruned CNN model and ﬁne-tune the resulting CNN model to improve the inference accuracy . Throughout the training process, we do not consider the entropy encoding and decoding modules, because this lossless compression does not cause any accuracy loss. The detailed training process is described in Algorithm 1. 3) Ofﬂoading policy update: W e use the experience replay buf fer technique to train the GCN using the stored data samples ( k , Γ k , A ∗ k ) , as shown in Fig. 4. At timeslot k , we randomly select a mini-batch of training data ∆ k = (∆ T k , ∆ Γ k , ∆ A ∗ k ) from the memory to update the parameters 9 Algorithm 2: GRL-AECNN for ofﬂoading decision- making Input: Input MEC state Γ k , ∀ k ∈ K , training interval ω . Output: Output ofﬂoading decision A ∗ k . 1: f or k = 1 to K do 2: Generate the relaxed of ﬂoading action ζ k in (26). 3: Quantify ζ k into N binary actions A k . 4: Select the optimal ofﬂoading action A ∗ k using (27). 5: Update the experience replay buffer by adding (Γ k , A ∗ k ) . 6: if k mod ω = 0 then 7: Randomly sample a mini-batch of training data ∆ k from the buf fer . 8: T rain GCN and update the parameters using (28). 9: end if 10: end for 11: retur n A ∗ k of GCN and reduce the av eraged cross-entropy loss [6], as ξ (∆ k ) = − 1 | ∆ k | X k ′ ∈ ∆ T k (1 − A ∗ k ′ ) log (1 − f I ( E k ′ )) + A ∗ k ′ log f I ( E k ′ ) , (28) where | ∆ k | is the size of the mini-batch training data, ∆ T k is the set of timeslots, ∆ Γ k is the set of graphs, and ∆ A ∗ k is the set of actions. The detailed process of GRL-AECNN is described in Algorithm 2. V I . P E R F O R M A N C E E V A L U A T I O N A. Experimental Setup W e consider an MEC network comprising of S = 2 ESs (R TX 2080TI GPU) located at [(30 m , 30 m ) , (90 m , 30 m )] , and U = 14 IoT de vices (Raspberry pi 4B) randomly distributed in the [0 , 120] × [0 , 60] m 2 region. The bandwidth for the 2.4 GHz W iFi connection between the IoT de vice and the respectiv e ES is set at B k u,s = 20 MHz, the noise po wer spectral density is n 0 = − 174 dBm/Hz and the maximal transmission po wer of each IoT device is limited to P u = 20 dBm. Similar to [6], we consider the free-space propagation model and express the av erage channel gain as g k u,s = g a ( 3 × 10 8 4 π f c ϑ k u,s ) d e , where f c is the WiFi frequency , g a = 2 is the antenna gain, d e = 2 . 8 is the path loss e xponent, and ϑ k u,s represents the distance between IoT device u and ES s , measured in meters. The wireless channel gain g k u,s can be expressed as g k u,s = g k u,s g r , where the Rayleigh small-scale fading coef ﬁcient follows g r ∼ C N (0 , I ) . W ithout loss of generality , we assume that channel gains remain consistent within a single timeslot and exhibit independent variability from one timeslot to another . W e consider the classiﬁcation task of Caltech-101 dataset [24], consisting of approximately 9,000 images categorized into 101 classes. Each category comprises roughly 40 to 800 images with resolutions ranging from 200 × 200 to 300 × 300 pixels. The diversity in resolutions aligns well with the variability typically encountered in IoT applications. W e assume task sizes d k m,n ranging between 20 KBytes and 100 0 10 20 30 40 50 60 No. of channels 0.0 0.2 0.4 0.6 0.8 1.0 Attention weight data_batch1 data_batch2 data_batch3 Fig. 5: Attention weights of splitting point l = 1 . KBytes, and use the popular ResNet-50 [18] for image clas- siﬁcation. T o perform the co-inference, we split ResNet-50 at different splitting points and compress the intermediate tensor with dif ferent compression ratios M = { 1 , 2 , 4 , 8 , 16 , 32 , 64 } . Since ResNet-50 introduces a branching structure with residual blocks instead of a sequential structure, the ﬁrst CL and each residual block are considered as the candidate splitting points. W e use PyT orch to implement GRL-AECNN with the following training parameters: the hidden neurons of two GCN layers are 128 and 64, the learning rate is initialized as 0.001, the experience replay buf fer size is 128, the mini-batch size | ∆ k | = 64 ; the training interval ω = 10 , and the optimizer for the loss function ξ (∆ k ) is the Adam function [25]. T o validate the effecti veness of GRL-AECNN, we conduct a comprehensiv e ev aluation comparing semantic compression performance and CNN inference ofﬂoading efﬁcienc y . W e ﬁrst compare the performance of AECNN with e xisting state- of-the-art semantic compression methods, BottleNet++ [11] and DeepJSCC [16]. In both BottleNet++ and DeepJSCC, the intermediate tensor is encoded using a CNN-based encoder with dimension adjustment at the ﬁnal FC layer . Subsequently , we compare the performance of GRL-AECNN with the state- of-the-art CNN inference ofﬂoading method, DR OO [6]. Our comparativ e analysis in volves the following three methods: • DR OO-AECNN: DROO [6] enhanced AECNN; • GRL-BottleNet++: GRL enhanced BottleNet++ [11]; • GRL-DeepJSCC: GRL enhanced DeepJSCC [16]. B. P erformance of GRL-AECNN Measurements of attention weights in AECNN: T o verify the robustness of the statistical method for calculating the importance of channels in AECNN, we use the same amount of input data from dif ferent batches to calculate the importance of each channel. W e hereby take the ﬁrst candidate point as an example and calculate the importance of each channel for the intermediate tensor by randomly sampling three batches of input data. As shown in Fig. 5, we can see that the overall trend of the channels’ importance is essentially consistent across these three batches of data. This means that while the importance of individual channels might vary depending on the speciﬁc input data, the general ranking and trend of the channels’ importance remains relativ ely stable, which 10 T ABLE II I N F E R E N C E A C C U RA C Y U N D E R D I FF E R E N T C O M P R E S S I O N R AT I O S O F I N T E R M E D I A T E T E N S O R A N D E N T RO P Y E N C O D I N G ( M E A S U R E D O N R A S P B E R RY P I 4 B ) A N D D E C O D I N G ( M E A S U R E D O N R T X 2 0 8 0 T I ) T I M E Splitting point C l × H l × W l m BottleNet++ (%) JSCC (%) CA Pruned (%) AECNN (%) C ′ l Entropy (bit) t enc u,l,m (ms) t dec s,l,m 2 × 92 . 19( ± 0 . 22) 93 . 10( ± 0 . 15) 95 . 58( ± 0 . 24) 95 . 62( ± 0 . 20) 32 11 . 13( ± 0 . 18) 4.53 3.03 4 × 91 . 66 ( ± 0 . 27 ) 92 . 06 ( ± 0 . 21 ) 95 . 05( ± 0 . 26) 95 . 30 ( ± 0 . 21 ) 16 10 . 48( ± 0 . 16) 3.25 2.41 l = 1 64 × 56 × 56 8 × 91 . 49( ± 0 . 36) 91 . 83( ± 0 . 23) 94 . 57( ± 0 . 11) 94 . 68( ± 0 . 17) 8 9 . 86( ± 0 . 21) 1.94 1.52 16 × 90 . 92( ± 0 . 23) 91 . 29( ± 0 . 19) 93 . 89( ± 0 . 54) 93 . 98( ± 0 . 17) 4 9 . 13( ± 0 . 27) 1.76 1.37 32 × 89 . 76( ± 0 . 18) 90 . 76( ± 0 . 18) 92 . 69( ± 0 . 65) 92 . 70( ± 0 . 31) 2 8 . 62( ± 0 . 12) 1.03 0.74 64 × 88 . 54( ± 0 . 44) 89 . 52( ± 0 . 29) 91 . 10( ± 0 . 53) 91 . 47 ( ± 0 . 25 ) 1 7 . 73 ( ± 0 . 32 ) 0.62 0.39 2 × 93 . 28( ± 0 . 30) 94 . 39( ± 0 . 27) 95 . 43( ± 0 . 19) 95 . 48( ± 0 . 33) 128 12 . 16( ± 0 . 27) 7.89 4.76 4 × 92 . 25( ± 0 . 17) 93 . 85( ± 0 . 24) 95 . 23( ± 0 . 25) 95 . 24( ± 0 . 28) 64 11 . 51( ± 0 . 29) 5.07 3.21 l = 2 256 × 56 × 56 8 × 91 . 70( ± 0 . 39) 92 . 91( ± 0 . 31) 94 . 96( ± 0 . 14) 95 . 05( ± 0 . 25) 32 10 . 99( ± 0 . 24) 3.51 2.60 16 × 91 . 64( ± 0 . 31) 92 . 21( ± 0 . 19) 94 . 80( ± 0 . 30) 94 . 85( ± 0 . 16) 16 10 . 42( ± 0 . 21) 3.00 2.35 32 × 90 . 86( ± 0 . 28) 91 . 65( ± 0 . 21) 94 . 61( ± 0 . 21) 94 . 64( ± 0 . 19) 8 9 . 82( ± 0 . 18) 1.87 1.46 64 × 90 . 63( ± 0 . 22) 91 . 04( ± 0 . 31) 93 . 63( ± 0 . 28) 93 . 78( ± 0 . 19) 4 9 . 19( ± 0 . 22) 1.90 1.44 demonstrates the feasibility of the statistical method we used for calculating the importance of channels. Inference accuracy of AECNN under various compres- sion ratios: In partial ofﬂoading, the latency is mainly caused by the computation and communication time on the IoT device due to the limited resources and bandwidth. Therefore, we should split the CNN model as early as possible (near the input layer) to reduce the computation on the IoT de vice and compress the intermediate tensor as much as possible without compromising too much accuracy . In our experiment, we found that the computation time on the IoT device alone exceeds 100ms, if ResNet-50 is split at the third or later candidate points, which is not suitable for real-time inference. Therefore, we mainly consider the ﬁrst and second candidate points. In our experiments, the inference accuracy of the original ResNet-50 on the test dataset is 95 . 84( ± 0 . 35)% . T able II compares the inference accuracy of AECNN with that of BottleNet++ and DeepJSCC at different compression ratios of intermediate tensor , where ‘CA Pruned’ signiﬁes the pruned ResNet-50 without the FR module. W e can see that AECNN improv es the accuracy of CA Pruned ResNet- 50, which demonstrates the effecti veness of the proposed FR module. In general, higher compression will result in more accuracy loss due to the lack of comprehensive presentation of the features. The experimental result shows that AECNN consistently outperforms BottleNet++ and DeepJSCC at differ - ent compression ratios. For example, when l = 1 and m = 4 , AECNN improves the accuracy from 91.66% and 92.06% to 95.30% in comparison with BottleNet++ and DeepJSCC, re- spectiv ely . This improvement is attributed to AECNN’ s ability to extract informati ve features instead of directly compressing the intermediate tensor as BottleNet++ and DeepJSCC do, thus av oiding the loss of semantic information. Moreo ver , AECNN achiev es higher accuracy by splitting the model at the ﬁrst splitting point than the second with the same communication ov erhead. For example, splitting the model at the ﬁrst point, AECNN achiev es an accuracy of 93.98% when m = 16 , while that of the second point is 93.78% when m = 64 . Note that in this case, the compressed data size at the ﬁrst point is 4 × 64 × 56 × 56 / 16 , which is equiv alent to 4 × 256 × 56 × 56 / 64 at the second point. Therefore, choosing the ﬁrst split point is a better option, in addition, it uses less computation time and can reduce the ov erall task completion time. At the ﬁrst splitting point, AECNN can compress the intermediate tensor by more than 256 × (i.e., 64 × 32 / 7 . 73 ) using channel pruning and entropy coding, with accuracy loss of only about 4% (i.e., 95 . 84% − 91 . 47% ). In subsequent experiments, we mainly focus on the ﬁrst splitting point in partial ofﬂoading to alleviate the computation complexity of GRL-AECNN. Con vergence of GRL-AECNN: W e ﬁrst deﬁne the nor- malized reward for timeslot k as Υ ( G k , I k , A k ) = Υ ( G k , I k , A k ) /Υ ( G k , I k , A ′ k ) (29) where the action A ′ k is obtained by exhausti ve searching. In Fig. 6, we characterize the con ver gence performance by plotting the moving av erage of Υ over the most recent 50 timeslots, alongside the training loss. As the timeslot increases, the moving average of Υ and training loss show a gradual con vergence towards the optimal solution. The occasional ﬂuctuations are primarily attributed to the randomness of training data sampling. Speciﬁcally , as the timeslot increases, the moving a verage of the normalized rew ard Υ for GRL- AECNN consistently exceeds 0.95, while the training loss remains consistently below 0.03. This performance superiority is particularly evident when compared to the other three methods. This distinction can be attributed to GRL-AECNN’ s ability to make full use of the MEC states for making ofﬂoading decisions, thus pro viding an advantage ov er DROO- AECNN, which only considers the wireless channel state for decision-making. Compared to GRL-BottleNet++ and GRL- DeepJSCC, GRL-AECNN can ef fecti vely extract the semantic information to make better ofﬂoading decisions, resulting in superior performance. The robust conv ergence, coupled with the compelling performance metrics, reinforces the efﬁcienc y of GRL-AECNN in optimizing ofﬂoading decision-making. C. P erformance under V arious No. of IoT devices W e measured the performance of different of ﬂoading meth- ods for 10,000 timeslots in different scenarios. W e deﬁne a 11 0 600 1200 1800 2400 3000 T ime slot 0.0 0.2 0.4 0.6 0.8 1.0 Normalized reward G R L - A E C N N = 3 0 m s D R O O - A E C N N = 3 0 m s G R L - D e e p J S C C = 3 0 m s G R L - B o t t l e N e t + + = 3 0 m s (a) Normalized rew ard 0 600 1200 1800 2400 3000 T ime slot 0.0 0.2 0.4 0.6 0.8 T raining loss G R L - A E C N N = 3 0 m s D R O O - A E C N N = 3 0 m s G R L - D e e p J S C C = 3 0 m s G R L - B o t t l e N e t + + = 3 0 m s (b) T raining loss Fig. 6: Performance of con vergence 4 6 8 10 12 14 No. of IoT devices 0.3 0.4 0.5 0.6 0.7 0.8 0.9 A verage inference accuracy 0.56 0.36 0.45 0.41 G R L - A E C N N = 3 0 m s D R O O - A E C N N = 3 0 m s G R L - D e e p J S C C = 3 0 m s G R L - B o t t l e N e t + + = 3 0 m s G R L - A E C N N = 1 0 m s D R O O - A E C N N = 1 0 m s G R L - D e e p J S C C = 1 0 m s G R L - B o t t l e N e t + + = 1 0 m s (a) A verage accuracy 4 6 8 10 12 14 No. of IoT devices 30 40 50 60 70 80 90 100 Service success probability G R L - A E C N N = 3 0 m s D R O O - A E C N N = 3 0 m s G R L - D e e p J S C C = 3 0 m s G R L - B o t t l e N e t + + = 3 0 m s G R L - A E C N N = 1 0 m s D R O O - A E C N N = 1 0 m s G R L - D e e p J S C C = 1 0 m s G R L - B o t t l e N e t + + = 1 0 m s (b) Service successful probability 4 6 8 10 12 14 No. of IoT devices 100 200 300 400 500 600 700 A verage throughput (fps) G R L - A E C N N = 3 0 m s D R O O - A E C N N = 3 0 m s G R L - D e e p J S C C = 3 0 m s G R L - B o t t l e N e t + + = 3 0 m s G R L - A E C N N = 1 0 m s D R O O - A E C N N = 1 0 m s G R L - D e e p J S C C = 1 0 m s G R L - B o t t l e N e t + + = 1 0 m s (c) A verage throughput Fig. 7: Performance under various No. of IoT devices successful task as the one that is completed within the deadline and use several metrics to ev aluate the reliability , accuracy and efﬁcienc y of GRL-AECNN: • service success probability (SSP): the number of success- ful tasks divided by the total number of tasks; • av erage inference accuracy: the sum of each successful task’ s accuracy divided by the total number of tasks; • av erage throughput: the number of successful tasks is divided by the cumulative number of timeslots. As shown in Fig. 7, the a verage accuracy and SSP decrease as the number of IoT devices U increases at the given ES computation resources. This is because when U is large, more tasks fail to meet their deadlines due to the limited resources of ESs. As such, the av erage throughput gradually reaches a plateau. Additionally , we can see that the system achieves a higher throughput at τ = 10 ms than that of τ = 30 ms; howe ver , the increase in throughput comes with an associated trade-off, i.e., a decrease in both the average accuracy and the SSP . This is because the higher task generation rate at τ = 10 ms allows the system to make better use of ES’ s idle time to process more tasks during the same time duration; howe ver , this will result in more failed tasks because of the higher occupancy of ES and wireless channels. Furthermore, GRL-AECNN demonstrates the capability to enhance average accuracy , SSP , and throughput, particularly in scenarios where U is large. For e xample, when U = 10 and τ = 10 ms, GRL- AECNN achiev es average inference accuracy improvement by 0.20, 0.15, and 0.11 respectiv ely , in comparison with DR OO- AECNN, GRL-BottleNet++, and GRL-DeepJSCC. This is because AECNN can effecti vely identify the semantic infor- mation while reducing the computation on IoT devices via channel pruning, and GRL can use the full information of MEC to make optimal ofﬂoading decisions; howe ver , the se- mantic encoders in BottleNet++ and DeepJSCC lack ef fective extraction and compression of semantic information while introducing extra computation on IoT devices. D. P erformance under Uncertain Computation T ime In real-world scenarios, ESs are often not consistently idle and their computational resources are dynamic. T o comprehen- siv ely reﬂect the impact of such variations on the effecti veness of of ﬂoading strategies, we consider a scenario with 14 IoT de- vices and 2 ESs where each ES has a stochastic computational resource av ailability ranging between λ % and 100% of its ov erall computational capacity at each timeslot. In Fig. 8, we compare the performance of the mentioned methods under the case of λ ∈ { 25 , 50 , 75 , 100 } . As the variation range increases, both av erage inference accuracy and av erage throughput de- crease. This is because insufﬁcient computational resources in ESs make it more likely for tasks to miss deadlines, leading to more task failures, lo wer av erage accuracy , and reduced throughput. Notably , GRL-AECNN achieves higher gain o ver the other methods in terms of av erage accuracy and av erage throughput under larger variations in ESs’ av ailable 12 [100%, 100%] [75%, 100%] [50%, 100%] [25%, 100%] V arious available capacity of ESs 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A verage inference accuracy G R L - A E C N N = 3 0 m s D R O O - A E C N N = 3 0 m s G R L - D e e p J S C C = 3 0 m s G R L - B o t t l e N e t + + = 3 0 m s G R L - A E C N N = 1 0 m s D R O O - A E C N N = 1 0 m s G R L - D e e p J S C C = 1 0 m s G R L - B o t t l e N e t + + = 1 0 m s (a) A verage accuracy [100%, 100%] [75%, 100%] [50%, 100%] [25%, 100%] V arious available capacity of ESs 0 100 200 300 400 500 600 700 800 A verage throughput (fps) G R L - A E C N N = 3 0 m s D R O O - A E C N N = 3 0 m s G R L - D e e p J S C C = 3 0 m s G R L - B o t t l e N e t + + = 3 0 m s G R L - A E C N N = 1 0 m s D R O O - A E C N N = 1 0 m s G R L - D e e p J S C C = 1 0 m s G R L - B o t t l e N e t + + = 1 0 m s (b) A verage throughput Fig. 8: Performance under various av ailable capacities of ESs [75%, 100%] [75%, 100%]+fluc. [50%, 100%]+fluc. [25%, 100%]+fluc. V arious available capacity of ESs 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A verage inference accuracy G R L - A E C N N = 3 0 m s D R O O - A E C N N = 3 0 m s G R L - D e e p J S C C = 3 0 m s G R L - B o t t l e N e t + + = 3 0 m s G R L - A E C N N = 1 0 m s D R O O - A E C N N = 1 0 m s G R L - D e e p J S C C = 1 0 m s G R L - B o t t l e N e t + + = 1 0 m s Fig. 9: Performance under uncertain computation time 0 1 2 3 4 5 U n c e r t a i n t y b o u n d o f C S I i m p e r f e c t i o n s 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 A verage inference accuracy G R L - A E C N N = 3 0 m s D R O O - A E C N N = 3 0 m s G R L - D e e p J S C C = 3 0 m s G R L - B o t t l e N e t + + = 3 0 m s G R L - A E C N N = 1 0 m s D R O O - A E C N N = 1 0 m s G R L - D e e p J S C C = 1 0 m s G R L - B o t t l e N e t + + = 1 0 m s Fig. 10: Performance under imperfect CSI computational resources. For example, at λ = 25 and τ = 10 ms, GRL-AECNN impro ves average inference accuracy up to (0 . 189 − 0 . 019) / 0 . 019 = 8 . 9 × over DROO-AECNN and (0 . 189 − 0 . 074) / 0 . 074 = 1 . 6 × over GRL-DeepJSCC respec- tiv ely , which is higher than that of (0 . 302 − 0 . 157) / 0 . 157 = 0 . 9 × and (0 . 302 − 0 . 208) / 0 . 208 = 0 . 5 × when λ = 75 and τ = 10 ms. Furthermore, as the variation range increases (i.e., smaller λ ), the three GRL-based ofﬂoading methods show lo wer degradation in av erage inference accuracy and av erage throughput than DR OO-AECNN. F or example, GRL- AECNN has (0 . 398 − 0 . 189) / 0 . 398 = 0 . 5 × degradation in av erage inference accuracy from λ = 100 to λ = 25 when τ = 10 ms; ho we ver , that of DR OO-AECNN up to (0 . 277 − 0 . 019) / 0 . 277 = 0 . 9 × . This highlights the effec- tiv eness of GRL-AECNN in ofﬂoading decision-making for scenarios with limited av ailable computation resources. The computation time of ESs can be affected by various factors, including storage av ailability , thermal conditions, and en vironmental factors. In addition to the previously men- tioned variation in ES computation resources, we considered a realistic case where the computation time of each ES ﬂuctuates by ± 25% of its measured value. In Fig. 9, the three GRL-based ofﬂoading methods remain relativ ely more stable in av erage inference accuracy than DROO-AECNN under computation time ﬂuctuations. For example, GRL-AECNN has (0 . 384 − 0 . 349) / 0 . 349 = 0 . 1 × degradation in a verage inference accuracy when λ = 75 and τ = 10 ms; ho wever , that of DROO-AECNN is (0 . 258 − 0 . 157) / 0 . 157 = 0 . 64 × . This further demonstrates that GRL-AECNN is better at learning the state information of ESs for ef fecti ve decision-making. E. P erformance under Imperfect Channel State Information Since channel estimation is often not perfect in practical systems, the channel state information (CSI) imperfections can be deterministically modeled by using the ellipsoidal approximation [26], as ˆ g k u,s = g k u,s 10 ϑ k u,s / 10 , ϑ k u,s ∈ [ − ε, ε ] . (30) where the non-negativ e constant ε denotes the uncertainty bound of CSI imperfections. In this study , we consider the scenario with 14 IoT devices and 2 ESs, and include the CSI imperfections under different uncertainty bound ε . As shown in Fig. 10, the av erage infer- ence accurac y of the system decreases as the uncertainty bound of CSI imperfections increases. This occurs because the task ofﬂoading decisions made at lar ger biases in imperfect channel estimation might lead to tasks not being completed within their speciﬁed deadlines. Consequently , this situation causes more task failures, contributing to an overall decrease in the av erage inference accuracy . Nonetheless, our proposed GRL- AECNN has a relativ ely small degradation in terms of av erage 13 inference accurac y . For example, when τ = 10 ms, GRL- AECNN has (0 . 398 − 0 . 380) / 0 . 380 = 0 . 05 × degradation in av erage inference accuracy from ε = 0 to ε = 5 ; howe ver , that of DR OO-AECNN up to (0 . 277 − 0 . 187) / 0 . 187 = 0 . 48 × . This highlights the effecti veness of GRL-AECNN in aggregat- ing the CSI imperfections thereby making robust of ﬂoading decisions to ensure that more tasks are processed within the deadline in dynamic MEC scenarios. V I I . C O N C L U S I O N In this paper , we studied the computation ofﬂoading of CNN inference tasks in dynamic MEC networks. W e proposed a nov el semantic compression method, AECNN, to address the uncertainties in communication time and av ailable computa- tion resources at ESs. In AECNN, we designed a CA module to ﬁgure out the importance of channels, then compressed the intermediate tensor by pruning the less important channels. W e used entropy encoding to further reduce communication time by removing redundant information and designed a lightweight CNN-based FR module to recov er the interme- diate tensor through learning from the receiv ed compressed tensor to improve accuracy . W e designed a reward function to trade off the inference accuracy and task completion time, and formulated the CNN inference ofﬂoading problem as a maximization problem to maximize the av erage inference accuracy and throughput in the long term. T o address the optimization problem, we proposed GRL-AECNN to make the optimal ofﬂoading decision and use a step-by-step approach to fasten the training process. The experimental results show that GRL-AECNN can achiev e better performance in terms of av erage inference accuracy , service success reliability , and av erage throughput than the state-of-the-art methods, which highlights the effecti veness of GRL-AECNN in aggregating all the information of dynamic MEC thereby making robust ofﬂoading decisions. R E F E R E N C E S [1] N. Li, A. Iosiﬁdis, and Q. Zhang, “ Attention-based feature compression for cnn inference ofﬂoading in edge computing, ” in ICC 2023 - IEEE International Conference on Communications , 2023, pp. 967–972. [2] S. Guo, B. Xiao, Y . Y ang, and Y . Y ang, “Energy-efﬁcient dynamic ofﬂoading and resource scheduling in mobile cloud computing, ” in IEEE INFOCOM 2016 , pp. 1–9. [3] J. Liu and Q. Zhang, “T o improve service reliability for ai-powered time- critical services using imperfect transmission in mec: An experimental study , ” IEEE Internet of Things Journal , vol. 7, no. 10, pp. 9357–9371, 2020. [4] T . X. Tran and D. Pompili, “Joint task ofﬂoading and resource allocation for multi-server mobile-edge computing networks, ” IEEE T ransactions on V ehicular T echnology , vol. 68, no. 1, pp. 856–868, 2019. [5] Z. Chang, L. Liu, X. Guo, and Q. Sheng, “Dynamic resource alloca- tion and computation ofﬂoading for iot fog computing system, ” IEEE T ransactions on Industrial Informatics , vol. 17, no. 5, pp. 3348–3357, 2021. [6] L. Huang, S. Bi, and Y .-J. A. Zhang, “Deep reinforcement learning for online computation ofﬂoading in wireless powered mobile-edge computing networks, ” IEEE T ransactions on Mobile Computing , vol. 19, no. 11, pp. 2581–2593, 2020. [7] X. W ang, F . Y u, Z.-Y . Dou, T . Darrell, and J. E. Gonzalez, “Skipnet: Learning dynamic routing in convolutional networks, ” in Pr oceedings of the European Conference on Computer V ision (ECCV) , September 2018. [8] X. Gao, Y . Zhao, Łukasz Dudziak, R. Mullins, and C. zhong Xu, “Dynamic channel pruning: Feature boosting and suppression, ” in In- ternational Conference on Learning Representations , 2019. [9] S. T eerapittayanon, B. McDanel, and H. Kung, “Branchynet: Fast inference via early exiting from deep neural networks, ” in 2016 23rd International Conference on P attern Recognition (ICPR) , 2016, pp. 2464–2469. [10] H.-J. Jeong, I. Jeong, H. J. Lee, and S. M. Moon, “Computation ofﬂoad- ing for machine learning web apps in the edge server en vironment, ” in IEEE International Conference on Distributed Computing Systems (ICDCS) , 2018. [11] J. Shao and J. Zhang, “Bottlenet++: An end-to-end approach for fea- ture compression in device-edge co-inference systems, ” in IEEE ICC W orkshops , 2020, pp. 1–6. [12] K. Han, Y . W ang, Q. Tian, J. Guo, C. Xu, and C. Xu, “Ghostnet: More features from cheap operations, ” in IEEE/CVF CVPR , June 2020. [13] S. W oo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Conv olutional block attention module, ” in ECCV , September 2018. [14] X. Luo, H.-H. Chen, and Q. Guo, “Semantic communications: Overview , open issues, and future research directions, ” IEEE W ir eless Communi- cations , vol. 29, no. 1, pp. 210–219, 2022. [15] B. Juba and S. S. V empala, “Semantic communication for simple goals is equivalent to on-line learning, ” in Pr oceedings of 22nd International Confer ence on Algorithmic Learning Theory ALT , vol. 6925, 2011, pp. 277–291. [16] M. Jankowski, D. G ¨ und ¨ uz, and K. Mikolajczyk, “Wireless image re- triev al at the edge, ” IEEE Journal on Selected Areas in Communications , vol. 39, no. 1, pp. 89–100, 2021. [17] J. Li, H. Gao, T . Lv , and Y . Lu, “Deep reinforcement learning based computation of ﬂoading and resource allocation for mec, ” in IEEE WCNC 2018 , pp. 1–6. [18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in IEEE/CVF CVPR , June 2016. [19] R. Desislavo v , F . Mart ´ ınez-Plumed, and J. Hern ´ andez-Orallo, “Compute and energy consumption trends in deep learning inference, ” CoRR , vol. abs/2109.05472, 2021. [Online]. A vailable: https://arxiv .org/abs/2109. 05472 [20] V . W eaver , “The gﬂops/w of the various machines in the vmw re- search group, ” [EB/OL], https://web .eece.maine.edu/ ∼ vweav er/group/ green machines.html Accessed June 17, 2023. [21] Z. Y ang, L. Zhu, Y . W u, and Y . Y ang, “Gated channel transformation for visual recognition, ” in IEEE/CVF CVPR , June 2020. [22] G. Alcantara, “Empirical analysis of non-linear activ ation functions for deep neural networks in classiﬁcation tasks, ” , 2017. [23] M. Niepert, M. Ahmed, and K. Kutzkov , “Learning conv olutional neural networks for graphs, ” in ICML 2016 , pp. 2014–2023. [24] F .-F . Li, M. Andreeto, M. Ranzato, and P . Perona, “Caltech 101, ” Apr 2022. [Online]. A vailable: https://data.caltech.edu/records/20086 [25] K. Simonyan and A. Zisserman, “V ery deep conv olutional networks for large-scale image recognition, ” in ICLR 2015,May 7-9 . [26] K. Papadaki and V . Friderikos, “Robust scheduling in spatial reuse tdma wireless networks, ” IEEE T ransactions on W ireless Communications , vol. 7, no. 12, pp. 4767–4771, 2008. [27] H. Li, A. Kadav , I. Durdano vic, H. Samet, and H. P . Graf, “Pruning ﬁlters for efﬁcient con vnets, ” in 5th International Conference on Learning Repr esentations, ICLR 2017, T oulon, F rance, April 24-26, 2017. [28] P . Molchanov , S. T yree, T . Karras, T . Aila, and J. Kautz, “Pruning con volutional neural networks for resource efﬁcient inference, ” in 5th International Conference on Learning Representations, ICLR 2017, T oulon, F rance, April 24-26, 2017 . A P P E N D I X A F L O P S C O U N T I N C N N In CNNs, the input feature map of a CL is derived from the output feature map of its preceding CL. For instance, let’ s consider CL l , which takes the input feature map denoted as X l − 1 ∈ R C l − 1 × H l − 1 × W l − 1 and produces the output feature map X l ∈ R C l × H l × W l . The computation of FLOPs for CL l captures the computational complexity associated with processing data through CL l . According to the work [27], the FLOPs calculation of CL l is primarily inﬂuenced by the dimensions of the input and 14 output channels, the kernel size employed by the con volution operation, and the dimensions of the resulting output height and width. As such, the FLOPs for CL l are calculated using the following formula from [27]: ξ l = C l − 1 C l f 2 l H l W l , ∀ l ∈ L , (A1) where f l is the kernel size of CL l . Regarding the FC layer, the computational operations pri- marily consist of multiplying the input data by weight parame- ters and then applying activ ation functions. W e consider an FC layer l ′ with the input feature map X in l ′ ∈ R 1 × D in l ′ and output feature map X out l ′ ∈ R 1 × D out l ′ . According to [28], the FLOPs of an FC layer l ′ on can be calculated as ξ l ′ = D in l ′ D out l ′ , (A2) where D in l ′ and D out l ′ are the input dimensionality and the output dimensionality of FL l ′ . A P P E N D I X B P RO O F O F T H E O R E M 1 T o prove Theorem 1, we ﬁrst introduce the hyperbolic tangent function sigmoid ( x ) = e x e x +1 . The function sigmoid ( x ) is a nonlinear function with values between 0 and 1 [22]. As shown in Fig. B1, sigmoid ( x ) approaches 1 as x → 5 , while sigmoid ( x ) approaches 1 2 as x → 0 . 5 0 5 x 0.0 0.2 0.4 0.6 0.8 1.0 S i g m o i d ( x ) Fig. B1: sigmoid ( x ) function Based on the abov e, we deﬁne a new function ψ ′ ( x ) = 2 (1 − sigmoid ( x )) and deri ve that ψ ′ ( x ) approaches 0 as x ap- proaches 5 and approaches 1 as x approaches 0. Extending this to our original function ψ ( x ) = 2  1 − sigmoid  5 x σ k u  , we can deduce that ψ ( x ) behav es similarly: as x approaches σ k u , ψ ( x ) approaches 0; and as x approaches 0, ψ ( x ) approaches 1.

Dynamic Semantic Compression for CNN Inference in Multi-access Edge Computing: A Graph Reinforcement Learning-based Autoencoder

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment