Metasurfaces-Integrated Wireless Neural Networks for Lightweight Over-The-Air Edge Inference
The upcoming sixth Generation (6G) of wireless networks envisions ultra-low latency and energy efficient Edge Inference (EI) for diverse Internet of Things (IoT) applications. However, traditional digital hardware for machine learning is power intens…
Authors: Kyriakos Stylianopoulos, Mario Edoardo P, olfo
1 Metasurfaces-Inte grated W ireless Neural Networks for Lightweight Ov er-The-Air Edge Inference K yriakos Stylianopoulos, Graduate Student Member , IEEE , Mario Edoardo Pandolfo, Paolo Di Lorenzo, Senior Member , IEEE , and George C. Alexandropoulos, Senior Member , IEEE Abstract —The upcoming sixth Generation (6G) of wir eless networks en visions ultra-low latency and energy efficient Edge Inference (EI) for diverse Internet of Things (IoT) applications. Howev er , traditional digital hard ware for machine learning is power intensive, motiv ating the need for alter native computation paradigms. Over-The-Air (O T A) computation is r egarded as an emerging transformativ e approach assigning the wireless channel to actively perf orm computational tasks. This article intro- duces the concept of Metasurfaces-Integrated Neural Networks (MINNs), a ph ysical-layer -enabled deep learning framework that leverages programmable multi-layer metasurface structures and Multiple-Input Multiple-Output (MIMO) channels to realize computational layers in the wave propagation domain. The MINN system is conceptualized as three modules: Encoder , Channel (uncontrollable pr opagation features and metasurfaces), and Decoder . The first and last modules, realized respectively at the multi-antenna transmitter and receiver , consist of con ventional digital or pur posely designed analog Deep Neural Network (DNN) layers, and the metasurfaces responses of the Channel module are optimized alongside all modules as trainable weights. This architectur e enables computation offloading into the end-to-end physical layer , flexibly among its constituent modules, achieving performance comparable to fully digital DNNs while significantly reducing power consumption. The training of the MINN frame- work, two representativ e variations, and performance results for indicative applications are presented, highlighting the potential of MINNs as a lightweight and sustainable solution for future EI-enabled wireless systems. The article is concluded with a list of open challenges and promising research directions. Index T erms —Stacked intelligent metasurfaces, edge inference, over -the-air computing, MIMO, deep diffracti ve neural netw orks. I . I N T R O D U C T I O N Massiv e machine-type communications constitute one of the core use cases of fifth Generation (5G) networks, promoting the Internet of Things (IoT) paradigm. In the upcoming sixth Generation (6G), and beyond, ultra-low latenc y and energy efficient Device-to-De vice (D2D) wireless links are en visioned, necessitating innovations to ward power - and cost- efficient PHYsical (PHY) layer components, combined with This work has been supported by the SNS JU projects 6G-DISAC and 6G- GO ALS under the EU’ s Horizon Europe research and innovation program under Grant Agreements numbers 101139130 and 101139232, respectively . K. Stylianopoulos and G. C. Alexandropoulos are with the Department of Informatics and T elecommunications, National and Kapodistrian University of Athens, 16122 Athens, Greece (e-mails: kstylianop, alexandg@di.uoa.gr). M. E. Pandolfo and P . Di Lorenzo are with the National Inter-Univ ersity Consortium for T elecommunications (CNIT), Parma, Italy . M. E. Pandolfo is also with the DIA G Department, Sapienza University of Rome, via Ariosto 25, Rome, Italy . P . Di Lorenzo is also with the DIET Department, Sapienza University of Rome, V ia Eudossiana 18, Rome, Italy . (e-mails: { marioedoardo.pandolfo, paolo.dilorenzo } @uniroma1.it). unprecedented advancements in information processing algo- rithms and respecti ve applications. T o deal with the anticipated amount of IoT -generated data, including Radio-Frequency (RF) signals intended for position- ing and sensing, processing at the network edge is essential. T o this end, 6G is expected to adopt a cross-layer design, transcending the hard distinctions between user plane and PHY layer . Additionally , the trend to wards data-driven applications brings forth the role of Edge Inference (EI) and goal-oriented communications. In the latter paradigm, the T ransmitter (TX) encodes and sends data to the Receiver (RX), which does not aim to perfectly reconstruct them, but rather to extract information rele v ant to some computational task the netw ork is designed to perform. EI constitutes a special case of goal- oriented communications focusing on facilitating the RX to understand a target feature of the transmitted signal that is not explicitly present in the original input data. T o this end, a dataset of collected input-target pairs is lev eraged to infer target v alues from past examples. EI has benefits, first in terms of computations, since there is no need for RX to accurately reconstruct the input data, but also in terms of communication resources, since TX may choose encodings that preserve only information relev ant to the target feature. Machine Learning (ML) applications for IoT implemented at the network edge are gaining momentum in the landscape of 6G as an alternative paradigm of information processing. ML tools facilitate identification of patterns from data capturing precise characteristics of the intended deployment en viron- ment, which may not be accurately described by model-based approaches due to their unrealistic assumptions. Additionally , ML algorithms incur most of their computational comple xity during training, which may take place offline at an earlier stage, offering low latency computations during deployment. On the contrary , the main cost associated with ML, especially for the EI domain, is that of hardware complexity . Parallel processing units are predominantly utilized to efficiently exe- cute computations inv olved in Deep artificial Neural Networks (DNNs), resulting in substantial power consumption increases. A transformati ve idea lately gaining traction posits that com- munications computational tasks need not be confined solely to the transceivers [1]. Capitalizing on the notion of goal- oriented smart wir eless envir onments , including infrastructure such as programmable MetaSurfaces (MSs) (a.k.a. Reconfig- urable Intelligent Surfaces (RISs) [2]), capable of sensing and intentionally reconfiguring the signal propagation en vironment through intelligent beam shaping, the wireless channel itself can become an active part of the computational chain. In 2 this context, the channel ev olves from a passive propagation medium into a goal-driven computational entity , effecti vely handling part of the processing load traditionally carried out by digital hardware. By shaping wav e transformations Ov er-the- Air (O T A) using passiv e, acti ve, or hybrid analog operations, the goal-oriented smart wireless environments paradigm en vi- sions e xecuting portions of the feature extraction, compression, filtering, or interference management computations with suf- ficiently low energy consumption. This redistributes, or ev en remov es, computational load away from con ventional energy- hungry TX/RX components and into the wa ve propagation lev el, enabling more sustainable and efficient ML-enabled wireless systems as well as ML applications such as EI. In this article, we in vestigate the integration of pro- grammable multi-layer MSs into wireless propag ation en viron- ments as well as their joint optimization with transcei ver RF hardware components to realize OT A computations analogous to those performed by DNNs, thereby enabling an ef fectiv e computational framework for wireless ML applications. In this way , ML computations that traditionally take place in digital processors can be realized with O T A operations directly in the domain of ElectroMagnetic (EM) wa ves propagation. The role of the MS-enabled smart wireless medium is there- fore highlighted as a computational entity within which ap- propriately designed Multiple-Input Multiple-Output (MIMO) systems can be treated End-to-End (E2E) as single DNNs lev eraging digital-, analog-, and wa ve-domain-based layers. Dev eloping such PHY -layer-enabled wireless ML systems has the potential to greatly reduce the comple xity and requirements for lightweight de vices, lik e IoT , to perform EI. The remainder of this article is organized as follows. Sec- tion II illustrates the basic principles behind MS-based ML and discusses their integration into wireless systems. Section III presents the proposed Metasurfaces-Inte grated Neural Net- work (MINN) frame work, detailing training and deployment considerations, while Section III-D examines their multiple variations and applications. Section IV discusses open chal- lenges with MINNs and outlines important research directions. Finally , Section V concludes the article. I I . D E E P D I FF R A C T I V E N E U R A L N E T W O R K S This section discusses the fundamental principles of build- ing MS-based DNNs and their integration to wireless systems. A. Basic Principle The main technology enabling wa ve-domain implementa- tion of DNNs is that of Stacked Intelligent Metasurfaces (SIM) [1]. Such multi-layer MS structures are composed of densely placed thin layers of diffracti ve MSs, each comprising multiple metamaterials with tunable EM responses, all man- aged by a controller . The overall SIM is typically supposed to be enclosed in absorbing material [3], and the propagation of the signal between elements of consecutive layers is gov erned by geometrical optics. By purposely controlling the responses, one might perform particular operations on the signals that are forwarded from the first SIM layer . Since such operations are linear with respect to the impinging signal at ev ery layer , and all elements of one layer contribute to the arri ving signal at ev ery element of the successor layer, this structure loosely resembles a fully-connected linear layer , which is the fundamental component behind DNNs. By exploiting this principle, Deep Diffracti ve Neural Networks (D 2 NNs) may be materialized treating SIM responses as trainable weights [3]. In the D 2 NN context, the input data to the network must first be av ailable in the RF domain. This is readily done when the network is tasked with processing RF signals under sensing applications, which gives D 2 NNs strong advantages to digital DNNs in terms of ener gy efficiency and latenc y , since computationally lightweight devices are only needed. In this case, no digital-to-analog con versions need to take place for data to be digitally processed. On the contrary , when D 2 NNs are used for performing inference on digital input data, the transfer to the RF domain is of particular importance. In [4], this conv ersion is achiev ed through programmable input layers at a SIM. Each element of the input data vector is mapped to the response of a corresponding element of that layer using rudimentary techniques (e.g., amplitude modulation). Then, a beacon signal, typically from a single antenna, illuminates the back of the first layer to bootstrap the forward network pass. T o obtain D 2 NN’ s output for inference applications, task- specific designs are necessary . In classification problems, where the output corresponds to the index of one or more predefined classes of which the input data are assumed to belong, signal receptors equal to the number of classes are placed after the final D 2 NN layer . For single-class classifi- cation, the predicted class index corresponds to the inde x of the receptor with the higher observed signal strength. In regression problems, obtaining D 2 NN’ s output is relati vely less straightforward. It is possible to interpret the signal at receptors as being amplitude- or phase-modulated, howe ver , extremely fine grained beams need to be implemented by the SIM to achie ve desired accuracy; this is impractical with the av ailable D 2 NN hardware designs [4]. Considering the training phase in particular, once the forward pass is performed and the output is con verted through analog-to-digital con verters, the result is compared to the expected target value for each of the training data instances, and the loss function is then digitally computed. The backpropagation algorithm is then applied to determine the changes in the responses in each of the SIM elements and the process is repeated until conv ergence. Once the data is con verted to RF signals, D 2 NNs perform computations at the speed of light, gi ving them a competitive advantage over digital DNNs. Arguably , more important are the benefits in terms of po wer consumption. MSs usually comprise near-passi ve circuitry , such as varactors, requiring ev en nanoW atts to operate [2]. This is a tremendous energy efficienc y improvement compared to standard DNN proces- sors (particularly , Graphics Processing Units (GPUs)), which consume hundreds of W atts for inference. Even more so, once the desired responses of constituent metamaterials are determined (presumably in simulating software), the MSs may be fabricated to be completely passive. In such cases, the only power consuming components of D 2 NNs are therefore the feeding antenna and output receptors, all of which may operate under very lo w power settings in controlled wireless 3 Channel Module Encoder Module Programmable TX RF Chains DNN Layers SIM Uncontrollable Multi-layer MSs SIM Propagation Features Reception Noise Decoder Module DNN Layers RX RF Chains MSs Input Data T arget Feature Fig. 1. Conceptual architecture of the Metasurfaces-Integrated Neural Network (MINN) framework comprising three core modules. The Encoder and Decoder modules, which may incorporate neural network structures, are collocated respectively at the multi-antenna TX and RX nodes, while the Channel module performs O T A computations le veraging the fading coef ficients of the wireless channel, the programmable EM responses of multi-layer MS structures constituting it, as well as the properties of the RX thermal noise. The E2E MIMO system is described as the composition of these three modules, therefore, the chain rule may be applied to compute the respective gradients and, consequently , optimize the heterogeneous DNN weights and constituent MS responses through gradient descent. It is noted that: i ) the physical MS de vice(s) enabling the reconfigurability of the signal propagation environment may be collocated at either in the TX or RX, instead of being placed in between them (in this case, it is probable that the MS(s) affect only the portion of the signal components impinging on them), or in both; and i ) the Encoder and Decoder DNN components can be implemented either through conv entional digital processors or equiv alent analog computation units (e.g., liquid state machines, memristors, and multi-layer MS structures, such as Stacked Intelligent Metasurfaces (SIM)). en vironments where signal attenuation and multipath effects are negligible. It is noted that, for applications where the input data are already in the RF domain, such as those that fall under communications, localization, and/or sensing, the elimination of the need of digital-to-analog con version, and its associated power and latency overheads, constitutes D 2 NNs an ideal candidate for “exotic” analog DNN hardware, enabling lightweight ML applications, such as EI. B. Inte gration in W ireless Systems Despite their analog nature, D 2 NNs hav e been predomi- nantly de veloped as DNN hardw are accelerators for general in- ference problems [4]. T o materialize the promised advantages under EI applications, the integration of SIM-based computing within wireless communication infrastructure requires further in vestigation. In fact, the current design of D 2 NNs is largely incompatible with existing and future wireless system stacks. One of the main considerations is the digital-to-analog con version of input data. Encoding each element of this data (e.g., image pixel) directly as a pre-mapped response of a programmable first SIM layer might be sev erely constraining for practical applications, since: i ) the power consumption attributed to the SIM controller might be significant; ii ) it is difficult to obtain sufficient precision in programmable MS responses to accurately encode input data; and iii ) the size of the first SIM layer grows with the dimensionality of the data. Besides, current D 2 NN mode of operation does not utilize the capabilities of contemporary wireless systems. In particular , MIMO systems le verage transmission over multiple antennas to achiev e spatial multiple xing or beamforming, thus, provid- ing additional degrees of freedom for feeding input data to the SIM device. Under this prism, standard PHY operations, such as source encoding, modulation, and precoding/combining, could be exploited to inte grate D 2 NNs in MIMO systems. Another important consideration of SIM-based computing for EI is the wireless channel. D 2 NNs hav e been originally dev eloped for almost free-space signal propagation conditions with high Signal-to-Noise Ratio (SNR) le vels [4]. Howe ver , in practical wireless systems, small- and large-scale fading effects resulting in fluctuating SNR lev els are present. For these systems, a SIM can play a dual role: their learned responses can perform successful inference, akin to DNN layers, while also adapting to channel conditions. It is im- portant to highlight that the wireless channel needs not be treated as a source of undesirable behavior; it can be employed as an additional means of computation, follo wing the O T A computing paradigm, where the superimposition of wireless signals is exploited to carry out computations [5]. I I I . M E TA S U R FA C E S - I N T E G R A T E D N E U R A L N E T W O R K S In this section, we elaborate on the MINN frame work [6], a generic MIMO wireless system setup operating E2E as a single DNN, integrating digital and/or analog neural network layers, beamforming operations, and D 2 NN layers. This PHY - layer-enabled DNN framework consists of three core modules, as illustrated in Fig. 1. Each module represents a physical 4 entity implemented in a distinct system de vice (i.e., TX, RX, and multi-layer MS structures and/or SIM realizing the smart wireless environment), as detailed in the sequel. A. Overall Ar chitectur e As depicted in Fig. 1, the TX operates the Encoder module, which feeds the data this device possesses into their neural network layers to directly output the signal that is to be transmitted. In that regard, TX is responsible for initial feature extraction, compression, and modulation, while accounting for error correction and other channel-related operations. When performing EI, the principles of joint source-channel coding are arguably more con venient to follow , suggesting that all of the above operations are performed implicitly through the hidden layers of the module, rather than devising separate functional blocks for each one (this strategy is common un- der separated source-channel coding schemes of con ventional communications). Regardless of the architecture of the mod- ule, the transmitted signal should be in a form that makes use of all degrees of freedom offered by modern MIMO systems (in the spatial, temporal, and frequency domains), and adheres to av erage or maximum power constraints. Three-dimensional complex-v alued tensors can be used to represent the TX signal, with each element representing the baseband signal at a certain antenna, frequency bin, and time slot. In practice, the Encoder can be implemented through a con ventional DNN, howe ver , in EI applications inv olving lightweight devices with limited requirements, this module’ s comple xity is a restrictive factor . The Channel module within the MINN framework contains two components, an uncontrollable and a programmable one. The former component consists of the typical wireless channel itself together with the thermal noise at the RX side. W a ve propagation is well known to be subject to fading, thus, imposing linear effects on the transmitted signal, whereas ther- mal noise is commonly additiv e white Gaussian. The second component concerns the programmable MSs influencing the wireless TX-RX link. These devices constitute ke y features of the overall programmable MIMO channel, offering dynam- ically reconfigurable channel matrix coefficients. Under the MINN framework, the responses of multi-layer MS structures and/or SIM are treated similar to trainable weights of typical DNN layers: they are optimized through a training process with the goal to configure the overall channel to perform OT A computations aiding inference. In this way , ML computations, that would otherwise be performed via digital DNN layers, can be realized by the programmable MIMO channel, thereby , offloading either the TX, RX, or both. This represents a paradigm shift over traditional communication systems, where the wireless channel is treated as a component of adverse effects motiv ating countermeasures at the transceiv ers. Note that the precise form of computations that are attainable by the Channel module varies depending on the wireless en vironment as well as the capabilities of MSs/SIM. Most of state-of-the-art designs for the latter [3], [4] offer computations maintaining the linear nature imposed by the channel response matrix, and only recently nonlinear implementations have started being considered [7], [8], as will be discussed in the sequel. As sho wn in Fig. 1, the final module of the MINN frame- work is the Decoder , which is realized at the RX. The signal passing through the Channel module is collected therein, with the goal to extract the embedded information and output the inference result. T o achiev e this, operations similar to channel equalization, demodulation, and source decoding are implicitly implemented. Crucially , the Decoder does not aim to reconstruct the exact form of the input signal, b ut rather to perform feature extraction on it. Similar to the Encoder , realizing the Decoder through a conv entional DNN entails increased complexity which may be prohibitiv e for certain inference applications through lightweight wireless devices. B. T raining for Static and Dynamic MS Responses The sequence of operations described abo ve concern the for - ward pass of the considered PHY -layer-enabled DNN frame- work. The overall model can be expressed mathematically as a functional composition of the Decoder , Channel , and Encoder modules in that order [6]. This conceptual structure is therefore compatible with the backpropagation algorithm for DNN training, where the last layer/module has its trainable weights updated first through Stochastic Gradient Descent (SGD), and each preceding layer has its weights updated with respect to the gradient of its successor layer or module. For the case of MINN training on varying fading with stationary distributions, this procedure can be detailed as follows. During training, each data instance is paired with a different random channel instance, and the forward pass under these channel conditions is performed to deriv e the network’ s output value. The error between that prediction and the kno wn target value corresponding to the input data is measured with the help of an objectiv e function (e.g., mean squared error or cross entropy). The gradient with respect to the last layer of the Decoder is then computed, and the error signals are backpropagated through this module, which are in turn forwarded to the Channel and Encoder modules. Note that this training procedure implies that the data is stochastically independent from the channel conditions, which might not alw ays guaranteed (e.g., when data arises from wireless sensors in target sensing scenarios, the presence of a target affects the channel). Assuming training takes place in an accurate simulator , all DNN weight updates can be digitally computed and only the trained modules need to be deployed. This implies that the MSs included at any of the MINN modules can be fabricated to implement the optimized responses in a fixed passive manner . Note that this procedure also treats the case of dynamic channel fading variation, and, as a result, the trained MINN can adapt to instantaneous channel changes as long as their statistics remain the same. This happens, of course, at the e xpense of prolonged training times, since the ef fecti ve dataset is the Cartesian product of all data and channel instances. Another theoretical consideration is the effect of random thermal noise in the Channel module. Since this noise is additiv e, it does not impede the calculation of gradients. Moreov er , noise whiteness guarantees that each observation of the received signal is an unbiased estimator of its noise-free 5 0 . 700 0 . 775 0 . 850 0 . 925 1 . 000 Accuracy MINN (4-la yer SIM) MINN (3-la yer SIM) MINN (2-la yer SIM) MINN (1-la yer SIM) MINN (No SIM) Digital DNN (upper bound) (a) MS layers each with 8 × 8 metamaterials and SNR of 5 dB. 0 . 700 0 . 775 0 . 850 0 . 925 1 . 000 Accuracy MINN (4-la yer SIM) MINN (3-la yer SIM) MINN (2-la yer SIM) MINN (1-la yer SIM) MINN (No SIM) (b) MS layers each with 12 × 12 metamaterials and SNR of 10 dB. Fig. 2. Mean accuracy of different MINN versions for MNIST classification, considering fixed SNR during training and inference. All simulated 4 × 4 MIMO system setups included a SIM positioned close to the TX at a distance corresponding to approximately 7.5% of the total TX-RX distance, with its broadside perpendicular to the TX-RX line of sight, differing on the number of MS layers and the number of metamaterials per layer. A geometric channel with 10 scatterers, yielding static fading conditions, has been considered. T wo conv olutional layers followed by a linear layer were used at the TX digital DNN, while the RX digital DNN comprised three feedforward layers. The “No SIM” baseline refers to performing inference with only the digital DNNs at the transceivers, i.e., without manipulating the wireless channel through any MS (a simplified MINN variation including only the uncontrollable component of the Channel module). The “Digital DNN” benchmark refers to performing MNIST classification entirely on the TX with the same number of layers, without accounting for channel transmission, and is therefore an upper bound. All networks were trained for 50 epochs irrespective of their size. As observed, by increasing the number of SIM elements and the received SNR lev el, larger MINNs approach the Digital DNN bound, whereas the remov al of the SIM is detrimental to the training process. It is additionally demonstrated that, under lower SNRs, deeper MINN versions are not always more efficient since they suffer from higher signal attenuation through the SIM layers. version, a fact that ensures SGD con ver gence. Evidently , cases with low recei ved SNRs require more extensi ve training, and, empirically , it is beneficial to pre-train MINNs first with data at high SNR values and, then, adapt to low SNR conditions through fine tuning and transfer learning techniques [6]. A MNIST classification performance ev aluation with the pre- sented MINN architecture is demonstrated in Fig. 2 for static fading conditions. As observ ed, in the high SNR regime and with sufficient number of MS elements, the performance of the presented PHY -layer-enabled DNN framework with optimized static SIM approaches that of fully digital DNNs. The MINN frame work also includes the option where the MS-based DNN layers, at any of the MINN modules, adapt their constituent responses at e very channel coherent block. This option, howe ver , necessitates a dedicated DNN at each module hosting MS(s), serving as the structure’ s dynamic response controller [6]. These controllers need to be trained in an E2E coordinated manner to map instantaneous channel observations to MS responses, a fact that increases the number of trainable DNN parameters and inter-module training control ov erhead. Nev ertheless, in highly dynamic wireless conditions with correlated f ading, this MINN version with dynamic MS responses promises significant performance improv ements. C. T wo MINN Ar chitectur e V ariations The MINN framework offers a large variety of potential implementations, lev eraging the computing potential of either multi-layer MS structures, or MIMO systems with eXtremely Large (XL) antenna arrays, or both. The first all-MSs variation enables replacing digital DNN layers with wav e-domain-based DNNs [9], resulting in computationally lightweight TX and/or RX devices, whereas the second exploits the XL number of transformations offered by the wireless channel. An example MINN architecture for MNIST handwritten digit classification based primarily on MSs at all three modules is depicted in Fig. 3(a). The first ( Encoder module) and last ( Decoder module) MS layers of the wav e-domain-based E2E DNN are respectiv ely implemented at the TX and RX, while the hidden MS layers constituting the Channel module are handled by the programmable MIMO channel [6]. The latter implies that the smart wireless environment may act as shared O T A computing resources that can be dynamically allocated to lightweight end devices for ML applications [9]. Under this viewpoint, future IoT networks may be designed to of fer “O T A computation as a service, ” in order to partially offload devices from computationally demanding D2D inference tasks. Figure 3b illustrates an XL MIMO system operating as an Extreme Learning Machine (ELM) [7]. This single-hidden- layer E2E neural network paradigm has been recently shown to offer OT A digit classification leveraging the uncontrollable XL MIMO channel. In particular, the TX hosting the Encoder module performs analog modulation, encoding digit images at their XL antenna array . This first neural network layer is followed by the hidden layer formulated by the channel gain coefficients of the uncontrollable Channel module. The last layer constituting the Decoder module, realized at the XL RX, collects the receiv ed signals at each antenna element, feeding each of them to a nonlinear RF component. Then, it multiplies each of them with a controllable weight and, finally , all weighted analog outputs are combined to provide the output inferred digit. For the latter purpose, the RX may deploy an adequately designed con ventional antenna array followed by RF circuitry [7] or a diode-based diffracting 6 å Controller M S Respons e s Modul a t e d Data Antenna Feed Energy Receptors Input Data Channel Module Decoder Module at RX Encoder Module at TX argmax Ta r g e t F e a t u r e Controller Da t a ( f o r wa r d p a s s ) Gr a d i e n t ( b a c k wa r d p a s s ) Co n t r o l ( b a c k wa r d p a s s ) (a) An all-MSs MIMO system where all except the first and last MS-based DNN layers, placed respectively at the TX and RX, can be flexibly installed within the signal propagation environment or/and near the end devices. Encoder Module at TX Channel Module TX RF Chains Decoder Module at RX RX RF Chain Tr ai n a b l e We i g h t s Nonlinear Activations (b) An XL MIMO system with analog combining of the nonlinearly processed received signals, lev eraged as a single hidden layer neural network. Fig. 3. T wo variations of the MINN architecture for the example of MNIST handwritten digit classification: (a) An all-MSs MIMO system where all three modules are primarily implemented by MSs. At the TX, the forward network pass initiates through a single antenna illuminating the first layer of the E2E wav e-domain-based DNN, which has as many elements as the number of features within the input data (i.e., the number of MNIST image pixels). This data is encoded via the EM responses of this layer’ s diffracti ve metamaterials [4]. At the RX, the final MS layer , constituting of ten fully absorbing metamaterials each followed by an energy detector (with each representing one of the possible MNIST digits), provides the output inferred digit. Although conceptually regarded as parts of the Channel module handling all O T A computations, the hidden diffractiv e MS layers may be flexibly distributed at all system physical devices. For example, some can be collocated with the transceiver devices and the remaining installed all together or in groups within the signal propagation en vironment. Alternatively , those dif fractive MSs may be placed at one of the end devices and inside the MIMO channel, or solely within that channel, enabling lightweight transmissions, receptions or both. (b) An XL MIMO system where the Encoder and Decoder modules are respectively carried out by a multi-antenna TX and RX, and the Channel module is dev oid of programmable devices. [7]. At the TX, each pixel of the input data is encoded and fed to a distinct antenna, while, at the RX, the signal received at each antenna is first fed to a nonlinear RF component, then, it is multiplied by a controllable weight, and, finally , all weighted analog outputs are combined to provide the output inferred digit. This XL MIMO system capitalizes on the random transformations imposed by the uncontrollable Channel module on the feature signals before being superimposed at the RX antennas, operating as an OT A ELM. MS(s) [8] to perform nonlinear acti vation. It was recently shown in [7] that, for static fading conditions, the combining weights of this MINN-ELM variation can be optimized in closed form, and may be quickly fine-tuned as fading changes gradually ov er time without relying on costly backpropagation. As demonstrated in Fig. 4, MINN-ELMs with an XL MS at the RX can perform equally well to fully digital ELMs that ignore fading. In fact, it has been proven in [7] that MINN- ELMs are univ ersal approximators (i.e., they may approximate any computable function based on av ailable data) under the following conditions: i ) XL numbers of RX antennas; ii ) rich scattering conditions (Rayleigh-like); and, crucially , iii ) nonlinear response at the MS elements. Overall, MINN-ELMs exploit random fading as linear projections of input data to an arbitrary , yet representationally rich space, and use the MS re- sponses at RX as weights that promote nonlinear combinations of useful representations to the inference problem at hand. D. Example MINN Applications T o promote energy efficient MINN implementations, their data-driv en design objectiv e may be augmented with a penalty term including the TX po wer at the Encoder module. This practice is similar to DNN regularization and can also be interpreted as MINN optimization under soft constraints. More 7 8 16 32 64 128 256 512 1024 Number of trainable MS elements at the RX 0 . 60 0 . 65 0 . 70 0 . 75 0 . 80 0 . 85 0 . 90 Accuracy MINN-ELM (Parkinson’s) Digital ELM (Parkinson’s) MINN-ELM (MNIST) Digital ELM (MNIST) (a) Classification on Parkinson’ s and MNIST datasets. 8 16 32 64 128 256 512 1024 Number of trainable MS elements at the RX 0 . 60 0 . 65 0 . 70 0 . 75 0 . 80 0 . 85 0 . 90 0 . 95 Accuracy MINN-ELM (WBCD) Digital ELM (WBCD) MINN-ELM (SECOM) Digital ELM (SECOM) (b) Classification on the WBCD and SECOM datasets. Fig. 4. Mean accuracy of the MINN-ELM variation in Fig. 3b over different binary classification datasets at the received SNR level of 25 dB. XL MIMO system setups with different numbers of TX antenna elements and MS sizes at the RX were simulated under a Rayleigh fading channel, which was treated as the random single hidden layer of the E2E wav e-domain-based neural network. Each constituent metamaterial of the last MS-based layer at the RX was designed to realize a cascade of a fixed nonlinear response (thus, acting as an activ ation function) followed by a tunable linear response (thus, acting as a trainable weight). The MINN-ELM training took place in closed form within each channel coherence block. The number of TX antennas corresponded to the number of input features of the dataset used: 22 for Parkinson’ s; 60 (sub-sampled from 784 ) for binary (even/odd) MNIST ; 30 for Wisconsin Breast Cancer Diagnosis (WBCD); and 20 (sub-sampled from 590 ) for Semiconductor Manufacturing (SECOM). The MINN-ELM classification performance has been compared with that of a fully digital ELM implementation, considering 200 random initializations over four datasets. As observed, The MINN-ELM performs equally well to its digital counterpart in all scenarios. The performance increases as its approximation power is enhanced by increasing the number of MS elements at the RX, with the exception of the SECOM dataset which suffers from ov erfitting. In sufficiently XL MIMO conditions, MINN-ELM achiev es near-optimal classification for all datasets, approaching their asymptotic theoretical guarantees of universal approximation. − 15 − 10 − 5 0 5 10 15 T ransmission Pow er [dBm] 0 . 25 0 . 50 0 . 75 1 . 00 Accuracy γ = 10 − 3 γ = 10 − 2 γ = 10 − 1 γ = 1 γ = 10 γ = 10 − 3 γ = 10 − 2 γ = 1 γ = 10 No p ow er control (10 × 10 SIM) No p ow er control (16 × 16 SIM) Po wer control (10 × 10 SIM) Po wer control (16 × 16 SIM) Fig. 5. Mean accuracy of MINN with power control for MNIST classification versus the TX power lev el, compared with MINN trained under constant power budgets. An 16 × 8 MIMO system setup with two different sizes of 4 -layer SIM, positioned close to the TX similar to Fig. 2, operating under a geometric channel with 15 scatterers, was simulated. At each channel realization, the RX position was randomly sampled, resulting overall in dynamic fading conditions. The penalty term γ is a hyper-parameter balancing energy efficiency and classification performance. As depicted, classification with power control yields desired accuracy levels with order of magnitudes lower TX power during inference. This behavior exemplifies this MINN application for low and varying SNR lev els. specifically , a scalar parameter may be used to control the magnitude of the penalty term as the MINN progressiv ely reduces the transmission power . As depicted in Fig. 5, MINNs with integrated power control achiev e comparable classifica- tion performance with orders of magnitude lower final power budget, as compared with MINNs dev oiding this control. A recent research direction deals with SIM response opti- mization to approximate arbitrary matrices [10]. Inspired by this MS-based operation approximation potential, a fully digi- tal DNN may first be trained to perform EI and, subsequently , replace one of their hidden layers with an appropriately designed MS or multi-layer MS structure [11]. Under the MINN framework, this approach indicates that the original digital weights may be approximated by the programmable component of the Channel module, possibly in conjunction with the TX beamformer ( Encoder module) and RX com- biner ( Decoder module). An application of this idea for the objectiv e of OT A semantic (i.e., encoding) alignment [12] is presented in Fig. 6. As shown, MINNs employing deeper and larger SIM within the programmable Channel module achiev e performance comparable to fully digital alignment techniques. I V . O P E N C H A L L E N G E S A N D F U T U R E D I R E C T I O N S Despite the recent realizations of the MINN framew ork, capitalizing on state-of-the-art D 2 NN prototypes and SIM architectures, there exist several open challenges ranging from hardware design, algorithmic dev elopment, and prototyping. Hardwar e Components and Orchestration: Practical MS designs implement quantized responses, necessitating non- negligible po wer to maintain and change their configurations as well as to operate their controllers, which, for the case of training for dynamic responses, need to also support a DNN. For this case, MSs with integrated sensing and com- puting capabilities [2] may provide attractiv e solutions for locally acquiring channel knowledge. Howe ver , the impact of quantization on training has not been yet explored, and efficient orchestration protocols are needed for the distribution of the trained parameters at all three MINN modules. It is crucial for the proposed Encoder and Decoder modules to 8 1 2 3 4 5 6 7 8 9 10 15 20 25 Number of MS lay ers at the SIM 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Accuracy No Mismatch Original Linear Original PPFE Alignment Type Linear PPFE Number of MS elements p er layer 16x16 32x32 64x64 Fig. 6. Mean accuracy of different MINN versions for the objectiv e of OT A alignment for MNIST classification. T wo pairs of Encoder / Decoder modules, corresponding to two distinct TX/RX MIMO links for EI, were first pretrained independently from each other to encode and classify CIF AR-10 images (i.e., learned different encoded spaces that were semantically misaligned). Then, the Channel module was optimized to align those encoded spaces, enabling efficient O T A classification between the Encoder module of the TX of the first pair and the Decoder of the RX of the second pair. This process inv olved learning a linear transformation that maps the encoding space of the first MIMO pair to the encoding space of the second one. The optimal mapping was first computed digitally , either as a direct linear transformation or by applying Proto Parse val Frame Equalizers (PPFE) onto the encoded data as a pre-processing step. Then, the Channel module controlling a SIM placed at the TX, with different numbers of layers and metamaterials, was optimized to approximate the ideal transformation matrices. The all-MSs MIMO variation of the MINN architecture in Fig. 3(a) was considered, with the TX’ s SIM having a first layer of 768 metamaterials whose responses were designed to correspond to the actual image values, and the RX was equipped with a 384 - element fully absorbing MS comprising the first DNN layer of the Decoder . As observed, when the number of SIM layers and their elements increase, the proposed OT A semantic alignment’ s performance approaches that of the simulated fully digital alignment techniques. operate under reasonable power consumption supporting both ML computations and con ventional communication functions. For the former objectiv e, multi-hybrid XL MIMO solutions with few and low power active components are needed (small numbers of RF chains with lo w resolution signal con verters). Finally , the placement of the MS-based DNN layers within the signal propagation en vironment and/or near the end devices is of paramount importance to ensure proper illumination of the MSs during both the forward and backward network passes. Nonlinear Analog-/W av e-Domain Layers: D 2 NNs and recent hybrid digital-/wa ve-domain DNN designs [6], [12] rely on linear response MSs, implying that MINN’ s Channel module behaves overall as a single linear layer , irrespective of the number of SIM layers incorporated. This feature, howe ver , provides only limited approximation capabilities. T o enable univ ersal OT A approximation, nonlinear activ ation functions must be integrated in each layer-to-layer O T A/analog connec- tion, which necessitates nov el designs of metamaterials and/or passiv e/near-passiv e RF circuits. Note that nonlinear responses need not be controllable, since DNN acti vation functions are typically fixed. For this goal, RF components operating in their saturation region [7] and diode-based RF circuits approximating the Rectified Linear Unit (ReLU) acti vation [8] constitute promising research directions. Advanced DNN Architectures: State-of-the-art ML mod- els employ more advanced layer architectures, going beyond what is currently offered by the fully-connected feedforward propagation of D 2 NNs. Encouraging attempts have been re- cently made to implement con volutional layers by exploiting wideband characteristics [13], ho we ver , implementing deep con volutional neural netw orks, or more adv anced recurrent and attention-based architectures poses a formidable challenge. Theoretical Inference Guarantees: Ensuring the universal approximation properties of MINN’ s Channel module is a tedious task. In fact, further theoretical advancements are re- quired to extend the guarantees of MINN-ELMs [7] to deeper structures and dynamic fading scenarios. T o this end, accurate MSs-parametrized channel modeling is crucial, impacting also simulations-based offline training. Complementary , analytical insights may guide regularization, initialization, and hyper - parameter selection, of fering more stable training behavior . MINNs with Wideband Signaling: D 2 NNs mainly rely on the spatial degrees of freedom offered by multiple antenna/MS elements, with the temporal and frequency dimensions having been less explored [4], [14]. W ideband MS designs [2] and signaling can enable extensi ve parallel data transmissions in MINNs, facilitating feature extraction with minimal digital processing at the endpoints [15]. Besides, empowering MINNs with temporal memory , going beyond linear time in variant systems, may serve as a means to implement recurrent layers. O T A MINN T raining: The training of all MINN modules can be performed with synthetic data on realistic simulators. This implies that gradient updates can be computed digitally and, then, the optimized responses applied to the respectiv e MS layers of the E2E, possibly heterogeneous, DNN. It is, howe ver , desirable to design multi-layer MS structures and D 2 NN enabling O T A calculation of objectiv e functions and their corresponding gradients. In this way , reconfiguring MS responses based on impinging error signals will allo w for wa ve-domain-based backward passes. Distributed MINNs: The MINN framew ork may be re- alized across multiple devices with heterogeneous character- istics, shaping scalable PHY -layer-based DNNs on demand. This distributed O T A computing necessitates the dev elopment of ne w orchestration schemes and protocols for training and inference. Especially , when extending the MINN frame work to multi-D 2 NN/-user and federated learning scenarios, low over - head synchronization schemes are of particular importance. V . C O N C L U S I O N This article presented the concept of MINNs, a PHY -layer - enabled heterogeneous DNN framework encompassing layers realized in the digital, analog, and wav e-propagation domains, offering O T A ML applications in a computations placement flexible manner for future edge networking with lightweight devices. The operation of the three constituent modules of the proposed E2E MIMO system, all primarily comprising multi-layer MS structures and/or SIM, w as discussed, follo wed by variations of the overall architecture and applications for the example of image classification. The framework’ s training mechanism for both static and dynamic MS responses was 9 also elaborated. Finally , a list of MINN open challenges and respectiv e research directions was presented. R E F E R E N C E S [1] J. An, C. Xu, D. W . K. Ng, G. C. Alexandropoulos, C. Huang, C. Y uen, and L. Hanzo, “Stacked intelligent metasurfaces for ef ficient holographic MIMO communications in 6G, ” IEEE J. Sel. Areas Commun. , vol. 41, no. 8, pp. 2380–2396, 2023. [2] G. C. Alexandropoulos, A. Zappone, N. Shlezinger, M. Di Renzo, and Y . C. Eldar, Reconfigurable Intelligent Surfaces for W ireless Communi- cations: Modeling, Ar chitectur es, and Applications . Singapore: Springer Nature, 2026. [3] X. Lin, Y . Rivenson, N. T . Y ardimci, M. V eli, Y . Luo, M. Jarrahi, and A. Ozcan, “ All-optical machine learning using diffracti ve deep neural networks, ” Science , vol. 361, no. 6406, pp. 1004–1008, 2018. [4] C. Liu, Q. Ma, Z. J. Luo, Q. R. Hong, Q. Xiao, H. C. Zhang, L. Miao, W . M. Y u, Q. Cheng, L. Li, and T . J. Cui, “ A programmable diffracti ve deep neural network based on a digital-coding metasurface array , ” Nat. Electr on. , vol. 5, no. 2, pp. 113–122, 2022. [5] Z. W ang, Y . Zhao, Y . Zhou, Y . Shi, C. Jiang, and K. B. Letaief, “Over - the-air computation for 6G: Foundations, technologies, and applica- tions, ” IEEE Internet Things J. , vol. 11, no. 14, pp. 24 634–24 658, 2024. [6] K. Stylianopoulos, P . Di Lorenzo, and G. C. Alexandropoulos, “Over - the-air edge inference via metasurfaces-integrated artificial neural net- works, ” arXiv preprint , 2025. [7] K. Stylianopoulos and G. C. Alexandropoulos, “Univ ersal approxima- tion with XL MIMO systems: O T A classification via trainable analog combining, ” arXiv preprint , 2025. [8] K. Stylianopoulos, M. F abiani, G. T orcolacci, D. Dardari, and G. C. Alexandropoulos, “Over -the-air extreme learning machines with XL reception via nonlinear cascaded metasurfaces, ” in Proc. Int. Zurich Seminar Inf. Commun. , Zurich, Switzerland, 2026 (arXiv preprint [9] G. Huang, J. An, Z. Y ang, L. Gan, M. Bennis, and M. Debbah, “Stacked intelligent metasurfaces for task-oriented semantic communications, ” IEEE W ireless Commun. Lett. , vol. 14, no. 2, pp. 310–314, 2025. [10] J. An, C. Y uen, Y . L. Guan, M. Di Renzo, M. Debbah, H. V . Poor , and L. Hanzo, “T wo-dimensional direction-of-arrival estimation using stacked intelligent metasurfaces, ” IEEE J. Sel. Ar eas Commun. , vol. 42, no. 10, pp. 2786–2802, 2024. [11] M. Hua, C. Bian, H. Wu, and D. G ¨ und ¨ uz, “Implementing neural net- works o ver -the-air via reconfigurable intelligent surf aces, ” arXiv pr eprint arXiv:2508.01840 , 2025. [12] M. E. Pandolfo, K. Stylianopoulos, G. C. Alexandropoulos, and P . Di Lorenzo, “Over-the-air semantic alignment with stacked intelligent metasurfaces, ” arXiv preprint , 2026. [13] S. Garcia Sanchez, G. Reus-Muns, C. Bocanegra, Y . Li, U. Muncuk, Y . Naderi, Y . W ang, S. Ioannidis, and K. R. Chowdhury , “AirNN: Over- the-air computation for neural networks via reconfigurable intelligent surfaces, ” IEEE/ACM T rans. Netw . , vol. 31, no. 6, pp. 2470–2482, 2023. [14] Y . Luo, D. Mengu, N. T . Y ardimci, Y . Rivenson, M. V eli, M. Jarrahi, and A. Ozcan, “Design of task-specific optical systems using broadband diffracti ve neural networks, ” Light: Sci. Appl. , vol. 8, no. 1, p. 112, 2019. [15] Z. Li, J. An, and C. Y uen, “Stacked intelligent metasurface-enhanced MIMO OFDM wideband communication systems, ” IEEE T rans. Wir e- less Commun. , vol. 25, pp. 9608–9622, 2026.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment