OptINC: Optical In-Network-Computing for Scalable Distributed Learning

Distributed learning is widely used for training large models on large datasets by distributing parts of the model or dataset across multiple devices and aggregating the computed results for subsequent computations or parameter updates. Existing comm…

Authors: Sijie Fei, Grace Li Zhang, Bing Li

OptINC: Optical In-Network-Computing for Scalable Distributed Learning
OptINC: Optical In-Netw ork-Computing for Scalable Distrib uted Learning 1 st Sijie Fei T echnical University of Munich Munich, Germany sijie.fei@tum.de 2 nd Grace Li Zhang T echnical University of Darmstadt Darmstadt, Germany grace.zhang@tu-darmstadt.de 3 rd Bing Li T echnical University of Ilmenau Ilmenau, Germany bing.li@tu-ilmenau.de 4 th Ulf Schlichtmann T echnical University of Munich Munich, Germany ulf.schlichtmann@tum.de Abstract —Distributed learning is widely used for training large models on large datasets by distributing parts of the model or dataset across multiple devices and aggregating the computed r esults for subsequent computations or parameter updates. Existing communication algorithms for distributed learning such as ring all-r educe r esult in hea vy communication overhead between servers. Since communication in lar ge-scale systems uses optical fibers, we pr opose an Optical In-Network- Computing (OptINC) architecture to offload the computation in ser vers onto the optical interconnects. T o execute gradient av eraging and quantization in the optical domain, we incorporate optical devices such as Mach-Zehnder-Interfer ometers (MZIs) into the interconnects. Such a de facto optical neural network (ONN) can effectively r educe the communication overhead in existing distributed training solutions. T o reduce dataset complexity for training this neural network, a prepr ocessing algorithm implemented in the optical domain is also proposed. Hardware cost is lowered by approximating the weight matrices of the optical neural netw ork with unitary and diagonal matrices, while the accuracy is maintained by a proposed hardwar e-aware training algorithm. The proposed solution was ev aluated on real distrib uted learning tasks, including ResNet50 on CIF AR-100, and a LLaMA- based netw ork on Wikipedia-1B. In both cases, the proposed framework can achieve comparable training accuracy to the ring all-reduce baseline, while eliminating communication overhead. I. Introduction In recent years, deep neural networks (DNNs) ha ve achie v ed remarkable progress in various scenarios such as Computer V ision (CV) and Natural Language Processing (NLP). T o address increasingly complex tasks, DNNs, especially Large Language Models (LLMs) in the NLP domain, are gro wing rapidly in both model sizes and training data volume. For example, state-of-the-art LLMs can contain hundreds of billions of parameters and are trained on trillions of tokens [1]–[3]. The increasing model parameters and training data are posing significant challenges for training on a single GPU due to memory and computational limitations. T o ov ercome these limitations, distributed learning is applied to train a single model across multiple GPUs or servers. Generally , the model is trained via model parallelism [4], [5], where model parameters are split across de vices, or data parallelism, where each de vice trains on a local data subset. For both strategies, the resulting partial outputs or gradients are aggregated and processed after every batch. These two strategies are orthogonal and can be combined [4], [6]. Howe ver , both strategies require frequent inter -device com- munication during the training. T o reduce the communication, the ring all-reduce algorithm [7] is widely used. Fig. 1 sho ws an example of data parallelism with four servers forming a logical ring, where the gradients in the four servers should be av eraged and synchronized. T o balance the communication workload, the gradients in each server are partitioned into four chunks. The process consists of two stages. First, in the Reduce-Scatter stage, in every communication round, each server simultaneously sends one chunk to one neighbor and receiv es one chunk from another , where the averaging operation is performed in the servers. After three such rounds, each server holds one distinct chunk of the fully a veraged gradients. Second, in the All-Gather stage, the servers redistribute the averaged chunks to others, requiring another three rounds. In general, with N servers, while transmitting all chunks theoretically requires only N rounds, the ring all-reduce algorithm requires 2( N − 1) rounds, resulting in a relative communication overhead of N − 2 N , which is nearly 100%. On modern GPUs, compute capabilities often outpace communication bandwidth [8], making communication ov erhead a bottleneck in large-scale distributed training. T o reduce this communication ov erhead, sev eral strate gies hav e been proposed. For instance, alternati ve logical topolo- gies [9], [10] are introduced, b ut the y are often too complex to deploy at scale. Another direction is to quantize tensors and gradients to lower bit widths [11]–[13], which can reduce bandwidth but lead to model accuracy degradation. Despite these efforts, ho wever , all these approaches still rely on multiple rounds of communication, which remain a key communication bottleneck. T o eliminate redundant communication rounds, In-Network Computing (INC) has been introduced, where the computations of tensors or gradients can be of floaded from each device to the network itself. Prior work [8], [14], [15] has embedded computation and control units into electrical switches. When the data pack ets pass by the switches, the y can be processed and aggregated directly in the switches. This can reduce latency significantly and accelerate training by 1.4x to 5.5x [8], [14]. Howe ver , implementing INC on electrical switches intro- duces sev eral drawbacks. First, performing INC on electrical Chunked Gradients switch Fully A veraged Gradients Reduce-Scatter All-Gather switch switch Fig. 1: The ring all-r educe algorithm in distrib uted training with four servers connected to a switch, forming a logical ring topology . switches can increase energy consumption due to optical- electrical-optical (O-E-O) conv ersions. Moreo ver , the partial results need to be buf fered until computation in the electrical switches is complete, leading to pack et evictions [8]. Alterna- ti vely , Optical Circuit Switch (OCS) [16] can avoid the optical- electrical-optical conv ersions, but the potential computation ability in the OCS has not been explored. In this paper, we propose a novel Optical INC (OptINC) architecture for data parallelism, which eliminates both com- munication overhead and optical-electrical-optical con versions. Our contributions are summarized as follows: • An optical INC architecture with an Optical Neural Network (ONN) based on Mach-Zehnder-Interferometers (MZIs) is proposed to of fload the computation in serv ers onto the optical interconnects, eliminating communication ov erheads. • T o reduce the hardware cost, weight matrices in selected layers of the ONN are approximated by submatrices. T o maintain accurac y , a nov el hardware-aw are training scheme is adopted, recov ering the ONN accuracy perfectly . • The proposed OptINC architecture is sho wn to be scal- able in a cascading manner , incurring small hardware ov erhead. T rained with a modified dataset, OptINC can ef ficiently eliminate quantized errors caused by multi-le vel quantization. The rest of the paper is or ganized as follo ws: Section II provides background on OCS and MZIs. Section III introduces the OptINC architecture and the hardw are-aw are training and adaptations made for scalable architectures. Experimental re- sults are shown in Section IV. Section V dra ws the conclusion. II. Preliminaries In this section, we first discuss why the OCS is suitable for HPC. Then we introduce ho w MZIs can implement the matrices in the OCS and ONNs. A. OCS f or HPC W ith the increasing demand for bandwidth, datacenters and HPC rely on optical fibers for communication. Since electrical switches require O-E-O con versions, which incur energy overhead and latency , optical switches provide more efficient communication by avoiding these con versions. An OCS can be realized with Micro-Electro-Mechanical Systems (MEMS) [16], [17] with mechanically controlled mirrors or optical de vices such as MZIs [18]. In MZI-based OCS, reconfigurable connection matrices can be implemented to direct signals to output ports by tuning the phase shifters. Due to the slo w tuning speed of mechanical mirrors and optical de vices, a common challenge for the OCS is the reconfiguration latency , typically on the order of µ s [19] or ev en ms [20]. Ho wever , in distrib uted learning, the communication patterns are predetermined and require fe w changes during the training, making the reprogramming costs negligible. Therefore, M 1 I 1 I 2 Θ Φ Directional coupler (DC) Thermal-optic phase shifter (PS) O 1 O 2 M 2 M 3 M 4 M 5 M 6 Fig. 2: The interleaving MZI array for a 4 × 4 unitary matrix (adopted from [21]) where an MZI consists of two DCs and two PSs. with high bandwidth and no O-E-O con versions, the OCS is suitable for distrib uted learning tasks in HPC. B. Matrix-V ector Multiplications in OCS and ONNs T o implement matrix-vector multiplication in neural net- works, in-memory-computing with emerging devices such as RRAM [22]–[25] and optical devices. Fig. 2 shows ho w MZIs can be used to build a connection matrix in the OCS. An MZI consists of two directional couplers and two thermal- optic phase shifters. The input information is encoded on the amplitudes of tw o optical signals, I 1 and I 2 , which are input from the left ports of the MZI. Along the optical paths, the optical signals are transformed and output as O 1 and O 2 from the right ports. The transformation matrix is inherently unitary . A unitary M × M matrix can be implemented by cascading M × ( M − 1) 2 MZIs in an interleaving manner . For example, Fig. 2 sho ws how a 4 × 4 matrix can be realized by six MZIs, M 1 to M 6 . T o implement an arbitrary M × N matrix W with MZIs, W is decomposed using Singular V alue Decomposition (SVD) as follows: W SVD = U Σ V ⊤ (1) where U and V denote an M × M unitary matrix and an N × N unitary matrix, respecti vely , and can be implemented with cascading MZIs. Σ is an M × N diagonal matrix and can be implemented with a column of MZIs. In total, the implementation of W requires M ( M +1)+ N ( N − 1) 2 MZIs. T o realize a specific matrix, the PSs in the MZIs should be programmed to appropriate v alues by tuning the heaters [19]. MZIs are widely used to implement linear weight matrices in ONNs [21], [26]–[29]. The nonlinear acti vations can be realized either in digital circuits [26] with optical-electrical- optical conv ersions, or directly in the optical domain using electro-optic devices [30] or nonlinear materials [31]. III. Proposed work T o reduce the N − 2 N communication ov erhead in ring all- reduce for distrib uted learning, we propose an OptINC archi- tecture for data parallelism that performs gradient averaging and quantization directly within the network, consisting of linear operations and nonlinear logic. Conv entional optical logic gates usually require dedicated and specialized control mechanisms [32], making them difficult to deplo y for HPC. Instead, an ONN is employed to perform the computation. A. OptINC Architectur e Fig. 3 illustrates the proposed OptINC architecture support- ing N servers, S 1 to S N . Unlike the logical ring topology in Fig. 1, all servers are just connected to the OptINC without forming additional logical topologies. Each server is equipped with M full-duplex optical transcei vers. As sho wn in Fig. 3, one server can send data I i to the network via the i -th transceiver , OptINC Architecture . . . A veraging and Quantization Optical Neural Network f θ Splitting Unit T Unit P Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 3: The proposed OptINC architecture connecting N servers, S 1 to S N , each with M full-duplex optical transceivers. The system consists of three components: a preprocessing unit P , an ONN f θ , and a splitting unit T . while receiving data O i from the architecture using the same transceiv er simultaneously . Before transmission, each serv er should encode its local gradient into optical signals. While gradients are typically stored in 16 or 32-bit formats [33], due to transmission efficiency and reliability , HPC typically adopts 4-lev el Pulse Amplitude Modulation (P AM4) [34]. Therefore, a B -bit local gradient in the server n , G n , is encoded to M = ⌈ B / 2 ⌉ 2-bit segments, each of which is mapped to a P AM4 signal. The i -th P AM4 signal for server n , I ( i ) n , is extracted using the follo wing operation: I ( i ) n =  G n 2 2( M − i )  mo d 4 , i = 1 , 2 , . . . , M . (2) After transmission through the OptINC, each server decodes the optical signals and reconstructs the B -bit global quantized av eraged gradient G using (2) inv ersely . Since the transcei vers hav e limited resolution, the recei ved optical signals are quan- tized back to the nearest P AM4 le vel by the transceiv ers. The goal of the OptINC architecture is to execute gradient av eraging and quantization directly within the network. By offloading the computation from the servers to the optical interconnects, communication o verhead introduced by extra synchronization rounds as in the ring all-reduce algorithm can be eliminated. During the computation, although gradient av eraging is a linear operation and can be performed by a linear matrix, quantization and the mapping of the av eraged gradient to discrete P AM4 signals are nonlinear . Therefore, we employ an ONN with linear weight matrices implemented by the interleaving MZIs in Fig. 2 and nonlinear acti vation functions, denoted by f θ in Fig. 3. The ONN f θ is used to approximate the quantized average of all local gradients. Ideally , the reconstructed gradient G from the recei ved output signals should match the e xpected quantized average gradient G ∗ : G ! = G ∗ = Q ( 1 N N X n =1 G n ) (3) where Q ( · ) denotes the quantization and G n is defined in (2). The gradient averaging, quantization, and the mapping func- tion in the ONN f θ need to process M encoded P AM4 optical signals transmitted by each of the N servers. Accordingly , the ONN needs to be trained to process 2 M N input combinations. As the number of servers N grows, the dataset size for training the ONN grows exponentially . T o lower the data complexity , = ... ... Fig. 4: W eight matrix W can be partitioned to squar e submatrices W s in two ways, horizontally or vertically . a preprocessing unit is introduced before the ONN, as shown by P in Fig. 3, reducing the ONN input size to K ≤ M , by av eraging e very ⌈ M K ⌉ signals from N servers. The a veraged input signal to the ONN is denoted as A k for k = 1 , 2 . . . K , as shown in Fig. 3. Since the sum of ⌈ M K ⌉ P AM4 signals ranges from 0 to N (4 ⌈ M K ⌉ − 1) , the a veraged value across N servers, namely A k , ranges from 0 to 4 ⌈ M K ⌉ − 1 with resolution 1 N . With K such inputs, the dataset size required for training is reduced to ( N (4 ⌈ M K ⌉ − 1) + 1) K . As a result, the data complexity can be reduced from O (2 M N ) to O (2 K ) . K is determined by balancing the data complexity and training efficiency . Since all servers receive the same signals representing the av eraged gradients, a simple signal splitting unit T is employed to broadcast the ONN outputs to each server , as sho wn in Fig. 3. This function can be implemented using a simple MZI array . B. Hardwar e-A ware Design and T raining of ONN f θ As stated in Section II-B , to implement an ONN, an M × N weight matrix W requires M ( M +1)+ N ( N − 1) 2 MZIs. If one dimension is much larger than the other , the hardware area of the weight matrix will be dominated by the larger dimension. T o reduce the area cost of a weight matrix, a matrix approximation approach similar to that in [28], [29] is adopted. Specifically , W is first partitioned into small square submatrices W s as sho wn in Fig. 4, after which the partitioning can be performed either horizontally or vertically . After partitioning, instead of using SVD in (1), each square submatrix W s can be approximated with W a , composed of only one diagonal matrix Σ a and one unitary matrix U a : W s ≈ W a = Σ a U a (4) U a = U s V ⊤ s (5) Σ a = diag ( d 1 , . . . , d i ) , d i = arg min d i   W i s − d i · U i a   2 2 . (6) U s and V s are unitary matrices in the SVD form of W s as sho wn in (1) and W i s represents the i -th ro w of W s . By solving a least square optimization problem for W i s , the i -th element of the diagonal matrix Σ a , d i , can be determined. W ith one unitary matrix eliminated, the area cost for each square matrix can be reduced by nearly 50%. T o reduce the area cost of the ONN while maintaining the accuracy , weight matrices in selected ONN layers are partitioned as described in Fig. 4 and approximated using (4), which is referred to as matrix approximation. Ho we ver , the matrix approximation can introduce errors. Therefore, we apply a hardware-aware training algorithm to maintain the accurac y . When training the netw ork, the loss is defined as the a veraged weighted Mean Square Error (MSE) on the raw outputs for the first E 1 epochs. After E 1 epochs, the training is finetuned by directly applying the a veraged MSE of the reconstructed S_1N OptINC_1 S_11 . . . S_NN OptINC_N S_N1 . . . . . . OptINC_N+1 lev el1 lev el2 Serv er inputs S Fig. 5: The cascading topology with OptINCs in two levels, supporting up to N 2 servers. gradients. The loss function is determined as a two-stage function as follo ws: L =              1 | D | | D | X d =1 M X i =1 W ( i ) T  O d,i − O ∗ d,i  2 , if E < E 1 1 | D | | D | X d =1  G d − G ∗ d  2 , otherwise (7) where d denotes the d -th data sample in the dataset D and E denotes the current epoch. O d,i and O ∗ d,i represent the receiv ed outputs and the expected outputs. G d and G ∗ d are the reconstructed gradient from the recei ved signals and the expected quantized av eraged gradient, respecti vely . W T represents the importance of the output bits chosen for training. During training, to guide the weight matrices in the selected layers toward the structure defined in (4), the matrix approxima- tion algorithm is applied periodically . After the training, if the approximation is not applied at the last epoch, it is enforced on the selected layers to ensure that the trained network matches the ONN structure. C. Scalability of the OptINC Architecture As discussed in Section III-A , the input size of the ONN in the OptINC architecture depends on the number of supported servers N . As N increases, both the ONN and the required datasets scale accordingly , which makes training challenging. In this section, we demonstrate that a fixed OptINC architecture can be adapted to efficiently support a lar ger number of servers. Fig. 5 illustrates a cascading OptINC topology that supports up to N 2 servers, with N OptINCs in lev el 1 and one OptINC in lev el 2. For fewer servers, unused inputs to the OptINCs can be connected to zero input, and completely unused OptINCs can be removed. Howe ver , the cascading topology based on basic OptINCs can introduce errors due to two-le vel quantization. The expected averaged gradient G ∗ and the obtained result G basic under two-lev el quantization are given by: G ∗ = Q ( 1 N 2 N X i =1 N X j =1 G i,n ) (8) G basic = Q ( 1 N N X i =1 Q ( 1 N N X j =1 G i,n )) (9) where G i,n denotes the gradient of server n connected to the i -th OptINC. In (9), the decimal parts of the averaged gradients are discarded during quantization in level 1, which can accumulate across le vels and result in accuracy loss. RingAll-reduce OptINC Normalized#Comm 0 0,5 1 1,5 2 #servers 4 8 16 Fig. 6: The communication data normalized by the amount of data to be computed, f or the ring all-reduce algorithm and OptINC, when 4, 8, 16 servers participate in the distributed learning. T o a void this error , when creating the training datasets, we keep the discarded decimal parts d in le vel 1 as: G new = Q ( 1 N N X i =1 ( Q ( 1 N N X j =1 G i,n ) + d )) . (10) By using these adapted datasets to train the ONNs in the OptINCs in the cascading topology , the actual a veraged gradient G new can then become equi v alent to G ∗ . The decimal part d in (10) is merged into the last P AM4 output signal of the corresponding OptINC in lev el 1 and propagated to the OptINC in level 2, increasing the signal resolution at both le vels. Therefore, a larger ONN is adopted to maintain the computation accuracy . Notably , OptINCs in both le vels share the same expanded ONN structure, with each lev el trained on its modified dataset. IV . Experimental Results T o ev aluate the proposed architecture, four distributed learning scenarios with v arious bit width and server number combinations were considered. For each scenario, a dedicated neural network was trained and mapped to the ONN in the OptINC architecture. W e assumed that the weight matrices are mapped onto interlea ving MZI arrays as in [26], and that the nonlinear functions are implemented as in [31]. For simplicity , a Multilayer Perceptron (MLP) with ReLU acti vations was employed in all scenarios. The MLPs were trained with PyT orch using NVIDIA A100 T ensor Core GPUs. T able I sho ws the experimental results of the four scenarios. The first and the second columns list the bit widths and the number of supported serv ers, respecti vely . The third column specifies the ONN structures as a list of the numbers of neurons in the layers. The fourth column lists the selected layers to apply matrix approximation. The area ratio compared to the OptINC architecture without matrix approximation in (4) is sho wn in the fifth column, where the area cost is defined as the number of MZIs required to map the matrices in the OptINC architecture. The last column lists the trained accuracy of the corresponding ONN. When matrix approximation is applied, the OptINC area cost can be reduced to 39.2%–49.3% of that of the OptINC without approximation. W ith the hardw are-aw are training described in Section III-B , the accurac y can still be preserved at 100%. The output size of the ONNs was determined as the number of output P AM4 signals to represent the a veraged gradients and the input size of the ONNs w as set to four to balance the data complexity and training ef ficiency . The network depths and layer dimensions were selected through a greedy search, which T ABLE I: Experimental results under different scenarios with various bit widths and ser ver counts. Bit Width #Servers ONN Structure Layers With Matrix Approximation Area Ratio ONN Accuracy 8 4 4-64-128-256-128-64-4 None 100% 100% All layers 39.3% 100% 8 8 4-64-128-256-512- 256-128-64-4 None 100% 100% Layers 2–7 40.9% 100% 8 16 4-64-128-256-512-1024- 512-256-128-64-4 None 100% 100% Layers 2–9 40.4% 100% 16 4 4-64-128-256-512- 256-128-64-8 None 100% 100% Layers 4–6 49.3% 100% T ABLE II: Selected layers f or matrix approximation and the training accuracy in the fourth scenario. Layers ONN Acc. (%) Error V alues (Rel. Ratios %) Norm. Area 4, 5, 6 100 None 49.3% 4, 5, 6, 7 99.99986 ± 1 (90), − 64 (10) 47.9% 4, 5, 6, 7, 8 99.99999 1024 (100) 47.4% 3, 4, 5, 6 99.98891 ± 1 (99), ± 1024 (0.9), − 4 (0.1) 43.7% 3, 4, 5, 6, 7 99.99936 ± 4 (79.5), − 16 (17), 12 (3.5) 42.2% in future work can be optimized by using Neural Architecture Search (NAS) algorithms. Since OptINC offloads all computations to the network, incurring no e xtra rounds to exchange the data, the excessiv e communication data in the ring all-reduce algorithm can be eliminated. Fig. 6 illustrates the communication data for 4, 8, 16 servers when the ring all-reduce algorithm and OptINC are used. The communication data is normalized by the gradients to be a veraged. The ring all-reduce algorithm requires excessi ve communication data transfer , as described in Section I, where the N − 2 N communication overhead ranges from 50% to 87 . 5% . OptINC offloads the gradient computation to the network, where gradients are averaged once the signals trav erse the network, eliminating the communication ov erhead. The corresponding impro ved latency performance, which also depends on the trained models and hardware specifications, is discussed later . W e e xplored the performance of the ONN using the fourth scenario in T able I with dif ferent configurations of matrix approximation. The results are listed in T able II, where the first column specifies the layers selected for matrix approximation. The accuracy of the trained ONNs is sho wn in the second column, while the introduced errors with their relativ e ratios are listed in the third column. For example, ± 1(90%) indicates that, if the ONN fails to achie ve 100% accuracy and then introduces errors to the a veraged gradient, the errors equal ± 1 in 90% of the cases. The corresponding reduced area costs are listed in the last column, normalized to the original OptINC architecture without approximation. As more large layers adopt matrix approximation, the area cost can be further reduced from 49.3% to 42.2%, at the e xpense of introducing errors with small probabilities. W e ev aluated the impact of the proposed OptINC architecture on real distrib uted training tasks by simulating the training of two models with the OptINC architecture in T able II. During training, the introduced errors in the third column of T able II with their corresponding probabilities were injected into the av eraged gradients. Specifically , ResNet50 [35] was trained from scratch on the CIF AR-100 dataset for 300 epochs and a LLaMA-based network [1] with 8 layers, each with a hidden dimension of 384 and 8 attention heads, was trained on the W ikipedia-1B dataset for 50,000 steps. Floating-point gradients were quantized to fixed-point values using a global block quantization scheme similar to [14], incurring a negligible synchronization cost of less than 0.4% for both models. The training results are sho wn in Fig. 7(a). For comparison, baselines were obtained by simulating accurate gradient a v- eraging in servers for the ring all-reduce algorithm. W ithout error injection, both ResNet50 and the LLaMA-based network achie ved comparable results to the baseline, with only a slight accuracy drop of 0.03% on CIF AR-100 and a loss increase of 0.018 on W ikipedia-1B due to the block quantization. W ith error injection, the accuracy of ResNet50 slightly decreased by 0.55%, while the loss of the LLaMA-based network increased by 0.02. For both models, the accuracy and loss still remained within acceptable ranges. Fig. 7(b) illustrates the modeled latenc y breakdown when training the models for one epoch or step. The latency is normalized by the ov erall latency with the ring all-reduce algorithm. The setting included Nvidia H100 GPUs with a compute capability of 60 TFLOPs [36], a utilization ef ficiency of 0.6, and eight full-duple x transceiv ers, each of which has a bandwidth of 800 Gb/s [34]. F or ResNet50, a conv olutional neural network with less intensi ve computation w orkload than the transformer architecture, the communication latenc y dominated and OptINC reduced the ov erall latency by o ver 25%. F or the LLaMA-based network, where the latency for computation and communication were comparable, OptINC reduced the ov erall latency by around 17%. Since this scenario only supported four serv ers, the latency improvement would show an increasing trend when supporting more servers, as shown in Fig. 6. Finally , a scalable architecture was v alidated under the first scenario, where one OptINC supports four servers. By cascading five such OptINCs in two lev els as in Fig. 5, up to sixteen servers can be supported. T o accommodate the increased resolution of the a veraged gradients, the ONN structure listed in T able I was modified by inserting two 64 × 64 weight matrices with matrix approximation after the first layer and before the last layer , respectively , while other layers remained unchanged. For both le vels, the ONNs in the OptINC architecture can be trained to reach 100% accuracy on the modified dataset. Ringcomm Ringcomp OptINCcomm OptINCcomp NormalizedLatency 0 0,25 0,5 0,75 1 Models ResNet50 LLaMA-basedNN ResNet50 LLaMA-BasedNN ResNet50Baseline LLaMA-BasedNNBaseline Loss 2,9 2,95 3 Accuracy(%) 79,5 80 80,5 81 SelectedLayers 4,5,6 4,5,6,7 4,5,6,7,8 3,4,5,6 3,4,5,6,7 (a) (b) Fig. 7: (a) The trained accuracy and loss, and (b) the overall latency br eakdown for one epoch or step of ResNet50 on CIF AR- 100 and the LLaMA-based network on Wikipedia-1B. Compared with the ONN structures in T able I, this modification incurred about 10.5% hardware ov erhead. V . Conclusion In conclusion, a scalable OptINC architecture was proposed to eliminate the communication overhead of the ring all-reduce algorithm by offloading the computation onto the network. Specifically , an ONN was employed to map the encoded gradi- ents from multiple servers to the quantized av eraged gradients. T o reduce the dataset complexity , a preprocessing unit and a splitting unit were introduced. T o decrease the hardware cost, selected weight matrices were partitioned and approximated with unitary matrices, leading to an area reduction of nearly 50%. The accuracy of the proposed architecture can still be maintained by a hardware-a ware training algorithm. Future work will address physical-layer non-idealities and explore different network topologies and protocols. References [1] H. T ouvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y . Babaei, et al. , “Llama 2: Open foundation and fine-tuned chat models, ” arXiv:2307.09288 , 2023. [2] T . Bro wn, B. Mann, N. Ryder , M. Subbiah, J. D. Kaplan, P . Dhariwal, et al. , “Language models are few-shot learners, ” in NeurIPS , 2020. [3] J. W ang, Y .-G. Chen, I.-C. Lin, B. Li, and G. L. Zhang, “Basis sharing: Cross-layer parameter sharing for large language model compression, ” 2024. [4] M. Shoe ybi, M. Patwary , R. Puri, P . LeGresle y , J. Casper, and B. Catan- zaro, “Me gatron-lm: Training multi-billion parameter language models using model parallelism, ” , 2019. [5] Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, et al. , “Gpipe: Efficient training of giant neural netw orks using pipeline parallelism, ” in NeurIPS , 2019. [6] D. Narayanan, M. Shoeybi, J. Casper , P . LeGresley , M. Patwary , V . Korthikanti, et al. , “Efficient large-scale language model training on gpu clusters using megatron-lm, ” in International Conference for High P erformance Computing, Networking, Stora ge and Analysis , 2021. [7] P . Goyal, P . Doll ´ ar , R. Girshick, P . Noordhuis, L. W esolowski, A. Kyrola, et al. , “ Accurate, large minibatch sgd: Training imagenet in 1 hour, ” arXiv:1706.02677 , 2017. [8] B. Klenk, N. Jiang, G. Thorson, and L. Dennison, “ An in-network architecture for accelerating shared-memory multiprocessor collectives, ” in ISCA , 2020. [9] P . Sanders, J. Speck, and J. L. T r ¨ aff, “T wo-tree algorithms for full bandwidth broadcast, reduction and scan, ” P arallel Computing , vol. 35, no. 12, pp. 581–594, 2009. [10] M. Al-F ares, A. Loukissas, and A. V ahdat, “ A scalable, commodity data center netw ork architecture, ” in A CM SIGCOMM Confer ence , 2008. [11] Y . Lin, S. Han, H. Mao, Y . W ang, and W . J. Dally , “Deep gradient compression: Reducing the communication bandwidth for distributed training, ” , 2017. [12] D. Alistarh, D. Grubic, J. Li, R. T omioka, and M. V ojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding, ” in NeurIPS , 2017. [13] W . Sun, G. L. Zhang, H. Gu, B. Li, and U. Schlichtmann, “Class-based quantization for neural networks, ” in D ATE , 2023. [14] A. Sapio, M. Canini, C.-Y . Ho, J. Nelson, P . Kalnis, C. Kim, et al. , “Scaling distrib uted machine learning with In-Network aggregation, ” in NSDI , 2021. [15] J. Fei, C.-Y . Ho, A. N. Sahu, M. Canini, and A. Sapio, “Efficient sparse collectiv e communication and its application to accelerate distributed deep learning, ” in A CM SIGCOMM Confer ence , 2021. [16] L. Poutievski, O. Mashayekhi, J. Ong, A. Singh, M. T ariq, R. W ang, et al. , “Jupiter e volving: transforming google’ s datacenter network via optical circuit switches and software-defined networking, ” in ACM SIGCOMM Confer ence , 2022. [17] N. Jouppi, G. Kurian, S. Li, P . Ma, R. Nagarajan, L. Nai, et al. , “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings, ” in ISCA , 2023. [18] Z. Lu, D. Celo, H. Mehrvar , E. Bernier , and L. Chrostowski, “High- performance silicon photonic tri-state switch based on balanced nested mach-zehnder interferometer , ” Scientific Reports , vol. 7, no. 1, pp. 1–7, 2017. [19] N. C. Harris, Y . Ma, J. Mower , T . Baehr-Jones, D. Englund, M. Hochberg, et al. , “Ef ficient, compact and low loss thermo-optic phase shifter in silicon, ” Optics Express , vol. 22, no. 9, pp. 10487–10493, 2014. [20] R. Urata, H. Liu, K. Y asumura, E. Mao, J. Berger , X. Zhou, et al. , “Mission apollo: Landing optical circuit switching at datacenter scale, ” arXiv:2208.10041 , 2022. [21] Y . Zhu, G. L. Zhang, B. Li, X. Y in, C. Zhuo, H. Gu, et al. , “Countering variations and thermal effects for accurate optical neural networks, ” in ICCAD , 2020. [22] S. Zhang, G. L. Zhang, B. Li, H. H. Li, and U. Schlichtmann, “ Aging-aware lifetime enhancement for memristor -based neuromorphic computing, ” in DA TE , 2019. [23] Y . Zhu, G. L. Zhang, T . W ang, B. Li, Y . Shi, T .-Y . Ho, et al. , “Statistical training for neuromorphic computing using memristor -based crossbars considering process variations and noise, ” in DA TE , 2020. [24] S. Zhang, G. L. Zhang, B. Li, H. H. Li, and U. Schlichtmann, “Lifetime enhancement for rram-based computing-in-memory engine considering aging and thermal ef fects, ” in AICAS , 2020. [25] A. Eldebiky , G. L. Zhang, G. B ¨ ocherer , B. Li, and U. Schlichtmann, “Correctnet: Robustness enhancement of analog in-memory computing for neural networks by error suppression and compensation, ” in DA TE , 2023. [26] Y . Shen, N. C. Harris, S. Skirlo, M. Prabhu, T . Baehr-Jones, M. Hochberg, et al. , “Deep learning with coherent nanophotonic circuits, ” Nature Photonics , v ol. 11, no. 7, pp. 441–446, 2017. [27] J. Gu, Z. Zhao, C. Feng, M. Liu, R. T . Chen, and D. Z. Pan, “T ow ards area-efficient optical neural networks: an fft-based architecture, ” in ASP- D A C , 2020. [28] A. Eldebiky , B. Li, and G. L. Zhang, “Nearuni: Near -unitary training for ef ficient optical neural networks, ” in ICCAD , 2023. [29] S. Fei, A. Eldebiky , G. L. Zhang, B. Li, and U. Schlichtmann, “ An efficient general-purpose optical accelerator for neural networks, ” in ASP-D A C , 2025. [30] I. A. Williamson, T . W . Hughes, M. Minkov , B. Bartlett, S. Pai, and S. Fan, “Reprogrammable electro-optic nonlinear activ ation functions for optical neural networks, ” Journal of Selected T opics in Quantum Electr onics , vol. 26, no. 1, pp. 1–12, 2019. [31] Q. Bao, H. Zhang, Z. Ni, Y . W ang, L. Polav arapu, Z. Shen, et al. , “Monolayer graphene as a saturable absorber in a mode-locked laser, ” Nano Resear ch , v ol. 4, no. 3, pp. 297–307, 2011. [32] P . Singh, D. K. T ripathi, S. Jaiswal, and H. Dixit, “ All-optical logic gates: Designs, classification, and comparison, ” Advances in Optical T echnologies , v ol. 2014, no. 1, p. 275083, 2014. [33] P . Micik evicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, et al. , “Mixed precision training, ” , 2017. [34] NVIDIA, “Networking interconnect – linkx cables and transceiv ers guide. ” https://docs.n vidia.com/networking/interconnect/index.html#products, 2025. [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in CVPR , 2016. [36] NVIDIA, “Nvidia h100 gpu. ” https://www .n vidia.com/en- us/data- center/ h100/, 2025.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment