A Distributed Processing Architecture for Modular and Scalable Massive MIMO Base Stations
In this work, a scalable and modular architecture for massive MIMO base stations with distributed processing is proposed. New antennas can readily be added by adding a new node as each node handles all the additional involved processing. The architec…
Authors: Erik Bertilsson, Oscar Gustafsson, Erik G. Larsson
1 A Distrib uted Processing Architecture for Modular and Scalable Massi v e MIMO Base Stations Erik Bertilsson, Student Member , IEEE , Oscar Gustafsson, Senior Member , IEEE , and Erik G. Larsson, F ellow , IEEE Abstract —In this work, a scalable and modular architectur e for massive MIMO base stations with distributed processing is proposed. New antennas can readily be added by adding a new node as each node handles all the additional in volved processing . The ar chitecture supports conjugate beamforming, zero-f orcing, and MMSE, where for the two latter cases a central matrix inv ersion is required. The impact of the time required f or this matrix in version is carefully analyzed along with a generic frame f ormat. As part of the contrib ution, careful computational, memory , and communication analyses are presented. It is shown that all computations can be mapped to a single computational structure and that a pr ocessing node consisting of a single such processing element can handle a broad range of bandwidths and number of terminals. Index T erms —Massive MIMO, Distrib uted processing, Archi- tecture, Scalable, Conjugate beamforming, Zer o-for cing, MMSE I . I N T R O D U C T I O N T HE ev er increasing demands for higher data rates in wireless communication opens up for many opportunities and challenges in the fifth generation (5G) wireless infrastruc- ture [1], [2]. One such is the use of many antennas on the base station side, commonly referred to as Massi ve MIMO or V ery Large Scale MIMO [3]–[7]. By using many antennas, compared to the number of terminals, the transmit and recei ve power of each antenna and the processing associated per antenna can be reduced. Furthermore, the total energy per bit per user can potentially be reduced at the system lev el compared to traditional fe w antenna solutions. Howe ver , while massi ve MIMO is a promising technology there are still obstacles to o vercome before systems of this type can be deployed. There e xists a number of demonstrators [8]– [11] that have been used to demonstrate the feasibility of the techniques. The number of antennas for the demonstrators are typically between 64 and 128 and supports up to 12 terminals. In addition, some w ork has been done to implement parts or all of the inv olved processing [12]–[17], either using a centralized node with all processing [12], [13], [15] or distributing the processing to sev eral nodes [14], [16], [17]. For the centralized node architectures, typically the case of 128 antennas and 8 terminals are considered. Of more interest in our current context are the e xamples of distributed processing. In [14], a base station system design that is constructed by identical modules is proposed. The The authors are with the Department of Electrical Engineering, Linköping Uni versity , SE-581 83 Linköping, Sweden. Emails: {erik.bertilsson, oscar .gustafsson, erik.g.larsson}@liu.se Manuscript recei ved January 25, 2018. baseband processing is distrib uted among the modules, which are connected in an array . The modules contain RF-front ends, digital baseband processing and digital interconnection link to all four neighbors. The system design, along with cost and power consumption issues are analyzed. Howe ver , there are no details on ho w the baseband processing should be performed or the impact of timing constraints. In [16], [17], the authors propose distributed processing based on CO TS processors. Howe ver , the timing constraints of the considered L TE-frame structure is not taken into con- sideration. Additionally , the systems are not dimensioned to meet the maximum obtainable throughputs of the considered specifications. Compared to a centralized architecture, in a distrib uted architecture, the number of antennas at the base station can more easily be scaled. For the centralized architecture, in- creasing the number of antennas more or less requires a complete redesign of the system. In a distributed architecture, the number of antennas can be increased by adding another node that contains the antenna and circuitry for the associated processing. Furthermore, a distributed architecture enables performing the computations close to the antenna, possibly integrated on the same chip as the radio. In the case of component failure, the modularity allows a single node to be replaced instead of replacing a large centralized unit. Additionally , for centralized implementations the required data rate to read all uplink data from the ADCs and to feed the downlink data to the D A Cs gro ws with the number of antennas, making it very high for systems with many antennas. Finally , a higher manufacturing yield can be e xpected since each chip is smaller for a distributed architecture. In the current work, a node and system architecture is proposed that is distrib uted, modular, and scalable. It supports conjugate beamforming (CB), zero forcing (ZF) [18], and minimum mean-square error (MMSE) [19], where in the two latter cases a matrix in version is performed in a central unit. The computational dif ference between ZF and MMSE is that in MMSE, a regularization term is added before performing the matrix in verse. This is also done at the central unit. Therefore, we only discuss CB and ZF e xplicitly , as the node processing for MMSE is the same as for ZF . The main contrib utions in this work are: • Analysis of distributed and modular processing in a massiv e MIMO-OFDM system • Node and system architecture for a distrib uted, modular , and scalable MIMO-OFDM system 2 Fixed part Scalable part Central control unit Fig. 1. Proposed system architecture consisting of a central control unit (CCU) and the scalable antenna nodes. • Computation, memory , and communication analysis for the nodes and system • Analysis of timing constraints and their ef fects on re- source requirements • Deterministic scheduling/control of the nodes and system • Design space exploration showing that the proposed node architecture can be used in a rather lar ge set of scenarios A preliminary version of the current manuscript was pre- sented in [20]. Compared to the distributed architecture in [14], we suggest using a tree interconnection of the nodes, although the proposed approach can also be used in other intercon- nection topologies. Especially , we perform an analysis of the computational and timing requirements and propose a detailed node architecture, along with scheduling of the computations and inter-node communication. Compared to the distributed architectures in [16], [17], we propose an optimized node architecture instead of using generic processors. Additionally , the timing constraints of the selected frame format is carefully analyzed. I I . P RO P O S E D S Y S T E M A R C H I T E C T U R E In this work, the proposed system architecture consists of one central control unit (CCU) and a scalable part, as illustrated in Fig. 1. The CCU is responsible for performing operations such as error correction coding/decoding and opera- tions associated with the other network layers, such as medium access control (MA C). The scalable part is responsible for the channel estimation, linear precoding, and linear decoding of symbols transmitted to and from the base station. Every node in the scalable part contains computational blocks for the associated antenna(s) and inter-node communication links. One or more nodes can be combined into a chip for dif ferent granularity . The main dif ference is the latency of the inter-node communication, which within a chip will be one or a fe w clock cycles, while inter-chip may be in the range of one hundred clock cycles assuming a serial link and a clock frequency of hundreds of MHz. Here it is assumed that the nodes are clocked synchronously . In do wnlink operation, the CCU feeds modulated symbols for each terminal to the nodes. In uplink operation, the nodes compute estimated symbols transmitted from the terminals and sends to the CCU. In this work, we propose to connect the nodes in a K -ary tree, howe ver the nodes can be connected in other topologies as well. It is worth pointing out that, independently of the interconnection topology , during the accumulation of data (a) (b) (c) Fig. 2. Three dif ferent node topologies. The external controller is connected to the grey nodes. The black node is (one of) the furthest. The number of routing hops is: (a) N hops ∝ 3 2 √ M , (b) N hops ∝ √ M , (c) N hops ∝ log 2 M . from all nodes, each node will only transmit data to one other node on the way to the CCU to avoid duplicate transmissions and accumulations. This means that accumulating data will always be performed in a tree structure, independently of the interconnection topology . By modifying the interconnection topology , the number of hops when accumulating data is changed. Trees ha ve some inherent adv antages and disadv an- tages compared to array topologies. One of the most profound advantages of the tree structure is that the number of routing hops, N hops , grows logarithmically with the number of nodes in the tree, as opposed to proportionately to the square root for arrays. Figure 2 sho ws two dif ferent arrays and a tree topology . For systems with a large number of antennas this is a major benefit when using ZF or MMSE processing, as the latenc y of propagating data through the tree affects the system design. For CB processing, low latency is not as important. Another design trade-off that needs to be considered is fault tolerance. In an ordinary tree structure, if a node fails during operation that entire branch will not be able to communicate with the rest of the tree. In an array topology , this could be mitigated by routing data past the failing node. This howe ver increases the node complexity since a routing mechanism must be implemented. Additionally , there is the aspect of physical antenna place- ment and cable routing. In systems where antennas are placed in an array , the array-based node topologies ha ve the adv antage of simpler cable routing. In systems where the antennas are scattered in some irre gular pattern this advantage is lost. For the remainder of the article, for ease of exposition, complete binary trees are considered where each chip contains one node. A. System Specification The considered setup is a TDD based system that utilizes OFDM. A generalized frame structure can be seen in Fig. 3. 3 P UL UL UL UL DL G DL G T frame N UL,1 N UL,2 N DL Fig. 3. Generalized frame structure. T ABLE I S Y ST EM P AR A M E TE R S . Name Description K Number of terminals N FFT OFDM DFT/IDFT lenght N SC OFDM subcarriers utilized N UL,1 Uplink OFDM symbols before pilot N UL,2 Uplink OFDM symbols after pilot N DL Downlink OFDM symbols N hops Number of hops to the furthest node f sample Sample rate T OFDM Duration of one OFDM symbol T frame Duration of one frame T inv T ime to compute matrix inv erse T link Latency of sending one value over the link W comp W ord length of partial results W symbol W ord length of QAM modulated symbols W ADC W ord length of ADC W DA C W ord length of DA C The frame starts with N UL,1 uplink OFDM symbols where the terminals transmits data to the base station. Then comes the uplink pilot symbol, where all terminals transmit a unique pilot sequence that is used to estimate the uplink radio channel. Another N UL,2 uplink OFDM symbols are sent after the pilot. Then comes a guard interval to switch from uplink to downlink operation. The base station then transmits N DL OFDM symbols to the terminals. The frame duration is T frame = ( N UL,1 + N UL,2 + N DL + 3) T OFDM , (1) where T OFDM is the duration of one OFDM symbol. Generally , it is fa vorable to place the pilot close to the middle of the frame to reduce the time between sending the pilot and the data. Here, the synchronization between transmitters and re- ceiv ers is not considered. The system parameters are sho wn in T able I. I I I . C O M P U T A T I O NA L T A S K S T o utilize the proposed architecture efficiently , the algo- rithms used must be expressed in a distributed manner . The processing can be di vided into three phases: channel estima- tion, uplink data decoding and downlink data precoding. A. Channel Estimation Here, a channel estimation based on least squares is con- sidered. Let x k pilot = 0 1 × ( k − 1) p 0 1 × ( K − k ) (2) be the pilot vector transmitted by terminal k . The scalar p is computed statically at design time. Each node receiv es the signal vector y i,pilot = h i X p + N p ∈ C 1 × K , (3) where X p ∈ C K × K is the pilot matrix and N p ∈ C 1 × K is a noise vector . The pilot matrix X p is giv en by X p = h x 1 pilot | x 2 pilot | · · · x k pilot | i . (4) When the pilot signals has been received at node i , it has all data necessary data to estimate the channels to the K users, without any inter-node communication. This is done by multiplying the recei ved pilot signals by the scalar 1 p . The local channel estimate v ector is ˆ h i = y i,pilot p ∈ C 1 × K . (5) Assuming that the channels are frequency flat, the entire channel estimate matrix H ∈ C M × K can be written as H = h 1 , 1 h 1 , 2 · · · h 1 ,K h 2 , 1 h 2 , 2 · · · h 2 ,K . . . . . . . . . . . . h M , 1 h M , 2 · · · h M ,K , (6) where h i,j ∈ C is the channel coef ficient between antenna i and terminal j . After the locally performed channel estimation, node i has computed and stored row i of the channel matrix. B. Linear Decoding and Pr ecoding Matrices In the uplink data transmission, the base station separates the received signal vector y ∈ C M × 1 into K streams of symbols ˜ y ∈ C K × 1 . This is done by multiplication with a linear detection matrix A ∈ C K × M . For the considered algorithms, the decoding matrix is A = H H , for CB H H H − 1 H H , for ZF . (7) In the downlink data transmission, the symbol vector q ∈ C K × 1 is precoded and sent from M antennas x ∈ C M × 1 . This is done by multiplication with a linear precoding matrix W ∈ C M × K . For the considered algorithms, the precoding matrix is W = H ∗ , for CB H ∗ ( H | H ∗ ) − 1 , for ZF . (8) For CB, the linear detection matrix A and the linear precoding matrix W are obtained directly from the channel estimation. Each node then has access to one column of the decoding matrix. For the ZF algorithm, calculating A and W inv olves performing a pseudo inv erse of the channel matrix H . The ZF precoding matrix is W = H ∗ ( H | H ∗ ) − 1 = H ∗ H H H − 1 ∗ . (9) Let D = H H H − 1 . (10) 4 The matrices can for the ZF case then be rewritten as W = H ∗ D ∗ = ( HD ) ∗ (11) and A = DH H . (12) Giv en the fact that H H H is Hermitian, we know that its in verse is also Hermitian. W ith the Hermitian property ( D = D H ), the decoding matrix can be written as A = D H H H = ( HD ) H = W | . (13) Since the decoding and precoding matrices are each others transpose, the local decoding column vector and the precoding row v ector will be identical. T o calculate A and W , D must be known. The Gram matrix of the channel estimates, H H H , can be calculated in a distributed manner across all nodes. The in version is then performed in the CCU. Let B = H H H . H H H = M X i =1 h H i h i = h ∗ 1 , 1 · · · h ∗ M , 1 h ∗ 1 , 2 · · · h ∗ M , 2 . . . . . . . . . h ∗ 1 ,K · · · h ∗ M ,K h ∗ 1 , 1 h ∗ 1 , 2 · · · h ∗ 1 ,K . . . . . . . . . . . . h ∗ M , 1 h ∗ M , 2 · · · h ∗ M ,K (14) The matrix h H i h i is the Gram matrix of the local channel estimate vector in node i , and can be computed locally without any inter-node communication since the required data is obtained from the channel estimation. It is a Hermitian matrix, thus only K ( K +1) 2 entries must be computed. The computation performed in node i is B i = h H i h i + B left child + B right child . (15) The local contributions are added together as the matrices are propagated upwards in the tree to form the Gram matrix of the channel estimates. This reduces the computational complexity of the CCU and reduces the amount of data to be sent in the tree. Instead of propagating a matrix with M × K values, due to the Hermitian property only K ( K +1) 2 values needs to be propagated. Howe ver , the computational load in each node is increased, since B i must be computed at the node. When the D matrix has been computed in the CCU, it is propagated downwards in the tree structure to all nodes. The nodes can then calculate their local detection and precoding vectors by multiplying the in verted matrix with their local channel estimate v ector . The computation performed in node i is A i = Dh i ∈ C K × 1 , (16) where A i is the local decoding vector and W i = A | i is the local precoding vector . The process of determining the local precoding/decoding vector , A i , is illustrated in Fig. 4. The leaf nodes, 1 and 2 , computes their local contribution to the Gram matrix, B 1 and B 2 respectiv ely , and sends them to the parent node, 3 . CCU B 1 = h H 1 h 1 , B 2 = h H 2 h 2 B 3 = h H 3 h 3 + B 1 + B 2 D = B − 1 A 3 = Dh 3 A 1 = Dh 1 , A 2 = Dh 2 3 2 1 B = B 3 B 1 , B 2 D D Fig. 4. Data transfers and partitioning of computations when determining the precoding/decoding vector for M = 3 . Node 3 computes its own local contribution, B 3 , and sums it together with the contributions from the child nodes before sending it upwards to the CCU. The CCU performs the matrix in version, and redistributes the results do wnwards in the tree. When each node recei ves the inv erted matrix, it computes its local precoding/decoding vector . C. Uplink Linear Decoding The decoding process is performed by multiplying the receiv ed signal vector y ∈ C M × 1 with the decoding matrix, A . During the decoding, each node has access to one column of the decoding matrix and one sample of the receiv ed signal vector . The symbol vector estimate ˜ y is ˜ y = Ay = a 1 , 1 a 1 , 2 · · · a 1 ,M a 2 , 1 a 2 , 2 · · · a 2 ,M . . . . . . . . . . . . a K, 1 a K, 2 · · · a K,M y 1 y 2 . . . y M = a 1 , 1 y 1 a 2 , 1 y 1 . . . a K, 1 y 1 + a 1 , 2 y 2 a 2 , 2 y 2 . . . a K, 2 y 2 + . . . + a 1 ,M y M a 2 ,M y M . . . a K,M y M . (17) By multiplying the local sample with the local decoding column, a local contribution to the recei ved symbol vector is computed. When the local contributions are calculated in each node, they are sent upwards in the tree structure. The contributions are added together as they propagate to the CCU. Since the local contribution can be calculated using only the local sample and one column of the decoding matrix, the entire decoding matrix does not need to be a vailable in all nodes. The computation performed for each subcarrier in node i is ˜ y i = A i y i + ˜ y left child + ˜ y right child , (18) where A i is the local decoding vector . This is computed similarly to computing B in Fig. 4. D. Downlink Linear Pr ecoding The precoding process is done by multiplying the symbol vector , q ∈ C K × 1 , with the precoding matrix, W ∈ C M × K . During the precoding, each node has access to the symbol vector and one row of the precoding matrix. 5 x = Wq = w 1 , 1 w 1 , 2 · · · w 1 ,K w 2 , 1 w 2 , 2 · · · w 2 ,K . . . . . . . . . . . . w M , 1 w M , 2 · · · w M ,K q 1 q 2 . . . q K = P K j =1 W 1 ,j q j P K j =1 W 2 ,j q j . . . P K j =1 W M ,j q j . (19) The value transmitted at node i is the inner product between row i of the precoding matrix and the symbol vector q . Similarly to the decoding case, each node only requires one row of the precoding matrix to perform the precoding. Thus, the entire matrix does not need to be distributed to all nodes. The computation performed for each subcarrier in node i is x i = K X j =1 W i,j q j , (20) where W i is the local precoding vector . The symbol vector , q , is distributed to the nodes similarly to D and the computations of x i is performed similarly to A i in Fig. 4. E. OFDM Modulation and Demodulation In a massi ve MIMO OFDM system, the OFDM modulation and demodulation is performed for each antenna. Therefore, one FFT/IFFT must be performed in the node for each OFDM symbol (pilot, uplink, and downlink). The length of the FFT/IFFT is N FFT , while the number of subcarriers utilized is N SC . F . Pr ocessing Element As is shown in Section VIII, having one processing element that performs all computations in the node is enough to support a large range of dif ferent combinations of the number of termi- nals and channel bandwidth. Therefore, it is beneficial to find a common structure of the inv olved computations discussed earlier . The channel estimation only requires multiplications with 1 /p as shown in Fig. 5(a). F or uplink decoding, each node performs a multiplication and adds data from the other nodes further do wn the tree, for a binary tree as sho wn in Fig. 5(b). For the downlink precoding, a sum-of-products is locally computed, which consists of multiplication and accumulation, as shown in Fig. 5(c). Finally , the FFT and IFFT consists of butterfly operations and twiddle factor multiplications. Considering the operations in Figs. 5(a)–(c), it makes sense to use a radix-2 decimation in time (DIT) algorithm. This algorithm has the property that each butterfly operation has a twiddle factor multiplication in front of one of the inputs [21], as shown in Fig. 5(d). Although there exist many other radix-2 algorithm, the radix-2 DIT algorithm is the only one with this property for each and ev ery butterfly . As a note, it is often believed that DIT corresponds to bit-re versed input order and normal output order . Ho wev er, this is not the case as the butterfly computation order , and, T ABLE II C O MP UTA T IO NA L T AS K S A ND T H E N U M B E R O F O P ER A T I ON S . Name Description #PE Operations Operation CE Channel estimation, (5) K Fig. 5(a) B i Local contribution for Gram matrix, (15) K ( K +1) 2 Fig. 5(b) ˜ y i Local ˜ y contribution and add contributions from child nodes, for all subcarriers, (18) N SC K Fig. 5(b) x i Precoded symbol x i for all subcarriers, (20) N SC K Fig. 5(c) W i / A i Local precoding and decoding vector , (16) K 2 Fig. 5(c) FFT/ IFFT FFT/IFFT for one OFDM symbol N FFT 2 log 2 ( N FFT ) Fig. 5(d) W A C B W A B W A Multiply and add Multiply and Multiplication and W A Multiply accumulate butterfly op eration (a) (b) (c) (d) Fig. 5. Arithmetic operations performed in the nodes. hence, the data dependency , is independent of the algorithm selection. A conflict-free memory access scheme with lo w hardware ov erhead can be found in e.g. [22]. These operations can be efficiently mapped to a processing element as shown in Fig. 6. The number of operations for each task and the type of operations is summarized in T able II. In cases where multiple processing elements are used, the processing element selection may be reconsidered. In this case, it might be beneficial to map different computational tasks to different processing elements, enabling specialized structures for the given task. Similarly , if the computational requirements per antenna are low , it may be beneficial to interleav e the computations for more than one antenna on a single processing element. W A B C Y1 Y2 Fig. 6. Proposed processing element capable of performing all the operations in Fig. 5. 6 G. Computational P artitioning So far , all computations that can be performed is a dis- tributed manner are assumed to be done so. Ho wever , this does not need to be the case. Consider ZF processing, where the uplink decoding is performed as ˜ y = Ay = H H H − 1 H H y . (21) So far the decoding matrix, A = H H H − 1 H H is computed once ev ery frame. This requires that the in verted matrix is redistributed to the nodes before the decoding process can start. Another possibility is to compute H H y in each node, just like the conjugate beamforming case, and multiply with the in verted matrix once the results reaches the CCU. The distributed parts of the decoding could then be started inde- pendently of the matrix in version. Similarly for the downlink precoding, the ZF processing is performed according to x = Wq = H ∗ H H H − 1 ∗ q , (22) where the precoding matrix W = H ∗ H H H − 1 ∗ is computed once every frame. This requires that the in verted matrix is a vailable in the node before the precoding can start. By multiplying the complex conjugate of the in verted matrix with the symbol v ectors, H H H − 1 ∗ q , in the CCU before they are sent to the nodes, the inv erted matrix itself is not needed at each node for the precoding step. By partitioning the computations in this way , the in verted matrix does not need to be redistrib uted to the nodes. Ho wever , the computational load in the CCU is significantly increased. The computational load of each node is only slightly reduced, since precoding and decoding are still performed distributedly . The only dif ference is that the A / W computation does not need to be performed. I V . C O M P L E X I T Y A NA L Y S I S In this section, the computational, memory , and communi- cation complexity is analyzed. A. Computational Comple xity The computational complexity of each task is shown in T able II. W ith the selected frame format, there are two major limitations on the computational resources. First, since the frame is repeated cyclically , all computations for one frame must be performed in the duration of one frame, T frame . This yields the average number of operations per sample receiv ed, N op,avg . The number of operations that needs to be performed to obtain the precoding/decoding vector is N op,weights = N FFT 2 log 2 ( N FFT ) + K + K ( K + 1) 2 + K 2 , (23) where the first term corresponds to demodulating the OFDM symbol (FFT), the second term for estimating the K channels (CE) and the third term for computing the local contribution to the channel Gram matrix ( B i ) . The fourth term is from multi- plying the in verted matrix with the local channel estimates to create the local decoding and precoding vector . The number of operations performed for each uplink OFDM symbol is N op,UL = N FFT 2 log 2 ( N FFT ) + K N SC , (24) where the first term corresponds to demodulating the OFDM symbol, and the second term for computing the local contribu- tion to the receiv ed symbol vector . The number of operations required for each do wnlink OFDM symbol is N op,DL = K N SC + N FFT 2 log 2 ( N FFT ) , (25) where the first term corresponds to performing the precoding for each subcarrier utilized, and the second term for per- forming the OFDM modulation. The number of operations performed for the uplink and downlink OFDM symbols are the same: N op,OFDM = N op,UL = N op,DL . (26) The total number of operations per sample on a verage o ver an entire frame is then N OPS,avg = N op,weights + ( N UL + N DL ) N op,OFDM T frame f sample , (27) where N UL = N UL,1 + N UL,2 is the total number of uplink OFDM symbols, and N DL is the number of downlink OFDM symbols. W ithout considering data dependencies or critical paths, this is the theoretical lower bound on the number of operations per sample that the node must be able to perform. The other limitation is that the downlink symbols must be processed before their respective deadlines. In practice there will be N DL critical paths in the schedule for one frame. Figure 7 shows the computational tasks performed in each node, the critical paths in the frame, the important times and the possibility to buf fer or process the uplink OFDM symbols. The critical paths in the computations are from receiving the pilot symbol, estimating the channels, computing the local contribution to the Gram matrix, performing the centralized matrix in version, computing the local precoding/decoding vec- tor , and finally performing the precoding for each downlink OFDM symbol. The number of operations on the critical path for downlink symbol i is N op,CP ,i = N op,weights + iN op,OFDM , i ∈ { 1 , 2 , . . . , N DL } . (28) The time av ailable to perform the operations on the critical path for downlink symbol i is T CP ,i = T OFDM ( N UL,2 + i ) . (29) Between recei ving the pilot symbol and transmitting do wnlink symbol i , there are ( N UL,2 + i ) OFDM symbols, including the guard interval. Howe ver , during the time the local Gram matrices are propagated to the CCU, in verse computed and the result redistributed to the nodes, which in total takes T in v + 2 N hops T link , no computations on the critical paths can be performed. Hence, the worst case av erage number of computations per sample on the critical paths is N OPS,critical = max i N op,CP ,i ( T CP ,i − T in v − 2 N hops T link ) f sample . (30) 7 P UL UL UL DL G DL G UL1 UL DL UL UL UL N UL,1 N UL,2 N DL x i x i x i ˜ y i ˜ y i ˜ y i CE Inv + Communication W i ˜ y i ˜ y i ˜ y i ˜ y i ˜ y i T cp,1 T cp,2 T cp, N DL ˜ y i ˜ y i Buffered Symbols OFDM Symbols Computations . . . T inv T slack N UL,PB N UL,buffered Fig. 7. Critical paths and their respecti ve deadlines, for the asymptotic case N OPS = N OPS,asymptotic . Here, CE includes both channel estimation and computing B i . Dotted segments of the critical path times indicate that no computations on the critical path can be performed during the period. Gray boxes illustrate uplink OFDM symbols that must be stored before processing. This leads to that the computational requirements are deter- mined by N OPS = max ( N OPS,avg , N OPS,critical ) . (31) This means that the time to perform matrix in version and inter - node communication latency may af fect the computational requirements. If the system specifications are kept, but the number of up- link and do wnlink OFDM symbols are increased, the a verage number of operations per sample ov er an entire frame increases as well. This is due to the two guard intervals becoming less significant with an increasing number of OFDM symbols. When the number of uplink and downlink OFDM symbols is large, the number of operations per sample is N OPS,asymptotic = lim N UL ,N DL →∞ max ( N OPS,avg , N OPS,critical ) = N op,OFDM T OFDM f sample , (32) meaning that one OFDM symbol must be processed in the duration of one OFDM symbol. As seen from (30) and (31), the matrix in version time, T in v , and the total inter-node communication latency , 2 N hops T link , may affect the computational requirements. For fixed inter- node communication latency 1 this behavior is displayed in Fig. 8. There are two in version times marked in Fig. 8. The first time, T in v ,A is the time when the critical path requires equally many operations per sample as the frame a verage ( N OPS,critical = N OPS,avg ) . The second time, T in v ,B , is when the 1 Note that the same behavior occurs when varying the inter-node commu- nication latency , with fixed T inv , or the sum of both. T in v , A T in v,B T in v Op erations p er sample Critical path, fir s t DL sym b ol Critical path, last DL sym b ol F rame a v erage Asymptotic frame a v erage Fig. 8. Number of required PE operations per sample depending on the time to perform the matrix in version. number of operations on the critical path grows larger than the number of operations per sample in the asymptotic case. Figure 9 shows ho w the number of operations per sample for varying T in v changes when the number of OFDM symbols in a frame is changed. In Fig. 9(a) the number of OFDM symbols is small. In this case there is a significant gap between N OPS,avg and N OPS,asymptotic and between T in v ,A and T in v ,B . When the number of OFDM symbols increases these gaps decreases, as shown in Fig. 9(b). Additionally , it can be seen in Fig. 9(b) that when the number of OFDM symbols is large, the time T in v ,B acts as a deadline for the matrix inv ersion. If the in verse is receiv ed later than T in v ,B , the required number of operations 8 T in v , A T in v,B T in v Op erations p er sample Critical path, fi rs t DL sym b ol Critical path, la s t DL sym b ol F rame a v erage Asymptotic frame a v erage T in v Op erations p er sample (a) (b) Fig. 9. Number of operations per sample versus T inv for (a) small and (b) large number of OFDM symbols. per sample grows rapidly . It can be seen in Fig. 9 that the critical path for the last downlink symbol is the first to cross the frame average line. Combining (1), (27), (29), and (30) leads to T in v ,A = T CP , N DL − N op,CP , N DL N op,total T frame − 2 N hops T link = T OFDM ( N UL,2 + N DL − ( N UL + N DL + 3) N op,weights + N DL N op,OFDM N op,weights + ( N UL + N DL ) N op,OFDM − 2 N hops T link . (33) The point T in v ,B is giv en by the equation N OPS,asymptotic = N op,CP ,i ( T CP ,i − T in v ,B − 2 N hops T link ) f sample , (34) for do wnlink symbol i . Using (28), (29), and (32), T in v ,B can be expressed as T in v ,B = T OFDM N UL,2 − N op,weights N op,OFDM − 2 N hops T link . (35) It can be seen that T in v ,B is identical for all do wnlink symbols. The number of operations per sample for each of the critical paths are the same in the point T in v ,B . If T in v < T in v ,B , the critical path to the last downlink symbol will always require the highest number of operations per sample of all critical paths. Similarly , if T in v > T in v ,B the critical path to the first downlink symbol requires the higher number of operations per sample. T o keep up with the computational requirements, the num- ber of operations per sample, N OPS , the number of processing elements, N PE , and the clock frequency , f clk , must satisfy N PE f clk f sample ≥ N OPS = max ( N OPS,avg , N OPS,critical ) . (36) Selecting f clk as an integer multiple of f sample , the number of operations per sample that can be performed with N PE processing elements is ˆ N OPS = N PE f clk f sample . (37) In most cases N OPS will not be an integer . Howe ver , ˆ N OPS will, and, hence, there is a slack time that can be used to increase the number of terminals, K , the number of antennas, M , and/or the matrix in version time, T in v . If T in v < T in v ,A , the slack time can be used to process some uplink symbols, say N UL,PB , before the downlink symbols, as discussed be- low . Alternativ ely , the pilot symbol can be mov ed closer to the downlink symbols, i.e., decrease N UL,2 , as discussed in Section II-A. While this section focuses on ZF , the same analysis can be made for CB processing. This yields similar results, but with one significant difference. The precoding and decoding vector is obtained from the channel estimation, which means that the computational tasks B i , the central matrix in version and W i / A i is not performed. This results in N op,CP ,i = N FFT 2 log 2 ( N FFT ) + K + iN op,OFDM , (38) N OPS,critical = max i N op,CP ,i T CP ,i f sample , (39) and N OPS,avg = N FFT 2 log 2 ( N FFT ) + K + ( N UL + N DL ) N op,OFDM T frame f sample (40) for CB. Hence, the number of operations to perform locally does not decrease significantly , but the latency issues of performing centralized computations v anishes. B. Memory Comple xity Dimensioning the memories in the node will in part depend on the frame structure that is chosen, and in part on the scheduling of computations and inter -node communication. In Fig. 7 the gray boxes illustrate uplink OFDM symbols that must be stored locally in the node before they are processed. The number of symbols that must be stored is N UL,buf fered = N UL − N UL,PB (41) and the number of bits required to store these symbols is N bits,buf fered = N UL,buf fered N FFT W ADC . (42) For an uplink OFDM symbol, the number of variables during its lifetime in the node is seen in Fig. 10. Between times T 0 and T 1 the OFDM symbol is sampled from the antenna and stored in memory in the node. During this period, the number of v ariables grows until N FFT . The duration between T 0 and T 1 is slightly shorter than one OFDM symbol, due to the cyclic prefix not being stored. At time T 2 the OFDM demodulation starts and is finished at time T 3 . The FFT computation can be made in-place, meaning that no additional memory is strictly required. Ho wev er, towards the end of the FFT computation, some variables can be discarded due to only N SC subcarriers 9 N FFT N SC Time MV T 0 T 1 T 2 T 5 K N SC T 3 FFT Decoding UL sym b ol T 4 Fig. 10. Number of existing variables over the lifetime of an uplink OFDM symbol in the node. being utilized. When the decoding starts at time T 4 , there will be a data expansion by a f actor K , since each subcarrier is multiplied with the decoding vector . When the decoding is finished, there are K N SC variables. These are the number of variables that will exist during the lifetime of one uplink symbol, but not all of them must be stored. C. Communication Comple xity One of the advantages of distributing the computations among multiple nodes is that the number of values that needs to be sent to the centralized structure in the system grows proportionately to the number of terminals, K , rather than the number of antennas in the system, M . In massi ve MIMO systems where M K this is clearly adv antageous. The number of bits that needs to be sent upwards in the tree structure during one frame is N bits,up = K ( K + 1) 2 + ( N UL,1 + N UL,2 ) N SC W comp , (43) which corresponds to the local contributions to the Gram matrix, B i , and the symbol vector estimates, ˜ y i . These v alues are all used for computations and thus, the longer word length W comp is required. The required upwards link datarate is R up = N bits,up T frame . (44) The do wnwards propagation dif fers in that the word length of the modulated symbols is much shorter . Downwards, only the raw symbols are propagated to all nodes, using the shorter word length W symbol . Howe ver , the inv erted matrix still needs to be represented with W comp . The number of bits sent downw ards is N bits,down = K ( K + 1) 2 W comp + N DL N SC W symbol . (45) The required downwards link datarate is then R down = N bits,down T frame . (46) Howe ver , this is only the minimum required data rate. If the data is not sent between the nodes at the same rate as it is consumed, buffers (which may incur a significant increase in die area) are needed, as discussed in Section IV -D. The reduced number of values sent from the antennas to the central unit is often used as an argument for performing distributed processing. While this is indeed the case, it must FFT Decoding MV Time T 5 T 3 T 4 N FFT N SC K N SC T 6 Send to paren t Fig. 11. Number of stored memory variables during the lifetime of one uplink OFDM symbol. also be noted that the word lengths of the data are different. For a centralized architecture, the word length depends on the ADC, so the number of bits is proportional to M W ADC . For a distrib uted architecture, ˜ y and B are transmitted, so the number of bits is proportional to K W comp . Since these are sum-of-products, where one product term being the sample value, one may expect that in general W comp > W ADC . Ho w- ev er , as M K , the total number of bits transmitted to the central unit should still be significantly smaller . Furthermore, it is important that the intermediate v alues are properly scaled as more and more terms are added along the path to the central unit. D. Balancing Computations, Communication, and Memories T o obtain an optimized architecture the different types of resources must be balanced. Here, the processing, communi- cation and memory capabilities are included. Considering the inter node communication for one uplink OFDM symbol, the number of stored variables in each node can be seen in Fig. 11. From sampling the radio until the FFT is finished, the number of stored v ariables are the same as the number of existing v ariables in Fig. 10. The output data from the decoding process has no further data dependencies in the current node. These variables need to be sent to the parent node, so it can perform its own decoding process. In Fig. 11, the time T 4 to T 5 is again the time taken to perform the decoding. The time T 4 to T 6 is the time taken to send the local contributions to the decoded signal vectors to the parent node. It can be noted that T 6 ≥ T 5 . When the decoding starts, the number of variables that needs to be stored locally increases due to the data expansion of the decoding, but at the same time decreases due to variables being sent to the parent node, and thus not needing to be stored. There are two extreme cases of this behavior . The first is if T 6 tends to infinity . In this case, all v ariables must be stored locally , due to none of them being sent o ver the link. Clearly , this solution is not feasible. The other extreme is if T 6 = T 5 . This means that all output variables are sent directly to the parent node and does not need to be stored locally . As described earlier , the decoding on the parent node cannot be performed until the decoding output has been sent over the link. The implication of this is that the system should be designed such that the rate of processing and the rate of sending variables between nodes are the same. The requirements on the downw ards communication how- ev er are not as strict. The data that is propagated from the 10 Receive pilot Inv Receive N UL,2 symbols Process N DL Send N DL symbols Process pilot Receive N UL,1 symbols Process N UL,1 + 3 Process remaining UL sym bols Guard interv al Guard interv al Guard interv al Process N UL,1 + 3 Process remaining UL sym bols Receive pilot Inv Receive N UL,2 symbols Process pilot Receive N UL,1 symbols W/A Process N UL,PB Fig. 12. Schedule of the computational tasks in the asymptotic case. The white blocks corresponds to one frame. CCU to the nodes in the tree is not processed on the way downw ards, but rather just forwarded to the next node. This has the implication that it is not required to send the data in the same rate as it is processed. It does ho wev er pre vent the need for large buf fers in each node, making it desirable. Feeding the nodes with data is a rather straightforward trade-of f between link data rate and buf fer size. V . S C H E D U L I N G The computational tasks and data dependencies when using ZF processing can be seen in Fig. 7. This schedule is not correctly scaled, but rather made to illustrate the data depen- dencies and the need for a better realization. It is, e.g., clear that there is time at the end of the frame where no operations are currently performed. Therefore, it makes sense to mov e parts of the computations there to obtain a better utilization of the processing elements. This will come at a cost of memory as the data must be stored rather than processed directly . Here, a node with only one processing element is con- sidered. The processing element is assumed to support the required number of operations per sample for the asymptotic case, i.e., ˆ N OPS ≥ N OPS,asymptotic . The schedule is created as in Fig. 12. Initially , the node will wait for the pilot OFDM sym- bol. The computations for determining the precoding/decoding vector is then started. This includes an FFT , performing the channel estimation and computing the B i matrix. When the in verted matrix is recei ved, the precoding/decoding vector is computed. After this stage, the uplink and downlink symbols can be processed. In order to reduce the number of uplink symbols that needs to be b uffered before processing, N UL,PB uplink symbols is processed. All downlink symbols is then processed in order to meet their deadlines. When the downlink symbols are finished, the node computes N UL,1 + 3 uplink symbols is processed. T wo uplink symbols can be processed when the last downlink symbol is transmitted and during the guard interv al. Another uplink symbol can be computed when pilot of the next frame is sampled. The remaining N UL − N UL,PB − N UL,1 − 3 uplink symbols are processed when the node waits for the in verted matrix for the ne xt frame. As can be seen, the processing is fully deterministic for the asymptotic case, and, hence, a simple control unit can be implemented, where the dif ferent system parameters can be configured. For the non-asymptotic case, the same gen- eral structure is implemented. Howe ver , as the processing is possibly distributed dif ferently within the frame, a slightly more flexible control unit is required. Alternatively , the control signals can be stored in a RAM acting as an instruction memory . A. System Le vel Sc heduling For the computational tasks ˜ y i and B i in T able II there are inter-node data dependencies, as described in Section III. Before the local PE operation can be performed, the corre- sponding contrib utions from the child nodes must be sent over the inter -node link. The latency of sending a v alue ov er the link is T link . For each lev el in the tree, the ˜ y i and B i computations must be ske wed by this amount in order for the parent node to receiv e the data before processing. V I . A R C H I T E C T U R E In this section, an architecture for the node is proposed. The main components in the architecture are the of f-chip I/O, processing core, memory system and the RF-chain. As seen later on in Section VIII many system scenarios can be covered with a single processing element in each node. Hence, we focus on that here. Further inspection of the arithmetic operations in Fig. 5 rev eals that each input port of the processing element is connected only to a fe w specific data. This means that not all types of data need to be fed into any port of the PE. For instance, only the twiddle factors, channel estimates or the precoding/decoding vector is connected to one input of the multiplier . T aking this into account leads to the proposed node architecture shown in Fig. 13. The node architecture uses a processing element as shown in Fig. 6. The twiddle factor memory can be implemented as a ROM, since the twiddle factors are static. The channel estimates and precoding/decoding v ector memories are single port memories, that can either be written or read in one cycle. Although during precoding and decoding, only the precoding/decoding vector is required, the channel estimates must be stored until all precoding/decoding values are computed, and, hence, both must be stored. For simplicity , we select to have a separate 11 Processing element Twiddle factors Channel estimates Sample memory Radio Left in Right in Paren t out Paren t in Precoding/ decoding vector W A B C Y1 Y2 Fig. 13. Proposed node architecture (not all connections shown). F rom radio T o PE T o PE F rom PE F rom PE T o radio Radio input buffer FFT pro cessing buffer Radio output buffer W ADC W ADC W DA C W DA C W comp W comp Fig. 14. Structure of the sample memory in Fig. 13. memory allocation for the channel estimate, instead of using e.g. the sample memory . The sample memory is more complex and is di vided into three separate memories as sho wn in Fig. 14. The first memory is the radio input buf fer which stores raw data from the AD con verter . The size of this memory is Mem input = N UL,buf fered N FFT W ADC bits . (47) The FFT processing buffer is used when performing the FFT/IFFT computations and its size is Mem processing = N FFT W comp bits . (48) The last memory is the radio output buf fer which holds the finished downlink OFDM symbols that are to be sent to the D A con verter . The cyclic prefix of the OFDM symbol is also fetched from the memory . Its size is Mem output = N FFT W D AC bits . (49) In Fig. 14, the memories are shown as two-port memories being able to read and write simultaneously . In many cases, it may be beneficial to use two single-port memories of the half the size instead. For the input and output buf fers, it is straightforward to use memories alternating reading and writing. For the FFT processing buffer , it is also possible using e.g. the approach in [22]. All memory sizes and word lengths for the architecture in Fig. 13 are summarized in T able III. The e xact required word T ABLE III M E MO RY S I ZE S F O R A N O D E . Memory #W ords W ord length Input b uffer N UL,buf fered N FFT W ADC FFT processing buffer N FFT W comp Output b uffer N FFT W DA C Channel estimates K W comp Precoding/decoding vector K W comp T widdle factors (R OM) N FFT / 2 W TF T ABLE IV S P EC IFI C A T IO N S O F T H E L T E - L IK E S Y ST E M I N T H E E X A M P LE . Name V alue Name V alue K 20 T frame 0 . 5 ms N FFT 2048 T inv 40 µ s N SC 1200 W comp 12 + 12 bits N UL,1 0 W symbol 2 + 2 bits N UL,2 2 W ADC 6 + 6 bits N DL 2 W DA C 6 + 6 bits T link 0 . 5 µ s f sample 30 . 72 MHz lengths should be based on system lev el simulations, which is left for future work. V I I . E X A M P L E : L T E - L I K E S Y S T E M S P E C I FI C AT I O N S Here, the requirements for an L TE-like system using ZF processing are considered. This is the typical specification considered in most earlier work. The system specifications can be seen in T able IV. For this specification, the throughput is N SC N DL/UL K W symbol T frame = 384 Mb/s (50) in each direction. The centralized matrix in version can be performed either using an exact [23] or an approximate [12], [24], [25] algorithm. As shown in [26], the complexity is similar for the best exact algorithm and a Neumann series approximation with three terms. In both cases a 20 × 20 matrix in version can be performed in less than 40 µ s using one processing element running at 200 MHz. Based on these specifications and (27) and (30), we get N OPS,avg = 9 . 95 (51) and N OPS,critical = max (9 . 23 , 11 . 28) = 11 . 28 . (52) In this case, T in v ,A = 8 . 27 µ s and T in v ,B = 110 µ s. Therefore, based on (31) and (36), selecting one PE running at 12 f sample = 368 . 64 MHz is sufficient, leading to ˆ N OPS = 12 . In this case, the second do wnlink OFDM symbol imposes the limit on the number of computational resources. Hence, N UL,PB = 0 and N UL,buf fered = 2 . A schedule for the computations in the L TE-like system is deriv ed and can be seen in Fig. 15. Deri ving the schedule is rather straightforward since all tasks are performed sequen- tially . The slack time between determining the local precod- ing/decoding vector , W i / A i , and the start of the precoding, x i , can be utilized to modify the specification, as discussed below . The size of the radio input buf fer is Mem input = N UL,buf fered N FFT W ADC = 48 kb, (53) 12 0 2 4 6 8 10 12 14 16 18 Clo c k cycle × 10 4 Op eration FFT CE Bi INV W/A Xi IFFT Xi IFFT FFT Yi FFT Yi Pilot UL1 UL2 Guard DL1 DL2 Guard Fig. 15. Schedule for the L TE-like system with ZF processing. the size of the FFT processing buf fer is Mem processing = N FFT W comp = 48 kb, (54) and the size of the radio output b uffer is Mem output = N FFT W D AC = 24 kb . (55) Hence, a total of 120 kb of memory is required in each node. In addition, 960 more bits are needed for the channel estimates and precoding/decoding vector . As was shown in Section IV, the computations and com- munication needs to be performed at the same rate. W ith the selected number of operations performed per sample, N OPS , the data rates are R up = ˆ N OPS f sample W comp ≈ 8 . 847 Gbps (56) and R down = ˆ N OPS f sample W symbol ≈ 1 . 475 Gbps . (57) The a vailable slack time can be used to modify the specifi- cations of the system. By tweaking the parameters and redoing the calculations, we can inv estigate which configurations are supported with ˆ N OPS = 12 . For example, the matrix in version time can be increased up to 54 . 2 µ s, with exactly the same node architecture. Alternati vely , the number of users can be increased to K = 21 , assuming that the matrix in verse time increases cubically . For K = 30 and the same assumption, ˆ N OPS can be selected to 28 . In this case, either one PE running at 860 . 16 MHz or two PEs running at 430 . 08 MHz can be used 2 . Alternatively , for the example, N hops can be increased up to 22 , leading to a maximum of M = 2 22 − 1 antennas, assuming a binary tree. Even though this can be further increased by increasing ˆ N OPS , this should not pose a limitation in most cases. If we want to process one uplink symbol before the first downlink symbol, i.e., N UL,PB = 1 , we must select ˆ N OPS = 17 . T o move the pilot symbol one symbol closer to the downlink symbols, i.e., N UL,1 = 1 and N UL,2 = 1 , again ˆ N OPS = 17 , although this equality does not hold in general. Halving the matrix in version time leads to ˆ N OPS = 15 in both cases. This illustrates that when the critical paths are limiting, increasing the computational capabilities in the CCU, i.e., decreasing 2 Naturally , any combination of N PE and f clk is valid as long as (36) holds. Howe ver , if multiple PEs are used, the memory architecture may need to be modified. 0 50 100 Bandwidth, MHz (a) (b) 0 20 40 60 80 100 K f clk = 1 GHz f clk = 500 MHz f clk = 250 MHz 0 50 100 Bandwidth, MHz 0 20 40 60 80 100 K Fig. 16. Bound on bandwidth and terminals using a single processing element at a given f clk for the L TE-like case. Configurations with (a) lar ge and (b) small number of OFDM symbols. the matrix inv ersion time, leads to reduced computational requirements in the nodes. Naturally , an y valid combination of these modifications can be realized. V I I I . D E S I G N S PAC E E X P L O R A T I O N The clock frequency required in the L TE-like example, is not a problem to achieve in a modern process technology through, e.g., pipelining, which is straightforward since the ex ecution is deterministic. Hence, it is possible to change the bandwidth and/or the number of terminals. Here, we consider three different clock frequencies up to 1 GHz for a system otherwise as in the L TE-like case. Figure 16 sho ws the bounds on bandwidth and number of terminals for a given clock frequency . In Fig. 16(a) the asymptotic case is shown, N OPS = N OPS,asymptotic , meaning that the number of OFDM symbols in each frame is large. In Fig. 16(b) the frame format in the L TE-case is used. In both cases the average number of computations over an entire frame is used. Thus, it is assumed that the matrix inv ersion is performed fast enough to not influence the required number of operations per second, i.e., T in v ≤ T in v ,A . It is noted from Fig. 16 that increasing the channel band- width by a factor two, roughly requires that the number of simultaneous terminals are reduced by the same f actor . Here, the length of the FFT , N FFT , and the number of subcarriers utilized, N SC , is scaled linearly with the bandwidth of the channel. This is usually not the case, since the FFT length is fa vorably selected as a po wer of two. The plots still giv e a good estimate of the a vailable design space. I X . C O N C L U S I O N S In this work, a scalable system architecture using distributed processing was proposed for the base station in a massive MIMO system. It was shown that the computations associated with each antenna can be distributed and in most of the earlier studied use cases only a simple single processing element running at a few hundred MHz and a modest amount of memory are required. It was further sho wn that it is feasible to hav e a simple synchronous control of the nodes and that the inter-node communication can be handled by one or a fe w 13 high-speed serial links. All computations required by adding an antenna are handled by the introduced additional node. The case of connecting the nodes as a binary tree was primarily studied, although the architecture is readily extended to a K -ary tree. It is here worth noting that an array architec- ture with static scheduling will behav e as a binary or ternary tree, and, hence, the same concept can be used for an array interconnect with additional simple routing logic surrounding the processing node. As the processing core is so small, it is also of interest to possibly hav e more than one node in a chip, reducing the amount of inter-chip communication channels. The exact granularity is left for future work. The architecture supports conjugate beamforming, zero forcing, and MMSE processing. In the latter two cases, a matrix in version is performed in a central control unit, but all other computations are distributed. The impact of the matrix in version latency and pilot position on the computational requirements in the node are studied and related. R E F E R E N C E S [1] F . Boccardi, R. W . Heath, A. Lozano, T . L. Marzetta, and P . Popovski, “Fiv e disruptive technology directions for 5G, ” IEEE Commun. Mag. , vol. 52, no. 2, pp. 74–80, Feb . 2014. [2] P . K. Agyapong, M. Iwamura, D. Staehle, W . Kiess, and A. Benjebbour , “Design considerations for a 5G network architecture, ” IEEE Commun. Mag. , vol. 52, no. 11, pp. 65–75, Nov . 2014. [3] J. Hoydis, S. ten Brink, and M. Debbah, “Massive MIMO in the UL/DL of cellular networks: How many antennas do we need?” IEEE J . Sel. Ar eas Commun. , vol. 31, no. 2, pp. 160–171, Feb . 2013. [4] L. Lu, G. Y . Li, A. L. Swindlehurst, A. Ashikhmin, and R. Zhang, “ An overvie w of massiv e MIMO: Benefits and challenges, ” IEEE J. Sel. T opics Signal Process. , vol. 8, no. 5, pp. 742–758, Oct. 2014. [5] E. G. Larsson, O. Edfors, F . Tufv esson, and T . L. Marzetta, “Massi ve MIMO for next generation wireless systems, ” IEEE Commun. Mag. , vol. 52, no. 2, pp. 186–195, Feb . 2014. [6] E. Björnson, E. G. Larsson, and T . L. Marzetta, “Massive MIMO: ten myths and one critical question, ” IEEE Commun. Mag . , vol. 54, no. 2, pp. 114–123, Feb. 2016. [7] T . L. Marzetta, E. G. Larsson, H. Y ang, and H. Q. Ngo, Fundamentals of Massive MIMO . Cambridge University Press, 2016. [8] C. Shepard, H. Y u, N. Anand, E. Li, T . Marzetta, R. Y ang, and L. Zhong, “ Argos: practical many-antenna base stations, ” in Pr oc. Int. Conf. Mobile Comput. Networking . A CM, 2012, pp. 53–64. [9] J. V ieira, S. Malko wsky , K. Nieman, Z. Miers, N. Kundargi, L. Liu, I. W ong, V . Öwall, O. Edfors, and F . Tufvesson, “ A flexible 100-antenna testbed for massive MIMO, ” in Proc. IEEE Globecom W orkshops , Dec. 2014, pp. 287–293. [10] X. Y ang, W .-J. Lu, N. W ang, K. Nieman, S. Jin, H. Zhu, X. Mu, I. W ong, Y . Huang, and X. Y ou, “Design and implementation of a TDD-based 128-antenna massi ve MIMO prototyping system, ” 2016. [11] F .-L. Luo and C. Zhang, Signal Pr ocessing for 5G: Algorithms and Implementations . W iley-IEEE Press, 2016, ch. Massive MIMO for 5G: Theory , Implementation and Prototyping, pp. 616–. [12] M. W u, B. Y in, G. W ang, C. Dick, J. R. Cav allaro, and C. Studer , “Large-scale MIMO detection for 3GPP L TE: Algorithms and FPGA implementations, ” IEEE J. Sel. T opics Signal Process. , vol. 8, no. 5, pp. 916–929, Oct. 2014. [13] H. Prabhu, J. Rodrigues, L. Liu, and O. Edfors, “ Algorithm and hardware aspects of pre-coding in massive MIMO systems, ” in Pr oc. Asilomar Conf. Systems Computers , Nov . 2015, pp. 1144–1148. [14] A. Puglielli, A. T o wnley , G. LaCaille, V . Milovano vi ´ c, P . Lu, K. Trot- skovsk y , A. Whitcombe, N. Narevsky , G. Wright, T . Courtade, E. Alon, B. Nikoli ´ c, and A. M. Niknejad, “Design of energy- and cost-efficient massiv e MIMO arrays, ” Pr oc. IEEE , vol. 104, no. 3, pp. 586–606, Mar . 2016. [15] H. Prabhu, J. N. Rodriguez, L. Liu, and O. Edfors, “ A 60pJ/b 300Mb/s 128 × 8 massive MIMO precoder-decoder in 28nm FD-SOI, ” in Pr oc. IEEE Solid-State Circuit Conf. , 2017. [16] K. Li, R. Sharan, Y . Chen, J. Cavallaro, and C. Studer, “Decentralized data detection for massi ve MU-MIMO on a Xeon Phi cluster , ” in Proc. Asilomar Conf. Systems Computers , 2016. [17] K. Li, R. Skaran, Y . Chen, J. R. Cav allaro, T . Goldstein, and C. Studer, “Decentralized beamforming for massiv e MU-MIMO on a GPU cluster, ” in Pr oc. IEEE Global Conf. Signal Inform. Process. IEEE, 2016, pp. 590–594. [18] H. Y ang and T . L. Marzetta, “Performance of conjugate and zero- forcing beamforming in large-scale antenna systems, ” IEEE J. Sel. Areas Commun. , v ol. 31, no. 2, pp. 172–179, Feb . 2013. [19] C. B. Peel, B. M. Hochwald, and A. L. Swindlehurst, “ A vector -perturbation technique for near-capacity multiantenna multiuser communication-part I: channel inversion and regularization, ” IEEE T rans. Commun. , vol. 53, no. 1, pp. 195–202, Jan. 2005. [20] E. Bertilsson, O. Gustafsson, and E. G. Larsson, “ A scalable architecture for massive MIMO base stations using distributed processing, ” in Pr oc. Asilomar Conf. Systems Computers , 2016. [21] L. W anhammar , DSP Inte grated Circuits . Academic Press, 1999. [22] Y . Ma and L. W anhammar, “ A hardware ef ficient control of memory addressing for high-performance FFT processors, ” IEEE Tr ans. Signal Pr ocess. , vol. 48, no. 3, pp. 917–921, Mar . 2000. [23] C. Ingemarsson and O. Gustafsson, “Hardware architecture for posi- tiv e definite matrix in version based on LDL decomposition and back- substitution, ” in Pr oc. Asilomar Conf. Systems Computers , 2016. [24] X. Liang, C. Zhang, S. Xu, and X. Y ou, “Coefficient adjustment matrix in version approach and architecture for massive MIMO systems, ” in Pr oc. IEEE Int. Conf. ASIC , Nov . 2015. [25] S. M. Abbas and C.-Y . Tsui, “Low-latency approximate matrix inversion for high-throughput linear pre-coders in massiv e MIMO, ” in Proc. IFIP/IEEE Int. Conf. V ery Larg e Scale Inte gration , Sep. 2016. [26] O. Gustafsson, E. Bertilsson, J. Klasson, and C. Ingemarsson, “ Approx- imate Neumann series or exact matrix in version for massi ve MIMO?” in Pr oc. IEEE Symp. Comput. Arithmetic , 2017, invited paper .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment