Efficient Channel Estimator with Angle-Division Multiple Access
Massive multiple-input multiple-output (M-MIMO) is an enabling technology of 5G wireless communication. The performance of an M-MIMO system is highly dependent on the speed and accuracy of obtaining the channel state information (CSI). The computatio…
Authors: Xiaozhen Liu (1), Jin Sha (1), Hongxiang Xie (2)
1 Ef ficient Channel Estimator with Angle-Di vision Multiple Access Xiaozhen Liu, Jin Sha, Hongxiang Xie, Feifei Gao, Shi Jin, Zaichen Zhang, Senior Member , IEEE , Xiaohu Y ou, F ellow , IEEE and Chuan Zhang, Member , IEEE Abstract —Massive multiple-input multiple-output (M-MIMO) is an enabling technology of 5G wireless communication. The performance of an M-MIMO system is highly dependent on the speed and accuracy of obtaining the channel state information (CSI). The computational complexity of channel estimation for an M-MIMO system can be reduced by making use of the sparsity of the M-MIMO channel. In this paper , we propose the hard ware- efficient channel estimator based on angle-di vision multiple access (ADMA) f or the first time. Preamble, uplink (UL) and downlink (DL) training are also implemented. F or further hardware- efficiency consideration, optimization regarding quantization and approximation strategies have been discussed. Implementation techniques such as pipelining and systolic processing are also employed for hardware regularity . Numerical results and FPGA implementation hav e demonstrated the advantages of the pr o- posed channel estimator . Index T erms —M-MIMO, channel estimation, angle-division multiple access (ADMA), VLSI, pipelining. I . I N T R O D U C T I O N W ITH the e xplosiv e growth of mobile applications, cloud synchronization services and the rapid development of high-quality multimedia services such as high resolution image and 4K-resolution high dynamic range (HDR) video streaming, the existing 4G mobile communication technology could not meet the needs of enterprises and consumers for wireless communication networks any more. As a result, 5G mobile communication technology [2 – 4] has been raised up with higher transmission speed, stronger bearing capacity , and a wider range of applications. Massive multiple-input multiple- output (M-MIMO) is one of the key technologies of 5G [5, 6], owing to its plenty adv antages, such as high spectral ef ficiency , high power efficienc y , and high robustness. The performance of M-MIMO systems relies heavily on the acquisition of the uplink (UL) and downlink (DL) channel state information (CSI). Howe ver , the large-scale antenna array of M-MIMO brings ne w challenges to channel estimation [7] : • The overhead of training sequence grows due to the increasing number of users while the reuse of training sequence will arouse pilot contamination [8 – 11]. Xiaozhen Liu and Jin Sha are with the School of Electronic Science and Engineering, Nanjing University , China. Hongxiang Xie and Feifei Gao are with the Department of Automation, Tsinghua Univ ersity , Beijing, China. Shi Jin, Zaichen Zhang, Xiaohu Y ou, and Chuan Zhang are with the National Mobile Communications Research Laboratory , Southeast University , Nanjing, China. Email: shajin@nju.edu.cn, chzhang@seu.edu.cn. This paper was presented in part at International Conference on ASIC (ASICON), Guiyang, China, 2017 [1]. (Corresponding author: Jin Sha and Chuan Zhang.) • The growing dimension of channel matrices (CMs) or channel cov ariance matrices (CCMs) makes the com- plexity and resource consumption of the traditional UL and DL channel estimation algorithm greatly increased, limiting the M-MIMO system to play its superiority . • The amount of CSI that users feed back to the base station (BS) during DL channel estimation is gro wing with the increase of the number of antennas at the BS, which is a great burden of the feedback channel. • Channel reciprocity makes it easy to acquire DL CSI from UL CSI for time division duplexing (TDD) systems, while the non-reciprocity characteristic causes that the DL channel estimation of frequency di vision duplexing (FDD) systems cannot be predigested, which is a great burden for user’ s devices. In order to reduce the computational complexity , we need to take adv antage of the lo w-rank properties of the channel, which can reduce the dimension of CMs and CCMs signifi- cantly and acquire the v alid information. Many works point out that the directions of arriv als (DO A) as well as the directions of departures (DOD) of propagation signals are limited in a narrow region (i.e., the angle spread (AS) is small) because the BS equipped with large-scale antenna array has to be located on the top of high buildings, which is known as the finite scattering model [12]. Another similar scenario is the mmW ave communication, where the channel is naturally sparse and the AS equals 0 [13]. Meanwhile, due to the large- scale antenna array of M-MIMO, the spatial resolution of the BS is significantly improved, which means the BS can distinguish users from different directions more easily so that the representation of channel can be strongly sparse and there are relativ ely few non-zero elements in CMs and CCMs. As a result, a lot of ne w or optimized methods to acquire CSI hav e been proposed [14 – 18], especially for DL channel esti- mation of FDD system due to its non-reciprocity characteristic. [14] proposed an approach under joint spatial division and multiplexing (JSDM) scheme for DL channel estimation of FDD system, where the sparsity of CCMs is exploited and the eigen value decomposition (EVD) algorithm is required, which is a challenge for implementation. [15] proposed a lo w-rank matrix approximation based on compressed sensing (CS) and solved via a quadratic semi-define programming (SDP), which is novel b ut far too complex to implement. [16] deployed a CS-based method with the joint channel sparsity model (JCSM) which utilizes virtual angular domain representation of the CM and limited local scattering in order to reduce the 2 training and feedback ov erhead. T o this end, we first proposed a transmission strategy based on spatial based expansion model (SBEM) in [19], which comes from array signal processing theory . This scheme is also known as angle-division multiple access (ADMA) scheme. ADMA scheme has some particular advantages: • Due to the increased angle resolution of antenna arrays at the BS, the angular information can be easily obtained by the discrete Fourier transform (DFT) of CMs under ADMA scheme. • The angular information is corresponding to the real directions of users with the array signal processing theory , while the others’ methods only have a virtual angular representation. • As a result of the reciprocity brought by the DO A and DOD, the comple xity and ov erhead of DL channel estimation and feedback can be reduced. • The estimation algorithm mainly contains DFT calcula- tion, matrix multiplication, sorting and grouping, which is con venient for implementation. As it has shown in [19], the performance of ADMA is better than [14] and [16], especially at low signal noise ratio (SNR). In addition, there are also blind and semi- blind estimation methods to be explored. Those methods hav e higher transmission efficienc y because the y need fewer (or no) training sequences. But the result of those methods may be not accurate at the start of transmission because the BS needs some time to accumulate channel statistics information. Moreov er , the ef ficient implementation of ADMA is very challenging due to the non-linear computation inv olved in algorithm le vel, therefore hinder its application for channel estimation. In order to bridge the aforementioned gap between algo- rithm and implementation, this paper dev otes itself in propos- ing the hardware architecture for channel estimation under ADMA scheme for the first time. Hardware-aw are partition of the algorithm is conducted. Our main technical contributions can be listed as follows: • W e propose a hardware-ef ficient channel estimator under ADMA, which takes the adv antage of the sparsity of M-MIMO systems in order to reduce the complexity , sav e the amount of training sequences, and speed up the channel estimation of large amount of users. • W e discuss the approximation of algorithm and trans- mission strategy as well as the quantization optimization in order to make the our channel estimator suitable for hardware implementation. • W e propose the first channel estimator architecture with ADMA scheme, successfully achie ve higher hardware efficienc y and higher processing speed for channel es- timation of M-MIMO systems. • W e develop an optimized architecture to simplify our original channel estimator , with little performance loss but huge resources reduction and higher hardware effi- ciency . • W e present the FPGA implementation of our ADMA channel estimator on Xilinx V irtex- 7 xcvu 440 -flga 2892 - 2 -e, to demonstrate its suitability for 5 G wireless. The advantages ha ve been verified by FPGA implementations. The remainder of this paper is organized as follo ws. Section II proposes the implementation-aw are partition of ADMA algorithm. The hardware-friendly approximation and the sim- ulation results are presented in Section III. The detailed pipelined hardware architecture is presented in Section IV. FPGA implementation is also giv en in the same section to demonstrate the adv antages. Finally , Section V concludes the entire paper . Notations . The notations employed in this paper are listed in T able I for clearer representation. T ABLE I N O TA T I O N S I N T H I S P AP E R Symbol Definition M number of antennas at the BS, K number of users that the BS serves, L length of training sequences, τ number of parameters the BS can handle, h / H vector h / matrix H , [ h ] i the i -th element of vector h , [ H ] ij the ( i, j ) -th element of matrix H , h T / H T the transpose of vector h / matrix H , h H / H H the Hermitian of vector h / matrix H , B set B of τ continuous integers, |B| the cardinality of the set B , [ h ] B sub-vector of h by keeping the elements indexed by B , [ H ] : , B sub-matrix of H by collecting the columns indexed by B , diag { h } a diagonal matrix with the diagonal elements constructed from vector h , E {·} the statistical expectation. I I . I M P L E M E N TA T I O N - A W A R E P A RT I T I O N O F A D M A C H A N N E L E S T I M A T I O N Implementation-aware partition of ADMA channel estima- tion is first conducted in this section. A. Setting-Up of ADMA For the ease of illustration, we consider a multiuser M- MIMO system, where the BS is equipped with M ( M 1 ) antennas in the form of uniform linear array (ULA) and serves K users. W e assume that the number of parameters which the BS can handle is τ . In addition, as we presume that each user is equipped with only one antenna, the CM of user- k can be described as a M × 1 vector h k . From array signal processing theory , the UL channel vector h k of user- k has the form h k = 1 √ P P X p =1 α kp a ( θ kp ) , (1) where P is the number of beamforming rays, α kp is the complex gain of the p -th ray and a ( θ kp ) is the array manifold vector which can be expressed as a ( θ kp ) = h 1 , e j 2 πd λ sin θ kp , ..., e j 2 πd λ ( M − 1)sin θ kp i T . (2) Remark 1. In this paper , we do not discuss in the situa- tion that users are equipped with multiple antennas or the pr opagation signal contains multiple subcarriers in orthogonal fr equency duplex division multiplexing (OFDM) systems. In X. LIU et al. : EFFICIENT CHANNEL ESTIMA TOR WITH ANGLE-DIVISION MUL TIPLE ACCESS 3 fact, the sparsity of the vectors which can be obtained by collecting the r ow or column of channel matrices. And so do the sparsity of channel matrices of differ ent subcarriers. So when we obtain the sparsity under ADMA scheme, it can be pr omoted to plenty of scenarios. B. Channel Sparsity Revealed by ADMA T o grantee the performance of the proposed channel esti- mator , the sparsity reveal by ADMA must be kept during the implementation process. The ADMA presents a sparse channel representation for the channel of a M-MIMO system via the Discrete Fourier Transform (DFT) of channel vector , i.e., ˜ h k . which can be calculated by ˜ h k = Fh k , (3) where F is the M × M DFT matrix whose element is [ F ] pq = e − j 2 π M pq / √ M . For the ease of description, there are two lemmas which can be prov ed from paper [19] : Lemma 1. If P = 1 (i.e ., AS is zer o) and M → ∞ , ther e will be only one non-zer o element in ˜ h k and the index of this non-zer o element is r elative to its DO A or DOD. Pr oof: For P = 1 , h k can be simplified to h k = α kp a ( θ kp ) , then the b -th element of ˜ h k can be calculated as h ˜ h k i b = α k √ M M − 1 X m =0 e − j ( 2 π M mb − 2 π λ md sin θ k ) = α k √ M e − j ( 2 π M b − 2 π λ d sin θ k ) · sin[( 2 π M b − 2 π λ d sin θ k ) · M 2 ] sin[( 2 π M b − 2 π λ d sin θ k ) · 1 2 ] , (4) If M → ∞ , we can get that lim M →∞ h ˜ h k i b = | α k | · √ M · δ b M − d sin θ k λ . (5) Eq. 5 denotes the relationship between the index of the non- zero element (i.e., b 0 ) in ˜ h k and the DO A when M → ∞ , which can be described as b 0 = M d sin θ k λ θ k = arcsin( b 0 λ M d ) , (6) Since we hav e discussed the situation with P = 1 and M → ∞ , we can move onto the more complex and realistic scheme: • when P > 1 and M → ∞ , each propagation ray is corresponding to a non-zero element in ˜ h k . The index of the middle element is corresponding to the DO A of user- k while the number of the non-zero elements is corresponding to the AS of user- k . • when P = 1 and M is large but finite, the po wer leakage emerges because the resolution of the BS is relatively limited, which causes that b 0 = M d sin θ k λ is not alw ays an integer . Ho wev er, there are only a few non-zero elements concentrated around b 0 = b M d sin θ k λ e since M is large. In fact, M denotes the sample precision of the Discrete T ime Fourier T ransform (DTFT) of h k in the frequency domain. Since the index of the non-zero elements in ˜ h k is corresponding to the DOA and AS of user- k , M can also determine the spatial resolution of the BS. • when P > 1 and M is large but finite, it is similar to the situation with P = 1 and M is lar ge but finite, but the amount of non-zero elements in ˜ h k will be larger , which is interrelated to the AS of user- k . From the abov e we can see that we can simply get a sparse channel representation by applying DFT to the channel vector and pick the non-zero elements with their indexes. In practical scene, since the BS can handle τ ( τ M ) parameters at most, we can use τ non-zero points of the DFT channel vector ˜ h k instead of all M points to represent the CSI, which can reduce quite a lot of calculating and feedback ov erhead. C. Sparsity Enhancer for ADMA T o enhance the channel sparsity under ADMA scheme, we define: Definition 1. Define Φ ( φ k ) as the r otation matrix for user- k which can be expr essed as Φ ( φ k ) = diag nh 1 , e j φ k , . . . , e j ( M − 1) φ k io , (7) wher e φ k ∈ - π M , π M . Then we can add this r otate-operation to the DFT calculating. Define ˜ h ro k as the new channel r epresentation with r otation given by ˜ h ro k = F Φ( φ k ) h k . (8) In this way , we can use less non-zero elements to represent the channel vector . Or in practical scene, the τ non-zero elements we pick will contain more energy of the channel, which is a great benefit for the training ov erhead. Remark 2. Notice that the rotation is actually the translation of ˜ h k in the fr equency domain. Since the spatial resolution of the BS is r elatively limited, we can get the sample points aligned with the middle of the peak of the DTFT of h k to the gr eatest e xtent via the r otation operation. Since the sampling interval in the frequency domain is π M , it is only necessary to sear ch over φ k ∈ - π M , π M In this case, we can define the index set to describe the signature of channel vectors with rotation as following: Definition 2. Define B ro k as the spatial signatur e of user- k which can be determined accor ding to max φ k , B ro k h ˜ h r o k i B ro k 2 ˜ h r o k 2 , subject to |B r o k | = τ , (9) Now we ha ve two parameters for each user to be determined under ADMA scheme: φ and B ro . The main benefit of this sparse channel representation is that we only need a few training sequences because users from dif ferent directions whose DO As do not overlap can share the same training sequence. In practical scene, we usually use τ orthogonal training sequence which can make full use of the BS. 4 Meanwhile, we can explain the transmission strategy under ADMA scheme which can be divided into three stages: the preamble stage, the UL training stage and the DL training stage. The aim of the preamble stage is to collect the two parameters of all users and divide them into different groups according to their spatial signatures. Then in the UL training stage and the DL training stage we can perform faster esti- mation than con ventional channel estimation methods due to the grouping in the preamble stage. The preamble stage is not necessary after each UL and DL training stage and the times for UL and DL training stages after one preamble stage is corresponding to the mobility of users. D. Pr eamble Module In the preamble period, we need to find φ and B ro for each user so that we can allocate all users into different groups in which the index sets, i.e., B ro of users do not o verlap each other’ s. First we allocate all K users into G groups, each containing τ users as the BS can handle up to τ training sequences. Then we apply the con ventional UL training for each group, and the receiving signals matrix of each group in the BS is gi ven by Y = HD 1 / 2 S H + N = τ X i =1 p d i h i s H i + N , (10) where Y ∈ C M × L , H = [ h 1 , . . . , h τ ] ∈ C M × τ , S = [ s 1 , . . . , s τ ] ∈ C L × τ , D = diag { [ d 1 , . . . , d τ ] } ∈ C τ × τ and d k = P ut k /Lσ 2 p is used to satisfy the energy constraint ( P ut k is the UL training energy constraint of user- k , and σ 2 p is the pilot signal training power), N ∈ C M × L is the additiv e white Gaussian noise matrix. Then h k can be calculated through linear square (LS) method as ˆ h k = 1 √ d k Lσ 2 p Y s k . (11) Then we can find φ k and B ro k for each user by adopting Eq. (9). The specific method is discussed in Section III. After that, we need to allocate all users into G ul groups in which the index sets of users do not overlap each other’ s so that the users in the same group can share the same training consequence, which can be described as ( B ro k ∩ B ro l = ∅ min | b 1 − b 2 | ≥ Ω , ∀ b 1 ∈ B ro k , ∀ b 2 ∈ B ro l , (12) where Ω is a certain guard interval which depends on the tolerance of users for the interference due to pilot reusing. Here we present a grouping strategy that is easy for VLSI implementation in Section IV. E. UL T raining Module In the UL training, all K users send their training sequences to the BS. The received signals matrix in the BS is giv en by Y = G ul X i =1 X k ∈U ul i p d i h k s H i + N . (13) So first we extract the channel vector for group- g through a con ventional LS method: y g = 1 Lσ 2 p Y s g . (14) Since the tw o parameters of each user is different, we should extract ˜ h k for each user- k through \ h ˜ h ro k i B ro k = ˜ y ro g ,k B ro k = 1 √ d k FΦ ( φ k ) y g B ro k . (15) Finally we can recover the ˆ h k for user- k by ˆ h k = Φ( φ k ) H F H ˆ ˜ h ro k = Φ( φ k ) H F H : , B ro k \ h ˜ h ro k i B ro k . (16) F . DL T raining Module and Its Reciprocity Based on the reciprocity of ADMA, the DL CSI can be easily obtained from the UL training as sho wn in [19]. The reciprocity of ADMA comes from that the propagation path of electromagnetic wa ve is reciprocal. As a result, the DO A (DOD) of DL signal is the same as the DOD (DOA) of the UL signal. Assume that the DL spatial signature of user- k is B ro k which can determined by the UL spatial signature B ro k by applying Eq. (6): sin θ kp = q λ ul M d = q λ dl M d , (17) where q and q are the elements in B ro k and B ro k , while λ ul and λ dl denote the UL and DL carrier wa velengths. Since sin θ kp is a monotonic function with θ kp ∈ − π 2 , π 2 , the minimum and maximum elements of B ro k and B ro k hav e an one-to-one correspondence, i.e.,: q min = λ ul λ dl q min , q max = λ ul λ dl q max , (18) where q min ≤ q ≤ q max , ∀ q ∈ B ro k . Meanwhile, φ k can be calculated by φ k = ( λ ul /λ dl ) φ k similarly . The DL training is mostly the same with UL training except the Grouping strate gy . In DL training module, since users with identical spatial signatures can be carried out with the same beamforming vectors simultaneously , they can share the same training sequence. Meanwhile, users whose spatial signatures do not ov erlap each other’ s can share the same training sequence, which is the same with the UL training Grouping strategy . As a result, we denote our DL training strategy in two steps. First we allocate users with identical spatial signatures into the same cluster . Then we allocate these clusters in to different groups through Eq. (12). The rest of transmission and estimation is the same with the UL training module. W ith the successful algorithm partition, we are no w able to carry out the detailed implementation-aware algorithm optimization and module-wise architecture design as follows. X. LIU et al. : EFFICIENT CHANNEL ESTIMA TOR WITH ANGLE-DIVISION MUL TIPLE ACCESS 5 I I I . A P P RO X I M A T I O N A N D Q U A N T I Z A T I O N For simulations, the mean square error (MSE) is calculated as follo ws: MSE = E {|| h k − ˆ h k || 2 } E {|| h k || 2 } . (19) For comparison, the system parameters are set as: M = 128 , K = 32 , L = 64 , τ = 16 , θ k ∈ {− 48 . 59 ◦ , − 14 . 48 ◦ , 48 . 59 ◦ , 14 . 48 ◦ } and ∆ θ k = 2 ◦ , which are consistent with those in [19]. A. Appr oximation for Sliding W indow Method The authors of [19] proposed a basic way to find B ro k for user- k by adopting a one dimensional search over φ k = - π M , π M and for each possible φ k by sliding a window of size τ over the M elements in ˜ h k to determine B ro k that maximizes the channel power ratio. Ho wever , there are two main problems if we operate through this method. The first one is that searching ov er φ k = - π M , π M can not be carried out since it is a continuous interval. Fig. 1 sho ws the corresponding MSE of the simulations with N separate elements we choose in the - π M , π M . As we can see from that, the MSE of N = 3 is nearly the same as those of N > 3 , so N = 3 turns out to be suitable for VLSI implementation. −10 −5 0 5 10 15 20 25 10 −3 10 −2 10 −1 SNR/dB MSE N=257 N=9 N=7 N=5 N=3 N=1 Fig. 1. MSE results of different N . The second problem is that this method will introduce quite a lot latency and increase the computation complexity as the accumulator and divider are needed. Here we have some approximations to lower the complexity: • The first one is to find the maximum element in | ˜ h ro k | for each possible φ k and determine the best b ro k and φ k for user- k from the largest elements of all the possible φ k . • The second one is to find the maximum element and the second lar gest element in | ˜ h ro k | and calculate the quadratic sum of the lar gest two elements for each possible φ k . Determine the best b ro k and φ k from the largest quadratic sum of all the possible φ k (the index b ro k will be the mean value of the indexes of the largest two elements in | ˜ h ro k | with φ k ) . • The third one is to find the maximum element in | ˜ h ro k | and calculate the quadratic sum of τ continuous elements which center on the maximum element for each possible φ k . Determine the best b ro k and φ k from the largest quadratic sum of all the possible φ k (the index b ro k comes from the largest elements in | ˜ h ro k | with φ k ) . Fig. 2 shows the MSE increase for the approximations abov e. It is obvious that the performance of the third approx- imation is nearly the same as the basic method with a divider economized. Meanwhile, the performance loss of the first or the second method is higher b ut relativ ely acceptable with only one or two comparators needed. −10 −5 0 5 10 15 20 25 10 −3 10 −2 10 −1 SNR/dB MSE Sliding Window with ratio Sliding Window with acc Max Max 2 with acc Fig. 2. MSE of different methods to determine the spatial signature of user . B. Quantization Scheme For quantization, the v ariables are quantified with 1 sign bit, p integral bits, and q fractional bits which is expressed as fixed [ 1 , p, q ]. The width of integer p is usually determined by the Probability Density Function (PDF) of the data. But in our algorithm the largest data must be less than 2 q since the channel state information contains the lar gest element in | ˜ h ro k | . Here we sho w the statistics of large amount of the largest data in h k , ˜ h ro k and | ˜ h ro k | . According to our statistics, the largest data is less than 2 8 so that the width of inte gral part of v ariables is set as 8 . In order to determine the width of fractional part of the variables, the corresponding MSE of double floating and fixed simulations are illustrated in Fig. 4. From Fig. 4 the MSE of fixed [ 1 , 8 , 6 ] and fixed [ 1 , 8 , 7 ] simulation keep almost the same, with a slight degradation compared with the double floating simulation. Ho wev er, the MSE of fixed [ 1 , 8 , 5 ] simulation is a relativ ely far from double floating simulation. As a result, the quantization scheme with fixed [ 1 , 8 , 6 ] may be preferred for hardware implementation. I V . P I P E L I N E D A R C H I T E C T U R E For channel estimation under ADMA scheme, the opera- tions are conducted on the M -dimensional vectors and matri- ces, where M is lar ge. F or lo w-complexity and high processing speed, the pipelined architecture is demonstrated in Fig. 5. In addition, the quantization scheme with “fix ed [ 1 , 8 , 6 ]” is employed, together with φ k = { - π M , 0 , π M } , respecti vely . Our design has two stages controlled by a 1-to-2 switch. Stage 1 consists of pre-treatment module, preamble processing 6 0 20 40 60 80 100 120 0 0.5 1 1.5 2 2.5 3 3.5 4 (a) The statistics of large amount of the largest data in h k . 0 20 40 60 80 100 120 0 50 100 150 200 250 (b) The statistics of large amount of the largest data in ˜ h ro k . 0 20 40 60 80 100 120 0 50 100 150 200 250 (c) The statistics of large amount of the largest data in ˜ h ro k . Fig. 3. The statistics of large amount of the largest data in h k , ˜ h ro k and | ˜ h ro k | . The maximum data appears in the element of | ˜ h ro k | . í í í í í 615G% 06( 'RXEOH )L[HG>@ )L[HG>@ )L[HG>@ )L[HG>@ )L[HG>@ Fig. 4. MSE results of double precision floating and fixed simulations. module and UL-grouping module corresponding to Eq.s (11) and (3). Stage 2 comprises pre-treatment module and UL- Estimation module corresponding to Eq.s (15) and (16). A. Module Design 1) Pr e-treatment Module: The pre-treatment module can be reused since preamble module and UL-estimation module are processed in dif ferent time slots. Pre-treatment module consists of data buf fer and LS-based estimation module. The LS-based estimation module corresponding to Eq. (11) can be implemented by a systolic structure [20] whose data flow graph is shown in Fig. 7, which is an efficient processing method for matrix-vector multiplication. The processing el- ement (PE) performs one complex multiplication and one complex addition. Each PE is corresponding to one elements of training sequence s k and one column of receiving data matrix Y so that the data buffer is needed to get the data transmission proper because Y is recei ved by column (i.e., we recei ve M elements in one column of Y in one clock period). 2) F ast F ourier T ransform (FFT) Module: Eq. (3) can be divided into two steps. One is a diagonal matrix and vector multiplication which can be implemented by a com- plex multiplier and a Φ -generator which outputs the diagonal elements of Φ( φ k ) in pipeline. The other is DFT which can be implemented by Fast Fourier Transform (FFT) processors, reducing the computational comple xity to O ( M log 2 M ) . These are plenty of structures of FFT which emphasize either higher processing speed or less resources overhead [21–23]. For higher hardware ef ficiency , the single-path feedback pipelined hardware architecture is employed as it is shown in Fig. 9, where the number of registers is the smallest as a result of the application of multiplexers. 3) Up-link Gr ouping module: In the Up-link Grouping module, there are two main submodules: sorting and grouping. The sorting module is implemented by merging network [24] in pipeline which is shown in Fig. 8. This sorting network is mainly based on recursion, merging from 2-element compari- son to N -element comparison (assuming l og 2 N is a positiv e integer). Meanwhile, the mer ger- N module is a combination of symmetric comparing network and two bitonic sorter of N / 2 elements. Then the bitonic sorter- N/ 2 can be implemented by a half cleaner- N / 2 module and two bitonic sorter- N / 4 . The grouping module is implemented by a systolic structure shown in Fig. 10 where each comparing PE is corresponding to a group and decide if the latest input b ro k is suitable for the group by comparing it with the latest b ro l in this group. Since the outputs of sorting module is paralleled and the input of grouping module is serial, a parallel-to-serial module is necessary . Remark 3. Notice that the gr ouping messages ar e sent to users thr ough a independent feedback channel which is not contained in our hardwar e design. 4) Up-link estimation module: In the Up-link estimation module, the implementation of Eq. (15) is a combination of a complex multiplier, an FFT module and an extraction module. Besides, the implementation of Eq. (16) consists of an In verse Fast Fourier Transform (IFFT) module and a complex multiplier . Due to the sparsity of ˜ h k , the IFFT module can be treated as an M × τ matrix and τ × 1 vector multiplication which can be implemented by systolic structure which consists of τ PEs for higher efficienc y . B. Optimized Arc hitectur e without Rotation As we can see from the Fig. 5, the FFT modules of preamble processing module and Up-link estimation module could be reused since they are not deployed at the same time. Howe ver , the spatial signatures of users in the same group are different, leading to the waste of FFT module. Here we find that we X. LIU et al. : EFFICIENT CHANNEL ESTIMA TOR WITH ANGLE-DIVISION MUL TIPLE ACCESS 7 3UHWUHDWPHQW 'DWD %XIIHU W /6%DVHG (VWLPDWHG 3UHDPEOHSURFHVVLQJRIXVHUN ))7 $%6 0D[ 6HOHFWLRQ 0D[ 8S/LQN*URXSLQJ WR 6ZLWFK 6RUWLQJ *URXSLQJ *URXS 0HVVDJH 5HJ 5HJ 36 8S/LQN(VWLPDWLRQRIXVHUN ))7 ([WUDFWLRQ ,))7 UR N E )*HQHUDWRU *URXS 0HVVDJH 5HJ *HQ M H I *HQ *HQ N M H I N I 8S/LQN(VWLPDWLRQRIXVHUO ))7 ([WUDFWLRQ ,))7 )*HQHUDWRU *HQ *HQ O M H I O I UR Ö N K U R Ö N K UR N U R U R N N ª º ¬ ¼ K U R º º UR N UR N WHPS Ö N K Ö N K O M H I N M H I UR O E UR N N E I WHPS Ö N K UR Ö N K U R Ö N K UR N U R Ö _ _ N K U R UR UR Ö O K U R Ö O K UR O U R U R O O ª º ¬ ¼ K º U R º º UR O UR O < W 6WDJH 6WDJH Fig. 5. The overall hardware architecture of channel estimation under ADMA scheme. The number of preamble processing module is equal to the number of training sequences τ and the number of UL estimation module is equal to the number of users in order to achieve the highest processing speed. 3UHWUHDWPHQW 'DWD %XIIHU W /6%DVHG (VWLPDWHG 3UHDPEOHSURFHVVLQJRIXVH U N $%6 0D[ 6HOHFWLRQ 8S/LQN*URXSLQJ WR 6ZLWFK 6RUWLQJ *URXSLQJ *URXS 0HVVDJH 5HJ 5HJ 36 8S/LQN(VWLPDWLRQRIXVHU N ([WUDFWLRQ ,))7 N E )*HQHUDWRU *URXS 0HVVDJH 5HJ N N ª º ¬ ¼ K N º º º N Ö N K N E Ö N K Ö _ _ N K _ < W 6WDJH 6WDJH ))7 WHPS Ö N K Ö N K 8S/LQN(VWLPDWLRQRIXVHU O ([WUDFWLRQ ,))7 O E )*HQHUDWRU O O ª º ¬ ¼ K O º º º O Ö O K Fig. 6. The ov erall hardware architecture without rotation operations. > @ < > @ 0 < > @ < > @ < > @ 0 < > @ / < > @ / < > @ 0 / < > @ / < > @ / < > @ 0 / < > @ 0 < > @ 0 < > @ 0 / < > @ 0 / < > @ N V > @ N V > @ N / V > @ < > @ N / V Ö N 0 ª º ¬ ¼ K ' \ V K K K \V ' ' 3( &ON &ON &ON Fig. 7. Systolic structure of LS-based estimation module. can simply omit the rotation operations as the architecture shown in Fig. 6, which reuses the FFT module and reduces the number of FFT modules from τ + K to τ , saving the resources a lot. C. Pr ocessing Schedule and Overhead Analysis For channel estimation under ADMA scheme, the timing of the entire design is sho wn in Fig. 11, where T s is the clock cycle. we can see that each module is processed in pipeline except the UL-grouping module. The timing of the optimized architecture without rotation is the same with it is shown in Fig. 11. The resource statistics of each module is listed in T able II. In addition, the latency and processing time of each module is listed in T able III. Here, “Latency” is associated with one data package, and “Processing time” is associated with M data packages. Notice that P is an integer between 0 and M − 1 which is determined by the spatial signature of each user . D. FPGA Implementation Results In order to demonstrate the adv antage of channel esti- mation under ADMA scheme, our architectures are imple- mented with Xilinx V irtex-7 Ultrascale vu440-flga2892-2-e FPGA. For the ease of Implementation, the parameters are set as M = 128 , K = 16 , L = 4 , τ = 4 , θ k ∈ {− 48 . 59 ◦ , − 14 . 48 ◦ , 48 . 59 ◦ , 14 . 48 ◦ } and ∆ θ k = 2 ◦ . The resources overhead and maximum frequency are shown in T able IV. W e can see that the omission of rotation operations brings us 54% reduction in LUTs, 57% reduction in registers, 55% reduction in block RAMs and 60% reduction in DSPs. And for the timing constraints, since the critical path lies in the FFT module, the maximum frequency of these tw o architecture can both reach 217.39 megahertz. V . C O N C L U S I O N S In this paper , the hardware-ef ficient channel estimator based on ADMA scheme is first proposed. The corresponding op- timizations on quantization and approximation are presented as well. T o achieve high ef ficiency and lo w complexity , the pipelining technique and systolic structure have been employed to tailor the architecture for regularity . Finally , FPGA implementations are given. Suggestions on the choice 8 Me rger- N /8 Me rger- N /8 Me rger- N /8 Me rger- N /8 Me rge r- N /4 Me rge r- N /4 Me rge r- N /4 Me rge r- N /4 Me rger- N /2 Me rger- N /2 Merger- N Bito n ic Sorter - N /2 Me rger- N /8 Me rger- N /8 Me rger- N /8 Me rger- N /8 Ha lf Cle an er- N /2 Bi tonic Sor ter- N /4 Bi tonic Sor ter- N /4 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Bito n ic Sorter - N /2 Ha lf Cle a ner- N / 2 Bi tonic Sor ter- N /4 Bi tonic Sor ter- N /4 ... ... ... ... ... ... Fig. 8. Merging network structure of N -element sorting. T ABLE II R E S O U R C E C O S T O F T H E P R O P O S E D E S T I M ATO R Modules Complex Multipliers Complex Adders Real Comparators Registers LS-based Estimation L L 0 L − 1 FFT log 2 M − 1 2log 2 M 0 M − 1 ABS 1 0 0 0 Max-selection 0 0 1 1 Sorting 0 0 K log 2 K K log 2 K (log 2 K + 1) / 2 Grouping 0 1 τ 2 τ Extraction 0 0 1 0 IFFT τ τ 0 τ − 1 BF 1 D D 2 M M U X BF 1 M U X BF 1 2D M U X Fig. 9. Feed-back pipelined hardware architecture of FFT module. Compare PE Comp are PE Comp are PE D D D D D (a) Systolic structure of Grouping module. Compar e PE ( ) + t W Com p ar ator M U X (b) The structure of Compare PE. Fig. 10. Systolic structure of Grouping module. of rotation are listed. Future work will be directed to wards its application in our 5G Cloud T estbed. T ABLE III L ATE N C Y A N D P RO C E S S I N G T I M E Modules Latency ( T s ) Processing time ( T s ) LS-based Estimation L − 1 L + M FFT M − 1 2 M − 1 Max-Selection M M Sorting - log 2 K (log 2 K + 1) / 2 Grouping - K + τ Extraction P P IFFT τ M + τ T ABLE IV F P G A I M P L E M E N TA T I O N R E S U LTS Structures W ith Rotation W ithout Rotation LUTs 52 , 416 24 , 130 Registers 90 , 191 38 , 464 Block RAMs 220 100 DSPs 1 , 092 432 Frequency (MHz) 217 . 39 217 . 39 R E F E R E N C E S [1] X. Liu, H. Xie, J. Sha, F . Gao, S. Jin, X. Y ou, and C. Zhang, “The VLSI architecture for channel estimation based on ADMA, ” in Pr oc. IEEE International Confer ence on ASIC (ASICON) , 2017, pp. 1073– 1076. [2] J. G. Andrews, S. Buzzi, W . Choi, S. V . Hanly , A. Lozano, A. C. Soong, and J. C. Zhang, “What will 5G be?” IEEE J. Sel. Areas Commun. , vol. 32, no. 6, pp. 1065–1082, 2014. [3] M. Shafi, A. F . Molisch, P . J. Smith, T . Haustein, P . Zhu, P . De Silva, F . T ufvesson, A. Benjebbour , and G. W under, “5G: A tutorial overvie w of standards, trials, challenges, deployment, and practice, ” IEEE J. Sel. Ar eas Commun. , vol. 35, no. 6, pp. 1201–1221, 2017. [4] F . Rusek, D. Persson, B. K. Lau, E. G. Larsson, T . L. Marzetta, X. LIU et al. : EFFICIENT CHANNEL ESTIMA TOR WITH ANGLE-DIVISION MUL TIPLE ACCESS 9 FON UR N E U R U R N N ª º ¬ ¼ K UR N º U R º º UR N Ö N K 3UHDPEOH 3URFH V VLQJ 8SOLQN* URXSL QJ OJ OJ . . 7 . W 7 / 0 3 W < Ö Ö _ _ U R U R N N K K _ U R U R _ K _ UR _ N _ 7 / 0 0 WHPS Ö Ö UR N N K K Fig. 11. The processing schedule for the system. O. Edfors, and F . Tufv esson, “Scaling up MIMO: Opportunities and challenges with very large arrays, ” IEEE signal processing magazine , vol. 30, no. 1, pp. 40–60, 2013. [5] F . Boccardi, R. W . Heath, A. Lozano, T . L. Marzetta, and P . Popovski, “Fiv e disruptive technology directions for 5G, ” IEEE Commun. Mag. , vol. 52, no. 2, pp. 74–80, 2014. [6] E. G. Larsson, O. Edfors, F . Tufv esson, and T . L. Marzetta, “Massiv e MIMO for next generation wireless systems, ” IEEE Commun. Mag. , vol. 52, no. 2, pp. 186–195, 2014. [7] H. Xie, F . Gao, and S. Jin, “ An overvie w of low-rank channel estimation for massive MIMO systems, ” IEEE Access , vol. 4, pp. 7313–7321, 2016. [8] T . L. Marzetta, “Noncooperative cellular wireless with unlimited num- bers of base station antennas, ” IEEE Tr ans. W ireless Commun. , vol. 9, no. 11, pp. 3590–3600, 2010. [9] J. Jose, A. Ashikhmin, T . L. Marzetta, and S. V ishwanath, “Pilot contamination problem in multi-cell TDD systems, ” in Information Theory , 2009. ISIT 2009. IEEE International Symposium on . IEEE, 2009, pp. 2184–2188. [10] F . Fernandes, A. Ashikhmin, and T . L. Marzetta, “Inter-cell interference in noncooperative TDD large scale antenna systems, ” IEEE J. Sel. Areas Commun. , v ol. 31, no. 2, pp. 192–201, 2013. [11] L. Y ou, X. Gao, X.-G. Xia, N. Ma, and Y . Peng, “Pilot reuse for massiv e MIMO transmission over spatially correlated Rayleigh fading channels, ” IEEE T rans. Wir eless Commun. , v ol. 14, no. 6, pp. 3352–3366, 2015. [12] A. G. Burr, “Capacity bounds and estimates for the finite scatterers MIMO wireless channel, ” IEEE J. Sel. Ar eas Commun. , vol. 21, no. 5, pp. 812–818, 2003. [13] X. Gao, L. Dai, S. Han, I. Chih-Lin, and R. W . Heath, “Energy-ef ficient hybrid analog and digital precoding for mmW ave MIMO systems with large antenna arrays, ” IEEE J. Sel. Ar eas Commun. , vol. 34, no. 4, pp. 998–1009, 2016. [14] A. Adhikary , J. Nam, J.-Y . Ahn, and G. Caire, “Joint spatial division and multiplexingłThe lar ge-scale array regime, ” IEEE T rans. Inf . Theory , vol. 59, no. 10, pp. 6441–6463, 2013. [15] S. L. H. Nguyen and A. Ghrayeb, “Compressiv e sensing-based chan- nel estimation for massiv e multiuser MIMO systems, ” in Pr oc. IEEE W ireless Communications and Networking Conference (WCNC) . IEEE, 2013, pp. 2890–2895. [16] X. Rao and V . K. Lau, “Distributed compressive CSIT estimation and feedback for FDD multi-user massiv e MIMO systems, ” IEEE Tr ans. Signal Pr ocess. , vol. 62, no. 12, pp. 3261–3271, 2014. [17] C. Sun, X. Gao, S. Jin, M. Matthaiou, Z. Ding, and C. Xiao, “Beam divi- sion multiple access transmission for massive MIMO communications, ” IEEE T rans. Commun. , vol. 63, no. 6, pp. 2170–2184, 2015. [18] J. Fang, X. Li, H. Li, and F . Gao, “Lo w-rank covariance-assisted downlink training and channel estimation for FDD massiv e MIMO systems, ” IEEE T rans. W ireless Commun. , vol. 16, no. 3, pp. 1935– 1947, 2017. [19] H. Xie, F . Gao, S. Zhang, and S. Jin, “ A unified transmission strategy for TDD/FDD massive MIMO systems with spatial basis expansion model, ” IEEE T rans. V eh. T echnol. , vol. 66, no. 4, pp. 3170–3184, 2017. [20] R. Urquhart and D. W ood, “Systolic matrix and vector multiplication methods for signal processing, ” in Proc. Insr . Elec. Eng. , vol. 131, no. 6, 1984, pp. 623–631. [21] M. A yinala, M. Brown, and K. K. Parhi, “Pipelined parallel FFT archi- tectures via folding transformation, ” IEEE T rans. VLSI Syst. , vol. 20, no. 6, pp. 1068–1081, 2012. [22] C. Cheng and K. K. Parhi, “High-throughput VLSI architecture for FFT computation, ” IEEE T rans. Cir cuits Syst. II , vol. 54, no. 10, pp. 863–867, 2007. [23] Y .-N. Chang, “An ef ficient VLSI architecture for normal I/O order pipeline FFT design, ” IEEE T rans. Circuits Syst. II , vol. 55, no. 12, pp. 1234–1238, 2008. [24] K. E. Batcher, “Sorting networks and their applications, ” in Proc. AFIPS Spring J oint Comput. Conf , vol. 32, 1968, pp. 307–314.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment