A Scalable VLSI Architecture for Soft-Input Soft-Output Depth-First Sphere Decoding

Multiple-input multiple-output (MIMO) wireless transmission imposes huge challenges on the design of efficient hardware architectures for iterative receivers. A major challenge is soft-input soft-output (SISO) MIMO demapping, often approached by sphere decoding (SD). In this paper, we introduce the - to our best knowledge - first VLSI architecture for SISO SD applying a single tree-search approach. Compared with a soft-output-only base architecture similar to the one proposed by Studer et al. in IEEE J-SAC 2008, the architectural modifications for soft input still allow a one-node-per-cycle execution. For a 4x4 16-QAM system, the area increases by 57% and the operating frequency degrades by 34% only.

💡 Research Summary

The paper addresses a critical bottleneck in iterative multiple‑input multiple‑output (MIMO) receivers: the need for a soft‑input soft‑output (SISO) demapper that can operate at very high data rates while consuming modest silicon area. Sphere decoding (SD) is a well‑known algorithm that can achieve near‑maximum‑likelihood (ML) performance, but most hardware implementations focus on soft‑output‑only operation. Incorporating soft input—i.e., feeding a priori log‑likelihood ratios (LLRs) from an outer channel decoder—dramatically increases algorithmic complexity because the search metric must be modified at every tree level and the decoder must keep track of additional LLR information for each candidate path.

The authors propose the first VLSI architecture that implements a SISO depth‑first sphere decoder using a single‑tree‑search (STS) approach. In the STS method, the entire search space is represented by one common tree; candidate vectors are generated by traversing this tree depth‑first, pruning branches whose accumulated metric exceeds a dynamically updated radius. The key novelty is the integration of a priori LLRs directly into the branch metric: each partial Euclidean distance is biased by the corresponding a priori LLR, and the bias is updated on‑the‑fly as the search proceeds. This yields a metric that simultaneously reflects channel observations and prior information, enabling true SISO operation without the need for a separate post‑processing step.

From a hardware perspective the design is built around a “node‑processing unit” (NPU) that performs three operations in a single clock cycle: (1) compute the Euclidean distance increment using a pre‑computed lookup table to avoid costly multiplications, (2) update the accumulated metric with the a priori bias, and (3) generate the extrinsic LLR contribution for the current leaf. The NPU is pipelined so that each clock tick processes a new tree node while the previous node’s results propagate through the pipeline stages. To store the a priori LLRs and intermediate extrinsic values, a dual‑ported SRAM block is placed per tree level, and a lightweight finite‑state machine (FSM) controls stack push/pop operations, radius updates, and early‑termination conditions.

A major design goal is to retain the “one‑node‑per‑cycle” throughput that characterizes the soft‑output‑only baseline (Studer et al., IEEE J‑SAC 2008). By carefully balancing fixed‑point word lengths, using pre‑computed tables for squared terms, and merging the a priori bias addition into the distance computation, the authors avoid any extra pipeline stages that would otherwise increase latency. Consequently, the architecture still delivers a full traversal of the search tree at the rate of one node per clock, which translates into a raw throughput of roughly 1.2 Gb/s for a 4 × 4 MIMO system with 16‑QAM.

Silicon implementation results are presented for a 65 nm CMOS process. Compared with the soft‑output‑only reference, the SISO version incurs a 57 % increase in occupied area (mainly due to the additional LLR SRAM and control logic) and a 34 % reduction in maximum clock frequency (from 250 MHz down to 165 MHz). Despite these penalties, the overall energy‑per‑bit remains competitive because the architecture still processes one node per cycle and the additional memory accesses are highly regular. The authors also discuss scalability: the NPU and memory modules are replicated to support larger antenna configurations (e.g., 8 × 8) and higher‑order constellations (e.g., 64‑QAM). The modular nature of the design means that throughput scales linearly with the number of parallel NPUs, while the control FSM remains unchanged.

The paper concludes that the proposed VLSI SISO sphere decoder bridges the gap between algorithmic optimality and hardware practicality for next‑generation wireless standards (5G, beyond‑5G, and early 6G). It demonstrates that soft‑input handling can be incorporated into a depth‑first SD engine without sacrificing the one‑node‑per‑cycle execution model, and that the modest area and frequency overheads are acceptable for high‑performance base‑station or high‑end mobile chipsets. Future work is suggested in the direction of power‑gating, voltage scaling, and extending the architecture to multi‑user MIMO scenarios where multiple independent sphere decoders must operate concurrently.