Optimization of automatically generated multi-core code for the LTE RACH-PD algorithm

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Embedded real-time applications in communication systems require high processing power. Manual scheduling devel-oped for single-processor applications is not suited to multi-core architectures. The Algorithm Architecture Matching (AAM) methodology optimizes static application implementation on multi-core architectures. The Random Access Channel Preamble Detection (RACH-PD) is an algorithm for non-synchronized access of Long Term Evolu-tion (LTE) wireless networks. LTE aims to improve the spectral efficiency of the next generation cellular system. This paper de-scribes a complete methodology for implementing the RACH-PD. AAM prototyping is applied to the RACH-PD which is modelled as a Synchronous DataFlow graph (SDF). An efficient implemen-tation of the algorithm onto a multi-core DSP, the TI C6487, is then explained. Benchmarks for the solution are given.

💡 Research Summary

The paper addresses the challenge of implementing the LTE Random Access Channel Preamble Detection (RACH‑PD) algorithm on a multi‑core digital signal processor (DSP) while meeting strict real‑time constraints. RACH‑PD is responsible for detecting uplink preambles sent by user equipment during initial access, and in the worst‑case scenario (a 115 km cell) the base station must process a new preamble every millisecond, imposing a hard deadline of less than 4 ms for the whole detection chain.

The authors first decompose the algorithm into four logical stages: preprocessing (band‑pass filtering, down‑sampling, and DFT), circular correlation (correlating the received signal with 64 stored root sequences, performing IFFT, and accumulating power), noise‑floor estimation, and peak search. With four receive antennas, 64 root sequences, and two repetitions of the preamble, the algorithm expands to 1 357 atomic operations when expressed at the function level.

To manage this complexity, the algorithm is modeled as a Synchronous Data Flow (SDF) graph using the PREESM framework. In the SDF representation each vertex corresponds to a coarse‑grained operation (e.g., FIR filter, FFT) and each edge represents a data dependency. This formalism enables static scheduling, deterministic memory allocation, and easy manipulation of parallelism.

The core of the methodology is Algorithm‑Architecture Matching (AAM), which maps the SDF graph onto a target hardware graph. The hardware graph consists of DSP cores (operators) connected by Enhanced Direct Memory Access (EDMA) links (communication media). PREESM automatically explores different mappings, evaluates execution times based on measured cycle counts for each atomic operation, and accounts for EDMA transfer latency (modeled as a constant 3.08 GB/s throughput, i.e., 375 cycles per 4 800‑byte buffer).

Four candidate architectures are examined: a single‑core TMS320TCI6482, a dual‑core version, a tri‑core TMS320TCI6487 (the actual target), and a hypothetical quad‑core for reference. Simulations are performed under two memory‑access scenarios: ideal (L1 cache enabled) and worst‑case (L1 cache disabled). Results show that the dual‑core configuration meets the deadline only when the cache is fully effective; with realistic cache misses it exceeds the 4 ms bound. The tri‑core configuration comfortably satisfies the deadline in both scenarios, using about 68 % of the available CPU cycles when realistic cache behavior is considered (88 % in the ideal case). The quad‑core offers negligible additional benefit, confirming that three cores are the optimal trade‑off between performance and silicon cost.

Having selected the tri‑core TMS320TCI6487, the authors proceed to generate the final implementation. The chip provides asymmetric L2 memory (1.5 MB, 1 MB, and 0.5 MB per core) and allows each core to access the others’ L2 via EDMA. PREESM produces C code that includes automatically inserted EDMA transfer descriptors and inter‑core synchronization primitives. The circular‑correlation stage, which dominates the workload, is partitioned so that each core processes a subset of the 64 root sequences in parallel; EDMA is used to stream the necessary data between cores while computation proceeds, effectively overlapping communication and processing.

Benchmarks on the actual hardware confirm that the complete RACH‑PD pipeline executes in 3.9 ms with L1 cache enabled, meeting the real‑time requirement with a per‑core utilization of roughly 70 %. The generated code requires no manual tuning, and any change in system parameters (e.g., number of antennas, root sequences, or repetitions) can be accommodated by updating the SDF model and re‑running PREESM, without rewriting low‑level code.

In conclusion, the study demonstrates that SDF‑based modeling combined with PREESM’s AAM methodology provides a systematic, automated path from algorithm specification to high‑performance multi‑core DSP implementation. The approach yields deterministic scheduling, efficient use of EDMA for inter‑core data movement, and scalability to more demanding future algorithms such as those found in 5G NR. The work thus offers a valuable blueprint for designers seeking to meet stringent latency constraints in modern wireless base‑station signal processing.

Optimization of automatically generated multi-core code for the LTE RACH-PD algorithm

💡 Research Summary

Comments & Academic Discussion

Leave a Comment