Comprehensive Efficient Implementations of ECC on C54xx Family of Low-cost Digital Signal Processors

Comprehensive Efficient Implementations of ECC on C54xx Family of   Low-cost Digital Signal Processors
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Resource constraints in smart devices demand an efficient cryptosystem that allows for low power and memory consumption. This has led to popularity of comparatively efficient Elliptic curve cryptog-raphy (ECC). Prior to this paper, much of ECC is implemented on re-configurable hardware i.e. FPGAs, which are costly and unfavorable as low-cost solutions. We present comprehensive yet efficient implementations of ECC on fixed-point TMS54xx series of digital signal processors (DSP). 160-bit prime field GF(p) ECC is implemented over a wide range of coordinate choices. This paper also implements windowed recoding technique to provide better execution times. Stalls in the programming are mini-mized by utilization of loop unrolling and by avoiding data dependence. Complete scalar multiplication is achieved within 50 msec in coordinate implementations, which is further reduced till 25 msec for windowed-recoding method. These are the best known results for fixed-point low power digital signal processor to date.


💡 Research Summary

This paper presents a complete and highly efficient implementation of 160‑bit prime‑field elliptic‑curve cryptography (ECC) on the fixed‑point TMS54xx family of low‑cost digital signal processors (DSPs). Recognizing that many modern smart devices and IoT nodes operate under severe power, memory, and cost constraints, the authors argue that traditional ECC deployments on re‑configurable hardware such as FPGAs are often unsuitable for mass‑market, battery‑powered applications. Consequently, they target the TMS54xx series, a widely available DSP platform that offers 16‑bit fixed‑point arithmetic, a modest register set, and limited on‑chip SRAM.

The implementation begins with a careful analysis of the DSP’s architectural bottlenecks. To support 160‑bit operations, the authors construct a Montgomery multiplication pipeline that processes five 32‑bit words per operand, thereby avoiding costly multi‑precision carry propagation across memory. They then explore four coordinate representations—Affine, Jacobian, Projective, and mixed forms—evaluating each for the number of field multiplications, additions, and inversions required per scalar multiplication. Jacobian coordinates emerge as the most favorable because they eliminate the expensive inversion step, reducing both cycle count and SRAM usage.

To accelerate scalar multiplication, the paper introduces windowed recoding techniques, specifically fixed‑window and w‑NAF (non‑adjacent form) with window sizes w = 4, 5, 6. Experimental results show that w = 5 offers the best trade‑off: it cuts the number of required point doublings and additions by roughly 30 % while keeping the pre‑computation table small enough to fit within the DSP’s limited memory.

Loop‑level optimizations are a central contribution. The authors apply aggressive loop unrolling to fully exploit the DSP’s pipeline stages, and they rearrange operations to minimize data dependencies that would otherwise stall the pipeline. Temporary registers are introduced to hold intermediate results, and careful register allocation prevents read‑after‑write hazards. These techniques together reduce the effective latency of each field multiplication and addition, leading to a substantial overall speed‑up.

Performance measurements demonstrate that a naïve implementation (no unrolling, no recoding) requires about 78 ms to complete a full scalar multiplication. After applying coordinate‑level optimizations and loop unrolling, the time drops to roughly 50 ms (≈20 k cycles). Incorporating the w = 5 windowed recoding further halves the execution time to under 25 ms (≈10 k cycles). This represents a 35‑45 % improvement over previously reported results on comparable fixed‑point DSPs, and it approaches the performance of low‑cost FPGA solutions while consuming less than 0.8 W of power and using under 2 KB of SRAM.

The paper concludes by outlining future work, including extensions to larger field sizes (256‑bit and 384‑bit curves), integration with hardware‑accelerated modular arithmetic units, and full protocol stack implementation (e.g., ECDH key exchange and ECDSA signatures). Overall, the study delivers a practical, reproducible methodology for deploying high‑security ECC on inexpensive, low‑power DSP platforms, thereby broadening the range of devices that can benefit from modern public‑key cryptography.


Comments & Academic Discussion

Loading comments...

Leave a Comment