CORVET: A CORDIC-Powered, Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications

This brief presents a runtime-adaptive, performance-enhanced vector engine featuring a low-resource, iterative CORDIC-based MAC unit for edge AI acceleration. The proposed design enables dynamic reconfiguration between approximate and accurate modes,…

Authors: Sonu Kumar, Mohd Faisal Khan, Mukul Lokh

CORVET: A CORDIC-Powered, Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications
1 COR VET : A CO RDIC-Po wered, R esource-Frugal Mix ed-Precision V ector Processing E ngine for High- T hroughput AIoT applications Sonu Kumar , Mohd Faisal Khan , Mukul Lokhande , Member , IEEE , and Santosh Kumar V ishvakarma , Senior Member , IEEE Abstract —This brief presents a runtime-adapti ve, performance-enhanced vector engine featuring a low-resour ce, iterative CORDIC-based MA C unit for edge AI acceleration. The pr oposed design enables dynamic reconfiguration between approximate and accurate modes, exploiting the latency-accuracy trade-off for a wide range of workloads. Its resour ce-efficient approach further enables up to 4× thr oughput improv ement within the same hardwar e resour ces by leveraging vectorised, time-multiplexed execution and flexible pr ecision scaling. With a time-multiplexed multi-AF block and a lightweight pooling and normalisation unit, the proposed vector engine supports flexible pr ecision (4/8/16-bit) and high MA C density . The ASIC implementation r esults show that each MA C stage can sa ve up to 33% of time and 21% of power , with a 256-PE configuration that achieves higher compute density (4.83 TOPS/mm 2 ) and energy efficiency (11.67 TOPS/W) than previous state-of-the-art work. A detailed hardwar e-software co-design methodology for object detection and classification tasks on Pynq-Z2 is discussed to assess the proposed architecture, demonstrating a scalable, energy-efficient solution for edge AI applications. Index T erms —CORDIC, multiply-accumulate (MA C), Non- linear Activation Functions, deep lear ning accelerators, Internet of things, Reconfigurable Computing. I . I N T RO D U C T I O N D EEP learning has become a foundational component of modern artificial intelligence (AI) systems at the Internet of things (IoT), enabling breakthroughs across com- puter vision, speech recognition, natural language processing, and autonomous systems. Contemporary workloads are domi- nated by Deep Neural Netw orks (DNNs), V ision T ransformers (V iTs), and emerging large-scale models, whose inference and training pipelines are primarily composed of con volutional layers, fully connected (FC) or multi-layer perceptron (MLP) blocks, and attention mechanisms [1]–[6]. Despite architec- tural div ersity , workload characterisation studies consistently show that multiply-accumulate (MAC) operations account for approximately 90% of total computation, while non-linear acti- vation functions (N AFs) contrib ute an additional 2-5% [3], [5]. Efficient ex ecution of these operations is therefore critical for deploying AI models on resource-constrained edge platforms [7], [8]. T o address the stringent energy , area, and latenc y constraints of edge AI, prior work has explored se veral hardw are optimi- sation techniques, including fixed-point quantization [3], [9], Sonu Kumar , Mohd Faisal Khan and Santosh Kumar V ishv akarma acknowl- edge DST INSPIRE fellowship and MeitY/SMDP-C2S for ASIC tool support. Sonu K umar is with the Centre for Adv anced Electronics, IIT Indore. Mohd Faisal Khan, Mukul Lokhande, and Santosh Kumar V ishv akarma are with the NSDCS Research Group, Dept. of Electrical Engineering at IIT Indore. Corresponding author: Santosh K. V ishv akarma (skvishvakarma@iiti.ac.in). CORDIC-based arithmetic [5], logarithmic approximation [4], [10], and truncation-based MA C units [2], [11]. While these approaches achie ve notable reductions in computational com- plexity and power consumption, they typically operate at a fixed approximation point. As a result, either irrev ersible accu- racy degradation is incurred, or additional error-compensation mechanisms are required, which partially negate the energy benefits [2], [3]. Furthermore, these designs often lack the flexibility to dynamically adjust approximation depth based on layer sensitivity or application requirements [12], [13]. Recent edge-oriented deep learning accelerators focus on energy efficiency , reduced memory traffic, and architectural flexibility by lev eraging ef ficient dataflo ws, reconfigurable systolic arrays, and pipeline-aware designs. Complementary approaches exploit computing-in-memory , quantised and bi- nary neural networks, and tightly integrated SoC platforms to alleviate bandwidth and storage bottlenecks requirements [14], [15] [16]. In parallel, recent studies hav e highlighted a struc- tural inef ficiency in many deep learning accelerators: the dis- proportionate allocation of hardware resources to activ ation- function units. Although AFs constitute a small fraction of total operations, the y are frequently implemented using dedi- cated hardware blocks that remain idle for a significant portion of execution. Prior work reports up to 84% idle c ycles in N AF hardw are for layer-reused architectures [11], while large- scale commercial accelerators such as Google TPUv4 allocate nearly 20–25% of chip area to acti vation-related logic [3]. This imbalance results in substantial dark silicon, limiting overall energy ef ficiency and scalability . T able I pro vides a comparati ve ov erview of state-of-the-art (SoT A) AI accelerator designs, highlighting key architectural choices, supported precision, scalability , and associated trade- offs. Existing CORDIC- and approximation-based designs pre- dominantly employ pipelined or fixed-stage implementations, which constrain runtime flexibility and enforce static accurac y- latency operating points [17]–[21]. In contrast, recent layer- reused or time-multiplex ed architectures improve utilization but still lack fine-grained control ov er numerical accuracy at the MA C level. Consequently , prior designs are not very flexible with respect to dif ferent layer characteristics and operating conditions. They often require predefined datapaths and static precision settings [22], [23]. This rigidity leads to suboptimal trade-offs among accuracy , latency , and resource utilisation when deployed across heterogeneous deep learning workloads [24], particularly for models that demand mixed- precision computation and flexible acti vation support. This paper addresses these limitations by proposing a runtime-adaptiv e, CORDIC-accelerated vector engine that e x- 2 T ABLE I S O T A D E SI G N A P P ROA C HE S A ND C O M P A R IS O N O F R E SP E C T IV E D ES I G N F E A T U R ES I N A I W OR K L OA D S Design Baseline ICIIS’25 [11] ICIIS’25 [11] IEEE Access’24 [2] TVLSI’25 [3] ISCAS’25 [4] ISVLSI’25 [5] Proposed Compute Pipe-CORDIC Pipe-CORDIC PWL Pipe-CORDIC Pipe-CORDIC Logarithmic Approx. Iterative CORDIC Iterative CORDIC Arch. T ype Fully Parallel Layer-Reused NAF-Reused NAF-Reused Systolic Array Time-multiple xed Reconfigurable Array Layer-Reused V ector Engine Scalability No Y es No No - Y es No Y es Precision FxP-8 FxP-8 FxP-8 FxP-8 FxP-4/8/16/32 Posit-8/16/32 FxP-8 FxP-4/8/16 Accuracy loss High High High High Medium Low Medium V ariable (Low) Design Overhead Area, St. Power Area Area, St. Power Area, Power Energy Area, Complexity Latency Application-optimized High-Throughput NAF-Supported ReLU ReLU Sigmoid/T anh NA Sigmoid, Softmax, T anh, ReLU Sigmoid, T anh, Softmax Sigmoid/T anh SoftMax, GELU, Sigmoid T anh, Swish, ReLU, and SELU Applications ANN ANN ANN DNN DNN, Transformers DNN DNN DNN, Transformers (MLP) plicitly exposes MA C precision, approximation depth, and ex ecution latency as configurable architectural parameters. Unlike prior fixed-approximation designs, the proposed ap- proach enables seamless switching between approximate and accurate execution modes without structural modification or auxiliary correction logic. In addition, a time-multiplexed multi-activ ation-function (multi-NAF) block is integrated to maximise hardware utilisation and significantly reduce dark silicon. The key contributions of this work are summarised as follows: 1) A low-resource, iterativ e CORDIC-based MA C unit with runtime-configurable accuracy-latency trade-offs, sup- porting both approximate and accurate e xecution modes. 2) A scalable vector-engine architecture that amortises iter - ativ e MA C latency across parallel lanes, enabling higher throughput 4 × without excessi ve area overhead. 3) A high-utilisation, time-multiplex ed multi-NAF block supporting a wide range of nonlinear functions with minimal additional hardware cost. 4) A comprehensiv e ev aluation spanning software emula- tion, FPGA prototyping, and 28 nm ASIC synthesis, demonstrating system-lev el improvements on CNN and transformer-style w orkloads. The remainder of this paper is organised as follows. Sec- tion II presents the proposed vector-engine architecture. Sec- tion III details the circuit-level implementation of the iterati ve MA C and multi-N AF blocks. Section IV describes the ex- perimental methodology and ev aluation framework. Section V discusses FPGA, ASIC, and system-level results. Section VI concludes the paper and outlines future research directions. I I . A R C H I T E C T U R E O V E RV I E W Fig. 1 illustrates the top-level architecture of the proposed resource-efficient deep learning accelerator . The system is organised around a runtime-adaptiv e vector engine that serves as the primary compute core, supported by a lightweight con- trol engine, data prefetcher , time-multiplex ed multi-acti vation- function (multi-AF) block, pooling and normalisation units, and of f-chip memory interfaces. The architecture is designed to maximise compute utilisation while enabling flexible trade- offs between accuracy and latenc y across div erse deep learning workloads. I n p u t F e a tu r e M a p D a t a - A d d r e s s M a n a g e r Input T r a c k e r C o n f i g u r a b l e M U L T I - A F B l o c k , A A D Po o l i n g , N o r m / En c o d i n g U n i t Ex te r n a l O ff - C h i p M e m o r y (fe a tu r e m a p s , k e r n e l s ) Pa r a m e te r s A l l o c a to r Co n t r o l E n g in e K e r n e l M e m o r y B a n k # 0 K e r n e l M e m o r y B a n k # 1 K e r n e l M e m o r y B a n k # n M A C # 0 M A C # 1 M A C # n I n p u t D a ta Pr e - p r o c e s s o r C o m p u t e C o r e M o d u l e L i g h tw e i g h t C N N A c c e l e r a to r Fig. 1. Block-lev el architecture of the proposed CORDIC-based vector engine integrated within a resource-efficient deep learning accelerator . A. V ector Engine Or ganization The vector engine is composed of N homogeneous pro- cessing elements (PEs), where N is scalable from 64 to 256 depending on performance and area constraints. Each PE in- tegrates a precision-adjustable, accuracy-configurable iterativ e CORDIC-based MA C unit, local register storage, and interface logic for data and control synchronisation. Unlike fully parallel systolic arrays, the proposed v ector engine adopts a lane-based ex ecution model, amortising the latenc y of iterati ve computa- tions across multiple PEs to enable high throughput without requiring deeply pipelined or resource-intensiv e datapaths. T wo dedicated kernel memory banks, each or ganised as ( n -bit × 32) entries, are employed to store input activ ations and weights, respectively . This dual-bank organisation enables continuous data feeding to the PEs while ov erlapping memory access with computation. The memory interface is designed to support flexible precision modes (4/8/16-bit) and runtime reconfiguration without stalling the compute pipeline. B. Runtime Accuracy and Pr ecision Adaptation A key architectural feature of the proposed vector engine is its ability to dynamically adapt computation accuracy and latency at runtime. This is achie ved by controlling the number 3 L a y e r D o ne a nd A N N D o ne A s s i g nm e nt C u r r e nt L ay e r A s s i g nm e nt I n put Mux i ng G e ne r a te I np ut s f o r N e u r ao n s O ut p ut Mux i n g C o m p u t eD o n e ( fr o m N e u r o n s ) C u r r en t _L a y er L a y er D o n e L a y er D o n e C u r r en t _L a y er In p u t L a y er In t er m ed i a t e In p u t In d ex C u r r en t _L a y er O u t p u t f r o m N eu r o n s L a y er D o n e C u r r en t _L a y er In t er m ed i a t e In p u t In t er m ed i a t e O u t p u t In t er m ed i a t e In p u t t o N eu r o n s C o m p u t eIn i t t o N eu r o n s D N N D o n e O u t p u t In t er m ed i a t e O u t p u t C o n t r o l M o d u l e s S u b - b l o c k s f o r L a ye r M u l t i p l e x e d D N N D a t a S i g na l : C ont r ol S i g na l : Fig. 2. Control Engine for efficient reuse of data and control signals in a layer-multiple xed for reusing the same DNN architecture. of acti ve CORDIC iterations within each MA C unit on a per- layer basis. Layers that are less sensitiv e to numerical error can be e xecuted in an approximate mode with fe wer iterations, thereby reducing latency and ener gy consumption. Con versely , accuracy-critical layers operate in a more accurate mode with additional iterations, incurring modest latency ov erhead while preserving numerical fidelity . This runtime configurability is managed by the control engine through a set of configuration registers that specify precision mode, iteration count, and ex ecution sequencing for each layer . Unlike prior approximation-based accelerators that rely on fix ed hardware stages or static design-time tuning [3], [4], the proposed architecture enables fine-grained, layer- wise adaptation without requiring structural modification or auxiliary correction logic. C. Contr ol Engine and Data Flow The control engine orchestrates vector-engine e xecution by managing instruction sequencing, memory addressing, and synchronisation across PEs. It comprises configuration reg- isters for precision and iteration control, status registers for pipeline coordination, and a finite-state machine with datapath (FSMD) that gov erns layer ex ecution. This lightweight control logic enables efficient coordination between compute, memory access, and activ ation processing while minimising control ov erhead. A layer -multiplexed DNN requires a dedicated control unit composed of fiv e functional sub-blocks, as illustrated in Fig. 2. Each sub-block either monitors system status or produces control signals deri ved from that status. The control unit processes all status signals: LayerDone, DNNDone, Current- Layer , ComputeInit, Index, and ComputeDone. The neuron processing elements produce Index and ComputeDone, two of the status signals, while the control module internally generates the remaining signals. The Index signal indicates which input to send next to the MA C unit by counting the number of MA C operations completed in the activ e layer . The ComputeDone signal indicates that neuron computation for the current layer has completed and that v alid output data is av ailable. When aggre gated across all neuron units, this signal is referred to as ComputeDoneArray . T ogether , these status signals manage the data-path control necessary for the layer- multiplex ed architecture to function properly . The control module dynamically configures neuron activ a- tion and signal routing for layer-reused DNN computation. LayerDone and CurrentLayer are used to track progress, Com- puteInit is used to selectively activ ate neurons per layer , and index-controlled input and output routes are used to multiplex intermediate data. This reduces dynamic power by enabling idle-unit deactiv ation and ensuring correct sequencing. As the weight memory is partitioned into 64 se gments, each associated with a specific neuron processing unit, as depicted in Fig. 3(a). A key aspect of the parameter-loading mechanism is that the memory write sequence is the inv erse of the read sequence. This or ganisation enables ef ficient access to weights and bias v alues with reduced interconnect delay . Consequently , parameters must be loaded using a Last-In–First-Out (LIFO) ordering for both weights and biases, as well as for input data. Data transfer follo ws a synchronous interface using a valid signal, denoted as load_param_weight , to indicate when weight v alues are written. The accelerator asserts a data- ready signal, DNNDone, once valid outputs are produced. Input v alues are accepted on each clock cycle when the valid signal is activ e. Upon completion, outputs from all ten neurons are generated simultaneously when DNNDone is asserted and are subsequently captured by the host software. The ov erall software-controlled execution sequence of the accelerator is illustrated in Fig. 3(b). Data flow within the accelerator follows a streaming exe- cution model. Input feature maps are fetched from off-chip memory through the data pre-fetcher and buf fered locally before being broadcast to the vector engine. Partial sums generated by the PEs are forwarded either to subsequent MA C stages or to the activ ation function pipeline, depending on the layer configuration. This streaming approach minimises inter - mediate storage requirements and reduces memory bandwidth pressure, both of which are critical for energy-ef ficient edge deployment. D. Memory mapping The hardware architecture of fully connected neural net- works must support scalable, adaptable setups, as the number of layers and neurons per layer varies across applications. As a result, the memory or ganisation for weights and biases must be adaptable and av oid allocating unused address locations, a lim- itation commonly encountered in fixed-addressing schemes. This is accomplished by using BRAM for on-chip parameter 4 . . . . . . . . . . . N e u r o n_ 6 3 N e u r o n_ 1 N e u r on_ 0 . . . . . . . . W e i g h t s [# w e ig h t 6 3 ][8 : 0 ] W e i g h t s [# w e ig h t 1 ][8 : 0 ] W e i g h t s [# w e ig h t 0 ][8 : 0 ] S e qu e nt i a l l y l o a d W e i g ht s a nd B i a s e s S e qu e nt i a l l y l o a d I nput s a nd i ni t i a t e A N N C om put a t a i on S t or e O ut put U pda t e / I ni t i a l i z e S t a t us S i gn a l C om pu t e Lay e r La y e r D o ne ? D N N D one ? Y es Y es No No ( a ) D a t a R ea d / W r i t e O r d er ( b ) O p e r a r t i o n F l o w c h a r t Fig. 3. DNN accelerator data flo w and order to initialise the loading data. memory and FIFO buf fers for temporary storage. An ef ficient addressing strategy for neuron-wise weight and bias access is illustrated in Fig. 4 and is defined using the total number of layers L, the number of neurons in the lth layer N(l), and the number of inputs to that layer J(l). Since the number of neurons in one layer determines the number of inputs to the subsequent layer , these parameters satisfy J ( l + 1) = N ( l ) (1) Each parameter address consists of a layer identifier , a select bit indicating whether the accessed parameter is a weight or a bias, and a memory address field, as sho wn in Fig. 4(a). The select bit distinguishes between weight and bias access, while the most significant bits encode the layer index. The remaining bits represent either the neuron inde x (for bias) or the combined neuron and input index (for weight), as depicted in Fig. 4(b). The required address length for weights and biases in a layer is therefore given by R addr ( l ) = ⌈ log 2 N ( l ) ⌉ + ⌈ log 2 J ( l ) ⌉ (2) and that layer’ s ov erall address width turns into Addr ( l ) = ⌈ log 2 L ⌉ + 1 + R addr ( l ) (3) a fix ed address width, which is chosen based on the maximum required across all lev els, which is specified as R addr = max l =1 , 2 ,...,L {⌈ log 2 N ( l ) ⌉ + ⌈ log 2 J ( l ) ⌉} (4) and the length of the final uniform address is Addr = ⌈ log 2 L ⌉ + 1 + R addr (5) In addition to providing effecti ve, conflict-free access to weight and bias memories for scaled DNN implementations, this technique enables consistent addressing. Lo g 2 ( L) B i t s 1 B i t M a x [ R _ a dd r ( l ) ] L a y er B i t s S el ec t B i t s W e i g h t / B i a s R A M A d d r e s s Add r ( l ) - 1 t o Add r ( l ) - Log 2 ( L ) ( L a y e r B it s) Add r ( l ) - Log 2 ( L ) - 1 ( S e le c t B it s ) M a x [ R _ a ddr ( l ) ] - 1 t o 0 ( W e ig h t / B ia s R A M A d d r e s s ) R _ a ddr ( l ) - 1 t o W _ a dd r ( l ) W _ a dd r ( l ) - 1 t o 0 R _ a ddr ( l ) - 1 t o 0 W ei g h t R A M A d d r es s B i a s R A M A d d r es s ' 0 ' ' 1 ' ( a ) ( b ) Fig. 4. Memory mapping scheme for address bits that requires addressing weights and bias for the indi vidual neurons. E. T ime-Multiple xed Multi-Activation-Function Integr ation T o address the underutilization of activ ation-function hard- ware observed in prior accelerators [3], [11], the proposed ar - chitecture integrates a time-multiplex ed multi-AF block shared across all PEs. The multi-AF block supports a broad set of nonlinear functions, including Sigmoid, T anh, SoftMax, GELU, Swish, ReLU, and SELU, using common CORDIC resources and mode-specific datapaths. By multiplexing acti vation computation in time rather than dedicating separate hardware blocks, the architecture achieves high utilisation f actors while incurring minimal area and po wer ov erhead. Activ ation execution is overlapped with vector- engine computation where ver possible, ensuring that the multi- AF block does not become a performance bottleneck despite being shared. F . Scalability and System Inte gration The proposed vector engine is designed for seamless scal- ability across edge and embedded platforms. By adjusting the number of PEs, memory bank sizes, and iteration depth, the architecture can be tailored to a wide range of performance and energy tar gets. Furthermore, the modular organisation of the vector engine, control logic, and peripheral units facili- tates automated generation through a synthesizable hardware framew ork, enabling rapid design-space exploration and de- ployment. Overall, the architecture combines runtime adaptability , high hardware utilisation, and scalable performance, forming a unified compute substrate that bridges the gap between fixed- approximation accelerators and fully accurate but resource- intensiv e designs. I I I . C I R C U I T I M P L E M E N T AT I O N This section details the circuit-level design of the proposed iterativ e CORDIC-based MA C unit and the time-multiplex ed multi-activ ation-function (multi-AF) block. The design ob- jectiv e is to achieve a balance between hardware efficienc y , numerical accuracy , and runtime configurability while main- taining compatibility with standard deep learning workloads. 5 R e g R e g R e g L U T M U X M U X M U X M U X M U X M U X Xo [N :0 ] Yo [N :0 ] Z o [N : 0 ] Xn [N :0 ] Yn [N :0 ] Z n [N : 0 ] s gn [ Z n - 1] μα n α n > > k > > k Fig. 5. Iterative low-latency CORDIC-based MA C architecture with runtime- configurable iteration depth. A. Runtime-Adaptive Iterative CORDIC-Based MA C The proposed MA C unit is based on the unified CORDIC formulation originally introduced by W alther , which supports circular , linear , and hyperbolic computations using only shift, add/subtract, and multiplexing operations. Recent works such as ReCON [5] and Flex-PE [3] hav e demonstrated the ap- plicability of CORDIC arithmetic to deep learning opera- tions, including MA C, Sigmoid, T anh, and SoftMax [25]. Howe ver , these designs primarily employ pipelined or fixed- stage CORDIC architectures, which impose a static trade-of f between accuracy and latency . In contrast, the proposed MAC adopts an iterative CORDIC structure, as illustrated in Fig. 5, where the number of active iterations directly determines the approximation error and ex ecution latency . This enables runtime switching between approximate and accurate e xecution modes without altering the hardware structure or introducing auxiliary correction logic. The MA C unit supports both 8-bit and 16-bit fixed-point precision modes. In approximate mode, the MA C completes 8-bit and 16-bit operations in 4 and 7 clock cycles, respec- tiv ely , incurring approximately 2% accuracy degradation at the application lev el. In accurate mode, additional iterations are enabled, completing 8-bit and 16-bit operations in 5 and 9 cycles with less than 0.5% accuracy loss. Additionally , it supports 4-bit modes with accurate 4-bit cycle operation. These operating points are selected based on an accurac y- sensitivity heuristic [3], enabling layer-wise configuration based on numerical criticality . From a circuit perspectiv e, the iterativ e MA C minimises area and static power by reusing a single CORDIC datapath across iterations, rather than replicating pipeline stages. This design choice reduces the number of adders, shifters, and reg- isters compared to pipelined alternatives, while still enabling high throughput at the v ector-engine lev el through parallelism across multiple PEs. Fig. 6. Hardware AAD module for two inputs. Fig. 7. Hardware AAD module architecture based on sliding windo w B. Latency Hiding Thr ough V ector-Le vel P ar allelism Although the iterati ve MA C incurs a multi-cycle latency per operation, this ov erhead is ef fectively hidden at the vector - engine lev el. Since multiple PEs operate concurrently on in- dependent data elements, the increased per -MA C latency does not limit overall throughput for sufficiently large vector widths. This e xecution model distinguishes the proposed architecture from fully parallel or systolic-array designs, which require deeply pipelined datapaths to sustain throughput and therefore incur higher area and power ov erheads. The ability to trade per -MA C latency for reduced hardware complexity is particularly advantageous for edge AI accelera- tors, where area and ener gy efficiency are often more critical than single-operation latency . C. Absolute A verage Deviation (AAD) P ooling Block In addition to the MA C and activ ation units, the vector engine integrates peripheral components such as an Ab- solute A verage Deviation (AAD) pooling unit [26] and a normalisation block. The AAD pooling unit is selected due to its fav ourable accuracy characteristics for CORDIC-based computation, demonstrating a 0.5-1% accuracy improvement ov er con ventional pooling methods with lower computational complexity [3], [26]. Division, subtraction, and absolute value computation are the three primary steps of the hardware implementation of the two inputs in A verage Absolute Deviation (AAD) unit as shown in Fig. 6. Initially , the two input v alues are fed into a subtractor to determine their dif ference. The subtraction result is then processed through two parallel paths. A comparator receiv es the result from one path and compares it to zero to identify the sign of the difference, returning either +1 or - 1. T o match its timing with the comparator output, the other 6 Fig. 8. Hardware AAD module architecture based on parallel computation. Fig. 9. Multiple feature computations in parallel in hardware. channel passes the subtraction result through a buf fer . These two outputs are multiplied, ensuring the final result is always non-negati ve re gardless of the input order , effecti vely yielding the absolute de viation. This absolute de viation is then divided by two to obtain the final AAD output for the two-input case. For multi-input scenarios, multiple subtraction-absolute (SA) modules operate in parallel, each computing the absolute deviation between pairs of input v alues as shown in the Fig. 8. The outputs of these SA modules are summed using an adder network, and the accumulated result is divided by the normalisation f actor M = N (N-1) to produce the o verall AAD value, which is carried out in parallel as shown in Fig. 9. A sliding window technique, in which a windo w mov es over the input data with a specified stride and pooling size, is used to simplify the hardware. T o reduce hardware complexity , a sliding windo w approach is adopted, with a window moving across the input data according to the defined stride and pooling size. W ithin each windo w , de viations between data points are computed, accumulated in registers, and normalised to produce the final AAD result ef ficiently as illustrated in Fig. 7. D. T ime-Multiplexed Multi-Activation-Function Block Activ ation functions represent a small fraction of total oper- ations but often consume disproportionate hardware resources. T o address this inefficienc y , the proposed design integrates a time-multiplex ed multi-AF block that reuses CORDIC hard- ware across multiple nonlinear functions as shown in Fig. 10. The multi-AF block supports Sigmoid, T anh, SoftMax, GELU, Swish, ReLU, and SELU, enabling compatibility with both CNN and transformer-style workloads. The multi-AF block operates in two primary modes: a hyperbolic rotation (HR) mode for functions that require sinh and cosh computations, and a linear-di vision (L V) mode for functions that inv olve normalisation or exponential scaling. By selectiv ely enabling only the required datapaths for a giv en function, the design achie ves utilisation factors of up to 86% in HR mode and approximately 72% in L V mode. Additional auxiliary logic includes a lightweight switching multiplex er for Sigmoid and T anh selection, a ReLU bypass buf fer , a FIFO for intermediate SoftMax storage, and two small multipliers [27] to support GELU computation. Collectiv ely , these components incur less than 4% additional area and po wer ov erhead while significantly improving o verall hardware util- isation. E. P eripheral Support and Inte gration The proposed vector engine is integrated as a complete edge-AI processing subsystem comprising a lightweight con- trol engine, on-chip memory banks, input pre-processing logic, and a host communication interface. The control engine uses configuration re gisters, status flags, and a finite-state machine to control memory addressing, instruction sequencing, and synchronisation, coordinating ex ecution across the MAC array , activ ation block, and pooling units. Layer-adapti ve execution is made easier by runtime control signals such as ComputeInit, LayerDone, and ComputeDone. This enables the reuse of hardware resources across multiple network le vels, thereby guaranteeing proper data ordering. A data prefetcher retrie ves input feature maps from external memory , b uffers them lo- cally , and then broadcasts them to processing components. Index-controlled multiplexing is used to transport intermediate outputs to later layers. T o facilitate continuous data feeding and ov erlapping memory access with computation, parameter storage is managed via partitioned kernel memory banks that independently store acti vations and weights. The memory interface supports synchronous valid-data loading with a data- ready completion signal, allowing the host processor [28] to capture final outputs without stalling the compute pipeline. All processing elements share a time-multiplexed multi-activ ation- function unit that uses common CORDIC resources to perform nonlinear operations. T o prevent performance bottlenecks, its operations overlap with MAC computation. In addition, integrated pooling and normalisation blocks process partial sums before output generation, reducing intermediate storage and external memory traffic. The modular org anisation of con- trol, memory , and peripheral compute stages enables scalable deployment across FPGA and ASIC platforms. It supports efficient system-lev el integration with embedded processors through a lightweight interface, thereby transforming the vec- tor engine from a standalone compute core into a deployable edge-AI accelerator . I V . E X P E R I M E N T A L M E T H O D O L O G Y T o ensure a rigorous, fair , and reproducible ev aluation, the proposed vector engine is validated using a structured hardware-software co-design methodology spanning algorith- mic emulation, R TL-lev el verification, FPGA prototyping, 7 Fig. 10. T ime-multiplexed Activation function with integrated data flow and control signals. and ASIC synthesis. The ev aluation frame work is designed to isolate the impact of iterativ e CORDIC approximation while maintaining consistent experimental conditions across all comparisons. A. Softwar e-Level Functional Emulation At the algorithmic lev el, an iso-functional software model of the proposed vector engine is dev eloped in Python 3.0. The model emulates the vector engine’ s custom iterative CORDIC arithmetic, precision-switching behaviour , and ex- ecution scheduling. Fixed-point arithmetic is implemented using the FxP-Math library , while neural network layers and quantised inference flows are modelled using QK eras 2.3. The software frame work supports configurable precision modes (8-bit and 16-bit), variable CORDIC iteration depth, and layer-wise execution control. All deep learning ev alua- tions are performed against an FP32 reference baseline under identical network topology , dataset, and inference conditions. This approach ensures that observed accurac y dif ferences are attributable solely to arithmetic approximation, not to changes in training or model structure. Accuracy is ev aluated at both the layer and end-to-end model le vels for representativ e CNN and transformer-style MLP workloads. The number of CORDIC iterations per layer is selected using an accuracy-sensiti vity heuristic [3], which identifies numerically critical layers and assigns them to accurate execution modes, while non-critical layers operate in approximate mode. B. RTL Modelling and Functional V erification The proposed iterati ve CORDIC-based MA C unit and vector -engine datapath are modelled in synthesizable V erilog HDL. The architecture is parameterised to support different vector widths, precision modes, and iteration depths. A cycle- accurate R TL testbench is de veloped to v alidate functional correctness across all supported operating modes. Functional verification is performed using Synopsys VCS, where R TL outputs are compared against the software emula- tion model for a wide range of randomised and application- driv en test vectors. This cross-validation ensures bit-lev el consistency between the software model and the hardware im- plementation, accounting for fix ed-point rounding, truncation, and iteration control. C. FPGA Pr ototyping and Measur ement FPGA-based ev aluation is conducted using the AMD V irtex-707 (VC707) platform. Synthesis, placement, and rout- ing are performed using the AMD V i vado Design Suite with a target operating frequency of 100 MHz. All reported FPGA metrics, including lookup tables (LUTs), flip-flops (FFs), timing, and power consumption, are obtained from post-place- and-route reports to av oid optimistic estimation. T o enable fair comparison with state-of-the-art designs, either the reported post-implementation results from prior work are used directly , or the designs are re-synthesised under comparable constraints where feasible. Power measurements are extracted using vendor-supported power analysis tools with realistic switching activity deri ved from application traces. D. ASIC Synthesis and T echnology Assumptions ASIC e valuation is carried out using Synopsys Design Com- piler targeting a commercial 28 nm HPC+ CMOS technology at 0.9 V . Standard-cell libraries for worst-case timing corners are used to ensure conserv ativ e delay estimates. Area, timing, and power metrics are e xtracted from post-synthesis reports. System-lev el performance metrics, including energy effi- ciency (TOPS/W) and compute density (TOPS/mm 2 ), are deriv ed using consistent workload assumptions across all designs. The same precision mode, vector width, and clock frequency normalisation are applied when comparing against prior accelerators to ensure fairness. E. System-Level Deployment and End-to-End V alidation T o validate practical applicability , the proposed vector en- gine is deployed on a Pynq-Z2 platform with an ARM Cortex- A9 host processor . The accelerator is integrated through an AXI-based interface and ev aluated on object detection and classification workloads. End-to-end latency and power con- sumption are measured at the application lev el, capturing the 8 T ABLE II C O MPA R A T I V E P E R FO R M A NC E M E TR I C S F O R C O R D IC - BA S E D D I FF ER E N T S O TA M AC U N IT S Design TCAS-II’24 [29] ISCAS’25 [4] ICIIS’25 [11] TVLSI’25 [30] TCAD’22 [31] TVLSI’25 [3] Proposed FPGA Utilization (VC707, 100 MHz) Parameter FP32 FP32 BF16 Posit-8 V edic W allace Booth Quant-MA C CORDIC MSDF-MA C Acc-A pp-MAC CORDIC Iter-MA C LUTs 8065 8054 3670 467 160 106 84 72 56 62 57 45 24 FFs 1072 1718 324 175 241 113 59 56 72 45 NR 37 22 Delay (ns) 5.56 4.6 0.512 2.68 4.5 2.6 3.1 5.4 1.52 3.2 3.51 4.5 9.1 Power (mW) 378 296 136 68 6.1 3.3 3.1 4.2 8.3 5.8 6.9 2 1.9 PDP (pJ) 2102 1361.6 69.6 182 27.45 8.58 9.6 22.68 12.6 18.56 24.2 9 17.29 ASIC Perf ormance (28nm, 0.9V) Area (umˆ2) 10000 13000 4340 754 407 296 271 175 264 286 259 8570 108 Delay (ns) 679 700 295 40.6 6.38 5.62 5.3 3.58 2.36 1.42 2.6 0.7 2.98 Power (mW) 15.86 29.3 6.89 1.8 35 37 12.8 89 24.5 6.7 12.4 1.5 6.3 PDP (pJ) 10768.94 20510 4682 1189 223.3 207.94 67.84 318.62 57.82 9.514 32.24 1.05 18.774 T ABLE III C O MPA R A T I V E P E R FO R M A NC E M E TR I C S F O R C O R D IC - BA S E D D I FF ER E N T S O TA A F U NI T S Design ISQED’24 [32] TCAS-II’20 [33] TVLSI’23 [34] ISQED’24 [32] TC’23 [35] ISQED’24 [32] TVLSI’25 [3] Proposed FPGA Utilization (VC707, 100 MHz) Parameter Softmax-FP32 Softmax-FP16 Softmax-BF16 Softmax- FxP8/16 Softmax-16b T anh-FP32 T anh-FP16 Tanh-BF16 Tanh/Sigmoid-16b Sigmoid-FP32 Sigmoid-FP16 Sigmoid-BF16 SSTp FxP-4/8/16 LUTs 3217 1137 1263 2564 1215 4298 1530 1513 2395 5101 1853 1856 897 537 FFs NR NR NR 2794 1012 NR NR NR 1503 NR NR NR 1231 468 Delay (ns) 92 43 45 2.3 3.32 56 34 38 0.18 109 60 45 11.8 2.6 Power (mW) 115 115 77 NR 165 130 124 82 681 121 118 83 59 30 PDP (pJ) 10580 4945 3465 - 548 7280 4216 3116 123 13189 7080 3735 696.2 78 ASIC Performance (28nm, 0.9V) Area (umˆ2) 41536 17289 11301 18392 3819 5060 1180 843 870523 2234 1855 1180 49152 2138 Delay (ns) 6 4 3.3 0.3 1.6 4 3.3 3.4 NR 7.6 4.4 3.26 2.3 2.6 Power (mW) 75 40 25 51.6 1.6 8.75 3 2 150 10 4.8 2.5 5.2 60 PDP (pJ) 450 160 82.5 15.5 2.56 35 9.9 6.8 - 76 21.12 8.15 11.96 156 combined effects of computation, data mov ement, and control ov erhead. This multi-level e valuation methodology ensures that the reported improvements are not limited to isolated circuit optimisations b ut translate into tangible system-level benefits for real-world edge AI deployments. V . R E S U L T S A N D D I S C U S S I O N This section presents a comprehensive ev aluation of the proposed CORDIC-based vector engine at the circuit, archi- tectural, and system lev els. Results are reported for FPGA prototyping, ASIC synthesis, and end-to-end deployment and are compared against representati ve state-of-the-art (SoT A) AI accelerators to highlight performance, energy efficienc y , and scalability trade-offs. A. MA C-Level Har dwar e Efficiency T able II compares the proposed iterati ve CORDIC- based MA C unit with prior CORDIC, logarithmic, and approximation-based MA C designs across both FPGA and ASIC platforms. On the V irtex-707 FPGA, the proposed MAC achiev es significant reductions in lookup tables (LUTs) and flip-flops (FFs) compared to pipelined CORDIC and fixed- point MAC designs, while av oiding the use of DSP blocks. This reduction directly translates into lower static power consumption and improved placement fle xibility . At the ASIC lev el (28 nm, 0.9 V), the proposed MAC demonstrates up to 33% reduction in critical-path delay and approximately 21% lower po wer per MAC stage compared to comparable CORDIC-based designs. Although the iterati ve MA C incurs a multi-cycle ex ecution latenc y , this overhead is amortised at the vector-engine lev el through parallel ex ecu- tion across multiple processing elements (PEs), as discussed in Section II. Consequently , the proposed design achiev es a fa vourable power -delay product (PDP) while maintaining runtime configurability between approximate and accurate ex ecution modes. B. Activation-Function Har dwar e Utilization T able III summarises the FPGA and ASIC resource util- isation of the proposed time-multiplex ed multi-activ ation- function (multi-AF) block relati ve to prior dedicated AF imple- mentations. Existing designs often allocate separate hardware blocks for individual acti vation functions [36], leading to significant underutilization and dark silicon. In contrast, the proposed multi-AF block reuses CORDIC resources across multiple nonlinear functions, including Sigmoid, T anh, Soft- Max, GELU, Swish, ReLU, and SELU. The results show that the proposed design achieves utili- sation factors of 72-86% depending on the activ ation mode, while incurring less than 4% additional area and po wer over - head. On an FPGA, the multi-AF block reduces LUT and FF usage compared to SoT A designs supporting a similar function set. On ASIC, it demonstrates lower power consumption and competitiv e delay , confirming that time multiple xing ef fec- tiv ely mitigates activ ation-function underutilization without compromising performance. C. Accuracy Evaluation Under Iterative Appr oximation Fig. 11 reports the accuracy of representati ve CNN and DNN models under different CORDIC iteration settings. The results confirm that numerical error is tightly coupled to 9 0 10 20 30 40 50 60 70 80 90 1 0 0 Co n v -4 b Prop-4 b Conv -8 b Prop-8 b Conv -1 6 b Prop-1 6 b Conv -3 2 b Prop-3 2 b Clas sificatio n A cc u r ac y ( % ) sign ed f ixe d - p o int prec isi o n ( b it w idt h ) Cus tom /M N I S T Le Ne t-5 /M N I ST Res Ne t-1 8 /M NIST Caffe Ne t/M NIST VG G -1 6 /CIF AR -1 0 VG G -1 6 /CIF AR -1 0 0 Le Ne t-5 /CIF AR -1 0 Res Ne t-1 8 /CIF A R-1 0 Res Ne t-1 8 /CIF A R-1 0 0 Caffe Ne t/I m a ge Net Fig. 11. Ev aluation of DNN accuracy for different DNN models with CORDIC methodology . the number of acti ve CORDIC iterations, v alidating the ef- fectiv eness of the proposed runtime accuracy-latency trade- off mechanism. When operating in approximate mode, the accelerator incurs approximately 2% accuracy degradation, while accurate mode limits accuracy loss to below 0.5%. Importantly , by applying an accuracy-sensitivity heuristic to select the iteration depth per layer, most of the perfor- mance benefits of approximate ex ecution are retained while preserving end-to-end model accurac y . This demonstrates that the proposed architecture enables fine-grained control over numerical fidelity without requiring retraining or auxiliary correction hardware. D. FPGA System-Level Comparison T able IV compares the proposed vector engine against SoT A FPGA-based AI accelerators using object detection work- loads such as T inyY OLO-v3. The proposed design achiev es competitiv e throughput while significantly reducing power consumption. Operating at 85.4 MHz on the V irtex-707 plat- form, the vector engine deliv ers 6.43 GOPS/W at only 0.53 W , outperforming sev eral prior designs in energy efficiency despite using no DSP blocks. Compared to designs such as Flex-PE and LPRE, which rely on higher operating frequencies or specialised arithmetic units, the proposed architecture emphasises energy efficiency and scalability , making it particularly well-suited for edge deployments where po wer budgets are tightly constrained. E. ASIC Scalability and Compute Density ASIC-lev el scalability is ev aluated using two configurations of the proposed vector engine: a 64-PE configuration and a 256-PE configuration, as reported in T able V. The 64-PE configuration serves as a computationally equiv alent baseline, demonstrating comparable performance to prior designs at significantly lo wer area and power . The 256-PE configuration represents a resource-equi valent comparison, achieving a peak compute density of 4.83 TOPS/mm 2 and an energy efficiency of 11.67 TOPS/W . Fig. 12. Prototype visualisation, sho wing Pynq-z2 for edge AI inference on Unmanned Aerial V ehicles (U A V). These results highlight the benefits of the proposed iterati ve ex ecution model, where increased vector width compensates for per-MA C latency while preserving energy efficienc y . The architecture’ s scalability enables efficient deployment across a wide range of performance targets without redesigning the core datapath. F . End-to-End Embedded Deployment Fig. 13 presents a layer-wise ex ecution-time and po wer breakdown for the VGG-16 model, illustrating the impact of runtime precision switching on system performance. End-to- end deployment on a Pynq-Z2 platform with an ARM Corte x- A9 host reports a total latenc y of 84.6 ms at 0.43 W for object detection and classification workloads, as shown in Fig. 12. It outperforms prior works: [3] ( 186 . 4 ms / 2 . 24 W on V C 707 ), [40] ( 772 ms / 1 . 524 W on V C 707 ), [4] ( 184 ms / 0 . 93 W on Pynq-Z 2 ), [6] ( 163 . 7 ms / 13 . 32 W on V C U 102 ), and baselines such as NVIDIA Jetson Nano ( 226 ms / 1 . 34 W) and Raspberry Pi ( 555 ms / 2 . 7 W). The proposed vector engine achieves lo wer latency and lower power consumption than prior FPGA-based accelerators and commercial embedded platforms such as the NVIDIA Jetson Nano and Raspberry Pi. These impro vements stem from a combination of iterative MA C efficienc y , reduced memory bandwidth requirements, and dynamic precision adaptation. Overall, the results demonstrate that the proposed archi- tecture deliv ers consistent improvements across circuit-lev el efficienc y , architectural scalability , and system-level perfor- mance, validating its suitability for energy-efficient edge AI acceleration. V I . C O N C L U S I O N A N D F U T U R E W O R K This paper presented a runtime-adapti ve, CORDIC- accelerated vector engine designed to address the efficiency and flexibility challenges of deep learning inference on resource-constrained edge platforms. By introducing a low- resource, iterativ e CORDIC-based MAC unit with runtime- configurable iteration depth, the proposed architecture enables 10 T ABLE IV A NA L Y S I S O F F PG A H AR D W A R E I MP L E ME N TA T I O N F O R OB J E C T D E T EC T I ON ( T I NY Y OL O - V 3) W I TH S O TA A I AC C EL E R A T OR D E S IG N S Design Platform Precision k-LUTs k-Regs/FFs DSPs Op. Freq (MHz) Energy efficiency (GOPS/W) Po wer(W) Proposed VC707 4/8/16 26.7 15.9 - 85.4 6.43 0.53 TVLSI’25 [3] VC707 4/8/16/32 38.7 17.4 73 466 8.42 2.24 TCAS-I’24 [37] ZU3EG 8 40.8 45.5 258 100 0.39 2.2 TCAS-II’23 [38] XCVU9P 8 132 39.5 96 150 6.36 5.52 TVLSI’23 [39] ZCU102 8 117 74 132 300 4.2 6.58 Access’24 [2] VC707 4/8 19.8 12.1 39 136 0.68 1.81 ISCAS’25 [4] VCU129 8/16/32 17.5 14.8 - 54.5 2.64 1.6 T ABLE V A S IC P E R FO R M AN C E C OM PA RI S O N W I T H S O T A 8 - B IT AC C E LE R A T O R D E S IG N S , W I T H C M O S 2 8 N M , 0 . 9 V , S F T EC H N OL O G Y . Design Network/Arch Datatype Freq. (GHz) Area (mm 2 ) Power (mW) Energy Efficiency TOPS/W Compute Density TOPS/mm 2 TCAS-II’24 [29] V ector Engine (64 × MA Cs) FP8 1.47 0.896 1622 7.24 2.39 1.29 1.18 1375 3.57 1.21 TCAS-I’22 [1] V ector Engine (64 × MA Cs) 196-64-32-32-10 INT -8 0.4 2.43 224.6 7.75 1.67 ISCAS’25 [4] TREA (64 × MACs) 196-64-32-32-10 Posit-8 1.25 6.73 230.4 7.55 0.16 TVLSI’25 [3] Systolic Array (8x8) FxP8 0.44 1.85 523 4.3 2.76 ICIIS’25 [11] Layer-Reused (64 × MA Cs) 196-64-32-32-10 FxP8 0.25 3.78 1540 4.28 2.07 Proposed V ector Engine 64 × PEs FxP-4/8/16 1.24 0.43 329 3.84 1.52 V ector Engine 256 × PEs 0.96 1.42 1186 11.67 4.83 Access’24 [2] Shared Bank (256 × MA Cs) 784-196-120-84-10 FxP8 0.28 1.58 499.7 6.87 1.18 0 4 8 12 16 20 0 7 14 21 28 35 42 49 56 63 70 Ex e c u tio n T im e (u s ) Po we r Co n s u m p tio n (m W ) Fig. 13. VGG-16 layer-wise execution time and po wer consumption. fine-grained trade-of fs between accuracy and latency with- out requiring auxiliary error-correction hardware or structural modifications. In contrast to prior fix ed-approximation designs, this approach allo ws dynamic adaptation to layer-le vel numer- ical sensitivity while maintaining compatibility with standard deep learning workloads. The proposed vector engine further integrates a time- multiplex ed multi-activ ation-function (multi-AF) block that significantly impro ves hardware utilisation and mitigates dark silicon. By sharing CORDIC resources across a wide range of nonlinear functions, including Sigmoid, T anh, SoftMax, GELU, Swish, ReLU, and SELU, the architecture achieves high utilisation factors with minimal additional area and power ov erhead. This balanced treatment of MA C and activ ation units addresses a persistent inefficienc y in existing deep learning accelerators. Comprehensiv e e valuation across software emulation, R TL verification, FPGA prototyping, and 28 nm ASIC synthesis demonstrates the effecti veness of the proposed design. The iterativ e MA C unit achieves up to 33% reduction in critical- path delay and 21% power savings per stage, while scalable vector -engine configurations deliver a peak compute density of 4.83 TOPS/mm 2 and an energy ef ficiency of 11.67 TOPS/W . End-to-end deployment on embedded platforms further con- firms that the architectural benefits translate into tangible improv ements in latency and power consumption at the system lev el. Future work will focus on extending the proposed frame- work toward a compiler-assisted design flow that automates layer-wise precision and iteration selection based on model sensitivity analysis. In addition, inte grating full physical design and place-and-route (PnR) optimisation will enable more ac- curate post-layout ev aluation and facilitate tape-out readiness. Further exploration of adaptiv e e xecution strate gies for emerg- ing transformer and multi-modal workloads, as well as tighter integration with RISC-V -based system-on-chip platforms, rep- resents promising directions for expanding the applicability of the proposed vector engine [28]. Overall, the proposed runtime-adapti ve CORDIC-based vec- 11 tor engine provides a scalable and energy-efficient compute substrate that bridges the gap between approximate and ac- curate deep learning acceleration, making it well-suited for next-generation edge AI systems. R E F E R E N C E S [1] R. Pilipovi ´ c, P . Buli ´ c, and U. Lotri ˇ c, “A T wo-Stage Operand Trimming Approximate Logarithmic Multiplier, ” IEEE T rans. Circuits Syst. I, Reg . P apers , v ol. 68, pp. 2535–2545, June 2022. [2] N. Ashar, G. Raut, V . Tree vedi, S. K. V ishvakarma, and A. Ku- mar , “QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit, ” IEEE Access , vol. 12, pp. 43600–43614, 2024. [3] M. Lokhande, G. Raut, and S. K. V ishvakarma, “Flex-PE: Flexible and SIMD Multiprecision PE for AI W orkloads, ” IEEE T rans. V ery Lar ge Scale Inte gr . (VLSI) Syst. , v ol. 33, pp. 1610–1623, June 2025. [4] O. Kokane, M. Lokhande, G. Raut, A. T eman, and S. K. V ishvakarma, “LPRE: Logarithmic Posit-enabled Reconfigurable edge-AI Engine, ” IEEE International Symposium on Circuits and Systems , 2025. [5] O. K okane, G. Raut, S. Ullah, M. Lokhande, A. T eman, A. Kumar , and S. K. V ishvakarma, “Retrospectiv e: A CORDIC-Based Configurable Activ ation Function for NN Applications, ” in IEEE Computer Society Annual Symposium on VLSI (ISVLSI) , pp. 1–6, 2025. [6] R. Pilipovi ´ c and P . Buli ´ c, “On the design of logarithmic multiplier using radix-4 booth encoding, ” IEEE Access , v ol. 8, pp. 64578–64590, 2020. [7] Y . Mao and Q. Liu, “ Aln: Approximate layer normalization for trans- former training on edge device, ” IEEE Tr ansactions on Computers , pp. 1–14, 2026. [8] L. Nkenyere ye, C. Rajkumar , B. G. Lee, and W .-Y . Chung, “Dynamic transfer learning switching approach using resource benchmark in edge intelligence, ” IEEE Internet of Things Journal , vol. 12, no. 13, pp. 25148–25170, 2025. [9] M. Lokhande, A. Jain, and S. K. V ishvakarma, “Precision-aw are On- device Learning and Adaptive Runtime-cONfigurable AI acceleration, ” IEEE International Symposium on VLSI Design and T est , Aug. 2025. [10] A. Jha, T . Dew angan, M. Lokhande, and S. K. V ishvakarma, “QForce- RL: Quantized FPGA-Optimized RL Compute Engine, ” IEEE Interna- tional Symposium on VLSI Design and T est (VD AT) , Aug. 2025. [11] S. Kumar , K. Gupta, I. S. Dasanayake, M. Lokhande, and S. K. V ishvakarma, “HYDRA: Hybrid data multiple xing and run-time layer configurable dnn accelerator , ” in Pr oceedings of the 19th International Confer ence on Industrial and Information Systems (ICIIS) , (Sri Lanka), Dec. 2025. [12] J.-S. Park, C. Park, S. Kwon, T . Jeon, Y . Kang, H. Lee, D. Lee, J. Kim, H.-S. Kim, Y . Lee, S. Park, M. Kim, S. Ha, J. Bang, J. Park, S. Lim, and I. Kang, “A Multi-Mode 8k-MAC HW -Utilization-A ware Neural Processing Unit With a Unified Multi-Precision Datapath in 4- nm Flagship Mobile SoC, ” IEEE Journal of Solid-State Circuits , vol. 58, no. 1, pp. 189–202, 2023. [13] M. V erhelst, L. Benini, and N. V erma, “How to Keep Pushing ML Accelerator Performance? Know Y our Rooflines!, ” IEEE Journal of Solid-State Cir cuits , pp. 1–18, June 2025. [14] A. Krishna, S. Rohit Nudurupati, D. G. Chandana, P . Dwiv edi, A. v an Schaik, M. Mehendale, and C. S. Thakur, “RAMAN: A Reconfigurable and Sparse tinyML Accelerator for Inference on Edge, ” IEEE Internet of Things Journal , vol. 11, pp. 24831–24845, July 2024. [15] T . Chaudhari, A. J, T . Dewangan, M. Lokhande, and S. K. V ishvakarma, “XR-NPE: High-throughput mixed-precision simd neural processing engine for extended reality perception workloads, ” in 39th International Confer ence on VLSI Design and 25th International Conference on Embedded Systems (VLSI-D / ES) , (Pune, India), Jan. 2026. [16] M. Jaiswal, V . Sharma, A. Sharma, S. Saini, and R. T omar, “Quantized cnn-based efficient hardware architecture for real-time hand gesture recognition, ” Microelectr onics Journal , v ol. 151, p. 106345, July , 2024. [17] Z. Y uan, Q. Li, X. Lin, T .-M. Grønli, and S. Cherkaoui, “Edge ai for internet of robotic things, ” IEEE Internet of Things Magazine , vol. 9, no. 1, pp. 4–6, 2026. [18] P . Chen, T . Ouyang, K. Luo, W . Hong, and X. Chen, “Codrone: Autonomous drone navigation assisted by edge and cloud foundation models, ” IEEE Internet of Things J ournal , vol. 13, no. 4, pp. 5593– 5609, 2026. [19] Y . Chen, H. W ang, Z. Li, E. Mou, T . Song, S. Xia, and Y . Pang, “ A lightweight uav object detector based on optimized yolov8 fused with an auxiliary learning branch for aiot, ” IEEE Internet of Things J ournal , vol. 13, no. 4, pp. 5793–5808, 2026. [20] M. Ali and K. Nathwani, “Exploiting wa velet scattering transform and 1d-cnn for unmanned aerial v ehicle detection, ” IEEE Signal Pr ocessing Letters , vol. 31, pp. 1790–1794, 2024. [21] K. Zhang, X. Liu, K. W ang, Q. Cai, X. Xie, J. Zhang, J. Chen, C. Zhang, X. T ong, Z. Gong, and K. Li, “Edcl: An efficient dynamic continual learning framework for iot systems, ” IEEE T ransactions on Computers , pp. 1–16, 2026. [22] Y .-C. Lin, M.-S. Huang, J.-B. W ang, W .-C. Chen, N.-S. Chang, C.-P . Lin, C.-S. Chen, T .-D. Chiueh, and C.-H. Y ang, “A 16nm Fully Inte- grated SoC for Hardware-A ware Neural Architecture Search, ” in 2025 IEEE European Solid-State Electr onics Research Confer ence (ESSERC) , pp. 397–400, 2025. [23] J. Hu, Z. Zhang, Z. Li, Q. Meng, X. Shi, Q. Huang, H. W ang, and S. Chang, “Single-step hardware-a ware neural network quantization with mixed precision, ” IEEE T ransactions on Computers , pp. 1–12, 2026. [24] A. Sharma, L. H. Krishna, and B. Srinivasu, “High-Performance Gemmini-Based Matrix Multiplication Accelerator for Deep Learning W orkloads, ” IEEE T ransactions on V ery Large Scale Integration (VLSI) Systems , v ol. 33, no. 12, pp. 3276–3289, 2025. [25] S. Mehra, G. Raut, R. D. Purkayastha, S. K. V ishvakarma, and A. Bi- asizzo, “ An empirical e valuation of enhanced performance softmax function in deep learning, ” IEEE Access , vol. 11, pp. 34912–34924, 2023. [26] K. Khalil, O. Eldash, A. Kumar , and M. Bayoumi, “Designing Novel AAD Pooling in Hardware for a Conv olutional Neural Network Ac- celerator , ” IEEE T rans. V ery Lar ge Scale Inte gr . (VLSI) Syst. , vol. 30, pp. 303–314, Mar . 2022. [27] W . Zhang, X. Geng, X. Hu, H. W ang, J. Jiang, Q. W ang, S. Liu, J. Han, and H. Jiang, “Lut-alms: Trading off accuracy and power for approximate logarithmic multipliers via lut optimization, ” IEEE T ransactions on Computers , pp. 1–14, 2026. [28] A. Kamaleldin, H. Aouinti, and D. G ¨ ohringer , “Procon-v: A pro- grammable tightly coupled conv olution accelerator based on risc-v custom instructions for edge devices, ” IEEE T ransactions on Computers , pp. 1–15, 2026. [29] B. Li, K. Li, J. Zhou, Y . Ren, W . Mao, H. Y u, and N. W ong, “A Reconfig- urable Processing Element for Multiple-Precision Floating/Fixed-Point HPC, ” IEEE Tr ans. Circuits Syst. II, Exp. Briefs , vol. 71, pp. 1401–1405, Mar . 2024. [30] S. M. Cherati, M. Barzegar , and L. Sousa, “MSDF-Based MAC for Energy-Ef ficient Neural Netw orks, ” IEEE T rans. V ery Lar ge Scale Inte gr . (VLSI) Syst. , pp. 1–12, 2025. [31] S. Ullah, S. Rehman, M. Shafique, and A. Kumar , “High-Performance Accurate and Approximate Multipliers for FPGA-Based Hardware Ac- celerators, ” IEEE T rans. Comput.-Aided Design Integr . Circuits Syst. , vol. 41, pp. 211–224, Feb . 2022. [32] M. Basavaraju, V . Rayapati, and M. Rao, “Exploring Hardware Ac- tiv ation Function Design: CORDIC Architecture in Diverse Floating Formats, ” in 25th International Symposium on Quality Electr onic Design (ISQED) , pp. 1–8, 2024. [33] D. Zhu, S. Lu, M. W ang, J. Lin, and Z. W ang, “Efficient Precision- Adjustable Architecture for Softmax Function in DL, ” IEEE T rans. Cir cuits Syst. II, Exp. Briefs , vol. 67, pp. 3382–3386, Dec. 2020. [34] K. Chen, Y . Gao, H. W aris, W . Liu, and F . Lombardi, “Approximate Softmax Functions for Energy-Efficient DNNs, ” IEEE Tr ans. V ery Large Scale Inte gr . (VLSI) Syst. , v ol. 31, pp. 4–16, Jan. 2023. [35] N. A. Mohamed and J. R. Cav allaro, “A Unified Parallel CORDIC- Based Hardware Architecture for LSTM Network Acceleration, ” IEEE T ransactions on Computers , vol. 72, pp. 2752–2766, Oct. 2023. [36] J. Kim, K. Choi, and I.-C. Park, “Hardware-efficient unified approx- imation for implementing di verse smooth activ ation functions, ” IEEE T ransactions on Computers , pp. 1–8, 2026. [37] B. Wu, T . Y u, K. Chen, and W . Liu, “Edge-Side Fine-Grained Sparse CNN Accelerator W ith Efficient Dynamic Pruning Scheme, ” IEEE T rans. Cir cuits Syst. I, Re g. P apers , vol. 71, pp. 1285–1298, Mar . 2024. [38] S. Ki, J. Park, and H. Kim, “Dedicated FPGA Implementation of the Gaussian TinyY OLOv3 Accelerator, ” IEEE T rans. Circuits Syst. II, Exp. Briefs , v ol. 70, pp. 3882–3886, Oct. 2023. [39] W . Lee, K. Kim, W . Ahn, J. Kim, and D. Jeon, “A Real-T ime Object Detection Processor W ith XNOR-based V ariable-Precision Computing Unit, ” IEEE T rans. V ery Lar ge Scale Integr . (VLSI) Syst. , vol. 31, pp. 749–761, June 2023. [40] G. Raut, S. Karkun, and S. K. V ishvakarma, “An Empirical Approach to Enhance Performance for Scalable CORDIC-Based DNNs, ” ACM T rans. Reconfigurable T echnol. Syst. , vol. 16, June 2023.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment