CORVET: A CORDIC-Powered, Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications

1 COR VET : A CO RDIC-Po wered, R esource-Frugal Mix ed-Precision V ector Processing E ngine for High- T hroughput AIoT applications Sonu Kumar , Mohd Faisal Khan , Mukul Lokhande , Member , IEEE , and Santosh Kumar V ishvakarma , Senior Member , IEEE Abstract —This brief presents a runtime-adapti ve, performance-enhanced vector engine featuring a low-resour ce, iterative CORDIC-based MA C unit for edge AI acceleration. The pr oposed design enables dynamic reconﬁguration between approximate and accurate modes, exploiting the latency-accuracy trade-off for a wide range of workloads. Its resour ce-efﬁcient approach further enables up to 4× thr oughput improv ement within the same hardwar e resour ces by leveraging vectorised, time-multiplexed execution and ﬂexible pr ecision scaling. With a time-multiplexed multi-AF block and a lightweight pooling and normalisation unit, the proposed vector engine supports ﬂexible pr ecision (4/8/16-bit) and high MA C density . The ASIC implementation r esults show that each MA C stage can sa ve up to 33% of time and 21% of power , with a 256-PE conﬁguration that achieves higher compute density (4.83 TOPS/mm 2 ) and energy efﬁciency (11.67 TOPS/W) than previous state-of-the-art work. A detailed hardwar e-software co-design methodology for object detection and classiﬁcation tasks on Pynq-Z2 is discussed to assess the proposed architecture, demonstrating a scalable, energy-efﬁcient solution for edge AI applications. Index T erms —CORDIC, multiply-accumulate (MA C), Non- linear Activation Functions, deep lear ning accelerators, Internet of things, Reconﬁgurable Computing. I . I N T RO D U C T I O N D EEP learning has become a foundational component of modern artiﬁcial intelligence (AI) systems at the Internet of things (IoT), enabling breakthroughs across com- puter vision, speech recognition, natural language processing, and autonomous systems. Contemporary workloads are domi- nated by Deep Neural Netw orks (DNNs), V ision T ransformers (V iTs), and emerging large-scale models, whose inference and training pipelines are primarily composed of con volutional layers, fully connected (FC) or multi-layer perceptron (MLP) blocks, and attention mechanisms [1]–[6]. Despite architec- tural div ersity , workload characterisation studies consistently show that multiply-accumulate (MAC) operations account for approximately 90% of total computation, while non-linear acti- vation functions (N AFs) contrib ute an additional 2-5% [3], [5]. Efﬁcient ex ecution of these operations is therefore critical for deploying AI models on resource-constrained edge platforms [7], [8]. T o address the stringent energy , area, and latenc y constraints of edge AI, prior work has explored se veral hardw are optimi- sation techniques, including ﬁxed-point quantization [3], [9], Sonu Kumar , Mohd Faisal Khan and Santosh Kumar V ishv akarma acknowl- edge DST INSPIRE fellowship and MeitY/SMDP-C2S for ASIC tool support. Sonu K umar is with the Centre for Adv anced Electronics, IIT Indore. Mohd Faisal Khan, Mukul Lokhande, and Santosh Kumar V ishv akarma are with the NSDCS Research Group, Dept. of Electrical Engineering at IIT Indore. Corresponding author: Santosh K. V ishv akarma (skvishvakarma@iiti.ac.in). CORDIC-based arithmetic [5], logarithmic approximation [4], [10], and truncation-based MA C units [2], [11]. While these approaches achie ve notable reductions in computational com- plexity and power consumption, they typically operate at a ﬁxed approximation point. As a result, either irrev ersible accu- racy degradation is incurred, or additional error-compensation mechanisms are required, which partially negate the energy beneﬁts [2], [3]. Furthermore, these designs often lack the ﬂexibility to dynamically adjust approximation depth based on layer sensitivity or application requirements [12], [13]. Recent edge-oriented deep learning accelerators focus on energy efﬁciency , reduced memory trafﬁc, and architectural ﬂexibility by lev eraging ef ﬁcient dataﬂo ws, reconﬁgurable systolic arrays, and pipeline-aware designs. Complementary approaches exploit computing-in-memory , quantised and bi- nary neural networks, and tightly integrated SoC platforms to alleviate bandwidth and storage bottlenecks requirements [14], [15] [16]. In parallel, recent studies hav e highlighted a struc- tural inef ﬁciency in many deep learning accelerators: the dis- proportionate allocation of hardware resources to activ ation- function units. Although AFs constitute a small fraction of total operations, the y are frequently implemented using dedi- cated hardware blocks that remain idle for a signiﬁcant portion of execution. Prior work reports up to 84% idle c ycles in N AF hardw are for layer-reused architectures [11], while large- scale commercial accelerators such as Google TPUv4 allocate nearly 20–25% of chip area to acti vation-related logic [3]. This imbalance results in substantial dark silicon, limiting overall energy ef ﬁciency and scalability . T able I pro vides a comparati ve ov erview of state-of-the-art (SoT A) AI accelerator designs, highlighting key architectural choices, supported precision, scalability , and associated trade- offs. Existing CORDIC- and approximation-based designs pre- dominantly employ pipelined or ﬁxed-stage implementations, which constrain runtime ﬂexibility and enforce static accurac y- latency operating points [17]–[21]. In contrast, recent layer- reused or time-multiplex ed architectures improve utilization but still lack ﬁne-grained control ov er numerical accuracy at the MA C level. Consequently , prior designs are not very ﬂexible with respect to dif ferent layer characteristics and operating conditions. They often require predeﬁned datapaths and static precision settings [22], [23]. This rigidity leads to suboptimal trade-offs among accuracy , latency , and resource utilisation when deployed across heterogeneous deep learning workloads [24], particularly for models that demand mixed- precision computation and ﬂexible acti vation support. This paper addresses these limitations by proposing a runtime-adaptiv e, CORDIC-accelerated vector engine that e x- 2 T ABLE I S O T A D E SI G N A P P ROA C HE S A ND C O M P A R IS O N O F R E SP E C T IV E D ES I G N F E A T U R ES I N A I W OR K L OA D S Design Baseline ICIIS’25 [11] ICIIS’25 [11] IEEE Access’24 [2] TVLSI’25 [3] ISCAS’25 [4] ISVLSI’25 [5] Proposed Compute Pipe-CORDIC Pipe-CORDIC PWL Pipe-CORDIC Pipe-CORDIC Logarithmic Approx. Iterative CORDIC Iterative CORDIC Arch. T ype Fully Parallel Layer-Reused NAF-Reused NAF-Reused Systolic Array Time-multiple xed Reconﬁgurable Array Layer-Reused V ector Engine Scalability No Y es No No - Y es No Y es Precision FxP-8 FxP-8 FxP-8 FxP-8 FxP-4/8/16/32 Posit-8/16/32 FxP-8 FxP-4/8/16 Accuracy loss High High High High Medium Low Medium V ariable (Low) Design Overhead Area, St. Power Area Area, St. Power Area, Power Energy Area, Complexity Latency Application-optimized High-Throughput NAF-Supported ReLU ReLU Sigmoid/T anh NA Sigmoid, Softmax, T anh, ReLU Sigmoid, T anh, Softmax Sigmoid/T anh SoftMax, GELU, Sigmoid T anh, Swish, ReLU, and SELU Applications ANN ANN ANN DNN DNN, Transformers DNN DNN DNN, Transformers (MLP) plicitly exposes MA C precision, approximation depth, and ex ecution latency as conﬁgurable architectural parameters. Unlike prior ﬁxed-approximation designs, the proposed ap- proach enables seamless switching between approximate and accurate execution modes without structural modiﬁcation or auxiliary correction logic. In addition, a time-multiplexed multi-activ ation-function (multi-NAF) block is integrated to maximise hardware utilisation and signiﬁcantly reduce dark silicon. The key contributions of this work are summarised as follows: 1) A low-resource, iterativ e CORDIC-based MA C unit with runtime-conﬁgurable accuracy-latency trade-offs, sup- porting both approximate and accurate e xecution modes. 2) A scalable vector-engine architecture that amortises iter - ativ e MA C latency across parallel lanes, enabling higher throughput 4 × without excessi ve area overhead. 3) A high-utilisation, time-multiplex ed multi-NAF block supporting a wide range of nonlinear functions with minimal additional hardware cost. 4) A comprehensiv e ev aluation spanning software emula- tion, FPGA prototyping, and 28 nm ASIC synthesis, demonstrating system-lev el improvements on CNN and transformer-style w orkloads. The remainder of this paper is organised as follows. Sec- tion II presents the proposed vector-engine architecture. Sec- tion III details the circuit-level implementation of the iterati ve MA C and multi-N AF blocks. Section IV describes the ex- perimental methodology and ev aluation framework. Section V discusses FPGA, ASIC, and system-level results. Section VI concludes the paper and outlines future research directions. I I . A R C H I T E C T U R E O V E RV I E W Fig. 1 illustrates the top-level architecture of the proposed resource-efﬁcient deep learning accelerator . The system is organised around a runtime-adaptiv e vector engine that serves as the primary compute core, supported by a lightweight con- trol engine, data prefetcher , time-multiplex ed multi-acti vation- function (multi-AF) block, pooling and normalisation units, and of f-chip memory interfaces. The architecture is designed to maximise compute utilisation while enabling ﬂexible trade- offs between accuracy and latenc y across div erse deep learning workloads. I n p u t F e a tu r e M a p D a t a - A d d r e s s M a n a g e r Input T r a c k e r C o n f i g u r a b l e M U L T I - A F B l o c k , A A D Po o l i n g , N o r m / En c o d i n g U n i t Ex te r n a l O ff - C h i p M e m o r y (fe a tu r e m a p s , k e r n e l s ) Pa r a m e te r s A l l o c a to r Co n t r o l E n g in e K e r n e l M e m o r y B a n k # 0 K e r n e l M e m o r y B a n k # 1 K e r n e l M e m o r y B a n k # n M A C # 0 M A C # 1 M A C # n I n p u t D a ta Pr e - p r o c e s s o r C o m p u t e C o r e M o d u l e L i g h tw e i g h t C N N A c c e l e r a to r Fig. 1. Block-lev el architecture of the proposed CORDIC-based vector engine integrated within a resource-efﬁcient deep learning accelerator . A. V ector Engine Or ganization The vector engine is composed of N homogeneous pro- cessing elements (PEs), where N is scalable from 64 to 256 depending on performance and area constraints. Each PE in- tegrates a precision-adjustable, accuracy-conﬁgurable iterativ e CORDIC-based MA C unit, local register storage, and interface logic for data and control synchronisation. Unlike fully parallel systolic arrays, the proposed v ector engine adopts a lane-based ex ecution model, amortising the latenc y of iterati ve computa- tions across multiple PEs to enable high throughput without requiring deeply pipelined or resource-intensiv e datapaths. T wo dedicated kernel memory banks, each or ganised as ( n -bit × 32) entries, are employed to store input activ ations and weights, respectively . This dual-bank organisation enables continuous data feeding to the PEs while ov erlapping memory access with computation. The memory interface is designed to support ﬂexible precision modes (4/8/16-bit) and runtime reconﬁguration without stalling the compute pipeline. B. Runtime Accuracy and Pr ecision Adaptation A key architectural feature of the proposed vector engine is its ability to dynamically adapt computation accuracy and latency at runtime. This is achie ved by controlling the number 3 L a y e r D o ne a nd A N N D o ne A s s i g nm e nt C u r r e nt L ay e r A s s i g nm e nt I n put Mux i ng G e ne r a te I np ut s f o r N e u r ao n s O ut p ut Mux i n g C o m p u t eD o n e ( fr o m N e u r o n s ) C u r r en t _L a y er L a y er D o n e L a y er D o n e C u r r en t _L a y er In p u t L a y er In t er m ed i a t e In p u t In d ex C u r r en t _L a y er O u t p u t f r o m N eu r o n s L a y er D o n e C u r r en t _L a y er In t er m ed i a t e In p u t In t er m ed i a t e O u t p u t In t er m ed i a t e In p u t t o N eu r o n s C o m p u t eIn i t t o N eu r o n s D N N D o n e O u t p u t In t er m ed i a t e O u t p u t C o n t r o l M o d u l e s S u b - b l o c k s f o r L a ye r M u l t i p l e x e d D N N D a t a S i g na l : C ont r ol S i g na l : Fig. 2. Control Engine for efﬁcient reuse of data and control signals in a layer-multiple xed for reusing the same DNN architecture. of acti ve CORDIC iterations within each MA C unit on a per- layer basis. Layers that are less sensitiv e to numerical error can be e xecuted in an approximate mode with fe wer iterations, thereby reducing latency and ener gy consumption. Con versely , accuracy-critical layers operate in a more accurate mode with additional iterations, incurring modest latency ov erhead while preserving numerical ﬁdelity . This runtime conﬁgurability is managed by the control engine through a set of conﬁguration registers that specify precision mode, iteration count, and ex ecution sequencing for each layer . Unlike prior approximation-based accelerators that rely on ﬁx ed hardware stages or static design-time tuning [3], [4], the proposed architecture enables ﬁne-grained, layer- wise adaptation without requiring structural modiﬁcation or auxiliary correction logic. C. Contr ol Engine and Data Flow The control engine orchestrates vector-engine e xecution by managing instruction sequencing, memory addressing, and synchronisation across PEs. It comprises conﬁguration reg- isters for precision and iteration control, status registers for pipeline coordination, and a ﬁnite-state machine with datapath (FSMD) that gov erns layer ex ecution. This lightweight control logic enables efﬁcient coordination between compute, memory access, and activ ation processing while minimising control ov erhead. A layer -multiplexed DNN requires a dedicated control unit composed of ﬁv e functional sub-blocks, as illustrated in Fig. 2. Each sub-block either monitors system status or produces control signals deri ved from that status. The control unit processes all status signals: LayerDone, DNNDone, Current- Layer , ComputeInit, Index, and ComputeDone. The neuron processing elements produce Index and ComputeDone, two of the status signals, while the control module internally generates the remaining signals. The Index signal indicates which input to send next to the MA C unit by counting the number of MA C operations completed in the activ e layer . The ComputeDone signal indicates that neuron computation for the current layer has completed and that v alid output data is av ailable. When aggre gated across all neuron units, this signal is referred to as ComputeDoneArray . T ogether , these status signals manage the data-path control necessary for the layer- multiplex ed architecture to function properly . The control module dynamically conﬁgures neuron activ a- tion and signal routing for layer-reused DNN computation. LayerDone and CurrentLayer are used to track progress, Com- puteInit is used to selectively activ ate neurons per layer , and index-controlled input and output routes are used to multiplex intermediate data. This reduces dynamic power by enabling idle-unit deactiv ation and ensuring correct sequencing. As the weight memory is partitioned into 64 se gments, each associated with a speciﬁc neuron processing unit, as depicted in Fig. 3(a). A key aspect of the parameter-loading mechanism is that the memory write sequence is the inv erse of the read sequence. This or ganisation enables ef ﬁcient access to weights and bias v alues with reduced interconnect delay . Consequently , parameters must be loaded using a Last-In–First-Out (LIFO) ordering for both weights and biases, as well as for input data. Data transfer follo ws a synchronous interface using a valid signal, denoted as load_param_weight , to indicate when weight v alues are written. The accelerator asserts a data- ready signal, DNNDone, once valid outputs are produced. Input v alues are accepted on each clock cycle when the valid signal is activ e. Upon completion, outputs from all ten neurons are generated simultaneously when DNNDone is asserted and are subsequently captured by the host software. The ov erall software-controlled execution sequence of the accelerator is illustrated in Fig. 3(b). Data ﬂow within the accelerator follows a streaming exe- cution model. Input feature maps are fetched from off-chip memory through the data pre-fetcher and buf fered locally before being broadcast to the vector engine. Partial sums generated by the PEs are forwarded either to subsequent MA C stages or to the activ ation function pipeline, depending on the layer conﬁguration. This streaming approach minimises inter - mediate storage requirements and reduces memory bandwidth pressure, both of which are critical for energy-ef ﬁcient edge deployment. D. Memory mapping The hardware architecture of fully connected neural net- works must support scalable, adaptable setups, as the number of layers and neurons per layer varies across applications. As a result, the memory or ganisation for weights and biases must be adaptable and av oid allocating unused address locations, a lim- itation commonly encountered in ﬁxed-addressing schemes. This is accomplished by using BRAM for on-chip parameter 4 . . . . . . . . . . . N e u r o n_ 6 3 N e u r o n_ 1 N e u r on_ 0 . . . . . . . . W e i g h t s [# w e ig h t 6 3 ][8 : 0 ] W e i g h t s [# w e ig h t 1 ][8 : 0 ] W e i g h t s [# w e ig h t 0 ][8 : 0 ] S e qu e nt i a l l y l o a d W e i g ht s a nd B i a s e s S e qu e nt i a l l y l o a d I nput s a nd i ni t i a t e A N N C om put a t a i on S t or e O ut put U pda t e / I ni t i a l i z e S t a t us S i gn a l C om pu t e Lay e r La y e r D o ne ? D N N D one ? Y es Y es No No ( a ) D a t a R ea d / W r i t e O r d er ( b ) O p e r a r t i o n F l o w c h a r t Fig. 3. DNN accelerator data ﬂo w and order to initialise the loading data. memory and FIFO buf fers for temporary storage. An ef ﬁcient addressing strategy for neuron-wise weight and bias access is illustrated in Fig. 4 and is deﬁned using the total number of layers L, the number of neurons in the lth layer N(l), and the number of inputs to that layer J(l). Since the number of neurons in one layer determines the number of inputs to the subsequent layer , these parameters satisfy J ( l + 1) = N ( l ) (1) Each parameter address consists of a layer identiﬁer , a select bit indicating whether the accessed parameter is a weight or a bias, and a memory address ﬁeld, as sho wn in Fig. 4(a). The select bit distinguishes between weight and bias access, while the most signiﬁcant bits encode the layer index. The remaining bits represent either the neuron inde x (for bias) or the combined neuron and input index (for weight), as depicted in Fig. 4(b). The required address length for weights and biases in a layer is therefore given by R addr ( l ) = ⌈ log 2 N ( l ) ⌉ + ⌈ log 2 J ( l ) ⌉ (2) and that layer’ s ov erall address width turns into Addr ( l ) = ⌈ log 2 L ⌉ + 1 + R addr ( l ) (3) a ﬁx ed address width, which is chosen based on the maximum required across all lev els, which is speciﬁed as R addr = max l =1 , 2 ,...,L {⌈ log 2 N ( l ) ⌉ + ⌈ log 2 J ( l ) ⌉} (4) and the length of the ﬁnal uniform address is Addr = ⌈ log 2 L ⌉ + 1 + R addr (5) In addition to providing effecti ve, conﬂict-free access to weight and bias memories for scaled DNN implementations, this technique enables consistent addressing. Lo g 2 ( L) B i t s 1 B i t M a x [ R _ a dd r ( l ) ] L a y er B i t s S el ec t B i t s W e i g h t / B i a s R A M A d d r e s s Add r ( l ) - 1 t o Add r ( l ) - Log 2 ( L ) ( L a y e r B it s) Add r ( l ) - Log 2 ( L ) - 1 ( S e le c t B it s ) M a x [ R _ a ddr ( l ) ] - 1 t o 0 ( W e ig h t / B ia s R A M A d d r e s s ) R _ a ddr ( l ) - 1 t o W _ a dd r ( l ) W _ a dd r ( l ) - 1 t o 0 R _ a ddr ( l ) - 1 t o 0 W ei g h t R A M A d d r es s B i a s R A M A d d r es s ' 0 ' ' 1 ' ( a ) ( b ) Fig. 4. Memory mapping scheme for address bits that requires addressing weights and bias for the indi vidual neurons. E. T ime-Multiple xed Multi-Activation-Function Integr ation T o address the underutilization of activ ation-function hard- ware observed in prior accelerators [3], [11], the proposed ar - chitecture integrates a time-multiplex ed multi-AF block shared across all PEs. The multi-AF block supports a broad set of nonlinear functions, including Sigmoid, T anh, SoftMax, GELU, Swish, ReLU, and SELU, using common CORDIC resources and mode-speciﬁc datapaths. By multiplexing acti vation computation in time rather than dedicating separate hardware blocks, the architecture achieves high utilisation f actors while incurring minimal area and po wer ov erhead. Activ ation execution is overlapped with vector- engine computation where ver possible, ensuring that the multi- AF block does not become a performance bottleneck despite being shared. F . Scalability and System Inte gration The proposed vector engine is designed for seamless scal- ability across edge and embedded platforms. By adjusting the number of PEs, memory bank sizes, and iteration depth, the architecture can be tailored to a wide range of performance and energy tar gets. Furthermore, the modular organisation of the vector engine, control logic, and peripheral units facili- tates automated generation through a synthesizable hardware framew ork, enabling rapid design-space exploration and de- ployment. Overall, the architecture combines runtime adaptability , high hardware utilisation, and scalable performance, forming a uniﬁed compute substrate that bridges the gap between ﬁxed- approximation accelerators and fully accurate but resource- intensiv e designs. I I I . C I R C U I T I M P L E M E N T AT I O N This section details the circuit-level design of the proposed iterativ e CORDIC-based MA C unit and the time-multiplex ed multi-activ ation-function (multi-AF) block. The design ob- jectiv e is to achieve a balance between hardware efﬁcienc y , numerical accuracy , and runtime conﬁgurability while main- taining compatibility with standard deep learning workloads. 5 R e g R e g R e g L U T M U X M U X M U X M U X M U X M U X Xo [N :0 ] Yo [N :0 ] Z o [N : 0 ] Xn [N :0 ] Yn [N :0 ] Z n [N : 0 ] s gn [ Z n - 1] μα n α n > > k > > k Fig. 5. Iterative low-latency CORDIC-based MA C architecture with runtime- conﬁgurable iteration depth. A. Runtime-Adaptive Iterative CORDIC-Based MA C The proposed MA C unit is based on the uniﬁed CORDIC formulation originally introduced by W alther , which supports circular , linear , and hyperbolic computations using only shift, add/subtract, and multiplexing operations. Recent works such as ReCON [5] and Flex-PE [3] hav e demonstrated the ap- plicability of CORDIC arithmetic to deep learning opera- tions, including MA C, Sigmoid, T anh, and SoftMax [25]. Howe ver , these designs primarily employ pipelined or ﬁxed- stage CORDIC architectures, which impose a static trade-of f between accuracy and latency . In contrast, the proposed MAC adopts an iterative CORDIC structure, as illustrated in Fig. 5, where the number of active iterations directly determines the approximation error and ex ecution latency . This enables runtime switching between approximate and accurate e xecution modes without altering the hardware structure or introducing auxiliary correction logic. The MA C unit supports both 8-bit and 16-bit ﬁxed-point precision modes. In approximate mode, the MA C completes 8-bit and 16-bit operations in 4 and 7 clock cycles, respec- tiv ely , incurring approximately 2% accuracy degradation at the application lev el. In accurate mode, additional iterations are enabled, completing 8-bit and 16-bit operations in 5 and 9 cycles with less than 0.5% accuracy loss. Additionally , it supports 4-bit modes with accurate 4-bit cycle operation. These operating points are selected based on an accurac y- sensitivity heuristic [3], enabling layer-wise conﬁguration based on numerical criticality . From a circuit perspectiv e, the iterativ e MA C minimises area and static power by reusing a single CORDIC datapath across iterations, rather than replicating pipeline stages. This design choice reduces the number of adders, shifters, and reg- isters compared to pipelined alternatives, while still enabling high throughput at the v ector-engine lev el through parallelism across multiple PEs. Fig. 6. Hardware AAD module for two inputs. Fig. 7. Hardware AAD module architecture based on sliding windo w B. Latency Hiding Thr ough V ector-Le vel P ar allelism Although the iterati ve MA C incurs a multi-cycle latency per operation, this ov erhead is ef fectively hidden at the vector - engine lev el. Since multiple PEs operate concurrently on in- dependent data elements, the increased per -MA C latency does not limit overall throughput for sufﬁciently large vector widths. This e xecution model distinguishes the proposed architecture from fully parallel or systolic-array designs, which require deeply pipelined datapaths to sustain throughput and therefore incur higher area and power ov erheads. The ability to trade per -MA C latency for reduced hardware complexity is particularly advantageous for edge AI accelera- tors, where area and ener gy efﬁciency are often more critical than single-operation latency . C. Absolute A verage Deviation (AAD) P ooling Block In addition to the MA C and activ ation units, the vector engine integrates peripheral components such as an Ab- solute A verage Deviation (AAD) pooling unit [26] and a normalisation block. The AAD pooling unit is selected due to its fav ourable accuracy characteristics for CORDIC-based computation, demonstrating a 0.5-1% accuracy improvement ov er con ventional pooling methods with lower computational complexity [3], [26]. Division, subtraction, and absolute value computation are the three primary steps of the hardware implementation of the two inputs in A verage Absolute Deviation (AAD) unit as shown in Fig. 6. Initially , the two input v alues are fed into a subtractor to determine their dif ference. The subtraction result is then processed through two parallel paths. A comparator receiv es the result from one path and compares it to zero to identify the sign of the difference, returning either +1 or - 1. T o match its timing with the comparator output, the other 6 Fig. 8. Hardware AAD module architecture based on parallel computation. Fig. 9. Multiple feature computations in parallel in hardware. channel passes the subtraction result through a buf fer . These two outputs are multiplied, ensuring the ﬁnal result is always non-negati ve re gardless of the input order , effecti vely yielding the absolute de viation. This absolute de viation is then divided by two to obtain the ﬁnal AAD output for the two-input case. For multi-input scenarios, multiple subtraction-absolute (SA) modules operate in parallel, each computing the absolute deviation between pairs of input v alues as shown in the Fig. 8. The outputs of these SA modules are summed using an adder network, and the accumulated result is divided by the normalisation f actor M = N (N-1) to produce the o verall AAD value, which is carried out in parallel as shown in Fig. 9. A sliding window technique, in which a windo w mov es over the input data with a speciﬁed stride and pooling size, is used to simplify the hardware. T o reduce hardware complexity , a sliding windo w approach is adopted, with a window moving across the input data according to the deﬁned stride and pooling size. W ithin each windo w , de viations between data points are computed, accumulated in registers, and normalised to produce the ﬁnal AAD result ef ﬁciently as illustrated in Fig. 7. D. T ime-Multiplexed Multi-Activation-Function Block Activ ation functions represent a small fraction of total oper- ations but often consume disproportionate hardware resources. T o address this inefﬁcienc y , the proposed design integrates a time-multiplex ed multi-AF block that reuses CORDIC hard- ware across multiple nonlinear functions as shown in Fig. 10. The multi-AF block supports Sigmoid, T anh, SoftMax, GELU, Swish, ReLU, and SELU, enabling compatibility with both CNN and transformer-style workloads. The multi-AF block operates in two primary modes: a hyperbolic rotation (HR) mode for functions that require sinh and cosh computations, and a linear-di vision (L V) mode for functions that inv olve normalisation or exponential scaling. By selectiv ely enabling only the required datapaths for a giv en function, the design achie ves utilisation factors of up to 86% in HR mode and approximately 72% in L V mode. Additional auxiliary logic includes a lightweight switching multiplex er for Sigmoid and T anh selection, a ReLU bypass buf fer , a FIFO for intermediate SoftMax storage, and two small multipliers [27] to support GELU computation. Collectiv ely , these components incur less than 4% additional area and po wer ov erhead while signiﬁcantly improving o verall hardware util- isation. E. P eripheral Support and Inte gration The proposed vector engine is integrated as a complete edge-AI processing subsystem comprising a lightweight con- trol engine, on-chip memory banks, input pre-processing logic, and a host communication interface. The control engine uses conﬁguration re gisters, status ﬂags, and a ﬁnite-state machine to control memory addressing, instruction sequencing, and synchronisation, coordinating ex ecution across the MAC array , activ ation block, and pooling units. Layer-adapti ve execution is made easier by runtime control signals such as ComputeInit, LayerDone, and ComputeDone. This enables the reuse of hardware resources across multiple network le vels, thereby guaranteeing proper data ordering. A data prefetcher retrie ves input feature maps from external memory , b uffers them lo- cally , and then broadcasts them to processing components. Index-controlled multiplexing is used to transport intermediate outputs to later layers. T o facilitate continuous data feeding and ov erlapping memory access with computation, parameter storage is managed via partitioned kernel memory banks that independently store acti vations and weights. The memory interface supports synchronous valid-data loading with a data- ready completion signal, allowing the host processor [28] to capture ﬁnal outputs without stalling the compute pipeline. All processing elements share a time-multiplexed multi-activ ation- function unit that uses common CORDIC resources to perform nonlinear operations. T o prevent performance bottlenecks, its operations overlap with MAC computation. In addition, integrated pooling and normalisation blocks process partial sums before output generation, reducing intermediate storage and external memory trafﬁc. The modular org anisation of con- trol, memory , and peripheral compute stages enables scalable deployment across FPGA and ASIC platforms. It supports efﬁcient system-lev el integration with embedded processors through a lightweight interface, thereby transforming the vec- tor engine from a standalone compute core into a deployable edge-AI accelerator . I V . E X P E R I M E N T A L M E T H O D O L O G Y T o ensure a rigorous, fair , and reproducible ev aluation, the proposed vector engine is validated using a structured hardware-software co-design methodology spanning algorith- mic emulation, R TL-lev el veriﬁcation, FPGA prototyping, 7 Fig. 10. T ime-multiplexed Activation function with integrated data ﬂow and control signals. and ASIC synthesis. The ev aluation frame work is designed to isolate the impact of iterativ e CORDIC approximation while maintaining consistent experimental conditions across all comparisons. A. Softwar e-Level Functional Emulation At the algorithmic lev el, an iso-functional software model of the proposed vector engine is dev eloped in Python 3.0. The model emulates the vector engine’ s custom iterative CORDIC arithmetic, precision-switching behaviour , and ex- ecution scheduling. Fixed-point arithmetic is implemented using the FxP-Math library , while neural network layers and quantised inference ﬂows are modelled using QK eras 2.3. The software frame work supports conﬁgurable precision modes (8-bit and 16-bit), variable CORDIC iteration depth, and layer-wise execution control. All deep learning ev alua- tions are performed against an FP32 reference baseline under identical network topology , dataset, and inference conditions. This approach ensures that observed accurac y dif ferences are attributable solely to arithmetic approximation, not to changes in training or model structure. Accuracy is ev aluated at both the layer and end-to-end model le vels for representativ e CNN and transformer-style MLP workloads. The number of CORDIC iterations per layer is selected using an accuracy-sensiti vity heuristic [3], which identiﬁes numerically critical layers and assigns them to accurate execution modes, while non-critical layers operate in approximate mode. B. RTL Modelling and Functional V eriﬁcation The proposed iterati ve CORDIC-based MA C unit and vector -engine datapath are modelled in synthesizable V erilog HDL. The architecture is parameterised to support different vector widths, precision modes, and iteration depths. A cycle- accurate R TL testbench is de veloped to v alidate functional correctness across all supported operating modes. Functional veriﬁcation is performed using Synopsys VCS, where R TL outputs are compared against the software emula- tion model for a wide range of randomised and application- driv en test vectors. This cross-validation ensures bit-lev el consistency between the software model and the hardware im- plementation, accounting for ﬁx ed-point rounding, truncation, and iteration control. C. FPGA Pr ototyping and Measur ement FPGA-based ev aluation is conducted using the AMD V irtex-707 (VC707) platform. Synthesis, placement, and rout- ing are performed using the AMD V i vado Design Suite with a target operating frequency of 100 MHz. All reported FPGA metrics, including lookup tables (LUTs), ﬂip-ﬂops (FFs), timing, and power consumption, are obtained from post-place- and-route reports to av oid optimistic estimation. T o enable fair comparison with state-of-the-art designs, either the reported post-implementation results from prior work are used directly , or the designs are re-synthesised under comparable constraints where feasible. Power measurements are extracted using vendor-supported power analysis tools with realistic switching activity deri ved from application traces. D. ASIC Synthesis and T echnology Assumptions ASIC e valuation is carried out using Synopsys Design Com- piler targeting a commercial 28 nm HPC+ CMOS technology at 0.9 V . Standard-cell libraries for worst-case timing corners are used to ensure conserv ativ e delay estimates. Area, timing, and power metrics are e xtracted from post-synthesis reports. System-lev el performance metrics, including energy efﬁ- ciency (TOPS/W) and compute density (TOPS/mm 2 ), are deriv ed using consistent workload assumptions across all designs. The same precision mode, vector width, and clock frequency normalisation are applied when comparing against prior accelerators to ensure fairness. E. System-Level Deployment and End-to-End V alidation T o validate practical applicability , the proposed vector en- gine is deployed on a Pynq-Z2 platform with an ARM Cortex- A9 host processor . The accelerator is integrated through an AXI-based interface and ev aluated on object detection and classiﬁcation workloads. End-to-end latency and power con- sumption are measured at the application lev el, capturing the 8 T ABLE II C O MPA R A T I V E P E R FO R M A NC E M E TR I C S F O R C O R D IC - BA S E D D I FF ER E N T S O TA M AC U N IT S Design TCAS-II’24 [29] ISCAS’25 [4] ICIIS’25 [11] TVLSI’25 [30] TCAD’22 [31] TVLSI’25 [3] Proposed FPGA Utilization (VC707, 100 MHz) Parameter FP32 FP32 BF16 Posit-8 V edic W allace Booth Quant-MA C CORDIC MSDF-MA C Acc-A pp-MAC CORDIC Iter-MA C LUTs 8065 8054 3670 467 160 106 84 72 56 62 57 45 24 FFs 1072 1718 324 175 241 113 59 56 72 45 NR 37 22 Delay (ns) 5.56 4.6 0.512 2.68 4.5 2.6 3.1 5.4 1.52 3.2 3.51 4.5 9.1 Power (mW) 378 296 136 68 6.1 3.3 3.1 4.2 8.3 5.8 6.9 2 1.9 PDP (pJ) 2102 1361.6 69.6 182 27.45 8.58 9.6 22.68 12.6 18.56 24.2 9 17.29 ASIC Perf ormance (28nm, 0.9V) Area (umˆ2) 10000 13000 4340 754 407 296 271 175 264 286 259 8570 108 Delay (ns) 679 700 295 40.6 6.38 5.62 5.3 3.58 2.36 1.42 2.6 0.7 2.98 Power (mW) 15.86 29.3 6.89 1.8 35 37 12.8 89 24.5 6.7 12.4 1.5 6.3 PDP (pJ) 10768.94 20510 4682 1189 223.3 207.94 67.84 318.62 57.82 9.514 32.24 1.05 18.774 T ABLE III C O MPA R A T I V E P E R FO R M A NC E M E TR I C S F O R C O R D IC - BA S E D D I FF ER E N T S O TA A F U NI T S Design ISQED’24 [32] TCAS-II’20 [33] TVLSI’23 [34] ISQED’24 [32] TC’23 [35] ISQED’24 [32] TVLSI’25 [3] Proposed FPGA Utilization (VC707, 100 MHz) Parameter Softmax-FP32 Softmax-FP16 Softmax-BF16 Softmax- FxP8/16 Softmax-16b T anh-FP32 T anh-FP16 Tanh-BF16 Tanh/Sigmoid-16b Sigmoid-FP32 Sigmoid-FP16 Sigmoid-BF16 SSTp FxP-4/8/16 LUTs 3217 1137 1263 2564 1215 4298 1530 1513 2395 5101 1853 1856 897 537 FFs NR NR NR 2794 1012 NR NR NR 1503 NR NR NR 1231 468 Delay (ns) 92 43 45 2.3 3.32 56 34 38 0.18 109 60 45 11.8 2.6 Power (mW) 115 115 77 NR 165 130 124 82 681 121 118 83 59 30 PDP (pJ) 10580 4945 3465 - 548 7280 4216 3116 123 13189 7080 3735 696.2 78 ASIC Performance (28nm, 0.9V) Area (umˆ2) 41536 17289 11301 18392 3819 5060 1180 843 870523 2234 1855 1180 49152 2138 Delay (ns) 6 4 3.3 0.3 1.6 4 3.3 3.4 NR 7.6 4.4 3.26 2.3 2.6 Power (mW) 75 40 25 51.6 1.6 8.75 3 2 150 10 4.8 2.5 5.2 60 PDP (pJ) 450 160 82.5 15.5 2.56 35 9.9 6.8 - 76 21.12 8.15 11.96 156 combined effects of computation, data mov ement, and control ov erhead. This multi-level e valuation methodology ensures that the reported improvements are not limited to isolated circuit optimisations b ut translate into tangible system-level beneﬁts for real-world edge AI deployments. V . R E S U L T S A N D D I S C U S S I O N This section presents a comprehensive ev aluation of the proposed CORDIC-based vector engine at the circuit, archi- tectural, and system lev els. Results are reported for FPGA prototyping, ASIC synthesis, and end-to-end deployment and are compared against representati ve state-of-the-art (SoT A) AI accelerators to highlight performance, energy efﬁcienc y , and scalability trade-offs. A. MA C-Level Har dwar e Efﬁciency T able II compares the proposed iterati ve CORDIC- based MA C unit with prior CORDIC, logarithmic, and approximation-based MA C designs across both FPGA and ASIC platforms. On the V irtex-707 FPGA, the proposed MAC achiev es signiﬁcant reductions in lookup tables (LUTs) and ﬂip-ﬂops (FFs) compared to pipelined CORDIC and ﬁxed- point MAC designs, while av oiding the use of DSP blocks. This reduction directly translates into lower static power consumption and improved placement ﬂe xibility . At the ASIC lev el (28 nm, 0.9 V), the proposed MAC demonstrates up to 33% reduction in critical-path delay and approximately 21% lower po wer per MAC stage compared to comparable CORDIC-based designs. Although the iterati ve MA C incurs a multi-cycle ex ecution latenc y , this overhead is amortised at the vector-engine lev el through parallel ex ecu- tion across multiple processing elements (PEs), as discussed in Section II. Consequently , the proposed design achiev es a fa vourable power -delay product (PDP) while maintaining runtime conﬁgurability between approximate and accurate ex ecution modes. B. Activation-Function Har dwar e Utilization T able III summarises the FPGA and ASIC resource util- isation of the proposed time-multiplex ed multi-activ ation- function (multi-AF) block relati ve to prior dedicated AF imple- mentations. Existing designs often allocate separate hardware blocks for individual acti vation functions [36], leading to signiﬁcant underutilization and dark silicon. In contrast, the proposed multi-AF block reuses CORDIC resources across multiple nonlinear functions, including Sigmoid, T anh, Soft- Max, GELU, Swish, ReLU, and SELU. The results show that the proposed design achieves utili- sation factors of 72-86% depending on the activ ation mode, while incurring less than 4% additional area and po wer over - head. On an FPGA, the multi-AF block reduces LUT and FF usage compared to SoT A designs supporting a similar function set. On ASIC, it demonstrates lower power consumption and competitiv e delay , conﬁrming that time multiple xing ef fec- tiv ely mitigates activ ation-function underutilization without compromising performance. C. Accuracy Evaluation Under Iterative Appr oximation Fig. 11 reports the accuracy of representati ve CNN and DNN models under different CORDIC iteration settings. The results conﬁrm that numerical error is tightly coupled to 9 0 10 20 30 40 50 60 70 80 90 1 0 0 Co n v -4 b Prop-4 b Conv -8 b Prop-8 b Conv -1 6 b Prop-1 6 b Conv -3 2 b Prop-3 2 b Clas sificatio n A cc u r ac y ( % ) sign ed f ixe d - p o int prec isi o n ( b it w idt h ) Cus tom /M N I S T Le Ne t-5 /M N I ST Res Ne t-1 8 /M NIST Caffe Ne t/M NIST VG G -1 6 /CIF AR -1 0 VG G -1 6 /CIF AR -1 0 0 Le Ne t-5 /CIF AR -1 0 Res Ne t-1 8 /CIF A R-1 0 Res Ne t-1 8 /CIF A R-1 0 0 Caffe Ne t/I m a ge Net Fig. 11. Ev aluation of DNN accuracy for different DNN models with CORDIC methodology . the number of acti ve CORDIC iterations, v alidating the ef- fectiv eness of the proposed runtime accuracy-latency trade- off mechanism. When operating in approximate mode, the accelerator incurs approximately 2% accuracy degradation, while accurate mode limits accuracy loss to below 0.5%. Importantly , by applying an accuracy-sensitivity heuristic to select the iteration depth per layer, most of the perfor- mance beneﬁts of approximate ex ecution are retained while preserving end-to-end model accurac y . This demonstrates that the proposed architecture enables ﬁne-grained control over numerical ﬁdelity without requiring retraining or auxiliary correction hardware. D. FPGA System-Level Comparison T able IV compares the proposed vector engine against SoT A FPGA-based AI accelerators using object detection work- loads such as T inyY OLO-v3. The proposed design achiev es competitiv e throughput while signiﬁcantly reducing power consumption. Operating at 85.4 MHz on the V irtex-707 plat- form, the vector engine deliv ers 6.43 GOPS/W at only 0.53 W , outperforming sev eral prior designs in energy efﬁciency despite using no DSP blocks. Compared to designs such as Flex-PE and LPRE, which rely on higher operating frequencies or specialised arithmetic units, the proposed architecture emphasises energy efﬁciency and scalability , making it particularly well-suited for edge deployments where po wer budgets are tightly constrained. E. ASIC Scalability and Compute Density ASIC-lev el scalability is ev aluated using two conﬁgurations of the proposed vector engine: a 64-PE conﬁguration and a 256-PE conﬁguration, as reported in T able V. The 64-PE conﬁguration serves as a computationally equiv alent baseline, demonstrating comparable performance to prior designs at signiﬁcantly lo wer area and power . The 256-PE conﬁguration represents a resource-equi valent comparison, achieving a peak compute density of 4.83 TOPS/mm 2 and an energy efﬁciency of 11.67 TOPS/W . Fig. 12. Prototype visualisation, sho wing Pynq-z2 for edge AI inference on Unmanned Aerial V ehicles (U A V). These results highlight the beneﬁts of the proposed iterati ve ex ecution model, where increased vector width compensates for per-MA C latency while preserving energy efﬁcienc y . The architecture’ s scalability enables efﬁcient deployment across a wide range of performance targets without redesigning the core datapath. F . End-to-End Embedded Deployment Fig. 13 presents a layer-wise ex ecution-time and po wer breakdown for the VGG-16 model, illustrating the impact of runtime precision switching on system performance. End-to- end deployment on a Pynq-Z2 platform with an ARM Corte x- A9 host reports a total latenc y of 84.6 ms at 0.43 W for object detection and classiﬁcation workloads, as shown in Fig. 12. It outperforms prior works: [3] ( 186 . 4 ms / 2 . 24 W on V C 707 ), [40] ( 772 ms / 1 . 524 W on V C 707 ), [4] ( 184 ms / 0 . 93 W on Pynq-Z 2 ), [6] ( 163 . 7 ms / 13 . 32 W on V C U 102 ), and baselines such as NVIDIA Jetson Nano ( 226 ms / 1 . 34 W) and Raspberry Pi ( 555 ms / 2 . 7 W). The proposed vector engine achieves lo wer latency and lower power consumption than prior FPGA-based accelerators and commercial embedded platforms such as the NVIDIA Jetson Nano and Raspberry Pi. These impro vements stem from a combination of iterative MA C efﬁcienc y , reduced memory bandwidth requirements, and dynamic precision adaptation. Overall, the results demonstrate that the proposed archi- tecture deliv ers consistent improvements across circuit-lev el efﬁcienc y , architectural scalability , and system-level perfor- mance, validating its suitability for energy-efﬁcient edge AI acceleration. V I . C O N C L U S I O N A N D F U T U R E W O R K This paper presented a runtime-adapti ve, CORDIC- accelerated vector engine designed to address the efﬁciency and ﬂexibility challenges of deep learning inference on resource-constrained edge platforms. By introducing a low- resource, iterativ e CORDIC-based MAC unit with runtime- conﬁgurable iteration depth, the proposed architecture enables 10 T ABLE IV A NA L Y S I S O F F PG A H AR D W A R E I MP L E ME N TA T I O N F O R OB J E C T D E T EC T I ON ( T I NY Y OL O - V 3) W I TH S O TA A I AC C EL E R A T OR D E S IG N S Design Platform Precision k-LUTs k-Regs/FFs DSPs Op. Freq (MHz) Energy efﬁciency (GOPS/W) Po wer(W) Proposed VC707 4/8/16 26.7 15.9 - 85.4 6.43 0.53 TVLSI’25 [3] VC707 4/8/16/32 38.7 17.4 73 466 8.42 2.24 TCAS-I’24 [37] ZU3EG 8 40.8 45.5 258 100 0.39 2.2 TCAS-II’23 [38] XCVU9P 8 132 39.5 96 150 6.36 5.52 TVLSI’23 [39] ZCU102 8 117 74 132 300 4.2 6.58 Access’24 [2] VC707 4/8 19.8 12.1 39 136 0.68 1.81 ISCAS’25 [4] VCU129 8/16/32 17.5 14.8 - 54.5 2.64 1.6 T ABLE V A S IC P E R FO R M AN C E C OM PA RI S O N W I T H S O T A 8 - B IT AC C E LE R A T O R D E S IG N S , W I T H C M O S 2 8 N M , 0 . 9 V , S F T EC H N OL O G Y . Design Network/Arch Datatype Freq. (GHz) Area (mm 2 ) Power (mW) Energy Efﬁciency TOPS/W Compute Density TOPS/mm 2 TCAS-II’24 [29] V ector Engine (64 × MA Cs) FP8 1.47 0.896 1622 7.24 2.39 1.29 1.18 1375 3.57 1.21 TCAS-I’22 [1] V ector Engine (64 × MA Cs) 196-64-32-32-10 INT -8 0.4 2.43 224.6 7.75 1.67 ISCAS’25 [4] TREA (64 × MACs) 196-64-32-32-10 Posit-8 1.25 6.73 230.4 7.55 0.16 TVLSI’25 [3] Systolic Array (8x8) FxP8 0.44 1.85 523 4.3 2.76 ICIIS’25 [11] Layer-Reused (64 × MA Cs) 196-64-32-32-10 FxP8 0.25 3.78 1540 4.28 2.07 Proposed V ector Engine 64 × PEs FxP-4/8/16 1.24 0.43 329 3.84 1.52 V ector Engine 256 × PEs 0.96 1.42 1186 11.67 4.83 Access’24 [2] Shared Bank (256 × MA Cs) 784-196-120-84-10 FxP8 0.28 1.58 499.7 6.87 1.18 0 4 8 12 16 20 0 7 14 21 28 35 42 49 56 63 70 Ex e c u tio n T im e (u s ) Po we r Co n s u m p tio n (m W ) Fig. 13. VGG-16 layer-wise execution time and po wer consumption. ﬁne-grained trade-of fs between accuracy and latency with- out requiring auxiliary error-correction hardware or structural modiﬁcations. In contrast to prior ﬁx ed-approximation designs, this approach allo ws dynamic adaptation to layer-le vel numer- ical sensitivity while maintaining compatibility with standard deep learning workloads. The proposed vector engine further integrates a time- multiplex ed multi-activ ation-function (multi-AF) block that signiﬁcantly impro ves hardware utilisation and mitigates dark silicon. By sharing CORDIC resources across a wide range of nonlinear functions, including Sigmoid, T anh, SoftMax, GELU, Swish, ReLU, and SELU, the architecture achieves high utilisation factors with minimal additional area and power ov erhead. This balanced treatment of MA C and activ ation units addresses a persistent inefﬁcienc y in existing deep learning accelerators. Comprehensiv e e valuation across software emulation, R TL veriﬁcation, FPGA prototyping, and 28 nm ASIC synthesis demonstrates the effecti veness of the proposed design. The iterativ e MA C unit achieves up to 33% reduction in critical- path delay and 21% power savings per stage, while scalable vector -engine conﬁgurations deliver a peak compute density of 4.83 TOPS/mm 2 and an energy ef ﬁciency of 11.67 TOPS/W . End-to-end deployment on embedded platforms further con- ﬁrms that the architectural beneﬁts translate into tangible improv ements in latency and power consumption at the system lev el. Future work will focus on extending the proposed frame- work toward a compiler-assisted design ﬂow that automates layer-wise precision and iteration selection based on model sensitivity analysis. In addition, inte grating full physical design and place-and-route (PnR) optimisation will enable more ac- curate post-layout ev aluation and facilitate tape-out readiness. Further exploration of adaptiv e e xecution strate gies for emerg- ing transformer and multi-modal workloads, as well as tighter integration with RISC-V -based system-on-chip platforms, rep- resents promising directions for expanding the applicability of the proposed vector engine [28]. Overall, the proposed runtime-adapti ve CORDIC-based vec- 11 tor engine provides a scalable and energy-efﬁcient compute substrate that bridges the gap between approximate and ac- curate deep learning acceleration, making it well-suited for next-generation edge AI systems. R E F E R E N C E S [1] R. Pilipovi ´ c, P . Buli ´ c, and U. Lotri ˇ c, “A T wo-Stage Operand Trimming Approximate Logarithmic Multiplier, ” IEEE T rans. Circuits Syst. I, Reg . P apers , v ol. 68, pp. 2535–2545, June 2022. [2] N. Ashar, G. Raut, V . Tree vedi, S. K. V ishvakarma, and A. Ku- mar , “QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit, ” IEEE Access , vol. 12, pp. 43600–43614, 2024. [3] M. Lokhande, G. Raut, and S. K. V ishvakarma, “Flex-PE: Flexible and SIMD Multiprecision PE for AI W orkloads, ” IEEE T rans. V ery Lar ge Scale Inte gr . (VLSI) Syst. , v ol. 33, pp. 1610–1623, June 2025. [4] O. Kokane, M. Lokhande, G. Raut, A. T eman, and S. K. V ishvakarma, “LPRE: Logarithmic Posit-enabled Reconﬁgurable edge-AI Engine, ” IEEE International Symposium on Circuits and Systems , 2025. [5] O. K okane, G. Raut, S. Ullah, M. Lokhande, A. T eman, A. Kumar , and S. K. V ishvakarma, “Retrospectiv e: A CORDIC-Based Conﬁgurable Activ ation Function for NN Applications, ” in IEEE Computer Society Annual Symposium on VLSI (ISVLSI) , pp. 1–6, 2025. [6] R. Pilipovi ´ c and P . Buli ´ c, “On the design of logarithmic multiplier using radix-4 booth encoding, ” IEEE Access , v ol. 8, pp. 64578–64590, 2020. [7] Y . Mao and Q. Liu, “ Aln: Approximate layer normalization for trans- former training on edge device, ” IEEE Tr ansactions on Computers , pp. 1–14, 2026. [8] L. Nkenyere ye, C. Rajkumar , B. G. Lee, and W .-Y . Chung, “Dynamic transfer learning switching approach using resource benchmark in edge intelligence, ” IEEE Internet of Things Journal , vol. 12, no. 13, pp. 25148–25170, 2025. [9] M. Lokhande, A. Jain, and S. K. V ishvakarma, “Precision-aw are On- device Learning and Adaptive Runtime-cONﬁgurable AI acceleration, ” IEEE International Symposium on VLSI Design and T est , Aug. 2025. [10] A. Jha, T . Dew angan, M. Lokhande, and S. K. V ishvakarma, “QForce- RL: Quantized FPGA-Optimized RL Compute Engine, ” IEEE Interna- tional Symposium on VLSI Design and T est (VD AT) , Aug. 2025. [11] S. Kumar , K. Gupta, I. S. Dasanayake, M. Lokhande, and S. K. V ishvakarma, “HYDRA: Hybrid data multiple xing and run-time layer conﬁgurable dnn accelerator , ” in Pr oceedings of the 19th International Confer ence on Industrial and Information Systems (ICIIS) , (Sri Lanka), Dec. 2025. [12] J.-S. Park, C. Park, S. Kwon, T . Jeon, Y . Kang, H. Lee, D. Lee, J. Kim, H.-S. Kim, Y . Lee, S. Park, M. Kim, S. Ha, J. Bang, J. Park, S. Lim, and I. Kang, “A Multi-Mode 8k-MAC HW -Utilization-A ware Neural Processing Unit With a Uniﬁed Multi-Precision Datapath in 4- nm Flagship Mobile SoC, ” IEEE Journal of Solid-State Circuits , vol. 58, no. 1, pp. 189–202, 2023. [13] M. V erhelst, L. Benini, and N. V erma, “How to Keep Pushing ML Accelerator Performance? Know Y our Rooﬂines!, ” IEEE Journal of Solid-State Cir cuits , pp. 1–18, June 2025. [14] A. Krishna, S. Rohit Nudurupati, D. G. Chandana, P . Dwiv edi, A. v an Schaik, M. Mehendale, and C. S. Thakur, “RAMAN: A Reconﬁgurable and Sparse tinyML Accelerator for Inference on Edge, ” IEEE Internet of Things Journal , vol. 11, pp. 24831–24845, July 2024. [15] T . Chaudhari, A. J, T . Dewangan, M. Lokhande, and S. K. V ishvakarma, “XR-NPE: High-throughput mixed-precision simd neural processing engine for extended reality perception workloads, ” in 39th International Confer ence on VLSI Design and 25th International Conference on Embedded Systems (VLSI-D / ES) , (Pune, India), Jan. 2026. [16] M. Jaiswal, V . Sharma, A. Sharma, S. Saini, and R. T omar, “Quantized cnn-based efﬁcient hardware architecture for real-time hand gesture recognition, ” Microelectr onics Journal , v ol. 151, p. 106345, July , 2024. [17] Z. Y uan, Q. Li, X. Lin, T .-M. Grønli, and S. Cherkaoui, “Edge ai for internet of robotic things, ” IEEE Internet of Things Magazine , vol. 9, no. 1, pp. 4–6, 2026. [18] P . Chen, T . Ouyang, K. Luo, W . Hong, and X. Chen, “Codrone: Autonomous drone navigation assisted by edge and cloud foundation models, ” IEEE Internet of Things J ournal , vol. 13, no. 4, pp. 5593– 5609, 2026. [19] Y . Chen, H. W ang, Z. Li, E. Mou, T . Song, S. Xia, and Y . Pang, “ A lightweight uav object detector based on optimized yolov8 fused with an auxiliary learning branch for aiot, ” IEEE Internet of Things J ournal , vol. 13, no. 4, pp. 5793–5808, 2026. [20] M. Ali and K. Nathwani, “Exploiting wa velet scattering transform and 1d-cnn for unmanned aerial v ehicle detection, ” IEEE Signal Pr ocessing Letters , vol. 31, pp. 1790–1794, 2024. [21] K. Zhang, X. Liu, K. W ang, Q. Cai, X. Xie, J. Zhang, J. Chen, C. Zhang, X. T ong, Z. Gong, and K. Li, “Edcl: An efﬁcient dynamic continual learning framework for iot systems, ” IEEE T ransactions on Computers , pp. 1–16, 2026. [22] Y .-C. Lin, M.-S. Huang, J.-B. W ang, W .-C. Chen, N.-S. Chang, C.-P . Lin, C.-S. Chen, T .-D. Chiueh, and C.-H. Y ang, “A 16nm Fully Inte- grated SoC for Hardware-A ware Neural Architecture Search, ” in 2025 IEEE European Solid-State Electr onics Research Confer ence (ESSERC) , pp. 397–400, 2025. [23] J. Hu, Z. Zhang, Z. Li, Q. Meng, X. Shi, Q. Huang, H. W ang, and S. Chang, “Single-step hardware-a ware neural network quantization with mixed precision, ” IEEE T ransactions on Computers , pp. 1–12, 2026. [24] A. Sharma, L. H. Krishna, and B. Srinivasu, “High-Performance Gemmini-Based Matrix Multiplication Accelerator for Deep Learning W orkloads, ” IEEE T ransactions on V ery Large Scale Integration (VLSI) Systems , v ol. 33, no. 12, pp. 3276–3289, 2025. [25] S. Mehra, G. Raut, R. D. Purkayastha, S. K. V ishvakarma, and A. Bi- asizzo, “ An empirical e valuation of enhanced performance softmax function in deep learning, ” IEEE Access , vol. 11, pp. 34912–34924, 2023. [26] K. Khalil, O. Eldash, A. Kumar , and M. Bayoumi, “Designing Novel AAD Pooling in Hardware for a Conv olutional Neural Network Ac- celerator , ” IEEE T rans. V ery Lar ge Scale Inte gr . (VLSI) Syst. , vol. 30, pp. 303–314, Mar . 2022. [27] W . Zhang, X. Geng, X. Hu, H. W ang, J. Jiang, Q. W ang, S. Liu, J. Han, and H. Jiang, “Lut-alms: Trading off accuracy and power for approximate logarithmic multipliers via lut optimization, ” IEEE T ransactions on Computers , pp. 1–14, 2026. [28] A. Kamaleldin, H. Aouinti, and D. G ¨ ohringer , “Procon-v: A pro- grammable tightly coupled conv olution accelerator based on risc-v custom instructions for edge devices, ” IEEE T ransactions on Computers , pp. 1–15, 2026. [29] B. Li, K. Li, J. Zhou, Y . Ren, W . Mao, H. Y u, and N. W ong, “A Reconﬁg- urable Processing Element for Multiple-Precision Floating/Fixed-Point HPC, ” IEEE Tr ans. Circuits Syst. II, Exp. Briefs , vol. 71, pp. 1401–1405, Mar . 2024. [30] S. M. Cherati, M. Barzegar , and L. Sousa, “MSDF-Based MAC for Energy-Ef ﬁcient Neural Netw orks, ” IEEE T rans. V ery Lar ge Scale Inte gr . (VLSI) Syst. , pp. 1–12, 2025. [31] S. Ullah, S. Rehman, M. Shaﬁque, and A. Kumar , “High-Performance Accurate and Approximate Multipliers for FPGA-Based Hardware Ac- celerators, ” IEEE T rans. Comput.-Aided Design Integr . Circuits Syst. , vol. 41, pp. 211–224, Feb . 2022. [32] M. Basavaraju, V . Rayapati, and M. Rao, “Exploring Hardware Ac- tiv ation Function Design: CORDIC Architecture in Diverse Floating Formats, ” in 25th International Symposium on Quality Electr onic Design (ISQED) , pp. 1–8, 2024. [33] D. Zhu, S. Lu, M. W ang, J. Lin, and Z. W ang, “Efﬁcient Precision- Adjustable Architecture for Softmax Function in DL, ” IEEE T rans. Cir cuits Syst. II, Exp. Briefs , vol. 67, pp. 3382–3386, Dec. 2020. [34] K. Chen, Y . Gao, H. W aris, W . Liu, and F . Lombardi, “Approximate Softmax Functions for Energy-Efﬁcient DNNs, ” IEEE Tr ans. V ery Large Scale Inte gr . (VLSI) Syst. , v ol. 31, pp. 4–16, Jan. 2023. [35] N. A. Mohamed and J. R. Cav allaro, “A Uniﬁed Parallel CORDIC- Based Hardware Architecture for LSTM Network Acceleration, ” IEEE T ransactions on Computers , vol. 72, pp. 2752–2766, Oct. 2023. [36] J. Kim, K. Choi, and I.-C. Park, “Hardware-efﬁcient uniﬁed approx- imation for implementing di verse smooth activ ation functions, ” IEEE T ransactions on Computers , pp. 1–8, 2026. [37] B. Wu, T . Y u, K. Chen, and W . Liu, “Edge-Side Fine-Grained Sparse CNN Accelerator W ith Efﬁcient Dynamic Pruning Scheme, ” IEEE T rans. Cir cuits Syst. I, Re g. P apers , vol. 71, pp. 1285–1298, Mar . 2024. [38] S. Ki, J. Park, and H. Kim, “Dedicated FPGA Implementation of the Gaussian TinyY OLOv3 Accelerator, ” IEEE T rans. Circuits Syst. II, Exp. Briefs , v ol. 70, pp. 3882–3886, Oct. 2023. [39] W . Lee, K. Kim, W . Ahn, J. Kim, and D. Jeon, “A Real-T ime Object Detection Processor W ith XNOR-based V ariable-Precision Computing Unit, ” IEEE T rans. V ery Lar ge Scale Integr . (VLSI) Syst. , vol. 31, pp. 749–761, June 2023. [40] G. Raut, S. Karkun, and S. K. V ishvakarma, “An Empirical Approach to Enhance Performance for Scalable CORDIC-Based DNNs, ” ACM T rans. Reconﬁgurable T echnol. Syst. , vol. 16, June 2023.

CORVET: A CORDIC-Powered, Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment