Title: Hardware-accelerated graph neural networks: an alternative approach for neuromorphic event-based audio classification and keyword spotting on SoC FPGA
ArXiv ID: 2602.16442
Date: 2026-02-18
Authors: ** (논문에 명시된 저자 정보가 제공되지 않아 정확히 기재할 수 없습니다. 일반적으로 해당 논문은 21st International Symposium on Applied Reconfigurable Computing에 발표된 연구팀이며, Xilinx ZCU104 SoC FPGA 기반 구현을 수행한 하드웨어·AI 전문가들로 구성된 것으로 추정됩니다.) — **
📝 Abstract
As the volume of data recorded by embedded edge sensors increases, particularly from neuromorphic devices producing discrete event streams, there is a growing need for hardware-aware neural architectures that enable efficient, low-latency, and energy-conscious local processing. We present an FPGA implementation of event-graph neural networks for audio processing. We utilise an artificial cochlea that converts time-series signals into sparse event data, reducing memory and computation costs. Our architecture was implemented on a SoC FPGA and evaluated on two open-source datasets. For classification task, our baseline floating-point model achieves 92.7% accuracy on SHD dataset - only 2.4% below the state of the art - while requiring over 10x and 67x fewer parameters. On SSC, our models achieve 66.9-71.0% accuracy. Compared to FPGA-based spiking neural networks, our quantised model reaches 92.3% accuracy, outperforming them by up to 19.3% while reducing resource usage and latency. For SSC, we report the first hardware-accelerated evaluation. We further demonstrate the first end-to-end FPGA implementation of event-audio keyword spotting, combining graph convolutional layers with recurrent sequence modelling. The system achieves up to 95% word-end detection accuracy, with only 10.53 microsecond latency and 1.18 W power consumption, establishing a strong benchmark for energy-efficient event-driven KWS.
💡 Deep Analysis
📄 Full Content
GCN RNN SoC FPGA "STOP" Fig. 1. In this work, we propose an event-based keyword spotting system in which speech signals are converted into asynchronous events by an artificial cochlea and represented as spectro-temporal event-graphs. These are processed by a GCN-RNN model deployed on a SoC FPGA, enabling low-power, low-latency, and efficient keyword spotting.
classification and keyword spotting on SoC FPGA. In Proceedings of Make sure to enter the correct conference title from your rights confirmation email (Conference acronym ‘XX). ACM, New York, NY, USA, 28 pages. https://doi.org/XXXXXXX.XXXXXXX
As the Internet of Things expands, distributed sensors are collecting ever-increasing quantities of data. This has driven the need for accurate and efficient computing systems capable of processing this data locally, for example, to make predictions [53]. The energy consumption and latency of these systems are of particular importance, as data is increasingly processed directly on edge devices, which operate under strict energy constraints (e.g. the battery of a smartwatch).
In most cases, the raw data produced by the sensors is time-series -continuous signals following the evolution of environmental variables [37]. For example, sensors that monitor the vibration of mechanical parts have been used to predict failures in gearboxes [57], or implantable cardioverter-defibrillators monitor the state of a patient’s heart in order to apply an electric shock in the event of dangerous fibrillation [21].
It is becoming increasingly common to use artificial intelligence (AI) methods to process this time-series data.
However, using conventional hardware, such as GPUs (graphics processing units), consumes too much power, and microprocessors may struggle to meet the latency requirements of many applications. FPGAs (Field Programmable Gate Arrays) and custom integrated circuits offer a means of implementing architectures optimised to specific AI methods that are capable of meeting latency requirements while minimising power consumption [2,3,25]. Furthermore, it is necessary to apply the model at periodic intervals, and important temporal information present in the signal below the sampling frequency cannot be leveraged.
A particularly promising method is event-based AI, which operates on the sparse data generated by neuromorphic, event-based sensors and allows reducing power consumption and prediction latency [24]. Event-based sensors, instead of regularly sampling an environmental variable, generate “events” in case of the changes in the signal.
In this work, which is an extended version of our conference paper [45] presented at the 21st International Symposium on Applied Reconfigurable Computing, we focus on audio signal processing. In this context, event-based time-series data is generated by a class of sensors known as artificial cochleas (AC) (also referred to as dynamic audio sensors or silicon cochleas) [38,43]. Their operating principle is to apply a bank of band-pass filters to separate the signal into multiple frequency channels. A digital pulse (i.e. an event) is generated per channel in an asynchronous manner when the signal intensity changes by a pre-defined threshold. This results in a sparse spectrogram, which an event-based AI method exploits to perform efficient computation. how event-data sparsity can be truly exploited due to the nondeterministic pattern of synaptic weight-memory access [15,19] inherent to SNNs.
Recently, event-graph neural networks have been proposed as an alternative way of processing event-data [16,30,36,42,52]. The event-graph approach consists of a dynamically updated graph generated by an event-sensor, and it involves applying graph convolutions on the resulting data structure. Unlike SNNs, the weight access pattern for many event-graph models is deterministic. This may provide an opportunity to develop new event-based AI hardware that is truly capable of exploiting the inherent sparsity of data to reduce power consumption and latency. Although digital architectures for accelerating event-graphs have been proposed in the context of computer vision [31,64], a dedicated architecture for time-series audio applications has not yet been considered.
In this paper, we propose a hardware accelerator implemented on a SoC FPGA for event-graph audio classification and keyword spotting tasks (Figure 2). The proposed method enables real-time, end-to-end continuous processing while preserving the inherent sparsity of the input data. Specifically, we consider a recently proposed spectro-temporal model [52] developed for the classification of time-series data and evaluated on the Spiking Heidelberg Digits (SHD) dataset [13], which is a representative of time-series data.
We summarise our main contribution as follows:
• We use the hardware-aware design method to propose optimisations required to implement spectro-temporal event-graphs in reconfigurable hardware with low power, low latency and