Challenges in Hyperspectral Imaging for Autonomous Driving: The HSI-Drive Case

The use of hyperspectral imaging (HSI) in autonomous driving (AD), while promising, faces many challenges related to the specifics and requirements of this application domain. On the one hand, non-controlled and variable lighting conditions, the wide…

Authors: Koldo Basterretxea, Jon Gutiérrez-Zaballa, Javier Echanobe

Challenges in Hyperspectral Imaging for Autonomous Driving: The HSI-Drive Case
CHALLENGES IN HYPERSPECTRAL IMA GING FOR A UTONOMOUS DRIVING: THE HSI-DRIVE CASE K oldo Basterr etxea 1 , J on Guti ´ err ez-Zaballa 2 , and J avier Echanobe 2 1 Uni versity of the Basque Country , Dep. of Electronics T echnology 2 Uni versity of the Basque Country , Dep. of Electricity and Electronics ABSTRA CT The use of hyperspectral imaging (HSI) in autonomous dri v- ing (AD), while promising, faces man y challenges related to the specifics and requirements of this application domain. On the one hand, non-controlled and variable lighting con- ditions, the wide depth-of-field ranges, and dynamic scenes with fast-moving objects. On the other hand, the require- ments for real-time operation and the limited computational resources of embedded platforms. The combination of these factors determines both the criteria for selecting appropriate HSI technologies and the dev elopment of custom vision al- gorithms that leverage the spectral and spatial information obtained from the sensors. In this article, we analyse sev- eral techniques explored in the research of HSI-based vision systems with application to AD, using as an example results obtained from experiments using data from the most recent version of the HSI-Dri ve dataset. Index T erms — Hyperspectral imaging, autonomous driving, image se gmentation, spectral attention 1. INTR ODUCTION Recent advances in h yperspectral imaging (HSI) sensing technologies ha ve enabled the commercialization of snapshot cameras capable of capturing spectral information across dozens or ev en hundreds of bands at video rates [1]. These dev elopments hav e sparked significant interest in exploring the potential of HSI in emerging application domains that demand affordable, compact, and portable spectral imaging systems with moderate to low power consumption such as precision agriculture, remote sensing in the new space, med- ical imaging for surgery , food processing, and autonomous navigation [2]. In particular , some researchers hav e focused their attention on the potential of HSI to overcome some of the limitations of current intelligent vision systems for Au- tonomous Dri ving (AD), as well as to enhance their capability for comprehensiv e scene understanding [3], [4]. Ho wev er , the use of HSI snapshot cameras in AD requires a careful anal- ysis of in volv ed technologies and the customization of both This work was partially supported by the Uni versity of the Basque Coun- try (UPV/EHU) under grant GIU21/007. algorithms and processing architectures to meet the specific requirements of this field of application. The challenges associated with applying HSI to AD stem from the combination of factors that affect the quality and accuracy of collected data -constraints imposed by snapshot camera technologies, uncontrolled illumination, the presence of fast moving elements, the need for v ariable e xposure times, etc.- together with the requirements of lo w-latency operation and the limitations in the computational power of av ailable embedded processing platforms. In this article, we share sev eral lessons learned during the development of HSI-based image segmentation systems for AD using the HSI-Driv e dataset. W e also describe recent improvements introduced in both the latest public release of this dataset and the deep neu- ral network (DNN) image segmentation models developed through experiments with these data. 2. HSI SNAPSHO T CAMERA TECHNOLOGIES The selection of an HSI camera to e xperiment with the devel- opment of adv anced machine vision systems for AD must be grounded on a set of well established criteria. Ideally , • it must be capable of operating at video-rates (snapshot cameras) under different illumination conditions, • it must be small, compact, and mechanically robust, • it must provide enough spatial resolution for the depth of field required in the application, • it must provide enough spectral resolution to achiev e good spectral separability of the classes/materials to be identified, • and it must be based on a scalable sensor technology that makes it competiti ve for future mass deplo yment. Regarding the ability of HSI systems to operate at video rates, an additional factor that is often o verlooked, but of con- siderable importance, is the processing required to reconstruct spatial and spectral information from the raw data. A thor- ough analysis of commercially av ailable HSI snapshot cam- era technologies is beyond the scope of this work. Howe ver , c ©2025 IEEE. Final published version of the article can be found at . it is worth noting that the main players in the small-form fac- tor , high-throughput ( > 10-20 fps) HSI snapshot camera seg- ment employ either on-chip deposition technology of mosaics of narro w-band interferometric filters (e.g., Imec sensors in- tegrated in Ximea and Photonfocus cameras), or light-field technology based on microlens arrays for multiple projection of spectrally filtered sub-images onto the sensor (as in the case of Cubert). In both cases, the incident light is projected onto a standard CMOS sensor . Alternati ve technologies, such as coded aperture snapshot spectral imaging (CASSI) combined with scattered spectral sampling, require complex and time- consuming data processing. Consequently , these solutions are not yet capable of meeting the real-time performance require- ments of autonomous driving applications. The HSI-Dri ve dataset was generated from recordings with a Photonfocus MV2 camera featuring an Imec 25-band Red-NIR sensor with on-chip mosaic filter technology . The selection of this system was moti v ated by sev eral factors: its relativ ely low cost, its high throughput combined with accept- able spatial and spectral resolution, and the accessibility to the cube-generation processing code, which could be adapted and optimised for efficient execution on an embedded processing platform. Moreov er , spectral filter-on-chip technology offers far greater scalability than competing approaches, enabling mass production at costs comparable to those of traditional CMOS sensors. Nonetheless, there are also some drawbacks to be considered when compared to other technological al- ternativ es. These include the presence of spectral leakage and second-order response peaks, the constraints imposed by mosaic patterns on the number of bands, and the need for pixel-le vel spatial realignment of the mosaic filters. All in all, the design of the HSI-Driv e dataset was guided primarily by principles of simplicity and feasibility , rather than by the pur- suit of maximum spectral quality . In other words, our guiding question was: Given currently av ailable and potentially scal- able technologies for vehicle integration at reasonable cost, what lev el of useful information can be extracted from an HSI snapshot camera for the development of machine vision systems for autonomous driving? 3. SPECTRAL INFORMA TION: D A T A QU ALITY , CONSISTENCY , AND PR OCESSING CONSTRAINTS Considering that the approach for the development of the ma- chine vision system under study , i.e. an image segmenta- tion processing pipeline that could be executed on an em- bedded processing platform at video rates, was to be based on a machine learning (ML) model, to what extent the corre- spondence of the acquired spectral data with the real physical spectral reflectance signatures of the materials in a scene is of real importance? If that is the case, in principle, as far as there is data consistency , accuracy does not seem to be of any concern. Howev er , this is not entirely true, as it must be un- derstood that the spectral separability between materials may depend on subtle differences in the spectral reflectance sig- natures. Howe ver , here again the approach in the HSI-Driv e dataset leans tow ards computational simplicity as far as the model quality is not compromised. The latest version (v2.1) of the HSI-Driv e dataset presents two major dif ferences with respect to the previous v2.0 ver- sion: first, the annotation, while still fav ouring the preserv a- tion of the spectral information, has been carefully revie wed and the total number of labelled pixels has been increased with more than 1,100,000 new pixels (+2,5 %). Secondly , a new function to enhance data consistency has been added to the cube processing pipeline, which performs a pseudo- reflectance correction algorithm using data contained in the recorded frame itself. (a) False RGB (b) HSI-Driv e 2.0 labelling (c) HSI-Driv e 2.1 labelling Fig. 1 : Example of manually labelled ground true images in HSI-Driv e versions v2.0 and v2.1 3.1. New data labeling The v2.1 version of the HSI-Drive dataset does not provide more annotated images, but a ne w , more careful annotation of the images already in version v2.0. The aim of this ne w la- belling ef fort has been twofold. Firstly , to increase the amount of labelled pixels for training, especially in the most underrep- resented categories. Secondly , to provide higher quality test images for the ev aluation of segmentation models. Ho we ver , the primary approach to the image labelling of the dataset is not changed, i.e. keeping unlabelled the pixels that a hu- man labeller cannot clearly decide which category they be- long to This usually includes man y background pix els and all the edges that delimit different items or surf aces in a scene. 3.2. Reflectance correction and data normalization W ith the aim of preserving the feasibility of real-time deploy- ments, the cube processing pipeline applied to generate the hyperspectral cubes in the HSI-Dri ve 2.1 was kept simple as in previous versions. The tar get application (AD) requires real-time processing of the acquired images and, consider - ing the recording circumstances -outdoor recording, dif ferent camera setups, no additional lighting and spectra measure- ments etc., trying more accurate yet more complex cube gen- eration pipelines would become the principal processing bot- tleneck with no noticeable improvements in the final results (accuracy). The applied spectral cube generation process is a re- flectance processing pipeline that comprehends the follo wing steps 1. Image cropping and framing. 2. Bias removal and reflectance correction. 3. Partial demosaicing (with 1/5 spatial resolution loss). 4. Spatial filtering (optional). 5. Translation to centre (band alignment by bilinear inter - polation). Since data normalization techniques (per band normaliza- tion, pixel normalization etc.) hav e different objectives de- pending on the algorithms to be used to subsequently process the hyperspectral images, and since it is the final step of the processing pipeline, unlike in the v1.x v ersions of the dataset, we do not provide normalized cubes in v2.x versions. This process, if necessary , is left to the dataset users. Fig. 2 : figure Spectra of the maximum values of the a veraged reference white tile images for the four different camera configurations used in the dataset. Images were taken by e xposing the calibrated tile to direct sunlight in a clear day with the sun at its zenith Reflectance correction is aimed at cancelling the irradi- ance spectrum of the illuminant. In a laboratory , this can be performed by using a calibrated reference white tile to di vide the data in the image to be processed by the data in a refer- ence white image acquired under the same illumination con- ditions and with the same camera configuration. In dynamic outdoor conditions, as is the case for the recording of the HSI- Driv e dataset scenes, an accurate reflectance correction is not possible at all. Howe ver , when generating data for the pre- vious v ersions of the HSI-Driv e dataset, we chose to keep a ”pseudo reflectance correction” stage by using a common white reference image obtained for each of the four dif ferent f-number/A G configurations of the camera. These white ref- erence images were generated by av eraging v arious shots ov er a spectrally calibrated Spectralon white reflectance tile un- der expected natural maximum illumination conditions, i.e. at midday on a sunny day (Fig. 2). Since no point spectrometer or other light measuring devices were present in the recording setup to compensate for illumination v ariations, obviously the applied correction does not fully normalize the spectral signa- tures of the different images in the dataset, and thus the data are not entirely consistent. Howe ver , this processing still pro- vides some benefits. Firstly , it reduces sensor non-uniformity issues and image vignetting, and secondly , it cancels to some extent the irradiance spectrum of natural light. Applying a posteriori data normalization techniques can reduce this issue to some extent. The data used to train the CNN models for the HSI-Driv e 2.0 experiments (see results published in [3]) were obtained after applying a per -pixel nor- malization of the spectral signatures. This technique remov es the irradiance of fset produced by v ariations in incident light intensity . The beneficial consequences are that shadow effects T able 1 : Frequency of each class in the HSI Dri ve v2.0 dataset. T otal Road R.Marks a V eg. b Pain.Met. c Sky Concrete Ped. d W ater Unpain.Met. e Glass Pixels 43,947,503 26,690,619 1,325,343 9,339,224 948,852 2,511,496 2,315,153 209,531 12,330 348,341 246,614 % 100 60.73 3.02 21.25 2.16 5.71 5.27 0.48 0.03 0.79 0.56 a Road Marks. b V egetation. c Painted Metal. d Pedestrian. e Unpainted Metal. T able 2 : Frequency of each class in the HSI Dri ve v2.1 dataset. T otal Road R.Marks a V eg. b Pain.Met. c Sky Concrete Ped. d W ater Unpain.Met. e Glass Pixels 45,055,512 26,753,811 1,364,908 9,799,475 1,113,573 2,549,527 2,485,658 231,019 10,592 467,688 279,261 % 100 59.38 3.03 21.75 2.47 5.66 5.52 0.51 0.02 1.04 0.62 a Road Marks. b V egetation. c Painted Metal. d Pedestrian. e Unpainted Metal. are mitigated -since reflectances on the same material surfaces are equalized- and that spectral information is fav oured over the general reflectance of surfaces. The negati ve effects are that the differences in the overall reflectance lev els of differ - ent materials are remov ed, which results in the loss of valu- able information for training AI models. As an improv ement to data quality , in the v2.1 version of the dataset we incorporate an additional processing func- tion that estimates the relativ e lev el of illumination of the recorded scene by searching for the pixels with the highest albedo at each image. These pix els usually correspond to high-reflectance white surf aces such as road marks, white ve- hicle bodies, etc., although in some cases the algorithm se- lects pixels corresponding to the sky . By comparing the ir- radiance of these pixels with the reference white images, a scaling factor is calculated to correct the reference white im- ages stored in memory . Ideally , if the procedure was perfect, all images in the dataset would be scaled in the [0,1] range. The search for reference pixels for scaling is not straightfor- ward, since it in volves rejecting pix els from artificial light sources such as illuminated signs, traffic lights, and front and rear lights of vehicles. The programmed algorithm automat- ically segregates these ”suspicious” pixels on the foundation of their spectral signatures, so no human intervention is re- quired and thus, this function can be embedded in the image processing pipeline of the image segmentation processor (see Fig. 3). Artificial light pixels are thus treated as outliers and clipped to 1 at the end of the cube preprocessing sequence. 4. ENHANCED SEGMENT A TION MODELS AND EXPERIMENT AL RESUL TS 4.1. U-Net with spectral attention modules The U-Net trained with previous versions of HSI-Drive has been enhanced by incorporating attention modules. These mechanisms are inspired by the human visual system, where the brain selectiv ely prioritizes certain regions of the visual field while suppressing less relev ant information. In CNNs, a similar principle is implemented by weighting feature maps through an attention function. This enables the network to emphasize discriminative spatial features, spectral features, or a combination of both, ultimately improving its represen- tational capacity . T o lev erage the spectral richness pro vided by hyper - spectral images, se veral attention mechanisms have been in vestig ated, namely Conv olutional Block Attention Mod- ule (CBAM) [5], Squeeze-and-Excitation (SE) [6], Efficient Channel Attention (ECA) [7], and Coordinate Attention (CA) [8]. Among these, the best segmentation accuracy in our experiments was achieved with ECA which, additionally , introduces the least computational complexity and memory ov erhead to the model. Efficient Channel Attention (ECA) operates by adap- tiv ely re-weighting channel-wise features without relying on dimensionality reduction. Unlike SE, which compresses and then expands channel dimensions, ECA applies a local cross- channel interaction through a fast 1D con volution with a kernel size adaptiv ely determined by the channel dimension. This design ensures efficient information exchange between channels while a v oiding additional fully connected layers, thus maintaining both accuracy and ef ficiency . T o effecti vely integrate ECA into the previously devel- oped U-Net architecture, two attention blocks were incorpo- rated at each encoder and decoder stage: one before the first con v olutional block and another one before the second conv o- lutional block (a detailed diagram of the original U-net model can be consulted in [9]). This placement allo ws the network to refine its feature representations at multiple depths, ensuring that both low-le vel and high-lev el spectral-spatial information are adaptiv ely emphasized during the segmentation process. 4.2. Experimental results Performed testing experiments on the HSI-Driv e 2.0 and 2.1 datasets demonstrates the superiority of both the new mod- ified U-Net model with spectral attention modules, and the new scaled reflectance correction processing ov er the use of non-scaled cubes both with and without pixel normalization. T able 3 summarizes the results obtained in the HSI-Drive 2.0 dataset for a 5-class experiment with the pre vious U-Net using the pixel normalization technique. All figures correspond to (a) Sensed irradiance values (int12) (b) False RGB image of the 25-band HSI cube and coordinates of the pixel with maximum albedo Fig. 3 : Example of the identification of a maximum albedo pixel for the white balance scaling. This image corresponds to a cloudy Autumn morning recording with low lightning. Although the maximum irradiance v alues are generated by the rear and front lights of the cars (a), the algorithm successfully rejects those pixels and selects a pixel corresponding to the road mark as the highest reflectance pixel in the image (b) mean IoU values over a 5-fold cross-validation experimental setup. Attention modules show ov er 2% accuracy improve- ment in the weighted IoU index. In T able 4 figures show a comparativ e study of the use of the new reflectance correc- tion scaling algorithm with respect to the pixel normalization technique on the v2.1 dataset using the attention U-Net as predictor . Here again, the improv ement exceeds 2% in ac- curacy . T ables 5 and 5 show the results obtained for six class experiments by combining the attention modules with the new scaling reflectance correction technique. The additional sixth classes, painted metal (vehicle bodywork, road signals etc.) and pedestrians/cyclists respecti vely , are specially challeng- ing due to high intraclass spectral v ariability and lo w spectral interclass separability , as well as to small number of train- ing data (see 2). The results obtained on these two classes are specially noteworthy , with accuracy improv ements of 10.22% and 5.09% respectiv ely . Examples of segmented videos using these models can be found at https://ipaccess.ehu.eus/HSI- Drive/ . T able 3 : Segmentation results (%) for the 5 class experiment on the HSI-Driv e 2.0 dataset. Model V ersion road road m. ve g. sk y ”others” global weighted U-Net No scaling+PN 97.53 85.94 95.04 93.02 78.59 94.64 87.52 Att.U-Net No scaling+PN 98.05 87.74 95.54 95.25 82.94 95.64 89.71 T able 4 : Segmentation results (%) for the 5 class experiment on the HSI-Driv e 2.1 dataset. Model V ersion road road m. ve g. sk y ”others” global weighted Att.U-Net No scaling+PN 97.64 85.33 94.55 92.89 81.79 94.71 87.75 Att.U-Net Scaling+PN 97.83 87.27 94.60 94.14 82.50 95.04 89.16 Att.U-Net Scaling 98.04 89.97 94.46 92.05 83.26 95.17 90.03 T able 5 : Segmentation results (%) for the 6 class experiment (painted metal) Model V ersion road road m. veg. p.metal sky ”others” global weighted U-Net v2.0 No scaling+PN 97.34 85.20 93.84 58.61 92.30 68.65 93.07 74.45 Att.U-Net v2.1 Scaling 98.08 90.34 93.63 68.83 91.44 74.61 93.97 81.09 T able 6 : Segmentation results (%) for the 6 class experiment (pedestrians) Model V ersion road road m. veg. ped. sky ”others” global weighted U-Net v2.0 No scaling+PN 97.04 81.85 93.56 61.94 89.23 74.20 93.26 67.13 Att.U-Net v2.1 Scaling 97.60 87.93 93.64 67.03 89.34 80.28 94.15 72.23 5. CONCLUDING REMARKS The successful adoption of HSI technology in autonomous driving (AD) will depend on sev eral key factors. Firstly , it will depend on advances in HSI sensor technologies that en- able the production of affordable yet technically precise snap- shot cameras, capable of combining high image throughput with sufficient spectral and spatial resolution to support the dev elopment of high-performance machine vision systems for autonomous dri ving. Secondly , it will depend on research into more capable and robust, yet computationally efficient algorithms that can make the most of the information pro- vided by HSI data. Finally , the combination of improv ements achiev ed in both areas should lead to machine vision sys- tems that demonstrate either their superiority or , at least, their complementarity to increasingly capable and precise systems based on more mature technologies. In this paper , we share some research results obtained using the latest published version of the HSI-Driv e dataset. HSI-Driv e is a dataset de veloped by recording real driving scenes with a single snapshot hyperspectral camera featur- ing a 25-band Red-NIR on-chip filter mosaic sensor . In this version, we refined the labelling of ground-truth im- ages, improv ed data consistenc y by introducing a customized illuminant intensity estimation algorithm for reflectance cor- rection, and dev eloped enhanced image segmentation models by incorporating spectral attention modules. These additional lightweight attention blocks hav e been placed at k ey points in the encoder and decoder branches of the previous backbone U-net architecture to ensure efficient spectral information exchange between channels during inference. The result is a consistent improvement in segmentation accuracy and rob ust- ness, while preserving processing simplicity for deployment on embedded devices. 6. REFERENCES [1] Michael W est, John Grossman, and Chris Galv an, “Com- mercial snapshot spectral imaging: the art of the possi- ble, ” 2018. [2] Motoki Y ak o, “Hyperspectral imaging: history and prospects, ” Optical Re view , Sep 2025. [3] Jon Guti ´ errez-Zaballa, Koldo Basterretx ea, Javier Echanobe, M V ictoria Mart ´ ınez, and Unai Martinez- Corral, “HSI-Dri ve v2.0: More data for new challenges in scene understanding for autonomous driving, ” in 2023 IEEE Symposium Series on Computational Intelligence (SSCI) . IEEE, 2023, pp. 207–214. [4] Imad Ali Shah, Jiarong Li, Martin Gla vin, Edward Jones, Enda W ard, and Brian Deegan, “Hyperspectral imaging-based perception in autonomous driving scenar - ios: Benchmarking baseline semantic se gmentation mod- els, ” in 2024 14th W orkshop on Hyperspectral Imag- ing and Signal Pr ocessing: Evolution in Remote Sensing (WHISPERS) , 2024, pp. 1–5. [5] Sanghyun W oo, Jongchan Park, Joon-Y oung Lee, and In So Kweon, “CB AM: Con volutional Block Attention Module, ” in Computer V ision – ECCV . 2018, pp. 3–19, Springer . [6] Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-Excitation Networks, ” in IEEE/CVF Conference on Computer V i- sion and P attern Recognition , 2018, pp. 7132–7141. [7] Qilong W ang, Banggu W u, Pengfei Zhu, Peihua Li, W angmeng Zuo, and Qinghua Hu, “ECA-Net: Effi- cient Channel Attention for Deep Con volutional Neural Networks, ” in IEEE/CVF Conference on Computer V i- sion and P attern Recognition (CVPR) , 2020, pp. 11531– 11539. [8] Qibin Hou, Daquan Zhou, and Jiashi Feng, “Coordi- nate Attention for Efficient Mobile Network Design, ” in IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , 2021, pp. 13708–13717. [9] Jon Guti ´ errez-Zaballa, K oldo Basterretxea, and Javier Echanobe, “Ev aluating single ev ent upsets in deep neural networks for semantic segmentation: An embedded sys- tem perspectiv e, ” Journal of Systems Arc hitectur e , vol. 154, pp. 103242, 2024.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment