Reading time: 15 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.17864
  • Date:
  • Authors: Unknown

📝 Abstract

Plant diseases pose a significant threat to global food security, necessitating accurate and interpretable disease detection methods. This study introduces an interpretable attention-guided Convolutional Neural Network (CNN), CBAM-VGG16, for plant leaf disease detection. By integrating Convolution Block Attention Module (CBAM) at each convolutional stage, the model enhances feature extraction and disease localization. Trained on five diverse plant disease datasets, our approach outperforms recent techniques, achieving high accuracy (up to 98.87%) and demonstrating robust generalization. Here, we show the effectiveness of our method through comprehensive evaluation and interpretability analysis using CBAM attention maps, Gradient-weighted Class Activation Mapping (Grad-CAM), Grad-CAM++, and Layer-wise Relevance Propagation (LRP). This study advances the application of explainable AI in agricultural diagnostics, offering a transparent and reliable system for smart farming.

📄 Full Content

The agriculture industry is essential in maintaining food security worldwide, but different crops suffer from a variety of diseases due to varying weather conditions, posing a major threat to crop yield and quality. These diseases are often caused by factors such as extreme temperatures, microbial infections, and changes in humidity or soil conditions. Farmers have always relied on human examination to identify diseases, which is labour-intensive, prone to mistakes, and ineffective on large farms. As a result, among the most crucial areas of research in smart agriculture is the automation of plant disease identification and classification utilizing various Artificial Intelligence (AI) approaches [1,2]. In recent years, Convolutional Neural Networks (CNNs) based models in agriculture have performed well on challenges such as classifying plant diseases. However, despite their effectiveness, CNNs often operate as black boxes, offering limited transparency into how predictions are made, hampering trust and broader adoption.

In order to tackle this, Explainable AI (XAI) methods have been developed enabling visual interpretation of model decisions through attention maps, GradCAM [3], and LRP [4]. While GradCAM and GradCAM++ [5] generate classdiscriminative localization maps using gradients from the final convolutional layers, they may lack resolution and be susceptible to noisy activations. LRP, in contrast, provides pixel-level attributions by backpropagating the model output through a set of layer-specific relevance propagation rules. This offers a more granular explanation of predictions, which is crucial in medical and agricultural diagnostics.

In this work, an explainable deep learning approach is proposed for the detection of plant leaf disease. Our architecture is based on the VGG16 [6] backbone enhanced with the CBAM [7] which introduces attention layers for inherent interpretability of the model with an emphasis on the most relevant features at both the channel and spatial levels. Following each of the five convolutional layers, CBAM modules are added to improve classification accuracy and localization of relevant features by capturing both spatial and channel-wise attention. Five distinct datasets are used to train the model, namely Apple, Plant Village, Embrapa, Maize and Rice to ensure the generalizability and applicability of our proposed method across diverse set of crops. Apart from the inherent interpretability of the proposed method’s decision-making process through CBAM layers, we also demonstrate interpretability using advanced explainability methods like LRP, Grad-CAM, and Grad-CAM++. We have also employed high-dimensional feature visualization techniques such as t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) to visualize feature in lower dimension for visualization of extracted features. Overall, this study advances the application of XAI for agricultural use by presenting an interpretable and performance-robust framework for plant disease classification. The following are thes main contributions of the proposed work. [11] has proposed LeafGAN architecture for the augmentation of diseased leaf images via transformation for improving the plant disease diagnosis system using large scale dataset. To solve the data imbalance problem of health vs unhealthy images Zhao et al. [12] have used a DoubleGAN architecture. Their architecture uses Wasserstein Generative Adversarial Networks (WGAN) and Superresolution Generative Adversarial Network (SRGAN) to balance the dataset. Some of the recent methods also used the fuzzy rank-based ensemble along with the pretrained CNNs [13] and fuzzy feature extraction [14] for the plant leaf detection.

Although methods like GradCAM [15] and LRP are commonly utilised to interpret CNN outputs, their application in detecting plant leaf diseases is still relatively new. In the context of plant disease identification, LRP has been utilised to highlight significant regions of a leaf, aiding in the detection of disease symptoms and affected zones [16]. To improve interpretability, a number of studies have looked into combining Grad-CAM using additional methods of explanation with the goal of addressing GradCAM’s shortcomings, including its tendency to produce noisy or vague visual outputs. Despite progress, issues like unclear GradCAM explanations and the complexity of LRP visualisations remain unresolved. A detailed survey of recent advances in plant disease detection is provided by Qadri et al. [17]. They have highlighted the issue and challenges in plant leaf disease detection using machine learning and deep learning based solutions.

The overview of interpretable architecture used for identifying plant leaf disease is provided in Fig. 1. The input plant images undergo through data preprocessing to improve quality and guarantee alignment with the input structure of the proposed CBAM-VGG16 model. This involves Contrast Limited Adaptive Histogram Equalization (CLAHE) to improve contrast, normalization of pixel values to the [0, 1] range and resizing images to 224 × 224 pixels. Following preprocessing, the enhanced VGG16 [6] model integrated with CBAM is employed for disease classification. In this architecture, CBAM modules are added following every convolutional stage of the VGG16 model. The CBAM mechanism adaptively refines the feature maps by applying both channel and spatial attention, thereby enhancing the model’s ability to focus on disease-affected regions. Once the input image’s class label is predicted by the model, multiple explainability modules are used to interpret its decision-making process. CBAM attention maps, generated by the integrated CBAM modules in the VGG16 architecture, provide insights into the spatial and channel-wise focus of the network across layers. For class-discriminative localization, Grad-CAM [15] and Grad-CAM++ [5] are used to generate class-discriminative heatmaps that draw attention to the input image’s most significant areas. Additionally LRP is employed using multiple propagation rules i.e. including ϵ-rule, ϵγ-box, α 2 β 1 , and excitation backpropagation to assign pixel-level relevance scores, further enriching the interpretability of the model’s predictions through diverse attribution perspectives. This entire pipeline from data preprocessing to classification and visual explanation forms an end-to-end plant leaf disease detection system designed for improved interpretability and trust in AI-based agricultural diagnostics. The internal modules of the architecture are explained in further subsections.

As illustrated in Fig. 1, the proposed CBAM-VGG16 architecture begins by utilising input layer of size 224 × 224 × 3. In the architecture, CBAM layers are added after each MaxPooling2D layer. Even though CBAM is a lightweight architecture, overuse could lead to overfitting, which is why the CBAM layer is added after each MaxPooling layer. The channel module and the spatial module are the two primary parts of the CBAM developed by [7]. Given the input I f ∈ R c×W ×H , a 1D channel attention map is generated by CBAM C m ∈ R c×1×1 as well as a 2D spatial attention map S m ∈ R c×W ×H , where c in our scenario stands for three channel input, W and H stand for the input feature map’s width and height, respectively. Our suggested CBAM VGG16’s attention process is given in Eqs. 1 and 2.

where ⊗ denotes element-wise multiplication, I c f ′ is the channel multiplied output, and I s f ′′ denotes the final refined output. The internal workings of CBAM’s channel attention and spatial attention mechanisms are detailed in the following subsections.

A map of channel attention C m ∈ R c×1×1 is applied to the input features by the channel attention module. To generate the average pooled feature, I avg f , and the max pooled feature, I max c , the average and max pooling procedures are applied independently. Thereafter, a shared Multi Layer Perceptron (MLP) network with one hidden layer receives I avg c and I max c . In our CBAM VGG16 design, the activation size of the hidden layer is set to R c/r×1×1 , where the reduction ratio denoted by r, which is set to eight. This value provides an equitable trade-off between model complexity and representational capacity, as empirically validated in [7], ensuring sufficient channel interdependencies are captured without incurring significant computational overhead. The channel attention method used in our proposed architecture is provided as given in Eqs. 3 and4.

In the above equations, σ denotes the sigmoid activation function. The weights W 0 ∈ R c×c/r and W 1 ∈ R c/r×c are shared parameters of the MLP. To generate the channel attention map C m (I f ), the input feature map I f undergoes both average pooling and max pooling operations across its spatial dimensions, resulting in I avg c and I max c , respectively. These are then passed through the shared MLP, and their outputs are summed and activated using the sigmoid function.

This module generates a 2D spatial attention map S m ∈ R c×W ×H , which is applied to the channel-refined feature map I c f ′ . To compute spatial attention, the module first applies average pooling and max pooling operations along the channel axis, resulting in two 2D feature maps: I avg s ∈ R 1×W ×H and I max s ∈ R 1×W ×H . These two maps are then concatenated along the channel dimension and passed through a convolutional layer to produce the spatial attention map. The resulting map is used to refine the features by performing an element-wise multiplication with I c f ′ , yielding the final spatially-refined output I s f ′′ . The spatial attention computation in CBAM is formally described in Eqs. 5 and 6.

In the above equations, σ represents the sigmoid activation function, and ν 7×7 denotes a convolution operation with a kernel size of 7 × 7.

The intermediate layers of the model utilize the Rectified Linear Unit (ReLU) activation function, defined as given in Eq. 7.

Here, x is the input to the activation function. If x > 0, the output remains x; otherwise, the output is zero. ReLU is preferred over sigmoid and tanh due to its ability to mitigate the vanishing gradient problem. It also aids in learning complex, non-linear patterns and accelerates convergence during training by maintaining sparse activation.

The final output layer employs the Softmax activation function, a normalized exponential function commonly used for multi-class classification. Softmax transforms the input vector into a probability distribution where the sum of all output probabilities is one, as defined in Eq. 8.

In our work, as indicated by Eq. 9, we have utilised the Cross Entropy loss function, for model optimization.

To reduce the model’s complexity, L2 regularization is employed where extreme changes in weights during the training phase are penalized by regularization reducing the probability of over-fitting. During training, the L2 regularizer additionally simplifies the input features and stabilises performance. The definition of the L2 regularizer utilised in our CBAM VGG16 is given in Eq. 10.

where ω i denotes the two-dimensional weight matrix of the ith layer, N is the number of input samples, and λ is the regularization hyperparameter.

Methods such as Grad-CAM [15] and Grad-CAM++ [5], utilize the gradients of class scores with respect to intermediate feature maps to produce localization maps. Grad-CAM is a widelyused technique to show the areas of an input image that are distinctive to a class and have the biggest influence on a model’s judgment. Grad-CAM++ is a generalized version of Grad-CAM that enables more accurate localization, especially in scenarios with several instances of the same item.

LRP [4], [18] belongs to the class of additive explanation techniques. These methods operate under the assumption that a function f j with N input features x = {x i } N i=1 can be expressed as a sum of contributions from each input variable, represented as R i←j , referred to as relevance scores.

Here, R i←j quantifies how much the ith input contributes to the j-th output. The total function value can be approximated (or exactly recovered) as shown in Eq. 11.

When an input x i influences multiple outputs j, as is common in multidimensional functions, the total relevance attributed to x i is the sum over its contributions from each output, as defined in Eq. 12.

Unlike other interpretability approaches, LRP explicitly treats the neural network as a hierarchical, layer-wise acyclic graph where each unit j in layer l is associated with a local function f l j . Relevance values from the output layer L, where R L j ∝ f L j , are passed backward, layer by layer, through the network until it reaches the input. This reverse mapping follows the same activation path used during the forward inference process, moving from the output node f L down to the initial input f 1 .

We have also used some of the advanced variants of LRP such as Epsilon Plus Rule (ε + ) [4], Epsilon Plus Gamma Box Rule (ε + γ⊡) [19], Epsilon Plus Flat Rule (ε + flat ) [18], and Epsilon Alpha2 Beta 1 Flat Rule (ε α=2,β=1 flat ) [19] for assesing the interpretability of our proposed method.

We have selected five different datasets for to evaluate our proposed method. These five datasets, namely, PlantVillage, Embrapa, Maize, Apple, and Rice ensure the generalizability of our approach in plant leaf disease detection for other crops. All the datasets are divided in an almost 80:20 ratio in the training and testing set. The overall composition of the datasets is provided in Table 1.

The F1 score (F1), Area Under Curve (AUC), Recall (REC), Accuracy (ACC), Precision (PREC), and Cohen’s Kappa score (KAPPA) are common classification metrics that are used to assess and compare the performance of the proposed CBAM-VGG16 model with state-of-the-art models. The model’s feature representation capability is further assessed through high-dimensional visualization techniques, t-SNE [20] and UMAP [21].

The performance results along with a detailed comparative study is reported in Table 2. It can be observed from the results that across all five datasets, our CBAM-VGG16 consistently outperforms existing models [22], [9], [23], [27], [24], [25], [26], [28], [29], [31], [32], [30]. On the Apple dataset, the our method performs better with the highest accuracy of 95.42%. On the Embrapa dataset, CBAM-VGG16 again achieves best performance, with an accuracy of 94.20%. The similar superiority performance is observed in other evaluation metrics as well, indicating the robust generalization capability of our proposed method across different datasets. For the Maize dataset, our model obtains a notable accuracy of 95.00%, outperforming all the methods. In the PlantVillage dataset, the proposed model maintains a competitive edge with nearly perfect scores across all metrics. Notably, it again exceeds in performance, reaffirming its superior feature learning and attention capabilities as compared with other methods. Lastly, on the Rice dataset, our approach attains the maximum accuracy of 98.87%, alongside the best precision, recall, and AUC (99.94%), demonstrating its effectiveness in capturing fine-grained disease patterns even in challenging samples. The obtained results on all the datasets exhibits that the suggested approach has good generalizability and can be adapted to identify leaf diseases in any other crop.

We have selected grape leaf dataset from plant village dataset for detailed explainability analysis of leaf samples across different disease classes. Grad-CAM computes gradients reaching the last convolutional layers to determine which input locations most affected the model’s decision-making. However, it typically produces relatively coarse, blob-like heatmaps. In contrast, Grad-CAM++ refines this by incorporating second-order gradients, resulting in more spatially accurate and class-discriminative attention maps.

Fig. 4 illustrates the class-wise attribution heatmaps generated for the CBAM-VGG16 model using a comprehensive suite of LRP techniques. The LRP family consistently produces sparse yet discriminative saliency maps, aligning well with regions exhibiting disease-related symptoms. Among them, Epsilon Plus Flat and Epsilon Alpha2 Beta1 Flat stand out for their capacity to sharply localize high-relevance areas, typically corresponding to lesions, necrotic margins, or discolored patches symptomatic of disease. These variants effectively suppress irrelevant background and vein structures, enhancing focus on pathologically significant textures.

We have shown in Fig. 5 the explainable visualization of proposed method on some other disease of other dataset used in this work. These visualisations also exhibit the same properties as observed in the detailed explainability analysis for grape leaf dataset.

To evaluate the discriminative capability of the CBAM-enhanced VGG16 model, we employed t-SNE and UMAP to project high-dimensional feature representations into a 2-D space. Fig. 6 and Fig. 7 presents the t-SNE and UMAP plots on the four grape leaf classes: Esca, Healthy, Leaf Blight, and Black Rot. It can be interpreted from the t-SNE figure that our proposed model is providing a clear distinguishable cluster separation with minimal class overlap, indicating better feature discrimination. In the UMAP plot, class boundaries are clearly defined, with each category forming dense and distinct clusters. These findings demonstrate how well CBAM works to improve the VGG16 backbone’s representational quality, resulting in improved class-wise separability and model interpretability.

The study in this work provides an interpretable approach for detection of plant leaf diseases. By incorporating attention modules at each convolutional stage, the model not only enhances disease-specific feature extraction but also offers improved performance across five public datasets. The performance on five different datasets exhibits the genralizability of our proposed solution for any other crop. In order to evaluate the model’s interpretability, we carried this in-depth evaluation using different attribution techniques. The qualitative study highlighted that LRP variants produced the most visually clear, localized, and class-discriminative heatmaps with minimal noise. Future directions include refining the attention mechanism to enhance class-awareness and reduce interpretability gaps, exploring global or transformer-inspired attention integration for better contextual representation, and conducting human-in-the-loop evaluations to assess the trust and reliability of visualisations.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut