Recent advances in deep learning have enabled significant progress in plant disease classification using leaf images. Much of the existing research in this field has relied on the PlantVillage dataset, which consists of well-centered plant images captured against uniform, uncluttered backgrounds. Although models trained on this dataset achieve high accuracy, they often fail to generalize to real-world field images, such as those submitted by farmers to plant diagnostic systems. This has created a significant gap between published studies and practical application requirements, highlighting the necessity of investigating and addressing this issue. In this study, we investigate whether attention-based architectures and zero-shot learning approaches can bridge the gap between curated academic datasets and real-world agricultural conditions in plant disease classification. We evaluate three model categories: Convolutional Neural Networks (CNNs), Vision Transformers, and Contrastive Language-Image Pre-training (CLIP)-based zero-shot models. While CNNs exhibit limited robustness under domain shift, Vision Transformers demonstrate stronger generalization by capturing global contextual features. Most notably, CLIP models classify diseases directly from natural language descriptions without any task-specific training, offering strong adaptability and interpretability. These findings highlight the potential of zero-shot learning as a practical and scalable domain adaptation strategy for plant health diagnosis in diverse field environments.
Global food security is under threat as plant diseases cause up to 40% of crop losses annually [1]. Crops such as potatoes and tomatoes are vital for both nutrition and the economy, yet they remain vulnerable to devastating diseases like early blight and late blight [2]. These diseases not only reduce 2 Materials and Methods
To validate the aforementioned hypothesis, we adopt a comprehensive methodology that leverages three distinct model categories: CNN-based architectures, Transformer-based networks, and CLIPbased zero-shot models, as illustrated in Fig. 1.
We include three state-of-the-art convolutional architectures: EfficientNet-B0, ResNet50, and Incep-tionV3. These models follow the classical deep learning pipeline, where input images are processed through stacked convolutional layers to extract hierarchical spatial features. These layers exploit local receptive fields, weight sharing, and translation invariance-hallmarks of CNNs in visual tasks. CNN-based approaches remain dominant in plant disease classification research due to their strong performance on clean, labeled datasets such as PlantVillage [3,13]. However, their reliance on local features may limit their generalization in field conditions where disease patterns are more variable and occluded.
Transformer architectures have emerged as a powerful alternative to CNNs by modeling images as sequences of non-overlapping patches. Each patch is embedded and enriched with positional encodings before being processed by multi-head self-attention mechanisms. In our work, we adopt ViT-B/16, ViT-B/32, Swin-T, and Swin-S architectures.
We hypothesize that the attention mechanism enables these models to selectively attend to symptomatic regions while ignoring irrelevant visual noise, which is particularly beneficial in complex field scenarios. This global context modeling may offer superior robustness over CNNs for real-world leaf imagery. This experiment investigates whether Transformer-based architectures can effectively capture long-range dependencies and highlight disease-specific symptoms across diverse plant classes.
Recent studies have demonstrated the potential of Transformers for fine-grained classification in agricultural contexts and other vision tasks [6,22].
To explore generalization beyond traditional supervised learning, we incorporate CLIP-based zeroshot models (CLIP-ViT-B/16 and CLIP-ViT-B/32) into our study. These models align textual disease descriptions with image embeddings in a shared semantic space via contrastive pretraining. At inference, handcrafted prompts (e.g., “A healthy potato leaf with no spots”, “A potato leaf with concentric brown lesions”, “A potato leaf with irregular dark patches”) are encoded by the text encoder, while plant images are processed by the vision encoder. The resulting embeddings are compared through cosine similarity, and the highest similarity score determines the predicted class.
As illustrated in Figure 2 and Algorithm 1, the CLIP zero-shot pipeline operates by comparing input images against multiple textual descriptions without requiring any task-specific fine-tuning. In this workflow, disease-related prompts are first encoded by the text encoder and stored as fixed embeddings, while each new image is encoded online by the vision encoder. The embeddings are then projected into a shared semantic space, where cosine similarity determines the most likely class. This organization highlights an important optimization: the text encoder is executed only once, whereas the variable component across inferences is the image embedding.
-Healthy ?A healthy potato leaf with no spots?.
-Early Blight ?A potato leaf with concentric brown lesions?.
-Late Blight leaf ?A potato leaf with irregular dark patches?. To further enhance robustness, multiple textual variants can be defined for each class, and their similarity scores aggregated, as shown in Algorithm 1. This strategy mitigates sensitivity to prompt wording and improves classification stability. For example, in the illustrative case of Figure 2, a potato leaf affected by late blight achieves similarity scores of (Healthy: 0.12, Early Blight: 0.35, Late Blight: 0.89), leading to the correct prediction of Late Blight. These values are provided solely to demonstrate how CLIP performs zero-shot classification in practice.
Overall, this paradigm eliminates the need for large annotated datasets, making it especially valuable in agricultural contexts where expert labeling is limited or costly. While CLIP has shown broad generalization in vision-language domains [9,23], applying it to plant disease classification remains largely unexplored, representing a novel contribution of this work. for each description d j c , j = 1, . . . , N c do 4:
end for 7: end for 8: (2) Image Encoding (online, per input):
The CNN and Transformer models are trained in a supervised manner using a dataset taken from PlantVillage, which includes labeled potato leaf images across t
This content is AI-processed based on open access ArXiv data.