Title: Automated Plant Disease and Pest Detection System Using Hybrid Lightweight CNN-MobileViT Models for Diagnosis of Indigenous Crops
ArXiv ID: 2512.11871
Date: 2025-12-06
Authors: Tekleab G. Gebremedhin, Hailom S. Asegede, Bruh W. Tesheme, Tadesse B. Gebremichael, Kalayu G. Redae
📝 Abstract
Agriculture supports over 80% of the population in the Tigray region of Ethiopia, where infrastructural disruptions limit access to expert crop disease diagnosis. We present an offline-first detection system centered on a newly curated indigenous cactus-fig (Opuntia ficus-indica) dataset consisting of 3,587 field images across three core symptom classes. Given deployment constraints in post-conflict edge environments, we benchmark three mobile-efficient architectures: a custom lightweight CNN, EfficientNet-Lite1, and the CNN-Transformer hybrid MobileViT-XS. While the broader system contains independent modules for potato, apple, and corn, this study isolates cactus-fig model performance to evaluate attention sensitivity and inductive bias transfer on indigenous morphology alone. Results establish a clear Pareto trade-off: EfficientNet-Lite1 achieves 90.7% test accuracy, the lightweight CNN reaches 89.5% with the most favorable deployment profile (42 ms inference latency, 4.8 MB model size), and MobileViT-XS delivers 97.3% mean cross-validation accuracy, demonstrating that MHSA-based global reasoning disambiguates pest clusters from two dimensional fungal lesions more reliably than local texture CNN kernels. The ARM compatible models are deployed in a Tigrigna and Amharic localized Flutter application supporting fully offline inference on Cortex-A53 class devices, strengthening inclusivity for food security critical diagnostics.
💡 Deep Analysis
📄 Full Content
Agriculture is critical to livelihoods in Ethiopia's Tigray region, a post-conflict environment where damaged infrastructure disrupts expert crop disease diagnosis and timely advisory interventions [1]. The drought-resilient cactus-fig (Opuntia ficus-indica), locally known as "Beles," plays a dual socioeconomic role: as a seasonal buffer crop supporting food availability during pre-harvest scarcity and as an industrial raw material used in local manufacturing sectors [6].
Ecologically unique to the highland agro-system, cactus-fig acts as a strategic buffer crop against seasonal food scarcity, providing nutrition when alternative yields are depleted. A major production threat is the invasive cochineal insect (Dactylopius coccus) and fungal rots, which form visually ambiguous symptoms (white wax clusters vs. mildew-like textures) capable of causing large-scale yield devastation if not detected early [6], [13].
Corresponding author. The source code and dataset are publicly available at: https://github.com/Tekleab15/Automated
plant disease and pest detection system † A shorter version of this work was presented at the International Conference on Postwar Technology for Recovery and Sustainable Development in Mekelle, Tigray, in Feb. 2025.
Historically, disease diagnosis relied on manual inspection by agricultural extension workers. However, the recent conflict in the region has severely damaged infrastructure and disrupted the human advisory supply chain, leaving millions of farmers isolated from expert support. In this context, automated, offline-first diagnostic tools are not merely a convenience but a humanitarian necessity.
Recent advances in Computer Vision have enabled automated disease recognition, with Convolutional Neural Networks (CNNs) achieving high accuracy on global datasets like PlantVillage [3].
However, deploying these models for indigenous highland crops presents two specific scientific challenges:
The Data Gap: To bridge this gap, this paper presents an Automated Plant Disease and Pest Detection System built upon a novel, field-verified dataset of indigenous crops. We propose a hybrid approach, benchmarking a custom Lightweight CNN (optimized for extreme efficiency) against the MobileViT-XS architecture, a CNN-ViT hybrid that combines the inductive bias of convolutions with the self-attention of Transformers.
Our specific contributions are:
• Indigenous Dataset Construction: We introduce a dataset of 3,587 annotated images of Opuntia ficusindica, capturing real-world field variability (dust, shadows, mixed infections) often absent in laboratory datasets. The remainder of this paper is organized as follows: Section II provides biological and technical background. Section IV details the dataset and hybrid model architectures. Section V presents the comparative results and explainability analysis. Section VI concludes with implications for food security.
Plant diseases manifest through complex visual cues such as chlorosis, necrotic spots, pustules, and wilting. From a computer vision perspective, these symptoms present a significant challenge due to their high intra-class variance and inter-class similarity. Symptoms are often non-uniform, multi-stage, and visually confounded by environmental factors such as dust, nutrient deficiencies, or physical damage.
The target crop, Opuntia ficus-indica (Cactus-fig), introduces unique morphological constraints. Unlike the planar leaf structures of broad-leaf crops (e.g., maize, apple), cactus cladodes (pads) are voluminous, waxy, and covered in glochids (spines). The primary pest threat, the cochineal insect (Dactylopius coccus), creates white, cottony wax clusters that can be visually indistinguishable from certain fungal mildews to standard texture-based classifiers. Discriminating between these 3D pest clusters and 2D fungal lesions requires a model capable of understanding both local texture and global geometric context.
CNNs have established themselves as the standard for agricultural image analysis due to their ability to learn hierarchical feature representations. A typical CNN is composed of stacked convolutional layers that extract local features (edges, textures) and pooling layers that introduce spatial invariance.
Formally, a convolutional layer computes a feature map F from an input X using a learnable kernel W and bias b:
where σ is a non-linear activation function (e.g., ReLU). While CNNs excel at extracting local patterns, their limited receptive field can struggle to capture long-range dependencies, such as the spatial distribution of a pest infestation across a large cladode.
To address the limitations of local receptive fields, Vision Transformers (ViTs) adapt the self-attention mechanism from Natural Language Processing to image data. Unlike CNNs, which process pixels in fixed neighborhoods, Transformers treat an image as a sequence of patches and compute the relationship between every patch pair simultaneously.