TCLeaf-Net: a transformer-convolution framework with global-local attention for robust in-field lesion-level plant leaf disease detection

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Timely and accurate detection of foliar diseases is vital for safeguarding crop growth and reducing yield losses. Yet, in real-field conditions, cluttered backgrounds, domain shifts, and limited lesion-level datasets hinder robust modeling. To address these challenges, we release Daylily-Leaf, a paired lesion-level dataset comprising 1,746 RGB images and 7,839 lesions captured under both ideal and in-field conditions, and propose TCLeaf-Net, a transformer-convolution hybrid detector optimized for real-field use. TCLeaf-Net is designed to tackle three major challenges. To mitigate interference from complex backgrounds, the transformer-convolution module (TCM) couples global context with locality-preserving convolution to suppress non-leaf regions. To reduce information loss during downsampling, the raw-scale feature recalling and sampling (RSFRS) block combines bilinear resampling and convolution to preserve fine spatial detail. To handle variations in lesion scale and feature shifts, the deformable alignment block with FPN (DFPN) employs offset-based alignment and multi-receptive-field perception to strengthen multi-scale fusion. Experimental results show that on the in-field split of the Daylily-Leaf dataset, TCLeaf-Net improves mAP@50 by 5.4 percentage points over the baseline model, reaching 78.2%, while reducing computation by 7.5 GFLOPs and GPU memory usage by 8.7%. Moreover, the model outperforms recent YOLO and RT-DETR series in both precision and recall, and demonstrates strong performance on the PlantDoc, Tomato-Leaf, and Rice-Leaf datasets, validating its robustness and generalizability to other plant disease detection scenarios.

💡 Research Summary

The paper addresses the pressing need for accurate lesion‑level detection of foliar diseases under real‑field conditions, where cluttered backgrounds, illumination changes, and large variations in lesion size severely limit the performance of existing CNN‑based detectors. To this end, the authors first construct and publicly release a high‑quality dataset named Daylily‑Leaf, comprising 1,746 RGB images and 7,839 annotated lesions captured both in controlled laboratory settings (ideal) and in authentic agricultural environments (in‑field). This dataset fills a notable gap in the community, as most public plant‑disease datasets contain only whole‑leaf labels or are collected under ideal lighting and simple backgrounds.

Building on this dataset, the authors propose TCLeaf‑Net, a transformer‑convolution hybrid detector specifically engineered for field deployment. The architecture consists of three major components:

Transformer‑Convolution Module (TCM) – The backbone replaces a pure CNN or a single‑stream CNN‑to‑Transformer pipeline with a parallel tri‑branch design called the Transformer‑Convolution Layer (TCL). Each TCL contains a Global‑Attention Module (GAM) that employs Efficient Attention (EA) for low‑cost long‑range dependency modeling, a Local‑Attention Module (LAM) built from a 3×3 Conv‑BN‑ReLU block to preserve fine‑grained edge and texture cues, and a Residual Branch to stabilize training. Four stacked TCLs form the TCM, enabling simultaneous capture of global context (leaf‑level semantics) and local details (lesion edges) while mitigating the “diffuse‑attention” problem typical of vanilla transformers in cluttered scenes.
Raw‑Scale Feature Recalling and Sampling (RSFRS) – Standard non‑overlapping patch embeddings (OPE) cause spatial discontinuities and loss of boundary precision. The authors introduce Small‑Step Overlapping Patch Embedding (SSOPE) with a 3×3 stride‑2 kernel to retain spatial continuity. RSFRS then fuses a learnable stride‑2 convolution with bilinear interpolation via a 1×1 convolution, effectively recalling high‑resolution cues lost during down‑sampling. This module is crucial for preserving the subtle visual patterns of small lesions that would otherwise be erased.
Deformable Alignment Block with Feature Pyramid Network (DFPN) – Multi‑scale feature fusion in conventional FPNs suffers from misalignment, especially when lesions appear at vastly different scales. DFPN incorporates deformable convolutions to predict per‑pixel offsets, aligning features across pyramid levels. Additionally, a Multi‑Receptive‑Field Perception (MRFP) sub‑module aggregates information from varied receptive fields, strengthening the detector’s ability to handle both tiny and large lesions.

After the backbone, a decoupled detection head predicts class scores and bounding‑box coordinates independently, reducing interference between classification and localization. The entire pipeline processes 640×640 inputs, passes them through SSOPE → RSFRS → TCM → RSFRS → SPPF → DFPN, and finally outputs detections after confidence thresholding and non‑maximum suppression.

Experimental Evaluation
On the in‑field split of Daylily‑Leaf, TCLeaf‑Net achieves a mean Average Precision at IoU = 0.5 (mAP@50) of 78.2 %, a 5.4 percentage‑point gain over the strongest baseline. Moreover, the model reduces computational cost by 7.5 GFLOPs and GPU memory consumption by 8.7 % compared with comparable YOLO and RT‑DETR variants, demonstrating suitability for edge devices such as drones or mobile robots. Cross‑dataset tests on PlantDoc, Tomato‑Leaf, and Rice‑Leaf confirm the model’s generalizability: it consistently outperforms recent detectors in both precision and recall, particularly under challenging background conditions.

Ablation Studies reveal that removing TCM leads to a dramatic increase in false positives from background objects, while omitting RSFRS drops small‑lesion detection rates by over 12 percentage points. Excluding DFPN causes multi‑scale fusion to degrade, especially for lesions smaller than 32 × 32 pixels. Grad‑CAM visualizations illustrate that TCM focuses attention on leaf regions and suppresses irrelevant textures, validating the design rationale.

Contributions

Release of two meticulously annotated lesion‑level datasets (ideal and in‑field) for daylily, filling a critical gap for object‑detection‑oriented plant disease research.
Introduction of a novel transformer‑convolution hybrid backbone (TCM) that balances global semantic reasoning with local texture fidelity, effectively mitigating background interference.
Development of RSFRS for high‑resolution feature preservation during down‑sampling, and DFPN for deformable, multi‑scale feature alignment.
Comprehensive empirical validation showing superior accuracy, efficiency, and robustness across multiple crops and disease datasets.

Future Directions suggested include extending the framework with domain‑adaptation techniques to further close the gap between laboratory and field domains, exploring lightweight variants for ultra‑low‑power edge hardware, and expanding the dataset to cover additional crops and disease types. Overall, TCLeaf‑Net represents a significant step toward practical, high‑precision, lesion‑level plant disease monitoring in real agricultural settings.

TCLeaf-Net: a transformer-convolution framework with global-local attention for robust in-field lesion-level plant leaf disease detection

💡 Research Summary

Comments & Academic Discussion

Leave a Comment