LAViT: Class-Aware Vision Transformer with Learnable Attention for Pest Recognition in Agricultural Fields
Abstract
Insect pests pose a significant threat to agricultural productivity, making early and accurate identification essential for effective pest management. This study proposes a novel deep learning-based classification framework for multi-class pest recognition from field images. The proposed approach enhances discriminative region representation by integrating a patch embedding module, a Vision Transformer backbone, and a learnable spatial attention mask. This hybrid design enables the model to focus on critical visual cues without requiring segmentation-based preprocessing. The attention mask, learned via convolutional layers, is pooled and directly applied to Transformer-encoded patch tokens to refine spatial feature emphasis. Positional embeddings are further employed to preserve spatial context within the tokenized image representation. Experimental evaluations conducted on publicly available pest datasets with 5, 9, and 12 classes demonstrate the effectiveness and robustness of the proposed framework. The model achieves accuracies of 99.67%, 99.52%, and 97.00% on the Pest5, Pest9, and Pest12 datasets, respectively, indicating strong generalization across varying classification complexities. To enhance model transparency and reliability, visual interpretability is provided through Grad-CAM and attention heatmap visualizations that reveal the model’s focus regions. Additionally, t-SNE-based feature visualization illustrates clear separability in the learned embedding space. The proposed framework shows strong potential for practical deployment in smart agriculture and precision pest monitoring systems.
Keywords
References
- Dinca M A, PopescuD, Ichim L & Angelescu N (2025). Ensemble of efficient vision transformers for insect classification. Applied Sciences 15(13): 7610. https://doi.org/10.3390/app15137610
- Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D (2021). An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2010.11929
- Fuentes A, Yoon S, Kim S C & Park D S (2017). A robust deep-learning-based detector for real-time tomato plant diseases and pests recognition. Sensors 17(9): 2022. https://doi.org/10.3390/s17092022
- Ferentinos K P (2018). Deep learning for plant disease detection and diagnosis in smart farming. Computers and Electronics in Agriculture 162: 112–123. https://doi.org/10.1016/j.compag.2018.01.009
- Ghosh P (2021). Crop Pest Dataset [Data set].. https://www.kaggle.com/datasets/pialghosh/crop-pest-dataset
- Hu Y, Deng X, Lan Y, Chen X, Long Y & Liu C (2023). Detection of Rice Pests Based on Self-Attention Mechanism and Multi-Scale Feature Fusion. Insects 14(3): 280. https://doi.org/10.3390/insects14030280
- Jelali M (2024). Deep learning networks-based tomato disease and pest detection: a first review of research studies using real field datasets. Frontiers in Plant Science 15: 1493322. https://doi.org/10.3389/fpls.2024.1493322
- Kim G, Son C & Lee S (2025). ROI-aware multiscale cross-attention vision transformer for pest image identification. Computers and Electronics in Agriculture 237: Article 107546. https://doi.org/10.1016/j.compag.2025.110732
Details
Primary Language
English
Subjects
Evolutionary Computation
Journal Section
Research Article
Publication Date
March 24, 2026
Submission Date
July 10, 2025
Acceptance Date
December 31, 2025
Published in Issue
Year 2026 Volume: 32 Number: 2