The accurate assessment of tympanic membrane (TM) conditions—such as effusion, normal membrane appearance, or the presence of ventilation tubes—is essential for timely diagnosis and treatment of otitis media (OM), one of the leading causes of preventable hearing loss in children. However, visual otoscopic examination remains highly subjective, and diagnostic accuracy varies widely across clinicians due to inconsistent image quality, limited training, and subtle variations in membrane morphology. Recent deep learning–based approaches have shown strong performance, yet most rely on large-scale pretraining or heavyweight convolutional networks that are impractical for point-of-care deployment. Vision Transformers (ViTs) have emerged as powerful feature extractors, but their reliance on fixed patch tokenization and a global class token results in large parameter counts and suboptimal performance on small medical datasets.
In this study, we propose a lightweight Compact Convolutional Transformer (CCT) for TM image classification trained from scratch on a clinical otoscopy dataset. Unlike standard ViT architectures, CCT integrates convolutional tokenization to extract local visual patterns before self-attention, and sequence pooling replaces the class token, reducing model complexity while preserving global reasoning. We conduct a structured ablation study varying both convolutional kernel size (3×3, 5×5, 7×7) and transformer encoder depth (3, 5, 7 layers), resulting in nine model configurations. Across these experiments, the optimal configuration (7×7 kernel, 3 transformer layers) achieved 91.21% accuracy and 90.65% macro F1 on the test set with only 3.26M parameters, outperforming deeper models and demonstrating superior efficiency–performance trade-offs. Results show that wider convolutional tokenizers effectively capture broader visual patterns of the TM, while excessive transformer depth may introduce overfitting on small datasets.
These findings indicate that compact transformer architectures can deliver high diagnostic performance without transfer learning or large data requirements, supporting their potential for real-time clinical decision support and integration into low-resource or mobile otoscopy systems.
compact convolutional transformer otoscopy classification deep learning vision transformer medical image diagnosis
This study did not involve human participants or the collection of new data. All analyses were performed on publicly available, de-identified datasets, used in accordance with their licenses and terms of use. As this constitutes secondary analysis of open data with no intervention or interaction with individuals, institutional ethics approval and informed consent were not required. The work adheres to applicable ethical standards for research using publicly accessible datasets.
The author would like to thank the creators of the publicly available tympanic membrane dataset on Zenodo (https://zenodo.org/records/3595567) for making their data accessible to the research community.
| Primary Language | English |
|---|---|
| Subjects | Biomedical Diagnosis |
| Journal Section | Research Article |
| Authors | |
| Submission Date | November 11, 2025 |
| Acceptance Date | December 22, 2025 |
| Publication Date | December 30, 2025 |
| Published in Issue | Year 2025 Volume: 11 Issue: 2 |

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.