Modern Deep Learning Architectures for Urban Sound Classification on UrbanSound8K
Abstract
Environmental sound classification (ESC) is critically important for monitoring noise pollution and ensuring urban safety in smart city applications. Although deep learning–based approaches have achieved high performance in this domain, many studies in the literature rely on randomly partitioned datasets that cause data leakage or require massive pretraining corpora with high computational cost (e.g., AudioSet). In this study, we propose a methodologically robust and computationally efficient approach for environmental sound classification on the UrbanSound8K dataset. To ensure the reliability of the results, we adopt the Official 10-Fold Cross-Validation protocol, which is considered the most challenging evaluation scheme in the literature. In our experiments, the Vision Transformer (ViT) architecture is compared with modern CNN architectures. In addition, the impact of data augmentation techniques such as MixUp and SpecAugment on these architectures is analyzed. The results show that under the Official 10-fold protocol, ConvNeXt-Tiny achieves the best mean accuracy, reaching 83.94% with MixUp and 82.81% with the combined SpecAugment+MixUp setting, while ViT attains 81.94% under SpecAugment+MixUp. In contrast, Random splitting artificially inflates performance to 98.06% due to leakage, underscoring the need for the Official, leakage-free protocol.
Keywords
References
- B. İşler, "Urban sound recognition in smart cities using an IoT–fog computing framework and deep learning models: A performance comparison", Appl. Sci., vol. 15, no. 3, Art. no. 1201, 2025, doi: 10.3390/app15031201.
- United Nations, Department of Economic and Social Affairs, Population Division, World Urbanization Prospects: The 2018 Revision (ST/ESA/SER.A/420). New York, NY, USA: United Nations, 2019.
- B. Peng, W. H. Abdulla, and K. I.-K. Wang, "Urban noise monitoring using edge computing with CNN–LSTM on Jetson Nano" in Proc. 2023 Asia Pacific Signal and Information Processing Association Annu. Summit and Conf. (APSIPA ASC), 2023, pp. 2244–2250.
- M. Çakır, M. A. Güvenç, and S. Mıstıkoğlu, "IoT-based Condition Monitoring System Design for Investigation of Non-Oil Ball Bearing in terms of Vibration, Temperature, Acoustic Emission, Current and Revolution Parameters," in Proc. 10th Int. Symp. Intelligent Manufacturing and Service Systems (IMSS), Sakarya, Turkey, Sep. 2019, pp. 1059–1068.
- M. Çakır, M. A. Güvenç, and S. Mıstıkoğlu, "The experimental application of popular machine learning algorithms on predictive maintenance and the design of IIoT based condition monitoring system," Comput. Ind. Eng., vol. 151, p. 106948, Jan. 2021, doi: 10.1016/j.cie.2020.106948.
- A. M. Tripathi and O. J. Pandey, "Divide and distill: New outlooks on knowledge distillation for environmental sound classification", IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 1100–1113, 2023, doi: 10.1109/TASLP.2023.3244507.
- R. Jahangir, M. A. Nauman, R. Alroobaea, J. Almotiri, M. M. Malik, and S. M. Alzahrani, "Deep learning-based environmental sound classification using feature fusion and data enhancement", Comput. Mater. Continua, vol. 74, no. 1, pp. 1069–1091, 2023, doi: 10.32604/cmc.2023.032719.
- B. Peng, K. I.-K. Wang, and W. H. Abdulla, "Robust classification of urban sounds in noisy environments: A novel approach using SPWVD–MFCC and dual-stream classifier", Acoust. Aust., vol. 53, pp. 253–268, 2025, doi: 10.1007/s40857-025-00350-6.
Details
Primary Language
English
Subjects
Artificial Intelligence (Other)
Journal Section
Research Article
Authors
Ulaş Yurtsever
*
0000-0003-3438-6872
Türkiye
Early Pub Date
June 25, 2026
Publication Date
June 30, 2026
Submission Date
November 26, 2025
Acceptance Date
March 30, 2026
Published in Issue
Year 2026 Volume: 9 Number: 3
