EN
Using Of Deep Learning Models In Acoustic Scene Classification
Abstract
Ambient sound analysis has become more prominent with the rise of portable and wearable devices. It provides valuable insights into a person's environment by analyzing surrounding sounds. Recently, deep learning methods, frequently used in image and text processing, have been applied to this field and are proving more effective than traditional machine learning techniques.
In this study, we evaluated the performance of different deep learning models using mel-spectrograms of 3 classes of stage sounds based on TAU Acoustic Scene 2023 dataset. Our results indicate that a simple Convolutional Neural Network (CNN) model gives better classification results compared to other more complex models in classification tasks. Despite having the fewest parameters, the CNN model achieved the highest success with 59% accuracy. This suggests that simpler models can be highly effective for acoustic scene classification, highlighting the value of more efficient and computationally feasible approaches in this domain
Keywords
References
- [1] Kriman S, Beliaev S, Ginsburg B, Huang J, Kuchaiev O, Lavrukhin V, Leary R, Li J, Zhang Y, Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions, in: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, 2020; arXiv:1910.10261.
- [2] Shivakumar KM, Aravind KG, Anoop TV, Gupta D. Kannada speech to text conversion using CMU Sphinx. Proceedings of the International Conference on Inventive Computation Technologies, ICICT 2016; 3-1:6.
- [3] Mathur A, Saxena T, Krishnamurthi R. Generating subtitles automatically using audio extraction and speech recognition. Proceedings of the 2015 IEEE International Conference on Computational Intelligence and Communication Technology (CICT). 2015; 621–626.
- [4] Sakurai M, Kosaka T. Emotion recognition combining acoustic and linguistic features based on speech recognition results. Proceedings of the 2021 IEEE 10th Global Conference on Consumer Electronics. 2021; 824-827.
- [5] Yağcı M, Aygül ME. Derin öğrenme tabanlı gerçek zamanlı vücut hareketlerinden duygu analizi modeli. Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım ve Teknoloji. 2022; 12: 664–674.
- [6] Fayek HM, Johnson J. Temporal reasoning via audio question answering. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2020; 1-1.
- [7] Ewert S, Müller M. Estimating note intensities in music recordings. In: Proceedings of ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing. 2011; 385-388.
- [8] Türkmen MC, Ergin AA. Music note detection using matrix pencil method. In: Proceedings of the 31st IEEE Conference on Signal Processing Communications Applications, SIU 2023, 2023; 1-4.
Details
Primary Language
English
Subjects
Signal Processing
Journal Section
Research Article
Early Pub Date
July 2, 2025
Publication Date
September 30, 2025
Submission Date
November 15, 2024
Acceptance Date
June 4, 2025
Published in Issue
Year 2025 Volume: 13 Number: 3
