Using Of Deep Learning Models In Acoustic Scene Classification

Zehra Bozdağ; Harun Çiğ

doi:10.29109/gujsc.1585401

EN

Using Of Deep Learning Models In Acoustic Scene Classification

Abstract

Ambient sound analysis has become more prominent with the rise of portable and wearable devices. It provides valuable insights into a person's environment by analyzing surrounding sounds. Recently, deep learning methods, frequently used in image and text processing, have been applied to this field and are proving more effective than traditional machine learning techniques. In this study, we evaluated the performance of different deep learning models using mel-spectrograms of 3 classes of stage sounds based on TAU Acoustic Scene 2023 dataset. Our results indicate that a simple Convolutional Neural Network (CNN) model gives better classification results compared to other more complex models in classification tasks. Despite having the fewest parameters, the CNN model achieved the highest success with 59% accuracy. This suggests that simpler models can be highly effective for acoustic scene classification, highlighting the value of more efficient and computationally feasible approaches in this domain

Keywords

References

[1] Kriman S, Beliaev S, Ginsburg B, Huang J, Kuchaiev O, Lavrukhin V, Leary R, Li J, Zhang Y, Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions, in: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, 2020; arXiv:1910.10261.
[2] Shivakumar KM, Aravind KG, Anoop TV, Gupta D. Kannada speech to text conversion using CMU Sphinx. Proceedings of the International Conference on Inventive Computation Technologies, ICICT 2016; 3-1:6.
[3] Mathur A, Saxena T, Krishnamurthi R. Generating subtitles automatically using audio extraction and speech recognition. Proceedings of the 2015 IEEE International Conference on Computational Intelligence and Communication Technology (CICT). 2015; 621–626.
[4] Sakurai M, Kosaka T. Emotion recognition combining acoustic and linguistic features based on speech recognition results. Proceedings of the 2021 IEEE 10th Global Conference on Consumer Electronics. 2021; 824-827.
[5] Yağcı M, Aygül ME. Derin öğrenme tabanlı gerçek zamanlı vücut hareketlerinden duygu analizi modeli. Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım ve Teknoloji. 2022; 12: 664–674.
[6] Fayek HM, Johnson J. Temporal reasoning via audio question answering. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2020; 1-1.
[7] Ewert S, Müller M. Estimating note intensities in music recordings. In: Proceedings of ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing. 2011; 385-388.
[8] Türkmen MC, Ergin AA. Music note detection using matrix pencil method. In: Proceedings of the 31st IEEE Conference on Signal Processing Communications Applications, SIU 2023, 2023; 1-4.

[9] Audio Analyzing in Publications - Dimensions, (n.d.). https://app.dimensions.ai/discover/publication?search_mode=content&search_text=Audio Analyzing&search_type=kws&search_field=full_search (accessed September 3, 2024).
[10] Barchiesi D, Giannoulis DD, Stowell D, Plumbley MD. Acoustic scene classification: Classifying environments from the sounds they produce. IEEE Signal Processing Magazine. 2015; 32-16:34.
[11] Zaman K, Sah M, Direkoglu C, Unoki M. A survey of audio classification using deep learning. IEEE Access. 2023; 11-1:1.
[12] Kumari RSS, Sugumar D, Sadasivam V. Audio signal classification based on optimal wavelet and support vector machine. In: Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA). 2007; 2-544:548.
[13] Mahanta SK, Basisth NJ, Halder E, Khilji AFUR, Pakray P. Exploiting cepstral coefficients and CNN for efficient musical instrument classification. Evolving Systems. 2024; 15(3): 1–13.
[14] Chu HC, Zhang YL, Chiang HC. A CNN sound classification mechanism using data augmentation. Sensors. 2023; 23(15): 6972.
[15] Dong S, Xia Z, Pan X, Yu T. Environmental sound classification based on improved compact bilinear attention network. Digital Signal Processing. 2023; 141.
[16] Mu W, Yin B, Huang X, Xu J, Du Z. Environmental sound classification using temporal-frequency attention based convolutional neural network. Scientific Reports. 2021; 11(1): 21552.
[17] Piczak KJ. ESC: Dataset for environmental sound classification. In: Proceedings of the 2015 ACM Multimedia Conference (MM). 2015; 1015–1018.
[18] Aytar Y, Vondrick C, Torralba A. SoundNet: Learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems. 2016; 29: 892–900.
[19] Baelde M, Biernacki C, Greff R. A mixture model-based real-time audio sources classification method. In: Proceedings of ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing. 2017; 371–375.
[20] Guzhov A, Raue F, Hees J, Dengel A. Esresnet: Environmental sound classification based on visual domain models. In: Proceedings of the International Conference on Pattern Recognition. 2020; 3504–3511.
[21] Jin G, Zhai J, Wei J. CAA-Net: End-to-End Two-Branch Feature Attention Network for Single Image Dehazing. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences. 2022; E106-A(1): 1–9.
[22] Spoorthy V, Mulimani M, Koolagudi SG. Acoustic scene classification using deep learning architectures. In: Proceedings of the 6th International Conference for Convergence in Technology (I2CT). 2021; 1–6.
[23] Yang L, Tao L, Chen X, Gu X. Multi-scale semantic feature fusion and data augmentation for acoustic scene classification. Applied Acoustics. 2020; 163: 107238.
[24] Wu Y, Lee T. Enhancing sound texture in CNN-based acoustic scene classification. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019; 815–819.
[25] Bozdağ Karakeçi Z, Talu MF. Multi-scale residual segmentation network for histopathological image. Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi. 2024; 15(3): 623–632.
[26] Spanhol FA, Oliveira LS, Petitjean C, Heutte L. Breast cancer histopathological image classification using Convolutional Neural Networks. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN). 2016; 2560–2567.
[27] Zhao Z-Q, Zheng P, Xu S-T, Wu X. Object detection with deep learning: A review. IEEE Transactions on Neural Networks and Learning Systems. 2019; 30(11): 3212–3232.
[28] Tchinda BS, Tchiotsop D, Djoufack Nkengfack LC, Tchinda R. Diagnosis of epileptic seizures from electroencephalogram signals using log-Mel spectrogram and a deep learning CNN model. Heliyon. 2025;11(2): e42993
[29] Sharan RV, Mascolo C, Schuller BW. Emotion recognition from speech signals by Mel-spectrogram and a CNN-RNN. In: Proceedings of the 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). 2024; 1–4.
[30] Radhakrishnan BL, Kirubakaran E, Jebadurai IJ, Selvakumar IA, Andrew J. A CNN-based deep learning model for the in-home sleep stage detection system using Mel-spectrogram. In: Thomas KV, editor. Manipal Interdisciplinary Health Science and Technical Reports-2023. 2024; 98–103.
[31] Montavon G, Samek W, Müller KR. Methods for interpreting and understanding deep neural networks. Digital Signal Processing: A Review Journal. 2018; 73: 1–15.
[32] Dosilović FK, Brčić M, Hlupić N. Explainable artificial intelligence: A survey. In: Proceedings of the 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). 2018; 210–215.
[33] Ding B, Zhang T, Wang C, Liu G, Liang J, Hu R, Wu Y, Guo D. Acoustic scene classification: A comprehensive survey. Expert Systems with Applications. 2024; 238:121902.
[34] DCASE2023 Challenge - DCASE, (n.d.). https://dcase.community/challenge2023/index (accessed September 3, 2024).
[35] Heittola T, Mesaros A, Virtanen T. Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop. 2020; 1–5.
[36] Mesaros A, Heittola T, Virtanen T. A multi-device dataset for urban acoustic scene classification. arXiv preprint arXiv:1807.09840. 2018.
[37] Bear HL, Heittola T, Mesaros A, Benetos E, Virtanen T. City classification from multiple real-world sound scenes. In: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 2019; 1–5.
[38] Wang M, Deng W. Deep visual domain adaptation: A survey. Neurocomputing. 2018; 312: 135–153.
[39] Abdoli S, Cardinal P, Koerich AL. End-to-end environmental sound classification using a 1D convolutional neural network. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019; 255–259.
[40] Seo S, Kim C, Kim JH. Convolutional neural networks using log Mel-spectrogram separation for audio event classification with unknown devices. Journal of Web Engineering. 2022; 21(5): 1115–1133.
[41] Zhang T, Feng G, Liang J, An T. Acoustic scene classification based on Mel spectrogram decomposition and model merging. Applied Acoustics. 2021; 178: 107956.
[42] Nguyen MT, Lin WW, Huang JH. Heart sound classification using deep learning techniques based on log-mel spectrogram. Circuits, Systems, and Signal Processing. 2023; 42: 1039–1058.
[43] Abeßer J. A review of deep learning-based methods for acoustic scene classification. Applied Sciences (Switzerland). 2020; 10(15): 5251.
[44] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Communications of the ACM. 2017; 60(6): 84–90.
[45] Duppada V, Hiray S. Ensemble of deep neural networks for acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop. 2017; 1–5.
[46] Eghbal-Zadeh H, Dorfer M, Widmer G. Deep within-class covariance analysis for robust audio representation learning. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop. 2017; 1–5.
[47] Gharib S, Derrar H, Niizumi D, Senttula T, Tommola J, Heittola T, Virtanen T, Huttunen H. Acoustic scene classification: A competition review. In: Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP). 2018; 1–6.
[48] Zhang R, Zou W, Li X. Cross-task pre-training for on-device acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop. 2019; 1–5.
[49] Tonami N, Imoto K, Yamanishi R, Yamashita Y. Joint analysis of sound events and acoustic scenes using multitask learning. IEICE Transactions on Information and Systems. 2021; E104-D (9): 1017–1025.
[50] Tonami N, Imoto K, Okamoto Y, Fukumori T, Yamashita Y. Sound event detection based on curriculum learning considering learning difficulty of events. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021; 841–845.
[51] Tonami N, Imoto K, Nagase R, Okamoto Y, Fukumori T, Yamashita Y. Sound event detection guided by semantic contexts of scenes. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022; 851–855.
[52] Chu Y, Xu J, Zhou X, Yang Q, Zhang S, Yan Z, Zhou C, Zhou J. Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2307.04765. 2023.
[53] Schmid F, Morocutti T, Masoudian S, Koutini K, Widmer G. CP-JKU submission to DCASE23: Efficient acoustic scene classification with CP-Mobile. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop. 2023; 1–5.
[54] Tan J, Li Y. Low-complexity acoustic scene classification using blueprint separable convolution and knowledge distillation. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop. 2023; 1–5.
[55] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016; 770–778.
[56] Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. 2017.
[57] Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C. MobileNetV2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018; 4510–4520.
[58] Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, Le QV, Adam H. Searching for MobileNetV3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2019; 1314–1324.
[59] Tan M, Le QV. EfficientNet: Rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th International Conference on Machine Learning (ICML). 2019; 97: 6105–6114.
[60] Kingma DP, Ba JL. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR). 2015.
[61] Baratloo A, Hosseini M, Negida A, El Ashal G. Part 1: Simple definition and calculation of accuracy, sensitivity and specificity. Emergency (Tehran, Iran). 2015; 3(2): 48–49.
[62] Koutini K, Henkel F, Eghbal-zadeh H, Widmer G. Low-complexity models for acoustic scene classification based on receptive field regularization and frequency damping. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop. 2020; 1–5.

Details

Primary Language

English

Subjects

Signal Processing

Journal Section

Research Article

Authors

Zehra Bozdağ ^*
0000-0002-1119-5275
Türkiye

Harun Çiğ
0000-0003-0419-9531
Türkiye

Early Pub Date

July 2, 2025

Publication Date

September 30, 2025

Submission Date

November 15, 2024

Acceptance Date

June 4, 2025

Published in Issue

Year 2025 Volume: 13 Number: 3

DOI

https://doi.org/10.29109/gujsc.1585401

IZ

https://izlik.org/JA89HD27RR

Cite

RIS / Bibtex

APA

Bozdağ, Z., & Çiğ, H. (2025). Using Of Deep Learning Models In Acoustic Scene Classification. Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım Ve Teknoloji, 13(3), 849-858. https://doi.org/10.29109/gujsc.1585401

Using Of Deep Learning Models In Acoustic Scene Classification

Abstract

Keywords

Öz

References

Details

Primary Language

Subjects

Journal Section

Authors

Early Pub Date

Publication Date

Submission Date

Acceptance Date

Published in Issue

DOI

IZ

Cite