Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals

Yunus Korkmaz

doi:10.36222/ejt.1761640

TR EN

Ses Sinyallerinden Duygu Tanıma için YAMNet ve VGGish Ağlarının Performans Analizi

Abstract

İnsan duygularını ses ipuçları aracılığıyla anlamak, özellikle insan-bilgisayar etkileşimi, sağlık hizmetleri ve sanal asistanlar gibi alanlarda duygusal zekâya sahip sistemler geliştirmek için kilit bir noktadır. Ancak, konuşmadan duyguları doğru şekilde tanımak; konuşmacı özelliklerindeki değişkenlik, akustik koşullar ve duygusal durumların ince ve çoğu zaman örtüşen doğası nedeniyle hâlâ zorlu bir görevdir. Bu çalışmada, önceden eğitilmiş ses tabanlı sinir ağları kullanılarak konuşma duygu tanıma (SER) için aktarım öğrenmesi yöntemlerinin karşılaştırmalı bir analizi sunulmuştur. Özellikle, YAMNet ve VGGish modelleri hem statik özellik çıkarıcılar olarak hem de ince ayar (fine-tuning) yöntemiyle kullanılmıştır. Elde edilen gömülü temsiller (embedding’ler), Destek Vektör Makineleri (SVM), En Yakın Komşular (KNN), Rastgele Ormanlar (RF) ve Lojistik Regresyon (LR) gibi geleneksel makine öğrenimi algoritmaları ile sınıflandırılmıştır. Deneyler, yaygın olarak kullanılan iki duygusal konuşma veri seti üzerinde gerçekleştirilmiştir: RAVDESS ve EmoDB. Özellik çıkarma aşamasının performans değerlendirmesi doğruluk, F1 skoru, karışıklık matrisi ve ROC eğrisi altındaki alan (AUC) ölçütlerine dayandırılmıştır. İnce ayar aşamasında ise doğruluğun yanı sıra sınıf bazlı kesinlik (precision), geri çağırma (recall), F1 skoru ve sınıfa özgü AUC metrikleri ile ROC eğrileri kullanılmıştır. Sonuçlar, VGGish’in hem özellik çıkarma hem de ince ayar senaryolarında YAMNet’ten tutarlı bir şekilde daha iyi performans gösterdiğini ortaya koymaktadır. EmoDB veri setinde, VGGish özellikleri ile LR kullanıldığında en yüksek sınıflandırma doğruluğu elde edilmiştir (%73,83). Ayrıca, VGGish’in EmoDB üzerinde ince ayarı, %72,90’lık rekabetçi bir doğruluk sağlamış ve duygu temsili öğrenmede etkinliğini ortaya koymuştur.

Keywords

Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals

Abstract

Understanding human emotions through vocal cues is a key point for developing emotionally intelligent systems, particularly in fields such as human-computer interaction, healthcare, and virtual assistants. However, accurately recognizing emotions from speech remains a challenging task due to the variability in speaker traits, acoustic conditions, and the subtle, often overlapping nature of emotional states. In this study, a comparative analysis of transfer learning methods for speech emotion recognition (SER) was presented by employing pretrained audio-based neural networks. Specifically, YAMNet and VGGish models were employed both as static feature extractors and in a fine-tuning setup. The extracted embeddings were classified using traditional machine learning algorithms, including Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Random Forests (RF), and Logistic Regression (LR). Experiments were conducted on two widely used emotional speech datasets: RAVDESS and EmoDB. The results demonstrate that VGGish consistently outperforms YAMNet in both feature extraction and fine-tuning scenarios. The highest classification accuracy was achieved using VGGish features with LR on EmoDB (73.83%). Additionally, fine-tuning VGGish on EmoDB yielded a competitive accuracy of 72.90%. Also class-specific analysis showed that the highest AUC score of 0.9635 was obtained using the LR in VGGish + EmoDB setting, while fine-tuning both YAMNet and VGGish with EmoDB dataset has reached up to Recall score of 1 for the ‘Sadness’ emotion.

Keywords

References

[1] Singh, Y.B., & Goel, S. (2022). A systematic literature review of speech emotion recognition approaches. Neurocomputing, 492, pp. 245-263.
[2] Llruba, C., & Palau, R. (2024). Real-Time Emotion Recognition for Improving the Teaching–Learning Process: A Scoping Review. Journal of Imaging, 21, 10(12), 313.
[3] Lope, J.d., & Grana, M. (2023). An ongoing review of speech emotion recognition. Neurocomputing, 528, pp. 1-11.
[4] Alhussein, G. et al. (2025). Speech emotion recognition in conversations using artificial intelligence: a systematic review and meta-analysis. Artificial Intelligence Review, Vol. 58, Article No: 198.
[5] Chakhtouna, A. et al. (2023). Speaker and gender dependencies in within/cross linguistic Speech Emotion Recognition. International Journal of Speech Technology, Vol. 26, pp. 609–625.
[6] Foggia, P. et al. (2024). Identity, Gender, Age, and Emotion Recognition from Speaker Voice with Multi-task Deep Networks for Cognitive Robotics. Cognitive Computation, Vol. 16, pp. 2713–2723.
[7] Kakuba, S., & Han, D.S. (2025). Addressing data scarcity in speech emotion recognition: A comprehensive review. ICT Express, 11(1), pp. 110-123.
[8] He, Z. (2025). Research Advanced in Speech Emotion Recognition based on Deep Learning. Theoretical and Natural Science, 86, pp. 45-52.

[9] Nguyen, D. et al. (2023). Meta-transfer learning for emotion recognition. Neural Computing and Applications, Vol. 35, pp. 10535–10549.
[10] Padi, S. et al. (2021). Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation. arXiv:2108.02510.
[11] Phukan, O.C. et al. (2023). A Comparative Study of Pre-trained Speech and Audio Embeddings for Speech Emotion Recognition. arXiv:2304.11472.
[12] Liu, K. et al. (2024). Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective. IEEE Transactions on Multimedia, Vol. 26, pp. 10623-10636.
[13] Hassan, A. et al. (2024). Benchmarking Pretrained Models for Speech Emotion Recognition: A Focus on Xception. Computers, 13(12), 315.
[14] Hershey, S. et al. (2017). CNN architectures for large-scale audio classification. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131-135.
[15] Livingstone, S.R., & Russo, F.A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. Plos One, 13(5).
[16] Burkhardt, F. et al. (2005). A database of German emotional speech. Proc. Interspeech, pp. 1517-1520, doi: 10.21437/Interspeech.2005-446.
[17] Schuller, B. et al. (2011). Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication, 53(9-10), pp. 1062-1087.
[18] Ayadi, M.A. et al. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), pp. 572-587.
[19] Lee, C.M. et al. (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13(2), pp. 293-303.
[20] Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9), pp. 1162-1181.
[21] Picard, R.W. (2003). Affective computing: challenges. International Journal of Human-Computer Studies, 59(1-2), pp. 55-64.
[22] Tao, J., & Tan, T. (2005). Affective Computing: A Review. Affective Computing and Intelligent Interaction, 3784.
[23] Schuller, B. et al. (2013). Computational paralinguistics challenge: social signals, conflict, emotion, autism. Proc. Interspeech, pp. 148-152.
[24] Burkhardt, F., & Sendlmeier, W.F. (2000). Verification of acoustical correlates of emotional speech using formant-synthesis. Proc. ITRW on Speech and Emotion, pp. 151-156.
[25] Madanian, S. et al. (2023). Speech emotion recognition using machine learning - A systematic review. Intelligent Systems with Applications, 20.
[26] Feng, K., & Chaspari, T. (2020). A Review of Generalizable Transfer Learning in Automatic Emotion Recognition. Frontiers in Computer Science, 20.
[27] Belkacem, S. (2023). Speech Emotion Recognition: Recent Advances and Current Trends. 22nd International Conference on Artificial Intelligence and Soft Computing: (ICAISC), Proceedings, Part II.
[28] George, S.M., & Ilyas, P.M. (2024). A review on speech emotion recognition: A survey, recent advances, challenges, and the influence of noise. Neurocomputing, 568.
[29] Sonmez, Y.U., & Varol, A. (2019). New Trends in Speech Emotion Recognition. 7th International Symposium on Digital Forensics and Security (ISDFS), pp. 1-7.
[30] Jakubec, M. et al. (2024). Speech Emotion Recognition Using Transfer Learning: Integration of Advanced Speaker Embeddings and Image Recognition Models. Applied Sciences, 14(21).
[31] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), pp. 273-297.
[32] Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), pp. 21-27.
[33] Breiman, L. (2001). Random forests. Machine Learning, 45(1), pp. 5-32.
[34] Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), pp. 215-232.

Details

Primary Language

English

Subjects

Computer Software

Journal Section

Research Article

Authors

Yunus Korkmaz ^*
0000-0002-6315-5750
Türkiye

Publication Date

December 31, 2025

Submission Date

August 9, 2025

Acceptance Date

October 2, 2025

Published in Issue

Year 2025 Volume: 15 Number: 2

DOI

https://doi.org/10.36222/ejt.1761640

IZ

https://izlik.org/JA29AJ92XP

Cite

RIS / Bibtex

APA

Korkmaz, Y. (2025). Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals. European Journal of Technique (EJT), 15(2), 251-260. https://doi.org/10.36222/ejt.1761640

AMA

1.Korkmaz Y. Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals. EJT. 2025;15(2):251-260. doi:10.36222/ejt.1761640

Chicago

Korkmaz, Yunus. 2025. “Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals”. European Journal of Technique (EJT) 15 (2): 251-60. https://doi.org/10.36222/ejt.1761640.

EndNote

Korkmaz Y (December 1, 2025) Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals. European Journal of Technique (EJT) 15 2 251–260.

IEEE

[1]Y. Korkmaz, “Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals”, EJT, vol. 15, no. 2, pp. 251–260, Dec. 2025, doi: 10.36222/ejt.1761640.

ISNAD

Korkmaz, Yunus. “Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals”. European Journal of Technique (EJT) 15/2 (December 1, 2025): 251-260. https://doi.org/10.36222/ejt.1761640.

JAMA

1.Korkmaz Y. Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals. EJT. 2025;15:251–260.

MLA

Korkmaz, Yunus. “Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals”. European Journal of Technique (EJT), vol. 15, no. 2, Dec. 2025, pp. 251-60, doi:10.36222/ejt.1761640.

Vancouver

1.Yunus Korkmaz. Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals. EJT. 2025 Dec. 1;15(2):251-60. doi:10.36222/ejt.1761640