Research Article

Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals

Volume: 15 Number: 2 December 31, 2025
TR EN

Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals

Abstract

Understanding human emotions through vocal cues is a key point for developing emotionally intelligent systems, particularly in fields such as human-computer interaction, healthcare, and virtual assistants. However, accurately recognizing emotions from speech remains a challenging task due to the variability in speaker traits, acoustic conditions, and the subtle, often overlapping nature of emotional states. In this study, a comparative analysis of transfer learning methods for speech emotion recognition (SER) was presented by employing pretrained audio-based neural networks. Specifically, YAMNet and VGGish models were employed both as static feature extractors and in a fine-tuning setup. The extracted embeddings were classified using traditional machine learning algorithms, including Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Random Forests (RF), and Logistic Regression (LR). Experiments were conducted on two widely used emotional speech datasets: RAVDESS and EmoDB. The results demonstrate that VGGish consistently outperforms YAMNet in both feature extraction and fine-tuning scenarios. The highest classification accuracy was achieved using VGGish features with LR on EmoDB (73.83%). Additionally, fine-tuning VGGish on EmoDB yielded a competitive accuracy of 72.90%. Also class-specific analysis showed that the highest AUC score of 0.9635 was obtained using the LR in VGGish + EmoDB setting, while fine-tuning both YAMNet and VGGish with EmoDB dataset has reached up to Recall score of 1 for the ‘Sadness’ emotion.

Keywords

References

  1. [1] Singh, Y.B., & Goel, S. (2022). A systematic literature review of speech emotion recognition approaches. Neurocomputing, 492, pp. 245-263.
  2. [2] Llruba, C., & Palau, R. (2024). Real-Time Emotion Recognition for Improving the Teaching–Learning Process: A Scoping Review. Journal of Imaging, 21, 10(12), 313.
  3. [3] Lope, J.d., & Grana, M. (2023). An ongoing review of speech emotion recognition. Neurocomputing, 528, pp. 1-11.
  4. [4] Alhussein, G. et al. (2025). Speech emotion recognition in conversations using artificial intelligence: a systematic review and meta-analysis. Artificial Intelligence Review, Vol. 58, Article No: 198.
  5. [5] Chakhtouna, A. et al. (2023). Speaker and gender dependencies in within/cross linguistic Speech Emotion Recognition. International Journal of Speech Technology, Vol. 26, pp. 609–625.
  6. [6] Foggia, P. et al. (2024). Identity, Gender, Age, and Emotion Recognition from Speaker Voice with Multi-task Deep Networks for Cognitive Robotics. Cognitive Computation, Vol. 16, pp. 2713–2723.
  7. [7] Kakuba, S., & Han, D.S. (2025). Addressing data scarcity in speech emotion recognition: A comprehensive review. ICT Express, 11(1), pp. 110-123.
  8. [8] He, Z. (2025). Research Advanced in Speech Emotion Recognition based on Deep Learning. Theoretical and Natural Science, 86, pp. 45-52.

Details

Primary Language

English

Subjects

Computer Software

Journal Section

Research Article

Publication Date

December 31, 2025

Submission Date

August 9, 2025

Acceptance Date

October 2, 2025

Published in Issue

Year 2025 Volume: 15 Number: 2

APA
Korkmaz, Y. (2025). Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals. European Journal of Technique (EJT), 15(2), 251-260. https://doi.org/10.36222/ejt.1761640
AMA
1.Korkmaz Y. Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals. EJT. 2025;15(2):251-260. doi:10.36222/ejt.1761640
Chicago
Korkmaz, Yunus. 2025. “Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals”. European Journal of Technique (EJT) 15 (2): 251-60. https://doi.org/10.36222/ejt.1761640.
EndNote
Korkmaz Y (December 1, 2025) Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals. European Journal of Technique (EJT) 15 2 251–260.
IEEE
[1]Y. Korkmaz, “Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals”, EJT, vol. 15, no. 2, pp. 251–260, Dec. 2025, doi: 10.36222/ejt.1761640.
ISNAD
Korkmaz, Yunus. “Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals”. European Journal of Technique (EJT) 15/2 (December 1, 2025): 251-260. https://doi.org/10.36222/ejt.1761640.
JAMA
1.Korkmaz Y. Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals. EJT. 2025;15:251–260.
MLA
Korkmaz, Yunus. “Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals”. European Journal of Technique (EJT), vol. 15, no. 2, Dec. 2025, pp. 251-60, doi:10.36222/ejt.1761640.
Vancouver
1.Yunus Korkmaz. Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals. EJT. 2025 Dec. 1;15(2):251-60. doi:10.36222/ejt.1761640

All articles published by EJT are licensed under the Creative Commons Attribution 4.0 International License. This permits anyone to copy, redistribute, remix, transmit and adapt the work provided the original work and source is appropriately cited.Creative Commons Lisansı