VOICE AND IMAGE BASED EMOTION RECOGNITION WITH DEEP LEARNING

Abdil Karakan

doi:10.36306/konjes.1574874

Research Article

Year 2026, Volume: 14 Issue: 1 , 97 - 112 , 01.03.2026

Abdil Karakan

https://doi.org/10.36306/konjes.1574874

https://izlik.org/JA53SY69KL

Abstract

References

V.V. Narasimha, R. Saravanakumar, N. Yusuf, R. Pradhan, H. Hamdi, K. A. Saravanan, V. S. Rao, and M. A. Askar, "Enhancing emotion prediction using deep learning and distributed federated systems with SMOTE oversampling technique", Alexandria Engineering Journal, 108, 498–508, 2024. https://doi.org/10.1016/j.aej.2024.07.081.
F. G. Eris¸ and E. Akbal, "Enhancing speech emotion recognition through deep learning and handcrafted feature fusion", Applied Acoustics, 222, 110070, 2024. https://doi.org/10.1016/j.apacoust.2024.110070
D. Weber, and B. Kostek, "Bimodal deep learning model for subjectively enhanced emotion classification in films", Information Sciences, 678, 121049, 2024. https://doi.org/10.1016/j.ins.2024.121049
R. K. Gupta, and R. Sinha, "Deep multi-task learning based detection of correlated mental disorders using audio modality", Computer Speech & Language, 89, 101710, 2025. https://doi.org/10.1016/j.csl.2024.101710
A. I. Middya, B. Nag, and S. Roy, "Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities", Knowledge-Based Systems, 244, 108580, 2022. https://doi.org/10.1016/j.knosys.2022.108580
L. Zheng, Q. Li, H. Ban and S. Liu, "Speech emotion recognition based on convolution neural network combined with random forest." 2018 Chinese Control And Decision Conference (CCDC), Shenyang, pp. 4143-4147, 2018. https://doi.org/10.1109/CCDC.2018.8407844
D. Bitouk, R. Verma, and A. Nenkova "Class-level spectral features for emotion recognition", Speech Communication, 52, 613-625, 2010. https://doi.org/10.1016/j.specom.2010.02.010
Z. T. Liu, M. Wu, W. H. Cao, J. W. Mao, J. P. Xu, and G. Z. Tan, "Speech emotion recognition based on feature selection and extreme learning machine decision tree", Neurocomputing, 273, 271-280, 2018. https://doi.org/10.1016/j.neucom.2017.07.050
S. Actis, A. Denner, L. Hofer, J. N. Lang A. Scharf, and S. Uccirati, "RECOLA-Recursive Computation of One-Loop Amplitudes", Computer Physics Communications, 214, 140-173, 2017. https://doi.org/10.1016/j.cpc.2017.01.004
G. Trigeorgis, "End-to-end speech emotion recognition using a deep convolutional recurrent network", 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, pp. 5200-5204, 2016. https://doi.org/10.1109/ICASSP.2016.7472669.
K. Wang, N. An, B. N. Li, Y. Zhang, and L. Li, "Speech Emotion Recognition Using Fourier Parameters," IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 69-75, 1 Jan.-March 2015, https://doi.org/10.1109/TAFFC.2015.2392101
I. Dias, M. Demirci, M. Fatih and A. Yazıcı, "Speech emotion recognition with deep convolutional neural networks", Biomedical Signal Processing and Control, 59, 101894, 2020. https://doi.org/10.1016/j.bspc.2020.101894Get rights and content
J. Cai, "Feature-Level and Model-Level Audiovisual Fusion for Emotion Recognition in the Wild", 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, pp. 443-448, 2019. https://doi.org/10.1109/MIPR.2019.00089.
S. Langari, H. Marvi, and M. Zahedi, M. "Efficient speech emotion recognition using modified feature extraction, " Informatics in Medicine 20, 100424, 2020. https://doi.org/10.1016/j.imu.2020.100424
J. Zhao, X. Mao, and L. Chen, "Speech emotion recognition using deep 1D & 2D CNN LSTM networks", Biomedical Signal Processing and Control, 47, 312-323, 2019. https://doi.org/10.1016/j.bspc.2018.08.035
P. P. Dahake, K. Shaw and P. Malathi, "Speaker dependent speech emotion recognition using MFCC and Support Vector Machine," 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT), Pune, pp. 1080-1084, 2016. https://doi.org/10.1109/ICACDOT.2016.7877753.
L. Kerkeni, Y. Serrestou, K. Raoof, M. Mbarki, M. A. "Mahjoub, and C. Cleder, "Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO", Speech Communication, 114, 22-35, 2019. https://doi.org/10.1016/j.specom.2019.09.00
V. Zue, S. Seneff, and J. Glass, "Speech database development at MIT: Timit and beyond", JSpeech Communication, 9(4), 351–356, 2023. https://doi.org/10.1016/0167-6393(90)90010-7
C. Liu, T. L. Tang, and M. Wang, "Multi-feature based emotion recognition for video clips", Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 630–634, 2018. https://doi.org/10.1145/ 3242969.3264989.
J. Wei, X. Yang, and Y. Dong, "User-generated video emotion recognition based on key frames", Multimedia Tools and Applications, 80(9), 14343–14361, 2021. https://doi.org/ 10.1007/s11042-020-10203-1.
T. L. B. Khanh, S. Kim, G. Lee, H. J. Yang, and E. T. Baek, E.-T. "Korean video dataset for emotion recognition in the wild", Multimedia Tools and Applications, 80(6), 9479–9492, 2021. https://doi.org/10.1007/s11042-020-10106-1
Guo, X., Polanía, L. F., & Barner, K. E. "Toward end-to-end deception detection in videos", 2018 IEEE International Conference on Big Data, pp. 1278–1283, 2018. https://doi.org/10.1109/BigData.2018.8621909.
R. Guetari, A. Chetouani, H. Tabia, and N. Khlifa, N. "Real time emotion recognition in video stream, using B-CNN and F-CNN", 2020 5th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), pp. 1–6, 2020. https://doi.org/10.1109/ATSIP49331.2020.9231902.
H. Zhou, D. Meng, Y. Zhang, X. Peng, J. Du, and K. Wang, "Exploring emotion features and fusion strategies for audio-video emotion recognition", 2019 International Conference on Multimodal Interaction, pp. 562–566, 2019. https://doi.org/10.1145/3340555.3355713.
S. E. Kahou, "EmoNets: Multimodal deep learning approaches for emotion recognition in video", Journal On Multimodal User Interfaces, 10(2), 99–111, 2016. https:// doi.org/10.1007/s12193-015-0195-2.
T. S. Gunawan, A. Ashraf, B. S. Riza, E. V. Haryanto, R. Rosnelly, M. Kartiwi, and Z. Janin, Z. "Development of video-based emotion recognition using deep learning with Google Colab", TELKOMNIKA Telecommunication Computing Electronics And Control, 18(5), 2463–2471,2020. https://doi.org/10.12928/telkomnika.v18i5.16717
H. V. Manalu, and A. P. Rifai, "Detection of human emotions through facial expressions using hybrid convolutional neural network-recurrent neural network algorithm", Intelligent Systems with Applications, 21, 200339, 2024. https://doi.org/10.1016/j.iswa.2024.200339
R. Memisevic, S. E. Kahou, V. Michalski, K. Konda, and C. Pal, "Recurrent neural networks for emotion recognition in video", Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 467–474, 2015. https://doi.org/10.1145/2818346.2830596.
L. H. Sun, J. Chen, and T. Gu, "Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition", Int. J. Speech Technol, 21(4):931–40, 2018. https://doi.org/10.1007/s10772-018-9551-4
Y. M. Huang, "Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition", J. Ambient Intell Hum Comput, 10(5):1787–98, 2019. https://doi.org/10.1016/j.engappai.2024.108293
M. Xu, F. Zhang and S. U. Khan, "Improve accuracy of speech emotion recognition with attention head fusion", 2020 10th annual computing and communication workshop and conference (CCWC). pp. 12-5-18, 2020. https://doi.org/10.1109/CCWC47524.2020.9031207
W. Jiang, Z. Wang, J. S. Jin, X. Han, and C. Li "Speech emotion recognition with heterogeneous feature unification of deep neural network", Sensors (Basel), 19(12), 2730, 2019. https://doi.org/10.3390/s19122730.
Z. W. Tu, B. Lui, W. Zhao, R. Yan and Y. Zou "A feature fusion model with data augmentation for speech emotion recognition", Appl Sci-Basel, 13(7), 4124, 2023. https://doi.org/10.3390/app13074124.
I. Shahin, O. S. Alamori, A. B. Nassif, I. Afyouni, I. A. Hashem, A. Elnagar "An efficient feature selection method for arabic and english speech emotion recognition using Grey Wolf Optimizer", Appl Acoust, 205, 109279, 2023. https://doi.org/10.1016/j.apacoust.2023.109279
Y. Liu, "Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion", Multimed Tools Appl, pp.1–21, 2024. https://doi.org/10.1007/s11042-023-17829-x
Z. Liu, X. Kang. And F. Ren, "Improving speech emotion recognition by fusing pre-trained and acoustic features using transformer and BiLSTM", International Conference on Intelligent Information Processing. pp 68-79, 2022. https://doi.org/10.1007/978-3-031-03948-5_28.

VOICE AND IMAGE BASED EMOTION RECOGNITION WITH DEEP LEARNING

Year 2026, Volume: 14 Issue: 1 , 97 - 112 , 01.03.2026

Abdil Karakan

https://doi.org/10.36306/konjes.1574874

https://izlik.org/JA53SY69KL

Abstract

Emotion is a phenomenon that reflects every moment of an individual's life. The way in which an emotional state is expressed can be complex and different for each individual. Facial expressions and changes in voice are ways of expressing emotions. In the study, a sound and image-based system was implemented for emotion recognition. Since there was no Turkish dataset for voice detection, an original dataset named TR-EmotionSpeech was prepared for this study. Likewise, a facial expression dataset named TRFace-40 was developed to recognize visual emotional cues. This dataset consists of samples taken from 40 different Turkish-speaking people. The dataset includes 6 different emotions and 2000 audio files. It consists of samples taken from 40 different people from different angles for face recognition. The study will perform the detection process in real time. For this reason, errors that may occur from the camera were added to the samples in the dataset. A new dataset consisting of 40000 images was created with the changes in the dataset. The modifications applied to the dataset significantly contributed to improving the overall recognition accuracy. First, pre-processing and feature extraction were applied to the audio files. Then, they were classified with Long-Short Term Memory Networks. The emotion recognition accuracy rate of the system was determined as 75.18%. YOLOv5, YOLOv6, YOLOv7 and YOLOv8 architectures were used in image recognition. 97.82% accuracy was achieved in the YOLOv8 architecture.

Keywords

Deep learning , Face recognition , Long-Short Term Memory Network , Voice recognition , YOLO architectures

References

V.V. Narasimha, R. Saravanakumar, N. Yusuf, R. Pradhan, H. Hamdi, K. A. Saravanan, V. S. Rao, and M. A. Askar, "Enhancing emotion prediction using deep learning and distributed federated systems with SMOTE oversampling technique", Alexandria Engineering Journal, 108, 498–508, 2024. https://doi.org/10.1016/j.aej.2024.07.081.
F. G. Eris¸ and E. Akbal, "Enhancing speech emotion recognition through deep learning and handcrafted feature fusion", Applied Acoustics, 222, 110070, 2024. https://doi.org/10.1016/j.apacoust.2024.110070
D. Weber, and B. Kostek, "Bimodal deep learning model for subjectively enhanced emotion classification in films", Information Sciences, 678, 121049, 2024. https://doi.org/10.1016/j.ins.2024.121049
R. K. Gupta, and R. Sinha, "Deep multi-task learning based detection of correlated mental disorders using audio modality", Computer Speech & Language, 89, 101710, 2025. https://doi.org/10.1016/j.csl.2024.101710
A. I. Middya, B. Nag, and S. Roy, "Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities", Knowledge-Based Systems, 244, 108580, 2022. https://doi.org/10.1016/j.knosys.2022.108580
L. Zheng, Q. Li, H. Ban and S. Liu, "Speech emotion recognition based on convolution neural network combined with random forest." 2018 Chinese Control And Decision Conference (CCDC), Shenyang, pp. 4143-4147, 2018. https://doi.org/10.1109/CCDC.2018.8407844
D. Bitouk, R. Verma, and A. Nenkova "Class-level spectral features for emotion recognition", Speech Communication, 52, 613-625, 2010. https://doi.org/10.1016/j.specom.2010.02.010
Z. T. Liu, M. Wu, W. H. Cao, J. W. Mao, J. P. Xu, and G. Z. Tan, "Speech emotion recognition based on feature selection and extreme learning machine decision tree", Neurocomputing, 273, 271-280, 2018. https://doi.org/10.1016/j.neucom.2017.07.050
S. Actis, A. Denner, L. Hofer, J. N. Lang A. Scharf, and S. Uccirati, "RECOLA-Recursive Computation of One-Loop Amplitudes", Computer Physics Communications, 214, 140-173, 2017. https://doi.org/10.1016/j.cpc.2017.01.004
G. Trigeorgis, "End-to-end speech emotion recognition using a deep convolutional recurrent network", 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, pp. 5200-5204, 2016. https://doi.org/10.1109/ICASSP.2016.7472669.
K. Wang, N. An, B. N. Li, Y. Zhang, and L. Li, "Speech Emotion Recognition Using Fourier Parameters," IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 69-75, 1 Jan.-March 2015, https://doi.org/10.1109/TAFFC.2015.2392101
I. Dias, M. Demirci, M. Fatih and A. Yazıcı, "Speech emotion recognition with deep convolutional neural networks", Biomedical Signal Processing and Control, 59, 101894, 2020. https://doi.org/10.1016/j.bspc.2020.101894Get rights and content
J. Cai, "Feature-Level and Model-Level Audiovisual Fusion for Emotion Recognition in the Wild", 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, pp. 443-448, 2019. https://doi.org/10.1109/MIPR.2019.00089.
S. Langari, H. Marvi, and M. Zahedi, M. "Efficient speech emotion recognition using modified feature extraction, " Informatics in Medicine 20, 100424, 2020. https://doi.org/10.1016/j.imu.2020.100424
J. Zhao, X. Mao, and L. Chen, "Speech emotion recognition using deep 1D & 2D CNN LSTM networks", Biomedical Signal Processing and Control, 47, 312-323, 2019. https://doi.org/10.1016/j.bspc.2018.08.035
P. P. Dahake, K. Shaw and P. Malathi, "Speaker dependent speech emotion recognition using MFCC and Support Vector Machine," 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT), Pune, pp. 1080-1084, 2016. https://doi.org/10.1109/ICACDOT.2016.7877753.
L. Kerkeni, Y. Serrestou, K. Raoof, M. Mbarki, M. A. "Mahjoub, and C. Cleder, "Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO", Speech Communication, 114, 22-35, 2019. https://doi.org/10.1016/j.specom.2019.09.00
V. Zue, S. Seneff, and J. Glass, "Speech database development at MIT: Timit and beyond", JSpeech Communication, 9(4), 351–356, 2023. https://doi.org/10.1016/0167-6393(90)90010-7
C. Liu, T. L. Tang, and M. Wang, "Multi-feature based emotion recognition for video clips", Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 630–634, 2018. https://doi.org/10.1145/ 3242969.3264989.
J. Wei, X. Yang, and Y. Dong, "User-generated video emotion recognition based on key frames", Multimedia Tools and Applications, 80(9), 14343–14361, 2021. https://doi.org/ 10.1007/s11042-020-10203-1.
T. L. B. Khanh, S. Kim, G. Lee, H. J. Yang, and E. T. Baek, E.-T. "Korean video dataset for emotion recognition in the wild", Multimedia Tools and Applications, 80(6), 9479–9492, 2021. https://doi.org/10.1007/s11042-020-10106-1
Guo, X., Polanía, L. F., & Barner, K. E. "Toward end-to-end deception detection in videos", 2018 IEEE International Conference on Big Data, pp. 1278–1283, 2018. https://doi.org/10.1109/BigData.2018.8621909.
R. Guetari, A. Chetouani, H. Tabia, and N. Khlifa, N. "Real time emotion recognition in video stream, using B-CNN and F-CNN", 2020 5th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), pp. 1–6, 2020. https://doi.org/10.1109/ATSIP49331.2020.9231902.
H. Zhou, D. Meng, Y. Zhang, X. Peng, J. Du, and K. Wang, "Exploring emotion features and fusion strategies for audio-video emotion recognition", 2019 International Conference on Multimodal Interaction, pp. 562–566, 2019. https://doi.org/10.1145/3340555.3355713.
S. E. Kahou, "EmoNets: Multimodal deep learning approaches for emotion recognition in video", Journal On Multimodal User Interfaces, 10(2), 99–111, 2016. https:// doi.org/10.1007/s12193-015-0195-2.
T. S. Gunawan, A. Ashraf, B. S. Riza, E. V. Haryanto, R. Rosnelly, M. Kartiwi, and Z. Janin, Z. "Development of video-based emotion recognition using deep learning with Google Colab", TELKOMNIKA Telecommunication Computing Electronics And Control, 18(5), 2463–2471,2020. https://doi.org/10.12928/telkomnika.v18i5.16717
H. V. Manalu, and A. P. Rifai, "Detection of human emotions through facial expressions using hybrid convolutional neural network-recurrent neural network algorithm", Intelligent Systems with Applications, 21, 200339, 2024. https://doi.org/10.1016/j.iswa.2024.200339
R. Memisevic, S. E. Kahou, V. Michalski, K. Konda, and C. Pal, "Recurrent neural networks for emotion recognition in video", Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 467–474, 2015. https://doi.org/10.1145/2818346.2830596.
L. H. Sun, J. Chen, and T. Gu, "Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition", Int. J. Speech Technol, 21(4):931–40, 2018. https://doi.org/10.1007/s10772-018-9551-4
Y. M. Huang, "Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition", J. Ambient Intell Hum Comput, 10(5):1787–98, 2019. https://doi.org/10.1016/j.engappai.2024.108293
M. Xu, F. Zhang and S. U. Khan, "Improve accuracy of speech emotion recognition with attention head fusion", 2020 10th annual computing and communication workshop and conference (CCWC). pp. 12-5-18, 2020. https://doi.org/10.1109/CCWC47524.2020.9031207
W. Jiang, Z. Wang, J. S. Jin, X. Han, and C. Li "Speech emotion recognition with heterogeneous feature unification of deep neural network", Sensors (Basel), 19(12), 2730, 2019. https://doi.org/10.3390/s19122730.
Z. W. Tu, B. Lui, W. Zhao, R. Yan and Y. Zou "A feature fusion model with data augmentation for speech emotion recognition", Appl Sci-Basel, 13(7), 4124, 2023. https://doi.org/10.3390/app13074124.
I. Shahin, O. S. Alamori, A. B. Nassif, I. Afyouni, I. A. Hashem, A. Elnagar "An efficient feature selection method for arabic and english speech emotion recognition using Grey Wolf Optimizer", Appl Acoust, 205, 109279, 2023. https://doi.org/10.1016/j.apacoust.2023.109279
Y. Liu, "Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion", Multimed Tools Appl, pp.1–21, 2024. https://doi.org/10.1007/s11042-023-17829-x
Z. Liu, X. Kang. And F. Ren, "Improving speech emotion recognition by fusing pre-trained and acoustic features using transformer and BiLSTM", International Conference on Intelligent Information Processing. pp 68-79, 2022. https://doi.org/10.1007/978-3-031-03948-5_28.

There are 36 citations in total.

Details

Primary Language	English
Subjects	Electrical Engineering (Other)
Journal Section	Research Article
Authors	Abdil Karakan 0000-0003-1651-7568
Submission Date	October 28, 2024
Acceptance Date	September 16, 2025
Publication Date	March 1, 2026
DOI	https://doi.org/10.36306/konjes.1574874
IZ	https://izlik.org/JA53SY69KL
Published in Issue	Year 2026 Volume: 14 Issue: 1

Cite

IEEE	[1]A. Karakan, “VOICE AND IMAGE BASED EMOTION RECOGNITION WITH DEEP LEARNING”, KONJES, vol. 14, no. 1, pp. 97–112, Mar. 2026, doi: 10.36306/konjes.1574874.

Article Files

Full Text