Emotion is a phenomenon that reflects every moment of an individual's life. The way in which an emotional state is expressed can be complex and different for each individual. Facial expressions and changes in voice are ways of expressing emotions. In the study, a sound and image-based system was implemented for emotion recognition. Since there was no Turkish dataset for voice detection, an original dataset named TR-EmotionSpeech was prepared for this study. Likewise, a facial expression dataset named TRFace-40 was developed to recognize visual emotional cues. This dataset consists of samples taken from 40 different Turkish-speaking people. The dataset includes 6 different emotions and 2000 audio files. It consists of samples taken from 40 different people from different angles for face recognition. The study will perform the detection process in real time. For this reason, errors that may occur from the camera were added to the samples in the dataset. A new dataset consisting of 40000 images was created with the changes in the dataset. The modifications applied to the dataset significantly contributed to improving the overall recognition accuracy. First, pre-processing and feature extraction were applied to the audio files. Then, they were classified with Long-Short Term Memory Networks. The emotion recognition accuracy rate of the system was determined as 75.18%. YOLOv5, YOLOv6, YOLOv7 and YOLOv8 architectures were used in image recognition. 97.82% accuracy was achieved in the YOLOv8 architecture.
Deep learning Face recognition Long-Short Term Memory Network Voice recognition YOLO architectures
| Primary Language | English |
|---|---|
| Subjects | Electrical Engineering (Other) |
| Journal Section | Research Article |
| Authors | |
| Submission Date | October 28, 2024 |
| Acceptance Date | September 16, 2025 |
| Publication Date | March 1, 2026 |
| DOI | https://doi.org/10.36306/konjes.1574874 |
| IZ | https://izlik.org/JA53SY69KL |
| Published in Issue | Year 2026 Volume: 14 Issue: 1 |