TURK-SER: A Speech Emotion Recognition Dataset and Benchmark for Turkish

Mustafa Eriş; Fatma Güneş Eriş; Erhan Akbal

doi:10.46460/ijiea.1887612

EN TR

TURK-SER: A Speech Emotion Recognition Dataset and Benchmark for Turkish

Öz

Speech emotion recognition is a growing area focused on enhancing human-computer interaction by precisely recognizing emotions from speech signals. In recent years, advancements in deep learning have led to highly successful studies on speech emotion recognition in the literature. Especially, audio embeddings obtained from self-supervised models have significantly improved emotion recognition performance by capturing rich and meaningful representations of speech signals. While notable advancements have been achieved with self-supervised learning models for languages like English and German, high-quality datasets are still missing for other languages, such as Turkish. This research introduces a new Turkish SER dataset, TURK-SER, which includes 2150 recordings of phonetically varied sentences produced by 90 speakers across five emotional categories. Furthermore, we explore how to adapt the Wav2Vec2 model for Turkish SER using two fine-tuning methods: half-fine tuning, which only updates the transformer encoder, and full-fine tuning, which trains both the convolutional and transformer encoders. Experimental findings indicate that full fine-tuning enhances classification performance, reaching an accuracy of 85.44%. These findings underscore the promise of Wav2Vec2 for SER in low-resource languages and offer valuable insights into optimizing self-supervised learning-based models for emotion detection. This research highlights the effectiveness of Wav2Vec2 in Turkish SER and paves the way for future studies to investigate its applicability across other low-resource languages.

Anahtar Kelimeler

Speech Emotion Recognition, Self-Supervised Learning, TURK-SER Dataset, Speech Embeddings, Wav2Vec2 Fine-tuning

TURK-SER: Türkçe için Konuşma Duygu Tanıma Veri Seti ve Karşılaştırma Ölçeği

Öz

Konuşma duygu tanıma, konuşma sinyallerinden duyguları hassas bir şekilde tanıyarak insan-bilgisayar etkileşimini geliştirmeye odaklanan, giderek büyüyen bir alandır. Son yıllarda, derin öğrenmedeki gelişmeler, literatürde konuşma duygu tanıma konusunda oldukça başarılı çalışmalara yol açmıştır. Özellikle, öz denetimli modellerden elde edilen ses kodlamaları, konuşma sinyallerinin zengin ve anlamlı temsillerini yakalayarak duygu tanıma performansını önemli ölçüde iyileştirmiştir. İngilizce ve Almanca gibi diller için öz denetimli öğrenme modellerinde kayda değer ilerlemeler kaydedilmiş olsa da, Türkçe gibi diğer diller için hala yüksek kaliteli veri kümeleri eksiktir. Bu araştırmada, beş duygu kategorisinde 90 konuşmacı tarafından üretilen fonetik olarak çeşitli cümlelerin 2150 kaydını içeren yeni bir Türkçe SER veri kümesi olan TURK-SER sunulmuştur. Ayrıca, iki farklı ince ayar yöntemi kullanarak Wav2Vec2 modelini Türkçe SER'ye nasıl uyarlayabileceğimizi araştırılmıştır: sadece tansformer kodlayıcıyı güncelleyen yarı ince ayar ve hem evrişimli hem de transformer kodlayıcıları eğiten tam ince ayar. Deney sonuçları, tam ince ayarın sınıflandırma performansını artırdığını ve %85,44'lük bir doğruluk oranına ulaştığını göstermektedir. Bu bulgular, Wav2Vec2'nin kaynakları sınırlı dillerde SER için umut vaat ettiğini vurgulamakta ve duygu algılama için öz denetimli öğrenme tabanlı modellerin optimizasyonu konusunda değerli bilgiler sunmaktadır. Bu araştırma, Wav2Vec2'nin Türkçe SER'de etkinliğini vurgulamakta ve diğer kaynakları sınırlı dillerde uygulanabilirliğini araştırmak üzere gelecekteki çalışmalara zemin hazırlamaktadır.

Anahtar Kelimeler

Konuşmadan Duygu Tanıma, Kendi Kendine Öğrenme, TURK-SER Veri Seti, Konuşma Kodlama, Wav2Vec2 İnce Ayar

Kaynakça

Poria, S., Cambria, E., Bajpai, R., & Hussain, A. (2017). A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37, 98–125.
Kocaman, O., Yıldız, M., & Kamaz, B. (2018). Use of vocabulary learning strategies in Turkish as a foreign language context. International Journal of Psychology and Educational Studies, 5(2), 54–63.
Kına, E., & Biçek, E. (2023). Duygu analizinde denetimli makine öğrenme algoritmalarının karşılaştırılmaları (Kahramanmaraş depremi örneği). Batman Üniversitesi Yaşam Bilimleri Dergisi, 13(1), 21–31.
Kına, E., & Biçek, E. (2023). Tweetlerin duygu analizi için hibrit bir yaklaşım. Doğu Fen Bilimleri Dergisi, 6(1), 57–68.
Shang, Y., & Fu, T. (2024). Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning. Intelligent Systems with Applications, 24, 200436.
Yurtay, Y., Demirci, H., Tiryaki, H., & Altun, T. (2024). Emotion recognition on call center voice data. Applied Sciences, 14(20), 9458.
Chavhan, Y., Dhore, M., & Yesaware, P. (2010). Speech emotion recognition using support vector machine. International Journal of Computer Applications, 1(20), 6–9.
Lin, Y.-L., & Wei, G. (2005). Speech emotion recognition based on HMM and SVM. In Proceedings of the 2005 International Conference on Machine Learning and Cybernetics (Vol. 8, pp. 4898–4901).
Tanko, D., Dogan, S., Demir, F. B., Baygin, M., Sahin, S. E., & Tuncer, T. (2022). Shoelace pattern-based speech emotion recognition of the lecturers in distance education: ShoePat23. Applied Acoustics, 190, 108637.
Tuncer, T., Dogan, S., & Acharya, U. R. (2021). Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowledge-Based Systems, 211, 106547.

Ayon, R. D. G., Rabbi, M. S., Habiba, U., & Hasana, M. (2022). Bangla speech emotion detection using machine learning ensemble methods. Advances in Science, Technology and Engineering Systems Journal, 7(6), 70–76.
Khalil, R. A., Jones, E., Babar, M. I., Jan, T., Zafar, M. H., & Alhussain, T. (2019). Speech emotion recognition using deep learning techniques: A review. IEEE Access, 7, 117327–117345.
Fayek, H. M., Lech, M., & Cavedon, L. (2017). Evaluating deep learning architectures for speech emotion recognition. Neural Networks, 92, 60–68.
Meyer, P., Xu, Z., & Fingscheidt, T. (2021). Improving convolutional recurrent neural networks for speech emotion recognition. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT) (pp. 365–372).
Jiang, P., Fu, H., Tao, H., Lei, P., & Zhao, L. (2019). Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition. IEEE Access, 7, 90368–90377.
Wang, Y., Boumadane, A., & Heba, A. (2021). A fine-tuned wav2vec 2.0/Hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding. arXiv preprint arXiv:2111.02735.
Mohamed, O., & Aly, S. A. (2021). Arabic speech emotion recognition employing wav2vec2.0 and HuBERT based on BAVED dataset. arXiv preprint arXiv:2110.04425.
Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Yu, F., & Zhou, F. (2022). WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505–1518.
Pepino, L., Riera, P., & Ferrer, L. (2021). Emotion recognition from speech using wav2vec 2.0 embeddings. arXiv preprint arXiv:2104.03502.
Bhushan, P., Fahad, M. S., Agrawal, S., Tripathi, P., Mishra, P., & Deepak, A. (2023). A self-attention based hybrid CNN-LSTM for speaker-independent speech emotion recognition. GMSARN International Journal.
Atmaja, B. T., & Sasou, A. (2022). Evaluating self-supervised speech representations for speech emotion recognition. IEEE Access, 10, 124396–124407.
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M. (2020). Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979.
Yi, C., Wang, J., Cheng, N., Zhou, S., & Xu, B. (2020). Applying wav2vec 2.0 to speech recognition in various low-resource languages. arXiv preprint arXiv:2012.12121.
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391.
Burkhardt, F., Schrüfer, O., Reichel, U., Wierstorf, H., Derington, A., Eyben, F., & Schuller, B. W. (2025). EmoDB 2.0: A database of emotional speech in a world that is not black or white but grey. In Proceedings of Interspeech 2025 (pp. 4488–4492).
Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., Lee, J. N., Chang, S., & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359.
Martin, O., Kotsia, I., Macq, B., & Pitas, I. (2006, April). The eNTERFACE'05 audio-visual emotion database. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW'06) (pp. 8–8).
Haq, S., & Jackson, P. J. (2010). Speaker-dependent audio-visual emotion recognition. In Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP) (pp. 53–58).
Costantini, G., Iaderola, I., Paoloni, A., & Todisco, M. (2014). EMOVO corpus: An Italian emotional speech database. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) (pp. 3501–3504).
Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., & Verma, R. (2014). CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5(4), 377–390.
Canpolat, S. F., Ormanoğlu, Z., & Zeyrek, D. (2020). Turkish Emotion Voice Database (TurEV-DB). In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) (pp. 368–375).
Cristianini, N., & Ricci, E. (2008). Support vector machines. In M. Y. Kao (Ed.), Encyclopedia of Algorithms. Springer.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778).
Tao, W., Li, C., Song, R., Cheng, J., Liu, Y., Wan, F., & Chen, X. (2020). EEG-based emotion recognition via channel-wise attention and self attention. IEEE Transactions on Affective Computing, 14(1), 382–393.
Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451–3460.
Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Yu, F., & Wei, F. (2022). WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505–1518.
Korkmaz, Y., & Jararweh, Y. (2026). Wav2TP: A novel speech emotion recognition model using temporal pooling over transformer-based Wav2Vec2 embeddings. Cluster Computing, 29(2), 94.
Bendahmane, A., & Alasem, R. (2025). Efficient compression of wav2vec 2.0 for edge deployment in speech emotion and speaker recognition. Multimedia Tools and Applications, 1–39.
Eriş, F. G., & Akbal, E. (2024). Enhancing speech emotion recognition through deep learning and handcrafted feature fusion. Applied Acoustics, 222, 110070.
Khalifa, A. A., Abdulghani, K. O., Sadek, R. A., & Elfattah, M. M. (2024, December). A novel approach to speech emotion recognition using wav2vec2. In 2024 International Conference on Future Telecommunications and Artificial Intelligence (IC-FTAI) (pp. 1–6).
Wang, N., & Yang, D. (2025). Speech emotion recognition using fine-tuned Wav2Vec2.0 and neural controlled differential equations classifier. PLoS ONE, 20(2), e0318297.
Canpolat, S. F. (2019). A novel approach to emotion recognition in voice: A convolutional neural network approach and Grad-CAM generation (Master’s thesis). Middle East Technical University.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Otomatik Yazılım Mühendisliği, Yazılım Mühendisliği (Diğer)

Bölüm

Araştırma Makalesi

Yazarlar

Mustafa Eriş ^*
0000-0002-1757-8496
Türkiye

Fatma Güneş Eriş
0000-0002-6048-6060
Türkiye

Erhan Akbal
0000-0002-5257-7560
Türkiye

Yayımlanma Tarihi

30 Haziran 2026

Gönderilme Tarihi

13 Şubat 2026

Kabul Tarihi

23 Haziran 2026

Yayımlandığı Sayı

Yıl 2026 Cilt: 10 Sayı: 1

DOI

https://doi.org/10.46460/ijiea.1887612

IZ

https://izlik.org/JA42LP22PL

APA

Eriş, M., Güneş Eriş, F., & Akbal, E. (2026). TURK-SER: A Speech Emotion Recognition Dataset and Benchmark for Turkish. International Journal of Innovative Engineering Applications, 10(1), 90-101. https://doi.org/10.46460/ijiea.1887612

AMA

1.Eriş M, Güneş Eriş F, Akbal E. TURK-SER: A Speech Emotion Recognition Dataset and Benchmark for Turkish. ijiea, IJIEA. 2026;10(1):90-101. doi:10.46460/ijiea.1887612

Chicago

Eriş, Mustafa, Fatma Güneş Eriş, ve Erhan Akbal. 2026. “TURK-SER: A Speech Emotion Recognition Dataset and Benchmark for Turkish”. International Journal of Innovative Engineering Applications 10 (1): 90-101. https://doi.org/10.46460/ijiea.1887612.

EndNote

Eriş M, Güneş Eriş F, Akbal E (01 Haziran 2026) TURK-SER: A Speech Emotion Recognition Dataset and Benchmark for Turkish. International Journal of Innovative Engineering Applications 10 1 90–101.

IEEE

[1]M. Eriş, F. Güneş Eriş, ve E. Akbal, “TURK-SER: A Speech Emotion Recognition Dataset and Benchmark for Turkish”, ijiea, IJIEA, c. 10, sy 1, ss. 90–101, Haz. 2026, doi: 10.46460/ijiea.1887612.

ISNAD

Eriş, Mustafa - Güneş Eriş, Fatma - Akbal, Erhan. “TURK-SER: A Speech Emotion Recognition Dataset and Benchmark for Turkish”. International Journal of Innovative Engineering Applications 10/1 (01 Haziran 2026): 90-101. https://doi.org/10.46460/ijiea.1887612.

JAMA

1.Eriş M, Güneş Eriş F, Akbal E. TURK-SER: A Speech Emotion Recognition Dataset and Benchmark for Turkish. ijiea, IJIEA. 2026;10:90–101.

MLA

Eriş, Mustafa, vd. “TURK-SER: A Speech Emotion Recognition Dataset and Benchmark for Turkish”. International Journal of Innovative Engineering Applications, c. 10, sy 1, Haziran 2026, ss. 90-101, doi:10.46460/ijiea.1887612.

Vancouver

1.Mustafa Eriş, Fatma Güneş Eriş, Erhan Akbal. TURK-SER: A Speech Emotion Recognition Dataset and Benchmark for Turkish. ijiea, IJIEA. 01 Haziran 2026;10(1):90-101. doi:10.46460/ijiea.1887612

TURK-SER: A Speech Emotion Recognition Dataset and Benchmark for Turkish

Öz

Anahtar Kelimeler

TURK-SER: Türkçe için Konuşma Duygu Tanıma Veri Seti ve Karşılaştırma Ölçeği

Öz

Anahtar Kelimeler

Kaynakça

Ayrıntılar

Birincil Dil

Konular

Bölüm

Yazarlar

Yayımlanma Tarihi

Gönderilme Tarihi

Kabul Tarihi

Yayımlandığı Sayı

DOI

IZ

Kaynak Göster