Konuşma Duygu Tanıma için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım

Kenan Donuk; Davut Hanbay

doi:10.53070/bbd.1113379

TR EN

Konuşma Duygu Tanıma için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım

Abstract

Konuşma duygu tanıma, konuşma sinyallerinden insan duygularını gerçek zamanlı olarak tanıyabilen aktif bir insan-bilgisayar etkileşimi alanıdır. Bu alanda yapılan tanıma görevi, duyguların karmaşıklığı nedeniyle zorlu bir sınıflandırma örneğidir. Etkili bir sınıflandırma işleminin yapılabilmesi yüksek seviyeli derin özelliklere ve uygun bir derin öğrenme modeline bağlıdır. Konuşma duygu tanıma alanında yapılmış birçok sınıflandırma çalışması mevcuttur. Bu çalışmalarda konuşma verilerinden duyguların doğru bir şekilde çıkarılması için birçok farklı model ve özellik birleşimi önerilmiştir. Bu makalede konuşma duygu tanıma görevi için bir sistem önerilmektedir. Bu sistemde konuşma duygu tanıma için uzun-kısa süreli bellek tabanlı bir derin öğrenme modeli önerilmiştir. Önerilen sistem ön-işlem, özellik çıkarma, özellik birleşimi, uzun-kısa süreli bellek ve sınıflandırma olmak üzere dört aşamadan oluşmaktadır. Önerilen sistemde konuşma verilerine ilk olarak kırpma ve ön-vurgu ön-işlemleri uygulanır. Bu işlemlerden sonra elde edilen konuşma verilerinden Mel Frekans Kepstrum Katsayıları, Sıfır Geçiş Oranı ve Kök Ortalama Kare Enerji akustik özellikleri çıkarılarak birleştirilir. Birleştirilen bu özelliklerin uzamsal bilgilerinin yanında zaman içindeki akustik değişimleri sistemde önerilen uzun-kısa süreli bellek ve buna bağlı bir derin sinir ağı modeliyle öğrenilir. Son olarak softmax aktivasyon fonksiyonu ile öğrenilen bilgiler 8 farklı duyguya sınıflandırılır. Önerilen sistem RAVDESS ve TESS veri setlerinin birlikte kullanıldığı bir veri kümesinde test edilmiştir. Eğitim, doğrulama ve test sonuçlarında sırasıyla %99.87 , %85.14 , %88.92 oranlarında doğruluklar ölçülmüştür. Sonuçlar, son teknoloji çalışmalardaki doğruluklarla kıyaslanmış önerilen sistemin başarısı ortaya konmuştur.

Keywords

An LSTM-Based Approach with Acoustic Features for Speech Emotion Recognition

Abstract

Speech emotion recognition is an area of active human-computer interaction that can recognize human emotions from speech signals in real time. The recognition task in this area is an example of a difficult classification due to the complexity of emotions. An effective classification process depends on high-level deep features and an appropriate deep learning model. There are many classification studies in the field of speech emotion recognition. In these studies, many different models and combinations of features have been proposed to accurately extract emotions from speech data. In this article, a system for speech emotion recognition task is proposed. In this system, a long-short-term memory-based deep learning model is proposed for speech emotion recognition. The proposed system consists of four stages: preprocessing, feature extraction, feature combination, long-short-term memory and classification. In the proposed system, the clipping and pre-emphasis pre-processes are applied to the speech data first. After these processes, Mel Frequency Kepstrum Coefficients, Zero Crossing Ratio and Root Mean Square Energy acoustic properties are extracted from the obtained speech data and combined. In addition to the spatial information of these combined features, their acoustic changes over time are learned with the proposed long-short-term memory and a deep neural network model associated with it. Finally, the information learned is classified into 8 different emotions by the softmax activation function. The proposed system has been tested on a dataset using RAVDESS and TESS datasets together. Accuracies of 99.87%, 85.14% and 88.92% were measured in training, validation and test results, respectively. The results were compared in terms of the accuracies in the recent studies and the success of the proposed system was revealed.

Keywords

References

Cai L, Dong J & Wei M. (2020) Multi-Modal Emotion Recognition from Speech and Facial Expression Based on Deep Learning. Proceedings - 2020 Chinese Automation Congress, CAC 2020, pp. 5726–5729.
Issa D, Fatih Demirci M, Yazici A (2020) Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control 59:101894.
Atila O, Şengür A (2021) Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition. Applied Acoustics 182:108260.
Mujaddidurrahman A, Ernawan F, Wibowo A, Sarwoko E. A, Sugiharto A, Wahyudi M. D. R. (2021) Speech Emotion Recognition Using 2D-CNN with Data Augmentation. 2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM), pp. 685–689.
Padi S, Manocha D, Sriram R. D (2020) Multi-Window Data Augmentation Approach for Speech Emotion Recognition. http://arxiv.org/abs/2010.09895
Nasim A. S, Chowdory R. H, Dey A, Das A. (2021) Recognizing Speech Emotion Based on Acoustic Features Using Machine Learning. 2021 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2021. https://doi.org/10.1109/ICACSIS53237.2021.9631319
Asiya U. A, Kiran V. K. (2021) Speech Emotion Recognition-A Deep Learning Approach. Proceedings of the 5th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud), I-SMAC 2021, pp. 867–871.
Öztürk Ö. F, Pashaei E (2021) Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti. Convolutional LSTM model for speech emotion recognition. DUJE (Dicle University Journal of Engineering) 12:581–589.

Hochreiter S, Schmidhuber J. (1997) Long Short-Term Memory. Neural Computation 9(8):1735–1780. https://doi.org/10.1162/NECO.1997.9.8.1735
Livingstone S. R, Russo F. A (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLOS ONE 13(5):e0196391. https://doi.org/10.1371/JOURNAL.PONE.0196391
Zenodo (2022) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) | Zenodo. https://zenodo.org/record/1188976#.YiypnHpBy71. Accessed 12 March 2022.
University of Toronto Dataverse (2022) Toronto emotional speech set (TESS). https://dataverse.scholarsportal.info/dataset.xhtml?persistentId=doi:10.5683/SP2/E8H2MF. Accessed 6 May 2022.
Davis S. B, Mermelstein P (1980) Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28(4):357–366.
Chen Q, Huang G (2021) A novel dual attention-based BLSTM with hybrid features in speech emotion recognition. Engineering Applications of Artificial Intelligence 102:104277.
Ancilin J, Milton A (2021) Improved speech emotion recognition with Mel frequency magnitude coefficient. Applied Acoustics 179:108046.
Sun J (2019) Research on vocal sounding based on spectrum image analysis. Eurasip Journal on Image and Video Processing 2019(1). https://doi.org/10.1186/S13640-018-0397-0
Stevens S. S, Volkmann J, Newman E. B (1937) A Scale for the Measurement of the Psychological Magnitude Pitch. Journal of the Acoustical Society of America, 8(3):185–190.
O’Shaughnessy D. (1987) Speech communication : human and machine. In Wikipedia. Addison-Wesley.
Wikipedia (2022) Discrete Cosine Transform. https://en.wikipedia.org/wiki/Discrete_cosine_transform. Accessed 10 March 2022.
Ahmed N, Natarajan T, Rao K. R (1974) Discrete Cosine Transform. IEEE Transactions on Computers C–23(1):90–93. https://doi.org/10.1109/T-C.1974.223784
Silva A. C. M. da, Coelho M. A. N, Neto R. F (2020) A Music Classification model based on metric learning applied to MP3 audio files. Expert Systems with Applications, 144:113071.
Giannakopoulos T, Pikrakis A. (2014) Introduction to Audio Analysis: A MATLAB Approach, pp. 1–266.
Wikipedia (2022) Zero-crossing rate. https://en.wikipedia.org/wiki/Zero-crossing_rate. Accessed 26 April 2022.
Alías F, Socoró J. C, Sevillano X (2016) A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds. Applied Sciences 6(5):143.
Librosa (2022) librosa 0.9.1 documentation. https://librosa.org/doc/latest/index.html. Accessed 16 April 2022.

Details

Primary Language

Turkish

Subjects

Artificial Intelligence

Journal Section

Research Article

Authors

Kenan Donuk ^*
0000-0002-7421-5587
Türkiye

Davut Hanbay
0000-0003-2271-7865
Türkiye

Publication Date

December 7, 2022

Submission Date

May 6, 2022

Acceptance Date

June 21, 2022

Published in Issue

Year 2022 Volume: Vol:7 Number: Issue:2

DOI

https://doi.org/10.53070/bbd.1113379

IZ

https://izlik.org/JA47DP23RM

Cite

RIS / Bibtex

APA

Donuk, K., & Hanbay, D. (2022). Konuşma Duygu Tanıma için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım. Computer Science, Vol:7(Issue:2), 54-67. https://doi.org/10.53070/bbd.1113379

AMA

1.Donuk K, Hanbay D. Konuşma Duygu Tanıma için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım. JCS. 2022;Vol:7(Issue:2):54-67. doi:10.53070/bbd.1113379

Chicago

Donuk, Kenan, and Davut Hanbay. 2022. “Konuşma Duygu Tanıma Için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım”. Computer Science Vol:7 (Issue:2): 54-67. https://doi.org/10.53070/bbd.1113379.

EndNote

Donuk K, Hanbay D (December 1, 2022) Konuşma Duygu Tanıma için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım. Computer Science Vol:7 Issue:2 54–67.

IEEE

[1]K. Donuk and D. Hanbay, “Konuşma Duygu Tanıma için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım”, JCS, vol. Vol:7, no. Issue:2, pp. 54–67, Dec. 2022, doi: 10.53070/bbd.1113379.

ISNAD

Donuk, Kenan - Hanbay, Davut. “Konuşma Duygu Tanıma Için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım”. Computer Science VOL:7/Issue:2 (December 1, 2022): 54-67. https://doi.org/10.53070/bbd.1113379.

JAMA

1.Donuk K, Hanbay D. Konuşma Duygu Tanıma için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım. JCS. 2022;Vol:7:54–67.

MLA

Donuk, Kenan, and Davut Hanbay. “Konuşma Duygu Tanıma Için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım”. Computer Science, vol. Vol:7, no. Issue:2, Dec. 2022, pp. 54-67, doi:10.53070/bbd.1113379.

Vancouver

1.Kenan Donuk, Davut Hanbay. Konuşma Duygu Tanıma için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım. JCS. 2022 Dec. 1;Vol:7(Issue:2):54-67. doi:10.53070/bbd.1113379

The effect of emotion recognition and empathy training on children’s empathic tendencies and social acceptance of peers with special needs: a pretest–posttest controlled experimental study

Irish Journal of Medical Science (1971 -)

https://doi.org/10.1007/s11845-025-04116-x

Konuşma Duygu Tanıma için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım

Konuşma Duygu Tanıma için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım

Abstract

Keywords

An LSTM-Based Approach with Acoustic Features for Speech Emotion Recognition

Abstract

Keywords

References

Details

Primary Language

Subjects

Journal Section

Authors

Publication Date

Submission Date

Acceptance Date

Published in Issue

DOI

IZ

Cite

Cited By

CREMA-D: Improving Accuracy with BPSO-Based Feature Selection for Emotion Recognition Using Speech

A Modified MFCC-Based Deep Learning Method for Emotion Classification from Speech

Konuşma Duygu Tanıma Uygulamalarında Hiper Parametre Optimizasyonu ile Derin Öğrenme Metotlarının Geliştirilmesi

A CNN–NCP Based Hybrid Deep Learning Model for Speech-Driven Gender Classification

The effect of emotion recognition and empathy training on children’s empathic tendencies and social acceptance of peers with special needs: a pretest–posttest controlled experimental study