A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization

Emsal Altinay; Ecir Uğur Küçüksille

doi:10.35414/akufemubid.1565447

Araştırma Makalesi

BibTex

RIS

Kaynak Göster

A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization

Yıl 2025, Cilt: 25 Sayı: 5, 1095 - 1105, 01.10.2025

Emsal Altinay , Ecir Uğur Küçüksille

https://doi.org/10.35414/akufemubid.1565447

Öz

Speaker diarization is the task of distinguishing and segmenting speech from multiple speakers in an audio recording, a crucial task for various applications such as meeting transcription, voice activated systems, and audio indexing. Traditional clustering-based methods have been widely used, but they struggle with challenges in real-world scenarios, including noisy environments, overlapping speech, speaker variability and variable recording conditions. This study addresses these limitations by focusing on deep learning-based approaches, which have demonstrated significant advancements in improving the accuracy of multi-speaker diarization. The aim of this study is to compare traditional clustering methods with new deep learning techniques, including Time Delay Neural Networks (TDNN), End-to-End Neural Diarization (EEND), and the Fully Supervised UIS-RNN, to solve the challenges of multi-speaker diarization. The results show that on the CallHome dataset, TDNN systems indicated slight improvements in non-overlapping speech, with a Diarization Error Rate (DER) of 12-14%, in comparison to 13-15% for traditional clustering methods. However, in overlapping speech, EEND outperformed traditional methods, achieving a DER of 12.6%, which was significantly lower than the 23.7% observed with traditonal clustering. The Fully Supervised UIS-RNN model delivered the best overall performance, achieving a DER of 7.6%. Future research should focus on integrating the strengths of traditional and deep learning techniques while reducing the computational and data requirements for more accessible, real-time speaker diarization systems. The findings indicated that deep learning will make a substantial contribution to the field of speaker diarisation.

Anahtar Kelimeler

Speaker Diarization , Traditional Clustering Algorithm , Deep Learning , Overlapping Speech , Computational Complexity

Kaynakça

Al-Hadithy, T.M., Frikha, M., & Maseer, Z.K., 2022. Speaker Diarization based on Deep Learning Techniques: A Review. 2022 International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), 856-871. https://doi.org/10.1109/ISMSIT56059.2022.9932710
Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. 2012. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356-370. https://doi.org/10.1109/TASL.2011.2125954
Bredin H.,2017, TristouNet: Triplet loss for speaker turn embedding," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5430-5434. https://doi.org/10.1109/ICASSP.2017.7953194
Bullock, L., Bredin, H., & Garcia-Perera, L. P., 2020. Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 7114-7118. https://doi.org/10.1109/ICASSP40776.2020.9053096
Canavan, A., Graff, D., & Zipperlen, G., CALLHOME American English Speech (LDC97S42), https://catalog.ldc.upenn.edu/LDC97S42, (11.04.2025)
Chung, J. S., Nagrani, A., & Zisserman, A., 2018. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622. https://doi.org/10.48550/arXiv.1806.05622
Çelik, H., & Ekşi, H., 2013. SÖYLEM ANALİZİ. Marmara Üniversitesi Atatürk Eğitim Fakültesi Eğitim Bilimleri Dergisi, 27(27), 99-117.
Fiscus, J.G., Ajot, J., Michel, M., & Garofolo, J.S., 2006. The Rich Transcription 2006 Spring Meeting Recognition Evaluation. In: Renals, S., Bengio, S., Fiscus, J.G. (eds) Machine Learning for Multimodal Interaction. MLMI 2006. Lecture Notes in Computer Science, vol 4299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965152_28
Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., & Watanabe, S. (2019b). End-to-end neural speaker diarization with permutation-free objectives. arXiv preprint arXiv:1909.05952. https://doi.org/10.48550/arXiv.1909.05952
Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., & Watanabe, S., 2019. End-to-end neural speaker diarization with self-attention. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, Singapore, 296-303). https://doi.org/10.1109/ASRU46091.2019.9003959
Hamza, H., Gafoor, F., Sithara, F., Anil, G., & Anoop, V. S., 2023. EmoDiarize: Speaker Diarization and Emotion Identification from Speech Signals using Convolutional Neural Networks, arXiv preprint arXiv:2310.12851. https://doi.org/10.48550/arXiv.2310.12851
Kshirod, K.S., 2020. Speaker Diarization with Deep Learning Techniques. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 11, 3, 2570–2582. https://doi.org/10.61841/turcomat.v11i3.14309
Li, Y., Gao, F., Ou, Z., & Sun, J., 2018. Angular softmax loss for end-to-end speaker verification. In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taiwan, IEEE, 190-194. https://doi.org/10.1109/ISCSLP.2018.8706570.
Mao, H. H., Li, S., McAuley, J., & Cottrell, G., 2020. Speech recognition and multi-speaker diarization of long conversations. arXiv preprint arXiv:2005.08072. https://doi.org/10.48550/arXiv.2005.08072
Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., & Narayanan, S., 2022. A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language, 72, 101317. https://doi.org/10.1016/j.csl.2021.101317
Raj, D., Huang, Z., & Khudanpur, S. 2021. Multi-class spectral clustering with overlaps for speaker diarization. In 2021 IEEE Spoken Language Technology Workshop (SLT), China, IEEE, 582-589 https://doi.org/10.1109/SLT48900.2021.9383602
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B., 2000. Speaker verification using adapted Gaussian mixture models. Digital signal processing, 10(1-3), 19-41. https://doi.org/10.1006/dspr.1999.0361
Serafini, L., Cornell, S., Morrone, G., Zovato, E., Brutti, A., & Squartini, S., 2023. An experimental review of speaker diarization methods with application to two-speaker conversational telephone speech recordings. Computer Speech & Language, 82, 101534. https://doi.org/10.1016/j.csl.2023.101534
Shum, S. H., Dehak, N., Dehak, R., & Glass, J. R., 2013. Unsupervised methods for speaker diarization: An integrated and iterative approach. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2015-2028. https://doi.org/10.1109/TASL.2013.2264673
Tian, J., Ye, S., Chen, S., Xiang, Y., Yin, Z., Hu, X., & Xu, X., 2024. The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge. arXiv preprint arXiv:2405.05498. https://doi.org/10.48550/arXiv.2405.05498
Variani, E., Lei, X., McDermott, E., Moreno, I., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, 4052-4056. https://doi.org/10.1109/ICASSP.2014.6854363
Wang, Q., Downey, C., Wan, L., Mansfield, P. A., & Moreno, I. L., 2018. Speaker diarization with LSTM. In 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP), Canada, 5239-5243. IEEE. https://doi.org/10.1109/ICASSP.2018.8462628
Zhang, A., Wang, Q., Zhu, Z., Paisley, J., & Wang, C., 2019. Fully supervised speaker diarization. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), UK, 6301-6305). IEEE. https://doi.org/10.1109/ICASSP.2019.8683892

Konuşmacı Diarizasyonundaki Zorlukların Çözümünde Geleneksel Yöntemler ile Derin Öğrenme Yöntemlerinin Karşılaştırılması

Yıl 2025, Cilt: 25 Sayı: 5, 1095 - 1105, 01.10.2025

Emsal Altinay , Ecir Uğur Küçüksille

https://doi.org/10.35414/akufemubid.1565447

Öz

Konuşmacı diyarizasyonu, bir ses kaydında birden fazla konuşmacıdan gelen konuşmayı ayırt etme ve bölümlere ayırma görevidir ve toplantı transkripsiyonu, sesle etkinleştirilen sistemler ve ses indeksleme gibi çeşitli uygulamalar için çok önemli bir görevdir. Geleneksel kümeleme tabanlı yöntemler yaygın olarak kullanılmaktadır, ancak gürültülü ortamlar, örtüşen konuşma, konuşmacı değişkenliği ve değişken kayıt koşulları gibi gerçek dünya senaryolarındaki zorluklarla mücadele etmektedirler. Bu çalışma, çok hoparlörlü diyarizasyonunun doğruluğunu artırmada önemli gelişmeler gösteren derin öğrenme tabanlı yaklaşımlara odaklanarak bu sınırlamaları ele almaktadır. Bu makalenin amacı, geleneksel kümeleme yöntemlerini Zaman Gecikmeli Sinir Ağları (TDNN), Uçtan Uca Sinirsel Günlük Oluşturma (EEND) ve Tam Denetimli UIS-RNN gibi derin öğrenme teknikleriyle karşılaştırarak çok konuşmacılı diyarizasyon zorluklarını çözmektir. Sonuçlar, CallHome veri setinde, TDNN sistemlerinin, geleneksel kümeleme yöntemleri için %13-15'e kıyasla, %12-14'lük bir Diyarizasyon Hata Oranı (DER) ile örtüşmeyen konuşmada hafif iyileşmeler gösterdiğini göstermektedir. Bununla birlikte, örtüşen konuşmada, EEND geleneksel yöntemlerden daha iyi performans göstermiş ve geleneksel kümeleme ile gözlemlenen %23,7'den önemli ölçüde daha düşük olan %12,6'lık bir DER elde etmiştir. Tam Denetimli UIS-RNN modeli, %7,6'lık bir DER elde ederek en iyi genel performansı sağlamıştır. Bulgular, derin öğrenmenin konuşmacı diyarizasyonu alanında önemli bir katkı sağlayacağını göstermiştir.

Anahtar Kelimeler

Konuşmacı Diyarizasyonu , Geleneksel Kümeleme Algoritması , Derin Öğrenme , Örtüşen Konuşma , Hesaplama Karmaşıklığı

Kaynakça

Al-Hadithy, T.M., Frikha, M., & Maseer, Z.K., 2022. Speaker Diarization based on Deep Learning Techniques: A Review. 2022 International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), 856-871. https://doi.org/10.1109/ISMSIT56059.2022.9932710
Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. 2012. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356-370. https://doi.org/10.1109/TASL.2011.2125954
Bredin H.,2017, TristouNet: Triplet loss for speaker turn embedding," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5430-5434. https://doi.org/10.1109/ICASSP.2017.7953194
Bullock, L., Bredin, H., & Garcia-Perera, L. P., 2020. Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 7114-7118. https://doi.org/10.1109/ICASSP40776.2020.9053096
Canavan, A., Graff, D., & Zipperlen, G., CALLHOME American English Speech (LDC97S42), https://catalog.ldc.upenn.edu/LDC97S42, (11.04.2025)
Chung, J. S., Nagrani, A., & Zisserman, A., 2018. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622. https://doi.org/10.48550/arXiv.1806.05622
Çelik, H., & Ekşi, H., 2013. SÖYLEM ANALİZİ. Marmara Üniversitesi Atatürk Eğitim Fakültesi Eğitim Bilimleri Dergisi, 27(27), 99-117.
Fiscus, J.G., Ajot, J., Michel, M., & Garofolo, J.S., 2006. The Rich Transcription 2006 Spring Meeting Recognition Evaluation. In: Renals, S., Bengio, S., Fiscus, J.G. (eds) Machine Learning for Multimodal Interaction. MLMI 2006. Lecture Notes in Computer Science, vol 4299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965152_28
Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., & Watanabe, S. (2019b). End-to-end neural speaker diarization with permutation-free objectives. arXiv preprint arXiv:1909.05952. https://doi.org/10.48550/arXiv.1909.05952
Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., & Watanabe, S., 2019. End-to-end neural speaker diarization with self-attention. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, Singapore, 296-303). https://doi.org/10.1109/ASRU46091.2019.9003959
Hamza, H., Gafoor, F., Sithara, F., Anil, G., & Anoop, V. S., 2023. EmoDiarize: Speaker Diarization and Emotion Identification from Speech Signals using Convolutional Neural Networks, arXiv preprint arXiv:2310.12851. https://doi.org/10.48550/arXiv.2310.12851
Kshirod, K.S., 2020. Speaker Diarization with Deep Learning Techniques. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 11, 3, 2570–2582. https://doi.org/10.61841/turcomat.v11i3.14309
Li, Y., Gao, F., Ou, Z., & Sun, J., 2018. Angular softmax loss for end-to-end speaker verification. In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taiwan, IEEE, 190-194. https://doi.org/10.1109/ISCSLP.2018.8706570.
Mao, H. H., Li, S., McAuley, J., & Cottrell, G., 2020. Speech recognition and multi-speaker diarization of long conversations. arXiv preprint arXiv:2005.08072. https://doi.org/10.48550/arXiv.2005.08072
Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., & Narayanan, S., 2022. A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language, 72, 101317. https://doi.org/10.1016/j.csl.2021.101317
Raj, D., Huang, Z., & Khudanpur, S. 2021. Multi-class spectral clustering with overlaps for speaker diarization. In 2021 IEEE Spoken Language Technology Workshop (SLT), China, IEEE, 582-589 https://doi.org/10.1109/SLT48900.2021.9383602
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B., 2000. Speaker verification using adapted Gaussian mixture models. Digital signal processing, 10(1-3), 19-41. https://doi.org/10.1006/dspr.1999.0361
Serafini, L., Cornell, S., Morrone, G., Zovato, E., Brutti, A., & Squartini, S., 2023. An experimental review of speaker diarization methods with application to two-speaker conversational telephone speech recordings. Computer Speech & Language, 82, 101534. https://doi.org/10.1016/j.csl.2023.101534
Shum, S. H., Dehak, N., Dehak, R., & Glass, J. R., 2013. Unsupervised methods for speaker diarization: An integrated and iterative approach. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2015-2028. https://doi.org/10.1109/TASL.2013.2264673
Tian, J., Ye, S., Chen, S., Xiang, Y., Yin, Z., Hu, X., & Xu, X., 2024. The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge. arXiv preprint arXiv:2405.05498. https://doi.org/10.48550/arXiv.2405.05498
Variani, E., Lei, X., McDermott, E., Moreno, I., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, 4052-4056. https://doi.org/10.1109/ICASSP.2014.6854363
Wang, Q., Downey, C., Wan, L., Mansfield, P. A., & Moreno, I. L., 2018. Speaker diarization with LSTM. In 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP), Canada, 5239-5243. IEEE. https://doi.org/10.1109/ICASSP.2018.8462628
Zhang, A., Wang, Q., Zhu, Z., Paisley, J., & Wang, C., 2019. Fully supervised speaker diarization. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), UK, 6301-6305). IEEE. https://doi.org/10.1109/ICASSP.2019.8683892

Toplam 23 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Bilgisayar Yazılımı
Bölüm	Makaleler
Yazarlar	Emsal Altinay 0000-0002-0315-0894 Ecir Uğur Küçüksille 0000-0002-3293-9878
Erken Görünüm Tarihi	18 Eylül 2025
Yayımlanma Tarihi	1 Ekim 2025
Gönderilme Tarihi	11 Ekim 2024
Kabul Tarihi	19 Nisan 2025
Yayımlandığı Sayı	Yıl 2025 Cilt: 25 Sayı: 5

Kaynak Göster

APA	Altinay, E., & Küçüksille, E. U. (2025). A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, 25(5), 1095-1105. https://doi.org/10.35414/akufemubid.1565447
AMA	Altinay E, Küçüksille EU. A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. Ekim 2025;25(5):1095-1105. doi:10.35414/akufemubid.1565447
Chicago	Altinay, Emsal, ve Ecir Uğur Küçüksille. “A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 25, sy. 5 (Ekim 2025): 1095-1105. https://doi.org/10.35414/akufemubid.1565447.
EndNote	Altinay E, Küçüksille EU (01 Ekim 2025) A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 25 5 1095–1105.
IEEE	E. Altinay ve E. U. Küçüksille, “A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization”, Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, c. 25, sy. 5, ss. 1095–1105, 2025, doi: 10.35414/akufemubid.1565447.
ISNAD	Altinay, Emsal - Küçüksille, Ecir Uğur. “A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 25/5 (Ekim2025), 1095-1105. https://doi.org/10.35414/akufemubid.1565447.
JAMA	Altinay E, Küçüksille EU. A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. 2025;25:1095–1105.
MLA	Altinay, Emsal ve Ecir Uğur Küçüksille. “A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, c. 25, sy. 5, 2025, ss. 1095-0, doi:10.35414/akufemubid.1565447.
Vancouver	Altinay E, Küçüksille EU. A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. 2025;25(5):1095-10.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin

Bu eser Creative Commons Atıf-GayriTicari 4.0 Uluslararası Lisansı ile lisanslanmıştır.