Araştırma Makalesi
BibTex RIS Kaynak Göster

A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization

Yıl 2025, Cilt: 25 Sayı: 5, 1095 - 1105, 01.10.2025
https://doi.org/10.35414/akufemubid.1565447

Öz

Speaker diarization is the task of distinguishing and segmenting speech from multiple speakers in an audio recording, a crucial task for various applications such as meeting transcription, voice activated systems, and audio indexing. Traditional clustering-based methods have been widely used, but they struggle with challenges in real-world scenarios, including noisy environments, overlapping speech, speaker variability and variable recording conditions. This study addresses these limitations by focusing on deep learning-based approaches, which have demonstrated significant advancements in improving the accuracy of multi-speaker diarization. The aim of this study is to compare traditional clustering methods with new deep learning techniques, including Time Delay Neural Networks (TDNN), End-to-End Neural Diarization (EEND), and the Fully Supervised UIS-RNN, to solve the challenges of multi-speaker diarization. The results show that on the CallHome dataset, TDNN systems indicated slight improvements in non-overlapping speech, with a Diarization Error Rate (DER) of 12-14%, in comparison to 13-15% for traditional clustering methods. However, in overlapping speech, EEND outperformed traditional methods, achieving a DER of 12.6%, which was significantly lower than the 23.7% observed with traditonal clustering. The Fully Supervised UIS-RNN model delivered the best overall performance, achieving a DER of 7.6%. Future research should focus on integrating the strengths of traditional and deep learning techniques while reducing the computational and data requirements for more accessible, real-time speaker diarization systems. The findings indicated that deep learning will make a substantial contribution to the field of speaker diarisation.

Kaynakça

  • Al-Hadithy, T.M., Frikha, M., & Maseer, Z.K., 2022. Speaker Diarization based on Deep Learning Techniques: A Review. 2022 International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), 856-871. https://doi.org/10.1109/ISMSIT56059.2022.9932710
  • Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. 2012. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356-370. https://doi.org/10.1109/TASL.2011.2125954
  • Bredin H.,2017, TristouNet: Triplet loss for speaker turn embedding," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5430-5434. https://doi.org/10.1109/ICASSP.2017.7953194
  • Bullock, L., Bredin, H., & Garcia-Perera, L. P., 2020. Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 7114-7118. https://doi.org/10.1109/ICASSP40776.2020.9053096
  • Canavan, A., Graff, D., & Zipperlen, G., CALLHOME American English Speech (LDC97S42), https://catalog.ldc.upenn.edu/LDC97S42, (11.04.2025)
  • Chung, J. S., Nagrani, A., & Zisserman, A., 2018. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622. https://doi.org/10.48550/arXiv.1806.05622
  • Çelik, H., & Ekşi, H., 2013. SÖYLEM ANALİZİ. Marmara Üniversitesi Atatürk Eğitim Fakültesi Eğitim Bilimleri Dergisi, 27(27), 99-117.
  • Fiscus, J.G., Ajot, J., Michel, M., & Garofolo, J.S., 2006. The Rich Transcription 2006 Spring Meeting Recognition Evaluation. In: Renals, S., Bengio, S., Fiscus, J.G. (eds) Machine Learning for Multimodal Interaction. MLMI 2006. Lecture Notes in Computer Science, vol 4299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965152_28
  • Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., & Watanabe, S. (2019b). End-to-end neural speaker diarization with permutation-free objectives. arXiv preprint arXiv:1909.05952. https://doi.org/10.48550/arXiv.1909.05952
  • Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., & Watanabe, S., 2019. End-to-end neural speaker diarization with self-attention. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, Singapore, 296-303). https://doi.org/10.1109/ASRU46091.2019.9003959
  • Hamza, H., Gafoor, F., Sithara, F., Anil, G., & Anoop, V. S., 2023. EmoDiarize: Speaker Diarization and Emotion Identification from Speech Signals using Convolutional Neural Networks, arXiv preprint arXiv:2310.12851. https://doi.org/10.48550/arXiv.2310.12851
  • Kshirod, K.S., 2020. Speaker Diarization with Deep Learning Techniques. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 11, 3, 2570–2582. https://doi.org/10.61841/turcomat.v11i3.14309
  • Li, Y., Gao, F., Ou, Z., & Sun, J., 2018. Angular softmax loss for end-to-end speaker verification. In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taiwan, IEEE, 190-194. https://doi.org/10.1109/ISCSLP.2018.8706570.
  • Mao, H. H., Li, S., McAuley, J., & Cottrell, G., 2020. Speech recognition and multi-speaker diarization of long conversations. arXiv preprint arXiv:2005.08072. https://doi.org/10.48550/arXiv.2005.08072
  • Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., & Narayanan, S., 2022. A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language, 72, 101317. https://doi.org/10.1016/j.csl.2021.101317
  • Raj, D., Huang, Z., & Khudanpur, S. 2021. Multi-class spectral clustering with overlaps for speaker diarization. In 2021 IEEE Spoken Language Technology Workshop (SLT), China, IEEE, 582-589 https://doi.org/10.1109/SLT48900.2021.9383602
  • Reynolds, D. A., Quatieri, T. F., & Dunn, R. B., 2000. Speaker verification using adapted Gaussian mixture models. Digital signal processing, 10(1-3), 19-41. https://doi.org/10.1006/dspr.1999.0361
  • Serafini, L., Cornell, S., Morrone, G., Zovato, E., Brutti, A., & Squartini, S., 2023. An experimental review of speaker diarization methods with application to two-speaker conversational telephone speech recordings. Computer Speech & Language, 82, 101534. https://doi.org/10.1016/j.csl.2023.101534
  • Shum, S. H., Dehak, N., Dehak, R., & Glass, J. R., 2013. Unsupervised methods for speaker diarization: An integrated and iterative approach. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2015-2028. https://doi.org/10.1109/TASL.2013.2264673
  • Tian, J., Ye, S., Chen, S., Xiang, Y., Yin, Z., Hu, X., & Xu, X., 2024. The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge. arXiv preprint arXiv:2405.05498. https://doi.org/10.48550/arXiv.2405.05498
  • Variani, E., Lei, X., McDermott, E., Moreno, I., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, 4052-4056. https://doi.org/10.1109/ICASSP.2014.6854363
  • Wang, Q., Downey, C., Wan, L., Mansfield, P. A., & Moreno, I. L., 2018. Speaker diarization with LSTM. In 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP), Canada, 5239-5243. IEEE. https://doi.org/10.1109/ICASSP.2018.8462628
  • Zhang, A., Wang, Q., Zhu, Z., Paisley, J., & Wang, C., 2019. Fully supervised speaker diarization. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), UK, 6301-6305). IEEE. https://doi.org/10.1109/ICASSP.2019.8683892

Konuşmacı Diarizasyonundaki Zorlukların Çözümünde Geleneksel Yöntemler ile Derin Öğrenme Yöntemlerinin Karşılaştırılması

Yıl 2025, Cilt: 25 Sayı: 5, 1095 - 1105, 01.10.2025
https://doi.org/10.35414/akufemubid.1565447

Öz

Konuşmacı diyarizasyonu, bir ses kaydında birden fazla konuşmacıdan gelen konuşmayı ayırt etme ve bölümlere ayırma görevidir ve toplantı transkripsiyonu, sesle etkinleştirilen sistemler ve ses indeksleme gibi çeşitli uygulamalar için çok önemli bir görevdir. Geleneksel kümeleme tabanlı yöntemler yaygın olarak kullanılmaktadır, ancak gürültülü ortamlar, örtüşen konuşma, konuşmacı değişkenliği ve değişken kayıt koşulları gibi gerçek dünya senaryolarındaki zorluklarla mücadele etmektedirler. Bu çalışma, çok hoparlörlü diyarizasyonunun doğruluğunu artırmada önemli gelişmeler gösteren derin öğrenme tabanlı yaklaşımlara odaklanarak bu sınırlamaları ele almaktadır. Bu makalenin amacı, geleneksel kümeleme yöntemlerini Zaman Gecikmeli Sinir Ağları (TDNN), Uçtan Uca Sinirsel Günlük Oluşturma (EEND) ve Tam Denetimli UIS-RNN gibi derin öğrenme teknikleriyle karşılaştırarak çok konuşmacılı diyarizasyon zorluklarını çözmektir. Sonuçlar, CallHome veri setinde, TDNN sistemlerinin, geleneksel kümeleme yöntemleri için %13-15'e kıyasla, %12-14'lük bir Diyarizasyon Hata Oranı (DER) ile örtüşmeyen konuşmada hafif iyileşmeler gösterdiğini göstermektedir. Bununla birlikte, örtüşen konuşmada, EEND geleneksel yöntemlerden daha iyi performans göstermiş ve geleneksel kümeleme ile gözlemlenen %23,7'den önemli ölçüde daha düşük olan %12,6'lık bir DER elde etmiştir. Tam Denetimli UIS-RNN modeli, %7,6'lık bir DER elde ederek en iyi genel performansı sağlamıştır. Bulgular, derin öğrenmenin konuşmacı diyarizasyonu alanında önemli bir katkı sağlayacağını göstermiştir.

Kaynakça

  • Al-Hadithy, T.M., Frikha, M., & Maseer, Z.K., 2022. Speaker Diarization based on Deep Learning Techniques: A Review. 2022 International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), 856-871. https://doi.org/10.1109/ISMSIT56059.2022.9932710
  • Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. 2012. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356-370. https://doi.org/10.1109/TASL.2011.2125954
  • Bredin H.,2017, TristouNet: Triplet loss for speaker turn embedding," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5430-5434. https://doi.org/10.1109/ICASSP.2017.7953194
  • Bullock, L., Bredin, H., & Garcia-Perera, L. P., 2020. Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 7114-7118. https://doi.org/10.1109/ICASSP40776.2020.9053096
  • Canavan, A., Graff, D., & Zipperlen, G., CALLHOME American English Speech (LDC97S42), https://catalog.ldc.upenn.edu/LDC97S42, (11.04.2025)
  • Chung, J. S., Nagrani, A., & Zisserman, A., 2018. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622. https://doi.org/10.48550/arXiv.1806.05622
  • Çelik, H., & Ekşi, H., 2013. SÖYLEM ANALİZİ. Marmara Üniversitesi Atatürk Eğitim Fakültesi Eğitim Bilimleri Dergisi, 27(27), 99-117.
  • Fiscus, J.G., Ajot, J., Michel, M., & Garofolo, J.S., 2006. The Rich Transcription 2006 Spring Meeting Recognition Evaluation. In: Renals, S., Bengio, S., Fiscus, J.G. (eds) Machine Learning for Multimodal Interaction. MLMI 2006. Lecture Notes in Computer Science, vol 4299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965152_28
  • Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., & Watanabe, S. (2019b). End-to-end neural speaker diarization with permutation-free objectives. arXiv preprint arXiv:1909.05952. https://doi.org/10.48550/arXiv.1909.05952
  • Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., & Watanabe, S., 2019. End-to-end neural speaker diarization with self-attention. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, Singapore, 296-303). https://doi.org/10.1109/ASRU46091.2019.9003959
  • Hamza, H., Gafoor, F., Sithara, F., Anil, G., & Anoop, V. S., 2023. EmoDiarize: Speaker Diarization and Emotion Identification from Speech Signals using Convolutional Neural Networks, arXiv preprint arXiv:2310.12851. https://doi.org/10.48550/arXiv.2310.12851
  • Kshirod, K.S., 2020. Speaker Diarization with Deep Learning Techniques. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 11, 3, 2570–2582. https://doi.org/10.61841/turcomat.v11i3.14309
  • Li, Y., Gao, F., Ou, Z., & Sun, J., 2018. Angular softmax loss for end-to-end speaker verification. In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taiwan, IEEE, 190-194. https://doi.org/10.1109/ISCSLP.2018.8706570.
  • Mao, H. H., Li, S., McAuley, J., & Cottrell, G., 2020. Speech recognition and multi-speaker diarization of long conversations. arXiv preprint arXiv:2005.08072. https://doi.org/10.48550/arXiv.2005.08072
  • Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., & Narayanan, S., 2022. A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language, 72, 101317. https://doi.org/10.1016/j.csl.2021.101317
  • Raj, D., Huang, Z., & Khudanpur, S. 2021. Multi-class spectral clustering with overlaps for speaker diarization. In 2021 IEEE Spoken Language Technology Workshop (SLT), China, IEEE, 582-589 https://doi.org/10.1109/SLT48900.2021.9383602
  • Reynolds, D. A., Quatieri, T. F., & Dunn, R. B., 2000. Speaker verification using adapted Gaussian mixture models. Digital signal processing, 10(1-3), 19-41. https://doi.org/10.1006/dspr.1999.0361
  • Serafini, L., Cornell, S., Morrone, G., Zovato, E., Brutti, A., & Squartini, S., 2023. An experimental review of speaker diarization methods with application to two-speaker conversational telephone speech recordings. Computer Speech & Language, 82, 101534. https://doi.org/10.1016/j.csl.2023.101534
  • Shum, S. H., Dehak, N., Dehak, R., & Glass, J. R., 2013. Unsupervised methods for speaker diarization: An integrated and iterative approach. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2015-2028. https://doi.org/10.1109/TASL.2013.2264673
  • Tian, J., Ye, S., Chen, S., Xiang, Y., Yin, Z., Hu, X., & Xu, X., 2024. The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge. arXiv preprint arXiv:2405.05498. https://doi.org/10.48550/arXiv.2405.05498
  • Variani, E., Lei, X., McDermott, E., Moreno, I., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, 4052-4056. https://doi.org/10.1109/ICASSP.2014.6854363
  • Wang, Q., Downey, C., Wan, L., Mansfield, P. A., & Moreno, I. L., 2018. Speaker diarization with LSTM. In 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP), Canada, 5239-5243. IEEE. https://doi.org/10.1109/ICASSP.2018.8462628
  • Zhang, A., Wang, Q., Zhu, Z., Paisley, J., & Wang, C., 2019. Fully supervised speaker diarization. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), UK, 6301-6305). IEEE. https://doi.org/10.1109/ICASSP.2019.8683892
Toplam 23 adet kaynakça vardır.

Ayrıntılar

Birincil Dil İngilizce
Konular Bilgisayar Yazılımı
Bölüm Makaleler
Yazarlar

Emsal Altinay 0000-0002-0315-0894

Ecir Uğur Küçüksille 0000-0002-3293-9878

Erken Görünüm Tarihi 18 Eylül 2025
Yayımlanma Tarihi 1 Ekim 2025
Gönderilme Tarihi 11 Ekim 2024
Kabul Tarihi 19 Nisan 2025
Yayımlandığı Sayı Yıl 2025 Cilt: 25 Sayı: 5

Kaynak Göster

APA Altinay, E., & Küçüksille, E. U. (2025). A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, 25(5), 1095-1105. https://doi.org/10.35414/akufemubid.1565447
AMA Altinay E, Küçüksille EU. A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. Ekim 2025;25(5):1095-1105. doi:10.35414/akufemubid.1565447
Chicago Altinay, Emsal, ve Ecir Uğur Küçüksille. “A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 25, sy. 5 (Ekim 2025): 1095-1105. https://doi.org/10.35414/akufemubid.1565447.
EndNote Altinay E, Küçüksille EU (01 Ekim 2025) A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 25 5 1095–1105.
IEEE E. Altinay ve E. U. Küçüksille, “A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization”, Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, c. 25, sy. 5, ss. 1095–1105, 2025, doi: 10.35414/akufemubid.1565447.
ISNAD Altinay, Emsal - Küçüksille, Ecir Uğur. “A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 25/5 (Ekim2025), 1095-1105. https://doi.org/10.35414/akufemubid.1565447.
JAMA Altinay E, Küçüksille EU. A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. 2025;25:1095–1105.
MLA Altinay, Emsal ve Ecir Uğur Küçüksille. “A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, c. 25, sy. 5, 2025, ss. 1095-0, doi:10.35414/akufemubid.1565447.
Vancouver Altinay E, Küçüksille EU. A Comparative Analysis of Traditional and Deep Learning Approaches for Adressing Challenges in Speaker Diarization. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. 2025;25(5):1095-10.


Bu eser Creative Commons Atıf-GayriTicari 4.0 Uluslararası Lisansı ile lisanslanmıştır.