Sentetik ve Dönüştürülmüş Konuşmaların Tespitinde Genlik ve Faz Tabanlı Spektral Özniteliklerin Kullanılması

Burak Kasapoğlu; Turgay Koç

doi:10.31590/ejosat.780650

EN TR

Using Magnitude and Phase Based Spectral Features for Detecting Synthetic and Converted Speech

Abstract

With the advancement in technology, the use of biometric signals that differ from person to person such as fingerprint, retina, face, and voice is becoming more popular in order to provide personal access in applications that need security. The fact that among these biometric signals, voice, that is, speech signal can be easily obtained from the person and provides high mobility make automatic speaker verification (ASV) systems popular. Due to the widespread use of ASV systems in security applications, different spoofing attack methods have been developed to mislead these systems and it is observed that these developed spoofing attack methods pose a serious threat to ASV systems. In this study, a new system is proposed to detect the spoofing attacks using speech synthesis and voice conversion methods, which are two of the biggest threats to ASV systems. Proposed system uses Gaussian Mixture Model based classifier using the fusion of magnitude spectrum based constant Q cepstral coefficients (CQCC), that was chosen as best countermeasure feature of ASVSpoof challenge for detection of speech produced with speech synthesis and voice conversion methods, and glottal flow modified group delay (GFMGD) feature, that contains phase spectrum information of glottal flow obtained by applying inverse filtering on speech signal. In the classification of spoof speech produced by using genuine speech signals, due to both systems having classification error below 1%, it is not found any major difference in classification performance between proposed system and CQCC based baseline system. However, in the classification of spoof speech produced by using waveform filtering method both systems similarly performed poorly compared to other attacking methods. On the other hand, the proposed system can provide up to 55% performance increase against speech signals synthesized or converted by modern artificial neural networks and audio vocoders compared to the baseline system using only CQCC.

Keywords

Sentetik ve Dönüştürülmüş Konuşmaların Tespitinde Genlik ve Faz Tabanlı Spektral Özniteliklerin Kullanılması

Öz

Teknolojideki gelişmeyle birlikte güvenlik ihtiyacı bulunan uygulamalarda kişisel erişimi sağlayabilmek amacıyla parmak izi, retina, yüz, ses gibi kişiden kişiye değişiklik gösteren biyometrik sinyallerin kullanımı gün geçtikçe yaygınlaşmaktadır. Bu biyometrik sinyallerden ses yani konuşma sinyalinin hem kişiden kolaylıkla elde edilebilir olması hem de yüksek mobilite sağlaması otomatik konuşmacı doğrulama (Automatic Speaker Verification – ASV) sistemlerini popüler hale getirmektedir. ASV sistemlerinin güvenlik alanlarında yaygınlaşmasıyla birlikte bu sistemleri yanıltmaya yönelik farklı saldırı yöntemleri geliştirilerek bu saldırıların ASV sistemleri için ciddi birer tehdit oluşturduğu gözlenmiştir. Bu çalışmada, ASV sistemlerine en büyük tehdit oluşturan yöntemlerden ikisi olan ses sentezi ve ses dönüştürme yöntemleri kullanılarak ASV sistemlerine yapılan saldırıların tespit edilebilmesi için yeni bir sistem önerilmiştir. Önerilen sistemde, daha önce ses dönüştürme ve ses sentezleme yöntemiyle üretilen sahte seslerin tespit edilebilmesi amacıyla 2015 yılında düzenlenmiş olan ASVSpoof yarışmasında en iyi performansı gösteren genlik spektrumu tabanlı anlık Q kepstral katsayıları (Constant Q Cepstral Coefficients – CQCC) özniteliği ile konuşma sinyalinin ters filtrelenmesiyle elde edilen gırtlak akımına ait faz bilgisi içeren değiştirilmiş grup gecikmesi (Glottal Flow Modified Group Delay – GFMGD) özniteliği birlikte kullanılarak Gauss Karışım Modeli tabanlı sınıflandırma sistemi oluşturulmuştur. Doğrudan gerçek ses parçaları kullanılarak üretilen sahte seslerin sınıflandırılmasında hem CQCC tabanlı temel sistem hem de önerilen sistem için sistem performansları arasında belirgin bir fark görülmeyip her iki sistem de %1’in altında sınıflandırma hatası göstermiştir. Ancak, dalga form filtreleme ile üretilen sahte seslerin sınıflandırılmasında her iki sistem de benzer şekilde diğer saldırı yöntemlerine göre daha zayıf performans göstermiştir. Önerilen sistem, sadece CQCC kullanan temel sistem ile kıyaslandığında özellikle son yıllarda geliştirilmiş olan modern yapay sinir ağları ve ses kodlayıcılar tarafından sentezlenen ya da dönüştürülen konuşma sinyallerine karşı %55’e kadar performans artışı sağlayabilmektedir.

Anahtar Kelimeler

Teşekkür

Bu araştırmada yer alan tüm/kısmi nümerik hesaplamalar TÜBİTAK ULAKBİM, Yüksek Başarım ve Grid Hesaplama Merkezi'nde (TRUBA kaynaklarında) gerçekleştirilmiştir. TÜBİTAK ULAKBİM’e çalışmalarımız sırasında TRUBA kaynaklarını paylaştığı için teşekkür ederiz.

Kaynakça

Z. Wu, P. L. D. Leon, C. Demiroglu, vd., “Antispoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 768–783, Nisan 2016.
R. G. Hautamäki, T. Kinnunen, vd., “Automatic versus human speaker verification: The case of voice mimicry, Speech Commun.,” vol. 72, pp. 13–31, 2015.
Y. W. Lau, M. Wagner, vd., “Vulnerability of speaker verification to voice mimicking,” in Proc. Int. Symp. Intell. Multimedia, Video Speech Process., pp. 145-148, Ekim 2004.
J. Villalba and E. Lleida, “Preventing replay attacks on speaker verification systems,” in Proc. IEEE Int. Carnahan Conf. Secur. Technol. (ICCST), pp. 1-8, Ekim 2011.
P. L. De Leon, M. Pucher, vd., “Evaluation of speaker verification security and detection of HMM-based synthetic speech,” IEEE Trans. Audio Speech Lang. Process., vol. 20, no. 8, pp. 2280–2290, Ekim 2012.
Z. Wu and H. Li, “Voice conversion versus speaker verification: An overview,” APSIPA Trans. Signal Inf. Process., vol. 3, p. e17, 2014.
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. “Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis.” In Proc. Eurospeech, pp. 2347–2350, 1999.
Z.-H. Ling, Y.-J. Wu, Y.-P. Wang, L. Qin, and R.-H. Wang. “USTC system for Blizzard Challenge 2006 an improved HMM-based speech synthesis method.” In Proc. the Blizzard Challenge Workshop, 2006.

A.W. Black. “CLUSTERGEN: A statistical parametric synthesizer using trajectory modeling.” In Proc. Interspeech, pages 1762–1765, 2006.
H. Zen, T. Toda, M. Nakamura, and K. Tokuda. “Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005.” IEICE Trans. Inf. Syst., E90-D(1): 325–333, 2007.
H. Ze, A. Senior, and M. Schuster. “Statistical parametric speech synthesis using deep neural networks.” In Proc. ICASSP, pages 7962–7966, Mayıs 2013.
Z. H. Ling, L. Deng, and D. Yu. “Modeling spectral envelopes using restricted boltzmann machines and deep belief networks for statistical parametric speech synthesis.” IEEE Transactions on Audio, Speech, and Language Processing, 21(10):2129–2139, Ekim 2013.
T. F. Quatieri. “Discrete-Time Speech Signal Processing: Principles and Practice.” Prentice- Hall, Inc., 2002.
L. R. Rabiner, R. W. Schafer, “Theory and Applications of Digital Speech Processing (1st edition).”, Prentice- Hall, Inc., 1975
P. L. De Leon, M. Pucher, J. Yamagishi, I. Hernaez, and I. Saratxaga. “Evaluation of speaker verification security and detection of HMM-based synthetic speech. Audio, Speech, and Language Processing,” IEEE Transactions on, 20(8):2280–2290, Ekim 2012.
Z.Wu, E.S. Chng, and H. Li. Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In Proc. Interspeech, 2012.
Y. Stylianou. “Voice transformation: a survey.” In Proc. ICASSP, pp. 3585–3588, 2009.
A. Kain and M.W. Macon. “Spectral voice conversion for text-to-speech synthesis.” In Proc. ICASSP, volume 1, pp. 285–288, 1998.
Y. Stylianou, O. Capp´e, and E. Moulines. “Continuous probabilistic transform for voice conversion.” Speech and Audio Processing, IEEE Transactions on, 6(2):131–142, 1998.
V. Popa, H. Silen, J. Nurminen, and M. Gabbouj. “Local linear transformation for voice conversion.” In Proc. ICASSP, pp. 4517–4520. IEEE, 2012.
Y. Chen, M. Chu, E. Chang, J. Liu, and R. Liu. “Voice conversion with smoothed GMM and MAP adaptation.” In Proc. EUROSPEECH, pp. 2413–2416, 2003.
H.-T. Hwang, Y. Tsao, H.-M. Wang, Y.-R. Wang, and S.-H. Chen. “A study of mutual information for GMM-based spectral conversion.” In Proc. Interspeech, 2012.
E. Helander, T. Virtanen, J. Nurminen, and M. Gabbouj. “Voice conversion using partial least squares regression.” Audio, Speech, and Language Processing, IEEE Transactions on, 18(5):912–921, 2010.
N. Pilkington, H. Zen, and M. Gales. “Gaussian process experts for voice conversion.” In Proc. Interspeech, 2011.
Kamble, M. R., Sailor, H. B., Patil, H. A., & Li, H. “Advances in anti-spoofing: from the perspective of ASVspoof challenges.” APSIPA Transactions on Signal and Information Processing, 9., 2020.
P. Alku, E. Vilkman, U. K. Laine, “Analysis of glottal waveform in different phonation types using the new IAIF-method.” In Proc. 12th Int. Congress Phonetic Sciences, Vol. 4, pp. 362-365, Ağustos 1991.
N.P. Narendra, M. Airaksinen, B. Story, P. Alku, “Estimation of the glottal source from coded telephone speech using deep neural networks.” Speech Communication, vol. 106, pp. 95-104., 2019
M. Todisco, H. Delgado, N. Evans, “Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification.” Computer Speech & Language, vol. 45, pp. 516-535, 2017
Quatieri, T., “Discrete-Time Speech Signal Processing: Principles and Practice.” Prentice Hall PTR, pp. 111–174., 2001
Z. Wu, E.S. Chng, vd., “Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition,” in Proc. of Interspeech, 2012.
L.D. Alsteris & K.K. Paliwal, “Short-time phase spectrum in speech processing: A review and some experimental results,” Digital Signal Processing, vol. 17, no. 3, pp. 578–616, 2007.
Todisco, M., Wang, X., Vestman, V., Sahidullah, M., Delgado, H., Nautsch, A., ... & Lee, K. A., “Asvspoof 2019: Future horizons in spoofed and fake audio detection.” arXiv:1904.05441, 2019.
Wang, X., Yamagishi, J., Todisco, M., Delgado, H., Nautsch, A., Evans, N., ... & Juvela, L., “ASVspoof 2019: a large-scale public database of synthetic, converted and replayed speech.”, arXiv, arXiv-1911, 2019
Degottex, G., Kane, J., Drugman, T., Raitio, T., & Scherer, S., “COVAREP—A collaborative voice analysis repository for speech technologies.” In 2014 IEEE International Conference On Acoustics, Speech and Signal Processing (ICASSP), pp. 960-964. IEEE, Mayıs 2014
Vedaldi, A., & Fulkerson, B., “VLFeat: An open and portable library of computer vision algorithms.” In Proceedings of the 18th ACM international conference on Multimedia, pp. 1469-1472, Ekim 2010.
T. Kinnunen, M. Sahidullah, vd., “The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,” in Proc. Interspeech, Annual Conf. of the Int. Speech Comm. Assoc., pp. 2–6., 2017.
T. Kinnunen, K. Lee, vd., “t-DCF: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification,” in Proc. Odyssey, Les Sables d’Olonne, Fransa, Haziran 2018.

Ayrıntılar

Birincil Dil

Türkçe

Konular

Mühendislik

Bölüm

Araştırma Makalesi

Yazarlar

Burak Kasapoğlu Bu kişi benim
0000-0003-3580-0465
Türkiye

Turgay Koç Bu kişi benim
0000-0002-4846-7772
Türkiye

Yayımlanma Tarihi

15 Ağustos 2020

Gönderilme Tarihi

28 Haziran 2020

Kabul Tarihi

10 Ağustos 2020

Yayımlandığı Sayı

Yıl 2020

DOI

https://doi.org/10.31590/ejosat.780650

IZ

https://izlik.org/JA97NJ25XA

Kaynak Göster

RIS / Bibtex

APA

Kasapoğlu, B., & Koç, T. (2020). Sentetik ve Dönüştürülmüş Konuşmaların Tespitinde Genlik ve Faz Tabanlı Spektral Özniteliklerin Kullanılması. Avrupa Bilim ve Teknoloji Dergisi, 398-406. https://doi.org/10.31590/ejosat.780650

Sentetik ve Dönüştürülmüş Konuşmaların Tespitinde Genlik ve Faz Tabanlı Spektral Özniteliklerin Kullanılması

Using Magnitude and Phase Based Spectral Features for Detecting Synthetic and Converted Speech

Abstract

Keywords

Sentetik ve Dönüştürülmüş Konuşmaların Tespitinde Genlik ve Faz Tabanlı Spektral Özniteliklerin Kullanılması

Öz

Anahtar Kelimeler

Teşekkür

Kaynakça

Ayrıntılar

Birincil Dil

Konular

Bölüm

Yazarlar

Yayımlanma Tarihi

Gönderilme Tarihi

Kabul Tarihi

Yayımlandığı Sayı

DOI

IZ

Kaynak Göster

Cited By

Ses Telleri Görüntülerinde Otomatik Piksel Tabanlı Sınıflandırma için Performans Ölçütlerinin İncelenmesi

Detecting audio copy-move forgery with an artificial neural network

Recurrent neural network and long short-term memory models for audio copy-move forgery detection: a comprehensive study