A Deep Learning Approach based on Ensemble Classification Pipeline and Interpretable Logical Rules for Bilingual Fake Speech Recognition

Emre Beray Boztepe; Bahadir Karasulu

doi:10.35378/gujs.1357317

Research Article

BibTex

RIS

Cite

A Deep Learning Approach based on Ensemble Classification Pipeline and Interpretable Logical Rules for Bilingual Fake Speech Recognition

Year 2025, Early View, 1 - 1

Emre Beray Boztepe , Bahadir Karasulu

https://doi.org/10.35378/gujs.1357317

Abstract

The essential steps of our study are to quantify and classify the differences between real and fake speech signals. In this scope, the main aim is to use the salient feature learning ability of deep learning in our study. With the use of ensemble classification pipeline, the interpretable logical rules were used for generalized reasoning with the class activation maps to discriminate the different speech classes as correctly. Fake audio samples were generated by using Deep Convolutional Generative Adversarial Neural Network. Our experiments were conducted on three different language dataset such as Turkish, English languages and Bilingual. As a result of higher classification and recognition accuracy with the use of classification pipeline as compiled into a majority voting-based ensemble classifier, the experimental results were obtained for each individual language performance approximately as 90% for training and as 80.33% for testing stages for pipeline, and it reached as 73% for majority voting results considered together with the appropriate test cases as well. To extract semantically rich rules, an interpretable logical rules infrastructure was used to infer the correct fake speech from class activations of deep learning’s generative model. Discussion and conclusion based on scientific findings are included in our study.

Keywords

Ensemble classifier, Machine learning, Deep learning, Speech recognition, Speech analysis

References

[1] Imran, M., Ali, Z., Bakhsh, S. T., Akram, S., "Blind Detection of Copy-Move Forgery in Digital Audio Forensics", IEEE Access, 5: 12843-12855, (2017).
[2] Mannepalli, K., SubbaRamaiah, V., Raghu, K., "Speech Forgery Detection of Framed Sentences In Audio Recordings Using DTW", European Journal of Molecular & Clinical Medicine, 7(8): 2269-2274, (2020).
[3] Baskoro, A. B., Cahyani, N., Putrada, A. G., "Analysis of Voice Changes in Anti Forensic Activities Case Study: Voice Changer with Telephone Effect", International Journal on Information and Communication Technology (IJoICT), 6(2):64-77, (2020).
[4] Shi, Y., Liu, H., Wang, Y., Cai, M., Xu, W., "Theory and application of audio-based assessment of cough", Journal of Sensors, Article ID: 9845321, 1–7, (2018).
[5] Maher, R. C., "Audio forensic examination", IEEE Signal Processing Magazine, 26(2):84-94, (2009).
[6] Ally, M., Alotaibi, M. S., "A novel deep learning model to detect COVID-19 based on wavelet features extracted from Mel-scale spectrogram of patients’ cough and breathing sounds", Informatics in Medicine Unlocked, 32:(101049), 1-11, (2022).
[7] Lia, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., Pietikäinen, M., "Deep Learning for Generic Object Detection: A Survey", International Journal of Computer Vision, 128, 261-218, (2020).
[8] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., "Generative Adversarial Networks", Communications of the ACM, 63(11):139-144, (2020).
[9] Radford, A., Metz, L., Chintala, S., "Unsupervised representation learning with deep convolutional generative adversarial networks", arXiv preprint, Machine Learning (cs.LG), Computer Vision and Pattern Recognition (cs.CV), arXiv:1511.06434, 1-16, (2015).
[10] Beguš, G., "CiwGAN and fiwGAN: Encoding information in acoustic data to model lexical learning with Generative Adversarial Networks", Neural Networks, 139:305-325, (2021).
[11] Donahue, C., McAuley, J., Puckette, M. S., "Adversarial audio synthesis", 7th International conference on learning representations (ICLR2019), New Orleans LA, USA, May 6-9, OpenReview.net, 1–16, (2019). Online: https://openreview.net/forum?id=ByMVTsR5KQ.
[12] Rodionov, S., "Info-wgan-gp", (2018). Online: https://github.com/singnet/semantic-vision/tree/master/experiments/concept_learning/gans/info-wgan-gp. Access date: 25.05.2023
[13] Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W.Z., Sotelo, J., de Brébisson, A., Bengio, Y., Courville, A.C., "MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis", ArXiv Preprint, Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), Machine Learning (cs.LG), Sound (cs.SD), 1-14, arXiv:1910.06711, (2019).
[14] Kong, J., Kim, J., Bae, J., "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis", ArXiv Preprint, Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS), 1-14, arXiv:2010.05646, (2020).
[15] Kocaoğlu, D., Turgut, K., Konyar, M. Z., "Sector-Based Stock Price Prediction with Machine Learning Models", Sakarya University Journal of Computer and Information Sciences, 5(3): 415-426, (2022).
[16] Bhateja, V., Taquee, A., Sharma, D. K. Pre-Processing and Classification of Cough Sounds in Noisy Environment using SVM. 4th International Conference on Information Systems and Computer Networks (ISCON), Mathruda, India, November 21-22, 822-826, (2019).
[17] Rasmussen, C.E., Williams, C.K.I., "Gaussian Processes for Machine Learning", the MIT Press, Massachusetts Institute of Technology, (2006). ISBN 026218253X.
[18] Gao, W., Bao, W., Zhou, X., "Analysis of cough detection index based on decision tree and support vector machine", Journal of Combinatorial Optimization, 37: 375–384, (2019).
[19] Karasulu, B., "Sound Scene and Events Detection using Deep Learning in the Scope of Cyber Security for Multimedia Systems", Acta Infologica, 3(2): 60-82, (2019).
[20] Virtanen, T., Plumbley, M.D., Ellis, D. (Eds.)., "Computational analysis of sound scenes and events", Book Cham, Switzerland: Springer International Publishing AG. (2018).
[21] Bäckström, T., Räsänen, O., Zewoudie, A., Zarazaga, P.P., Koivusalo, L., Das, S., Mellado, E.G., Mansali, M.B., Ramos, D., Kadiri, S., Alku, P., "Introduction to Speech Processing", 2nd Edition, (2022). Online: https://speechprocessingbook.aalto.fi. Access date: 25.05.2023
[22] Çakır, E., "Deep neural networks for sound event detection", (Doctoral Dissertation, Tampere University, Finland), (2019). Online: https://tutcris.tut.fi/portal/files/17626487/cakir_12.pdf. Access date: 25.05.2023
[23] Juillerat N., Hirsbrunner, B., "Low latency audio pitch shifting in the frequency domain", International Conference on Audio Language and Image Processing (ICALIP), Shanghai, China, November 23-25, 16-24, (2010).
[24] Damskägg, E.-P., Välimäki, V., "Audio Time Stretching Using Fuzzy Classification of Spectral Bins", Applied Sciences, 7(12): 1293, (2017).
[25] Govender, D., "Investigating Audio Classification to Automate the Trimming of Recorded Lectures", University of Cape Town, February, (2018). Online: https://pubs.cs.uct.ac.za/id/eprint/1260/1/Thesis-final.pdf . Access date: 25.05.2023
[26] McFee, B., Raffel, C., Liang, D., Ellis, D., Mcvicar, M., Battenberg, E., Nieto, O., "Librosa: Audio and Music Signal Analysis in Python", Proceedings of the Python in Science Conference, Austin, Texas, USA, 6 - 12 July, 18-24, (2015).
[27] Griffin, D., Lim, J., "Signal estimation from modified short-time Fourier transform", IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2): 236-243, (1984).
[28] Laroche, J., Dolson, M., "Improved phase vocoder time-scale modification of audio", IEEE Transactions on Speech and Audio Processing, 7(3): 323-332, (1999).
[29] Zhang, Z., Xu, S., Zhang, S., Qiao, T., Cao, S., "Learning Attentive Representations for Environmental Sound Classification", IEEE Access, 7: 130327-130339, (2019).
[30] Boztepe, E.B., Karakaya, B., Karasulu, B., Ünlü, I., "An Approach for Audio-Visual Content Understanding of Video using Multimodal Deep Learning Methodology", Sakarya University Journal of Computer and Information Sciences (SAUCIS), 5(2): 181-207, (2022).
[31] Kıvrak, E.A., Karasulu, B., Sözbir, C., Türkay, A., "A Deep Learning Based Software Tool for Audio Mood Classification Using Audio Attributes", Veri Bilimi Dergisi, 4(3): 14-27, (2021).
[32] Korzeniowski, F., Widmer, G., "Feature learning for chord recognition: The deep chroma extractor", Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), New York, USA, arXiv preprint, arXiv:1612.05065, August 7-11, 1-7, (2016).
[33] Ganchev, T.D., "Speaker recognition", University of Patras, Wire Communications Laboratory, Dept. of Computer and Electrical Engineering, Gree e, Dissetation for Doctor of Philosophy, (2005). https://thesis.ekt.gr/thesisBookReader/id/13812#page/1/mode/2up. Access date: 25.05.2023
[34] Müller, M., Ewert S., "Chroma Toolbox: MATLAB implementations for extracting variants of chroma-based audio features", Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR2011), Miami, Florida, USA, October 24-28, 215-220, (2011). http://ismir2011.ismir.net/papers/PS2-8.pdf. Access date: 25.05.2023
[35] Panayotov, V., Chen, G., Povey, D., Khudanpur, S., "Librispeech: An ASR corpus based on public domain audio books", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, April 19-24, 5206-5210, (2015).
[36] Tensorflow Library Documentation, (2023). Online: https://www.tensorflow.org/api_docs. Access date: 25.05.2023
[37] Keras Library Documentation, (2023). Online: https://keras.io/api/. Accessed on May 25, 2023.
[38] Xu, B., Wang, N., Chen, T., Li, M., "Empirical Evaluation of Rectified Activations in Convolutional Network", arXiv preprint, Machine Learning (cs.LG), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (stat.ML), arXiv:1505.00853v2, 1-5, (2015).
[39] Buduma, N., Lacascio, N., "Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms", O’Reilly Media UK Ltd., (2017). ISBN: 978–1–491–92561–4.
[40] Krizhevsky, A., Sutskever, I., Hinton, G.E., "ImageNet classification with deep convolutional neural networks", Communications of the ACM, Research Highlights, 60(6): 84-90, (2017).
[41] Pratiwi, H., Windarto, A.P., Susliansyah, S., Aria, R.R., Susilowati, S., Rahayu, L.K., Fitriani, Y., Merdekawati, A., Rahadjeng, I.R., "Sigmoid Activation Function in Selecting the Best Model of Artificial Neural Networks", Journal of Physics Conference Series, 1471, 012010, 1st Bukittinggi International Conference on Education, West Sumatera, Indonesia, October 17-18, 1-8, (2019).
[42] Kingma, D.P., Ba, J., "Adam: A Method for Stochastic Optimization", Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, USA, May 7-9, 1-13, (2015).
[43] Suh, S., Lee, H., Jo, J., Lukowicz, P., Lee, Y.O., "Generative Oversampling Method for Imbalanced Data on Bearing Fault Detection and Diagnosis", Applied Science, 9(4:746): 1-16, (2019).
[44] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., "Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization", International Journal of Computer Vision, 128(2): 336–359, (2020).
[45] Esener, I.I., Ergin, S., Yüksel, T., "A Genuine GLCM-based Feature Extraction for Breast Tissue Classification on Mammograms", International Journal of Intelligent Systems and Applications in Engineering (IJISAE), 4 (Special Issue), 124-129, (2016).
[46] Özkan, K., "Comparing Shannon entropy with Deng entropy and improved Deng entropy for measuring biodiversity when a priori data is not clear", Forestist, 68(2): 136-140, (2018).
[47] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., "Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research", arXiv preprint, Machine Learning (cs.LG), Mathematical Software (cs.MS), 12(85): 2825-2830, (2011).
[48] Cunningham, P., Delany, S.J., "k-Nearest neighbour classifiers", ACM Computing Surveys, Article No: 128, 54(6): 1-25, (2007).
[49] Manning, C.D., Raghavan, P., Schütze, H., "Introduction to Information Retrieval", Cambridge University Press. (2008). ISBN:978-0-521-86571-5
[50] Evgeniou, T., Pontil, M., "Support Vector Machines: Theory and Applications", In: Paliouras, G., Karkaletsis, V., Spyropoulos, C.D. (Eds.), "Machine Learning and Its Applications", ACAI 1999, Lecture Notes in Computer Science, 2049, Springer, Berlin, Heidelberg, 249-257, (2001).
[51] Bors, A.G., "Introduction of the Radial Basis Function (RBF) Networks", Online Symposium for Electronics Engineers, DSP Algorithms: Multimedia, 1:1-7, (2001).
[52] Ebden, M., "Gaussian Processes: A Quick Introduction", arXiv preprint, Statistics Theory (math.ST), arXiv:1505.02965, 1-13, (2015).
[53] Fei, Y., Rong, G., Wang, B., Wang, W., "Parallel L-BFGS-B algorithm on GPU", Computers & Graphics, 40: 1-9, (2014).
[54] Rokach, L., Maimon, O., "Decision Trees", In: Maimon, O., Rokach, L. (Eds.), "Data Mining and Knowledge Discovery Handbook", Springer, Boston, MA, 165-192, (2005).
[55] Suryakanthi, T. "Evaluating the Impact of GINI Index and Information Gain on Classification using Decision Tree Classifier Algorithm", International Journal of Advanced Computer Science and Applications, 11(2): 612-619, (2020).
[56] Anagnostopoulos, T.T., Skourlas, C., "Ensemble Majority Voting Classifier for Speech Emotion Recognition and Prediction", Journal of Systems and Information Technology, 16(3): 222-232, (2014).
[57] Gardin, F., Gautier, R., Jaffre, R., Goix, N., Ndiaye, B., Schertzer, J.-M., "GitHub - scikit-learn-contrib/skope-rules: machine learning with logical rules in Python", v1.0.1, (2020). Online: https://github.com/scikit-learn-contrib/skope-rules. https://2018.ds3-datascience-polytechnique.fr/ wp-content/uploads/2018/06/DS3-309.pdf. Access date: 25.05.2023
[58] Lal, G.R., Chen, X., Mithal, V., "TE2Rules: Extracting Rule Lists from Tree Ensembles", arXiv preprint, Machine Learning (cs.LG), Artificial Intelligence (cs.AI), arXiv:2206.14359, 1-17, 2022.
[59] Friedman, J.H., Popescu, B.E., "Predictive learning via rule ensembles". The Annals of Applied Statistics, 2(3): 916-954, (2008).
[60] Google Colab Website, (2023). Online: https://colab.research.google.com. Access date: 25.05.2023
[61] Python Doc Website, (2023). Online: https://www.python.org/doc/. Access date: 25.05.2023
[62] OpenSLR Dataset, (2023). Online: https://www.openslr.org. Access date: 25.05.2023
[63] Piispaanen, P. Blažek, V., "Altaic Languages – History of research, survey, classification, and a sketch of comparative grammar in collaboration with M. Schwarz and O. Srba", Journal of Old Turkic Studies, 4(1):266-274, (2020).
[64] Johanson, L., "Turkic languages - Old Turkic, Uyghur, Qarakhanid, Ottoman", Encyclopædia Britannica website, (2023). Online: https://www.britannica.com/topic/Turkic-languages/Linguistic-structure. Access date: 25.05.2023
[65] Eberhard, D.M., Simons, G.F., C. D. Fennig, C. D. (Eds.), "Ethnologue: Languages of the World. Twenty-sixth edition", Dallas, Texas: SIL International, Turkish Language Ethnologue, (2023). Online: https://www.ethnologue.com/language/tur/. Access date: 25.05.2023
[66] Kolobov, R., Okhapkina, O., Omelchishina, O., Platunov, Bedyakin, A.R., Moshkin, V., Menshikov, D., Mikhaylovskiy, N., "MediaSpeech: Multilanguage ASR Benchmark and Dataset", arXiv preprint, Audio and Speech Processing (eess.AS), Sound (cs.SD), arXiv:2103.16193, 1-4, (2021).
[67] Youtube Website, (2023). Online: https://www.youtube.com. Access date: 25.05.2023
[68] Cowgill, W., Jasanoff, J.H., "Indo-European languages", Encyclopædia Britannica website, (2023). Online: https://www.britannica.com/topic/Indo-European-languages. Access date: 25.05.2023
[69] Eberhard, D.M., Simons, G.F., C. D. Fennig, C. D. (Eds.), "Ethnologue: Languages of the World", Twenty-sixth edition. Dallas, Texas: SIL International, English Language Ethnologue, (2023). Online: https://www.ethnologue.com/language/eng/. Access date: 25.05.2023
[70] Librivox free public domain audiobooks, LibriVox, (2023). Online: https://librivox.org. Access date: 25.05.2023
[71] Fawcett, T., "Introduction to ROC analysis", Pattern Recognition Letters, 27(8): 861- 874, (2006).
[72] Powers, D.M.W., "The Problem of Area Under the Curve", Proceedings of the IEEE International Conference on Information Science and Technology (ICIST2012), Wuhan, China, March 23-25, 567-573, (2012).