Çoklu Ortam Sistemleri İçin Siber Güvenlik Kapsamında Derin Öğrenme Kullanarak Ses Sahne ve Olaylarının Tespiti

Bahadir Karasulu

EN TR

Sound Scene and Events Detection using Deep Learning in the Scope of Cyber Security for Multimedia Systems

Abstract

In addition to many natural sound sources in nature, synthetic sounds are also used in the multimedia systems of our modern world. Environments (i.e., sound scenes) with these sounds are important for biometric authorization, security requirements and robust/safer voice/video communication. Apart from audio formats that have special constraints such as speech/speaker recognition and verification, the separation of polyphonic sounds, noise reduction, detection of sound scenes/events and voice tagging processes are gaining importance in order to create safer information systems in terms of cyber security. In recent years deep learning has been preferred in the field of cyber security due to its layered infrastructure, which enables the easy extraction of attributes and semantic relationships in the raw data. In this study, the use of deep learning architecture models for voice (or speech) analysis and classification/prediction and detection as multimedia data in cyber security coverage is examined. In our study, deep neural networks, convolutional neural networks, recurrent neural networks, restricted Boltzmann machine and deep belief networks are systematically reviewed as prominent models in the publications between 2015 and 2019. Therefore, the orientation in the literature on voice/speech processing in cyber security, prevention of voice spoofing, and achieving consistent and high performance results is clearly demonstrated through discussions and comments based on scientific findings over fourty studies.

Keywords

Speaker Recognition,Cyber Security,Sound Event Detection,Deep Learning

Çoklu Ortam Sistemleri İçin Siber Güvenlik Kapsamında Derin Öğrenme Kullanarak Ses Sahne ve Olaylarının Tespiti

Abstract

Günümüzde doğadaki birçok doğal ses kaynağı yanısıra sentetik sesler de çoklu ortam sistemlerinde kullanılmaktadır. Bu seslerin bulunduğu ortamlar (sahneler) biyometrik yetkilendirme, güvenlik isterleri ve gürbüz/güvenli sesli/görüntülü iletişim için önem arz etmektedir. Konuşma/konuşmacı tanıma, doğrulama gibi özel kısıtlara sahip ses biçemleri haricinde çoklu seslerin ayrıştırılması, gürültü giderilmesi, ses sahnesi/ olaylarının tespiti ve ses etiketleme işlemleri siber güvenlik açısından daha güvenli bilişim sistemleri oluşturulması adına gün geçtikçe önem kazanmaktadır. Derin öğrenme katmanlı altyapısı gereği oldukça iyi bir biçimde ham verideki özniteliklerin ve anlamsal ilişkinin elde edilmesine olanak sunmasından dolayı son yılllarda siber güvenlik alanında da tercih edilir olmuştur. Bu çalışmada siber güvenlik kapsamında çoklu ortam verisi olarak ses (veya konuşma) analizi ve sınıflandırma/tahminleme ve tespit için derin öğrenme mimari modellerinin kullanımı irdelenmiştir. Çalışmamızda 2015 ilâ 2019 yılları arasındaki yayınlarda öne çıkan modeller olan derin sinir ağları, evrişimli sinir ağları, tekrarlayıcı sinir ağları, kısıtlanmış Boltzmann makinesi ve derin inanç ağları sistematik olarak incelenmiştir. Böylece siber güvenlikte ses/konuşma işleme, sesle aldatmayı önleme, tutarlı ve yüksek başarımlı sonuçları elde etmeye dair literatürdeki yönelim kırkı aşkın çalışma üzerinden bilimsel bulgulara dayanan tartışma ve yorumlarla açıkça ortaya konulmaktadır.

Keywords

Konuşmacı Tanıma,Siber Güvenlik,Ses Olayı Tespiti,Derin Öğrenme

References

Alisamir, S., Ahadi, S. M., & Seyedin, S. (2018). An end-to-end deep learning model to recognize Farsi speech from raw input. In Proceedings of IEEE 4th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS) (pp. 1–5). Tehran, Iran: IEEE. http://dx.doi.org/10.1109/ICSPIS.2018.8700538
Anand, P., Singh, A. K., Srivastava, S., & Lall, B. (2019). Few shot speaker recognition using deep neural networks. Electrical Engineering and Systems Science, Audio and Speech Processing(eess.AS), ArXiv. 1–5. Retrieved from https://arxiv.org/abs/1904.08775
Babaee, E., Anuar, N. B., Wahab, A. W. A., Shamshirband, S., & Chronopoulos, A. T. (2017). An overview of audio event detection methods from feature extraction to classification, Applied Artificial Intelligence, 31(9–10), 661–714. http://dx.doi.org/10.1080/08839514.2018.1430469.
Bhatt, G., Gupta, A., Arora, A., & Raman, B. (2018). Acoustic features fusion using attentive multi-channel deep architecture. Proceedings of CHIME 2018 Workshop on Speeech Processing in Everyday Environments, Hyderabad, India, 30–34. http://dx.doi.org/10.21437/CHiME.2018-7
Boddapati, V., Petef, A., Rasmusson, J., & Lundberg, L. (2017). Classifying environmental sounds using image recognition networks. Procedia Computer Science, 112, 2048–2056. http://dx.doi.org/10.1016/j.procs.2017.08.250.
Chen, K., Yan, Z-J., & Huo, Q. (2015). Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT approach. In Proceedings of the INTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association: Vol 1-5 (pp.3600–3604). Dresden, Germany: ISCA archive. Retrieved from https://www.isca-speech.org/archive/interspeech_2015/i15_3600.html
Chollet, F. (2017). Deep learning with python. Newyork, NY: Manning Publication.
Chung, H., Park, J. G., & Jung, H.-Y. (2019). Rank-weighted reconstruction feature for a robust deep neural network-based acoustic model. ETRI Journal, 41(2), 235–241. http://dx.doi.org/10.4218/etrij.2018-0189

Chung, J. S., Nagrani, A., & Zisserman, A. (2018). VoxCeleb2: Deep speaker recognition. Computer Science, Sound (cs.SD), Electrical Engineering and Systems Science, Audio and Speech Processing(eess.AS). ArXiv. 1–6. Retrieved from https://arxiv.org/abs/1806.05622v2.
Çakır, E. (2019). Deep neural networks for sound event detection. (Doctoral Dissertation, Tampere University, Finland). Retrieved from https://tutcris. tut.fi/portal/files/17626487/cakir_12.pdf
Espi, M., Fujimoto, M., Kinoshita, K., & Nakatani, T. (2015). Exploiting spectro-temporal locality in deep learning based acoustic event detection. EURASIP Journal On Audio Speech And Music Proccessing, 2015(26), 1–12. http://dx.doi.org/10.1186/s13636-015-0069-2.
Etienne, C., Fidanza, G., Petrovskii, A., Devillers, L., & Schmauch, B. (2018). CNN+LSTM architecture for speech emotion recognition with data augmentation. In Proceedings of the INTERSPEECH 2018 Workshop on Speech, Music and Mind (pp.21–25). Hyderabad, India:ISCA archive. http:// dx.doi.org/10.21437/SMM.2018-5.
Farhadipour, A., Veisi, H., Asgari, M., & Keyvanrad, M. A. (2018). Dysarthric speaker identification with different degrees of dysarthria severity using deep belief networks, ETRI Journal (Electronics and Telecommunications Research Institute), 40(5), 643–652. http://dx.doi.org/10.4218/ etrij.2017-0260
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Adaptive Computation and Machine Learning Series. Cambridge, MA: MIT Press.
Han, K., Wang, Y., Wang, D, Woods, W. S., Merks, I., & Zhang, T. (2015). Learning spectral mapping for speech dereverberation and denoising. IEEE/ ACM Transactions on Audio, Speech, and Language Processing, 23(6), 982–992. http://dx.doi.org/10.1109/TASLP.2015.2416653.
Hanilçi, C., Kinnunen, T., Sahidullah, M., & Sizov, A. (2016). Spoofing detection goes noisy: An analysis of synthetic speech detection in the presence of additive noise. Speech Communication, 85, 83–97. http://dx.doi.org/10.1016/j.specom.2016.10.002
Hautamäki, R. G., Kinnunen, T., Hautamäki, V., & Laukkanen, A. M. (2015). Automatic versus human speaker verification: The case of voice mimicry. Speech Communication, 72, 13–31. http://dx.doi.org/10.1016/j.specom.2015.05.002
Himawan, I., Villavicencio, F., Sridharan, S., & Fookes, C. (2019). Deep domain adaptation for anti-spoofing in speaker verification systems. Computer Speech & Language, 58, 377–402. http://dx.doi.org/ 10.1016/j.csl.2019.05.007
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsburry, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97. http://dx.doi.org/10.1109/MSP.2012.2205597.
Huzaifah, M. (2017). Comparison of time-frequency representations for environmental sound classification using convolutional neural networks. Computing Research Repository (CoRR), ArXiv. 1–5. Retrieved from https://arxiv.org/abs/1706.07156v1.
Jayalakshmi, S. L., Chandrakala, S., & Nedunchelian, R. (2018). Global statistical features-based approach for acoustic event detection. Applied Acoustics, 139, 113–118. http://dx.doi.org/10.1016/j.apacoust.2018.04.026.
Kang, T. G., Shin, J. W., & Kim, N. S. (2018). DNN-based monaural speech enhancement with temporal and spectral variations equalization. Digital Signal Processing, 74, 102–110. http://dx.doi.org/10.1016/j.dsp.2017.12.002
Khodabakhsh, A., Mohammadi, A., & Demiroglu, C. (2017). Spoofing voice verification systems with statistical speech synthesis using limited adaptation data. Computer Speech & Language, 42, 20–37. http://dx.doi.org/10.1016/j.csl.2016.08.004
Kiranyaz, S., Avci, O., Abdeljaber, O., Ince, T., Gabbouj, M., & Inman, D.J. (2019). 1D convolutional neural networks and applications: A survey. Computing Research Repository (CoRR), ArXiv. 1–20. Retrieved from https://arxiv.org/abs/1905.03554v1.
Kong. Q., Xu, Y., Sobieraj, I., Wang, W., & Plumbley, M. D. (2019). Sound event detection and time–frequency segmentation from weakly labelled data. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(4), 777–787. http://dx.doi.org/10.1109/TASLP.2019.2895254.
Korkmaz, Y. ve Boyacı, A. (2018). Adli bilişim açısından ses incelemeleri. Fırat Üniversitesi Mühendislik Bilimleri Dergisi, 30, 329–343.
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. http://dx.doi.org/10.1109/5.726791.
Li, R., Liu, Y., Shi, Y., Dong, L., & Cui, W. (2016). ILMSAF based speech enhancement with DNN and noise classification. Speech Communication, 85, 53-70. http://dx.doi.org/10.1016/j.specom.2016.10.008
Lopez-Moreno, I., Gonzalez-Dominguez, J., Martinez, D., Plchot, O., Gonzalez-Rodriguez, J., & Moreno, P. J. (2016). On the use of deep feedforward neural networks for automatic language identification. Computer Speech & Language, 40, 46–59. http://dx.doi.org/10.1016/j.csl.2016.03.001
Meral, H. M., Sankur, B., Özsoy, A. S., Güngör, T., & Sevinç, E. (2009). Natural language watermarking via morphosyntactic alterations. Computer Speech & Language, 23(1), 107–125. http://dx.doi.org/10.1016/j.csl.2008.04.001
Morfi, V., & Stowell, D. (2018). Deep learning for audio event detection and tagging on low-resource datasets. Applied Sciences, 8(8):1397, 1–16. http:// dx.doi.org/10.3390/app8081397.
Muratoğlu, O., Okul, Ş. & Aydın, M. A. & Bilge, H. S. (2018, September). Review on cyber risks relating to security management in smart cars. Proceedings 3rd International Conference on Computer Science and Engineering (UBMK18), Sarajevo, Bosnia and Herzegovina, 406–409. http://dx.doi.org/10.1109/ UBMK.2018.8566569.
Özer, İ., Özer, Z., & Fındık, O. (2018). Noise robust sound event classification with convolutional neural network. Neurocomputing, 272, 505–512. http:// dx.doi.org/10.1016/j.neucom.2017.07.021
Patterson, J. & Gibson, A. (2016). Deep learning. Sebastopol, CA: O’Reilly Media, Inc.
Qian, Y., Chen, N., & Yu, K. (2016). Deep features for automatic spoofing detection. Speech Communication, 85, 43–52, http://dx.doi.org/10.1016/j. specom.2016.10.007
Qian, Y., Evanini, K., Wang, X., Lee, C. M., & Mulholland, M., (2017). Bidirectional LSTM-RNN for improving automated assessment of non-native children’s speech. In Proceedings of the INTERSPEECH 2017 18th Annual Conference of the International Speech Communication Association (pp.1417–1421). Stockholm, Sweden:ISCA Archive. http://dx.doi.org/10.21437/Interspeech.2017-250
Sağıroğlu Ş. ve Koç, O. (2017). Büyük veri ve açık veri analitiği: Yöntemler ve uygulamalar. Ankara: Grafiker Yayınları.
Sainath, T. N., Kingsbury, B., Saon, G., Soltau, H., Mohamed, A.-R., Dahl, G., & Ramabhadran, B. (2015). Deep convolutional neural networks for largescale speech tasks. Neural Networks, 64, 39–48. http://dx.doi.org/10.1016/j.neunet.2014.08.005
Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 24(3), 279–283. http://dx.doi.org/ 10.1109/LSP.2017.2657381.
Samui, S., Chakrabarti, I., & Ghosh, S. K. (2017, August). Deep recurrent neural network based monaural speech separation using recurrent temporal restricted boltzmann machines. In Proceedings Interspeech 2017 18th Annual Conference of the International Speech Communication Association (pp.3622–3626). Stockholm, Sweden:ISCA archive. http://dx.doi.org/10.21437/Interspeech.2017-57.
Samui, S., Chakrabarti, I., & Ghosh, S. K. (2019). Time–frequency masking based supervised speech enhancement framework using fuzzy deep belief network. Applied Soft Computing, 74, 583–602. http://dx.doi.org/10.1016/j.asoc.2018.10.031.
Sharan, R. V., & Moir, T. J. (2017). Robust acoustic event classification using deep neural networks. Information Sciences, 396, 24–32. http://dx.doi. org/10.1016/j.ins.2017.02.013.
Todisco, M., Delgado, H., & Evans, N. (2017). Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Computer Speech & Language, 45, 516–535. http://dx.doi.org/10.1016/j.csl.2017.01.001.
Valenti, M., Squartini, S., Diment, A., Parascandolo, G., & Virtanen, T. (2017, May). A convolutional neural network approach for acoustic scene classification. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN) (pp.1547–1554). Anchorage, AK. http://dx.doi. org/10.1109/IJCNN.2017.7966035.
Virtanen, T., Plumbley, M. D., Ellis, D. (Eds.). (2018). Computational analysis of sound scenes and events. Cham, Switzerland: Springer International Publishing AG. http://dx.doi.org/10.1007/978-3-319-63450-0.
Xu, Y., Huang, Q., Wang, W., Foster, P., Sigtia, S., Jackson, P. J. B., & Plumbley, M. D. (2017). Unsupervised feature learning based on deep models for environmental audio tagging. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(6), 1230–1241. http://dx.doi.org/10.1109/ TASLP.2017.2690563.
Zazo, R., Lozano-Diez, A., Gonzalez-Dominguez, J., Toledano, D. T., & Gonzalez-Rodriguez, J. (2016). Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PLoS ONE, 11(1):e0146917, 1–17. http://dx.doi.org/10.1371/journal.pone.0146917.
Zeyer, A., Doetsch, P., Voigtlaender, P., Schlüter, R., & Ney, H. (2017). A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp.2462– 2466). New Orleans, LA:IEEE. http://dx.doi.org/10.1109/ICASSP.2017.7952599.
Zheng, W., Mo, Z., Xing, X., & Zhao, G. (2018). CNNs-based acoustic scene classification using multi-spectrogram fusion and label expansions. Computing Research Repository (CoRR), ArXiv. 1–7. Retrieved from https://arxiv.org/abs/1809.01543v1.
Zhou, H., Bai, X., & Du, J. (2018, November). An investigation of transfer learning mechanism for acoustic scene classification. Proceedings of 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) (pp. 404-408). Taipei City, Taiwan. http://dx.doi.org/10.1109/ ISCSLP.2018.8706712

Details

Primary Language

Turkish

Subjects

Computer Software

Journal Section

Review

Authors

Bahadir Karasulu ^*
0000-0001-8524-874X
Türkiye

Publication Date

December 30, 2019

Submission Date

July 11, 2019

Acceptance Date

December 2, 2019

Published in Issue

Year 2019 Volume: 3 Number: 2

IZ

https://izlik.org/JA56NU24UD

Cite

RIS / Bibtex

APA

Karasulu, B. (2019). Çoklu Ortam Sistemleri İçin Siber Güvenlik Kapsamında Derin Öğrenme Kullanarak Ses Sahne ve Olaylarının Tespiti. Acta Infologica, 3(2), 60-82. https://izlik.org/JA56NU24UD

AMA

1.Karasulu B. Çoklu Ortam Sistemleri İçin Siber Güvenlik Kapsamında Derin Öğrenme Kullanarak Ses Sahne ve Olaylarının Tespiti. ACIN. 2019;3(2):60-82. https://izlik.org/JA56NU24UD

Chicago

Karasulu, Bahadir. 2019. “Çoklu Ortam Sistemleri İçin Siber Güvenlik Kapsamında Derin Öğrenme Kullanarak Ses Sahne Ve Olaylarının Tespiti”. Acta Infologica 3 (2): 60-82. https://izlik.org/JA56NU24UD.

EndNote

Karasulu B (December 1, 2019) Çoklu Ortam Sistemleri İçin Siber Güvenlik Kapsamında Derin Öğrenme Kullanarak Ses Sahne ve Olaylarının Tespiti. Acta Infologica 3 2 60–82.

IEEE

[1]B. Karasulu, “Çoklu Ortam Sistemleri İçin Siber Güvenlik Kapsamında Derin Öğrenme Kullanarak Ses Sahne ve Olaylarının Tespiti”, ACIN, vol. 3, no. 2, pp. 60–82, Dec. 2019, [Online]. Available: https://izlik.org/JA56NU24UD

ISNAD

Karasulu, Bahadir. “Çoklu Ortam Sistemleri İçin Siber Güvenlik Kapsamında Derin Öğrenme Kullanarak Ses Sahne Ve Olaylarının Tespiti”. Acta Infologica 3/2 (December 1, 2019): 60-82. https://izlik.org/JA56NU24UD.

JAMA

1.Karasulu B. Çoklu Ortam Sistemleri İçin Siber Güvenlik Kapsamında Derin Öğrenme Kullanarak Ses Sahne ve Olaylarının Tespiti. ACIN. 2019;3:60–82.

MLA

Karasulu, Bahadir. “Çoklu Ortam Sistemleri İçin Siber Güvenlik Kapsamında Derin Öğrenme Kullanarak Ses Sahne Ve Olaylarının Tespiti”. Acta Infologica, vol. 3, no. 2, Dec. 2019, pp. 60-82, https://izlik.org/JA56NU24UD.

Vancouver

1.Bahadir Karasulu. Çoklu Ortam Sistemleri İçin Siber Güvenlik Kapsamında Derin Öğrenme Kullanarak Ses Sahne ve Olaylarının Tespiti. ACIN [Internet]. 2019 Dec. 1;3(2):60-82. Available from: https://izlik.org/JA56NU24UD