Konuşmacıları Kadın, Erkek ve Çocuk Olarak Sınıflandırmada Veri Artırmanın Performansa Etkisi

Ergün Yücesoy

doi:10.21597/jist.1505349

Araştırma Makalesi

Konuşmacıları Kadın, Erkek ve Çocuk Olarak Sınıflandırmada Veri Artırmanın Performansa Etkisi

Yıl 2024, Cilt: 14 Sayı: 3, 974 - 987, 01.09.2024

Ergün Yücesoy

https://doi.org/10.21597/jist.1505349

Öz

Derin öğrenme alanındaki gelişmeler daha doğru sınıflandırıcıların oluşturulmasına olanak sağlamıştır. Ancak yüksek genelleme yeteneğine sahip derin öğrenme modellerinin oluşturulabilmesi için büyük miktarda etiketli veri kümelerine ihtiyaç duyulmaktadır. Veri artırma bu ihtiyacın karşılanmasında yaygın olarak kullanılan bir yöntemdir. Bu çalışmada konuşmacıların yaş ve cinsiyetlerine göre sınıflandırılmasında farklı veri artırma yöntemlerinin sınıflandırma performansı üzerindeki etkileri araştırılmıştır. Çalışmada yetişkin konuşmacılar erkek ve kadın olarak, çocuklar ise cinsiyet ayrımı yapılmadan tek bir sınıf olarak değerlendirilmiş ve toplamda üç (kadın, erkek ve çocuk) sınıflı bir sınıflandırma gerçekleştirilmiştir. Bu amaç doğrultusunda gürültü ekleme, zaman uzatma ve perde kaydırma olmak üzere üç veri artırma yöntemi farklı kombinasyonlarda kullanılarak yedi farklı model oluşturulmuş ve her birinin performans ölçümleri yapılmıştır. aGender veri kümesinden rastgele seçilen 5760 konuşma verisi ile geliştirilen bu modeller arasında en yüksek performans artışı üç veri artırma yönteminin birlikte kullanıldığı modelle sağlanmıştır. Bu model sınıflandırma doğruluğunu %84.583’den % 87.523’e çıkararak %3’e yakın performans artışı sağlarken veri artırmanın kullanıldığı diğer modellerde de %1 ile %2.3 arasında performans artışı sağlanmıştır.

Anahtar Kelimeler

Yaş ve cinsiyet Tanıma, Evrişimli sinir ağları, Veri artırma, Perde kaydırma, Zaman uzatma, Gürültü ekleme

Kaynakça

Arakawa, R., Takamichi, S., & Saruwatari, H. (2019). Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device. In: Proc. ISCA Workshop Speech Synthesis, (pp. 93–98). Vienna, Austria.
Bhatt, D., Patel, C., Talsania, H., Patel, J., Vaghela, R., Pandya, S., ... & Ghayvat, H. (2021). CNN variants for computer vision: History, architecture, application, challenges and future scope. Electronics, 10(20), 2470.
Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116.
Chai, J., Zeng, H., Li, A., & Ngai, E. W. (2021). Deep learning in computer vision: A critical review of emerging techniques and application scenarios. Machine Learning with Applications, 6, 100134.
Gerosa, M., Giuliani, D., & Brugnara, F. (2005). Speaker adaptive acoustic modeling with mixture of adult and children's speech. In Interspeech, (pp. 2193-2196). Lisbon, Portugal.
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio Speech Lang, 19, 788–798.
Ertam, F. (2019)An effective gender recognition approach using voice data via deeper LSTM networks. Appl. Acoust., 156, 351–358.
Gupta, A., Harrison, P. J., Wieslander, H., Pielawski, N., Kartasalo, K., Partel, G., ... & Wählby, C. (2019). Deep learning in image cytometry: a review. Cytometry Part A, 95(4), 366-380.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, (pp 448-456). Lille France.
Issa, D., Demirci, M. F., & Yazici, A. (2020). Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894.
Jasuja, L., Rasool, A., Hajela, G. (2020) Voice Gender Recognizer Recognition of Gender from Voice using Deep Neural Networks. In Proceedings of the 2020 International Conference on Smart Electronics and Communication (ICOSEC), (pp. 319–324). Trichy, India.
Kwasny, D., & Hemmerling, D. (2021). Gender and age estimation methods based on speech using deep neural networks. Sensors, 21(14), 4785.
Kockmann, M., Burget, L., & Cernocký, J. (2010). Brno university of technology system for interspeech 2010 paralinguistic challenge. In Interspeech, (pp. 2822-2825). Makuhari, Chiba, Japan.
Levitan, S. I., Mishra, T., & Bangalore, S. (2016). Automatic identification of gender from speech. In Proceeding of speech prosody, (pp. 84-88). Boston, USA.
Li, M., Han, K. J., & Narayanan, S. (2013). Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech & Language, 27(1), 151-167.
Lingenfelser, F., Wagner, J., Vogt, T., Kim, J., & André, E. (2010). Age and gender classification from speech using decision level fusion and ensemble based techniques. In Eleventh Annual Conference of the International Speech Communication Association, (pp. 2798-2801). Makuhari, Chiba, Japan.
Liu, X., Wang, H., Zhang, Y., Wu, F., & Hu, S. (2022). Towards efficient data-centric robust machine learning with noise-based augmentation, arXiv preprint arXiv:2203.03810.
Lou, G., & Shi, H. (2020). Face image recognition based on convolutional neural network. China communications, 17(2), 117-124.
Mahmoodi, D., Marvi, H., Taghizadeh, M., Soleimani, A., Razzazi, F. & Mahmoodi, M. (2011, July). Age Estimation Based on Speech Features and Support Vector Machine. In Proceedings of the 2011 3rd Computer Science and Electronic Engineering Conference (CEEC), (pp. 60–64). Colchester, UK.
Mavaddati, S. (2024). Voice-based Age, Gender, and Language Recognition Based on ResNet Deep model and Transfer learning in Spectro-Temporal Domain. Neurocomputing, (580), 127429.
Miliaresi, I., Poutos, K., & Pikrakis, A. (2021). Combining acoustic features and medical data in deep learning networks for voice pathology classification. In 2020 28th European Signal Processing Conference (EUSIPCO), (pp. 1190-1194). Amsterdam, Netherlands.
Nanthini, K., Sivabalaselvamani, D., Chitra, K., Gokul, P., KavinKumar, S., & Kishore, S. (2023). A Survey on Data Augmentation Techniques. In 2023 7th International Conference on Computing Methodologies and Communication (ICCMC), (pp. 913-920). Erode, India.
Nugroho, K., & Noersasongko, E. (2022). Enhanced Indonesian ethnic speaker recognition using data augmentation deep neural network. Journal of King Saud University-Computer and Information Sciences, 34(7), 4375-4384.
Nusrat, I., & Jang, S.B. (2018). A comparison of regularization techniques in deep neural networks. Symmetry, 10(11):648.
Potamianos, A., & Narayanan, S. (2003). Robust recognition of children's speech. IEEE Transactions on speech and audio processing, 11(6), 603-616.
Qureshi, M. F., Mushtaq, Z., ur Rehman, M. Z., & Kamavuako, E.N. (2022) Spectral image-based multiday surface electromyography classification of hand motions using CNN for human–computer interaction. IEEE Sens. J., 22, 20676–20683.
Sánchez-Hevia, H. A., Gil-Pita, R., Utrilla-Manso, M., & Rosa-Zurera, M. (2022). Age group classification and gender recognition from speech with temporal convolutional neural networks. Multimedia Tools and Applications, 81(3), 3535-3552.
Sarker, I. H. (2021). Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science, 2(6), 420.
Srivastava, N., Hinton, G., Krizhevsky, A,, Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929-1958.
Tursunov, A., Mustaqeem, Choeh, J. Y., & Kwon, S. (2021). Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms. Sensors, 21(17), 5892.
Uddin, M. A., Hossain, M. S., Pathan, R. K., & Biswas, M. (2020). Gender Recognition from Human Voice using Multi-Layer Architecture. In Proceedings of the 2020 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), (pp. 1–7).Novi Sad, Serbia.
Vlaj, D., & Zgank, A. (2022). Acoustic Gender and Age Classification as an Aid to Human–Computer Interaction in a Smart Home Environment. Mathematics, 11(1), 169.
Wei, S., Sun, Z., Wang, Z., Liao, F., Li, Z., & Mi, H. (2023). An efficient data augmentation method for automatic modulation recognition from low-data imbalanced-class regime. Applied Sciences, 13(5), 3177.
Yamashita, R., Nishio, M., Do, R. K. G., & Togashi, K. (2018). Convolutional neural networks: an overview and application in radiology. Insights into imaging, 9, 611-629.
Yücesoy, E., & Nabiyev, V. V. (2016). A new approach with score-level fusion for the classification of a speaker age and gender. Computers & Electrical Engineering, 53, 29-39.
Zhang X., Chen A., Zhou G., Zhang Z., Huang X.,& Qiang X. (2019). Spectrogram-frame linear network and continuous frame sequence for bird sound classification. Ecol. Inform., 54, 101009.

Effect of Data Augmentation on Performance in Classifying Speakers as Female, Male, and Child

Yıl 2024, Cilt: 14 Sayı: 3, 974 - 987, 01.09.2024

Ergün Yücesoy

https://doi.org/10.21597/jist.1505349

Öz

Developments in the field of deep learning have enabled the creation of more accurate classifiers. However, large amounts of labeled datasets are needed to create deep learning models with high generalization ability. Data augmentation is a widely used method to address the need for more data. This study investigates the effects of different data augmentation methods on the classification performance of speakers based on their age and gender. In this study, adult speakers are classified as male or female, while children are classified as a single group without gender discrimination, resulting in a total of three classes (female, male, and child). For this purpose, seven different models are created using combinations of three data augmentation methods: noise addition, time stretching, and pitch shifting. The performance of each model is then evaluated. Among these models, which were developed with 5760 speech data randomly selected from the aGender dataset, the highest performance increase is achieved with the model where three data augmentation methods are used together. This model increases the classification accuracy from 84.583% to 87.523%, providing a performance increase of nearly 3%, while other models using data augmentation provide a performance increase of 1% to 2.3%.

Anahtar Kelimeler

Age and gender recognition, Convolutional neural networks, Data augmentation, Pitch shift, Time stretching Noise addition

Kaynakça

Arakawa, R., Takamichi, S., & Saruwatari, H. (2019). Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device. In: Proc. ISCA Workshop Speech Synthesis, (pp. 93–98). Vienna, Austria.
Bhatt, D., Patel, C., Talsania, H., Patel, J., Vaghela, R., Pandya, S., ... & Ghayvat, H. (2021). CNN variants for computer vision: History, architecture, application, challenges and future scope. Electronics, 10(20), 2470.
Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116.
Chai, J., Zeng, H., Li, A., & Ngai, E. W. (2021). Deep learning in computer vision: A critical review of emerging techniques and application scenarios. Machine Learning with Applications, 6, 100134.
Gerosa, M., Giuliani, D., & Brugnara, F. (2005). Speaker adaptive acoustic modeling with mixture of adult and children's speech. In Interspeech, (pp. 2193-2196). Lisbon, Portugal.
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio Speech Lang, 19, 788–798.
Ertam, F. (2019)An effective gender recognition approach using voice data via deeper LSTM networks. Appl. Acoust., 156, 351–358.
Gupta, A., Harrison, P. J., Wieslander, H., Pielawski, N., Kartasalo, K., Partel, G., ... & Wählby, C. (2019). Deep learning in image cytometry: a review. Cytometry Part A, 95(4), 366-380.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, (pp 448-456). Lille France.
Issa, D., Demirci, M. F., & Yazici, A. (2020). Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894.
Jasuja, L., Rasool, A., Hajela, G. (2020) Voice Gender Recognizer Recognition of Gender from Voice using Deep Neural Networks. In Proceedings of the 2020 International Conference on Smart Electronics and Communication (ICOSEC), (pp. 319–324). Trichy, India.
Kwasny, D., & Hemmerling, D. (2021). Gender and age estimation methods based on speech using deep neural networks. Sensors, 21(14), 4785.
Kockmann, M., Burget, L., & Cernocký, J. (2010). Brno university of technology system for interspeech 2010 paralinguistic challenge. In Interspeech, (pp. 2822-2825). Makuhari, Chiba, Japan.
Levitan, S. I., Mishra, T., & Bangalore, S. (2016). Automatic identification of gender from speech. In Proceeding of speech prosody, (pp. 84-88). Boston, USA.
Li, M., Han, K. J., & Narayanan, S. (2013). Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech & Language, 27(1), 151-167.
Lingenfelser, F., Wagner, J., Vogt, T., Kim, J., & André, E. (2010). Age and gender classification from speech using decision level fusion and ensemble based techniques. In Eleventh Annual Conference of the International Speech Communication Association, (pp. 2798-2801). Makuhari, Chiba, Japan.
Liu, X., Wang, H., Zhang, Y., Wu, F., & Hu, S. (2022). Towards efficient data-centric robust machine learning with noise-based augmentation, arXiv preprint arXiv:2203.03810.
Lou, G., & Shi, H. (2020). Face image recognition based on convolutional neural network. China communications, 17(2), 117-124.
Mahmoodi, D., Marvi, H., Taghizadeh, M., Soleimani, A., Razzazi, F. & Mahmoodi, M. (2011, July). Age Estimation Based on Speech Features and Support Vector Machine. In Proceedings of the 2011 3rd Computer Science and Electronic Engineering Conference (CEEC), (pp. 60–64). Colchester, UK.
Mavaddati, S. (2024). Voice-based Age, Gender, and Language Recognition Based on ResNet Deep model and Transfer learning in Spectro-Temporal Domain. Neurocomputing, (580), 127429.
Miliaresi, I., Poutos, K., & Pikrakis, A. (2021). Combining acoustic features and medical data in deep learning networks for voice pathology classification. In 2020 28th European Signal Processing Conference (EUSIPCO), (pp. 1190-1194). Amsterdam, Netherlands.
Nanthini, K., Sivabalaselvamani, D., Chitra, K., Gokul, P., KavinKumar, S., & Kishore, S. (2023). A Survey on Data Augmentation Techniques. In 2023 7th International Conference on Computing Methodologies and Communication (ICCMC), (pp. 913-920). Erode, India.
Nugroho, K., & Noersasongko, E. (2022). Enhanced Indonesian ethnic speaker recognition using data augmentation deep neural network. Journal of King Saud University-Computer and Information Sciences, 34(7), 4375-4384.
Nusrat, I., & Jang, S.B. (2018). A comparison of regularization techniques in deep neural networks. Symmetry, 10(11):648.
Potamianos, A., & Narayanan, S. (2003). Robust recognition of children's speech. IEEE Transactions on speech and audio processing, 11(6), 603-616.
Qureshi, M. F., Mushtaq, Z., ur Rehman, M. Z., & Kamavuako, E.N. (2022) Spectral image-based multiday surface electromyography classification of hand motions using CNN for human–computer interaction. IEEE Sens. J., 22, 20676–20683.
Sánchez-Hevia, H. A., Gil-Pita, R., Utrilla-Manso, M., & Rosa-Zurera, M. (2022). Age group classification and gender recognition from speech with temporal convolutional neural networks. Multimedia Tools and Applications, 81(3), 3535-3552.
Sarker, I. H. (2021). Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science, 2(6), 420.
Srivastava, N., Hinton, G., Krizhevsky, A,, Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929-1958.
Tursunov, A., Mustaqeem, Choeh, J. Y., & Kwon, S. (2021). Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms. Sensors, 21(17), 5892.
Uddin, M. A., Hossain, M. S., Pathan, R. K., & Biswas, M. (2020). Gender Recognition from Human Voice using Multi-Layer Architecture. In Proceedings of the 2020 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), (pp. 1–7).Novi Sad, Serbia.
Vlaj, D., & Zgank, A. (2022). Acoustic Gender and Age Classification as an Aid to Human–Computer Interaction in a Smart Home Environment. Mathematics, 11(1), 169.
Wei, S., Sun, Z., Wang, Z., Liao, F., Li, Z., & Mi, H. (2023). An efficient data augmentation method for automatic modulation recognition from low-data imbalanced-class regime. Applied Sciences, 13(5), 3177.
Yamashita, R., Nishio, M., Do, R. K. G., & Togashi, K. (2018). Convolutional neural networks: an overview and application in radiology. Insights into imaging, 9, 611-629.
Yücesoy, E., & Nabiyev, V. V. (2016). A new approach with score-level fusion for the classification of a speaker age and gender. Computers & Electrical Engineering, 53, 29-39.
Zhang X., Chen A., Zhou G., Zhang Z., Huang X.,& Qiang X. (2019). Spectrogram-frame linear network and continuous frame sequence for bird sound classification. Ecol. Inform., 54, 101009.

Toplam 36 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Bilgisayar Yazılımı
Bölüm	Bilgisayar Mühendisliği / Computer Engineering
Yazarlar	Ergün Yücesoy 0000-0003-1707-384X
Erken Görünüm Tarihi	27 Ağustos 2024
Yayımlanma Tarihi	1 Eylül 2024
Gönderilme Tarihi	26 Haziran 2024
Kabul Tarihi	21 Temmuz 2024
Yayımlandığı Sayı	Yıl 2024 Cilt: 14 Sayı: 3

Kaynak Göster

APA	Yücesoy, E. (2024). Konuşmacıları Kadın, Erkek ve Çocuk Olarak Sınıflandırmada Veri Artırmanın Performansa Etkisi. Journal of the Institute of Science and Technology, 14(3), 974-987. https://doi.org/10.21597/jist.1505349
AMA	Yücesoy E. Konuşmacıları Kadın, Erkek ve Çocuk Olarak Sınıflandırmada Veri Artırmanın Performansa Etkisi. Iğdır Üniv. Fen Bil Enst. Der. Eylül 2024;14(3):974-987. doi:10.21597/jist.1505349
Chicago	Yücesoy, Ergün. “Konuşmacıları Kadın, Erkek Ve Çocuk Olarak Sınıflandırmada Veri Artırmanın Performansa Etkisi”. Journal of the Institute of Science and Technology 14, sy. 3 (Eylül 2024): 974-87. https://doi.org/10.21597/jist.1505349.
EndNote	Yücesoy E (01 Eylül 2024) Konuşmacıları Kadın, Erkek ve Çocuk Olarak Sınıflandırmada Veri Artırmanın Performansa Etkisi. Journal of the Institute of Science and Technology 14 3 974–987.
IEEE	E. Yücesoy, “Konuşmacıları Kadın, Erkek ve Çocuk Olarak Sınıflandırmada Veri Artırmanın Performansa Etkisi”, Iğdır Üniv. Fen Bil Enst. Der., c. 14, sy. 3, ss. 974–987, 2024, doi: 10.21597/jist.1505349.
ISNAD	Yücesoy, Ergün. “Konuşmacıları Kadın, Erkek Ve Çocuk Olarak Sınıflandırmada Veri Artırmanın Performansa Etkisi”. Journal of the Institute of Science and Technology 14/3 (Eylül 2024), 974-987. https://doi.org/10.21597/jist.1505349.
JAMA	Yücesoy E. Konuşmacıları Kadın, Erkek ve Çocuk Olarak Sınıflandırmada Veri Artırmanın Performansa Etkisi. Iğdır Üniv. Fen Bil Enst. Der. 2024;14:974–987.
MLA	Yücesoy, Ergün. “Konuşmacıları Kadın, Erkek Ve Çocuk Olarak Sınıflandırmada Veri Artırmanın Performansa Etkisi”. Journal of the Institute of Science and Technology, c. 14, sy. 3, 2024, ss. 974-87, doi:10.21597/jist.1505349.
Vancouver	Yücesoy E. Konuşmacıları Kadın, Erkek ve Çocuk Olarak Sınıflandırmada Veri Artırmanın Performansa Etkisi. Iğdır Üniv. Fen Bil Enst. Der. 2024;14(3):974-87.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin