Investigation of the Effectiveness of Audio Processing and Filtering Strategies in Noisy Environments on Speech Recognition Performance

Cem Özkurt

doi:10.47495/okufbed.1457532

Research Article

BibTex

RIS

Cite

Gürültülü Ortamlarda Ses Tanıma Performansı Üzerinde Ses İşleme ve Filtreleme Stratejilerinin Etkinliğinin Araştırılması

Year 2025, Volume: 8 Issue: 1, 222 - 247, 17.01.2025

Cem Özkurt

https://doi.org/10.47495/okufbed.1457532

Abstract

Bu çalışma, gürültülü ortamlarda konuşma tanıma sistemlerinin performansını artırmak için ses işleme ve filtreleme stratejilerinin etkilerini araştırmaktadır. Odak noktası, gürültülü ses dosyalarına uygulanan Kısa Süreli Fourier Dönüşümü (STFT) işlemleri ve gürültü azaltma prosedürleridir. STFT işlemleri, gürültüyü tespit etme ve konuşma sinyalini frekans alanında analiz etme temelini oluştururken, gürültü azaltma adımları eşik tabanlı maskeleme ve konvolüsyon işlemlerini içermektedir. Elde edilen sonuçlar, gürültülü ortamlarda ses tanıma doğruluğunda önemli bir iyileşme sağladığını göstermektedir. Grafiklerin detaylı analizi, gürültü azaltma prosedürlerinin etkinliğini değerlendirmek için rehberlik sağlar ve gelecek araştırmalar için bir yol haritası görevi görür. Bu çalışma, gürültülü ortamlarda konuşma tanıma sistemlerinin performansını artırmada ses işleme ve filtreleme derin öğrenme ve CNN stratejilerinin kritik önemini vurgulayarak, gelecek çalışmalar için bir temel oluşturur.

Keywords

Ses işleme , Konuşma tanıma , Kısa Süreli Fourier Dönüşümü (STFT) , Gürültü azaltma , CNN

References

Ali MH., Jaber MM., Abd SK., Rehman A., Awan MJ., Vitkutė-Adžgauskienė D., Damaševičius R., Bahaj SA. Harris hawks sparse auto-encoder networks for automatic speech recognition system. Applied Sciences 2022; 12(3): 1091.
Anggriawan DO., Wahjono E., Sudiharto I., Firdaus AA., Putri DNN., Budikarso A. Identification of short duration voltage variations based on short time Fourier transform and artificial neural network. 2020 International Electronics Symposium 2020; 43-47.
Bharti D., Kukana P. A hybrid machine learning model for emotion recognition from speech signals. International Conference on Smart Electronics and Communication (ICOSEC) 2020; 491-496.
Garg K., Jain G. A comparative study of noise reduction techniques for automatic speech recognition systems. International Conference on Advances in Computing, Communications and Informatics (ICACCI) 2016; 2098-2103.
Hamidi M., Satori H., Zealouk O., Satori K. Amazigh digits through interactive speech recognition system in noisy environment. International Journal of Speech Technology 2020; 23(1): 101-109.
Hazrati A., Eftekhari A., Taherian S. A Novel Speech Enhancement Method Based on Deep Residual Network in Low SNR Conditions. 7th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS) 2019; 72-76.
Jurado F., Saenz JR. Comparison between discrete STFT and wavelets for the analysis of power quality events. Electric Power Systems Research 2002; 62(3): 183-190.
Kalamani M., Krishnamoorthi M. Modified least mean square adaptive filter for speech enhancement. Applied Speech Processing 2021; 47-73.
Kara S., Içer S., Erdogan N. Spectral broadening of lower extremity venous Doppler signals using STFT and AR modeling. Digital Signal Processing 2018; 669–676.
Kim HD., Kim J., Komatani K., Ogata T., Okuno HG. Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments. IEEE/RSJ International Conference on Intelligent Robots and Systems 2008; 1705-1711.
Kitaoka N., Yamamoto K., Kusamizu T., Nakagawa S., Yamada T., Tsuge S., Miyajima C., Nishiura T., Nakayama M., Denda Y., Fujimoto M. Development of VAD evaluation framework CENSREC-1-C and investigation of relationship between VAD and speech recognition performance. IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) 2007; 607-612.
Li J., Wu Y., Gaur Y., Wang C., Zhao R., Liu S. On the comparison of popular end-to-end models for large scale speech recognition. arXiv preprint arXiv:2005.14327 2020.
Malik M., Malik MK., Mehmood K., Makhdoom I. Automatic speech recognition: a survey. Multimedia Tools and Applications 2021; 80, 9411-9457.
Manhertz G., Bereczky A. STFT spectrogram based hybrid evaluation method for rotating machine transient vibration analysis. Mechanical Systems and Signal Processing 2021; 154, 107583.
Manoharan S., Ponraj N. Analysis of complex non-linear environment exploration in speech recognition by hybrid learning technique. Journal of Innovative Image Processing (JIIP) 2020; 2(04): 202-209.
Martinek R., Vanus J., Nedoma J., Fridrich M., Frnda J., Kawala-Sterniuk A. Voice communication in noisy environments in a smart house using hybrid LMS+ICA algorithm. Sensors 2020; 20(21): 6022.
Nuha HH., Absa AA. Noise Reduction and Speech Enhancement Using Wiener Filter. International Conference on Data Science and Its Applications (ICoDSA) 2022; 177-180.
Oh SY., Chung KY. Improvement of speech detection using ERB feature extraction. Wireless Personal Communications 2014; 79(4): 2439-2451.
Oh S., Chung K. Performance evaluation of silence-feature normalization model using cepstrum features of noise signals. Wireless Personal Communications 2018; 98, 3287-3297.
Padmapriya J., Sasilatha T., Aagash G., Bharathi V. Voice extraction from background noise using filter bank analysis for voice communication applications. Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV) 2021; 269-273.
Phyu WLL., Pa PW. Building Speaker Identification Dataset for Noisy Conditions. IEEE Conference on Computer Applications (ICCA) 2020; 1-6.
Premoli M., Baggi D., Bianchetti M., Gnutti A., Bondaschi M., Mastinu A., Migliorati P., Signoroni A., Leonardi R., Memo M., Bonini SA. Automatic classification of mice vocalizations using Machine Learning techniques and Convolutional Neural Networks. PloS One 2021; 16(1): e0244636.
Rosca JP., Balan R., Fan NP., Beaugeant C., Gilg V. Multichannel voice detection in adverse environments. 11th European Signal Processing Conference 2002; 1-4.
Agram N., Øksendal B. Introduction to white noise, hida-malliavin calculus and applications. arXiv preprint arXiv:1903.02936 2019.
Sainburg T. (2022). Noisereduce: Noise Reduction in Python. GitHub. https://github.com/timsainb/noisereduce
Sasaki Y., Kagami S., Mizoguchi H., Enomoto T. A predefined command recognition system using a ceiling microphone array in noisy housing environments. IEEE/RSJ International Conference on Intelligent Robots and Systems 2008; 2178-2184.
Schroeder MR. Speech processing. NATO ASI Series F Computer and Systems Sciences 1999; 174, 129-136. Seetharaman P. (2022). torch-stft. GitHub. https://github.com/pseeth/torch-stft
Shahamiri SR. Speech vision: An end-to-end deep learning-based dysarthric automatic speech recognition system. IEEE Transactions on Neural Systems and Rehabilitation Engineering 2021; 29, 852-861.
Wang Y., Fan X., Chen IF., Liu Y., Chen T., Hoffmeister B. End-to-end anchored speech recognition. International Conference on Acoustics Speech and Signal Processing (ICASSP) 2019; 7090-7094.
Xing F., Chen H., Xie S., Yao J. Ultrafast three-dimensional surface imaging based on short-time Fourier transform. IEEE Photonics Technology Letters 2015; 27(21): 2264-2267.
Zhang H., Hua G., Yu L., Cai Y., & Bi G. Underdetermined blind separation of overlapped speech mixtures in time-frequency domain with estimated number of sources. Speech Communication 2017; 89, 1-16.
Zhang WY., Hao T., Chang Y., & Zhao YH. Time-frequency analysis of enhanced GPR detection of RF tagged buried plastic pipes. NDT & E International 2017; 92, 88-96.

Investigation of the Effectiveness of Audio Processing and Filtering Strategies in Noisy Environments on Speech Recognition Performance

Year 2025, Volume: 8 Issue: 1, 222 - 247, 17.01.2025

Cem Özkurt

https://doi.org/10.47495/okufbed.1457532

Abstract

This study investigates the effects of audio processing and filtering strategies to enhance the performance of speech recognition systems in noisy environments. The focus is on the Short-Time Fourier Transform (STFT) operations applied to noisy audio files and noise reduction procedures. While STFT operations form the basis for detecting noise and analyzing the speech signal in the frequency domain, noise reduction steps involve threshold-based masking and convolution operations. The results obtained demonstrate a significant improvement in speech recognition accuracy in noisy environments through audio processing and filtering strategies. A detailed analysis of the graphs provides guidance for evaluating the effectiveness of noise reduction procedures and serves as a roadmap for future research. This study emphasizes the critical importance of audio processing and filtering strategies in improving the performance of speech recognition systems in noisy environments, laying a foundation for future studies.

Keywords

Audio processing , Speech recognition , Short-Time Fourier Transform (STFT) , Noise reduction , CNN

References

Ali MH., Jaber MM., Abd SK., Rehman A., Awan MJ., Vitkutė-Adžgauskienė D., Damaševičius R., Bahaj SA. Harris hawks sparse auto-encoder networks for automatic speech recognition system. Applied Sciences 2022; 12(3): 1091.
Anggriawan DO., Wahjono E., Sudiharto I., Firdaus AA., Putri DNN., Budikarso A. Identification of short duration voltage variations based on short time Fourier transform and artificial neural network. 2020 International Electronics Symposium 2020; 43-47.
Bharti D., Kukana P. A hybrid machine learning model for emotion recognition from speech signals. International Conference on Smart Electronics and Communication (ICOSEC) 2020; 491-496.
Garg K., Jain G. A comparative study of noise reduction techniques for automatic speech recognition systems. International Conference on Advances in Computing, Communications and Informatics (ICACCI) 2016; 2098-2103.
Hamidi M., Satori H., Zealouk O., Satori K. Amazigh digits through interactive speech recognition system in noisy environment. International Journal of Speech Technology 2020; 23(1): 101-109.
Hazrati A., Eftekhari A., Taherian S. A Novel Speech Enhancement Method Based on Deep Residual Network in Low SNR Conditions. 7th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS) 2019; 72-76.
Jurado F., Saenz JR. Comparison between discrete STFT and wavelets for the analysis of power quality events. Electric Power Systems Research 2002; 62(3): 183-190.
Kalamani M., Krishnamoorthi M. Modified least mean square adaptive filter for speech enhancement. Applied Speech Processing 2021; 47-73.
Kara S., Içer S., Erdogan N. Spectral broadening of lower extremity venous Doppler signals using STFT and AR modeling. Digital Signal Processing 2018; 669–676.
Kim HD., Kim J., Komatani K., Ogata T., Okuno HG. Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments. IEEE/RSJ International Conference on Intelligent Robots and Systems 2008; 1705-1711.
Kitaoka N., Yamamoto K., Kusamizu T., Nakagawa S., Yamada T., Tsuge S., Miyajima C., Nishiura T., Nakayama M., Denda Y., Fujimoto M. Development of VAD evaluation framework CENSREC-1-C and investigation of relationship between VAD and speech recognition performance. IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) 2007; 607-612.
Li J., Wu Y., Gaur Y., Wang C., Zhao R., Liu S. On the comparison of popular end-to-end models for large scale speech recognition. arXiv preprint arXiv:2005.14327 2020.
Malik M., Malik MK., Mehmood K., Makhdoom I. Automatic speech recognition: a survey. Multimedia Tools and Applications 2021; 80, 9411-9457.
Manhertz G., Bereczky A. STFT spectrogram based hybrid evaluation method for rotating machine transient vibration analysis. Mechanical Systems and Signal Processing 2021; 154, 107583.
Manoharan S., Ponraj N. Analysis of complex non-linear environment exploration in speech recognition by hybrid learning technique. Journal of Innovative Image Processing (JIIP) 2020; 2(04): 202-209.
Martinek R., Vanus J., Nedoma J., Fridrich M., Frnda J., Kawala-Sterniuk A. Voice communication in noisy environments in a smart house using hybrid LMS+ICA algorithm. Sensors 2020; 20(21): 6022.
Nuha HH., Absa AA. Noise Reduction and Speech Enhancement Using Wiener Filter. International Conference on Data Science and Its Applications (ICoDSA) 2022; 177-180.
Oh SY., Chung KY. Improvement of speech detection using ERB feature extraction. Wireless Personal Communications 2014; 79(4): 2439-2451.
Oh S., Chung K. Performance evaluation of silence-feature normalization model using cepstrum features of noise signals. Wireless Personal Communications 2018; 98, 3287-3297.
Padmapriya J., Sasilatha T., Aagash G., Bharathi V. Voice extraction from background noise using filter bank analysis for voice communication applications. Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV) 2021; 269-273.
Phyu WLL., Pa PW. Building Speaker Identification Dataset for Noisy Conditions. IEEE Conference on Computer Applications (ICCA) 2020; 1-6.
Premoli M., Baggi D., Bianchetti M., Gnutti A., Bondaschi M., Mastinu A., Migliorati P., Signoroni A., Leonardi R., Memo M., Bonini SA. Automatic classification of mice vocalizations using Machine Learning techniques and Convolutional Neural Networks. PloS One 2021; 16(1): e0244636.
Rosca JP., Balan R., Fan NP., Beaugeant C., Gilg V. Multichannel voice detection in adverse environments. 11th European Signal Processing Conference 2002; 1-4.
Agram N., Øksendal B. Introduction to white noise, hida-malliavin calculus and applications. arXiv preprint arXiv:1903.02936 2019.
Sainburg T. (2022). Noisereduce: Noise Reduction in Python. GitHub. https://github.com/timsainb/noisereduce
Sasaki Y., Kagami S., Mizoguchi H., Enomoto T. A predefined command recognition system using a ceiling microphone array in noisy housing environments. IEEE/RSJ International Conference on Intelligent Robots and Systems 2008; 2178-2184.
Schroeder MR. Speech processing. NATO ASI Series F Computer and Systems Sciences 1999; 174, 129-136. Seetharaman P. (2022). torch-stft. GitHub. https://github.com/pseeth/torch-stft
Shahamiri SR. Speech vision: An end-to-end deep learning-based dysarthric automatic speech recognition system. IEEE Transactions on Neural Systems and Rehabilitation Engineering 2021; 29, 852-861.
Wang Y., Fan X., Chen IF., Liu Y., Chen T., Hoffmeister B. End-to-end anchored speech recognition. International Conference on Acoustics Speech and Signal Processing (ICASSP) 2019; 7090-7094.
Xing F., Chen H., Xie S., Yao J. Ultrafast three-dimensional surface imaging based on short-time Fourier transform. IEEE Photonics Technology Letters 2015; 27(21): 2264-2267.
Zhang H., Hua G., Yu L., Cai Y., & Bi G. Underdetermined blind separation of overlapped speech mixtures in time-frequency domain with estimated number of sources. Speech Communication 2017; 89, 1-16.
Zhang WY., Hao T., Chang Y., & Zhao YH. Time-frequency analysis of enhanced GPR detection of RF tagged buried plastic pipes. NDT & E International 2017; 92, 88-96.

There are 32 citations in total.

Details

Primary Language	English
Subjects	Deep Learning, Machine Learning (Other)
Journal Section	RESEARCH ARTICLES
Authors	Cem Özkurt 0000-0002-1251-7715
Early Pub Date	January 15, 2025
Publication Date	January 17, 2025
Submission Date	March 22, 2024
Acceptance Date	August 29, 2024
Published in Issue	Year 2025 Volume: 8 Issue: 1

Cite

APA	Özkurt, C. (2025). Investigation of the Effectiveness of Audio Processing and Filtering Strategies in Noisy Environments on Speech Recognition Performance. Osmaniye Korkut Ata Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 8(1), 222-247. https://doi.org/10.47495/okufbed.1457532
AMA	Özkurt C. Investigation of the Effectiveness of Audio Processing and Filtering Strategies in Noisy Environments on Speech Recognition Performance. Osmaniye Korkut Ata University Journal of The Institute of Science and Techno. January 2025;8(1):222-247. doi:10.47495/okufbed.1457532
Chicago	Özkurt, Cem. “Investigation of the Effectiveness of Audio Processing and Filtering Strategies in Noisy Environments on Speech Recognition Performance”. Osmaniye Korkut Ata Üniversitesi Fen Bilimleri Enstitüsü Dergisi 8, no. 1 (January 2025): 222-47. https://doi.org/10.47495/okufbed.1457532.
EndNote	Özkurt C (January 1, 2025) Investigation of the Effectiveness of Audio Processing and Filtering Strategies in Noisy Environments on Speech Recognition Performance. Osmaniye Korkut Ata Üniversitesi Fen Bilimleri Enstitüsü Dergisi 8 1 222–247.
IEEE	C. Özkurt, “Investigation of the Effectiveness of Audio Processing and Filtering Strategies in Noisy Environments on Speech Recognition Performance”, Osmaniye Korkut Ata University Journal of The Institute of Science and Techno, vol. 8, no. 1, pp. 222–247, 2025, doi: 10.47495/okufbed.1457532.
ISNAD	Özkurt, Cem. “Investigation of the Effectiveness of Audio Processing and Filtering Strategies in Noisy Environments on Speech Recognition Performance”. Osmaniye Korkut Ata Üniversitesi Fen Bilimleri Enstitüsü Dergisi 8/1 (January2025), 222-247. https://doi.org/10.47495/okufbed.1457532.
JAMA	Özkurt C. Investigation of the Effectiveness of Audio Processing and Filtering Strategies in Noisy Environments on Speech Recognition Performance. Osmaniye Korkut Ata University Journal of The Institute of Science and Techno. 2025;8:222–247.
MLA	Özkurt, Cem. “Investigation of the Effectiveness of Audio Processing and Filtering Strategies in Noisy Environments on Speech Recognition Performance”. Osmaniye Korkut Ata Üniversitesi Fen Bilimleri Enstitüsü Dergisi, vol. 8, no. 1, 2025, pp. 222-47, doi:10.47495/okufbed.1457532.
Vancouver	Özkurt C. Investigation of the Effectiveness of Audio Processing and Filtering Strategies in Noisy Environments on Speech Recognition Performance. Osmaniye Korkut Ata University Journal of The Institute of Science and Techno. 2025;8(1):222-47.