Investigation of the Effect of LSTM Hyperparameters on Speech Recognition Performance

Yeşim Dokuz; Zekeriya Tüfekci

doi:10.31590/ejosat.araconf21

Araştırma Makalesi

LSTM Hiperparametrelerinin Ses Tanıma Performansına olan Etkilerinin Araştırılması

Yıl 2020, Ejosat Özel Sayı 2020 (ARACONF), 161 - 168, 01.04.2020

Yeşim Dokuz , Zekeriya Tüfekci

https://doi.org/10.31590/ejosat.araconf21

Öz

Bilgisayara dayalı hesaplamalı metotlar ve donanım teknolojilerindeki gelişmelerle birlikte, bilgisayarlar ses tanıma ve görüntü işleme gibi zor görevlerin üstesinden gelme konusunda daha güçlü hale gelmiştir. Ses tanıma, hesaplamalı veya analitik yöntemler kullanarak ses sinyallerinin metinsel karşılığını çıkarma görevidir. Ses tanıma aksanlar ve diller arasındaki değişkenlikler, güçlü donanım gereksinimleri, doğru modellerin üretilebilmesi için büyük veri setlerine olan ihtiyaç ve ses kalitesini etkileyen çevresel faktörlerden dolayı zor bir problemdir. Son yıllarda, Grafiksel İşleme Birimleri gibi donanım cihazlarının yükselen veri işleme yetenekleri yardımıyla derin öğrenme metotları, özellikle Özyinelemeli Sinir Ağları (ÖSA – Recurrent Neural Networks, RNN) ve RNN’in bir varyantı olan LSTM (Long Short Term Memory – Uzun Kısa Dönem Hafıza), ses tanıma alanında çok yaygın ve kabul gören metotlar haline gelmişlerdir. Literatürde, RNN ve LSTM ses tanıma ve ses tanımanın uygulamaları için katman sayısı, gizli katman sayısı ve yığın boyutu gibi çeşitli parametrelerle kullanılmaktadır. Kullanılan bu parametre değerlerin hangi kriterlere göre seçildiği ve bu parametre değerlerinin daha sonraki çalışmalarda da kullanılabilirliği ise incelenmemiştir. Bu çalışmada, LSTM hiperparametrelerinin ses tanıma performansına olan etkileri hata oranları ve derin mimari maliyeti dikkate alınarak incelenmiştir. Her bir parametre ayrı olarak değerlendirilmiş ve bu esnada diğer parametreler sabit tutulmuş ve parametrelerin ses verisi üzerindeki etkisi gözlemlenmiştir. Deneysel sonuçlarda, daha düşük hata oranları ve daha iyi ses tanıma performansı elde edebilmek için her parametrenin seçilen eğitim seti için farklı değerlere sahip olduğu görülmüştür. Bu çalışmanın sonuçlarına göre, LSTM için en uygun parametrelerin seçilmesinden önce ses veri kümesi üzerinde farklı deneyler yapılarak her bir parametre için en uygun değerin bulunması gerektiği gözlemlenmiştir.

Anahtar Kelimeler

Ses tanıma , Derin Öğrenme , RNN , LSTM , LSTM hiperparametreleri

Kaynakça

Gao, C., Braun, S., Kiselev, I., Anumula, J., Delbruck, T., & Liu, S. C. (2019, May). Real-time speech recognition for IoT purpose using a delta recurrent neural network accelerator. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1-5). IEEE.
Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6645-6649). IEEE.
Graves, A., Jaitly, N., & Mohamed, A. R. (2013b, December). Hybrid speech recognition with deep bidirectional LSTM. In 2013 IEEE workshop on automatic speech recognition and understanding (pp. 273-278). IEEE.
He Y., Sainath T. N., Prabhavalkar R., McGraw I., Alvarez R., Zhao D., Rybach D., Kannan A., Wu Y., Pang R., Liang Q., Bhatia D., Shangguan Y., Li B., Pundak G., Sim K.
C., Bagby T., Chang S., Rao K., and Gruenstein A. (2019, May). Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6381-6385). IEEE.
Hori, T., Watanabe, S., Zhang, Y., & Chan, W. (2017). Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. arXiv preprint arXiv:1706.02737.
Lee, K., Park, C., Kim, N., & Lee, J. (2018, April). Accelerating recurrent neural network language model based online speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5904-5908). IEEE.
Liu, X., Liu, S., Sha, J., Yu, J., Xu, Z., Chen, X., & Meng, H. (2018, April). Limited-memory bfgs optimization of recurrent neural network language models for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6114-6118). IEEE.
Miao, Y., Gowayyed, M., & Metze, F. (2015, December). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 167-174). IEEE.
Sainath T. N., Pang R., Rybach D., He Y., Prabhavalkar R., Li W., Visontai M., Liang Q., Strohman T., Wu Y., McGraw I., and Chiu C.-C. (2019). Two-Pass End-to-End Speech Recognition, In INTERSPEECH 2019, Graz, Austria, 2019.
Toshniwal, S., Sainath, T. N., Weiss, R. J., Li, B., Moreno, P., Weinstein, E., & Rao, K. (2018, April). Multilingual speech recognition with a single end-to-end model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4904-4908). IEEE.
Veaux C., Yamagishi J., and MacDonald K. (2017, 04/02/2020). Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. Available: https://datashare.is.ed.ac.uk/handle/10283/2651.
Wang, S., Zhou, P., Chen, W., Jia, J., & Xie, L. (2019, November). Exploring RNN-Transducer for Chinese speech recognition. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1364-1369). IEEE.
Yu, D., & Deng, L. (2016). Automatic Speech Recognition: A Deep Learning Approach. Springer.

Investigation of the Effect of LSTM Hyperparameters on Speech Recognition Performance

Yıl 2020, Ejosat Özel Sayı 2020 (ARACONF), 161 - 168, 01.04.2020

Yeşim Dokuz , Zekeriya Tüfekci

https://doi.org/10.31590/ejosat.araconf21

Öz

With the recent advances in hardware technologies and computational methods, computers became more powerful for analyzing difficult tasks, such as speech recognition and image processing. Speech recognition is the task of extraction of text representation of a speech signal using computational or analytical methods. Speech recognition is a challenging problem due to variations in accents and languages, powerful hardware requirements, big dataset needs for generating accurate models, and environmental factors that affect signal quality. Recently, with the increasing processing ability of hardware devices, such as Graphical Processing Units, deep learning methods became more prevalent and state-of-the-art method for speech recognition, especially Recurrent Neural Networks (RNNs) and Long-Short Term Memory (LSTMs) networks which is a variant of RNNs. In the literature, RNNs and LSTMs are used for speech recognition and the applications of speech recognition with various parameters, i.e. number of layers, number of hidden units, and batch size. It is not investigated that how the parameter values of the literature are selected and whether these values could be used in future studies. In this study, we investigated the effect of LSTMs hyperparameters on speech recognition performance in terms of error rates and deep architecture cost. Each parameter is investigated separately while other parameters remain constant and the effect of each parameter is observed on a speech corpus. Experimental results show that each parameter has its specific values for the selected number of training instances to provide lower error rates and better speech recognition performance. It is shown in this study that before selecting appropriate values for each LSTM parameters, there should be several experiments performed on the speech corpus to find the most eligible value for each parameter.

Anahtar Kelimeler

Speech recognition , Deep learning , RNNs , LSTMs , LSTMs hyperparameters

Kaynakça

Gao, C., Braun, S., Kiselev, I., Anumula, J., Delbruck, T., & Liu, S. C. (2019, May). Real-time speech recognition for IoT purpose using a delta recurrent neural network accelerator. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1-5). IEEE.
Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6645-6649). IEEE.
Graves, A., Jaitly, N., & Mohamed, A. R. (2013b, December). Hybrid speech recognition with deep bidirectional LSTM. In 2013 IEEE workshop on automatic speech recognition and understanding (pp. 273-278). IEEE.
He Y., Sainath T. N., Prabhavalkar R., McGraw I., Alvarez R., Zhao D., Rybach D., Kannan A., Wu Y., Pang R., Liang Q., Bhatia D., Shangguan Y., Li B., Pundak G., Sim K.
C., Bagby T., Chang S., Rao K., and Gruenstein A. (2019, May). Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6381-6385). IEEE.
Hori, T., Watanabe, S., Zhang, Y., & Chan, W. (2017). Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. arXiv preprint arXiv:1706.02737.
Lee, K., Park, C., Kim, N., & Lee, J. (2018, April). Accelerating recurrent neural network language model based online speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5904-5908). IEEE.
Liu, X., Liu, S., Sha, J., Yu, J., Xu, Z., Chen, X., & Meng, H. (2018, April). Limited-memory bfgs optimization of recurrent neural network language models for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6114-6118). IEEE.
Miao, Y., Gowayyed, M., & Metze, F. (2015, December). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 167-174). IEEE.
Sainath T. N., Pang R., Rybach D., He Y., Prabhavalkar R., Li W., Visontai M., Liang Q., Strohman T., Wu Y., McGraw I., and Chiu C.-C. (2019). Two-Pass End-to-End Speech Recognition, In INTERSPEECH 2019, Graz, Austria, 2019.
Toshniwal, S., Sainath, T. N., Weiss, R. J., Li, B., Moreno, P., Weinstein, E., & Rao, K. (2018, April). Multilingual speech recognition with a single end-to-end model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4904-4908). IEEE.
Veaux C., Yamagishi J., and MacDonald K. (2017, 04/02/2020). Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. Available: https://datashare.is.ed.ac.uk/handle/10283/2651.
Wang, S., Zhou, P., Chen, W., Jia, J., & Xie, L. (2019, November). Exploring RNN-Transducer for Chinese speech recognition. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1364-1369). IEEE.
Yu, D., & Deng, L. (2016). Automatic Speech Recognition: A Deep Learning Approach. Springer.

Toplam 14 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Mühendislik
Bölüm	Araştırma Makalesi
Yazarlar	Yeşim Dokuz 0000-0001-7202-2899 Zekeriya Tüfekci 0000-0001-7835-2741
Yayımlanma Tarihi	1 Nisan 2020
Yayımlandığı Sayı	Yıl 2020 Ejosat Özel Sayı 2020 (ARACONF)

Kaynak Göster

APA	Dokuz, Y., & Tüfekci, Z. (2020). Investigation of the Effect of LSTM Hyperparameters on Speech Recognition Performance. Avrupa Bilim ve Teknoloji Dergisi161-168. https://doi.org/10.31590/ejosat.araconf21

Kapak Resmi İndir

Makale Dosyaları

Tam Metin