A Review on Deep Learning Architectures for Speech Recognition

Yeşim Dokuz; Zekeriya Tüfekci

doi:10.31590/ejosat.araconf22

TR EN

Ses Tanıma için Derin Öğrenme Mimarileri Üzerine Derleme

Öz

Derin öğrenme, çeşitli algoritmalar kullanarak çok sayıda işlem katmanından oluşan derin mimariler yardımıyla veri kümelerinin modelini çıkarmaya çalışan makine öğrenmesi alanının bir alt alanıdır. Derin öğrenme mimarilerinin başarılı uygulamaları ve popülerliğinden dolayı, derin öğrenme sistemleri ses tanıma alanında da kullanılmaya başlanmıştır. Araştırmacılar bu mimarileri ses tanıma ve ses tanımanın uygulamalarında, örneğin ses duygu tanıma, ses etkinliği tespiti ve konuşmacı tanıma ve doğrulama, ses girdileri ve çıktıları arasındaki modellerin daha iyi kurulması ve ses tanıma sistemlerinin hata oranlarının düşürülmesi amaçlarıyla kullanmışlardır. Literatürde, ses tanıma sistemleri için derin öğrenme mimarilerini kullanan çok sayıda çalışma yapılmıştır. Literatürde yapılmış olan çalışmalar ses tanıma ve uygulamaları için derin öğrenme mimarilerinin kullanılmasının pek çok ses tanıma alanın için fayda sağladığını ve hata oranlarını düşürerek daha iyi performans elde edilmesini sağladığını göstermiştir. Bu çalışmada, ilk olarak, ses tanıma probleminden ve ses tanıma adımlarından bahsedilmiştir. Daha sonra, derin öğrenme tabanlı ses tanıma için yapılmış olan çalışmalar incelenmiştir. Özellikle, derin öğrenme mimarilerinden olan Derin Sinir Ağları (DSA), Evrişimli Sinir Ağları (ESA) ve Özyinelemeli Sinir Ağları (ÖSA) ve bu mimarilerden üretilmiş olan hibrit yaklaşımlar değerlendirilmiş ve bu mimarilerin ses tanıma ve ses tanımanın uygulama alanlarındaki kullanımları ile ilgili literatürdeki çalışmalar değerlendirilmiştir. Sonuç olarak, hata oranları ve ses tanıma performansı açısından tüm mimariler arasında en yaygın olarak kullanılan ve en güçlü derin öğrenme mimarisinin ÖSA olduğu gözlemlenmiştir. ESA ise diğer bir başarılı derin öğrenme mimarisidir ve ses tanıma performansı ve hata oranları açısından ÖSA ile yakın sonuçlar üretmektedir. Ayrıca, hibrit derin öğrenme mimarilerinin de gittikçe yaygın hale geldiği ve ses tanıma hata oranlarını düşürebildiği gözlemlenmiştir.

Anahtar Kelimeler

Ses tanıma,Derin Öğrenme,DSA,ESA,ÖSA,Hibrit mimariler

A Review on Deep Learning Architectures for Speech Recognition

Öz

Deep learning is a branch of machine learning that uses several algorithms which tries to model datasets by using deep architectures with many processing layers. With the popularity and successful applications of deep learning architectures, they are being used in speech recognition, as well. Researchers utilized these architectures for speech recognition and its applications, such as speech emotion recognition, voice activity detection, and speaker recognition and verification to better model speech inputs with outputs and to reduce error rates of speech recognition systems. Many studies are performed in the literature that use deep learning architectures for speech recognition systems. The literature studies show that using deep learning architectures for speech recognition and its applications provide benefits for many speech recognition areas and have ability to reduce error rates and provide better performance. In this study, first of all, we explained speech recognition problem and the steps of speech recognition. Then, we analyzed the studies related to deep learning based speech recognition. In particular, deep learning architectures of Deep Neural Networks, Convolutional Neural Networks, and Recurrent Neural Networks and hybrid approaches that use these architectures are evaluated and the literature studies related to these architectures for speech recognition and the application areas of speech recognition are investigated. As a result, we observed that RNNs are the most utilized and powerful deep learning architecture among all of the deep learning architectures in terms of error rates and speech recognition performance. CNNs are other successful deep learning architectures and have closer results with RNN in terms of error rates and speech recognition performance. Also, we observed that new deep architectures that use either hybrid of DNNs, CNNs, and RNNs or other deep learning architectures are getting attention and have increasing performance and could reduce error rates in speech recognition.

Anahtar Kelimeler

Speech recognition,Deep learning,DNNs,CNNs,RNNs,Hybrid architectures

Kaynakça

Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing, 22(10), 1533-1545.
Badshah, A. M., Ahmad, J., Rahim, N., & Baik, S. W. (2017, February). Speech emotion recognition from spectrograms with deep convolutional neural network. In 2017 international conference on platform technology and service (PlatCon) (pp. 1-5). IEEE.
Chan, W., Ke, N. R., & Lane, I. (2015). Transferring knowledge from a RNN to a DNN. arXiv preprint arXiv:1504.01483.
Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 25(10), 1440-1444.
Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing, 20(1), 30-42.
Dahl, G. E., Sainath, T. N., & Hinton, G. E. (2013). Improving deep neural networks for LVCSR using rectified linear units and dropout. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 8609-8613). IEEE.
Fu, S. W., Tsao, Y., & Lu, X. (2016, September). SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement. In INTERSPEECH 2016, San Francisco, USA (pp. 3768-3772).
Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6645-6649). IEEE.

Graves, A., Jaitly, N., & Mohamed, A. R. (2013b, December). Hybrid speech recognition with deep bidirectional LSTM. In 2013 IEEE workshop on automatic speech recognition and understanding (pp. 273-278). IEEE.
Graves, A., & Jaitly, N. (2014, January). Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning, Beijing, China (pp. 1764-1772).
Han, K., Yu, D., & Tashev, I. (2014). Speech emotion recognition using deep neural network and extreme learning machine. In Fifteenth annual conference of the international speech communication association INTERSPEECH 2014, Singapore.
Hori, T., Watanabe, S., Zhang, Y., & Chan, W. (2017). Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. arXiv preprint arXiv:1706.02737.
Hughes, T., & Mierle, K. (2013, May). Recurrent neural networks for voice activity detection. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 7378-7382). IEEE.
Jaitly, N., Nguyen, P., Senior, A., & Vanhoucke, V. (2012). Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition. In Thirteenth Annual Conference of the International Speech Communication Association INTERSPEECH 2012, Portland, OR, USA.
Lalitha, S., Tripathi, S., & Gupta, D. (2019). Enhanced speech emotion detection using deep neural networks. International Journal of Speech Technology, 22(3), 497-510.
Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014, May). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1695-1699). IEEE.
Li, X., & Wu, X. (2015, April). Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4520-4524). IEEE.
Lim, W., Jang, D., & Lee, T. (2016, December). Speech emotion recognition using convolutional and recurrent neural networks. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) (pp. 1-4). IEEE.
Lu, L., Zhang, X., & Renais, S. (2016, March). On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5060-5064). IEEE.
Maas, A., Le, Q. V., O’neil, T. M., Vinyals, O., Nguyen, P., & Ng, A. Y. (2012). Recurrent neural networks for noise reduction in robust ASR. INTERSPEECH 2012, Portland, OR, USA.
Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE transactions on multimedia, 16(8), 2203-2213.
Miao, Y., Gowayyed, M., & Metze, F. (2015, December). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 167-174). IEEE.
Mirsamadi, S., Barsoum, E., & Zhang, C. (2017, March). Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2227-2231). IEEE.
Palaz, D., Doss, M. M., & Collobert, R. (2015, April). Convolutional neural networks-based continuous speech recognition using raw speech signal. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4295-4299). IEEE.
Qian, Y., Bi, M., Tan, T., & Yu, K. (2016). Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(12), 2263-2276.
Sainath, T. N., Mohamed, A. R., Kingsbury, B., & Ramabhadran, B. (2013, May). Deep convolutional neural networks for LVCSR. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 8614-8618). IEEE.
Sainath, T. N., Vinyals, O., Senior, A., & Sak, H. (2015, April). Convolutional, long short-term memory, fully connected deep neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4580-4584). IEEE.
Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. INTERSPEECH 2015, Dresden, Germany.
Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. the INTERSPEECH 2014, Singapore.
Seltzer, M. L., Yu, D., & Wang, Y. (2013, May). An investigation of deep neural networks for noise robust speech recognition. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 7398-7402). IEEE.
Sercu, T., Puhrsch, C., Kingsbury, B., & LeCun, Y. (2016, March). Very deep multilingual convolutional neural networks for LVCSR. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4955-4959). IEEE.
Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., & Khudanpur, S. (2016, December). Deep neural network-based speaker embeddings for end-to-end speaker verification. In 2016 IEEE Spoken Language Technology Workshop (SLT) (pp. 165-170). IEEE.
Sun, L., Kang, S., Li, K., & Meng, H. (2015, April). Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4869-4873). IEEE.
Swietojanski, P., Ghoshal, A., & Renals, S. (2014). Convolutional neural networks for distant speech recognition. IEEE Signal Processing Letters, 21(9), 1120-1124.
Thomas, S., Ganapathy, S., Saon, G., & Soltau, H. (2014, May). Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2519-2523). IEEE.
Torfi, A., Dawson, J., & Nasrabadi, N. M. (2018, July). Text-independent speaker verification using 3d convolutional neural networks. In 2018 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6). IEEE.
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2016, March). Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200-5204). IEEE.
Variani, E., Lei, X., McDermott, E., Moreno, I. L., & Gonzalez-Dominguez, J. (2014, May). Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4052-4056). IEEE.
Wang, D., Wang, X., & Lv, S. (2019). End-to-End Mandarin Speech Recognition Combining CNN and BLSTM. Symmetry, 11(5), 644.
Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J. R., & Schuller, B. (2015, August). Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In International Conference on Latent Variable Analysis and Signal Separation (pp. 91-99). Springer, Cham.
Weninger, F., Hershey, J. R., Le Roux, J., & Schuller, B. (2014, December). Discriminatively trained recurrent neural networks for single-channel speech separation. In 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP) (pp. 577-581). IEEE.
Wu, Z., Sivadas, S., Tan, Y. K., Bin, M., & Goh, R. S. M. (2016). Multi-modal hybrid deep neural network for speech enhancement. arXiv preprint arXiv:1606.04750.
Yu, D., Yao, K., Su, H., Li, G., & Seide, F. (2013, May). KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 7893-7897). IEEE.
Yu, D., & Deng, L. (2016). Automatic Speech Recognition A Deep Learning Approach. Springer.
Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C. L. Y., & Courville, A. (2017). Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720.
Zhao, H., Zarar, S., Tashev, I., & Lee, C. H. (2018, April). Convolutional-recurrent neural networks for speech enhancement. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2401-2405). IEEE.
Zen, H., Senior, A., & Schuster, M. (2013, May). Statistical parametric speech synthesis using deep neural networks. In 2013 ieee international conference on acoustics, speech and signal processing (pp. 7962-7966). IEEE.
Zen, H., & Sak, H. (2015, April). Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4470-4474). IEEE.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Mühendislik

Bölüm

Derleme

Yazarlar

Yeşim Dokuz
0000-0001-7202-2899
Türkiye

Zekeriya Tüfekci
0000-0001-7835-2741
Türkiye

Yayımlanma Tarihi

1 Nisan 2020

Gönderilme Tarihi

15 Mart 2020

Kabul Tarihi

28 Mart 2020

Yayımlandığı Sayı

Yıl 2020

DOI

https://doi.org/10.31590/ejosat.araconf22

IZ

https://izlik.org/JA83AE27ZX

Kaynak Göster

RIS / Bibtex

APA

Dokuz, Y., & Tüfekci, Z. (2020). A Review on Deep Learning Architectures for Speech Recognition. Avrupa Bilim ve Teknoloji Dergisi, 169-176. https://doi.org/10.31590/ejosat.araconf22

AMA

1.Dokuz Y, Tüfekci Z. A Review on Deep Learning Architectures for Speech Recognition. EJOSAT. Published online 01 Nisan 2020:169-176. doi:10.31590/ejosat.araconf22

Chicago

Dokuz, Yeşim, ve Zekeriya Tüfekci. 2020. “A Review on Deep Learning Architectures for Speech Recognition”. Avrupa Bilim ve Teknoloji Dergisi, Nisan 1, 169-76. https://doi.org/10.31590/ejosat.araconf22.

EndNote

Dokuz Y, Tüfekci Z (01 Nisan 2020) A Review on Deep Learning Architectures for Speech Recognition. Avrupa Bilim ve Teknoloji Dergisi 169–176.

IEEE

[1]Y. Dokuz ve Z. Tüfekci, “A Review on Deep Learning Architectures for Speech Recognition”, EJOSAT, ss. 169–176, Nis. 2020, doi: 10.31590/ejosat.araconf22.

ISNAD

Dokuz, Yeşim - Tüfekci, Zekeriya. “A Review on Deep Learning Architectures for Speech Recognition”. Avrupa Bilim ve Teknoloji Dergisi. 01 Nisan 2020. 169-176. https://doi.org/10.31590/ejosat.araconf22.

JAMA

1.Dokuz Y, Tüfekci Z. A Review on Deep Learning Architectures for Speech Recognition. EJOSAT. 2020;:169–176.

MLA

Dokuz, Yeşim, ve Zekeriya Tüfekci. “A Review on Deep Learning Architectures for Speech Recognition”. Avrupa Bilim ve Teknoloji Dergisi, Nisan 2020, ss. 169-76, doi:10.31590/ejosat.araconf22.

Vancouver

1.Yeşim Dokuz, Zekeriya Tüfekci. A Review on Deep Learning Architectures for Speech Recognition. EJOSAT. 01 Nisan 2020;169-76. doi:10.31590/ejosat.araconf22