Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması

Tahir Bekiryazıcı; Gürkan Aydemir; Hakan Gürkan

doi:10.46387/bjesr.1452937

Araştırma Makalesi

Deep Convolutional Autoencoder and Residual Vector Quantization-Based Compression of Speech Signals

Yıl 2024, Cilt: 6 Sayı: 1, 113 - 124, 30.04.2024

Tahir Bekiryazıcı , Gürkan Aydemir , Hakan Gürkan

https://doi.org/10.46387/bjesr.1452937

Öz

This paper proposes a compression method based on deep convolutional autoencoder and residual vector quantization to compress speech signals. In the proposed method, the first encoder part of an autoencoder is utilized to map the input speech signal to a lower dimensional (code) space, and then the code is further compressed via residual vector quantization. The compression method offers different ratios due to two different decoder structures operating in parallel and the two codebooks. The method's performance is evaluated with the TIMIT dataset using the Perceptual Evaluation of Speech Quality metric. The proposed speech compression method achieved perceptual evaluation of speech quality scores of 1.665 and 1,985 for 1.25 and 2.5 kbps transmission rates, respectively.

Anahtar Kelimeler

Speech Compression , Causal Convolutional Neural Network , Residual Vector Quantization , Deep Autoencoder

Proje Numarası

230D005

Kaynakça

P.K. Mongia, and R.K. Sharma, “Estimation and statistical analysis of human voice parameters to investigate the influence of psychological stress and to determine the vocal tract transfer function of an individual,” Journal of Computer Networks and Communications, vol. 2014, no. 17, pp. 1-17, 2014.
T.F. Quatieri, “Discrete-time speech signal processing: principles and practice,” Pearson Education India, 2002.
P. Warkade, and A. Mishra, “Lossless Speech Compression Techniques: A Literature Review,” International Journal of Innovative Research in Computer Science & Technology, vol. 3, pp. 25-32, 2015.
T. Ogunfunmi, and M. Narasimha, “Principles of speech coding.” CRC Press, 2010.
L. Rabiner, and R. Schafer, “Theory and applications of digital speech processing.” Prentice Hall Press, USA, 2010.
D. O'Shaughnessy, “Linear predictive coding”, IEEE potentials, vol. 7, pp. 29-32, 1988
M. Schroeder, and B. Atal, “Code-excited linear prediction (CELP): High-quality speech at very low bit rates”, IEEE International Conference on Acoustics, Speech, and Signal Processing. vol. 10, pp. 937-940, 1985.
T. Unno, T.P. Barnwell, and K. Truong, “An improved mixed excitation linear prediction (MELP) coder,” IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings., vol. 1, pp. 245-248, 1999.
Ü. Güz, H. Gürkan, and B.S. Yarman, “A new method to represent speech signals via predefined signature and envelope sequences,” EURASIP Journal on Advances in Signal Processing, vol. 2007, pp. 1–17, 2006.
B.S. Yarman, Ü. Güz, and H. Gürkan, “On the comparative results of ‘sympes: A new method of speech modeling’,” AEU-International Journal of Electronics and Communications, vol. 60, no. 6, pp. 421–427, 2006.
A. van den Oord et al., “WaveNet: A Generative Model for Raw Audio,” arXiv, Sep. 19, 2016.
S. Kankanahalli, “End-to-end optimized speech coding with deep neural networks,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2521-2525, 2018.
H.Y. Keles, J. Rozhon, H.G. Ilk, and Voznak, M., “DeepVoCoder: A CNN model for compression and coding of narrow band speech,” IEEE Access, vol. 7, pp. 75081-75089, 2019.
Y.T. Lo, S.S. Wang, Y. Tsao, and S.Y.A. Peng, “Pruned-CELP Speech Codec Using Denoising Autoencoder with Spectral Compensation for Quality and Intelligibility Enhancement,” IEEE International Conference on Artificial Intelligence Circuits and Systems, pp. 150-151, 2019.
K. Zhen, J. Sung, M.S. Lee, S. Beack, and M. Kim, “Cascaded cross-module residual learning towards lightweight end-to-end speech coding,” arXiv:1906.07769, 2019.
D.N. Rim, I. Jang, and H. Choi, "Deep neural networks and end-to-end learning for audio compression," arXiv:2105.11681, 2021.
J. Byun, S. Shin, Y. Park, J. Sung, and S. Beack, “Optimization of deep neural network (DNN) speech coder using a multi time scale perceptual loss function,” in Proceedings of the Annual Conference of the International Speech Communication Association, pp. 4411–4415, 2022.
H. Yang, K. Zhen, S. Beack, and M. Kim, “Source-aware neural speech coding for noisy speech compression,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 706-10, 2021.
J. Zhang, C. Zhao, and W. Gao, “Optimization-inspired compact deep compressive sensing,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 4, pp. 765-774, 2020.
M. Zhang, S. Liu, and Y. Wu, “Compression and Enhancement of Speech with Low SNR based on Deep Learning,” IEEE International Conference on Machine Learning, Big Data and Business Intelligence, pp. 242-248, 2022.
K. Zhen, J. Sung, M. S. Lee, S. Beack, and M. Kim, “Scalable and efficient neural speech coding: A hybrid design,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 12-25, 2022.
R. Lotfidereshgi, and P. Gournay, “Practical cognitive speech compression,” IEEE Data Science and Learning Workshop, pp. 1-6, 2022.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, and Rabinovich, “A. Going deeper with convolutions,” IEEE conference on computer vision and pattern recognition, pp. 1-9, 2015.
N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495-507, 2021.
D.P. Kingma, and J. Ba, “Adam: A method for stochastic optimization,”, arXiv:1412.6980, 2014.
J.S. Garofolo, “TIMIT acoustic-phonetic continuous speech corpus LDC93S1,” Linguistic Data Consortium, 1993. [Online]. Available: https://catalog.ldc.upenn.edu/LDC93S1
R.F. Kubichek, “Standards and technology issues in objective voice quality assessment,” Digital Signal Processing, vol. 1, no. 2, pp. 38–44, 1991.
A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” IEEE international conference on acoustics, speech, and signal processing, vol. 2, pp. 749-752, 2001.

Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması

Yıl 2024, Cilt: 6 Sayı: 1, 113 - 124, 30.04.2024

Tahir Bekiryazıcı , Gürkan Aydemir , Hakan Gürkan

https://doi.org/10.46387/bjesr.1452937

Öz

Bu çalışmada, konuşma işaretlerini sıkıştırmak için derin öğrenme tabanlı oto kodlayıcı ve artık vektör nicemlemesini temel alan sıkıştırma yöntemi önerilmiştir. Önerilen sıkıştırma yönteminde, öncelikle giriş konuşma işaretini daha düşük boyutlu bir uzaya atayan oto kodlayıcı kullanılmakta ve ardından oto kodlayıcı çıkışı, artık vektör nicemlemesi ile daha da sıkıştırılmaktadır. Sıkıştırma yöntemi, birbirine paralel çalışan iki farklı kod çözücü yapısı ve iki kod kitapçığı sayesinde farklı oranlarda sıkıştırma oranı sunmaktadır. Yöntemin başarımı konuşma kalitesini algısal değerlendirme metriği kullanılarak TIMIT veri kümesi ile test edilmiştir. Önerilen konuşma sıkıştırma yöntemi, 1.25 ve 2.5 kbps iletim hızları için sırasıyla 1.665 ve 1.985 konuşma kalitesini algısal değerlendirme skorları elde etmiştir.

Anahtar Kelimeler

Konuşma Sıkıştırma , Nedensel Evrişimsel Sinir Ağları , Artık Vektör Nicemlemesi , Derin Oto kodlayıcı

Destekleyen Kurum

Bursa Teknik Üniversitesi

Proje Numarası

230D005

Teşekkür

Bu çalışma Bursa Teknik Üniversitesi Bilimsel Araştırma Projeleri birimi tarafından desteklenmiştir. Proje no: 230D005

Kaynakça

P.K. Mongia, and R.K. Sharma, “Estimation and statistical analysis of human voice parameters to investigate the influence of psychological stress and to determine the vocal tract transfer function of an individual,” Journal of Computer Networks and Communications, vol. 2014, no. 17, pp. 1-17, 2014.
T.F. Quatieri, “Discrete-time speech signal processing: principles and practice,” Pearson Education India, 2002.
P. Warkade, and A. Mishra, “Lossless Speech Compression Techniques: A Literature Review,” International Journal of Innovative Research in Computer Science & Technology, vol. 3, pp. 25-32, 2015.
T. Ogunfunmi, and M. Narasimha, “Principles of speech coding.” CRC Press, 2010.
L. Rabiner, and R. Schafer, “Theory and applications of digital speech processing.” Prentice Hall Press, USA, 2010.
D. O'Shaughnessy, “Linear predictive coding”, IEEE potentials, vol. 7, pp. 29-32, 1988
M. Schroeder, and B. Atal, “Code-excited linear prediction (CELP): High-quality speech at very low bit rates”, IEEE International Conference on Acoustics, Speech, and Signal Processing. vol. 10, pp. 937-940, 1985.
T. Unno, T.P. Barnwell, and K. Truong, “An improved mixed excitation linear prediction (MELP) coder,” IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings., vol. 1, pp. 245-248, 1999.
Ü. Güz, H. Gürkan, and B.S. Yarman, “A new method to represent speech signals via predefined signature and envelope sequences,” EURASIP Journal on Advances in Signal Processing, vol. 2007, pp. 1–17, 2006.
B.S. Yarman, Ü. Güz, and H. Gürkan, “On the comparative results of ‘sympes: A new method of speech modeling’,” AEU-International Journal of Electronics and Communications, vol. 60, no. 6, pp. 421–427, 2006.
A. van den Oord et al., “WaveNet: A Generative Model for Raw Audio,” arXiv, Sep. 19, 2016.
S. Kankanahalli, “End-to-end optimized speech coding with deep neural networks,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2521-2525, 2018.
H.Y. Keles, J. Rozhon, H.G. Ilk, and Voznak, M., “DeepVoCoder: A CNN model for compression and coding of narrow band speech,” IEEE Access, vol. 7, pp. 75081-75089, 2019.
Y.T. Lo, S.S. Wang, Y. Tsao, and S.Y.A. Peng, “Pruned-CELP Speech Codec Using Denoising Autoencoder with Spectral Compensation for Quality and Intelligibility Enhancement,” IEEE International Conference on Artificial Intelligence Circuits and Systems, pp. 150-151, 2019.
K. Zhen, J. Sung, M.S. Lee, S. Beack, and M. Kim, “Cascaded cross-module residual learning towards lightweight end-to-end speech coding,” arXiv:1906.07769, 2019.
D.N. Rim, I. Jang, and H. Choi, "Deep neural networks and end-to-end learning for audio compression," arXiv:2105.11681, 2021.
J. Byun, S. Shin, Y. Park, J. Sung, and S. Beack, “Optimization of deep neural network (DNN) speech coder using a multi time scale perceptual loss function,” in Proceedings of the Annual Conference of the International Speech Communication Association, pp. 4411–4415, 2022.
H. Yang, K. Zhen, S. Beack, and M. Kim, “Source-aware neural speech coding for noisy speech compression,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 706-10, 2021.
J. Zhang, C. Zhao, and W. Gao, “Optimization-inspired compact deep compressive sensing,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 4, pp. 765-774, 2020.
M. Zhang, S. Liu, and Y. Wu, “Compression and Enhancement of Speech with Low SNR based on Deep Learning,” IEEE International Conference on Machine Learning, Big Data and Business Intelligence, pp. 242-248, 2022.
K. Zhen, J. Sung, M. S. Lee, S. Beack, and M. Kim, “Scalable and efficient neural speech coding: A hybrid design,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 12-25, 2022.
R. Lotfidereshgi, and P. Gournay, “Practical cognitive speech compression,” IEEE Data Science and Learning Workshop, pp. 1-6, 2022.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, and Rabinovich, “A. Going deeper with convolutions,” IEEE conference on computer vision and pattern recognition, pp. 1-9, 2015.
N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495-507, 2021.
D.P. Kingma, and J. Ba, “Adam: A method for stochastic optimization,”, arXiv:1412.6980, 2014.
J.S. Garofolo, “TIMIT acoustic-phonetic continuous speech corpus LDC93S1,” Linguistic Data Consortium, 1993. [Online]. Available: https://catalog.ldc.upenn.edu/LDC93S1
R.F. Kubichek, “Standards and technology issues in objective voice quality assessment,” Digital Signal Processing, vol. 1, no. 2, pp. 38–44, 1991.
A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” IEEE international conference on acoustics, speech, and signal processing, vol. 2, pp. 749-752, 2001.

Toplam 28 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Kodlama, Bilgi Teorisi ve Sıkıştırma, Derin Öğrenme
Bölüm	Araştırma Makalesi
Yazarlar	Tahir Bekiryazıcı 0000-0002-0664-649X Gürkan Aydemir 0000-0001-9213-576X Hakan Gürkan 0000-0002-7008-4778
Proje Numarası	230D005
Erken Görünüm Tarihi	27 Nisan 2024
Yayımlanma Tarihi	30 Nisan 2024
Gönderilme Tarihi	14 Mart 2024
Kabul Tarihi	15 Nisan 2024
Yayımlandığı Sayı	Yıl 2024 Cilt: 6 Sayı: 1

Kaynak Göster

APA	Bekiryazıcı, T., Aydemir, G., & Gürkan, H. (2024). Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması. Mühendislik Bilimleri ve Araştırmaları Dergisi, 6(1), 113-124. https://doi.org/10.46387/bjesr.1452937
AMA	Bekiryazıcı T, Aydemir G, Gürkan H. Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması. Müh.Bil.ve Araş.Dergisi. Nisan 2024;6(1):113-124. doi:10.46387/bjesr.1452937
Chicago	Bekiryazıcı, Tahir, Gürkan Aydemir, ve Hakan Gürkan. “Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması”. Mühendislik Bilimleri ve Araştırmaları Dergisi 6, sy. 1 (Nisan 2024): 113-24. https://doi.org/10.46387/bjesr.1452937.
EndNote	Bekiryazıcı T, Aydemir G, Gürkan H (01 Nisan 2024) Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması. Mühendislik Bilimleri ve Araştırmaları Dergisi 6 1 113–124.
IEEE	T. Bekiryazıcı, G. Aydemir, ve H. Gürkan, “Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması”, Müh.Bil.ve Araş.Dergisi, c. 6, sy. 1, ss. 113–124, 2024, doi: 10.46387/bjesr.1452937.
ISNAD	Bekiryazıcı, Tahir vd. “Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması”. Mühendislik Bilimleri ve Araştırmaları Dergisi 6/1 (Nisan2024), 113-124. https://doi.org/10.46387/bjesr.1452937.
JAMA	Bekiryazıcı T, Aydemir G, Gürkan H. Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması. Müh.Bil.ve Araş.Dergisi. 2024;6:113–124.
MLA	Bekiryazıcı, Tahir vd. “Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması”. Mühendislik Bilimleri ve Araştırmaları Dergisi, c. 6, sy. 1, 2024, ss. 113-24, doi:10.46387/bjesr.1452937.
Vancouver	Bekiryazıcı T, Aydemir G, Gürkan H. Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması. Müh.Bil.ve Araş.Dergisi. 2024;6(1):113-24.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin