Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması
Yıl 2024,
, 113 - 124, 30.04.2024
Tahir Bekiryazıcı
,
Gürkan Aydemir
,
Hakan Gürkan
Öz
Bu çalışmada, konuşma işaretlerini sıkıştırmak için derin öğrenme tabanlı oto kodlayıcı ve artık vektör nicemlemesini temel alan sıkıştırma yöntemi önerilmiştir. Önerilen sıkıştırma yönteminde, öncelikle giriş konuşma işaretini daha düşük boyutlu bir uzaya atayan oto kodlayıcı kullanılmakta ve ardından oto kodlayıcı çıkışı, artık vektör nicemlemesi ile daha da sıkıştırılmaktadır. Sıkıştırma yöntemi, birbirine paralel çalışan iki farklı kod çözücü yapısı ve iki kod kitapçığı sayesinde farklı oranlarda sıkıştırma oranı sunmaktadır. Yöntemin başarımı konuşma kalitesini algısal değerlendirme metriği kullanılarak TIMIT veri kümesi ile test edilmiştir. Önerilen konuşma sıkıştırma yöntemi, 1.25 ve 2.5 kbps iletim hızları için sırasıyla 1.665 ve 1.985 konuşma kalitesini algısal değerlendirme skorları elde etmiştir.
Destekleyen Kurum
Bursa Teknik Üniversitesi
Teşekkür
Bu çalışma Bursa Teknik Üniversitesi Bilimsel Araştırma Projeleri birimi tarafından desteklenmiştir. Proje no: 230D005
Kaynakça
- P.K. Mongia, and R.K. Sharma, “Estimation and statistical analysis of human voice parameters to investigate the influence of psychological stress and to determine the vocal tract transfer function of an individual,” Journal of Computer Networks and Communications, vol. 2014, no. 17, pp. 1-17, 2014.
- T.F. Quatieri, “Discrete-time speech signal processing: principles and practice,” Pearson Education India, 2002.
- P. Warkade, and A. Mishra, “Lossless Speech Compression Techniques: A Literature Review,” International Journal of Innovative Research in Computer Science & Technology, vol. 3, pp. 25-32, 2015.
- T. Ogunfunmi, and M. Narasimha, “Principles of speech coding.” CRC Press, 2010.
- L. Rabiner, and R. Schafer, “Theory and applications of digital speech processing.” Prentice Hall Press, USA, 2010.
- D. O'Shaughnessy, “Linear predictive coding”, IEEE potentials, vol. 7, pp. 29-32, 1988
- M. Schroeder, and B. Atal, “Code-excited linear prediction (CELP): High-quality speech at very low bit rates”, IEEE International Conference on Acoustics, Speech, and Signal Processing. vol. 10, pp. 937-940, 1985.
- T. Unno, T.P. Barnwell, and K. Truong, “An improved mixed excitation linear prediction (MELP) coder,” IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings., vol. 1, pp. 245-248, 1999.
- Ü. Güz, H. Gürkan, and B.S. Yarman, “A new method to represent speech signals via predefined signature and envelope sequences,” EURASIP Journal on Advances in Signal Processing, vol. 2007, pp. 1–17, 2006.
- B.S. Yarman, Ü. Güz, and H. Gürkan, “On the comparative results of ‘sympes: A new method of speech modeling’,” AEU-International Journal of Electronics and Communications, vol. 60, no. 6, pp. 421–427, 2006.
- A. van den Oord et al., “WaveNet: A Generative Model for Raw Audio,” arXiv, Sep. 19, 2016.
- S. Kankanahalli, “End-to-end optimized speech coding with deep neural networks,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2521-2525, 2018.
- H.Y. Keles, J. Rozhon, H.G. Ilk, and Voznak, M., “DeepVoCoder: A CNN model for compression and coding of narrow band speech,” IEEE Access, vol. 7, pp. 75081-75089, 2019.
- Y.T. Lo, S.S. Wang, Y. Tsao, and S.Y.A. Peng, “Pruned-CELP Speech Codec Using Denoising Autoencoder with Spectral Compensation for Quality and Intelligibility Enhancement,” IEEE International Conference on Artificial Intelligence Circuits and Systems, pp. 150-151, 2019.
- K. Zhen, J. Sung, M.S. Lee, S. Beack, and M. Kim, “Cascaded cross-module residual learning towards lightweight end-to-end speech coding,” arXiv:1906.07769, 2019.
- D.N. Rim, I. Jang, and H. Choi, "Deep neural networks and end-to-end learning for audio compression," arXiv:2105.11681, 2021.
- J. Byun, S. Shin, Y. Park, J. Sung, and S. Beack, “Optimization of deep neural network (DNN) speech coder using a multi time scale perceptual loss function,” in Proceedings of the Annual Conference of the International Speech Communication Association, pp. 4411–4415, 2022.
- H. Yang, K. Zhen, S. Beack, and M. Kim, “Source-aware neural speech coding for noisy speech compression,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 706-10, 2021.
- J. Zhang, C. Zhao, and W. Gao, “Optimization-inspired compact deep compressive sensing,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 4, pp. 765-774, 2020.
- M. Zhang, S. Liu, and Y. Wu, “Compression and Enhancement of Speech with Low SNR based on Deep Learning,” IEEE International Conference on Machine Learning, Big Data and Business Intelligence, pp. 242-248, 2022.
- K. Zhen, J. Sung, M. S. Lee, S. Beack, and M. Kim, “Scalable and efficient neural speech coding: A hybrid design,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 12-25, 2022.
- R. Lotfidereshgi, and P. Gournay, “Practical cognitive speech compression,” IEEE Data Science and Learning Workshop, pp. 1-6, 2022.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, and Rabinovich, “A. Going deeper with convolutions,” IEEE conference on computer vision and pattern recognition, pp. 1-9, 2015.
- N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495-507, 2021.
- D.P. Kingma, and J. Ba, “Adam: A method for stochastic optimization,”, arXiv:1412.6980, 2014.
- J.S. Garofolo, “TIMIT acoustic-phonetic continuous speech corpus LDC93S1,” Linguistic Data Consortium, 1993. [Online]. Available: https://catalog.ldc.upenn.edu/LDC93S1
- R.F. Kubichek, “Standards and technology issues in objective voice quality assessment,” Digital Signal Processing, vol. 1, no. 2, pp. 38–44, 1991.
- A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” IEEE international conference on acoustics, speech, and signal processing, vol. 2, pp. 749-752, 2001.
Deep Convolutional Autoencoder and Residual Vector Quantization-Based Compression of Speech Signals
Yıl 2024,
, 113 - 124, 30.04.2024
Tahir Bekiryazıcı
,
Gürkan Aydemir
,
Hakan Gürkan
Öz
This paper proposes a compression method based on deep convolutional autoencoder and residual vector quantization to compress speech signals. In the proposed method, the first encoder part of an autoencoder is utilized to map the input speech signal to a lower dimensional (code) space, and then the code is further compressed via residual vector quantization. The compression method offers different ratios due to two different decoder structures operating in parallel and the two codebooks. The method's performance is evaluated with the TIMIT dataset using the Perceptual Evaluation of Speech Quality metric. The proposed speech compression method achieved perceptual evaluation of speech quality scores of 1.665 and 1,985 for 1.25 and 2.5 kbps transmission rates, respectively.
Kaynakça
- P.K. Mongia, and R.K. Sharma, “Estimation and statistical analysis of human voice parameters to investigate the influence of psychological stress and to determine the vocal tract transfer function of an individual,” Journal of Computer Networks and Communications, vol. 2014, no. 17, pp. 1-17, 2014.
- T.F. Quatieri, “Discrete-time speech signal processing: principles and practice,” Pearson Education India, 2002.
- P. Warkade, and A. Mishra, “Lossless Speech Compression Techniques: A Literature Review,” International Journal of Innovative Research in Computer Science & Technology, vol. 3, pp. 25-32, 2015.
- T. Ogunfunmi, and M. Narasimha, “Principles of speech coding.” CRC Press, 2010.
- L. Rabiner, and R. Schafer, “Theory and applications of digital speech processing.” Prentice Hall Press, USA, 2010.
- D. O'Shaughnessy, “Linear predictive coding”, IEEE potentials, vol. 7, pp. 29-32, 1988
- M. Schroeder, and B. Atal, “Code-excited linear prediction (CELP): High-quality speech at very low bit rates”, IEEE International Conference on Acoustics, Speech, and Signal Processing. vol. 10, pp. 937-940, 1985.
- T. Unno, T.P. Barnwell, and K. Truong, “An improved mixed excitation linear prediction (MELP) coder,” IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings., vol. 1, pp. 245-248, 1999.
- Ü. Güz, H. Gürkan, and B.S. Yarman, “A new method to represent speech signals via predefined signature and envelope sequences,” EURASIP Journal on Advances in Signal Processing, vol. 2007, pp. 1–17, 2006.
- B.S. Yarman, Ü. Güz, and H. Gürkan, “On the comparative results of ‘sympes: A new method of speech modeling’,” AEU-International Journal of Electronics and Communications, vol. 60, no. 6, pp. 421–427, 2006.
- A. van den Oord et al., “WaveNet: A Generative Model for Raw Audio,” arXiv, Sep. 19, 2016.
- S. Kankanahalli, “End-to-end optimized speech coding with deep neural networks,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2521-2525, 2018.
- H.Y. Keles, J. Rozhon, H.G. Ilk, and Voznak, M., “DeepVoCoder: A CNN model for compression and coding of narrow band speech,” IEEE Access, vol. 7, pp. 75081-75089, 2019.
- Y.T. Lo, S.S. Wang, Y. Tsao, and S.Y.A. Peng, “Pruned-CELP Speech Codec Using Denoising Autoencoder with Spectral Compensation for Quality and Intelligibility Enhancement,” IEEE International Conference on Artificial Intelligence Circuits and Systems, pp. 150-151, 2019.
- K. Zhen, J. Sung, M.S. Lee, S. Beack, and M. Kim, “Cascaded cross-module residual learning towards lightweight end-to-end speech coding,” arXiv:1906.07769, 2019.
- D.N. Rim, I. Jang, and H. Choi, "Deep neural networks and end-to-end learning for audio compression," arXiv:2105.11681, 2021.
- J. Byun, S. Shin, Y. Park, J. Sung, and S. Beack, “Optimization of deep neural network (DNN) speech coder using a multi time scale perceptual loss function,” in Proceedings of the Annual Conference of the International Speech Communication Association, pp. 4411–4415, 2022.
- H. Yang, K. Zhen, S. Beack, and M. Kim, “Source-aware neural speech coding for noisy speech compression,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 706-10, 2021.
- J. Zhang, C. Zhao, and W. Gao, “Optimization-inspired compact deep compressive sensing,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 4, pp. 765-774, 2020.
- M. Zhang, S. Liu, and Y. Wu, “Compression and Enhancement of Speech with Low SNR based on Deep Learning,” IEEE International Conference on Machine Learning, Big Data and Business Intelligence, pp. 242-248, 2022.
- K. Zhen, J. Sung, M. S. Lee, S. Beack, and M. Kim, “Scalable and efficient neural speech coding: A hybrid design,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 12-25, 2022.
- R. Lotfidereshgi, and P. Gournay, “Practical cognitive speech compression,” IEEE Data Science and Learning Workshop, pp. 1-6, 2022.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, and Rabinovich, “A. Going deeper with convolutions,” IEEE conference on computer vision and pattern recognition, pp. 1-9, 2015.
- N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495-507, 2021.
- D.P. Kingma, and J. Ba, “Adam: A method for stochastic optimization,”, arXiv:1412.6980, 2014.
- J.S. Garofolo, “TIMIT acoustic-phonetic continuous speech corpus LDC93S1,” Linguistic Data Consortium, 1993. [Online]. Available: https://catalog.ldc.upenn.edu/LDC93S1
- R.F. Kubichek, “Standards and technology issues in objective voice quality assessment,” Digital Signal Processing, vol. 1, no. 2, pp. 38–44, 1991.
- A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” IEEE international conference on acoustics, speech, and signal processing, vol. 2, pp. 749-752, 2001.