Research Article
BibTex RIS Cite

TinyML based audio visual keyword detection

Year 2024, Volume: 13 Issue: 4, 1207 - 1215, 15.10.2024
https://doi.org/10.28948/ngumuh.1482481

Abstract

Keyword detection (KWD) is one of the areas where machine learning is used. Its purpose is the automatic detection of specific words or objects from audio or image data. As portable artificial intelligence applications become more prevalent, the number of applications in this field is also growing. In particular, hybrid systems (the use of audio and video together) are being studied to increase the effectiveness of KWD applications. The system aims to combine audio and visual commands detected through two different channels. Extensive work has been done on audiovisual keyword detection in a computer environment, yielding good results. On the other hand, efforts are being made within the scope of TinyML (Low-Power Machine Learning) to implement deep learning applications on low-capacity processors. In these applications, reducing the parameters of the deep learning model (quantization, pruning) makes it possible to implement the model on ordinary microcontrollers. In this study, a keyword detection application in the field of TinyML is proposed using audio and visual data. In the training of the proposed hybrid model, the audio and visual models were first trained separately in the Edge Impulse software environment. Developed MobileNetV2 and CNN-based models were loaded onto ESP32-CAM and Arduino Nano BLE development kits and tested. Subsequently, the models were combined using a linear weighted fusion method and tested. In the experimental results, according to the accuracy criterion, the success rate of the audio-based KWD was 85%, the success rate of the image-based KWD was 85%, while the classification success in the audiovisual hybrid application was around 90%.

References

  • J. Tian, The human resources development applications of machine learning in the view of artificial ıntelligence. IEEE 3rd International Conference, 39-43, 2020. https://doi.org/10.1109/CCET50901.2020.9213113.
  • M. Rusci and T. Tuytelaars, On-device customization of tiny deep learning models for keyword spotting with few examples. IEEE Micro, 43(6), 50-57, 2023. https://doi.org/10.1109/MM.2023.3311826.
  • Y. Abadade, A. Temouden, H. Bamoumen, N. Benamar, Y. Chtouki and A. S. Hafid, A comprehensive survey on TinyML. IEEE Access, 11, 96892-96922, 2023. https://doi.org/10.1109/ACCESS.2023.3294111.
  • P. Warden and D. Situnayake, TinyML machine learning with TensorFlow lite on arduino and ultra-low-power microcontrollers. O’Reilly Media, 2019.
  • M. Altayeb, M. Zennaro and E. Pietrosemoli, TinyML gamma radiation classifier. Nuclear Engineering and Technology, 55(2), 443-451, 2023. https://doi.org/10.1016/j.net.2022.09.032.
  • M. Lord, TinyML, anomaly detection. Masters Thesis, California State University, Computer Science, Northridge, USA, 2021.
  • M. Monfort Grau, TinyML from basic to advanced applications. Bachelor Thesis, Universitat Politècnica de Catalunya, Facultat d'Informàtica de Barcelona, Spain, 2021.
  • S. Sadhu and P. K. Ghosh, Low resource point process models for keyword spotting using unsupervised online learning. 25th European Signal Processing Conference, 538-542, 2017. https://doi.org/10.23919/ eusipco.2017.8081265.
  • Z. Tang, L. Chen, B. Wu, D. Yu and D. Manocha, Improving reverberant speech training using diffuse acoustic simulation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6969-6973, 2020. https://doi.org/10.48550/arXiv.1907.03988.
  • J. M. Phillips and J. M. Conrad, Robotic system control using embedded machine learning and speech recognition. 19th International Conference on Smart Communities, Improving Quality of Life Using ICT, IoT and AI (HONET), 214-218, 2022. https://doi.org/10.1109/HONET56683.2022.10019106.
  • H. Han and J. Siebert, TinyML: A systematic review and synthesis of existing research. International Conference on Artificial Intelligence in Information and Communication (ICAIIC), 269-274, 2022. https://doi.org/10.1109/ICAIIC54071.2022.9722636.
  • N. S. Huynh, S. De La Cruz and A. Perez-Pons, Denial-of Service (DoS) Attack Detection Using Edge Machine Learning. International Conference on Machine Learning and Applications (ICMLA), 1741-1745, 2023. https://doi.org/10.1109/ICMLA58977.2023.00264.
  • H. Andrew, Z. Menglong, C. Bo, K. Dmitry, W. Weijun, W. Tobias, A. Marco and A. Hartwig, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. Computer Vision and Pattern Recognition, 2017. https://doi.org/10.48550/ arXiv.1704.04861.
  • Kaggle, Hand Gesture Recognition Dataset. https://www.kaggle.com/datasets/aryarishabh/hand-gesture-recognition-dataset, Accessed 14 January 2024.
  • W. Pete, Speech commands: A dataset for limited-vocabulary speech recognition. Computation and Language, 2018. https://doi.org/10.48550/ arXiv.1804.03209.
  • Papers with code, Speech Commands, https://paperswithcode.com/dataset/speech-commands, Accessed 2 February 2024.
  • V. Roman and M. Nikolay, Learning efficient representations for keyword spotting with triplet loss. 23rd International Conference SPECOM, 2021. https://doi.org/10.1007/978-3-030-87802-3_69.
  • B. Kim, S. Chang, J. Lee and D. Sung, Broadcasted Residual Learning for Efficient Keyword Spotting. Proceedings of INTERSPEECH, 2021. https://doi.org/10.48550/arXiv.2106.04140.
  • D. Seo, H.-S. Oh and Y. Jung, Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting. IEEE Access, 9, 80682-80691, 2021. https://doi.org/10.1109/ACCESS.2021.3078715.
  • R. Tang, J. Lee, A. Razi, J. Cambre, I. Bicking, J. Kaye and J. Lin, Howl: A Deployed, Open-Source Wake Word Detection System. Computation and Language, 2020, https://doi.org/10.48550/arXiv.2008.09606.
  • A. Berg, M. O’Connor and M. Tairum Cruz, Keyword Transformer: A Self-Attention Model for Keyword Spotting. Interspeech, 4249-4253, 2021, https://doi.org/10.21437/Interspeech.2021-1286.
  • C. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan and J. Gehrke, A scalable noisy speech dataset and online subjective test framework. InterSpeech, 2019. https://doi.org/10.48550/arXiv.1909.08050.
  • A. Mahmood and U. Köse, Speech recognition based on convolutional neural networks and MFCC algorithm. Advances in Artificial Intelligence Reseach, 1(1), 6–12, 2021.
  • Y. Xu, J. Sun, Y. Han, S. Zhao, C. Mei, T. Guo, S. Zhou, C. Xie, W. Zou, X. Li, S. Zhou, C. Xie, W. Zou and X. Li, Audio-Visual Wake Word Spotting System For MISP Challenge 2021. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 9246-9250, 2021, https://doi.org/10.48550/ arXiv.2204.08686.

TinyML tabanlı görsel işitsel anahtar kelime tespiti

Year 2024, Volume: 13 Issue: 4, 1207 - 1215, 15.10.2024
https://doi.org/10.28948/ngumuh.1482481

Abstract

Anahtar kelime tespiti (AKT), makine öğreniminin kullanıldığı alanlardan birisidir. Amacı, ses veya görüntü verisinden belirli kelime veya objenin otomatik tespit edilmesidir. Taşınabilir yapay zekâ uygulamalarının artmasıyla beraber, bu alanda da uygulamalar artmaktadır. Özellikle AKT uygulamalarının etkinliğini artırmak için hibrit sistemler (ses ve görüntünün birlikte kullanımı) üzerinde çalışma yapılmaktadır. Bu sistem ile birlikle iki farklı kanaldan algılanan ses ve görüntü komutlarının birleştirilmesi amaçlanmaktadır. Bilgisayar (PC) ortamında görsel işitsel AKT üzerinde birçok çalışma yapılmış ve iyi sonuçlar elde edilmiştir. Diğer taraftan derin öğrenme uygulamalarını düşük kapasiteli işlemciler üzerinde gerçekleştirmek için TinyML (Düşük Kapasiteli Makine Öğrenmesi) kapsamında çalışmalar yapılmaktadır. Bu uygulamalarda, derin öğrenmeye yönelik geliştirilen modelin parametrelerini azaltarak (nicelleştirme, kırpma) sıradan mikrodenetleyici üzerinde uygulama imkânı oluşturmaktadır. Bu çalışmada ses ve görüntü verisi kullanılarak, TinyML alanında AKT uygulaması önerilmiştir. Önerilen hibrit modelin eğitiminde öncelikle ses ve görüntü modelleri Edge Impulse yazılım ortamında ayrı ayrı eğitilmiştir. Geliştirilen MobileNetV2 ve CNN tabanlı modeller ESP32-CAM ve Arduino Nano BLE geliştirme kitlerine yüklenerek, denenmiştir. Daha sonra modeller doğrusal ağırlıklı birleştirme metodu ile birleştirilerek denenmiştir. Sistemin başarısı standart ölçütlere göre test edilmiştir. Deneysel sonuçlarda doğruluk ölçütüne göre, sadece ses tabanlı AKT başarısı %85, sadece görüntü tabanlı AKT başarısı %85 olurken, görsel işitsel hibrit uygulamasında sınıflandırma başarısı %90 civarında olmuştur.

References

  • J. Tian, The human resources development applications of machine learning in the view of artificial ıntelligence. IEEE 3rd International Conference, 39-43, 2020. https://doi.org/10.1109/CCET50901.2020.9213113.
  • M. Rusci and T. Tuytelaars, On-device customization of tiny deep learning models for keyword spotting with few examples. IEEE Micro, 43(6), 50-57, 2023. https://doi.org/10.1109/MM.2023.3311826.
  • Y. Abadade, A. Temouden, H. Bamoumen, N. Benamar, Y. Chtouki and A. S. Hafid, A comprehensive survey on TinyML. IEEE Access, 11, 96892-96922, 2023. https://doi.org/10.1109/ACCESS.2023.3294111.
  • P. Warden and D. Situnayake, TinyML machine learning with TensorFlow lite on arduino and ultra-low-power microcontrollers. O’Reilly Media, 2019.
  • M. Altayeb, M. Zennaro and E. Pietrosemoli, TinyML gamma radiation classifier. Nuclear Engineering and Technology, 55(2), 443-451, 2023. https://doi.org/10.1016/j.net.2022.09.032.
  • M. Lord, TinyML, anomaly detection. Masters Thesis, California State University, Computer Science, Northridge, USA, 2021.
  • M. Monfort Grau, TinyML from basic to advanced applications. Bachelor Thesis, Universitat Politècnica de Catalunya, Facultat d'Informàtica de Barcelona, Spain, 2021.
  • S. Sadhu and P. K. Ghosh, Low resource point process models for keyword spotting using unsupervised online learning. 25th European Signal Processing Conference, 538-542, 2017. https://doi.org/10.23919/ eusipco.2017.8081265.
  • Z. Tang, L. Chen, B. Wu, D. Yu and D. Manocha, Improving reverberant speech training using diffuse acoustic simulation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6969-6973, 2020. https://doi.org/10.48550/arXiv.1907.03988.
  • J. M. Phillips and J. M. Conrad, Robotic system control using embedded machine learning and speech recognition. 19th International Conference on Smart Communities, Improving Quality of Life Using ICT, IoT and AI (HONET), 214-218, 2022. https://doi.org/10.1109/HONET56683.2022.10019106.
  • H. Han and J. Siebert, TinyML: A systematic review and synthesis of existing research. International Conference on Artificial Intelligence in Information and Communication (ICAIIC), 269-274, 2022. https://doi.org/10.1109/ICAIIC54071.2022.9722636.
  • N. S. Huynh, S. De La Cruz and A. Perez-Pons, Denial-of Service (DoS) Attack Detection Using Edge Machine Learning. International Conference on Machine Learning and Applications (ICMLA), 1741-1745, 2023. https://doi.org/10.1109/ICMLA58977.2023.00264.
  • H. Andrew, Z. Menglong, C. Bo, K. Dmitry, W. Weijun, W. Tobias, A. Marco and A. Hartwig, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. Computer Vision and Pattern Recognition, 2017. https://doi.org/10.48550/ arXiv.1704.04861.
  • Kaggle, Hand Gesture Recognition Dataset. https://www.kaggle.com/datasets/aryarishabh/hand-gesture-recognition-dataset, Accessed 14 January 2024.
  • W. Pete, Speech commands: A dataset for limited-vocabulary speech recognition. Computation and Language, 2018. https://doi.org/10.48550/ arXiv.1804.03209.
  • Papers with code, Speech Commands, https://paperswithcode.com/dataset/speech-commands, Accessed 2 February 2024.
  • V. Roman and M. Nikolay, Learning efficient representations for keyword spotting with triplet loss. 23rd International Conference SPECOM, 2021. https://doi.org/10.1007/978-3-030-87802-3_69.
  • B. Kim, S. Chang, J. Lee and D. Sung, Broadcasted Residual Learning for Efficient Keyword Spotting. Proceedings of INTERSPEECH, 2021. https://doi.org/10.48550/arXiv.2106.04140.
  • D. Seo, H.-S. Oh and Y. Jung, Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting. IEEE Access, 9, 80682-80691, 2021. https://doi.org/10.1109/ACCESS.2021.3078715.
  • R. Tang, J. Lee, A. Razi, J. Cambre, I. Bicking, J. Kaye and J. Lin, Howl: A Deployed, Open-Source Wake Word Detection System. Computation and Language, 2020, https://doi.org/10.48550/arXiv.2008.09606.
  • A. Berg, M. O’Connor and M. Tairum Cruz, Keyword Transformer: A Self-Attention Model for Keyword Spotting. Interspeech, 4249-4253, 2021, https://doi.org/10.21437/Interspeech.2021-1286.
  • C. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan and J. Gehrke, A scalable noisy speech dataset and online subjective test framework. InterSpeech, 2019. https://doi.org/10.48550/arXiv.1909.08050.
  • A. Mahmood and U. Köse, Speech recognition based on convolutional neural networks and MFCC algorithm. Advances in Artificial Intelligence Reseach, 1(1), 6–12, 2021.
  • Y. Xu, J. Sun, Y. Han, S. Zhao, C. Mei, T. Guo, S. Zhou, C. Xie, W. Zou, X. Li, S. Zhou, C. Xie, W. Zou and X. Li, Audio-Visual Wake Word Spotting System For MISP Challenge 2021. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 9246-9250, 2021, https://doi.org/10.48550/ arXiv.2204.08686.
There are 24 citations in total.

Details

Primary Language Turkish
Subjects Deep Learning, Embedded Systems
Journal Section Research Articles
Authors

Mehmet Tosun 0009-0007-0769-1990

Hamit Erdem 0000-0003-1704-1581

Early Pub Date September 11, 2024
Publication Date October 15, 2024
Submission Date May 11, 2024
Acceptance Date July 30, 2024
Published in Issue Year 2024 Volume: 13 Issue: 4

Cite

APA Tosun, M., & Erdem, H. (2024). TinyML tabanlı görsel işitsel anahtar kelime tespiti. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, 13(4), 1207-1215. https://doi.org/10.28948/ngumuh.1482481
AMA Tosun M, Erdem H. TinyML tabanlı görsel işitsel anahtar kelime tespiti. NOHU J. Eng. Sci. October 2024;13(4):1207-1215. doi:10.28948/ngumuh.1482481
Chicago Tosun, Mehmet, and Hamit Erdem. “TinyML Tabanlı görsel işitsel Anahtar Kelime Tespiti”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 13, no. 4 (October 2024): 1207-15. https://doi.org/10.28948/ngumuh.1482481.
EndNote Tosun M, Erdem H (October 1, 2024) TinyML tabanlı görsel işitsel anahtar kelime tespiti. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 13 4 1207–1215.
IEEE M. Tosun and H. Erdem, “TinyML tabanlı görsel işitsel anahtar kelime tespiti”, NOHU J. Eng. Sci., vol. 13, no. 4, pp. 1207–1215, 2024, doi: 10.28948/ngumuh.1482481.
ISNAD Tosun, Mehmet - Erdem, Hamit. “TinyML Tabanlı görsel işitsel Anahtar Kelime Tespiti”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 13/4 (October 2024), 1207-1215. https://doi.org/10.28948/ngumuh.1482481.
JAMA Tosun M, Erdem H. TinyML tabanlı görsel işitsel anahtar kelime tespiti. NOHU J. Eng. Sci. 2024;13:1207–1215.
MLA Tosun, Mehmet and Hamit Erdem. “TinyML Tabanlı görsel işitsel Anahtar Kelime Tespiti”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, vol. 13, no. 4, 2024, pp. 1207-15, doi:10.28948/ngumuh.1482481.
Vancouver Tosun M, Erdem H. TinyML tabanlı görsel işitsel anahtar kelime tespiti. NOHU J. Eng. Sci. 2024;13(4):1207-15.

download