Lip Reading Using Convolutional Neural Networks with and without Pre-Trained Models

Tayyip Ozcan; Alper Basturk

doi:10.17694/bajece.479891

Research Article

Lip Reading Using Convolutional Neural Networks with and without Pre-Trained Models

Year 2019, Volume: 7 Issue: 2, 195 - 201, 30.04.2019

Tayyip Ozcan , Alper Basturk

https://doi.org/10.17694/bajece.479891

Cited By: 18

Abstract

Lip reading has become a popular topic recently. There is a widespread literature studies on lip reading in human action recognition. Deep learning methods are frequently used in this area. In this paper, lip reading from video data is performed using self designed convolutional neural networks (CNNs). For this purpose, standard and also augmented AvLetters dataset is used train and test stages. To optimize network performance, minibatchsize parameter is also tuned and its effect is investigated. Additionally, experimental studies are performed using AlexNet and GoogleNet pre-trained CNNs. Detailed experimental results are presented.

Keywords

deep learning , convolutional neural networks , lip reading , transfer learning , human action recognition , human computer interaction , data augmentation

References

S. Agrawal, V. R. Omprakash, and Ranvijay, “Lip reading techniques: A survey,” in 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), pp. 753–757, July 2016.
A. Garg, J. Noyola, and S. Bagadia, “Lip reading using CNN and LSTM,” in Technical Report, 2016.
Y. Li, Y. Takashima, T. Takiguchi, and Y. Ariki, “Lip reading using a dynamic feature of lip images and convolutional neural networks,” in 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), pp. 1–6, June 2016.
S. Petridis, Z. Li, and M. Pantic, “End-to-end visual speech recognition with LSTMs,” CoRR, vol. abs/1701.05847, 2017.
Y. Takashima, Y. Kakihara, R. Aihara, T. Takiguchi, Y. Ariki, N. Mitani, K. Omori, and K. Nakazono, “Audio-visual speech recognition using convolutive bottleneck networks for a person with severe hearing loss,” IPSJ Transactions on Computer Vision and Applications, vol. 7, pp. 64–68, 2015.
A. Yargic and M. Dogan, “A lip reading application on MS Kinect camera,” in 2013 IEEE INISTA, pp. 1–5, June 2013.
A. Rekik, A. Ben-Hamadou, and W. Mahdi, “A new visual speech recognition approach for RGB-D cameras,” in Image Analysis and Recognition (A. Campilho and M. Kamel, eds.), (Cham), pp. 21–28, Springer International Publishing, 2014.
A. Rekik, A. Ben-Hamadou, andW. Mahdi, “Human machine interaction via visual speech spotting,” in Advanced Concepts for Intelligent Vision Systems (S. Battiato, J. Blanc-Talon, G. Gallo, W. Philips, D. Popescu, and P. Scheunders, eds.), (Cham), pp. 566–574, Springer International Publishing, 2015.
A. Rekik, A. Ben-Hamadou, and W. Mahdi, “Unified system for visual speech recognition and speaker identification,” in Advanced Concepts for Intelligent Vision Systems (S. Battiato, J. Blanc-Talon, G. Gallo, W. Philips, D. Popescu, and P. Scheunders, eds.), (Cham), pp. 381–390, Springer International Publishing, 2015.
I. Matthews, T. Cootes, J. A. Bangham, S. Cox, and R. Harvey, “Extraction of visual features for lipreading,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, p. 2002, 2002.
A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” NIPS, vol. 25, pp. 1106–1114, 2012.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol. abs/1409.4842, 2014.
I. Anina, Z. Zhou, G. Zhao, and M. Pietik¨ainen, “Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis,” in 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–5, May 2015.
E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, “Movingtalker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus,” EURASIP J. Appl. Signal Process., vol. 2002, pp. 1189–1201, Jan. 2002.
W. Dong, R. He, and S. Zhang, “Digital recognition from lip texture analysis,” in 2016 IEEE International Conference on Digital Signal Processing (DSP), pp. 477–481, Oct 2016.
T. Stafylakis and G. Tzimiropoulos, “Combining residual networks with LSTMs for lipreading,” CoRR, vol. abs/1703.04105, 2017.
J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian Conference on Computer Vision, pp. 87–103, Springer, 2016.
Y. Takashima, R. Aihara, T. Takiguchi, Y. Ariki, N. Mitani, K. Omori, and K. Nakazono, “Audio-visual speech recognition using bimodaltrained bottleneck features for a person with severe hearing loss,” in INTERSPEECH, 2016.
E. Kilic, Classification of Mitotic figureswith convolutional neural networks. M.Sc. thesis, Erciyes University, Graduate School of Natural and Applied Sciences, 2016.
H. S. Nogay and T. C. Akinci, “A convolutional neural network application for predicting the locating of squamous cell carcinoma in the lung,” Balkan Journal of Electrical and Computer Engineering, vol. 6, pp. 207 – 210, 2018.
H. S. Nogay, “Classification of different cancer types by deep convolutional neural networks,” Balkan Journal of Electrical and Computer Engineering, vol. 6, pp. 56 – 59, 2018.
J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, and G. Wang, “Recent advances in convolutional neural networks,” CoRR, vol. abs/1512.07108, 2015.
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.
S. Das, “CNNs architectures: LeNet, AlexNet, VGG, GoogLeNet, ResNet and more . . . .” https://medium.com/@siddharthdas-32104, 2017.

Year 2019, Volume: 7 Issue: 2, 195 - 201, 30.04.2019

Tayyip Ozcan , Alper Basturk

https://doi.org/10.17694/bajece.479891

Cited By: 18

Abstract

References

S. Agrawal, V. R. Omprakash, and Ranvijay, “Lip reading techniques: A survey,” in 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), pp. 753–757, July 2016.
A. Garg, J. Noyola, and S. Bagadia, “Lip reading using CNN and LSTM,” in Technical Report, 2016.
Y. Li, Y. Takashima, T. Takiguchi, and Y. Ariki, “Lip reading using a dynamic feature of lip images and convolutional neural networks,” in 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), pp. 1–6, June 2016.
S. Petridis, Z. Li, and M. Pantic, “End-to-end visual speech recognition with LSTMs,” CoRR, vol. abs/1701.05847, 2017.
Y. Takashima, Y. Kakihara, R. Aihara, T. Takiguchi, Y. Ariki, N. Mitani, K. Omori, and K. Nakazono, “Audio-visual speech recognition using convolutive bottleneck networks for a person with severe hearing loss,” IPSJ Transactions on Computer Vision and Applications, vol. 7, pp. 64–68, 2015.
A. Yargic and M. Dogan, “A lip reading application on MS Kinect camera,” in 2013 IEEE INISTA, pp. 1–5, June 2013.
A. Rekik, A. Ben-Hamadou, and W. Mahdi, “A new visual speech recognition approach for RGB-D cameras,” in Image Analysis and Recognition (A. Campilho and M. Kamel, eds.), (Cham), pp. 21–28, Springer International Publishing, 2014.
A. Rekik, A. Ben-Hamadou, andW. Mahdi, “Human machine interaction via visual speech spotting,” in Advanced Concepts for Intelligent Vision Systems (S. Battiato, J. Blanc-Talon, G. Gallo, W. Philips, D. Popescu, and P. Scheunders, eds.), (Cham), pp. 566–574, Springer International Publishing, 2015.
A. Rekik, A. Ben-Hamadou, and W. Mahdi, “Unified system for visual speech recognition and speaker identification,” in Advanced Concepts for Intelligent Vision Systems (S. Battiato, J. Blanc-Talon, G. Gallo, W. Philips, D. Popescu, and P. Scheunders, eds.), (Cham), pp. 381–390, Springer International Publishing, 2015.
I. Matthews, T. Cootes, J. A. Bangham, S. Cox, and R. Harvey, “Extraction of visual features for lipreading,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, p. 2002, 2002.
A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” NIPS, vol. 25, pp. 1106–1114, 2012.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol. abs/1409.4842, 2014.
I. Anina, Z. Zhou, G. Zhao, and M. Pietik¨ainen, “Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis,” in 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–5, May 2015.
E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, “Movingtalker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus,” EURASIP J. Appl. Signal Process., vol. 2002, pp. 1189–1201, Jan. 2002.
W. Dong, R. He, and S. Zhang, “Digital recognition from lip texture analysis,” in 2016 IEEE International Conference on Digital Signal Processing (DSP), pp. 477–481, Oct 2016.
T. Stafylakis and G. Tzimiropoulos, “Combining residual networks with LSTMs for lipreading,” CoRR, vol. abs/1703.04105, 2017.
J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian Conference on Computer Vision, pp. 87–103, Springer, 2016.
Y. Takashima, R. Aihara, T. Takiguchi, Y. Ariki, N. Mitani, K. Omori, and K. Nakazono, “Audio-visual speech recognition using bimodaltrained bottleneck features for a person with severe hearing loss,” in INTERSPEECH, 2016.
E. Kilic, Classification of Mitotic figureswith convolutional neural networks. M.Sc. thesis, Erciyes University, Graduate School of Natural and Applied Sciences, 2016.
H. S. Nogay and T. C. Akinci, “A convolutional neural network application for predicting the locating of squamous cell carcinoma in the lung,” Balkan Journal of Electrical and Computer Engineering, vol. 6, pp. 207 – 210, 2018.
H. S. Nogay, “Classification of different cancer types by deep convolutional neural networks,” Balkan Journal of Electrical and Computer Engineering, vol. 6, pp. 56 – 59, 2018.
J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, and G. Wang, “Recent advances in convolutional neural networks,” CoRR, vol. abs/1512.07108, 2015.
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.
S. Das, “CNNs architectures: LeNet, AlexNet, VGG, GoogLeNet, ResNet and more . . . .” https://medium.com/@siddharthdas-32104, 2017.

There are 25 citations in total.

Details

Primary Language	English
Subjects	Electrical Engineering
Journal Section	Research Article
Authors	Tayyip Ozcan Alper Basturk
Publication Date	April 30, 2019
Published in Issue	Year 2019 Volume: 7 Issue: 2

Cite

APA	Ozcan, T., & Basturk, A. (2019). Lip Reading Using Convolutional Neural Networks with and without Pre-Trained Models. Balkan Journal of Electrical and Computer Engineering, 7(2), 195-201. https://doi.org/10.17694/bajece.479891

Cited By

Transfer learning-based convolutional neural networks with heuristic optimization for hand gesture recognition

Neural Computing and Applications

Tayyip Ozcan

https://doi.org/10.1007/s00521-019-04427-y

Human action recognition with deep learning and structural optimization using a hybrid heuristic algorithm

Cluster Computing

Tayyip Ozcan

https://doi.org/10.1007/s10586-020-03050-0

A new composite approach for COVID-19 detection in X-ray images using deep features

Applied Soft Computing

Tayyip Ozcan

https://doi.org/10.1016/j.asoc.2021.107669

Performance Improvement Of Pre-trained Convolutional Neural Networks For Action Recognition

The Computer Journal

Tayyip Ozcan

https://doi.org/10.1093/comjnl/bxaa029

DERİN ÖĞRENME KULLANILARAK OPTİMUM JPEG KALİTE FAKTÖRÜNÜN BELİRLENMESİ

Mühendislik Bilimleri ve Tasarım Dergisi

Emir ÖZTÜRK

https://doi.org/10.21923/jesd.698719

Visual speech recognition for small scale dataset using VGG16 convolution neural network

Multimedia Tools and Applications

Shashidhar R

https://doi.org/10.1007/s11042-021-11119-0

Static facial expression recognition using convolutional neural networks based on transfer learning and hyperparameter optimization

Multimedia Tools and Applications

Tayyip Ozcan

https://doi.org/10.1007/s11042-020-09268-9

ERUSLR: yeni bir Türkçe işaret dili veri seti ve hiperparametre optimizasyonu destekli evrişimli sinir ağı ile tanınması

Gazi Üniversitesi Mühendislik-Mimarlık Fakültesi Dergisi

Tayyip ÖZCAN

https://doi.org/10.17341/gazimmfd.746793

Entwicklung und Evaluation eines Deep-Learning-Algorithmus für die Worterkennung aus Lippenbewegungen für die deutsche Sprache

HNO

https://doi.org/10.1007/s00106-021-01143-9

Visual Analysis of College Sports Performance Based on Multimodal Knowledge Graph Optimization Neural Network

Computational Intelligence and Neuroscience

https://doi.org/10.1155/2022/5398932

Derin Öğrenme ile Dudak Okuma Üzerine Detaylı Bir Araştırma

Uluslararası Muhendislik Arastirma ve Gelistirme Dergisi

https://doi.org/10.29137/umagd.1038899

LIP READING USING CNN FOR TURKISH NUMBERS

Journal of Business in The Digital Age

https://doi.org/10.46238/jobda.1100903

Visual Speech Recognition for Kannada Language Using VGG16 Convolutional Neural Network

Acoustics

https://doi.org/10.3390/acoustics5010020

A novel facial expression recognition algorithm using geometry β –skeleton in fusion based on deep CNN

Image and Vision Computing

https://doi.org/10.1016/j.imavis.2023.104677

Survey on Silentinterpreter : Analysis of Lip Movement and Extracting Speech using Deep Learning

International Journal of Scientific Research in Science, Engineering and Technology

https://doi.org/10.32628/IJSRSET2411219

Lip Reading Using Various Deep Learning Models with Visual Turkish Data

Gazi University Journal of Science

https://doi.org/10.35378/gujs.1239207

Approaches of Lip Reading to Assist Sign Language Understanding

Journal of Korea Robotics Society

https://doi.org/10.7746/jkros.2024.19.4.319

Urdu Lip Reading Systems for Digits in Controlled and Uncontrolled Environment

IEEE Access

https://doi.org/10.1109/ACCESS.2025.3531640

Download Cover Image

Article Files

Full Text

All articles published by BAJECE are licensed under the Creative Commons Attribution 4.0 International License. This permits anyone to copy, redistribute, remix, transmit and adapt the work provided the original work and source is appropriately cited. Creative Commons LisansÄ±