Analysis of Lip Reading of Assamese Digits using Deep Learning

Authors

Rabinder Kumar Prasad Dibrugarh University, India
Dhiraj Kalita Dibrugarh University, India
Zakariya Momin Mondal Dibrugarh University, India
M. Tiken Singh Dibrugarh University
S Md S Askari Rajiv Gandhi University, India
Chandan Kalita Gauhati University, India

DOI:

https://doi.org/10.22232/stj.2025.13.01.18%20

Keywords:

Long-Short Term Memory (LSTM), Lip Region Recognition, Color Imaging, Deep Learning, Lip reading, Assamese language, Custom dataset, Digit recognition

Abstract

Effective communication in noisy environments, such as aviation, construction, and manufacturing, is often hindered due to auditory challenges, making oral communication difficult. To address this issue, we propose an automatic lip-reading system specifically designed for recognizing Assamese digits in high-noise settings. This study introduces a deep learning-based approach that extracts the geometric features of lip movements from video data to accurately predict spoken digits. Traditional lip-reading models struggle with language-specific nuances due to reliance on generic datasets. To overcome this limitation, we construct a custom dataset of video recordings featuring diverse speakers varying in age, gender, and accent, ensuring a more robust and adaptable model. We employ a CNN+LSTM architecture, where Convolutional Neural Networks (CNNs) capture spatial features, and Long Short-Term Memory (LSTM) networks learn temporal dependencies. Experimental results demonstrate that our CNN+LSTM model outperforms conventional architectures like RNN+LSTM and RNN+CNN, achieving an accuracy of 83%. The findings highlight the effectiveness of deep learning in enhancing accessibility for the deaf and hard-of-hearing and enabling voice-free human-computer interaction.

Author Biographies

Rabinder Kumar Prasad, Dibrugarh University, India

Department of Computer Science and Engineering

Dhiraj Kalita, Dibrugarh University, India

Department of Computer Science and Engineering

Zakariya Momin Mondal, Dibrugarh University, India

Department of Computer Science and Engineering

M. Tiken Singh, Dibrugarh University

Department of Computer Science and Engineering

S Md S Askari, Rajiv Gandhi University, India

Department of Computer Science and Engineering

Chandan Kalita, Gauhati University, India

Department of Information Technology

References

Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. The journal of the acoustical society of america, 26(2), 212-215.

Fisher, C. G. (1968). Confusions among visually perceived consonants. Journal of speech and hearing research, 11(4), 796-804.

Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48(1), 71-99.

Treiman, R., Kessler, B., & Bick, S. (2002). Context sensitivity in the spelling of English vowels. Journal of Memory and Language, 47(3), 448-468.

Matthews, I., Cootes, T. F., Bangham, J. A., Cox, S., & Harvey, R. (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 198-213.

Koller, O., Ney, H., & Bowden, R. (2015). Deep learning of mouth shapes for sign language. In Proceedings of the IEEE International Conference on Computer Vision Workshops (pp. 85-91).

Pawar, D., Borde, P., & Yannawar, P. (2024). Generating dynamic lip-syncing using target audio in a multimedia environment. Natural Language Processing Journal, 100084.

Adeel, A., Gogate, M., Hussain, A., & Whitmer, W. M. (2019). Lip-reading driven deep learning approach for speech enhancement. IEEE Transactions on Emerging Topics in Computational Intelligence, 5(3), 481-490.

Chickerur, S., Patil, M. S., Anand, M. E. T. I., Nabapure, P. M., Mahindrakar, S., Sonali, N. A. I. K., & Kanyal, S. (2019). LSTM Based Lip Reading Approach for Devanagiri Script. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, 8(3), 13.

Assael, Y. M., Shillingford, B., Whiteson, S., & De Freitas, N. (2016). Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599.

Stafylakis, T., & Tzimiropoulos, G. (2017). Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105.

Fenghour, S., Chen, D., Guo, K., & Xiao, P. (2020). Lip reading sentences using deep learning with only visual cues. IEEE Access, 8, 215516-215530.

Kalbande, D., & Patil, S. (2011, September). Lip reading using neural networks. In International Conference on Graphic and Image Processing (ICGIP 2011) (Vol. 8285, pp. 310-316). SPIE.

Garg, A., Noyola, J., & Bagadia, S. (2016). Lip reading using CNN and LSTM. Technical report, Stanford University, CS231 n project report.

Vakhshiteh, F., Almasganj, F., & Nickabadi, A. (2018). Lip-reading via deep neural networks using hybrid visual features. Image Analysis and Stereology, 37(2), 159-171.

Faisal, M., & Manzoor, S. (2018). Deep learning for lip reading using audio-visual information for urdu language. arXiv preprint arXiv:1802.05521.

Zhu, M.-1., Wang, Q.-q., and Luo, J.-l. (2019). Lip-reading based on deep learning model. Transactions on Edutainment XV, pages 32-43.

Abrar, M. A., Islam, A. N., Hassan, M. M., Islam, M. T., Shahnaz, C., & Fattah, S. A. (2019, November). Deep lip reading-a deep learning based lip-reading software for the hearing impaired. In 2019 IEEE R10 humanitarian technology conference (R10- HTC)(47129) (pp. 40-44). IEEE.

Shirakata, T., & Saitoh, T. (2020). Lip reading using facial expression features. Int. J. Comput. Vis. Signal Process, 1 (1), 9-15.

Miled, M., Messaoud, M. A. B., & Bouzid, A. (2023). Lip reading of words with lip segmentation and deep learning. Multimedia Tools and Applications, 82(1), 551-571.

Hochreiter, S. (1997). Long Short-term Memory. Neural Computation MIT-Press.

Staudemeyer, R. C., & Morris, E. R. (2019). Understanding LSTM- -a tutorial into long short-term memory recurrent neural networks. arXiv preprint arXiv:1909.09586.

Yu, W., Zhang, J., and Li, Y. (2019). A review of lstm networks and their applications in speech recognition. IEEE Transactions on Speech and Audio Processing, 27(6):987-1000.

Wagle, A., Sharma, B., and Singh, C. (2021). An overview of convolutional neural networks (cnns) and their applications. International Journal of Artificial Intelligence Research, 45(3):123-134.

Townsend, J. T. (1971). Theoretical analysis of an alphabetic confusion matrix. Perception & Psychophysics, 9, 40-50.

Downloads

PDF

Published

2025-09-29

How to Cite

Rabinder Kumar Prasad, Dhiraj Kalita, Zakariya Momin Mondal, M. Tiken Singh, S Md S Askari, & Chandan Kalita. (2025). Analysis of Lip Reading of Assamese Digits using Deep Learning. Science & Technology Journal, 13(1). https://doi.org/10.22232/stj.2025.13.01.18

Download Citation

Issue

Vol. 13 No. 1 (2025): January 2025

Section

Research Articles

Categories

Engineering & Technology

License

Copyright (c) 2025 Rabinder Kumar Prasad, Dhiraj Kalita, Zakariya Momin Mondal, M. Tiken Singh, S Md S Askari, Chandan Kalita

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

© The Author(s) 2025. Published by the Science & Technology Journal (STJ), Mizoram University.

Articles published in this journal are open access and distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).

This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author(s) and source are properly credited.

Authors retain copyright and grant the journal the right of first publication, with the work simultaneously licensed under the CC BY 4.0 license.

License link: Creative Commons Attribution 4.0 International License (CC BY 4.0)