Analysis of Lip Reading of Assamese Digits using Deep Learning
DOI:
https://doi.org/10.22232/stj.2025.13.01.18%20Keywords:
Long-Short Term Memory (LSTM), Lip Region Recognition, Color Imaging, Deep Learning, Lip reading, Assamese language, Custom dataset, Digit recognitionAbstract
Effective communication in noisy environments, such as aviation, construction, and manufacturing, is often hindered due to auditory challenges, making oral communication difficult. To address this issue, we propose an automatic lip-reading system specifically designed for recognizing Assamese digits in high-noise settings. This study introduces a deep learning-based approach that extracts the geometric features of lip movements from video data to accurately predict spoken digits. Traditional lip-reading models struggle with language-specific nuances due to reliance on generic datasets. To overcome this limitation, we construct a custom dataset of video recordings featuring diverse speakers varying in age, gender, and accent, ensuring a more robust and adaptable model. We employ a CNN+LSTM architecture, where Convolutional Neural Networks (CNNs) capture spatial features, and Long Short-Term Memory (LSTM) networks learn temporal dependencies. Experimental results demonstrate that our CNN+LSTM model outperforms conventional architectures like RNN+LSTM and RNN+CNN, achieving an accuracy of 83%. The findings highlight the effectiveness of deep learning in enhancing accessibility for the deaf and hard-of-hearing and enabling voice-free human-computer interaction.
References
Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. The journal of the acoustical society of america, 26(2), 212-215.
Fisher, C. G. (1968). Confusions among visually perceived consonants. Journal of speech and hearing research, 11(4), 796-804.
Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48(1), 71-99.
Treiman, R., Kessler, B., & Bick, S. (2002). Context sensitivity in the spelling of English vowels. Journal of Memory and Language, 47(3), 448-468.
Matthews, I., Cootes, T. F., Bangham, J. A., Cox, S., & Harvey, R. (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 198-213.
Koller, O., Ney, H., & Bowden, R. (2015). Deep learning of mouth shapes for sign language. In Proceedings of the IEEE International Conference on Computer Vision Workshops (pp. 85-91).
Pawar, D., Borde, P., & Yannawar, P. (2024). Generating dynamic lip-syncing using target audio in a multimedia environment. Natural Language Processing Journal, 100084.
Adeel, A., Gogate, M., Hussain, A., & Whitmer, W. M. (2019). Lip-reading driven deep learning approach for speech enhancement. IEEE Transactions on Emerging Topics in Computational Intelligence, 5(3), 481-490.
Chickerur, S., Patil, M. S., Anand, M. E. T. I., Nabapure, P. M., Mahindrakar, S., Sonali, N. A. I. K., & Kanyal, S. (2019). LSTM Based Lip Reading Approach for Devanagiri Script. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, 8(3), 13.
Assael, Y. M., Shillingford, B., Whiteson, S., & De Freitas, N. (2016). Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599.
Stafylakis, T., & Tzimiropoulos, G. (2017). Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105.
Fenghour, S., Chen, D., Guo, K., & Xiao, P. (2020). Lip reading sentences using deep learning with only visual cues. IEEE Access, 8, 215516-215530.
Kalbande, D., & Patil, S. (2011, September). Lip reading using neural networks. In International Conference on Graphic and Image Processing (ICGIP 2011) (Vol. 8285, pp. 310-316). SPIE.
Garg, A., Noyola, J., & Bagadia, S. (2016). Lip reading using CNN and LSTM. Technical report, Stanford University, CS231 n project report.
Vakhshiteh, F., Almasganj, F., & Nickabadi, A. (2018). Lip-reading via deep neural networks using hybrid visual features. Image Analysis and Stereology, 37(2), 159-171.
Faisal, M., & Manzoor, S. (2018). Deep learning for lip reading using audio-visual information for urdu language. arXiv preprint arXiv:1802.05521.
Zhu, M.-1., Wang, Q.-q., and Luo, J.-l. (2019). Lip-reading based on deep learning model. Transactions on Edutainment XV, pages 32-43.
Abrar, M. A., Islam, A. N., Hassan, M. M., Islam, M. T., Shahnaz, C., & Fattah, S. A. (2019, November). Deep lip reading-a deep learning based lip-reading software for the hearing impaired. In 2019 IEEE R10 humanitarian technology conference (R10- HTC)(47129) (pp. 40-44). IEEE.
Shirakata, T., & Saitoh, T. (2020). Lip reading using facial expression features. Int. J. Comput. Vis. Signal Process, 1 (1), 9-15.
Miled, M., Messaoud, M. A. B., & Bouzid, A. (2023). Lip reading of words with lip segmentation and deep learning. Multimedia Tools and Applications, 82(1), 551-571.
Hochreiter, S. (1997). Long Short-term Memory. Neural Computation MIT-Press.
Staudemeyer, R. C., & Morris, E. R. (2019). Understanding LSTM- -a tutorial into long short-term memory recurrent neural networks. arXiv preprint arXiv:1909.09586.
Yu, W., Zhang, J., and Li, Y. (2019). A review of lstm networks and their applications in speech recognition. IEEE Transactions on Speech and Audio Processing, 27(6):987-1000.
Wagle, A., Sharma, B., and Singh, C. (2021). An overview of convolutional neural networks (cnns) and their applications. International Journal of Artificial Intelligence Research, 45(3):123-134.
Townsend, J. T. (1971). Theoretical analysis of an alphabetic confusion matrix. Perception & Psychophysics, 9, 40-50.
Downloads
Published
How to Cite
Issue
Section
Categories
License
Copyright (c) 2025 Rabinder Kumar Prasad, Dhiraj Kalita, Zakariya Momin Mondal, M. Tiken Singh, S Md S Askari, Chandan Kalita

This work is licensed under a Creative Commons Attribution 4.0 International License.
© The Author(s) 2025. Published by the Science & Technology Journal (STJ), Mizoram University.
Articles published in this journal are open access and distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author(s) and source are properly credited.
Authors retain copyright and grant the journal the right of first publication, with the work simultaneously licensed under the CC BY 4.0 license.
License link: Creative Commons Attribution 4.0 International License (CC BY 4.0)
LOCKSS – Library archiving for permanence
OpenAIRE – Open Access compliance