Advancements in Nepali Speech Recognition: A Comparative Study of BiLSTM, Transformer, and Hybrid Models

Authors

  • Ankit Kafle Department of Computer and Electronics Engineering, Kantipur Engineering College, Dhapakhel, Lalitpur, Nepal
  • Jenith Rajlawat Department of Computer and Electronics Engineering, Kantipur Engineering College, Dhapakhel, Lalitpur, Nepal
  • Nawaraj Shah Department of Computer and Electronics Engineering, Kantipur Engineering College, Dhapakhel, Lalitpur, Nepal
  • Neetish Paudel Department of Computer and Electronics Engineering, Kantipur Engineering College, Dhapakhel, Lalitpur, Nepal
  • Bishal Thapa Department of Computer and Electronics Engineering, Kantipur Engineering College, Dhapakhel, Lalitpur, Nepal

DOI:

https://doi.org/10.3126/injet.v2i1.72525

Keywords:

Automatic Speech Recognition, Convolutional Neural Networks, Connectionist Temporal Classification, Mel-frequency cepstral coefficients, Residual Networks, Bidirectional Long Short-Term Memory

Abstract

In today's world, leveraging Automatic Speech Recognition (ASR) technology to process and understand spoken language is highly desirable. Our proposed Nepali Speech Recognition employs an advanced generation to recognize and interpret spoken Nepali language. It approaches Nepali speech, allowing it to reply to user queries effectively. To attain this, we rent a mixture of superior neural network fashions. We extract Mel-frequency cepstral coefficients (MFCCs) from the preprocessed audio information; these MFCCs capture crucial spectral characteristics of Nepali speech and serve as essential input features for our neural network model. To design a top-rated version for textual content-based query processing, we make use of convolutional neural networks (CNN), residual networks (ResNet), and bidirectional long short-term memory (BiLSTM) layers. The CNN layers excel at extracting neighborhood patterns and spatial features from the MFCC input; the ResNet layers capture deeper representations to enhance performance. The BiLSTM layers are also employed to model temporal dependencies in the textual content-based query processing, we make use of convolutional neural networks (CNN), residual networks (ResNet), and bidirectional long short-term memory (BiLSTM) layers. The CNN layers excel at extracting neighborhood patterns and spatial features from the MFCC input; the ResNet layers capture deeper representations to enhance performance. The BiLSTM layers are also employed to model temporal dependencies in the textual content records. We hired the Connectionist Temporal classification (CTC) loss feature to enable sequence-to-series mapping, aligning the input speech with corresponding text outputs. This approach permits our gadget to successfully process textual content queries and provide correct responses, enhancing the user's usefulness. The model, after being trained with 1.55 million parameters in about 1 lakh 57 thousand audio datasets for 47 epochs, achieved a CTC of 17.98% (82.02%-character accuracy rate) with this model.

Abstract
39
PDF
80

Downloads

Published

2024-12-16

How to Cite

Kafle, A., Rajlawat, J., Shah, N., Paudel, N., & Thapa, B. (2024). Advancements in Nepali Speech Recognition: A Comparative Study of BiLSTM, Transformer, and Hybrid Models. International Journal on Engineering Technology, 2(1), 96–105. https://doi.org/10.3126/injet.v2i1.72525

Issue

Section

Articles