Advancing Voice Cloning for Nepali Language: Leveraging Transfer Learning in Low-Resource Language

Authors

  • Manjil Karki Department of Computer Engineering, Khwopa College of Engineering, Tribhuvan University
  • Pratik Shakya Department of Computer Engineering, Khwopa College of Engineering, Tribhuvan University
  • Ravi Pandit Department of Computer Engineering, Khwopa College of Engineering, Tribhuvan University
  • Sandesh Acharya Department of Computer Engineering, Khwopa College of Engineering, Tribhuvan University
  • Dinesh Gothe Department of Computer Engineering, Khwopa College of Engineering, Tribhuvan University

DOI:

https://doi.org/10.3126/jsce.v12i1.82362

Keywords:

Voice cloning, Low-resource language, Nepali speech synthesis, Transfer learning, Speaker encoder, Tacotron2, WaveNet

Abstract

Voice cloning refers to synthesizing speech that mimics the vocal characteristics of a specific individual using a limited number of audio samples. This technology finds extensive application in areas such as personalized voice interfaces, assistive technologies, and digital content creation. However, most of the existing voice cloning systems are developed on high-resource languages that have an upper hand on extensive annotated datasets. In contrast, this study introduces a novel voice cloning framework specifically designed for the Nepali language, a low-resource language with limited linguistic and acoustic resources. The proposed system uses a combination of a speaker encoder, a Tacotron2-based synthesizer, and a WaveNet vocoder, trained through a transfer learning approach leveraging multilingual pre-trained models to mitigate the challenges caused by data scarcity. To support this effort, we constructed a dataset of a Nepali speech corpus comprising 168 hours of audio data from 546 speakers and adapted the entire synthesis pipeline to accommodate the Devanagari script and the phonological nuances of the Nepali language. Evaluation through both subjective and objective metrics demonstrates the system’s effectiveness, with mean opinion scores (MOS) of 3.93 for naturalness and 3.29 for speaker similarity, as well as a low equal error rate (EER) of 0.005. These results affirm the feasibility of achieving high-quality voice cloning in low-resourced language contexts and establish a robust foundation for further exploration and development in Nepali speech synthesis and voice cloning.

Downloads

Download data is not yet available.
Abstract
121
PDF
41

Downloads

Published

2025-08-12

How to Cite

Karki, M., Shakya, P., Pandit, R., Acharya, S., & Gothe, D. (2025). Advancing Voice Cloning for Nepali Language: Leveraging Transfer Learning in Low-Resource Language. Journal of Science and Engineering, 12(1), 49–56. https://doi.org/10.3126/jsce.v12i1.82362

Issue

Section

Research Article