Advancing Voice Cloning for Nepali Language: Leveraging Transfer Learning in Low-Resource Language
DOI:
https://doi.org/10.3126/jsce.v12i1.82362Keywords:
Voice cloning, Low-resource language, Nepali speech synthesis, Transfer learning, Speaker encoder, Tacotron2, WaveNetAbstract
Voice cloning refers to synthesizing speech that mimics the vocal characteristics of a specific individual using a limited number of audio samples. This technology finds extensive application in areas such as personalized voice interfaces, assistive technologies, and digital content creation. However, most of the existing voice cloning systems are developed on high-resource languages that have an upper hand on extensive annotated datasets. In contrast, this study introduces a novel voice cloning framework specifically designed for the Nepali language, a low-resource language with limited linguistic and acoustic resources. The proposed system uses a combination of a speaker encoder, a Tacotron2-based synthesizer, and a WaveNet vocoder, trained through a transfer learning approach leveraging multilingual pre-trained models to mitigate the challenges caused by data scarcity. To support this effort, we constructed a dataset of a Nepali speech corpus comprising 168 hours of audio data from 546 speakers and adapted the entire synthesis pipeline to accommodate the Devanagari script and the phonological nuances of the Nepali language. Evaluation through both subjective and objective metrics demonstrates the system’s effectiveness, with mean opinion scores (MOS) of 3.93 for naturalness and 3.29 for speaker similarity, as well as a low equal error rate (EER) of 0.005. These results affirm the feasibility of achieving high-quality voice cloning in low-resourced language contexts and establish a robust foundation for further exploration and development in Nepali speech synthesis and voice cloning.