Advancing Voice Cloning for Nepali Language: Leveraging Transfer Learning in Low-Resource Language

Manjil Karki; Pratik Shakya; Ravi Pandit; Sandesh Acharya; Dinesh Gothe

doi:10.3126/jsce.v12i1.82362

Authors

Manjil Karki Department of Computer Engineering, Khwopa College of Engineering, Tribhuvan University
Pratik Shakya Department of Computer Engineering, Khwopa College of Engineering, Tribhuvan University
Ravi Pandit Department of Computer Engineering, Khwopa College of Engineering, Tribhuvan University
Sandesh Acharya Department of Computer Engineering, Khwopa College of Engineering, Tribhuvan University
Dinesh Gothe Department of Computer Engineering, Khwopa College of Engineering, Tribhuvan University

DOI:

https://doi.org/10.3126/jsce.v12i1.82362

Keywords:

Voice cloning, Low-resource language, Nepali speech synthesis, Transfer learning, Speaker encoder, Tacotron2, WaveNet

Abstract

Voice cloning refers to synthesizing speech that mimics the vocal characteristics of a specific individual using a limited number of audio samples. This technology finds extensive application in areas such as personalized voice interfaces, assistive technologies, and digital content creation. However, most of the existing voice cloning systems are developed on high-resource languages that have an upper hand on extensive annotated datasets. In contrast, this study introduces a novel voice cloning framework specifically designed for the Nepali language, a low-resource language with limited linguistic and acoustic resources. The proposed system uses a combination of a speaker encoder, a Tacotron2-based synthesizer, and a WaveNet vocoder, trained through a transfer learning approach leveraging multilingual pre-trained models to mitigate the challenges caused by data scarcity. To support this effort, we constructed a dataset of a Nepali speech corpus comprising 168 hours of audio data from 546 speakers and adapted the entire synthesis pipeline to accommodate the Devanagari script and the phonological nuances of the Nepali language. Evaluation through both subjective and objective metrics demonstrates the system’s effectiveness, with mean opinion scores (MOS) of 3.93 for naturalness and 3.29 for speaker similarity, as well as a low equal error rate (EER) of 0.005. These results affirm the feasibility of achieving high-quality voice cloning in low-resourced language contexts and establish a robust foundation for further exploration and development in Nepali speech synthesis and voice cloning.

Downloads

Download data is not yet available.

Abstract

278

PDF

152

Advancing Voice Cloning for Nepali Language: Leveraging Transfer Learning in Low-Resource Language

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

Information

Current Issue