Nepali Text-to-Speech Synthesis Using Tacotron2 and WaveGlow
DOI:
https://doi.org/10.3126/kjse.v8i1.69276Keywords:
Fine-tuning, Text-to-Speech, Synthesis, Tacotron2, WaveGlowAbstract
This research paper presents the development of a Nepali Text-to-Speech (TTS) system under low-resource conditions by adapting pre-trained English Tacotron2 and WaveGlow models. Tacotron2 has been utilized for spectrogram generation, and WaveGlow has been employed for vocoding, with recognition of the pivotal role played by these components in determining the efficacy of a Text-to-Speech (TTS) system. Our approach entails the adaptation of a pre-trained English Tacotron2 model and WaveGlow architecture to Nepali, leveraging limited data resources to craft a Nepali TTS system capable of producing natural-sounding output under low-resource conditions. Through fine-tuning with a Nepali text corpus aligned with its corresponding audio dataset, the pre-trained Tacotron2 model is optimized for spectrogram generation. Subsequently, WaveGlow, our chosen audio synthesis model, is utilized to convert the spectrogram representations into audible waveforms. It is worth noting that our model exhibits limitations in synthesizing audio for a restricted subset of Nepali texts, attributed to challenges stemming from text cleaning and normalization inadequacies.