Nepali Text-to-Speech Synthesis Using Tacotron2 and WaveGlow

Authors

  • Ashma Rai
  • Shikshya Shiwakoti
  • Swostika Basukala
  • Suramya Sharma Dahal

DOI:

https://doi.org/10.3126/kjse.v8i1.69276

Keywords:

Fine-tuning, Text-to-Speech, Synthesis, Tacotron2, WaveGlow

Abstract

This research paper presents the development of a Nepali Text-to-Speech (TTS) system under low-resource conditions by adapting pre-trained English Tacotron2 and WaveGlow models. Tacotron2 has been utilized for spectrogram generation, and WaveGlow has been employed for vocoding, with recognition of the pivotal role played by these components in determining the efficacy of a Text-to-Speech (TTS) system. Our approach entails the adaptation of a pre-trained English Tacotron2 model and WaveGlow architecture to Nepali, leveraging limited data resources to craft a Nepali TTS system capable of producing natural-sounding output under low-resource conditions. Through fine-tuning with a Nepali text corpus aligned with its corresponding audio dataset, the pre-trained Tacotron2 model is optimized for spectrogram generation. Subsequently, WaveGlow, our chosen audio synthesis model, is utilized to convert the spectrogram representations into audible waveforms. It is worth noting that our model exhibits limitations in synthesizing audio for a restricted subset of Nepali texts, attributed to challenges stemming from text cleaning and normalization inadequacies.

Downloads

Download data is not yet available.
Abstract
110
PDF
82

Author Biographies

Ashma Rai

Dept of Electronics and Computer Engineering, Thapathali Campus, IOE, TU

Shikshya Shiwakoti

Dept of Electronics and Computer Engineering, Thapathali Campus, IOE, TU

Swostika Basukala

Dept of Electronics and Computer Engineering, Thapathali Campus, IOE, TU

Suramya Sharma Dahal

Associate Professor, Dept of Electronics, Communication & Information Engineering, Kathmandu Engineering College

Downloads

Published

2024-09-02

How to Cite

Rai, A., Shiwakoti, S., Basukala, S., & Sharma Dahal, S. (2024). Nepali Text-to-Speech Synthesis Using Tacotron2 and WaveGlow. KEC Journal of Science and Engineering, 8(1), 103–109. https://doi.org/10.3126/kjse.v8i1.69276

Issue

Section

Articles