Preprocessing of Nepali News Corpus for Downstream Tasks

Sushil Awale; Suraj Prasai; Birodh Rijal; Santa B. Basnet

doi:10.3126/nl.v35i01.46553

Preprocessing of Nepali News Corpus for Downstream Tasks

Authors

Sushil Awale Integrated ICT Private LTD, Kupondole, Lalitpur, Nepal
Suraj Prasai Integrated ICT Private LTD, Kupondole, Lalitpur, Nepal
Birodh Rijal
Santa B. Basnet

Keywords:

Text processing, conjuncts, language models, glyphs, Nepali corpus

Abstract

Text collected from online resources introduce a lot of errors which results in incorrect learning outcomes in automatic language learning tasks. In this paper, we discuss a Nepali text preprocessing pipeline to generate clean corpus. This pipeline is tested using a language model to observe impact of each steps in learning task. The relevancy of this work lies in systematizing the procedure in the development of standard Nepali corpus.

Abstract

422

PDF

Downloads

Published

2022-07-11

How to Cite

Awale, S., Prasai, S., Rijal, B., & Basnet, S. B. (2022). Preprocessing of Nepali News Corpus for Downstream Tasks. Nepalese Linguistics, 35(01), 1-6. https://doi.org/10.3126/nl.v35i01.46553

Download Citation

Issue

Vol. 35 (2022)

Section

Articles

How to Cite

Awale, S., Prasai, S., Rijal, B., & Basnet, S. B. (2022). Preprocessing of Nepali News Corpus for Downstream Tasks. Nepalese Linguistics, 35(01), 1-6. https://doi.org/10.3126/nl.v35i01.46553

Download Citation

Preprocessing of Nepali News Corpus for Downstream Tasks

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

How to Cite

Information