Automatic Nepali Image Captioning Using CNN-Transformer Model
DOI:
https://doi.org/10.3126/juem.v3i1.84867Keywords:
Deep Learning, Pre-trained Dataset, Nepali Image Captions, Convulutional Neural Network (CNN), Transformer Model, EfficientNetB0, Feature Extraction, Sequence GenerationAbstract
Image captioning has gained significant attention, with most of the research efforts directed toward the English language. While some work has been explored in regional languages such as Hindi and Bengali, Nepali remains largely underrepresented in this domain. Furthermore, publicly accessible Nepali-language datasets for image captioning are extremely limited. This study leverages an existing pre-trained dataset that includes Nepali image captions and employs deep learning methods to automatically generate descriptions in the Nepali language. The architecture used integrates a Convolutional Neural Network (CNN) for image understanding and a Transformer model for sequence generation. In our approach, EfficientNetB0, a pre-trained CNN model, is utilized to extract high-level features from images. These features are then fed into the Transformer, which generates the corresponding captions in Nepali. The experimental results demonstrate encouraging performance, suggesting the approach is effective and holds potential for further refinement in future research.
Downloads
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for non-commercial purposes only, and only so long as attribution is given to the creator.