Automatic Image Captioning for Nepalese Socio-Cultural and Traditional Images Using CNN and RNN
DOI:
https://doi.org/10.3126/injet.v3i1.87026Keywords:
Image Captioning, Nepalese Culture, CNN, LSTM, BLEU ScoreAbstract
Image captioning is the process of generating textual description of an image. Automatically describing the content of images using natural language is a challenging task. Despite significant progress using deep learning architectures, most existing datasets and models are biased toward western cultural contexts, limiting their general applicability. This paper presents an approach for automatic image captioning focused on Nepalese socio-cultural and traditional contexts using Convolutional Neural Network (CNN) as an encoder and Long Short-Term Memory (LSTM) network as a decoder. A custom dataset consisting of 412 images with 1236 corresponding captions was developed, capturing local customs, festivals, and daily life. The model was evaluated on both the custom dataset and the standard Flickr8k dataset using Bilingual Evaluation Understudy BLEU-1 to BLEU-4 scores. The obtained accuracy was 90.421%, loss 3.4614%, BLEU-1 score 0.580268 and BLEU-4 score 0.300523. Same model was fitted with own dataset and achieved accuracy was 90.3519% with loss of 3.5082%, BLEU-1 score 0.569302 and BLEU-4 score 0.300328, showing competitive results. This highlights the value of culturally specific datasets.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal on Engineering Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.
This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.