Comparing LaBSE with Contrastively and Soft-Label Fine-Tuned mBERT Models for Semantic Search over a Nepali Knowledge Base
DOI:
https://doi.org/10.3126/injet.v3i1.87019Keywords:
Semantic Search, Sentence Embeddings, Contrastive Learning, Knowledge Distillation, mBERT, LaBSEAbstract
In this paper, the performance of multilingual sentence embedding models in semantic search for the Nepali language has been compared through three approaches: LaBSE under a zero-shot setting, mBERT fine-tuned through contrastive learning, and mBERT fine-tuned through soft similarity scores obtained by knowledge distillation from LaBSE. A customized dataset of approximately 800 labeled sentence pairs was developed from the e-commerce and appointment booking domains. The dataset contains questions written in Devanagari Nepali, with some code-mixed English. Each sentence pair was labeled as either semantically similar or dissimilar. Hard binary labels and a margin-based contrastive loss were utilized to train the contrastively trained model, while the distilled model was trained using a regression loss to match similarity scores obtained using LaBSE embeddings. All models were evaluated on a semantic retrieval task in which 89 user queries were embedded and compared with a corpus of 130 candidate sentences using cosine similarity. The quality of retrieval was calculated in terms of Top-1, Top-5, and Top-10 accuracy, and Mean Reciprocal Rank (MRR). LaBSE, without any task-specific fine-tuning, topped the results with Top-1 accuracy of 41.57% and MRR of 0.5246. The contrastively fine-tuned mBERT model achieved a Top-1 accuracy of 21.35% and MRR of 0.3204. The soft-label distilled mBERT model ranked mid-range with Top-1 accuracy of 34.83% and MRR of 0.4488, which shows that knowledge distillation can effectively transfer semantic similarity knowledge from LaBSE to mBERT. These findings demonstrate that while zero-shot LaBSE is strong, multilingual models like mBERT can be repurposed for Nepali semantic search through targeted fine-tuning. This research establishes a baseline for semantic search in Nepali and suggests practical approaches to enhancing sentence embeddings in low-resource language environments.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal on Engineering Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.
This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.