Comparative Study among Term Frequency-Inverse Document Frequency and Count Vectorizer towards K Nearest Neighbor and Decision Tree Classifiers for Text Dataset

Authors

  • Tula Kanta Deo Kalinga University, Raipur(CG), India
  • Rajesh Keshavrao Deshmukh Kalinga University, Raipur(CG), India
  • Gajendra Sharma Kathmandu University, Nepal

DOI:

https://doi.org/10.3126/njmr.v7i2.68189

Keywords:

Count Vectorizer, Decision Tree, K Nearest Neighbor, Term Frequency and Inverse Document Frequency

Abstract

Background: Text classification techniques are increasingly important with the exponential growth of textual data on the internet. Term Frequency-Inverse Document Frequency (TF-IDF) and Count Vectorizer(CV) are commonly used methods for feature extraction. TF-IDF assigning weights to terms based on their frequency. CV simply counts the occurrences of terms. The performance of CV as well as TF-IDF are evaluated and compared with KNN and DT  classifiers across text datasets.

Methodology: The investigation begins with preprocessing. The feature vectors are created using both TF-IDF and CV. Feature vectors are passed into the KNN and DT classifiers at in training stage. Experiments are executed the usage of Kaggle's public database Ukraine 10K tweets sentiment_analysis dataset and the Womens ecommerce clothing reviews dataset.

Findings: The average of precision, recall, f1 score and accuracy of KNN with TF-IDF were  84.5%, 87%, 83%, 87% respectively and KNN with CV were 83.5%, 87%, 83.5%, 87% respectively. Similarly, average of precision, recall, f1 score and accuracy of DT with TF-IDF were 89%, 89%, 89%, 89% respectively and DT with CV were 89%, 89.5%, 89.5%, 89.5% respectively. The results obtained in this research is consistent with previous similar research result.

Conclusions: The performance of TF-IDF is almost similar as CV for a particular dataset and a particular classifier in this study.

Novelty: The experiment performed using these classifiers and feature extraction methods on the datasets is a novelty and contribution of this research.

Downloads

Download data is not yet available.
Abstract
171
PDF
126

Author Biographies

Tula Kanta Deo, Kalinga University, Raipur(CG), India

Department of Computer Science and Engineering

Rajesh Keshavrao Deshmukh, Kalinga University, Raipur(CG), India

Department of Computer Science and Engineering

Gajendra Sharma, Kathmandu University, Nepal

Department of Computer Science and Engineering

Downloads

Published

2024-07-30

How to Cite

Deo, T. K., Deshmukh, R. K., & Sharma, G. (2024). Comparative Study among Term Frequency-Inverse Document Frequency and Count Vectorizer towards K Nearest Neighbor and Decision Tree Classifiers for Text Dataset. Nepal Journal of Multidisciplinary Research, 7(2), 1–11. https://doi.org/10.3126/njmr.v7i2.68189

Issue

Section

Articles