Comparative Study among Term Frequency-Inverse Document Frequency and Count Vectorizer towards K Nearest Neighbor and Decision Tree Classifiers for Text Dataset
DOI:
https://doi.org/10.3126/njmr.v7i2.68189Keywords:
Count Vectorizer, Decision Tree, K Nearest Neighbor, Term Frequency and Inverse Document FrequencyAbstract
Background: Text classification techniques are increasingly important with the exponential growth of textual data on the internet. Term Frequency-Inverse Document Frequency (TF-IDF) and Count Vectorizer(CV) are commonly used methods for feature extraction. TF-IDF assigning weights to terms based on their frequency. CV simply counts the occurrences of terms. The performance of CV as well as TF-IDF are evaluated and compared with KNN and DT classifiers across text datasets.
Methodology: The investigation begins with preprocessing. The feature vectors are created using both TF-IDF and CV. Feature vectors are passed into the KNN and DT classifiers at in training stage. Experiments are executed the usage of Kaggle's public database Ukraine 10K tweets sentiment_analysis dataset and the Womens ecommerce clothing reviews dataset.
Findings: The average of precision, recall, f1 score and accuracy of KNN with TF-IDF were 84.5%, 87%, 83%, 87% respectively and KNN with CV were 83.5%, 87%, 83.5%, 87% respectively. Similarly, average of precision, recall, f1 score and accuracy of DT with TF-IDF were 89%, 89%, 89%, 89% respectively and DT with CV were 89%, 89.5%, 89.5%, 89.5% respectively. The results obtained in this research is consistent with previous similar research result.
Conclusions: The performance of TF-IDF is almost similar as CV for a particular dataset and a particular classifier in this study.
Novelty: The experiment performed using these classifiers and feature extraction methods on the datasets is a novelty and contribution of this research.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 The Author(s)
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator.