AI Content Detection
DOI:
https://doi.org/10.3126/kjse.v9i1.78343Abstract
AI (Artificial Intelligence) content detection is the task of predicting if the given content is written by humans or AI. This project is a detection tool aimed at eliminating issues created by AI-generated text content such as fake academic reports and papers, articles, news, misinformation, and propaganda by combining multiple detection methods. Three models, LSTM (Long short-term memory), BERT (Bidirectional Encoder Representations from Transformers), and distilBERT (distilled Bidirectional Encoder Representations from Transformers) were fine-tuned on a small labelled dataset of 2492 rows. After comparing their performances, distilBERT was selected for further refinement. Then, a pre-trained distilBERT model was finetuned with 24034 rows of collected datasets to get results specific to the intended application. The language models in AI text generators (e.g. GPT-2) often plagiarize from the training datasets. So, to increase the accuracy BERT Classifier-based plagiarism detector was integrated into the system to determine the originality of input text and predict the likelihood of plagiarism or AI generation. The final model had an overall accuracy of 95% on the unseen data with it being able to detect 100% of all AI content in the unseen dataset and correctly classifying 90% of AI text in the unseen dataset.