Comparison of machine learning algorithms in statistically imputed water potability dataset


  • Diwash Poudel IOE , Thapathali Campus
  • Dhadkan Shrestha IOE, Thapathali Campus
  • Sulove Bhattarai IOE, Thapathali Campus
  • Abhishek Ghimire IOE, Thapathali Campus



ANN, K Nearest Neighbor, LR, missing values, RF


Lack of safe drinking water is a growing concern in the present day and age. Since missing data is commonly found among most of the available datasets, the main purpose of this study is to find the best algorithm that works in the dataset that is statistically imputed and find the algorithm that gives the best prediction on whether water is potable or not. Water potability is predicted using its datasets with the help of the four algorithms evaluating nine features. Some values of the three features, specifically pH, chloramine, and trihalomethane, are found to be missing in the dataset. Missing values are filled in by the median of that particular feature. The performance of machine learning algorithms called LR, K-NN, RF, and ANN is compared in these given conditions. As per our research, RF, with 700 decision trees at a maximum depth of 30, is found to be the best-performing algorithm for the statically imputed water potability dataset. The study most certainly answers the question concerning the best algorithm, but still, further study is needed to optimize the algorithm in order to provide the best prediction.


Download data is not yet available.



2023-02-04 — Updated on 2023-03-10

How to Cite

Poudel, D., Shrestha, D., Bhattarai, S., & Ghimire, A. (2023). Comparison of machine learning algorithms in statistically imputed water potability dataset. Journal of Innovations in Engineering Education, 5(1), 38–46.