search for




 

Comparison of machine learning methods for zero-inflated healthcare utilization data
Journal of the Korean Data & Information Science Society 2024;35:803-14
Published online November 30, 2024;  https://doi.org/10.7465/jkdi.2024.35.6.803
© 2024 Korean Data and Information Science Society.

Jung-hyo Kim1 · Sejung Kim2 · Ae Jeong Jo3 · Eun Jin Jang4

1234Department of Data Science, Andong National University
Correspondence to: This work was supported by a Research Grant of Andong National University.
1 Graduate student, Department of Data Science, Andong National University, Andong 36729, Korea.
2 Graduate student, Department of Data Science, Andong National University, Andong 36729, Korea.
3 Assistant professor, Department of Data Science, Andong National University, Andong 36729, Korea.
4 Corresponding author: Professor, Department of Data Science, Andong National University, Andong 36729, Korea. E-mail: ejjang@anu.ac.kr
Received October 19, 2024; Revised November 1, 2024; Accepted November 4, 2024.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
In the field of healthcare, medical utilization data such as the number of outpatient visits and hospitalizations of patients often exhibit characteristics of count data with excessive zeros, a long right tail, and overdispersion. In cases of zero-inflated count data, statistical models such as zero-inflated Poisson (ZIP) regression and zero-inflated negative binomial (ZINB) regression can be applied. In this study, we analyzed healthcare utilization data, which had overdispersion due to outliers and zero inflation, and compared the predictive performance of various models: ZIP, ZINB, random forests (RF), gradient boosting machine (GBM), and extreme gradient boosting (XGBoost). The results showed that in cases with a high proportion of zeros and large dispersion, such as the analysis of non-physician visit counts, machine learning methods like RF, GBM, and XGBoost outperformed the zero-inflated count regression models in predictive performance. Additionally, we applied the shapley additive explanations (SHAP) method, which is an explainable artificial intelligence (XAI) technique, to identify the covariates that most influenced the prediction of physician visit counts.
Keywords : Count regression, explainable artificial intelligence, healthcare utilization, machine learning, zero-inflated data