search for




 

Development of a depression risk prediction model in Korea using machine learning-based feature selection
Journal of the Korean Data & Information Science Society 2025;36:163-77
Published online January 31, 2025;  https://doi.org/10.7465/jkdi.2025.36.1.163
© 2025 Korean Data and Information Science Society.

Jun-Tae Han1 · Il-Su Park2

1Korea Student Aid Foundation
2Department of Healthcare Management, Dong-eui University
Correspondence to: 1 Team Manager, Merit-Based Scholarship Team, Korea Student Aid Foundation, Daegu 41200, Korea.
2 Corresponding author: Professor, Department of Healthcare Management, Dong-eui University, Busan 47340, Korea. E-mail: ispark@deu.ac.kr
Received December 18, 2024; Revised January 6, 2025; Accepted January 14, 2025.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
This study aims to develop a prediction model for assessing depression risk. Depression-related risk factors were identified from prior studies, and the most influential factors were selected using a feature selection method based on four machine learning techniques: random forest, XGBoost, AdaBoost, and gradient boosting. The random forest algorithm achieved the highest receiver operating characteristic (ROC) curve (0.8407) in classifying depression among the four machine learning algorithms. The variables were derived from the Korea community health survey (KCHS) 2022 by the Korea disease control and prevention agency (KDCA) and used as input variables, with depression status as the target variable. A weighted logistic regression model was employed for prediction. Based on feature importance rankings from four machine learning techniques, the combined key risk factors included economic activity, monthly household income, marital status, walking habits, subjective stress perception, subjective health status, number of cultural infrastructures, unemployment rate, suicide rate, and number of doctors. In the prediction model, subjective stress perception (OR: 9.65) was the most significant risk factor, followed by subjective health status (OR: 3.38). The use of machine learning techniques for variable selection effectively addresses the challenge of interpretability in prediction models. This approach demonstrates great potential for future healthcare-related disease risk prediction models.
Keywords : Depression risk factors, feature importance, machine learning, weighted logistic regression model