search for




 

A study on the data fusion using the national health screening data
Journal of the Korean Data & Information Science Society 2021;32:695-703
Published online May 31, 2021;  https://doi.org/10.7465/jkdi.2021.32.3.695
© 2021 Korean Data and Information Science Society.

Sejin Bae1 · Dal Ho Kim2

12Department of Statistics, Kyungpook National University
Correspondence to: 1 Ph.D. candidate, Department of Statistics, Kyungpook National University, Daegu 41566, Korea.
2 Corresponding author: Professor, Department of Statistics, Kyungpook National University, Daegu 41566, Korea. E-mail: dalkim@knu.ac.kr
Received April 30, 2021; Revised May 21, 2021; Accepted May 21, 2021.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
The exact combination of data collected from different objects is difficult. Data fusion is the statistical combination process of obtaining an integrated dataset using common variable. We consider three statistical techniques for data fusion: conditional mean matching using linear regression models, gamma regression using nonlinear regression models on two independent datasets, and a distance hot deck nonparametric approach based on the distance of each variable. The National Health Insurance Corporation's National Health Screening Data are used to compare the performance of three models.
Keywords : Data fusion, distance hot deck, gamma regression, linear regression model, statistical matching.