search for




 

Comparison study of synthetic data generation methods for credit card transaction data
Journal of the Korean Data & Information Science Society 2023;34:49-72
Published online January 31, 2023;  https://doi.org/10.7465/jkdi.2023.34.1.49
© 2023 Korean Data and Information Science Society.

Hyunwoo Jung1 · Younsang Cho2 · Geonwoo Ko3 · Jae-ik Song4 · Donghyeon Yu5

1235Department of Statistics, Inha University
4NexGen Innovation, NICE ZiniData Co., Ltd.
Correspondence to: This work was supported by the National Information Society Agency.
1 Master course student, Department of Statistics, Inha University, Incheon 22212, Korea.
2 Integrated Ph.D. program student, Department of Statistics, Inha University, Incheon 22212, Korea.
3 Master course student, Department of Statistics, Inha University, Incheon 22212, Korea.
4 Manager, NexGen Innovation, NICE ZiniData Co., Ltd., Seoul 07242, Korea.
5 Associate professor, Department of Statistics, Inha University, Incheon 22212, Korea. E-mail: dyu@inha.ac.kr
Received November 30, 2022; Revised December 10, 2022; Accepted December 10, 2022.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Synthetic data generation is one of the main topics in data privacy and statistical disclosure control. In this paper, we apply popular synthetic data generation methods such as synthpop, variational autoencoder (VAE), and generative adversarial network (GAN) models to credit card transaction data. We consider the targeted corrected attribution probability (TCAP) for the disclosure-risk measure, and we also consider propensity-score-based mean squared errors (pMSE) and ratio-of-estimates (ROE) for the data utility. As a result, the synthetic data by the synthpop has high disclosure risk and high data utility, while the VAE has the lowest disclosure risk and data utility. For GAN-based models, the conditional tabular GAN (CTGAN) has a relatively lower disclosure risk and similar data utility compared to the synthpop.
Keywords : Credit card transaction, generative adversarial network, synthetic data, synthpop, variational autoencoder.