search for




 

Clustering high-cardinality categorical data using category embedding methods
Journal of the Korean Data & Information Science Society 2020;31:209-20
Published online January 31, 2020;  https://doi.org/10.7465/jkdi.2020.31.1.209
© 2020 Korean Data and Information Science Society.

Hyun Cho1 · Yeojin Chung2

12Department of Data Science, Kookmin University
Correspondence to: Associate professor, Department of Data Science, Kookmin University, Seoul 02707, Korea. E-mail: ychung@kookmin.ac.kr
This research was supported by Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT & Future Planning (NRF-2016R1C1B1010940) and R&D Program for Forest Science Technology (2019150B10-1923-0301) funded by Korea Forest Service.
Received December 18, 2019; Revised January 10, 2020; Accepted January 15, 2020.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Compared to clustering numerical data, clustering algorithms for categorical data have not been extensively studied, particularly for data with high-cardinality attributes. When categorical attributes have a large number of levels, clustering algorithms tend to suffer from the curse of dimensionality. In this study, we verified that a good clustering performance can be achieved in the presence of categorical attributes by combining clustering algorithms typically applied to numerical data with word embedding methods. Using word embedding methods that were originally developed for natural language processing, the levels of categorical attributes can be represented in a vector space, where the resulting embedding vectors would reflect the relationship between frequently appearing categories. We utilized Word2vec, GloVe, and fastText for category embedding. We also applied K-means and Gaussian mixture model for clustering the embedded data. The clustering performance of the proposed methods was compared with that of typical clustering algorithms for categorical data, namely, K-mode and robust clustering using links. In a simulation study and experiments employing real-life examples, the Gaussian mixture model with GloVe had the best performance, especially when the number of observations and complexity of data was increased.
Keywords : Categorical variable, clustering, word embedding.