search for




 

Variable selection based on semi-parametric estimator of conditional mutual information assuming normal mixture in high-dimensional data
Journal of the Korean Data & Information Science Society 2018;29:1339-51
Published online November 30, 2018
© 2018 Korean Data and Information Science Society.

Chikyung Ahn1 · Donguk Kim2

12Department of Statistics, Sungkyunkwan University
Correspondence to: Professor, Department of Statistics, Sungkyunkwan University, 25-2 Sungkyunkwan-ro, Jongno-gu, Seoul 03063, Korea. E-mail: dkim@skku.edu
Received October 12, 2018; Revised November 22, 2018; Accepted November 22, 2018.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
We propose a method of estimating conditional mutual information by semi-parametric method using mixed normal distribution assumption between explanatory variables. In order to maintain the advantage of mutual information that keeps the nonparametric relationship between the explanatory variable and the response variable, the mutual information between the explanatory variable and the response variable is estimated in a nonparametric manner. Furthermore, to improve the efficiency of mutual information estimation, the mutual information between the explanatory variables is to be estimated parametrically. Since the estimated density function is used as a weight for conditional mutual information estimation, the outliers with relatively small density estimate have little effect on the semi-parametric estimator of conditional mutual information. Experimental results show that the semi-parametric estimation method of conditional mutual information assuming mixed normal distribution have shown excellent performance in terms of significant variable selection.
Keywords : Advanced selection methods, classification analysis, conditional mutual information, Edgeworth approximate, entropy, high-dimensional data, mixed normal distribution, variable selection, support vector machines.