search for




 

FDR-based categorical variable selection in naïve Bayes classification
Journal of the Korean Data & Information Science Society 2021;32:1329-41
Published online November 30, 2021;  https://doi.org/10.7465/jkdi.2021.32.6.1329
© 2021 Korean Data and Information Science Society.

Jieun Shin1 · Changyi Park2

12Department of Statistics, University of Seoul
Correspondence to: 1 Graduate student, Department of Statistics, University of Seoul, Seoul 02504, Korea.
2 Corresponding author: Professor, Department of Statistics, University of Seoul, Seoul 02504, Korea. E-mail: park463@uos.ac.kr
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1F1A01048268).
Received August 12, 2021; Revised October 25, 2021; Accepted October 28, 2021.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Naïve Bayes classification is based on the naïve Bayes assumption that explanatory variables are conditionally independent given the response variable. Although the naïve Bayes assumption is rather strong, the naïve Bayes classifier shows reasonable performances and has computational advantages on high-dimensional data. Since high-dimensional data sets usually have many noisy variables, variable selection can improve the accuracy in prediction and the interpretation of the classifier. In this paper, we propose a categorical variable selection method based on FDR control in naïve Bayes classification. Through simulations and real data analysis, the proposed method is compared with another variable selection method based on change point analysis and the proposed methods is illustrated to be more effective, particularly, for sparse or high-dimenional data.
Keywords : Chi-square statistic, high-dimensional data, na¨ıve Bayes assumption.