search for




 

Resampling-based methods for the error control of variable selection in high-dimensional genomic data
Journal of the Korean Data & Information Science Society 2024;35:691-702
Published online September 30, 2024;  https://doi.org/10.7465/jkdi.2024.35.5.691
© 2024 Korean Data and Information Science Society.

Dan Huang1 · Rakwon Kim2 · Hokeun Sun3

123Department of Statistics, Pusan National University
Correspondence to: This work was supported by a 2-Year Research Grant of Pusan National University.
1 Ph.D Program, Department of Statistics, Pusan National University, Busan 46241, Korea
2 Master Program, Department of Statistics, Pusan National University, Busan 46241, Korea
3 Professor, Department of Statistics, Pusan National University, Busan 46241, Korea. E-mail: hsun@pusan.ac.kr
Received July 4, 2024; Revised August 10, 2024; Accepted August 13, 2024.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
In analysis of high-dimensional data, penalized regression models have been commonly employed to select relevant variables. The most popularly used model is a least absolute shrinkage and selection operator, i.e., lasso. Recent studies proposed resampling-based methods to control the false discovery rate of variables selected by lasso. They include a data splitting method and a Gaussian mirror method. The former randomly splits samples into two different sets to estimate two independent coefficients for the same variable, while the later randomly generates Gaussian errors to construct a pair of variables and to estimate two different coefficients. Then, mirror statistics based on the coefficients estimated by each method were used for the error control. In this article, we proposed new approach to control FDR, combining a selection probability and a mirror statistic motivated by two resampling-based methods. In our simulation study, we demonstrated that the proposed approach controls FDR at a designated level better than other resampling-based methods while it maintains selection power. We also identified potentially cancer-related genes in analysis of microarray gene expression data from a breast cancer study.
Keywords : Data split, gaussian mirrors, high-dimensional data, lasso, selection probability