search for


A two-stage training approach for voice spoofing detection
Journal of the Korean Data & Information Science Society 2023;34:203-14
Published online March 31, 2023;
© 2023 Korean Data and Information Science Society.

Taein Kang1 · Il-Youp Kwak2

12Department of Applied Statistics, Chung-Ang University
Correspondence to: This research was supported by the Chung-Ang University Research Scholarship Grants in 2022.
1 Graduate student, Department of Applied Statistics, Chung-Ang University, 84 Heukseok-ro, Dongjakgu, Seoul 06974, Korea.
2 Assistant professor, Department of Applied Statistics, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Korea. E-mail:
Received January 12, 2023; Revised February 13, 2023; Accepted March 7, 2023.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
A novel 2-stage training method for voice spoofing detection is presented in this work along with performance experiments. The challenge of voice spoofing detection is to tell a real voice from a spoof that has been replicated in a setting other than the original voice. In areas where speaker identification is crucial to security, such as voice assistants, the demand for speech forgery detection is on the rise. The proposed 2-stage training model imports the embedding vectors of several single speech models studied with the Automatic Speaker Verification Spoofing (ASVSpoof) 2019 competition LA data set, combines them to define a concatenated embedding feature, and then builds a deep learning network on the concatenated embedding feature to create an ensemble model. We examined the analysis results based on the fusion of embedding vectors from various single models and modifications to deep learning networks for comparison with existing ensemble methodologies. The 2-stage training model produced an EER of 0.26 (%) by combining a number of models. This is a 0.34 (%p) improvement over the ensemble technique (Voting method) of 0.60 (%) and a 0.57 (%p) improvement over the single model’s highest performance of 0.83 (%).
Keywords : Deep learning, two-stage training, voice spoofing detection, embedding.