search for




 

Pre-trained models and ensemble technique for speech emotion recognition
Journal of the Korean Data & Information Science Society 2024;35:445-59
Published online July 31, 2024;  https://doi.org/10.7465/jkdi.2024.35.4.445
© 2024 Korean Data and Information Science Society.

Jaejin Seo1 · Taein Kang2 · Il-Youp Kwak3

123Department of Statistics and Data Science, Chung-Ang University
Correspondence to: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Ministry of Science and ICT (RS-2023-00208284). This research was supported by the Chung-Ang University Graduate Research Scholarship in 2022.
1 Graduate student, Department of Statistics and Data Science, Chung-Ang University, Seoul 06974, Korea.
2 Graduate student, Department of Statistics and Data Science, Chung-Ang University, Seoul 06974, Korea.
3 Associate professor, Department of Statistics and Data Science, Chung-Ang University, Seoul 06974, Korea. E-mail: ikwak2@cau.ac.kr
Received May 29, 2024; Revised June 14, 2024; Accepted June 17, 2024.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Research on speech emotion recognition plays a crucial role in enhancing human-machine interaction and improving efficiency and user experience in various fields such as healthcare, education, and customer service. In this study, we aimed to develop an AI model to classify six emotions by participating in DACON’s ‘Monthly DACON Speech Emotion Recognition AI Competition’. We compared the performance using traditional speech processing techniques and pretrained models, and investigated the potential for additional learning using embedding vectors effectively learned through pretrained models. As a result, a model combiningWavLM with 1D CNN demonstrated superior performance at 79.80%, and by ensembling all pretrained models using hard voting, we further improved performance to 80.79%, achieving a ranking equivalent to 5th place in the competition. This research is expected to contribute to the application potential of speech emotion recognition technology, enabling the utilization of emotion recognition models in various real-world applications.
Keywords : Deep learning, ensemble learning, pre-trained models, speech emotion recognition