search for


Comparison of audio input representations on piano transcription using neural networks
Journal of the Korean Data & Information Science Society 2021;32:439-53
Published online March 31, 2021;
© 2021 Korean Data and Information Science Society.

Hyemin Han1 · Yoonsuh Jung2

12Department of Statistics, Korea University
Correspondence to: Jung's work has been partially supported by National Research Foundation of Korea (NRF) grants funded by the Korea government (MIST) 2019R1F1A1040515 and 2019R1A4A1028134. This paper is based on Hyemin Han's Master thesis.
1Graduate student, Department of Statistics, Korea University, Seoul 02841, Korea
2Corresponding author: Associate professor, Department of Statistics, Korea University, Seoul 02841, Korea. E-mail:
Received January 21, 2021; Revised February 24, 2021; Accepted February 26, 2021.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
We compare the effect of multiple input representations on polyphonic piano music transcription based on neural networks. A state-of-the-art piano transcription neural network model, onsets and frames, is explored. We first provide detailed backgrounds of the piano transcription and input representations for the readers who are unfamiliar with this area. For comparing their effects, we consider four spectrograms; Mel-spectrogram, Linear-spectrogram, Log-spectrogram and constant-Q-transform with various hyper parameters. The effects of frequency bins, Short Time Fourier Transformation (STFT) window size and hop length on the four spectrograms are also examined. Our results show that Mel-spectrogram of 2,048 STFT window size, 512 frequency bins and 256 hop length yields the highest accuracy. We show that Mel-spectrogram is one of the most satisfactory input representations in general. Mel-spectrogram dominates other spectrograms and keeps a relatively high transcription accuracy even at the low resolutions in our experiments.
Keywords : Audio input representation, automatic music transcription, neural network, spectrogram.