Sound Processing

Environmental Sound Classification

Title


CNN-based Learnable Gammatone Filterbank and Equal-loudness Normalization for Environmental Sound Classification

Abstract

For environmental sound classification (ESC), this paper presents a learnable auditory filterbank based on a one-dimensional (1D) convolutional neural network with strong psychophysiological inductive bias in the form of a gammatone filterbank and an equal-loudness prompting normalization. In the past, a number of ESC methods based on learnable auditory features obtained by performing plain 1D convolutions on raw input waveforms for outperforming traditional handcrafted features such as a mel-frequency filterbank have been proposed. However, the large number of parameters involved in the convolutions suggests that these methods will not generalize better than a model defined by a smaller number of parameters, which is considered in this paper. Here, a learnable gammatone filterbank layer consisting of 1D kernels represented by a parametric form of the bandpass gammatone filters is proposed for acquiring a time-frequency representation of the raw waveform. A normalization with learnable parameters that control the trade-off between energy equalization and structure preservation in the spectrotemporal domain is proposed. To verify the effectiveness of the considered network and the normalization, ESC experiments on the ESC-50 and UrbanSound8K datasets were conducted. Compared to other state-of-the-art networks, the considered network performed better on the two datasets. In addition, an ensemble architecture achieved further performance improvement.

Proposed Method