For environmental sound classification (ESC), this
paper presents a learnable auditory filterbank based on a
one-dimensional (1D) convolutional neural network with strong
psychophysiological inductive bias in the form of a gammatone
filterbank and an equal-loudness prompting normalization. In the
past, a number of ESC methods based on learnable auditory features obtained by performing plain 1D convolutions on raw input
waveforms for outperforming traditional handcrafted features
such as a mel-frequency filterbank have been proposed. However,
the large number of parameters involved in the convolutions
suggests that these methods will not generalize better than a
model defined by a smaller number of parameters, which is
considered in this paper. Here, a learnable gammatone filterbank
layer consisting of 1D kernels represented by a parametric form
of the bandpass gammatone filters is proposed for acquiring a
time-frequency representation of the raw waveform. A normalization with learnable parameters that control the trade-off between
energy equalization and structure preservation in the spectrotemporal domain is proposed. To verify the effectiveness of the
considered network and the normalization, ESC experiments on
the ESC-50 and UrbanSound8K datasets were conducted. Compared to other state-of-the-art networks, the considered network
performed better on the two datasets. In addition, an ensemble
architecture achieved further performance improvement.