A New Framework for CNN Based Speech Enhancement in the Time Domain

Published in IEEE/ACM Transactions on Audio Speech and Language Processing, 2019


This work proposes a new learning mechanism for a fully convolutional neural network (CNN) to address speech enhancement in the time domain. The CNN takes as input the time frames of noisy utterance and outputs the time frames of the enhanced utterance. At the training time, we add an extra operation that converts the time domain to the frequency domain. This conversion corresponds to simple matrix multiplication, and is hence differentiable implying that a frequency domain loss can be used for training in the time domain. We use mean absolute error (MAE) loss between the enhanced short- time Fourier transform (STFT) magnitude and the clean STFT magnitude to train the CNN. This way the model can exploit the domain knowledge of converting a signal to the frequency domain for analysis. Moreover, this approach avoids the well- known invalid STFT problem since the proposed CNN operates in the time domain. Experimental results demonstrate that the proposed method substantially outperforms the other methods of speech enhancement. The proposed method is easy to implement and applicable to related speech processing tasks that require time-frequency (T-F) masking or spectral mapping. — Download here