On Adversarial Training and Loss Functions for Speech Enhancement

Published in ICASSP, 2018


Generative adversarial networks (GANs) are becoming increasingly popular for image processing tasks. Researchers have started using GANs for speech enhancement, but the advantage of using the GAN framework has not been established for speech enhancement. For example, a recent study reports encouraging enhancement results, but we find that the architecture of the generator used in the GAN gives better performance when it is trained alone using the L1 loss. This work presents a new GAN for speech enhancement, and obtains performance improvement with the help of adversarial training. A deep neural network (DNN) is used for time-frequency mask estimation, and it is trained in two ways: regular training with the L1 loss and training using the GAN framework with the help of an adversary discriminator. Experimental results suggest that the GAN framework improves speech enhancement performance. Further exploration of loss functions, for speech enhancement, suggests that the L1 loss is consistently better than the L2 loss for improving the perceptual quality of noisy speech. —

Significance of Glottal Activity Detection for Speaker Verification in Degraded and Limited Data Condition

Published in TENCON 2015 - 2015 IEEE Region 10 Conference, 2015


The objective of this work is to establish the importance of speaker information present in the glottal regions of speech signal. In addition, its robustness for degraded data and significance for limited data is sought for the task of speaker verification. An adaptive threshold method is proposed to use on zero frequency filtered signal to get the glottal activity regions. Feature vectors are extracted from regions having significant glottal activity. An i-vector based speaker verification system is developed using NIST SRE 2003 database and the performance of proposed method is evaluated in degraded and limited data condition. Robustness of proposed method is tested for white and babble noise. Further, short utterances of test data are considered to evaluate the performance in limited data condition. The proposed method based on the selection of glottal regions is found to perform better than the baseline energy based voice activity detection method in degraded and limited data conditions. — Download paper here