Posts by Collection



Significance of Glottal Activity Detection for Speaker Verification in Degraded and Limited Data Condition

Published in TENCON 2015 - 2015 IEEE Region 10 Conference, 2015


The objective of this work is to establish the importance of speaker information present in the glottal regions of speech signal. In addition, its robustness for degraded data and significance for limited data is sought for the task of speaker verification. An adaptive threshold method is proposed to use on zero frequency filtered signal to get the glottal activity regions. Feature vectors are extracted from regions having significant glottal activity. An i-vector based speaker verification system is developed using NIST SRE 2003 database and the performance of proposed method is evaluated in degraded and limited data condition. Robustness of proposed method is tested for white and babble noise. Further, short utterances of test data are considered to evaluate the performance in limited data condition. The proposed method based on the selection of glottal regions is found to perform better than the baseline energy based voice activity detection method in degraded and limited data conditions. — Download paper here

On Adversarial Training and Loss Functions for Speech Enhancement

Published in ICASSP, 2018


Generative adversarial networks (GANs) are becoming increasingly popular for image processing tasks. Researchers have started using GANs for speech enhancement, but the advantage of using the GAN framework has not been established for speech enhancement. For example, a recent study reports encouraging enhancement results, but we find that the architecture of the generator used in the GAN gives better performance when it is trained alone using the L1 loss. This work presents a new GAN for speech enhancement, and obtains performance improvement with the help of adversarial training. A deep neural network (DNN) is used for time-frequency mask estimation, and it is trained in two ways: regular training with the L1 loss and training using the GAN framework with the help of an adversary discriminator. Experimental results suggest that the GAN framework improves speech enhancement performance. Further exploration of loss functions, for speech enhancement, suggests that the L1 loss is consistently better than the L2 loss for improving the perceptual quality of noisy speech. — Download paper here

A New Framework for Supervised Speech Enhancement in the Time Domain

Published in Interspeech, 2018


This work proposes a new learning framework that uses a loss function in the frequency domain to train a convolutional neural network (CNN) in the time domain. At the training time, an extra operation is added after the speech enhancement network to convert the estimated signal in the time domain to the frequency domain. This operation is differentiable and is used to train the system with a loss in the frequency domain. This proposed approach replaces learning in the frequency domain, i.e., short-time Fourier transform (STFT) magnitude estimation, with learning in the original time domain. The proposed method is a spectral mapping approach in which the CNN first generates a time domain signal then computes its STFT that is used for spectral mapping. This way the CNN can exploit the additional domain knowledge about calculating the STFT magnitude from the time domain signal. Experimental results demonstrate that the proposed method substantially outperforms the other methods of speech enhancement. The proposed approach is easy to implement and applicable to related speech processing tasks that require spectral mapping or time-frequency (T-F) masking. — [Download paper here]



Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.