Search CORE

arXiv.org e-Print Archive

Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments

Author: Alameda-Pineda Xavier
Ban Yutong
Girin Laurent
Horaud Radu
Li Xiaofei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/02/2019
Field of study

We address the problem of online localization and tracking of multiple moving speakers in reverberant environments. The paper has the following contributions. We use the direct-path relative transfer function (DP-RTF), an inter-channel feature that encodes acoustic information robust against reverberation, and we propose an online algorithm well suited for estimating DP-RTFs associated with moving audio sources. Another crucial ingredient of the proposed method is its ability to properly assign DP-RTFs to audio-source directions. Towards this goal, we adopt a maximum-likelihood formulation and we propose to use an exponentiated gradient (EG) to efficiently update source-direction estimates starting from their currently available values. The problem of multiple speaker tracking is computationally intractable because the number of possible associations between observed source directions and physical speakers grows exponentially with time. We adopt a Bayesian framework and we propose a variational approximation of the posterior filtering distribution associated with multiple speaker tracking, as well as an efficient variational expectation-maximization (VEM) solver. The proposed online localization and tracking method is thoroughly evaluated using two datasets that contain recordings performed in real environments.Comment: IEEE Journal of Selected Topics in Signal Processing, 201

arXiv.org e-Print Archive

Multichannel Speech Separation and Enhancement Using the Convolutive Transfer Function

Author: Gannot Sharon
Girin Laurent
Horaud Radu
Li Xiaofei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/02/2018
Field of study

This paper addresses the problem of speech separation and enhancement from multichannel convolutive and noisy mixtures, \emph{assuming known mixing filters}. We propose to perform the speech separation and enhancement task in the short-time Fourier transform domain, using the convolutive transfer function (CTF) approximation. Compared to time-domain filters, CTF has much less taps, consequently it has less near-common zeros among channels and less computational complexity. The work proposes three speech-source recovery methods, namely: i) the multichannel inverse filtering method, i.e. the multiple input/output inverse theorem (MINT), is exploited in the CTF domain, and for the multi-source case, ii) a beamforming-like multichannel inverse filtering method applying single source MINT and using power minimization, which is suitable whenever the source CTFs are not all known, and iii) a constrained Lasso method, where the sources are recovered by minimizing the

\ell_1

-norm to impose their spectral sparsity, with the constraint that the

\ell_2

-norm fitting cost, between the microphone signals and the mixing model involving the unknown source signals, is less than a tolerance. The noise can be reduced by setting a tolerance onto the noise power. Experiments under various acoustic conditions are carried out to evaluate the three proposed methods. The comparison between them as well as with the baseline methods is presented.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processin

Informed Source Separation from compressed mixtures using spatial wiener filter and quantization noise estimation

Author: Girin Laurent
Liutkus Antoine
Zhang Shuhua
Publication venue: CCSD
Publication date: 01/05/2013
Field of study

International audienceIn a previous work, we proposed an Informed Source Separation sys- tem based on Wiener filtering for active listening of music from un- compressed (16-bit PCM) multichannel mix signals. In the present work, the system is improved to work with (MPEG-2 AAC) com- pressed mix signals: quantization noise is estimated from the AAC bitstream at the decoder and explicitly taken into account in the source separation process. Also a direct MDCT-to-STFT transform is used to optimize the computational efficiency of the process in the STFT domain from AAC-decoded MDCT coefficients

Reverberant Sound Localization with a Robot Head Based on Direct-Path Relative Transfer Function

Author: Badeig Fabien
Girin Laurent
Horaud Radu
Li Xiaofei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/10/2016
Field of study

International audienceThis paper addresses the problem of sound-source localization (SSL) with a robot head, which remains a challenge in real-world environments. In particular we are interested in locating speech sources, as they are of high interest for human-robot interaction. The microphone-pair response corresponding to the direct-path sound propagation is a function of the source direction. In practice, this response is contaminated by noise and reverberations. The direct-path relative transfer function (DP-RTF) is defined as the ratio between the direct-path acoustic transfer function (ATF) of the two microphones, and it is an important feature for SSL. We propose a method to estimate the DP-RTF from noisy and reverberant signals in the short-time Fourier transform (STFT) domain. First, the convolutive transfer function (CTF) approximation is adopted to accurately represent the impulse response of the microphone array, and the first coefficient of the CTF is mainly composed of the direct-path ATF. At each frequency, the frame-wise speech auto-and cross-power spectral density (PSD) are obtained by spectral subtraction. Then a set of linear equations is constructed by the speech auto-and cross-PSD of multiple frames, in which the DP-RTF is an unknown variable, and is estimated by solving the equations. Finally, the estimated DP-RTFs are concatenated across frequencies and used as a feature vector for SSL. Experiments with a robot, placed in various reverberant environments, show that the proposed method outperforms two state-of-the-art methods

Non-Stationary Noise Power Spectral Density Estimation Based on Regional Statistics

Author: Gannot Sharon
Girin Laurent
Horaud Radu
Li Xiaofei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2016
Field of study

International audienceEstimating the noise power spectral density (PSD) is essential for single channel speech enhancement algorithms. In this paper, we propose a noise PSD estimation approach based on regional statistics. The proposed regional statistics consist of four features representing the statistics of the past and present periodograms in a short-time period. We show that these features are efficient in characterizing the statistical difference between noise PSD and noisy speech PSD. We therefore propose to use these features for estimating the speech presence probability (SPP). The noise PSD is recursively estimated by averaging past spectral power values with a time-varying smoothing parameter controlled by the SPP. The proposed method exhibits good tracking capability for non-stationary noise, even for abruptly increasing noise level

Mapping Sounds on Images Using Binaural Spectrograms

Author: Deleforge Antoine
Drouard Vincent
Girin Laurent
Horaud Radu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2014
Field of study

International audienceWe propose a novel method for mapping sound spectrograms onto images and thus enabling alignment between auditory and visual features for subsequent multimodal processing. We suggest a supervised learning approach to this audio-visual fusion problem, on the following grounds. Firstly, we use a Gaussian mixture of locally-linear regressions to learn a mapping from image locations to binaural spectrograms. Secondly, we derive a closed-form expression for the conditional posterior probability of an image location, given both an observed spectrogram, emitted from an unknown source direction, and the mapping parameters that were previously learnt. Prominently, the proposed method is able to deal with completely different spectrograms for training and for alignment. While fixed-length wide-spectrum sounds are used for learning, thus fully and robustly estimating the regression, variable-length sparse-spectrum sounds, e.g., speech, are used for alignment. The proposed method successfully extracts the image location of speech utterances in realistic reverberant-room scenarios

An Inverse-Gamma Source Variance Prior with Factorized Parameterization for Audio Source Separation

Author: Alameda-Pineda Xavier
Gannot Sharon
Girin Laurent
Horaud Radu
Kounades-Bastian Dionyssos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2016
Field of study

International audienceIn this paper we present a new statistical model for the power spectral density (PSD) of an audio signal and its application to multichannel audio source separation (MASS). The source signal is modeled with the local Gaussian model (LGM) and we propose to model its variance with an inverse-Gamma distribution, whose scale parameter is factorized as a rank-1 model. We discuss the interest of this approach and evaluate it in a MASS task with underdetermined convolutive mixtures. For this aim, we derive a variational EM algorithm for parameter estimation and source inference. The proposed model shows a benefit in source separation performance compared to a state-of-the-art LGM NMF-based technique