Multi-channel speech enhancement using labelled random finite sets and a neural beamformer in cocktail party scenario

Datta, Jayanta; Dehghan Firoozabadi, Ali; Zabala-Blanco, David; Castillo-Soria, Francisco R.

Mostrar el registro sencillo de la publicación

dc.contributor.author	Datta, Jayanta
dc.contributor.author	Dehghan Firoozabadi, Ali
dc.contributor.author	Zabala-Blanco, David
dc.contributor.author	Castillo-Soria, Francisco R.
dc.date.accessioned	2025-06-12T14:15:43Z
dc.date.available	2025-06-12T14:15:43Z
dc.date.issued	2025
dc.identifier.uri	http://repositorio.ucm.cl/handle/ucm/6125
dc.description.abstract	In this research, a multi-channel target speech enhancement scheme is proposed that is based on deep learning (DL) architecture and assisted by multi-source tracking using a labeled random finite set (RFS) framework. A neural network based on minimum variance distortionless response (MVDR) beamformer is considered as the beamformer of choice, where a residual dense convolutional graph-U-Net is applied in a generative adversarial network (GAN) setting to model the beamformer for target speech enhancement under reverberant conditions involving multiple moving speech sources. The input dataset for this neural architecture is constructed by applying multi-source tracking using multi-sensor generalized labeled multi-Bernoulli (MS-GLMB) filtering, which belongs to the labeled RFS framework, to obtain estimations of the sources’ positions and the associated labels (corresponding to each source) at each time frame with high accuracy under the effect of undesirable factors like reverberation and background noise. The tracked sources’ positions and associated labels help to correctly discriminate the target source from the interferers across all time frames and generate time–frequency (T-F) masks corresponding to the target source from the output of a time-varying, minimum variance distortionless response (MVDR) beamformer. These T-F masks constitute the target label set used to train the proposed deep neural architecture to perform target speech enhancement. The exploitation of MS-GLMB filtering and a time-varying MVDR beamformer help in providing the spatial information of the sources, in addition to the spectral information, within the neural speech enhancement framework during the training phase. Moreover, the application of the GAN framework takes advantage of adversarial optimization as an alternative to maximum likelihood (ML)-based frameworks, which further boosts the performance of target speech enhancement under reverberant conditions. The computer simulations demonstrate that the proposed approach leads to better target speech enhancement performance compared with existing state-of-the-art DL-based methodologies which do not incorporate the labeled RFS-based approach, something which is evident from the 75% ESTOI and PESQ of 2.70 achieved by the proposed approach as compared with the 46.74% ESTOI and PESQ of 1.84 achieved by Mask-MVDR with self-attention mechanism at a reverberation time (RT60) of 550 ms.	es_CL
dc.language.iso	en	es_CL
dc.rights	Atribución-NoComercial-SinDerivadas 3.0 Chile	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/cl/	*
dc.source	Applied Sciences, 15(6), 2944	es_CL
dc.subject	SRP-PHAT	es_CL
dc.subject	Deep learning	es_CL
dc.subject	Microphone array	es_CL
dc.subject	MS-GLMB filtering	es_CL
dc.subject	Beamforming	es_CL
dc.title	Multi-channel speech enhancement using labelled random finite sets and a neural beamformer in cocktail party scenario	es_CL
dc.type	Article	es_CL
dc.ucm.facultad	Facultad de Ciencias de la Ingeniería	es_CL
dc.ucm.indexacion	Scopus	es_CL
dc.ucm.indexacion	Isi	es_CL
dc.ucm.uri	mdpi.com/2076-3417/15/6/2944	es_CL
dc.ucm.doi	doi.org/10.3390/app15062944	es_CL

Ficheros en la publicación

Ficheros	Tamaño	Formato	Ver
No hay ficheros asociados a esta publicación.

Esta publicación aparece en la(s) siguiente(s) colección(ones)

Artículos Científicos

Mostrar el registro sencillo de la publicación

Excepto si se señala otra cosa, la licencia de la publicación se describe como Atribución-NoComercial-SinDerivadas 3.0 Chile

Listar

Mi cuenta

Multi-channel speech enhancement using labelled random finite sets and a neural beamformer in cocktail party scenario

Ficheros en la publicación

Esta publicación aparece en la(s) siguiente(s) colección(ones)