Audiovisual Information Processing

Our research group holds a large experience in the field of speaker recognition and identification, participating since 2006 in the Speaker Recognition Evaluation (SRE) organized by the National Institute of Standards and Technology (NIST) with competitve results. This is one of the most prestigious evaluation in this field, with teams all around the world participating with state-of-the-art technology. Our current research efforts in speaker recognition rely on deep learning solutions, collaborating with industry partners such as Nuance (now part of Microsoft) in the development of text-dependent biometric applications.

Spoken language recognition involves correctly identifying the spoken language of an audio utterance. Due to similarity in the research fields, recent progress in automatic speech recognition and speaker recognition techniques based deep neural networks have subsequently improved the technology applied to language recognition. Our current research lines in this topic focus on this topic aim to accurately separate closely related languages (i.e., dialects) by introducing recently developed training objectives such as a triplet loss neural network or area under the ROC curve (AUC) optimization techniques.

Language recognition and identification has been an active research line in our group through the years, with our researchers making significant contribution to the scientific community, specially in the introduction of the i-vector framework to the language recognition task. In a similar way to the NIST SRE evaluation, we also have experience participating in the NIST Language Recognition Evaluation (LRE) since 2010 with competitive results in all our submissions.

Speaker diarization aims to answer the questions Who spoke when?. Current limitations in this technology are mainly due to the high variability in short fragments and the variability introduced by unwanted noise and channel conditions. So, our research effort in this line are focused on obtaining representations robust to variability and unseen training conditions. We also are investigating different applications of neural network solutions to subtasks that are relevant to the diarization problem, such as the clustering task and the identity asignation task.

Our research group has actively participated in different diarization evaluations with competitive results (MGB, DIHARD, Albayzín ’18 and ’20), also bringing together audio and image modalities in the face diarization challenges proposed recently in the Albayzín ’20 evaluation.

In addition to the information contained in the speech itself, audio signals generally contain a much richer variety of content such as different noises, music or a combination both. The acoustic event detection and classification systems aim to obtain information beyond speech, accurately capturing the different sources of information present.

One of our current lines of research in this topic applies deep learning solutions to deal with multiclass audio segmentation, separating an audio signal in homogenenous regions that contain speech, music or noise. Similar solutions were applied to the speech activity detection task in the international challenge Fearless Steps, introducing the challenging domain of audio coming from Apollo space missions.

Data-driven solutions behind deep learning applications have been found to be relevent in order to model the relationship between noisy and clean signals. That is why most of the state-of-the-art applications rely on deep neural networks architectures, reporting great successes in tasks related to speech enhancement. Whithin this topic, our research group is working mainly in two different lines: on the one hand, we aim to advance in the interpretability of deep neural network based speech enhancement. On the other hand, we aim to obtain solutions that are efficient enough to be included in daily applications for human communication such as videoconference of telephone conversations.

This research line is complemented by our joint work with the company BTS/SONOC, one of the world level top companies in telecomunnication speech traffic. We are currently developing an speech enhancement system able to provide a clear speech signal under adverse acoustic conditions and higly variant noisy environments,

Research lines

Audiovisual Information Processing

Speaker Verification and identification

Language identification

Speaker/Face Diarization

Acoustic event detection & classification

Speech enhancement and audio quality assessment