Multimedia Content Retrieval & Indexing

This research line aims to develop technologies that facilitate the access to huge multimedia repositories through automatic labelling and extraction of the different audiovisual documents present in it. Our prioritary working environment is the audiovisual content coming from broadcast emissions because of its interest both in commercial terms and in scientific terms, providing a variety of acoustic, semantic and emotional scenarios.

This activity is highly influenced from basic research results from the audiovisual information processing research line. Namely, we incorporate our latest advances in deep learning in order to separate an audiovisual content in homogeneous classes such as speech, noise, music or a combination of these.

Due to the huge in increase in the generation of multimedia content, systems that are able to analyze and index its content in a fast and accurate way are becoming more and more relevant nowadays. Our research group maintains stable agreements with Radio Televisión Española (RTVE) thanks to the “Cátedra RTVE en la Universidad de Zaragoza” since 2017. This agreement seeks to boost the work on audiovisual content analysis with an special emphasis on the digital transformation of huge multimedia archives.

We have also a close relationship with Corporación Aragonesa de Radio y Televisión (CARTV) since 2008 helping them to develop new technologies for enhancing accessibility to their multimedia contents. Furthermore, since 2016 we collaborate actively by means of a long-term technology transfer agreement with ETIQMEDIA to develop tools for audiovisual document management.

The confluence of machine learning techniques applied to audio and image processing allow the reuse of these algorithms from a multimodal perspective. In this topic, our research group holds an open research line in multimodal person recognition, bringing together our experience in speaker recognition and recent advances in image and video processing. We recently participated in the Albayzín 2020 multimodal diarization challenge with competitive results, presenting a system that assigns speaker identities thanks to audio information and facial recognition.

This research line combines both speech and language technologies and video and image processing techniques with the goal of extracting the most relevant fragments from an audiovisual document. The main idea is to automatically generate an abstract by detecting the most significant objects in the scenes, that are later described in natural language. Some of our recent work in this line has successfully created different proofs of concept for automatic summarization both in textual and audiovisual documents.

Research lines

Audiovisual Information Processing

Classification and segmentation of audiovisual documents

Analysis and retrieval of audiovisual content

Multimodal person and event recognition

Multimedia content summarization