Internship - Contrastive Multimodal Pretraining for Noise-aware Diffusion-based Audio-visual Speech Enhancement

Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : Convention de stage

Niveau de diplôme exigé : Bac + 4 ou équivalent

Fonction : Stagiaire de la recherche

Contexte et atouts du poste

This master internship is part of the REAVISE project: “Robust and Efficient Deep Learning based Audiovisual Speech Enhancement” (2023-2026) funded by the French National Research Agency (ANR). The general objective of REAVISE is to develop a unified audio-visual speech enhancement (AVSE) framework that leverages recent methodological breakthroughs in statistical signal processing, machine learning, and deep neural networks in order to design a robust and efficient AVSE framework.

Mission confiée

Diffusion models represent a cutting-edge class of generative models highly effective in modeling natural data, such as images and audio [1]. These models function through a forward (noising) process that incrementally transforms training data into Gaussian noise, paired with a reverse (denoising) process that reconstructs the original data point from noise. Recently, diffusion models have demonstrated promising performance for unsupervised speech enhancement [2]. By leveraging these models as data-driven priors for clean speech, they enable the enhancement of noisy speech data by estimating clean speech through posterior sampling, effectively separating it from background noise. Additionally, the integration of video as conditioning information into the speech model further augments the enhancement capability, utilizing visual cues from the target speaker to improve the performance [3]. This approach underscores the potential of combining audio and visual data to improve speech quality, especially in highly noisy environments.

Contrastive learning further extends the functionality of multimodal integration, as evidenced by models like CLIP (Contrastive Language–Image Pre-training) [4] and CLAP (Contrastive Language–Audio Pre-training) [5], which bridge disparate modalities such as text with image and audio. These models create a shared multimodal embedding space that supports various applications, from text-to-image generation to sophisticated audio processing tasks. Although these models have been used in some audio tasks like source separation [6], generation [7], classification [8] or localization [9], their application in audio-visual speech enhancement is highly under-explored.

Principales activités

The primary objective of this project is to refine and expand the capabilities of audio-visual speech enhancement through the strategic incorporation of additional modal information into the noise model. By utilizing either textual descriptions or visual representations of the noise environment, such as videos or images depicting the acoustic scene, we aim to enhance the model’s ability to identify and differentiate noise sources effectively. This would involve developing robust contrastive learning techniques to manage the discrepancies between training and testing conditions, such as training with textual noise descriptions and testing with visual data, thanks to the shared multimodal embedding space.

To address these challenges, we propose to:

Develop a contrastive learning framework that can dynamically adapt to different modalities of noise information, ensuring that the system remains effective regardless of the variability in available data type at training and test times.
Utilize the shared embedding space learned through contrastive methods as conditioning information for the noise model to improve the performance of speech enhancement systems, making them more adaptable and effective in diverse and noisy environments.

References

[1] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, Score-Based Generative Modeling through Stochastic Differential Equations In International Conference on Learning Representations.

[2] B. Nortier, M. Sadeghi, and R. Serizel, Unsupervised speech enhancement with diffusion-based generative models In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.

[3] J.-E. Ayilo, M. Sadeghi, R. Serizel, and X. Alameda-Pineda, Diffusion-based Unsupervised Audio-visual Speech Enhancement HAL preprint hal-04718254, 2024.

[4] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, et al., Learning transferable visual models from natural language supervision In International Conference on Machine Learning, pp. 8748-8763, PMLR, 2021.

[5] B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, Clap learning audio concepts from natural language supervision In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, IEEE, 2023.

[6] X. Liu, Q. Kong, Y. Zhao, H. Liu, Y. Yuan, Y. Liu, and W. Wang Separate anything you describe arXiv preprint arXiv :2308.05037 2023.

[7] Y. Yuan, H. Liu, X. Liu, X. Kang, P. Wu, M. D. Plumbley, and W. Wang, Text-driven foley sound generation with latent diffusion model arXiv preprint arXiv :2306.10359 2023.

[8] A. Guzhov, F. Raue, J. Hees, A. Dengel Audioclip: Extending clip to image, text and audio. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022.

[9] T. Mahmud, and D. Marculescu Ave-clip: Audioclip-based multi-window temporal transformer for audio visual event localization In IEEE/CVF Winter Conference on Applications of Computer Vision 2023.

Compétences

Preferred qualifications for candidates include a strong foundation in statistical (speech) signal processing, and computer vision, as well as expertise in machine learning and proficiency with deep learning frameworks, particularly PyTorch.

Avantages

Subsidized meals
Partial reimbursement of public transport costs
Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
Professional equipment available (videoconferencing, loan of computer equipment, etc.)
Social, cultural and sports events and activities
Access to vocational training
Social security coverage

Rémunération

€ 4.35/hour

Postuler à cette offre

Informations générales

Thème/Domaine : Langue, parole et audio
Calcul Scientifique (BAP E)
Ville : Villers lès Nancy
Centre Inria : Centre Inria de l'Université de Lorraine
Date de prise de fonction souhaitée : 2025-04-01
Durée de contrat : 6 mois
Date limite pour postuler : 2024-12-15

Attention: Les candidatures doivent être déposées en ligne sur le site Inria. Le traitement des candidatures adressées par d'autres canaux n'est pas garanti.

Consignes pour postuler

Sécurité défense :
Ce poste est susceptible d’être affecté dans une zone à régime restrictif (ZRR), telle que définie dans le décret n°2011-1425 relatif à la protection du potentiel scientifique et technique de la nation (PPST). L’autorisation d’accès à une zone est délivrée par le chef d’établissement, après avis ministériel favorable, tel que défini dans l’arrêté du 03 juillet 2012, relatif à la PPST. Un avis ministériel défavorable pour un poste affecté dans une ZRR aurait pour conséquence l’annulation du recrutement.

Politique de recrutement :
Dans le cadre de sa politique diversité, tous les postes Inria sont accessibles aux personnes en situation de handicap.

Contacts

Équipe Inria : MULTISPEECH
Recruteur :
Sadeghi Mostafa / mostafa.sadeghi@inria.fr

L'essentiel pour réussir

Prospective applicants are invited to submit their academic transcripts, a detailed curriculum vitae (CV), and, if they choose, a cover letter. The cover letter should highlight the reasons for their enthusiasm and interest in this specific project.

A propos d'Inria

Inria est l’institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l’interface d’autres disciplines. L’institut fait appel à de nombreux talents dans plus d’une quarantaine de métiers différents. 900 personnels d’appui à la recherche et à l’innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'eﬀorce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.