Internship - Comparative analysis of diffusion models and variational autoencoders as data-driven priors for speech enhancement
Type de contrat : Convention de stage
Niveau de diplôme exigé : Bac + 4 ou équivalent
Fonction : Stagiaire de la recherche
Contexte et atouts du poste
This master internship is part of the REAVISE project: “Robust and Efficient Deep Learning based Audiovisual Speech Enhancement” (2023-2026) funded by the French National Research Agency (ANR). The general objective of REAVISE is to develop a unified audio-visual speech enhancement (AVSE) framework that leverages recent methodological breakthroughs in statistical signal processing, machine learning, and deep neural networks in order to design a robust and efficient AVSE framework.
Mission confiée
Generative models have increasingly become a fundamental tool in solving several inverse problems in an unsupervised way [1, 2]. This technique relies on the ability of generative models to learn the inherent characteristics of target clean data. Specifically, for speech enhancement, establishing a generative model acts as a data-driven speech prior, enabling the estimation of high-quality speech from noisy recordings without the direct need for corresponding pairs of clean and noisy data [2, 3, 4]. This unsupervised learning approach is particularly advantageous as it eliminates the dependency on extensive labeled datasets, which are often challenging and costly to procure, as done in supervised methods [5]. Moreover, training with only clean speech allows these models to better generalize to a variety of noisy environments they have never encountered, thus offering potentially broader applications in real-world scenarios where noise conditions are not predictable.
Principales activités
The use of variational autoencoders (VAEs) [3, 4] and diffusion models [2] represents the forefront of research in generative models for unsupervised speech enhancement. However, the field lacks a systematic comparison that evaluates these models side by side under standardized conditions. This project aims to bridge this gap through meticulously designed experiments that compare the effectiveness of VAEs and diffusion models in speech enhancement tasks. Each model will be implemented using similar network architectures to ensure that any differences in performance are attributed to the model capabilities and not to disparities in model complexity or configuration. The objective includes not only quantifying their performance in enhancing speech but also understanding their operational differences, resilience to various noise types, and computational efficiency. The insights gained from this analysis will provide valuable guidance for future developments in speech processing technologies, aiming to optimize model selection and configuration for specific enhancement needs.
More precisely, the objectives of this project are outlined below:
- Implement both variational autoencoders and diffusion models using similar architectures to ensure comparability. Conduct detailed performance evaluations focusing on speech quality, intelligibility, noise reduction, and model efficiency under various noise conditions.
- Analyze the strengths and limitations of each model in handling diverse environmental noises and document their operational differences to determine their suitability for different speech enhancement scenarios.
References
[1] G. Daras, H. Chung, C.-H. Lai, Y. Mitsufuji, J. C. Ye, P. Milanfar, A. G. Dimakis, and M. Delbracio, A survey on diffusion models for inverse problems arXiv preprint arXiv :2410.00083, 2024.
[2] B. Nortier, M. Sadeghi, and R. Serizel, Unsupervised speech enhancement with diffusion-based generative models In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.
[3] X. Bie, S. Leglaive, X. Alameda-Pineda, and L. Girin, Unsupervised speech enhancement using dynamical variational autoencoders IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2993-3007, 2022.
[4] M. Sadeghi, and R. Serizel, Posterior sampling algorithms for unsupervised speech enhancement with recurrent variational autoencoder In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.
[5] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, Speech enhancement and dereverberation with diffusion-based generative models IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351-2364, 2023.
Compétences
Preferred qualifications for candidates include a strong foundation in statistical (speech) signal processing, and computer vision, as well as expertise in machine learning and proficiency with deep learning frameworks, particularly PyTorch.
Avantages
- Subsidized meals
- Partial reimbursement of public transport costs
- Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage
Rémunération
€ 4.35/hour
Informations générales
- Thème/Domaine :
Langue, parole et audio
Calcul Scientifique (BAP E) - Ville : Villers lès Nancy
- Centre Inria : Centre Inria de l'Université de Lorraine
- Date de prise de fonction souhaitée : 2025-04-01
- Durée de contrat : 6 mois
- Date limite pour postuler : 2024-12-15
Attention: Les candidatures doivent être déposées en ligne sur le site Inria. Le traitement des candidatures adressées par d'autres canaux n'est pas garanti.
Consignes pour postuler
Sécurité défense :
Ce poste est susceptible d’être affecté dans une zone à régime restrictif (ZRR), telle que définie dans le décret n°2011-1425 relatif à la protection du potentiel scientifique et technique de la nation (PPST). L’autorisation d’accès à une zone est délivrée par le chef d’établissement, après avis ministériel favorable, tel que défini dans l’arrêté du 03 juillet 2012, relatif à la PPST. Un avis ministériel défavorable pour un poste affecté dans une ZRR aurait pour conséquence l’annulation du recrutement.
Politique de recrutement :
Dans le cadre de sa politique diversité, tous les postes Inria sont accessibles aux personnes en situation de handicap.
Contacts
- Équipe Inria : MULTISPEECH
-
Recruteur :
Sadeghi Mostafa / mostafa.sadeghi@inria.fr
L'essentiel pour réussir
Prospective applicants are invited to submit their academic transcripts, a detailed curriculum vitae (CV), and, if they choose, a cover letter. The cover letter should highlight the reasons for their enthusiasm and interest in this specific project.
A propos d'Inria
Inria est l’institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l’interface d’autres disciplines. L’institut fait appel à de nombreux talents dans plus d’une quarantaine de métiers différents. 900 personnels d’appui à la recherche et à l’innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'efforce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.