Doctorant F/H Doctorant F/H PhD Position : Control, Motion Fidelity, and Computational Efficiency in Long-Form Audio-Visual Video Generation
Type de contrat : CDD
Niveau de diplôme exigé : Bac + 5 ou équivalent
Fonction : Doctorant
A propos du centre ou de la direction fonctionnelle
The Centre Inria de l’Université de Grenoble groups together almost 600 people in 22 research teams and 7 research support departments.
Staff is present on three campuses in Grenoble, in close collaboration with other research and higher education institutions (Université Grenoble Alpes, CNRS, CEA, INRAE, …), but also with key economic players in the area.
The Centre Inria de l’Université Grenoble Alpe is active in the fields of high-performance computing, verification and embedded systems, modeling of the environment at multiple levels, and data science and artificial intelligence. The center is a top-level scientific institute with an extensive network of international collaborations in Europe and the rest of the world.
Contexte et atouts du poste
Titre : Control, Motion Fidelity, and Computational Efficiency in Long-Form Audio-Visual Video Generation
Supervision : Dr Stéphane Lathuilière (INRIA-UGA)
Funding : BPI contract
Contexte :Background and Motivation
Recent advances in generative AI have dramatically expanded the ability to synthesize and manipulate video content. Large-scale diffusion transformers and autoregressive video models — such as Sora — now exhibit impressive capabilities in generating high-resolution, multi-second clips from textual prompts. These systems increasingly support multimodal conditioning (text, images, audio), showing early signs of temporally consistent storytelling and complex scene dynamics.
Despite this progress, several fundamental challenges remain unsolved. First, audio-visual controllability remains limited: while models can loosely synchronize audio and video, they struggle with precise alignment of speech, actions, and environmental events. Second, current systems lack fine-grained motion control, making it difficult to specify nuanced trajectories, subtle character actions, or physically plausible object interactions. Third, the generation of long-duration videos (over tens of seconds or minutes) introduces severe problems of temporal drift, memory accumulation, semantic inconsistency, and scene fragmentation. Finally, the computational demands of high-resolution, long-context generative models pose serious barriers to both training and deployment. Scaling video models in space-time while maintaining quality is currently prohibitively expensive and technically challenging.
This PhD will investigate the foundations of controllability, motion fidelity, temporal consistency, and computational efficiency in audio-visual video generation. It will develop new frameworks and methodologies that allow generative models to produce globally coherent, fine-controlled, long-range audio-visual sequences, while significantly reducing computational overhead. These contributions aim to advance the scientific understanding of generative video modeling and address core barriers impeding real-world applications in film production, simulation, robotics, and AR/VR systems.
Mission confiée
Research Objectives :
Principales activités
Methodology
Research Objectives :
The overarching aim is to develop a principled framework for controllable, motion-faithful, and computationally efficient long-form video generation. This project will proceed in several stages:
(A) Audio–Visual Controllability
The first component investigates how existing diffusion-based and autoregressive video models integrate audio and visual signals. The study will characterize failure modes in audio–video alignment, such as lip-sync drift, desynchronized actions, or mismatched environmental cues. The following aspects will be evaluated:
- temporal alignment measures,
- cross-modal coherence scores,
- perceptual consistency metrics sensitive to audiovisual synchrony.
The research will explore improved conditioning mechanisms — including hierarchical audio encodings, cross-modal attention stabilization, and temporally-aware guidance — to achieve fine-grained and persistent audio-video alignment.
(B) Fine-Grained Motion Control
This stage focuses on motion fidelity and motion controllability. Current models tend to produce coarse or oversmoothed dynamics, with limited adherence to specified trajectories or subtle gestures. The work will analyze:
- how motion representations are learned internally,
- how attention drift influences the degradation of fine-scale dynamics,
- where motion prediction failures propagate over time.
To address these issues, the thesis will explore several strategies:
- motion-conditioned latent representations,
- differential motion-field guidance,
- keyframe-to-in-between propagation mechanisms,
- physics-informed or kinematic consistency losses.
These techniques aim to allow creators to specify detailed motion constraints while preserving global visual realism.
(C) Long-Video Generation and Temporal Consistency
A central challenge is generating videos far beyond the typical 2–10 second window. Long sequences expose weaknesses in memory, context propagation, and semantic stability.
This research will develop new methodologies for:
- maintaining global scene coherence across hundreds or thousands of frames,
- preventing temporal drift and identity switching,
- supporting long-range narrative structure with sustained physical and stylistic consistency.
Investigated approaches may include:
- memory-augmented diffusion processes,
- hierarchical temporal decomposition,
- recurrent generative modules,
- compressed global-context representations.
Evaluation metrics for long-range temporal fidelity will also be proposed, focusing on continuity, identity preservation, and stability across extended horizons.
(D) Computational and Scaling Challenges
Finally, the thesis will address computational aspects limiting current generative video models. Long-context video generation scales quadratically or worse with space-time resolution, creating extreme GPU memory and inference-time demands.
This work will explore:
- sparse and low-rank attention mechanisms for spatiotemporal data,
- mixed-resolution diffusion schedules,
- temporal chunking with cross-segment consistency constraints,
- model-parallel and pipeline-efficient variants of video diffusion transformers.
A particular emphasis will be placed on understanding the trade-off between computational savings and degradation in motion fidelity or temporal coherence — and on designing architectures that optimize both.
Together, these contributions will advance the state of controllable, high-fidelity generative video models and support safer, more accessible, and more reliable deployment in real-world settings.
Compétences
Compétences techniques et niveau requis :We are seeking a motivated PhD candidate with a strong background in one or more the following areas :
- speech processing, computer vision, machine learning,
- solid programmming skills
- interest in connecting AI with human cognition Prior experience with LLM, SpeechLMs, RL algorithms, or robotic platforms is a plus, but not mandatory
Langues : Anglais
Avantages
- Subsidized meals
- Partial reimbursement of public transport costs
- Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage
Informations générales
- Thème/Domaine :
Vision, perception et interprétation multimedia
Statistiques (Big data) (BAP E) - Ville : Montbonnot
- Centre Inria : Centre Inria de l'Université Grenoble Alpes
- Date de prise de fonction souhaitée : 2026-01-01
- Durée de contrat : 3 ans
- Date limite pour postuler : 2025-12-27
Attention: Les candidatures doivent être déposées en ligne sur le site Inria. Le traitement des candidatures adressées par d'autres canaux n'est pas garanti.
Consignes pour postuler
Sécurité défense :
Ce poste est susceptible d’être affecté dans une zone à régime restrictif (ZRR), telle que définie dans le décret n°2011-1425 relatif à la protection du potentiel scientifique et technique de la nation (PPST). L’autorisation d’accès à une zone est délivrée par le chef d’établissement, après avis ministériel favorable, tel que défini dans l’arrêté du 03 juillet 2012, relatif à la PPST. Un avis ministériel défavorable pour un poste affecté dans une ZRR aurait pour conséquence l’annulation du recrutement.
Politique de recrutement :
Dans le cadre de sa politique diversité, tous les postes Inria sont accessibles aux personnes en situation de handicap.
Contacts
- Équipe Inria : ROBOTLEARN
-
Directeur de thèse :
Lathuiliere Stephane / stephane.lathuiliere@inria.fr
A propos d'Inria
Inria est l’institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l’interface d’autres disciplines. L’institut fait appel à de nombreux talents dans plus d’une quarantaine de métiers différents. 900 personnels d’appui à la recherche et à l’innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'efforce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.