Doctorant F/H Doctorant F/H PhD Position : Control, Motion Fidelity, and Computational Efficiency in Long-Form Audio-Visual Video Generation

Type de contrat : CDD

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

A propos du centre ou de la direction fonctionnelle

The Centre Inria de l’Université de Grenoble groups together almost 600 people in 22 research teams and 7 research support departments.

Staff is present on three campuses in Grenoble, in close collaboration with other research and higher education institutions (Université Grenoble Alpes, CNRS, CEA, INRAE, …), but also with key economic players in the area.

The Centre Inria de l’Université Grenoble Alpe is active in the fields of high-performance computing, verification and embedded systems, modeling of the environment at multiple levels, and data science and artificial intelligence. The center is a top-level scientific institute with an extensive network of international collaborations in Europe and the rest of the world.

Contexte et atouts du poste

Titre :  Control, Motion Fidelity, and Computational Efficiency in Long-Form Audio-Visual Video Generation

Supervision : Dr Stéphane Lathuilière (INRIA-UGA)

Funding : BPI contract

Contexte :Background and Motivation

Recent advances in generative AI have dramatically expanded the ability to synthesize and manipulate video content. Large-scale diffusion transformers and autoregressive video models — such as Sora — now exhibit impressive capabilities in generating high-resolution, multi-second clips from textual prompts. These systems increasingly support multimodal conditioning (text, images, audio), showing early signs of temporally consistent storytelling and complex scene dynamics.

Despite this progress, several fundamental challenges remain unsolved. First, audio-visual controllability remains limited: while models can loosely synchronize audio and video, they struggle with precise alignment of speech, actions, and environmental events. Second, current systems lack fine-grained motion control, making it difficult to specify nuanced trajectories, subtle character actions, or physically plausible object interactions. Third, the generation of long-duration videos (over tens of seconds or minutes) introduces severe problems of temporal drift, memory accumulation, semantic inconsistency, and scene fragmentation. Finally, the computational demands of high-resolution, long-context generative models pose serious barriers to both training and deployment. Scaling video models in space-time while maintaining quality is currently prohibitively expensive and technically challenging.

This PhD will investigate the foundations of controllability, motion fidelity, temporal consistency, and computational efficiency in audio-visual video generation. It will develop new frameworks and methodologies that allow generative models to produce globally coherent, fine-controlled, long-range audio-visual sequences, while significantly reducing computational overhead. These contributions aim to advance the scientific understanding of generative video modeling and address core barriers impeding real-world applications in film production, simulation, robotics, and AR/VR systems.



Mission confiée

Research Objectives : 

The objective of this PhD is to develop generative video models that achieve precise audio–visual synchronization, fine-grained motion control, and robust long-term temporal coherence. The research aims to extend current systems to produce long-duration videos while maintaining consistent identities, scenes, and dynamics. It will create new metrics for evaluating audio–visual alignment, motion fidelity, and long-horizon stability, and systematically analyze where existing diffusion and autoregressive models fail. The project will explore improved conditioning methods and novel architectures that support more reliable cross-modal control. A key goal is to address the significant computational challenges posed by high-resolution, long-context video generation, designing methods that reduce cost without degrading quality. Ultimately, the work seeks to deliver tools, benchmarks, and techniques enabling controllable, efficient, and coherent audio–visual generative video systems.

 
 

Principales activités

Methodology

Research Objectives : 

The overarching aim is to develop a principled framework for controllable, motion-faithful, and computationally efficient long-form video generation. This project will proceed in several stages:

(A) Audio–Visual Controllability

The first component investigates how existing diffusion-based and autoregressive video models integrate audio and visual signals. The study will characterize failure modes in audio–video alignment, such as lip-sync drift, desynchronized actions, or mismatched environmental cues. The following aspects will be evaluated:

  • temporal alignment measures,
  • cross-modal coherence scores,
  • perceptual consistency metrics sensitive to audiovisual synchrony.

The research will explore improved conditioning mechanisms — including hierarchical audio encodings, cross-modal attention stabilization, and temporally-aware guidance — to achieve fine-grained and persistent audio-video alignment.

(B) Fine-Grained Motion Control

This stage focuses on motion fidelity and motion controllability. Current models tend to produce coarse or oversmoothed dynamics, with limited adherence to specified trajectories or subtle gestures. The work will analyze:

  • how motion representations are learned internally,
  • how attention drift influences the degradation of fine-scale dynamics,
  • where motion prediction failures propagate over time.

To address these issues, the thesis will explore several strategies:

  • motion-conditioned latent representations,
  • differential motion-field guidance,
  • keyframe-to-in-between propagation mechanisms,
  • physics-informed or kinematic consistency losses.

These techniques aim to allow creators to specify detailed motion constraints while preserving global visual realism.

(C) Long-Video Generation and Temporal Consistency

A central challenge is generating videos far beyond the typical 2–10 second window. Long sequences expose weaknesses in memory, context propagation, and semantic stability.

This research will develop new methodologies for:

  • maintaining global scene coherence across hundreds or thousands of frames,
  • preventing temporal drift and identity switching,
  • supporting long-range narrative structure with sustained physical and stylistic consistency.

Investigated approaches may include:

  • memory-augmented diffusion processes,
  • hierarchical temporal decomposition,
  • recurrent generative modules,
  • compressed global-context representations.

Evaluation metrics for long-range temporal fidelity will also be proposed, focusing on continuity, identity preservation, and stability across extended horizons.

(D) Computational and Scaling Challenges

Finally, the thesis will address computational aspects limiting current generative video models. Long-context video generation scales quadratically or worse with space-time resolution, creating extreme GPU memory and inference-time demands.

This work will explore:

  • sparse and low-rank attention mechanisms for spatiotemporal data,
  • mixed-resolution diffusion schedules,
  • temporal chunking with cross-segment consistency constraints,
  • model-parallel and pipeline-efficient variants of video diffusion transformers.

A particular emphasis will be placed on understanding the trade-off between computational savings and degradation in motion fidelity or temporal coherence — and on designing architectures that optimize both.

Together, these contributions will advance the state of controllable, high-fidelity generative video models and support safer, more accessible, and more reliable deployment in real-world settings.

 

Compétences

Compétences techniques et niveau requis :We are seeking a motivated PhD candidate with a strong background in one or more the  following areas :

  • speech processing, computer vision, machine learning,
  • solid programmming skills 
  • interest in connecting AI with human cognition Prior experience with LLM, SpeechLMs, RL algorithms, or robotic platforms is a plus, but not mandatory

Langues : Anglais

 

Avantages

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage