2019-01579 - PhD Position F/M Multimodal interaction data generation using deep adversarial learning
Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : CDD de la fonction publique

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

A propos du centre ou de la direction fonctionnelle

Grenoble Rhône-Alpes Research Center groups together a few less than 800 people in 35 research teams and 9 research support departments.

Staff is localized on 5 campuses in Grenoble and Lyon, in close collaboration with labs, research and higher education institutions in Grenoble and Lyon, but also with the economic players in these areas.

Present in the fields of software, high-performance computing, Internet of things, image and data, but also simulation in oceanography and biology, it participates at the best level of international scientific achievements and collaborations in both Europe and the rest of the world.

Contexte et atouts du poste

The PhD student will be co-supervised by Dr. Dominique Vaufreydaz (Pervasive team, supervisor, https://research.vaufreydaz.org/) and by Dr. Xavier Alameda-Pineda (Perception team, co-supervisor, https://xavirema.eu/). This thesis will benefit from the background of the two teams in multimodal perception and interaction.

The funding of the thesis is already acquired though the IRS MIDGen Project (University Grenoble Alpes). The Ph.D. student will be hired by University and work at Inria.

Mission confiée

The funded project MIDGen (Multimodal Interaction Data Generation) addresses researches within the so-called "Ambient Assisted Living" (AAL) research field, more precisely on assistance to elderly or frail people. This topic is of interest as a societal challenge in the near future. The current researches on assistive systems for elderly range from smartphone helping applications to companion robots at home. The main challenge in building such companion robots is to provide social competence in perceiving, reasoning and expressing the social and emotional aspects of interactions with human, also known as the “social presence”. To fulfill the prerequisites of this social presence, multimodal perception algorithms must accurately perceive social signals emitted by humans. Many techniques have emerged in the state-of-the-art in recent years, taking advantage of the progresses of Deep Learning. Some are remarkably efficient on tasks such as detecting people in images, detecting or recognizing speech. The MIDGen project focuses on the need of current Deep Learning perception algorithms for a huge amount of training data, data that are not available in sufficient quantity for our research on multimodal interactions with elderly people.

In the context of the project, the collection and annotation of large-scale data sets poses ethical and privacy concerns, and requires enormous resources. We plan to overcome this issue by learning data generators (e.g. generative adversarial networks and variational auto-encoders) from already available and less sensitive corpora (e.g. political debates). These generators will be. able to synthesize large-scale data sets automatically annotated by the generation process. Limited progresses depicted in the literature showed that it is possible to generate facial cues or full body poses of a single individual, but little is known of how to generate audio-visual data describing the interaction between two or more people (speech turns, conversational gestures, people’s gaze, etc.).

Principales activités

The MIDGen proposal is structured into workpackages:

  • controllable and domain-adaptive mono-modal data generators
  • their multi-modal counterpart
  • evaluation

The main challenge of the first two workpackages is to develop generators that are controllable and domain-adaptive. We decided to first address the controllability and adaptation problems of several mono-modal data generators (workpackage 1) to later on investigate multimodality (workpackage 2). The evaluation (workpackage 3) will be carried out throughout the project base on our expertise on multimodal scene analysis and human robot interaction. An intrinsic assessment of the quality of the generated multimodal data will be conducted to understand if the generated data is realistic enough to detect low-level features (people’s position, speaking status, …). Using Amiqual4Home (https://amiqual4home.inria.fr/) and/or Domus experimental platforms (http://multicom.imag.fr/multicom.imag.fr/spipc5a9.html?article114), the last evaluation steps will validate our generative approach on real live interactions with robots in ecological situations.

The PhD candidate will start with the generation of monomodal interaction data (Workpackage 1) during the whole first year. It will be followed by Workpackage 2, on multimodal data until the 30th month of the thesis. At the same time, evaluations will be conducted to refine and validate the results (Workpackage 3). The dissemination of the work publish results in top conference (ICMI, ACM MM, ECCV, CVPR, IROS, ICRA).



The candidate must hold a Master in Computer Sciences or in Applied Mathematics, with a background in signal processing, machine learning. Knowledge in Deep Learning is appreciated. Good programming skills are also required (C++, python).


  • Partial reimbursement of public transport costs
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage


1768,55€/ month before taxes