2021-03328 - PhD Position F/M Supporting Online Learning and Inference in Parallel across the Digital Continuum

Contract type : Fixed-term contract

Level of qualifications required : Graduate degree or equivalent

Fonction : PhD Position

Level of experience : Recently graduated

About the research centre or Inria department

The Inria Rennes - Bretagne Atlantique Centre is one of Inria's eight centres and has more than thirty research teams. The Inria Center is a major and recognized player in the field of digital sciences. It is at the heart of a rich R&D and innovation ecosystem: highly innovative PMEs, large industrial groups, competitiveness clusters, research and higher education players, laboratories of excellence, technological research institute, etc.


Financing Project

This PhD will be done in the context of the ENGAGE DFKI-Inria (2021-2024), focused on building next-generation computing environments for artificial intelligence. 

More on Inria: https://inria.fr/en/inria-ecosystem 

More on DFKI: https://www.dfki.de/en/web/ 




As Artificial Intelligence has recently gained an unprecedented momentum in a rapidly increasing number of application areas, Deep Neural Networks (DNN) are becoming a pervasive tool across a large range of domains, including autonomous driving vehicle, industrial automation, and pharmaceutical research to name just a few.

As these neural network architectures and their training data are getting more and more complex, so are the infrastructures that are needed to execute them sufficiently fast. Hyperparameter setting and tuning, training, inference, dataset handling are operations that are all putting a growing pressure on the underlying compute infrastructure and call for novel approaches at all levels of the workflow, including the algorithmic level, the middleware and deployment level, and the resource optimization level.

This thesis is proposed as part of a collaborative project established between Inria and DFKI (the German Research Center for Artificial Intelligence). The goal of this project is to leverage efficient collaboration of experts in the AI and HPC areas to address the following specific research questions:

  1. How can we deal with situations where training or validation data is not available in sufficient quantity or quality, which is the case (1) if the generation of data is expensive because of sample creation or measurement costs, (2) if the cost of manual data annotation constitutes an infeasible effort, (3) if the natural occurring data distribution is unfavorable, i.e. highly relevant situations occur only rarely, or (4) if a phenomenon has been predicted in theory, but not yet observed. Our answer here will include the concepts of parametric models and simulations (also known as Digital Reality).
  2. How can the various possible deployment options of complex AI workflows on the available underlying infrastructure impact performance metrics? How can this infrastructure be best leveraged in practice, potentially through seamless integration of supercomputers, clouds, and fog/edge systems?

This thesis is focusing on the second question

Thesis proposal

The thesis will focus on the middleware and the deployment level. Our objective is to investigate various deployment strategies for complex AI workflows (e.g., potentially combining online training, simulations and inference, all in parallel and in real-time) on hybrid execution infrastructures (e.g., combining supercomputers and cloud/fog/edge systems). This requires scalable and reliable experimentation tools. To this purpose, an important objective is to propose methodologies and supporting tools enabling researchers to:

  • describe in a representative way the application behavior;
  • reproduce it in a reliable, controlled environment for extensive experiments, and
  • understand how the end-to-end performance of applications is correlated to various algorithm-dependent or infrastructure-dependent factors.  

Main activities

The main expected outcomes are: (1) publications describing an experimental, reproducibility-oriented methodology, its validation in practice through novel insights it can enable, potentially leading to novel algorithms for parallel/continual learning and inference across the computing continuum; (2) associated underlying algorithms and 3) an adequate software framework for experiment deployment, monitoring, and execution at scale on various relevant scalable infrastructures (e.g., on experimental platforms such as Grid’5000 in a first stage and on hybrid infrastructures including pre-exascale HPC platforms in a second stage).

To address these challenges, the thesis will leverage the E2Clab approach [E2Clab2020, Ros2020] initiated in the KerData team at Inria to address the needs of experimentation of workloads involving online parallel learning and inference. In addition to potential parallelization strategies for learning and inference tasks, our goal is to enable reproducible experimentation of complex AI workloads across hybrid infrastructures and help optimize deployment strategies depending on multiple factors including the application characteristics, the target performance metrics and the features of the available execution hardware. The goal is to answer questions like: How can the various possible deployment options of complex AI workflows on the available underlying infrastructure impact performance metrics? How can this infrastructure be best leveraged in practice, potentially through seamless integration of supercomputers, clouds, and fog/edge systems?

International visibility and mobility

The thesis will be conducted in strong collaboration with several partners including DFKI (contact: René Schubotz), where a pair PhD position will be provided, and Argonne National Lab, USA (contact: Bogdan Nicolae). The thesis may include long research stays (1-3 months) at the partners’s teams, for joint collaborative work.

How to apply?

In parallel to the online submission on the Inria web site, please send an email with a cover letter, CV, contact address of at least two references (internship, teacher in a related field, …) and copies of degree certificates to Dr. Gabriel Antoniu (gabriel.antoniu@inria.fr) and Dr. Alexandru Costan (alexandru.costan@inria.fr). Incomplete applications will not be considered nor answered.


[Ros2020] Daniel Rosendo, Pedro Silva, et al. (2020) E2Clab: Exploring the Computing Continuum through Repeatable, Replicable and Reproducible Edge-to-Cloud Experiments. Cluster 2020 - IEEE International Conference on Cluster Computing, Sep 2020, Kobe, Japan.

[E2Clab2020] The E2Clab project: https://team.inria.fr/kerdata/e2clab/.

[G5K] The Grid’5000 experimental testbed: https://www.grid5000.fr/w/Grid5000:Home.

[Hoi2018] Steven C.H. Hoi, Doyen Sahoo, Jing Lu, Peilin Zhao. Online Learning: A Comprehensive Survey. 2018. https://arxiv.org/abs/1802.02871

[Sahoo2017] Doyen Sahoo, Quang Pham, Jing Lu, Steven C.H. Hoi. Online Deep Learning: Learning Deep Neural Networks on the Fly. 2017. https://arxiv.org/abs/1711.03705


  • Advanced knowledge of computer networks, machine learning and distributed systems
  • Ability and motivation to conduct high-quality research, including publishing the results in relevant venues
  • Strong programming skills (e.g. C/C++, Java, Python).
  • Working experience in the areas of machine learning, HPC, distributed systems is an advantage

Benefits package

  • Subsidised catering service
  • Partially-reimbursed public transport


monthly gross salary amounting to 1982 euros for the first and second years and 2085 euros for the third year