2018-00755 - Ph-D / Data replication and load-balancing for fault-tolerance in a distributed task-based runtime system
Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : CDD de la fonction publique

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

Niveau d'expérience souhaité : Jeune diplômé

Contexte et atouts du poste

In the context of the European project EXA2PRO,

The STORM Research Team at Inria and LaBRI Laboratory in Bordeaux, France, works on the topic of High Performance Parallel Computing. As emphasized by initiatives such as the European Exascale Software Initiative, the European Technology Platform for High Performance Computing, or the International Exascale Software Initiative, the HPC community needs new programming APIs and languages for expressing heterogeneous massive parallelism in a way that provides an abstraction of the system architecture and promotes high performance and efficiency. In this context, Team STORM designs code optimizing techniques for the whole programming tool chain, at the compiler level, at the runtime system level, and at the execution analyser level, with a focus on heterogeneous platforms.

 

Key-words :

High performance computing, parallelism, exascale, fault tolerance, task-based programming runtime system, load balancing

Mission confiée

The Exa2pro European project gathers different research institutes and industrial partners to aim at enhancing programmability of the future exascale computing systems.

At these scales, hardware failures are very common, leading to the sporadic loss of whole computation nodes, and to the need of fault tolerance techniques. However, at these scales, the classical fault tolerance techniques such as checkpoint-restarts even show their limitations themselves. Task-based programming models however provide rich information about the flow of computations and their relationship with respect to data, which can be used to revamp these techniques with fine-grain, selective replications and restarts according to the task graph. Moreover, this should allow to let execution continue while the lost tasks and data are being recovered in an optimized way from data checkpoints.

The unexpected loss of a computation node entails a disruption in the balancing of the computation load, due to the loss of the node itself, but also due to the computation time lost in recovering from the data loss. This thus imposes careful dynamic redistribution of the workload of the failed node and recovery expenses over the system to avoid imbalances while limiting data redistributions.

The Inria STORM team has a long experience on runtime systems in general, and on task-based runtime systems in particular, notably through research related to the StarPU runtime system. StarPU currently supports scalable distributed execution through the MPI communication standard, and work is currently conducted to add adaptive dynamic load balancing according to application load fluctuations.

The goals of this PhD are thus the following ones :

- Conceiving StarPU runtime system replication strategies so that the loss of a computation node does not entail loss of data and the computation can be efficiently resumed without impacting all other computation nodes, while minimizing the amount of replications by determining pieces of data that can be recomputed from data that must be preserved

- Extending the task-based programming model and its execution model to allow for failure detection, and for seamless restart strategies, so that the runtime uses the task graph information to recover the data required by remaining tasks from the data which was duplicated, and restart the tasks whose instantiation was lost

- Improving dynamic distributed load-balancing strategies to cope with the loss of computation nodes, while limit the volume of data redistribution

Going further, by exploring possible relationships and cooperation between scheduling algorithms and the replication/restart mechanisms to increase the effectiveness of replication, while limiting the impact of restart steps.

Links :

 

Principales activités

Main activities :

  • Study state of the art on data replication and load balancing
  • Analyze needs for target applications of exa2pro
  • Propose extensions to existing replication and load balancing strategies to leverage task graphs
  • Experiment in simulation and real environments

Complementary activities :

  • Write documentation for the proposed application interfaces
  • Write research papers
  • Present scientific results in conferences

 

Compétences

Technical skills :

  • Mastering software development under UNIX-like operating systems
  • Good level in C/C++ language programming, system programming and parallel programming

Language :

  • Mastering technical and scientific English
  • Good writing skills

Additional skill :

  • Knowledge of MPI, task-based programming

Avantages sociaux

  • Subsidised catering service
  • Partially-reimbursed public transport

Rémunération

1982€ / month (before taxs) during  the first 2 years, 2085€ / month (before taxs) during the third year.