2022-05057 - Doctorant F/H Using Data Science to derive stochastic models for parallel applications scheduling

Type de contrat : CDD

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

Contexte et atouts du poste

The project is in the context of the national collaboration between centers in Grenoble, Lyon and Bordeaux. It is focused on a new vision for modeling jobs for the design of supercomputers.

The first year will be baased in Bordeaux within the Inria project-team Tadaam, the second and third year will be in Rennes (Inria project team Kerdata because the principal supervisor will move there).

It will be co-supervised by Guillaume Pallez (Tadaam, https://people.bordeaux.inria.fr/gaupy/ ) and Fanny Dufossé (Datamove, https://team.inria.fr/datamove/team-members/fanny-dufosse/

 

 

 

Mission confiée

## Context and Motivation

High-Performance Computers (also called supercomputers) are massive infrastructure used to run extremely large parallel applications. These applications come from a wide range of domains, such as material physics, climate modeling/prediction, astronomy etc. Used as a cornerstone of some industrial applications (self-driving cars, drug discovery etc), supercomputing is also one of the pillars of scientific discovery (such as recently the Higgs Boson, supermassive blackholes, etc). With the advent of Big Data and machine learning, together with the race to Exascale (a supercomputer able to compute at a peak of 10^18 Flops), an explosion of application domains turned to exploit supercomputer resources

## Scheduling
One of the central problems of computing is the allocation (or scheduling) of jobs with different requirements on the shared computational resources (the computing platform). Research on Resource and Job Management Systems (RJMS) is extremely active. Fundamentally there is a key element at the center of all existing and future algorithmic solutions: user-provided resource needs. So far, most scheduling algorithms rely on this assumption. However this is a well known and documented limitation: user estimates are known to be inaccurate (overestimated). It has been shown that this inaccuracy hurts the performance of the system.

## Research vision
We hypothesize that job resource requirements and temporal variations are in essence stochastic. The variability of their needs is inherent and can be large. Based on this hypothesis, we believe that HPC scheduling algorithms and softwares should embrace the uncertainty of job resources requirements.

**The idea of this project is to show that we can design job schedulers that do not assume that user needs are deterministic/computable, but that can work with them as soft constraints while knowing that they are unreliable.**

Note that this project is based on some previous work that showed extremely promising results [1,2,3].

## Internship/thesis topic

This thesis is the first step of the project where the goal is to design stochastic/statistical models for HPC jobs, based on the analysis of real data. In addition, we will work on constructing models based on partial information using tools such as Bayesian inference.

The challenge of the thesis consists in understanding and describing the variability of resource usage, including its origin and its behavior. We want to be able to present novel formulations to describe the behavior of applications that includes the uncertainty. Important questions have to be answered such as:
1. How can we account for the variability in application needs?
2. What are the sources of variability and their impact on application performance?
3. How can we abstract this variable behavior?

Variability in application resource needs seems to come from different factors (input data, machine parameters, code performance etc). Finding the right modelization is an important factor to the design of algorithmic solutions (which will also be an important part of the thesis).

(a)![(a) Application walltime variation for various inputs.](https://wtf.roflcopter.fr/pics/miL3TVku/BoL3Iybd.jpg =340x180) (b)![(b) Correlation between size of input and walltime](https://wtf.roflcopter.fr/pics/11oqmG2e/GUqTQRux.jpg =340x180)
**Data from a neuroscience application running with inputs from two different datasets [1]: (a) Application walltime variation for various inputs. (b) Correlation between size of input and walltime**

In preliminary work we have been able to demonstrate that for specific neuroscience applications run in complete isolation the behavior was strongly input-dependent and could be modeled through high variance statistical distribution [1], see FIgures. Here our goal is many-fold: (i) try to see if one can generalize the observations at a much higher scale (trace-based analytics); (ii) as a second step, we hypothesize that an application behavior is the consequence of a compositional rules between application behaviors and machine interference, hence by studying combinely one of the two applications and the complete trace, we want to try to guess the machine impact (and hence the function used in this compositionnal rule). (iii) Once this is done, we plan to verify on the second application the compositionality of the rules and predict its behavior.

Then, we will work on proposing modelization for complex workflows for specific applications (including the varying level of parallelism). In parallel these models will be tested against the design of scheduling algorithms to verify their manipulability.

This new strategy to create application performance models will provide in the future a completely new directions to submitting information to the resource arbitration mechanism which seems necessary to the upcoming application challenges of high-performance computing.


[1] https://hal.inria.fr/hal-02921487/
[2] https://people.bordeaux.inria.fr/gaupy/ressources/pub/confs/ipdps20_reservation.pdf
[3] https://people.bordeaux.inria.fr/gaupy/ressources/pub/confs/icpp19_speculative.pdf

Principales activités

The main activities are those typical of a PhD. They include: literature reading, scientific development, programming and simulation, data processing, reporting and presentation, paper and thesis manuscript writing, collaboration with the team, the supervisors and other scientific partners, participation to conferences and workshops. Course-taking and teaching activities in accordance with doctoral school rules

 

Compétences

 - advanced knowledge in statistics (such as bayesian optimization, active learning)
 - good knowledge in probability theory
 - Knowledge in a scripting language for data science (R, python)
 - some knowledge and interest in algorithmic design
 - good level in english (written)

Avantages

  • Restauration subventionnée
  • Transports publics remboursés partiellement
  • Congés: 7 semaines de congés annuels + 10 jours de RTT (base temps plein) + possibilité d'autorisations d'absence exceptionnelle (ex : enfants malades, déménagement)
  • Possibilité de télétravail partiel et aménagement du temps de travail
  • Équipements professionnels à disposition (visioconférence, prêts de matériels informatiques, etc.)
  • Prestations sociales, culturelles et sportives (Association de gestion des œuvres sociales d'Inria)
  • Accès à la formation professionnelle

Rémunération

Contrat à durée déterminée de 3 ans (contrat doctoral)

Rémunération mensuelle brute :

  • 1982 euros pendant la 1ère et 2ème année
  • 2085 euros pendant la 3ème année