2022-05077 - Post-Doctoral Research Visit F/M Distributed load balancing for finite element simulations using task-based programming model
Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : CDD

Niveau de diplôme exigé : Thèse ou équivalent

Fonction : Post-Doctorant

A propos du centre ou de la direction fonctionnelle

Team STORM combines strengths on high level DSLs, heterogeneous runtimes and performance analysis tools to help programmers get the highest efficiency from modern computer architectures in a portable manner.

 

Contexte et atouts du poste

Each year, Inria launches a recruitment campaign for Postdoctoral (postdoc) positions, and a limited number of slots is reserved for the International Relation Department in order to support Inria international collaborations.

This year, projects to strengthen partnership with Simula in Norway are eligible.

The postdoc contract will have a duration of 12 to 24 months. The default start date is November 1st, 2022 and not later than January, 1st 2023. The Post-Doc will be recruited by one of the Inria centers in France but it is recommended that the time is shared between France and Norway.

 

The MAELSTROM Inria — Simula Associate Team

Scientific simulations are a prominent means for academic and industrial research and development efforts nowadays. Such simulations are extremely computing intensive due to the process involved in expressing modelled phenomenons in a computer-enabled form. Exploiting supercomputer resources is essential to compute the high quality simulations in an affordable time. However, the complexity of supercomputer architectures makes it difficult to exploit them efficiently. SIMULA’s HPC Dept. is the major contributor of the FEniCS computing platform. FEniCS is a popular open-source (LGPLv3) computing platform for solving partial differential equations. FEniCS enables users to quickly translate scientific models into efficient finite element code, using a formalism close to their mathematical expression. Inria Team STORM develops methodologies and tools to statically and dynamically optimize computations on HPC architectures, ranging from task-based parallel runtime systems to vector processing techniques, from performance-oriented scheduling to energy consumption reduction. The purpose of the Maelstrom associate team proposal started in 2022 is to build on the potential for synergy between STORM and SIMULA to extend the effectiveness of FEniCS on heterogeneous, accelerated supercomputers, while preserving its friendliness for scientific programmers, and to readily make the broad range of applications on top of FEniCS benefit from Maelstrom’s results.

This post-doc research, taking place in the context of MAELSTROM will work on Algorithms and Programming to adapt the execution of parallel applications on massive, Post-Moore's law computers, where performance will be affected by the locality of data and the optimized use of heterogeneous multicore processors. Handling the complexity of these computers in a transparent and optimized manner is key for work and research in Modeling and Simulation. 

The development of parallel applications for High-Performance Computing (HPC) platforms is based on tools to manage and distribute the computation over the available resources. One such example are runtime systems. They abstract and handle details related to computations and communication with the goal of providing performance portability, i.e., making the best use of the available resources in a computing platform while reducing the effort that users have to put to adapt their applications to said platform.

StarPU is a runtime system developed by the STORM team that supports computing platforms based on heterogeneous architectures (e.g., CPUs and GPUs). It implements the sequential task flow (STF) programming model, whereby applications are decomposed into tasks that are submitted in a sequential order and are then scheduled and executed in parallel by the runtime system. In the case where the computing platform is composed of multiple nodes (that is, without a shared memory), StarPU extends the STF model by submitting a single task graph to all nodes. However, it either offers a fully distributed execution model, but lets the application handle the distributed load balancing (while still taking in charge the scheduling of tasks within each node), or it takes responsibility for the distributed scheduling work, but it then manages the execution in a centralized, master-worker model. The centralized execution model is similarly employed by other runtime systems from the state of the art, and it may pose a challenge to their scalability in future computing platforms.

In the fully distributed execution model, the application supplies a data distribution over the participating nodes, and StarPU then uses this data distribution to decide about the tasks mapping on these nodes.
The application may therefore control the load balancing by altering this data distribution over the course of the execution lifespan. The fully distributed execution model is scalable by design, but the added value offered by StarPU to applications in this model is currently limited. The purpose of this position is therefore to explore how StarPU could work in synergy with FEniCS to exploit high level problem knowledge to proposed an automated distributed load balancing based on data movements.

Mission confiée

Candidates for postdoctoral positions are recruited after the end of their PhD or after a first post-doctoral period: for the candidates who obtained their PhD in the Northern hemisphere, the date of the defense shall be later than 1 September 2020; in the Southern hemisphere, later than 1 April 2020.

In order to encourage mobility, the post-doctorate must take place in a scientific environment that is truly different from that of the PhD (and, if applicable, from the job held since the PhD); particular attention is thus paid to French or international candidates who obtained their doctorate abroad.

 

Work Description

The objective of this Post-doc position is to extend the fully distributed execution model of StarPU with the appropriate logic to take charge of the distributed load balancing work on behalf of applications in general and the FEniCS programming environment in particular, that is, to trigger data redistribution actions to ensure a balanced workload.
Since a piece of data may be referred to from multiple tasks and a task may refer to multiple pieces of data, the main challenge is to capture the potentially complex relationship between the data distribution and the resulting work distribution in terms of tasks.

References

  • Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. StarPU : A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience, Special Issue : Euro-Par 2009, 2011.
  • Matthias Lieber, Kerstin Gößner, and Wolfgang E. Nagel. The potential of diffusive load balancing at large scale. In 23rd European MPI Users' Group Meeting, 2016.
  • V. Freitas, L. L. Pilla, A. Santana, M. Castro and J. Cohen, "PackStealLB: A Scalable Distributed Load Balancer based on Work Stealing and Workload Discretization", in Journal of Parallel and Distributed Computing, April 2021.
  • Emmanuel Agullo, Olivier Aumage, Mathieu Faverge, Nathalie Furmento, Florent Pruvost, Marc Sergent, Samuel Thibault. Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model. IEEE Transactions on Parallel and Distributed Systems, December 2017.

 

Principales activités

Main Objectives

  • Identify situations of load imbalance and their causes;
  • Model the relationship between the data distribution and the subsequent workload distribution;
  • Establish new algorithms for the distribution of data at the start of applications;
  • Propose distributed load balancing (data redistribution) algorithms adapted to the sequential task flow programming model;
  • Register and validate all steps of the research leveraging both real and simulated executions of applications to
    allow the reproduction of results.

Besides the software developed in StarPU, it is expected that the scripts related to experiments and data analysis, and the data collected from experiments will be made available to the community.

Compétences

Required Skills

  • Knowledge of high-performance computing and scheduling algorithms
  • Mastery of software development under UNIX-like operating systems
  • Good level in C/C++ language programming, system programming and parallel programming
  • Mastery of technical and scientific English
  • Good writing skills
  • Good presentation skills

 

Instruction to apply

Applications for the Inria International Relations Department reserved post-docs must be submitted through this platform jobs.inria.fr before July 15, 2022 with the following documents:

- Completed summary sheet to be uploaded here: https://mybox.inria.fr/d/0012f24ad5484cfd8307/

- Research project including subject title, research program, work plan and planned visits, duration  (between 12 and 24 months) and the desired starting date (default start date is November 1st, 2022 and not later than January, 1st 2023).

- Detailed CV with a description of the PhD and a complete list of publications with the two most significant ones highlighted.

- Motivation letter from the candidate.

- 2 letters of recommendation.

- Letters of support from the host Inria research team and from the host international partner.

- Copy of passport.

For more information

Another contact: Feel free to contact postdoc-dri@inria.fr

Avantages

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Possibility of teleworking and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage

Rémunération

2653€ / month (before taxs)