Level of qualifications required : Graduate degree or equivalent
Other valued qualifications : master degree in Computer Science (or maths)
Fonction : PhD Position
Level of experience : Recently graduated
About the research centre or Inria department
An important force which has continued to drive HPC has been to focus on frontier milestones which consist in technical goals that symbolize the next stage of progress in the field. After the step of the teraflop machine in the 1990s, the HPC community envision the use of generalist petaflop supercomputers and soon coming exaflop machines in the 2020s. For application codes to sustain petaflop and more using a few millions of cores or more will be needed, regardless of processor technology. Currently, a few HPC simulation codes easily scale to this regime and major code development efforts are critical to achieve the potential of these new systems. Scaling to at this rate will involve improving physical models, mathematical modelling, super scalable algorithms that will require paying particular attention to acquisition, management and vizualization of huge amounts of scientific data.
In this context, the purpose of the HiePACS project is to perform efficiently frontier simulations arising from challenging research and industrial multiscale applications. The solution of these challenging problems require a multidisciplinary approach involving applied mathematics, computational and computer sciences. In applied mathematics, it essentially involves advanced numerical schemes. In computational science, it involves massively parallel computing and the design of highly scalable algorithms and codes to be executed on future petaflop (and beyond) platforms. Through this approach, HiePACS intends to contribute to all steps that go from the design of new high-performance more scalable, robust and more accurate numerical schemes to the optimized implementations of the associated algorithms and codes on very high performance supercomputers.
Within the framework of a partnership (you can choose between)
- Projet Région Nouvelle-Aquitaine 2018-1R50119 “HPC scalable ecosystem”.
- SysNum cluster https://sysnum.labex.u-bordeaux.fr/en/
- ANR DASH https://project.inria.fr/dash/
- The thesis will take place in Hiepacs Inria Team Project http://hiepacs.bordeaux.inria.fr/ in close collaboration with Tadaam Team Project https://team.inria.fr/tadaam/
In recent years, task based runtime schedulers (such as StarPU  or PARSEC) have shown their efficiency for scheduling complex HPC applications on complex platforms, where each node consists of multicore CPUs and accelerators like GPUs or FPGAs. These runtime schedulers take allocation decisions at runtime, based on the actual state of the platform and on the set of ready tasks. The flexibility of such runtimes comes from the explicit expression of complex scientific calculations as graphs of dependent tasks, and this allows to obtain better and more portable performance than monolithic applications based directly on OpenMP and MPI. Indeed, it is now possible both to implement sophisticated allocation strategies that would be too complex for monolithic implementations, and to specify dynamic strategies which can react to unexpected runtime events or incorrect processing and communication time predictions. In the context of a single node, it has also been proved that it is even possible to achieve better performance by using intermediate mixed strategies [2,3,4,5], where some static knowledge about the computational graph is injected into the runtime scheduler.
The goal of this thesis is to extend these mixed strategies to the context of large scale platforms with distributed memory. In this context, we can assume that the graph of the application is known in advance and the initial allocation of data is clearly crucial in order to avoid slow inter node communications. This initial allocation, computed before the execution, should be able
(i) to minimize the transfer or generation cost to obtain the initial data distribution
(ii) to balance the load between computing resources while minimizing the communication overhead and
(iii) to be robust to small changes in the performance of the ressources.
(iv) to be able to make use of distributed storage devices such as Burst Buffers
In all cases, the possibility to replicate data and to replicate computations will be explored.
Then, at runtime, the system must be able to decide (i) when to move a given task from one resource to another, or to regenerate this data onto another resource and (ii) when to perform a more complex expensive data redistribution.
These two problems will be formulated as combinatorial optimization problems, and we will design algorithms to solve them with provable performance guarantees, possibly in restricted settings (for specific task graphs that may come for linear algebra, tensor computations or deep learning or specific computing platforms)
Main activities :
- Algorithm design
- Optimality proofs
- Modelling of applications and computing platforms
Technical skills and level required : good level in algorithmic design and modelling
Languages : Python, C, C++
- Subsidized meals
- Partial reimbursement of public transport costs
- Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage
- 1982€ / month (before taxs) during the first 2 years
- 2085€ / month (before taxs) during the third year
- Theme/Domain :
Distributed and High Performance Computing
Scientific computing (BAP E)
- Town/city : Talence
- Inria Center : CRI Bordeaux - Sud-Ouest
- Starting date : 2019-10-01
- Duration of contract : 3 years
- Deadline to apply : 2019-08-31
The keys to success
This thesis requires some taste for modelling, algorithmic design and combinatorial problems. The goal will be to work first on toy task graphs and platform models to understand the problems, their complexity and possible approaches. Then, we will move to actual applications (coming from Linear Algebra, Tensor Computations, Learning) and implement our solutions on top of runtime systems such as StarPU.
Inria, the French national research institute for the digital sciences, promotes scientific excellence and technology transfer to maximise its impact. It employs 2,400 people. Its 200 agile project teams, generally with academic partners, involve more than 3,000 scientists in meeting the challenges of computer science and mathematics, often at the interface of other disciplines. Inria works with many companies and has assisted in the creation of over 160 startups. It strives to meet the challenges of the digital transformation of science, society and the economy.
Instruction to apply
Thank you to send CV + cover letter + support letter
Defence Security :
This position is likely to be situated in a restricted area (ZRR), as defined in Decree No. 2011-1425 relating to the protection of national scientific and technical potential (PPST).Authorisation to enter an area is granted by the director of the unit, following a favourable Ministerial decision, as defined in the decree of 3 July 2012 relating to the PPST. An unfavourable Ministerial decision in respect of a position situated in a ZRR would result in the cancellation of the appointment.
Recruitment Policy :
As part of its diversity policy, all Inria positions are accessible to people with disabilities.
Warning : you must enter your e-mail address in order to save your application to Inria. Applications must be submitted online on the Inria website. Processing of applications sent from other channels is not guaranteed.