PhD Position F/M Modelization of HPC Jobs and Resources to Minimize Energy Waste

Type de contrat : Fixed-term contract

Niveau de diplôme exigé : Graduate degree or equivalent

Fonction : PhD Position

A propos du centre ou de la direction fonctionnelle

The Centre Inria de l’Université de Grenoble groups together almost 600 people in 24 research teams and 9 research support departments.

Staff is present on three campuses in Grenoble, in close collaboration with other research and higher education institutions (Université Grenoble Alpes, CNRS, CEA, INRAE, …), but also with key economic players in the area.

The Centre Inria de l’Université Grenoble Alpes is active in the fields of high-performance computing, verification and embedded systems, modeling of the environment at multiple levels, and data science and artificial intelligence. The center is a top-level scientific institute with an extensive network of international collaborations in Europe and the rest of the world.

Contexte et atouts du poste

co-advised by Raphaël Bleuse and Eric Rutten (Ctrl-A), LIG/INRIA, and

Franck Corset (LJK, ASAR)

Within the framework of the Taranis project in the PEPR Cloud.

Mission confiée

Soberness—in terms of electrical power—of Data Centers and High-Performance Computing (HPC) systems is becoming an important design issue, as the global energy consumption of Information Technologies is rising at consid- erable levels. Large-scale computing infrastructures are processing vaster amount of data or solving problems requiring vaster amount of computing power. The behavior of large scale infrastructures has become more variable and difficult to model, especially with respect to power consumption and application performance. Therefore, dealing with time variations and unpredictable disturbances demands to automate the management (i.e., configuration) of the infras- tructures. This automatic management can be done by periodically monitoring the state of the system, and updating the configuration to activate relevant mechanisms.

This work takes root in the field of autonomic computing [6], and aims at designing efficient feedback loops to automatically manage the resources of a HPC (high performance computing) infrastructure. The use of feedback loops is widespread in various fields of engineering, but recent in the computer science field.

Principales activités

Resource harvesting

The Resource and Job Management System (RJMS) is a key component to operate a HPC system [5, 7]. Users submit jobs: a description of their computation, data, and resource requirements. With respect to the resources status reported by the resource manager, the scheduler decides which resources to allocate to a job and assigns a time slot for the job’s execution.

The RJMS is however unable to fully exploit the resources in a HPC cluster: the presence of unused resources results from the limits of the scheduling of jobs. The allocation of resources to jobs, while respecting all constraints, leaves resources idle. Such inefficiency is sometimes referred as fragmentation. The loss of computing power resulting from the fragmentation represents an exploitable pool of resource.

CiGri [1] is a lightweight, scalable and fault-tolerant grid system that plugs into the RJMS. CiGri aims at minimizing the waste of computing resources to leverage the pool of unused resources. Yet, some computations (best-effort computations) can still lead to wasted resources if they are stopped before completion. In particular, by integrating information from the RJMS one could avoid to start best-effort computations if there is not enough time to execute them.

Modeling jobs and resources

The authors in [4] show that the jobs execution times are not necessary deterministic and can be modeled by an Exponential, Weibull, log-Normal or Normal distributions. Moreover, the high variance of execution times and the diversity of the jobs, whether in terms of the nature of the data, the application domain or the size of the problem have to be taking into account, by considering for instance a mixture of distributions (see [3]). Thus, we propose to apply some Machine Learning techniques, e.g., a clustering of all jobs submitted during the last decade, in order to take into account this heterogeneity. In a second step, we propose to take this information as a prior distribution in a Bayesian setting in order to improve the accuracy of the execution times estimations (see [8]). Furthermore, prior work mostly focuses on models that are job-centric and based on post-execution data [2]: they neglect to model the resources used for the computation.

In this work, we want to model the availability of the platform in order to insert the best-effort jobs in a frugal way. The design of models will have to consider that the allocation decisions are taken online, with partial information unveiling during the platform life-cycle.

References

[1] Bruno Bzeznik and Ghislain Charrier, CiGri. lic: GPL-3.0-or-later. url: http : / / cigri . imag . fr/, vcs: https://github.com/oar-team/cigri.
[2] Dror G. Feitelson, Dan Tsafrir, and David Krakov. “Experience with using the Parallel Workloads Archive.” In: J. Parallel Distributed Comput. 74.10 (Oct. 2014), pp. 2967–2982. doi: 10.1016/J.JPDC.2014.06.013.

[3] Sylvia Fruhwirth-Schnatter, Gilles Celeux, and Christian P Robert. Handbook of mixture analysis. CRC press, 2019.

[4] Ana Gainaru, Hongyang Sun, Guillaume Aupy, Yuankai Huo, Bennett A. Landman, and Padma Raghavan. “On-the-fly scheduling versus reservation-based scheduling for unpredictable workflows.” In: Int. J. High Perform. Comput. Appl. 33.6 (2019). doi: 10.1177/1094342019841681.

[5] Yiannis Georgiou. “Contributions for Resource and Job Management in High Performance Computing.” PhD thesis. LIG, Univ. Grenoble Alpes, France, Nov. 2010. url: https://tel.archives-ouvertes.fr/tel-01499598 (visited on 2023-10-11).

[6] Jeffrey O. Kephart and David M. Chess. “The Vision of Autonomic Computing.” In: IEEE Computer 36.1 (Jan. 2003), pp. 41–50. doi: 10.1109/MC.2003.1160055.

[7] Albert Reuther et al. “Scalable system scheduling for HPC and big data.” In: Journal of Parallel and Distributed Computing 111 (Jan. 2018), pp. 76–92. doi: 10.1016/j.jpdc.2017.06.009.

[8] Christian P Robert et al. The Bayesian choice: from decision-theoretic foundations to computational implemen- tation. Vol. 2. Springer.

Compétences

The PhD candidate must have:

a MSc degree in Computer science or Statistics.
skills in programming languages, software engineering
Knowledge in the domains of HPC

Avantages

Subsidized meals
Partial reimbursement of public transport costs
Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
Professional equipment available (videoconferencing, loan of computer equipment, etc.)
Social, cultural and sports events and activities
Access to vocational training
Social security coverage

Rémunération

Base of 2200 euros gross / month

Postuler à cette offre

Informations générales

Thème/Domaine : Distributed Systems and middleware
System & Networks (BAP E)
Ville : Grenoble
Centre Inria : Centre Inria de l'Université Grenoble Alpes
Date de prise de fonction souhaitée : 2025-10-01
Durée de contrat : 3 years
Date limite pour postuler : 2025-07-16

Attention: Les candidatures doivent être déposées en ligne sur le site Inria. Le traitement des candidatures adressées par d'autres canaux n'est pas garanti.

Consignes pour postuler

Applications must be submitted online via the Inria website. Processing of applications submitted via other channels is not guaranteed.

Sécurité défense :
Ce poste est susceptible d’être affecté dans une zone à régime restrictif (ZRR), telle que définie dans le décret n°2011-1425 relatif à la protection du potentiel scientifique et technique de la nation (PPST). L’autorisation d’accès à une zone est délivrée par le chef d’établissement, après avis ministériel favorable, tel que défini dans l’arrêté du 03 juillet 2012, relatif à la PPST. Un avis ministériel défavorable pour un poste affecté dans une ZRR aurait pour conséquence l’annulation du recrutement.

Politique de recrutement :
Dans le cadre de sa politique diversité, tous les postes Inria sont accessibles aux personnes en situation de handicap.

Contacts

Équipe Inria : CTRL-A
Directeur de thèse :
Rutten Eric / eric.rutten@inria.fr

A propos d'Inria

Inria est l’institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l’interface d’autres disciplines. L’institut fait appel à de nombreux talents dans plus d’une quarantaine de métiers différents. 900 personnels d’appui à la recherche et à l’innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'eﬀorce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.