2019-01392 - PhD Position F/M Replication Algorithms for Real-time Tasks with Precedence Constraints [PHD Campaign 2019 - Campagne Doctorants Grenoble Rhône-Alpes]

Contract type : Public service fixed-term contract

Level of qualifications required : Graduate degree or equivalent

Fonction : PhD Position

About the research centre or Inria department

Grenoble Rhône-Alpes Research Center groups together a few less than 800 people in 35 research teams and 9 research support departments.

Staff is localized on 5 campuses in Grenoble and Lyon, in close collaboration with labs, research and higher education institutions in Grenoble and Lyon, but also with the economic players in these areas.

Present in the fields of software, high-performance computing, Internet of things, image and data, but also simulation in oceanography and biology, it participates at the best level of international scientific achievements and collaborations in both Europe and the rest of the world.

Context

ROMA team

LIP laboratory

ENS Lyon, Lyon, France

Supervisor: Yves Robert, Professor, yves.robert@inria.fr

Co-supervisor: Loris Marchal, CRNS Researcher, loris.marchal@ens-lyon.fr

Assignment

 

This PhD thesis focuses on real-time workflow applications that execute on high-performance computing platforms. Consecutive instances of the problem are periodically input to the platform, and execute concurrently, under deadline constraints. The optimization problem is to maximize the throughput of the platform while enforcing the response time constraints, and it has received considerable attention in the literature. The thesis revisits this problem on large-scale platforms where tasks are subject to transient errors. These errors must be detected and corrected via re-execution, which may increase response-time and decrease throughput. The thesis will investigate replication techniques to protect the tasks and guarantee some reliability threshold, under a multi-objective optimization approach, where energy consumption (a critical resource in real-time systems) is also accounted for.

Research topics : Algorithms, scheduling, parallel applications, resilience, transient errors, performance evaluation, throughput, response time, energy consumption.

Thesis description : A real-time system is responsible for performing logically correct computations while meeting predefined deadlines. Failing to meet these deadlines often results in catastrophic system failures. Transient faults triggered by cosmic rays or electromagnetic interference have to be tolerated when executing the tasks, which calls for sophisticated replication strategies. However, replication has a high cost in terms of consumed energy, which is a critical resource in embedded real-time systems. Altogether, enforcing all task deadlines while tolerating transient faults and without exceeding a given energy budget is a challenging endeavor.

The thesis will target real-time workflows, i.e., pipelined instances of the same set of tasks whose dependence graph is a given DAG (Directed Acyclic Graph). Instance number i is released at time i.P and must complete by time (i+1)P, where P is the global period of the problem. The workflow instances are mapped onto a parallel platform with N identical processors.

Each task is subject to transient failures, with some probability that is typically evaluated as a function of the task worst-case execution time. A reliability threshold is enforced, either task by task, or globally, which calls for replicating each task with a different number of replicas. However, when the execution of one replica succeeds, the other replicas are cancelled.

The energy consumed by each replica of each task depends upon the frequency chosen for its execution. The lower the frequency, the less energy consumed, but the longer the execution time; a low frequency may prevent the workflow deadline to be matched. Also, if the execution of the replicas of a given task do not overlap, energy consumption will be smaller whenever the first execution succeeds.

Altogether, this calls to designing schedules that achieve several trade-offs. The problem is to find:

(i) a number of replicas for each task;

(ii) a frequency for each task replica; and

(iii) a mapping and scheduling of all these replicas onto the processors.

The constraints are to match the reliability threshold and the deadlines, and the objective is to minimize the expectation of the consumed energy.

First research direction : The first research direction is to address the simpler case of independent tasks with the same period. A preliminary study has been conducted in [1], where some heuristics have been introduced. The thesis will assess whether the problem of scheduling the task replicas once the mapping and frequencies are given is NP-complete or not, and will design optimal algorithms (if the problem is polynomial) or efficient heuristics (if is is NP-complete). Then the thesis will investigate approximation algorithms to solve the general problem for independent tasks.

Second research direction : The next step will be to consider independent tasks with different periods. The scheduling of the tasks becomes less flexible, and the EDF policy (Earliest Deadline First) is usually enforced to guarantee that all deadlines are matched. The thesis will design algorithms and evaluate their efficiency for this problem.

Third research direction : The thesis will eventually tackle dependence graphs, first of a particular form (chains, fork-joins) and then will address the general problem.

General methodology : The work will be conducted along three main lines:

- Synthesis of relevant literature

- Design and analysis of new and multi-criteria algorithms

- Evaluation through simulations and using benchmark workflows

Pre-requisite :

- Basic background in computer science, probability theory,

- interest in algorithm design and complexity.

Bibliography :

(a) General references :

- T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms. The MIT Press, 2001

- M. Mitzenmacher, E. Upfal, Probability and Computing: Randomized Algorithms and Probabilistic Analysis, Cambridge University Press, 2005

- J. Leung editor, Handbook of Scheduling: Algorithms, Models and Performance Analysis, CRC Press 2004​​

- H. Casanova, A. Legrand and Y. Robert, Parallel Algorithms, CRC Press 2008

(b) Reference on real-time workflows:

- M. A. Haque, H. Aydin, D. Zhu, On Reliability Management of Energy-Aware Real-Time Systems Through Task Replication, IEEE Trans. Parallel and Distributed Systems 28, 3 (2017), pp 813-825.

 

 

 

Main activities

Research towards a PhD thesis in computer science

Skills

Technical slls and level required : see above

Languages : fluent English (French not needed)

 

Benefits package

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage

Remuneration

1st and 2nd year : 1 982 euros brut /month

3rd year : 2 085 euros brut / month