Type de contrat : CDD
Niveau de diplôme exigé : Bac + 5 ou équivalent
Fonction : Doctorant
Contexte et atouts du poste
Financial and working environment.
This PhD will be hosted by Inria (Myriads team, Rennes Bretagne Atlantique) and will be funded by Inria. This sub-project is a part of the Inria-OVH collaborative framework. Thus, the work will be carried out in a close collaboration with OVH. In fact, we plan to validate the results of the project using several OVH data services including backup services and media service, etc.
The PhD student will be supervised by:
- Shadi Ibrahim, member of the Myriads team in Rennes
- Guillaume Pierre, head of the Myriads team in Rennes
- Jean-François Smigielski, Software Engineer specialized in Block Storage, OVHcloud
- Romain De Joux, Technical Lead Object Storage, OVHcloud
Visits and meetings between the successful candidate and the supervisors will be organized, as well as meetings with the other members of the Inria-OVH collaborative framework.
Mission confiée
Context
The amount of data observed from the world is growing exponentially, reaching 64.2 zettabytes in 2020. To meet the continuously growing demand for computing resources to store and process Big Data, large cloud providers have equipped their infrastructures with millions of energy hungry servers distributed on multiple physically separate data-centers. This results in a tremendous increase in the energy consumed to operate these data-centers. However, as the data and the scale of data-centers are on the rise, energy consumption will continue to be a major concern in the Cloud. Thus, it is important to make data management in the cloud energy-efficient.
Data are usually replicated to ensure high availability and performance (by directing users to the closest replica). However, replication comes with high costs in term storage space, network usage, and performance when writing data. This can be also translated in high energy consumption [1], in particular to store and transfer data.
Recently, we have witnessed advances in the performance of reduction and protection schemes like erasure coding (EC), deduplication, compression, etc. Thus, recent efforts have been dedicated to investigate the potential of replacing replication with erasure coding to reduce the cost of data storage while sustaining good performance. For example, EC is now employed in data analytic systems [2, 3] and in-memory storage systems on cached (hot) data [4]. Though benefits exist, EC poses new challenges including cost of access, energy consumption (encoding, decoding, etc), data availability and data loss. In addition, when adopting EC, we need to take into consideration the frequency and performance requirements of data which vary according to the age and type of data, time of access, the applications, and users.
References
[1] Yacine Taleb, Shadi Ibrahim, Gabriel Antoniu and Toni Cortes: Characterizing performance and energy-efficiency of the ramcloud storage system. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pages 1488–1498, 2017.
[2] Jad Darrous and Shadi Ibrahim: Understanding the performance of erasure codes in hadoop distributed file system. In Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems (CHEOPS '22). Pages 24–32, 2022.
[3] Jad Darrous, Shadi Ibrahim and Christian Perez: Is it time to revisit erasure coding in data- intensive clusters ? In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pages 165–178, 2019.
[4] K. V. Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, and Kannan Ramchandran: EC-cache: load-balanced, low-latency cluster caching with online erasure coding. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16).
Principales activités
This PhD Thesis will address the problem of how to improve energy efficiency of Big Data services through exploring data reduction and protection schemes (i.e., erasure codes). This research is expected to bring innovative contributions with respect to the following aspects:
- As a first step we need to profile and classify the applications according to their objectives (energy, performance, durability etc.), their access patterns and deployment modes; and study and model the performance, energy consumption, and data loss of the applications under EC and replication;
- Data comes with different sizes and has different temperatures (frequency of access), Accordingly, a hybrid scheme (using Replication and EC) is more practical for heterogeneous data (for example, EC may not be the best choice for small files), thus it is essential to evaluate the cost of transforming data between replication and EC when hybrid schemes is used;
- Based on the performance models and the cost model, we will propose innovative data placement and retrieval strategies to optimize the performance and energy consumption of EC that take into consideration the location of users desired performance, the availability of high-speed hardware and the availability of green energy sources.
Compétences
- An excellent Master degree in computer science or equivalent
- Strong knowledge of distributed systems
- Knowledge of storage and distributed file systems
- Strong programming skills (C/C++, Python)
- Working experience in the areas of Big Data management, Cloud Computing, Data Analytics are advantageous
- Very good communication skills in oral and written English
Avantages
- Subsidized meals
- Partial reimbursement of public transport costs
- Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage
Partager
Informations générales
- Thème/Domaine :
Systèmes distribués et intergiciels
Système & réseaux (BAP E) - Ville : Rennes
- Centre Inria : Centre Inria de l'Université de Rennes
- Date de prise de fonction souhaitée : 2023-10-01
- Durée de contrat : 3 ans
- Date limite pour postuler : 2023-08-20
Contacts
- Equipe Inria : MYRIADS
-
Directeur de thèse :
Ibrahim Shadi / Shadi.Ibrahim@inria.fr
A propos d'Inria
Inria est l’institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 200 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3500 scientifiques pour relever les défis du numérique, souvent à l’interface d’autres disciplines. L’institut fait appel à de nombreux talents dans plus d’une quarantaine de métiers différents. 900 personnels d’appui à la recherche et à l’innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 180 start-up. L'institut s'efforce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.
Consignes pour postuler
Sécurité défense :
Ce poste est susceptible d’être affecté dans une zone à régime restrictif (ZRR), telle que définie dans le décret n°2011-1425 relatif à la protection du potentiel scientifique et technique de la nation (PPST). L’autorisation d’accès à une zone est délivrée par le chef d’établissement, après avis ministériel favorable, tel que défini dans l’arrêté du 03 juillet 2012, relatif à la PPST. Un avis ministériel défavorable pour un poste affecté dans une ZRR aurait pour conséquence l’annulation du recrutement.
Politique de recrutement :
Dans le cadre de sa politique diversité, tous les postes Inria sont accessibles aux personnes en situation de handicap.
Attention: Les candidatures doivent être déposées en ligne sur le site Inria. Le traitement des candidatures adressées par d'autres canaux n'est pas garanti.