2018-00574 - HPC-Big Data convergence at processing level by bridging in situ/in transit processing with Big Data analytics
Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : CDD de la fonction publique

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

Niveau d'expérience souhaité : Jeune diplômé

Contexte et atouts du poste

Financing Project

This PhD will be done in the context of the Inria Project Lab (IPL) HPC-BigData: High Performance Computing and Big Data. The goal of this IPL is to gather teams from HPC, Big Data and Machine Learning (ML) areas to work at the intersection between these domains.  External partners include: ATOS/Bull,  Argonne National Lab (ANL), Laboratoire de Biochimie Théoerique (LBT), CNRS, ESI-Group, Grid’5000.

Mobility

The PhD position is based in Rennes, at the Inria / IRISA center. The candidate is also expected to be hosted for 3 month internships at other partners of the consortium : the Inria’s Zenith team in Montpellier (France), Argonne National Laboratory (USA).

Mission confiée

In more and more application areas, data volumes increase exponentially and are produced at increasing speed. Keeping pace with the overwhelming data volumes and velocity is critical, in order to allow meaningful extraction of insights from data and enable precise predictions and relevant decision making, despite this increasingly challenging context. To achieve this goal, the ability to perform precise analytics at extreme scales gains major importance, in a context where the scale at which data is produced and consumed is also increasing.

In the extreme computing area, the need to get fast and relevant insights from massive amounts of data generated by extreme-scale computations led to the emergence of in situ and in transit processing approaches. They allow data to be visualized and processed in real-time, in an interactive way, as they are produced, as opposed to traditional approach consisting of transferring data off-site after the end of the computation, for offline analysis. In the Big Data area, the search for real-time, fast analysis was materialized through a different approach: stream-based processing. As the tools and cultures of HPC and BDA have evolved in divergent directions motivated by different optimization criteria, it becomes clear today that leveraging together the progresses achieved in the two areas can be an efficient means addresses the Big Data challenges, which are now relevant for both HPC and BDA.

 

 

References

  • Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, Leigh Orf. Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O. In Proc. CLUSTER – IEEE International Conference on Cluster Computing, Sep 2012, Beijing, China. URL: https://hal.inria.fr/hal-00715252
  • Matthieu Dorier, Robert Sisneros, Tom Peterka, Gabriel Antoniu, Dave Semeraro. Damaris/Viz: a Nonintrusive, Adaptable and User-Friendly In Situ Visualization Framework. Proc. LDAV – IEEE Symposium on Large-Scale Data Analysis and Visualization, Oct 2013, Atlanta, USA. URL: https://hal.inria.fr/hal-00859603
  • Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, María Pérez-Hernández, Radu Tudoran, Stefano Bortoli, Bogdan Nicolae. Towards a Unified Storage and Ingestion Architecture for Stream Processing. Second Workshop on Real-time & Stream Analytics in Big Data Colocates with the 2017 IEEE International Conference on Big Data, Dec 2017, Boston, USA. To Appear. URL: https://hal.inria.fr/hal-01649207
  • The Pl@ntNet project: https://plantnet.org

Principales activités

Thesis goal. This PhD thesis aims to propose an approach enabling HPC-Big Data convergence at the data processing level, by exploring alternative solutions to build a unified framework for extreme-scale data processing. The architecture of such a framework will leverage the extreme scalability demonstrated by in situ/in transit data processing approaches originated in the HPC area, in conjunction with Big Data processing approaches emerged in the BDA area (batch-based, streaming-based and hybrid). The high-level goal of this framework is to enable the usage of a large spectrum of Big Data analytics techniques at extreme scales, to support precise predictions in real-time and fast decision making.

Target use cases. The thesis will start by analyzing the needs of a concrete use-case scenario available to the project: the Pl@ntNet project from the Zenith team. It exhibits challenging data- analysis requirements in terms data volumes and data processing velocity. The goal is to enable the online computation and visualization of species distribution models from Pl@ntNet data stream. The platform actually generates millions of observations each month, but today, the analysis of that data is only done punctually as an offline process. For instance, all plant observations that occurred in 2016 are crawled and analyzed by ecologists who apply various niche modeling approaches on top of that static data. The ultimate objective, however, is to allow a more dynamic and more timely monitoring of species. For instance, one would want to study the flourishing dynamics of a species in real-time, based on the analysis and visualization of the observations of the last few weeks or days.

In a second phase, we will analyze and address the requirements of a second use case on machine learning coherent diffraction data made available by the group of Tom Peterka at Argonne National Lab, with which the KerData team is collaborating.

Enabling techniques. In the process of designing the unified data processing framework, we will leverage in particular techniques for data processing already investigated by the participating teams as proof-of-concept software, validated in real-life environments:

  • The Damaris framework for scalable, asynchronous I/O and in situ and in transit visualization and processing (developed at Inria, https://project.inria.fr/damaris/). Damaris already demonstrated its scalability up to 16,000 cores on some of the top supercomputers of Top500, including Titan, Jaguar and Kraken). Developments are currently in progress in a contractual framework between Total and Inria to use Damaris for in situ visualization for extreme-scales simulations at Total.
  • The KerA approach for low-latency storage for stream processing (currently under development at Inria, in collaboration with UPM, in the framework of a contractual partnership between Inria and Huawei Munich). By eliminating storage redundancies between data ingestion and storage, preliminary experiments with KerA successfully demonstrated its capability to increase throughput for stream processing. Kera is now subject of interest for exploitation plans by Huawei.

The resulting framework will be integrated in a state-of-the-art data processing ecosystem (Spark or Flink) and allow to apply in situ/in transit advanced tools for Big Data analytics (e.g. ML-based) using stream-based techniques, to combine the result with historical data and thereby derive insights from data in real time. These insights can further be used to steer the simulation.

The thesis will be mainly hosted by the KerData team at Inria Rennes Bretagne Atlantique and will be co-advised by the Zenith team, in Montpellier, where the student is expected to be hosted for long visits. It will include collaborations with two other IPL partners: the Datamove team in Grenoble and Argonne National Lab (which provides one of the target applications).

Compétences

  • Strong knowledge of computer networks and distributed systems
  • Knowledge on storage and (distributed) file systems
  • Ability and motivation to conduct high-quality research, including publishing the results in relevant venues
  • Strong programming skills (e.g. C/C++, Java, Python).
  • Working experience in the areas of Big Data management, Cloud computing, HPC, is an advantage

Avantages sociaux

  • Subsidised catering service
  • Partially-reimbursed public transport
  • Social security
  • Paid leave
  • Sports facilities

Rémunération

1982 euros (gross salary)