2019-01347 - Doctorant(e) F/H Enabling HPC-Big Data Convergence for Intelligent Extreme-Scale Data Analytics

Type de contrat : CDD de la fonction publique

Niveau de diplôme exigé : Bac + 5 ou équivalent

Autre diplôme apprécié : Masters of Science or Engineering

Fonction : Doctorant

A propos du centre ou de la direction fonctionnelle

Inria, l’institut national de recherche dédié aux sciences du numérique, promeut l'excellence scientifique et le transfert pour avoir le plus grand impact.
Il emploie 2400 personnes. Ses 200 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3000 scientifiques pour relever les défis des sciences informatiques et mathématiques, souvent à l’interface d’autres disciplines.
Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 160 start-up.
L'institut s’efforce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l’économie

Contexte et atouts du poste

  • Advisors: Gabriel Antoniu (KerData team), Alexandru Costan (KerData team), Patrick Valduriez (Zenith team)
  • Main contacts: gabriel.antoniu (at) inria.fr, alexandru.costan (at) inria.fr, patrick.valduriez (at) inria.fr, alexis.joly (at) inria.fr
  • Funding: secured (Inria), in the context of the HPC-Big Data Inria Project Lab
  • Application deadline: as early as possible, no later than May 30, 2019

Description

High-Performance Computing (HPC) refers to the use of parallel processing techniques on high-end machines (supercomputers) to solve complex problems from science and industry which require extreme amounts of computation. The major focus is on performance, therefore HPC typically relies on extreme-scale aggregation of the fastest available hardware and software technologies for processing, communication and storage. Examples of representative HPC application areas include physics, biochemistry, materials science, environment and industrial design (e.g., for car manufacturing). Such applications typically rely on modelling and simulation of the evolution of a complex system, thanks to mathematical models expressing physics-based laws underlying that system.

 

Big Data Analytics (BDA) refers to the process of examining and extracting relevant knowledge from sets of data which are so huge, exhibit such a high format variety and are generated at such a high speed that traditional systems for data storage and processing cannot be used any longer in an efficient way to extract knowledge in an acceptable time. Potentially coming from a very large variety of sources (e.g., sensors from the Internet of Things, social networks, business applications), such data are curated, (sometimes partially) stored, processed and fed into analysis engines that build representations through data-driven models that further enable descriptive, predictive and prescriptive analytics to get valuable insights for decision making.

 

Why HPC and BDA need each other. As HPC supercomputers evolve towards what is called the exascale, they enable simulations at an increasingly high precision, which results in overwhelming amounts of data generated faster and faster. The analysis of HPC-generated data is thus becoming a Big Data Analytics problem, which makes it perfectly relevant to envision the application of Big Data Analytics techniques to get relevant insights from that data. At the same time, Big Data Analytics exhibits an increasing use of high-performance computational capacities in order to allow extremely-fast knowledge extraction to be performed (e.g., real-time high-frequency analysis), to enable timely and precise decisions. Thus HPC and BDA exhibit clearer and clearer dependencies, which has recently motivated intense efforts to identify how the two areas could leverage each other in a converged way.

 

Why Artificial Intelligence is a catalyst for HPC-Big Data convergence. Big Data Analytics increasingly relies on Machine Learning (ML), a subfield of Artificial Intelligence typically used for data classification and feature extraction. While traditional ML deals with tractable feature extraction, Deep Learning recently attracted a very high interest as a particularly efficient approach when classical machine learning is intractable. DL relies on neural network representations with a high number of layers, able to learn very complex representations and subsequently use them for predictions (inference). It uses dense linear algebra kernels and allows for lower-precision representation and arithmetic, for which general-purpose GPU (GPGPU) accelerators (increasingly available on HPC systems) are a relevant infrastructure. Consequently, DL-based Big Data Analytics generates workloads that naturally fit HPC systems, thereby acting as a catalyst for HPC-Big Data convergence.

 

The overall challenge: overcome diverged cultures, methodologies and tools. HPC’s initial focus was on computational performance for tightly-coupled workloads requiring fast computation and frequent communication. The associated software stacks and programming models (e.g., MPI, OpenMP) were therefore developed and optimized accordingly. In contrast, traditional Big Data workloads are loosely coupled and can typically be divided into a huge number of independent jobs. BDA frameworks followed a scale-out model where a very high number of standard-performance processing units can be aggregated and used together without substantial communication. This favored the usage of cloud systems as reference infrastructures for BDA. Map-Reduce (consisting of two stages - map and reduce - each of which can be executed efficiently by a large set of highly parallel tasks) emerged as the dominant programming model. It was further generalized to other operators - not only map and reduce - and to multi-stage processing - not only two stages - in frameworks like Spark and Flink.

 

HPC and BDA thus underwent divergent evolutions motivated by different optimization goals. The major challenge posed by the convergence of HPC-Big Data comes precisely from the difficulty to ”put together” the inherited methodologies and tools as such, as they developed following diverging targets. For instance, preliminary experiments have shown that Big Data Analytics frameworks perform inefficiently on HPC systems and are totally ignorant of the huge optimization potential allowed by the high-performance underlying hardware.

 

 

 

Mission confiée

 

Focus of the thesis: the data processing level. In the high-performance computing area (HPC), the need to get fast and relevant insights from massive amounts of data generated by extreme-scale computations led to the emergence of in situ and in transit processing approaches. They allow data to be visualized and processed in real-time, in an interactive way, as they are produced, as opposed to traditional approach consisting of transferring data off-site after the end of the computation, for offline analysis. In the Big Data area, the search for real-time, fast analysis was materialized through a different approach: stream-based processing, in support to intelligent, ML-based data analytics.

 

Thesis goal. This PhD thesis aims to propose an approach enabling HPC-Big Data convergence at the data processing level, by exploring alternative solutions to build a unified framework for extreme-scale data processing. The architecture of such a framework will leverage the extreme scalability demonstrated by in situ/in transit data processing approaches originated in the HPC area, in conjunction with Big Data processing approaches emerged in the BDA area (batch-based, streaming-based and hybrid). The high-level goal of this framework is to enable the usage of a large spectrum of Big Data analytics techniques at extreme scales, to support precise predictions in real-time and fast decision making.

 

Principales activités

Target use cases. The thesis will start by analyzing the needs of a concrete application scenario available to the project: the Pl@ntNet project from the Zenith team. It exhibits challenging data- analysis requirements in terms data volumes and data processing velocity. The goal is to enable the online computation and visualization of species distribution models from Pl@ntNet data stream. The platform actually generates millions of observations each month, but today, the analysis of that data is only done punctually as an offline process. For instance, all plant observations that occurred in 2016 are crawled and analyzed by ecologists who apply various niche modeling approaches on top of that static data. The ultimate objective, however, is to allow a more dynamic and more timely monitoring of species. For instance, one would want to study the flourishing dynamics of a species in real-time, based on the analysis and visualization of the observations of the last few weeks or days.

In a second phase, we will analyze and address the requirements of a second use case on machine learning coherent diffraction data made available by the group of Tom Peterka at Argonne National Lab, with which the KerData team is collaborating.

Enabling techniques. In the process of designing the unified data processing framework, we will leverage in particular techniques for data processing already investigated by the participating teams as proof-of-concept software, validated in real-life environments:

  • The Damaris framework for scalable, asynchronous I/O and in situ and in transit visualization and processing (developed at Inria, https://project.inria.fr/damaris/). Damaris already demonstrated its scalability up to 16,000 cores on some of the top supercomputers of Top500, including Titan, Jaguar and Kraken). Developments are currently in progress in a contractual framework between Total and Inria to use Damaris for in situ visualization for extreme-scales simulations at Total. For the purpose of this work, Damaris will have to be extended to support Big Data analytics plugins for data processing (e.g., based on the Flink and Spark engines and on their higher-level machine-learning libraries).
  • The KerA approach for low-latency storage for stream processing (currently under development at Inria, in collaboration with UPM, in the framework of a contractual partnership between Inria and Huawei Munich). By eliminating storage redundancies between data ingestion and storage, preliminary experiments with KerA successfully demonstrated its capability to increase throughput for stream processing. Kera is now subject of interest for exploitation plans by Huawei.

The resulting framework will be integrated in a state-of-the-art data processing ecosystem (Spark or Flink) and allow to apply in situ/in transit advanced tools for Big Data analytics (e.g. ML-based) using stream-based techniques, to combine the result with historical data and thereby derive insights from data in real time. These insights can further be used to steer the simulation.

Location and Mobility

The thesis will be mainly hosted by the KerData team at Inria Rennes Bretagne Atlantique and will be co-advised by the Zenith team, in Montpellier (south of France), where the student is expected to be hosted for long visits. It will include collaborations with two other IPL partners: the DataMove team in Grenoble and Argonne National Lab (which provides one of the target applications, where the student is expected to be hosted for a 3-month internship). Rennes is the capital city of Britanny, in the western part of France. It is easy to reach thanks to the high-speed train line to Paris. Rennes is a dynamic, lively city and a major center for higher education and research: 25% of its population are students.

The KerData team in a nutshell for candidates

  • As a PhD student mainly hosted in the KerData team, you will join a dynamic and enthusiastic group, committed to top-level research in the areas of High-Perfomance Computing and Big Data Analytics. Check the team’s web site: https://team.inria.fr/kerdata/.
  • The team is leading multiple projects in top-level national and international collaborative environments, e.g., the JLESC international Laboratory on Extreme-Scale Computing: https://jlesc.github.io. It has active collaborations with top-level academic institutions all around the world (including the USA,  Mexico, Spain, Germany, Japan, Romania, etc.). The team has close connections with the industry (e.g., Microsoft, Huawei, Total).
  • The KerData team’s publication policy targets the best-level international journals and conferences of its scientific area.The team also strongly favors experimental research, validated by implementation and experimentation of software prototypes with real-world applications on real-world platforms, e.g., clouds such as Microsoft Azure and some of the most powerful supercomputers in the world.

Why joining the KerData team is an opportunity for you

  • The team's top-level collaborations strongly favor successful PhD theses dedicated to solving challenging problems at the edge of knowledge, in close interaction with top-level experts from both academia and industry. To follow the career of our former PhD students, have a look here:  https://team.inria.fr/kerdata/team-members/.
  • The KerData team is committed to personalized advising and coaching, to help PhD candidates train and grow in all directions that are critical in the process of becoming successful, top-level researchers.
  • You will have the opportunity to present your work in top level venues where you will meet the best experts in the field.
  • What you will learn. Beyond learning how to perform meaningful and impactful research,  you will acquire useful skills for communication both in written form (how to write a good paper, how to design a convincing poster) and in oral form (how to present their work in a clear, well-structured and convincing way). This is how some of our PhD students received awards in recognition to the quality of their research. Have a look here: https://team.inria.fr/kerdata/awards/.
  • Additional complementary training will be available, with the goal of preparing the PhD candidates for their postdoctoral career, should it be envisioned in academia, industry or in an entrepreneurial context, to create a startup company.  

References

  • Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, Leigh Orf. Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O. In Proc. CLUSTER – IEEE International Conference on Cluster Computing, Sep 2012, Beijing, China. URL: https://hal.inria.fr/hal-00715252
  • Matthieu Dorier, Robert Sisneros, Tom Peterka, Gabriel Antoniu, Dave Semeraro. Damaris/Viz: a Nonintrusive, Adaptable and User-Friendly In Situ Visualization Framework. Proc. LDAV – IEEE Symposium on Large-Scale Data Analysis and Visualization, Oct 2013, Atlanta, USA. URL: https://hal.inria.fr/hal-00859603
  • Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, María Pérez-Hernández, Radu Tudoran, Stefano Bortoli, Bogdan Nicolae. Towards a Unified Storage and Ingestion Architecture for Stream Processing. Second Workshop on Real-time & Stream Analytics in Big Data Colocates with the 2017 IEEE International Conference on Big Data, Dec 2017, Boston, USA. To Appear. URL: https://hal.inria.fr/hal-01649207
  • The Pl@ntNet project: https://plantnet.org

Financing Project

This PhD will be done in the context of the Inria Project Lab (IPL) HPC-BigData: High Performance Computing and Big Data. The goal of this IPL is to gather teams from HPC, Big Data and Machine Learning (ML) areas to work at the intersection between these domains. External partners include: ATOS/Bull, Argonne National Lab (ANL), Laboratoire de Biochimie Théorique (LBT), CNRS, ESI-Group, Grid’5000.

Compétences

  • An excellent Master degree in computer science or equivalent
  • Strong knowledge of computer networks and distributed systems
  • Knowledge on storage and (distributed) file systems
  • Ability and motivation to conduct high-quality research, including publishing the results in relevant venues
  • Strong programming skills (e.g. C/C++, Java, Python).
  • Very good communication skills in oral and written English.
  • Open-mindedness, strong integration skills and team spirit.
  • Working experience in the areas of Big Data management, Cloud computing, HPC, is an advantage.

Avantages

  • Restauration subventionnée
  • Transports publics remboursés partiellement

Rémunération

Rémunération mensuelle brute de 1982 euros les deux premières années et 2085 euros la troisième année