2019-01429 - PhD Position F/M Towards Machine Learning Based Elastic In Situ Data Analysis for High Performance Computing Applications

Contract type : Public service fixed-term contract

Level of qualifications required : Graduate degree or equivalent

Other valued qualifications : Master of Science or Engineering

Fonction : PhD Position

About the research centre or Inria department

Inria, the French national research institute for the digital sciences, promotes scientific excellence and technology transfer to maximise its impact.
It employs 2,400 people. Its 200 agile project teams, generally with academic partners, involve more than 3,000 scientists in meeting the challenges of computer science and mathematics, often at the interface of other disciplines.
Inria works with many companies and has assisted in the creation of over 160 startups.
It strives to meet the challenges of the digital transformation of science, society and the economy.

Context

  • Advisors: Gabriel Antoniu (KerData team), Matthieu Dorier (Argonne National Laboratory)
  • Main contacts: gabriel.antoniu (at) inria.fr, mdorier (at) anl.gov
  • Collaboration context: JLESC International Laboratory on Extreme-Scale Computing
  • Expected start date:  October 1st, 2019
  • Application deadline: 27 March 2019

Location and Mobility

The thesis will be mainly hosted by the KerData team at Inria Rennes Bretagne Atlantique and will be co-advised by Matthieu Dorier (Argonne National Laboratory, USA). Rennes is the capital city of Britanny, in the western part of France. It is easy to reach thanks to the high-speed train line to Paris. Rennes is a dynamic, lively city and a major center for higher education and research: 25% of its population are students. The candidate is also expected to be hosted for 3-month internships at other partners: Argonne National Laboratory (USA), and JAXA (Japan, to be confirmed).

The KerData team in a nutshell for candidates

  • As a PhD student mainly hosted in the KerData team, you will join a dynamic and enthusiastic group, committed to top-level research in the areas of High-Perfomance Computing and Big Data Analytics. Check the team’s web site: https://team.inria.fr/kerdata/.

  • The team is leading multiple projects in top-level national and international collaborative environments, e.g., the JLESC international Laboratory on Extreme-Scale Computing: https://jlesc.github.io. It has active collaborations with top-level academic institutions all around the world (including the USA,  Mexico, Spain, Germany, Japan, Romania, etc.). The team has close connections with the industry (e.g., Microsoft, Huawei, Total).

  • The KerData team’s publication policy targets the best-level international journals and conferences of its scientific area.The team also strongly favors experimental research, validated by implementation and experimentation of software prototypes with real-world applications on real-world platforms, e.g., clouds such as Microsoft Azure and some of the most powerful supercomputers in the world.

Why joining the KerData team is an opportunity for you

  • The team's top-level collaborations strongly favor successful PhD theses dedicated to solving challenging problems at the edge of knowledge, in close interaction with top-level experts from both academia and industry.

  • The KerData team is committed to personalized advising and coaching, to help PhD candidates train and grow in all directions that are critical in the process of becoming successful, top-level researchers.

  • You will have the opportunity to present your work in top level venues where you will meet the best experts in the field.

  • What you will learn. Beyond learning how to perform meaningful and impactful research,  you will acquire useful skills for communication both in written form (how to write a good paper, how to design a convincing poster) and in oral form (how to present their work in a clear, well-structured and convincing way).

  • Additional complementary training will be available, with the goal of preparing the PhD candidates for their postdoctoral career, should it be envisioned in academia, industry or in an entrepreneurial context, to create a startup company.

Assignment

Description

High-Performance Computing (HPC) refers to the use of parallel processing techniques on high-end machines (supercomputers) to solve complex problems from science and industry which require extreme amounts of computation. The major focus is on performance, therefore HPC typically relies on extreme-scale aggregation of the fastest available hardware and software technologies for processing, communication and storage. Supercomputers are expected to reach exascale by 2021 (1018 operations per second). With millions of cores grouped in massively multi-core nodes, such machines are capable of running scientific simulations at scales and speeds never achieved before, benefiting domains such as biology, astrophysics, or computational fluid dynamics. These simulations however pose important data management challenges: they typically produce very large amounts of data (in the order of petabytes) that have to be processed to get a scientific insight.

Why in situ and in transit analytics? A traditional approach for analyzing such data generated by simulations consists of transferring them from the supercomputer to persistent storage, from where they can be visualized and analyzed after the end of the computation (offline analysis). However, as the exascale is getting near, with the challenging volumes of data generated at increasing rates, storing the data for later, “offline” analysis, becomes infeasible. Hence, many computational scientists have moved to “in situ” analysis strategies, in which analysis tasks are executed on the same nodes (where the simulation is running). This brings several advantages : it is not necessarily useful any longer to store the whole data before it is analyzed; data transfers could actually even be fully avoided if the whole analytics can be performed “in situ.” In addition, it becomes possible to get early insights from the simulation while it runs, which makes it possible to detect ways to act on the simulation (e.g., change its parameters, increase the simulation resolution, etc.). In the case where the analytics is heavier, data generated by the simulations can be transferred from the simulation resources to dedicated resources for what is called “in transit analytics” (still performed while the simulation is running, but more loosely coupled with it), before they are finally transferred on persistent storage. These techniques (in situ/in transit analytics) usually rely on libraries or middlewares that interfaces the simulation with analysis codes.

Challenge: support elastic in situ/in transit analytics. To enable in situ/in transit analysis, various software technologies such as Damaris [1,2], Decaf [3], ADIOS [4] and FlowVR [5] have been developed. In all of the mentioned technologies, a subset of available resources (e.g. cores/nodes) are allocated to data analysis tasks. But, in current solutions, this allocation is completely static. This means that it is not possible for the simulation to attach/detach analysis processes while the simulation is running. This is not appropriate, as it leads to resource wasting. In addition, it is currently not possible to dynamically provision the analysis resources. These elasticity requirements are crucial in some cases wherein the in situ / in transit analysis (e.g. visualization) is only needed for short, specific times (e.g. during working time) and should be turned off later (e.g. during the night). To this date, no existing in situ / in transit analysis middleware allows such elasticity.

A step towards Machine Learning based resource management. Artificial intelligence techniques, including machine learning (ML) and deep learning (DL), have recently started to be considered as tools for optimizing resource usage in the HPC area. They are a means to enable autonomic data services which can adapt to the use of data, the state of the system, and would make resources available for more effective usage; lower risk of data loss; and provide more predictable performance. While some initial efforts have already been made on autonomic data services, this type of adaptation is almost absent today in HPC environments. Our approach will consist of designing the middleware-level mechanisms that will enabling the use of ML-based solutions for optimized resource management. In particular, we focus on enabling elastic in situ/in transit analytics at the middleware level as a step towards this goal.

Thesis goal. This PhD thesis aims to enable elastic in situ/in transit analysis. From a practical perspective, research will be conducted to explore how to support such features. After designing a technology-independent solution, for experimental purposes, these features will be implemented and evaluated on top of the Damaris middleware. Damaris enables scalable I/O (input/output) and in situ / in transit analysis and visualization of HPC simulations and is developed at INRIA in the KerData team: https://project.inria.fr/damaris/. While simulations run on multicore nodes, Damaris allows to dedicate some of the cores (or some supercomputer nodes) for data processing tasks. It also enables visualization frameworks to connect and interact with running simulations. Damaris is used by several academic and industry partners, including Total, who uses it for in situ visualization of its geophysics simulations.

Main activities

Roadmap

The workplan for enabling elastic in situ / in transit analysis will involve the following steps:

  1. Research on (1) how to efficiently migrate and reorganize data when the amount of resources dedicated to analysis tasks changes;(2) how to do so in a way that is transparent to both the running simulation and the analysis programs;(3) how to make efficient use of RDMA (Remote Direct Memory Access) to provide such elasticity.
  2. As a refinement, design a mechanism for incrementally migrating running stream tasks from the in-situ processing backend to the in-transit one without stopping the query execution.
  3. Once the support for elastic in situ analytics is operational, we will seek to validate it in a context where resource management decisions are guided by ML-based tools that exploit historic data on resource usage to guide resource allocation decisions. This work will be conducted in close collaboration with experts in ML/DL from ANL and Inria.

While the primary goal is to think outside the box and freely explore the full design space in a technology-independent way, for evaluation purposes we will favor experimentation with software frameworks and technologies under development in our teams, including Damaris, Argobots and Mercury (described below). In particular, this research will support the design of Damaris 2.0 (currently in progress).

Target Use Cases

The proposed solution will be evaluated with real-life simulations. In particular, we envision the establishment of a collaboration with the Japanese Aerospace Exploration Agency (JAXA). In this context, the candidate would have an internship at JAXA to augment an existing CFD simulation with elastic in situ/transit visualization on a heterogeneous machine (to be confirmed). Other application environments willcan be explored in collaboration with Argonne National Lab (USA), with which the KerData team has a long-running collaboration, and where the PhD student could also make research visits.

Enabling Technologies

In the process of enabling elasticity in Damaris, we will leverage the following tools:

Damaris

Damaris is a middleware for scalable, asynchronous I/O and in situ and in transit visualization and processing developed at Inria. Damaris already demonstrated its scalability up to 16,000 cores on some of the top supercomputers of Top500, including Titan, Jaguar, and Kraken. Developments are currently in progress in a contractual framework between Total and Inria to use Damaris for in situ visualization for extreme-scales simulations at Total.

Mercury

Mercury is a Remote Procedure Call (RPC) and Remote Direct Memory Access (RDMA) library developed by Argonne National Laboratory and the HDF Group. It is at the core of multiple DOE projects at Argonne. Mercury enables high-speed, low-latency RPCs and data transfers over a wide range of network fabrics (TCP, Infiniband, Cray GNI, etc.). Considering the fact that current MPI implementations are not flexible in heterogeneous environments, in the context of this PhD thesis, Mercury will be used to replace the communication layer of Damaris.

Argobots

Argobots is a threading/tasking framework developed at Argonne. It enables efficient use of massively multicore architectures, targeting Exascale supercomputer. It will be used to enable coroutine-style analysis plugins in Damaris.

The thesis will be mainly hosted by the KerData team at Inria Rennes Bretagne Atlantique. It will include collaborations with Argonne National Lab (which provides some of tools for RPC, RDMA, and threading that we intend to use) and JAXA (which will provide the main use case).

References

[1] M. Dorier, G. Antoniu, F. Cappello, M. Snir, L. Orf. “Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O”, In Proc. CLUSTER – IEEE International Conference on Cluster Computing, Sep 2012, Beijing, China. URL: https://hal.inria.fr/hal-00715252

[2] M. Dorier, R. Sisneros, T. Peterka, G. Antoniu, D. Semeraro, “Damaris/Viz: a Nonintrusive, Adaptable and User-Friendly In Situ Visualization Framework”, Proc. LDAV – IEEE Symposium on Large-Scale Data Analysis and Visualization, Oct 2013, Atlanta, USA. URL: https://hal.inria.fr/hal-00859603

[3] M. Dreher, T. Peterka, “Decaf: Decoupled Dataflows for In Situ High-Performance Workflows”, Technical Report, United States, doi:10.2172/1372113, https://www.osti.gov/servlets/purl/1372113/, 2017.

[4] Q. Liu, J. Logan, Y. Tian, H. Abbasi, N. Podhorszki, J. Y. Choi, S. Klasky, R. Tchoua, J. Lofstead, R. Oldfield, M. Parashar, N. Samatova, K. Schwan, A. Shoshani, M. Wolf, K. Wu, W. Yu, “Hello ADIOS: The Challenges and Lessons of Developing Leadership Class I/O Frameworks”, Concurrency & Computation: Practice and Experience, v.26, n.7, pp. 1453-1473, 2013.

[5] M. Dreher, Bruno Raffin, “A Flexible Framework for Asynchronous In Situ and In Transit Analytics for Scientific Simulations”, ACM/IEEE International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Chicago, IL, 2014.

Financing Project

This PhD will be done in the context of the JLESC International Laboratory on Extreme-Scale Computing. Selected candidates will be supported by the team to obtain various funding schemes, depending on eligibility (details are available from PhD advisors).

 

Skills

  • An excellent Master degree in computer science or equivalent
  • A strong profile in HPC is highly appreciated
  • Strong knowledge of parallel and distributed systems
  • Knowledge on storage and (parallel/distributed) file systems
  • Ability and motivation to conduct high-quality research, including publishing the results in relevant venues
  • Strong programming skills, in particular in C/C++ (including, if possible, C++14), and at least one scripting language (e.g. Python, Ruby)
  • Strong software design skills (knowledge of design patterns in C/C++)
  • Working experience in the areas of Big Data management, Cloud computing, HPC, is an advantage
  • Very good communication skills in oral and written English.
  • Open-mindedness, strong integration skills and team spirit

Benefits package

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage

Remuneration

 Monthly gross salary amounting to 1982 euros for the first and second years and 2085 euros for the third year