PhD Position F/M [Campagne Doctorant] Graph Representation for Multivariate Time Series Analytics
Contract type : Fixed-term contract
Level of qualifications required : Graduate degree or equivalent
Fonction : PhD Position
Context
Massive collections of time-varying data (i.e., time series or data series in general) are becoming a reality in virtually every scientific and social domain. Examples of fields that involve data series include finance, environmental sciences, astrophysics, neuroscience, engineering, and multimedia. What is challenging about these data is that they are mainly highly multivariate, and the different dimensions that compose them may originate from different sources.
However, this high number of dimensions from different sources causes severe limitations. First, existing solutions employ one model per dimension or data type. This implies (i) a drop in accuracy because of missed correlations among important dimensions, (ii) a significant increase in execution time because of all the independent models that are used, and (iii) a drop in interpretability because of the multitude of embedding produced by all independent models. To reach efficient and scalable analysis without sacrificing accuracy, we need a unified data embedding that can enable multiple analytic tasks (such as anomaly detection, classification, and clustering) on multivariate and heterogeneous data series.
The objective is to move towards a unified data embedding that allows multiple analytic tasks (such as anomaly detection, classification, and clustering) on multivariate and heterogeneous data. Towards that direction, we proposed in past research Series2graph, a method that summarizes univariate time series into a graph [1,2]. Even though the latter method has been proposed mainly for anomaly detection, similar graph embedding for time series has demonstrated state-of-the-art and scalable results for tasks such as clustering [3], classification, and representation learning [4]. The benefit of such time series graph representation is three-fold. (i) First, such graph representation is easy to interpret by any user. (ii) Second, it can benefit from other graph-represented data (such as ontologies and knowledge graphs and textual data represented as graphs [5]). (iii) Last, one unified embedding can significantly reduce the analysis execution time (as shown for anomaly detection [1]).
However, no method exists that proposes a unified graph embedding for multivariate time series. The straightforward solution would be to build one graph embedding per dimension and then analyze them all together. However, the graph size would be linearly proportional to the number of dimensions, making it impossible to use in practice. In the case of heterogeneous multivariate time series, no holistic graph representation exists, and we need novel approaches to address this problem.
References:
- Boniol, P., and Palpanas, T. Series2graph: Graph-based subsequence anomaly for detection time series. Proc. VLDB Endow. 13, 12 (July 2020), 1821–1834.
- Schneider, J., Wenig, P., and Papenbrock, T. Distributed detection of sequential anomalies in univariate time series. The VLDB Journal 30, 4 (2021), 579–602.
- Tiano, D., Bonifati, A., and Ng, R. Featts: Feature-based time series clustering. In Proceedings of the 2021 International Conference on Management of Data (New York, NY, USA, 2021), SIGMOD ’21, Association for Computing Machinery, p. 2784–2788.
- Heng, Z., Yang, Y., Jiang, S., Hu, W., Ying, Z., Chai, Z., and Wang, C. Time2graph+: Bridging time series and graph representation learning via multiple attentions. IEEE Transactions on Knowledge and Data Engineering (2021), 1–1.
- Boniol, P., Panagopoulos, G., Xypolopoulos, C., Hamdani, R. E., Amariles, D. R., and Vazirgiannis, M. Performance in the courtroom: Automated processing and visualization of appeal court decisions in France. NLLP workshop of the KDD Conference (2020).
Supervision:
The thesis will be co-supervised by Paul Boniol (Valda team, DI ENS & Inria Paris) and Michaël Thomazo (Valda team, DI ENS & Inria Paris). The PhD student will be part of the Valda team within the Computer Science Department of the École normale supérieure. Registration for the thesis will be carried out at Université PSL, via École doctorale 386 (Sciences mathématiques de Paris Centre). The doctoral student will benefit from the environment and resources of the VALDA team, the DIENS, the Inria Paris Center, and the PRAIRIE Institute, including local computing clusters. In addition, the PhD student will have access to the IDRIS Jean Zay supercomputer for GPU-intensive tasks.
Assignment
Research objective
The objective of this Ph.D. is to propose new meaningful graph representation and transformation for multivariate time series that could support basic analytics (classification, clustering, and anomaly detection). Overall, the research questions tackled are the following:
- Can a unique graph embedding method be more accurate on multiple analytical tasks than specific methods for each task?
- Can a unique graph embedding be constructed for large heterogeneous multivariate time series that preserves accuracy while remaining scalable?
- How can such embedding be used to interpret and explain analytical tasks (e.g., classification, clustering, anomaly detection)?
Application and use cases:
Time series analysis is a very important task for electricity production relevant applications. Indeed, the desire to analyze a large quantity of data efficiently and be able to express complex queries (i.e., anomalies discovery) can be crucial for industrial actors like EDF. For instance, one crucial goal for EDF is to improve the safety and availability of its electrical power plants by detecting anomalies that could occur. As massive gains are expected from reducing maintenance volumes, there is thus a serious need to have accurate and efficient algorithms to detect anomalies and understand their origins. Moreover, EDF has collected sensor data in every nuclear power plant for decades (at least 20 years). With a total of 58 nuclear power plants and more than 2000 sensors per unit, it represents a database of approximately 500 TeraBytes. Considering that the electrical power plants were built 20 to 30 years ago, we can expect that recent maintenance and new power plants will see their number of sensors and acquisition rate significantly increase, resulting in an exponential increase of new data. Moreover, half of the 2000 EDF electrical power plant sensors are boolean sensors, and the remaining half measure either water flow, pressure, temperature, or water level from very different parts of the plant, making each sensor almost unique. In addition to these already highly heterogeneous data series, the EDF database contains multiple logs (i.e., textual data) and structural knowledge (i.e., knowledge graphs representing the structure of the plant and the link between sensors). The context described above is highly related to the problems that will be tackled in this PhD. Thus, benefiting from previous collaborations of Paul Boniol with the research department of EDF, the PhD candidates might have the opportunity to apply the research conducted on such use cases.
Main activities
Main tasks:
- Acquire an exhaustive understanding of the literature on graph representation for time series and graph-based methods for time series analytics.
- Propose and implement a new graph representation for multivariate time series.
- Evaluate the proposed solution on publicly available benchmarks (UCR-Archive for classification and clustering and equivalents of TSB-UAD for multivariate time series).
- Study the impact of heterogeneous (i.e., various acquisition rates, types of time series, stationarity, etc.) multivariate time seires on unified graph embedding.
- Propose interpretable solutions based on the graph representation of time series for specific analytical tasks.
Additional tasks:
- Write scientific research papers with the objective to publish them on top data analytics and data management conferences and journals.
Benefits package
- Subsidized meals
- Partial reimbursement of public transport costs
- Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking
- Flexible organization of working hours (after 12 months of employment)
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
Remuneration
Monthly gross salary : 2100 € during the first and second years. 2190 € the last year.
General Information
- Theme/Domain : Data and Knowledge Representation and Processing
- Town/city : Paris
- Inria Center : Centre Inria de Paris
- Starting date : 2024-10-01
- Duration of contract : 3 years
- Deadline to apply : 2024-05-19
Warning : you must enter your e-mail address in order to save your application to Inria. Applications must be submitted online on the Inria website. Processing of applications sent from other channels is not guaranteed.
Instruction to apply
To apply, please include in your application:
- Covering letter highlighting the relevance of the candidate's training to the topic.
- CV
- Master's grades
Defence Security :
This position is likely to be situated in a restricted area (ZRR), as defined in Decree No. 2011-1425 relating to the protection of national scientific and technical potential (PPST).Authorisation to enter an area is granted by the director of the unit, following a favourable Ministerial decision, as defined in the decree of 3 July 2012 relating to the PPST. An unfavourable Ministerial decision in respect of a position situated in a ZRR would result in the cancellation of the appointment.
Recruitment Policy :
As part of its diversity policy, all Inria positions are accessible to people with disabilities.
Contacts
- Inria Team : VALDA
-
PhD Supervisor :
Boniol Paul / paul.boniol@inria.fr
The keys to success
The doctoral student must have obtained a Master's degree or equivalent in computer science or mathematics. He or she should have had courses and initial research experience in one of the following fields: artificial intelligence, data management, statistical learning, information retrieval. He or she should be comfortable with large-scale data processing and the use of modern artificial intelligence techniques, particularly deep learning. He or she should be able to read and write research articles in English.
About Inria
Inria is the French national research institute dedicated to digital science and technology. It employs 2,600 people. Its 200 agile project teams, generally run jointly with academic partners, include more than 3,500 scientists and engineers working to meet the challenges of digital technology, often at the interface with other disciplines. The Institute also employs numerous talents in over forty different professions. 900 research support staff contribute to the preparation and development of scientific and entrepreneurial projects that have a worldwide impact.