PhD Position F/M [Campagne Doctorant] Graph Representation for Multivariate Time Series Analytics

Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : CDD

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

Contexte et atouts du poste

Massive collections of time-varying data (i.e., time series or data series in general) are becoming a reality in virtually every scientific and social domain. Examples of fields that involve data series include finance, environmental sciences, astrophysics, neuroscience, engineering, and multimedia. What is challenging about these data is that they are mainly highly multivariate, and the different dimensions that compose them may originate from different sources. 

However, this high number of dimensions from different sources causes severe limitations. First, existing solutions employ one model per dimension or data type. This implies (i) a drop in accuracy because of missed correlations among important dimensions, (ii) a significant increase in execution time because of all the independent models that are used, and (iii) a drop in interpretability because of the multitude of embedding produced by all independent models. To reach efficient and scalable analysis without sacrificing accuracy, we need a unified data embedding that can enable multiple analytic tasks (such as anomaly detection, classification, and clustering) on multivariate and heterogeneous data series. 

The objective is to move towards a unified data embedding that allows multiple analytic tasks (such as anomaly detection, classification, and clustering) on multivariate and heterogeneous data. Towards that direction, we proposed in past research Series2graph, a method that summarizes univariate time series into a graph [1,2]. Even though the latter method has been proposed mainly for anomaly detection, similar graph embedding for time series has demonstrated state-of-the-art and scalable results for tasks such as clustering [3], classification, and representation learning [4]. The benefit of such time series graph representation is three-fold. (i) First, such graph representation is easy to interpret by any user. (ii) Second, it can benefit from other graph-represented data (such as ontologies and knowledge graphs and textual data represented as graphs [5]). (iii) Last, one unified embedding can significantly reduce the analysis execution time (as shown for anomaly detection [1]).

However, no method exists that proposes a unified graph embedding for multivariate time series. The straightforward solution would be to build one graph embedding per dimension and then analyze them all together. However, the graph size would be linearly proportional to the number of dimensions, making it impossible to use in practice. In the case of heterogeneous multivariate time series, no holistic graph representation exists, and we need novel approaches to address this problem.

References:

  1. Boniol, P., and Palpanas, T. Series2graph: Graph-based subsequence anomaly for detection time series. Proc. VLDB Endow. 13, 12 (July 2020), 1821–1834.
  2. Schneider, J., Wenig, P., and Papenbrock, T. Distributed detection of sequential anomalies in univariate time series. The VLDB Journal 30, 4 (2021), 579–602.
  3. Tiano, D., Bonifati, A., and Ng, R. Featts: Feature-based time series clustering. In Proceedings of the 2021 International Conference on Management of Data (New York, NY, USA, 2021), SIGMOD ’21, Association for Computing Machinery, p. 2784–2788.
  4. Heng, Z., Yang, Y., Jiang, S., Hu, W., Ying, Z., Chai, Z., and Wang, C. Time2graph+: Bridging time series and graph representation learning via multiple attentions. IEEE Transactions on Knowledge and Data Engineering (2021), 1–1.
  5. Boniol, P., Panagopoulos, G., Xypolopoulos, C., Hamdani, R. E., Amariles, D. R., and Vazirgiannis, M. Performance in the courtroom: Automated processing and visualization of appeal court decisions in France. NLLP workshop of the KDD Conference (2020).

 

Supervision:

The thesis will be co-supervised by Paul Boniol (Valda team, DI ENS & Inria Paris) and Michaël Thomazo (Valda team, DI ENS & Inria Paris). The PhD student will be part of the Valda team within the Computer Science Department of the École normale supérieure. Registration for the thesis will be carried out at Université PSL, via École doctorale 386 (Sciences mathématiques de Paris Centre). The doctoral student will benefit from the environment and resources of the VALDA team, the DIENS, the Inria Paris Center, and the PRAIRIE Institute, including local computing clusters. In addition, the PhD student will have access to the IDRIS Jean Zay supercomputer for GPU-intensive tasks.

Mission confiée

Research objective

The objective of this Ph.D. is to propose new meaningful graph representation and transformation for multivariate time series that could support basic analytics (classification, clustering, and anomaly detection). Overall, the research questions tackled are the following:

  • Can a unique graph embedding method be more accurate on multiple analytical tasks than specific methods for each task?
  • Can a unique graph embedding be constructed for large heterogeneous multivariate time series that preserves accuracy while remaining scalable?
  • How can such embedding be used to interpret and explain analytical tasks (e.g., classification, clustering, anomaly detection)?

 

Application and use cases:

Time series analysis is a very important task for electricity production relevant applications. Indeed, the desire to analyze a large quantity of data efficiently and be able to express complex queries (i.e., anomalies discovery) can be crucial for industrial actors like EDF. For instance, one crucial goal for EDF is to improve the safety and availability of its electrical power plants by detecting anomalies that could occur. As massive gains are expected from reducing maintenance volumes, there is thus a serious need to have accurate and efficient algorithms to detect anomalies and understand their origins. Moreover, EDF has collected sensor data in every nuclear power plant for decades (at least 20 years). With a total of 58 nuclear power plants and more than 2000 sensors per unit, it represents a database of approximately 500 TeraBytes. Considering that the electrical power plants were built 20 to 30 years ago, we can expect that recent maintenance and new power plants will see their number of sensors and acquisition rate significantly increase, resulting in an exponential increase of new data. Moreover, half of the 2000 EDF electrical power plant sensors are boolean sensors, and the remaining half measure either water flow, pressure, temperature, or water level from very different parts of the plant, making each sensor almost unique. In addition to these already highly heterogeneous data series, the EDF database contains multiple logs (i.e., textual data) and structural knowledge (i.e., knowledge graphs representing the structure of the plant and the link between sensors). The context described above is highly related to the problems that will be tackled in this PhD. Thus, benefiting from previous collaborations of Paul Boniol with the research department of EDF, the PhD candidates might have the opportunity to apply the research conducted on such use cases.

Principales activités

Main tasks:

  • Acquire an exhaustive understanding of the literature on graph representation for time series and graph-based methods for time series analytics.
  • Propose and implement a new graph representation for multivariate time series.
  • Evaluate the proposed solution on publicly available benchmarks (UCR-Archive for classification and clustering and equivalents of TSB-UAD for multivariate time series). 
  • Study the impact of heterogeneous (i.e., various acquisition rates, types of time series, stationarity, etc.) multivariate time seires on unified graph embedding. 
  • Propose interpretable solutions based on the graph representation of time series for specific analytical tasks.

Additional tasks:

  • Write scientific research papers with the objective to publish them on top data analytics and data management conferences and journals.

Avantages

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking
  • Flexible organization of working hours (after 12 months of employment)
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training

Rémunération

Monthly gross salary : 2100 € during the first and second years. 2190 € the last year.