PhD Position F/M Exploitation and Structuring of Heterogeneous Geological Data and Knowledge

Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : CDD

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

Contexte et atouts du poste

This PhD position is in the context of a national collaboration between Inria and BRGM (the French geological survey) on augmenting the scientific process of geologists – more specifically, this position is about exploiting data and knowledge available to geologists before field campaigns, in the form of previous reports, maps, scientific publications, databases, etc., about the location surveyed.

The PhD will be co-supervised by:

  • Pierre Senellart (Valda team at Inria Paris, where the PhD student will be primarily located)
  • Ioana Manolescu (Cedar team at Inria Saclay)
  • Cécile Gracianne (BRGM)

Mission confiée

Motivation : Careful preparation is essential for organizing a field campaign, encompassing both logistical tasks such as acquiring and readying equipment and arranging travel, as well as scientific considerations. Prior to each field campaign, a preparatory phase is undertaken, utilizing existing knowledge to appropriately scale the data acquisition efforts. This involves aligning the scientific requirements and objectives of the campaign with the project's constraints, including budget, time, and data management. Geologists employ various data sources—whether unstructured, semi-structured, or structured, such as scientific reports, publications, and databases—to enhance their understanding of the study area before selecting and developing the most promising scientific hypotheses for testing or confirmation during the field campaign. Throughout the field campaign, geologists generate data of varying structure based on their observations and measurements. The ability to compare this acquired data with initial hypotheses in real-time during the campaign, rather than upon return, allows for adjustments to the action plan in response to unforeseen constraints, such as inaccessible measurement sites or changes in the relevance of certain points. The PhD focuses on developing tools and methodologies to select, extract, and link the necessary data for field campaign setup while promoting the on-site utilization of acquired data during the campaign.

Challenges : The PhD addresses the question of enhancing the accessibility and reusability of BRGM's wealth of information by endowing it with metadata or restructuring it for more effective utilization. This involves tackling several scientific challenges. There is the intricate task of extracting information from BRGM's diverse document corpora, aiming to efficiently incorporate geographical/spatial and other annotations into the data and documents.

Assignment : The PhD student will be tasked to develop a methodology to automatically build a data warehouse from the information available to geologists. Such a warehouse is multimodal as it mixes text, images, and different forms of structured content. Information extraction techniques will be used to extract data from raw documents (e.g., tables of data values from PDFs; coordinates of specific geological features from maps; identifications of geological layers from schemas) and enrich them with accompanying metadata. Deep learning techniques can be used to construct representations of different modalities, which will then be combined in a global model used for information extraction. Integration of data from different sources and semantization of their content will be performed using Open Information Extraction techniques, in connection with knowledge bases such as Wikidata providing basic knowledge about minerals or locations.

Principales activités

Main tasks:

  • Acquire an exhaustive understanding of the literature on information extraction, data warehousing, and data semantization.
  • Propose and implement approaches to extract, structure, exploit, different data types, taking into account their heterogeneity.
  • Evaluate the proposed solution on publicly available and BRGM-specific benchmarks.
  • Keep track of the uncertainty and provenance of data items.
  • Write scientific research papers with the objective to publish them on top data analytics and data management conferences and journals.

Avantages

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage