PhD Position F/M PhD Semantically-enriched queries and analysis of metagenomic datasets

Contract type : Fixed-term contract

Level of qualifications required : Graduate degree or equivalent

Fonction : PhD Position

Level of experience : Recently graduated

About the research centre or Inria department

The Inria Centre at Rennes University is one of Inria's eight centres and has more than thirty research teams. The Inria Centre is a major and recognized player in the field of digital sciences. It is at the heart of a rich R&D and innovation ecosystem: highly innovative PMEs, large industrial groups, competitiveness clusters, research and higher education players, laboratories of excellence, technological research institute, etc

Context

Genomic data enable critical advances in medicine, ecology, ocean monitoring,  and agronomy.  A major limitation is that it is impossible to query these entire data  (petabytes of sequences).

The OmicFinder project (https://project.inria.fr/omicfinder/) will provide a search engine able  to remove this lock. The central algorithmic idea of a genomic  search engine is to index and query small exact words (hundreds of billions over millions of  datasets), as well as the associated metadata. The project brings together Inria teams in algorithmic on strings, ontologies,  computing architectures, and data distribution. They will bring algorithmic  advances including computation frugality, clever index distributions, and smart ontology-based questions and answers filtration.

The core idea of the OmicFinder is to build an index of small exact words present in millions of datasets, so that a query based on this index will return the list of datasets that have (at least) a sequence containing this word. This corresponds to the syntactic aspect of query resolution.

Assignment

The expected benefits are two-folds.

Smart queries First, this will allow users to specify *a priori* relevance criteria that will reduce noise and improve performances. For example, it will allow an user to specify that (s)he is interested in Human gut microbiome, so that the datasets containing sequences that match the word but obtained in a Tara oceanic expedition can be ignored. Even better, OmicFinder will not even channel this query to the tara repository, avoiding unnecessary computations.  Note that we want to support multiple levels of granularity in order to focus on mammal gut microbiome, or mammal omnivorous gut microbiome.

Smart answers Second, it will allow the OmicFinder query engine to provide *a posteriori* characterization of the datasets, similar to the classical enrichment analyses. Typically, one could compare the frequencies of annotations in the datasets returned by the query with the frequencies of the same annotations among the whole set of datasets, or among the datasets that match the semantic criteria. For example, one could find that the datasets returned by a query on a particular word on datasets related to Human gut microbiome are enriched in liver-related diseases compared to the datasets related to Human gut microbiome in general.

Main activities

The expected contributions of this PhD thesis are:

  1. the creation of a semantic index of the datasets based on FAIR principles. This will require to retrieve the metadata from the main dataset repositories, and to represent them in an unified schema, based on Semantic Web technologies such as RDF, RDFS+OWL and bioschemas.
  2. the comparison of the trade-off between a centralized and a decentralized storage of the semantic annotations in terms of implementation simplicity, performance impact, and scalability.
  3. the capability for users to express semantically rich queries. This will rely on SPARQL for representing the queries, but will necessitate an adequate user interface.
  4. the capability to describe and characterize the query results.

 

Skills

Technical skills and level required : Programming (Python or Java)

Languages : French or English

Other valued appreciated : Semantic Web

Benefits package

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Possibility of teleworking (90 days per year) and flexible organization of working hours
  • Partial payment of insurance costs

Remuneration

Monthly gross salary: 2100€ during the 2 1st years and 2200€ during the 3rd year.