2019-01628 - PhD Position F/M Distributed link prediction in large complex graphs: application to biomolecule interactions
Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : CDD de la fonction publique

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

Niveau d'expérience souhaité : Jeune diplômé

Contexte et atouts du poste

Today, vast and diverse sources of data exist for almost every scientific domain, making their integration and intelligent exploitation challenging. Indeed, complex data require expressive data representation models such as graph representation. The Linked Open Data (LOD) movement along with the FAIR (Findability, Accessibility, Interoperability, Reusability) data principles are intended to facilitate heterogeneous data integration and analyses. In the LOD context, graphs are called knowledge graphs as they encompass domain ontologies for typing objects and describing their relationships. Semantic web languages (RDFS, OWL, SPARQL) have reached an interesting level of maturity on which ambitious machine learning techniques can rely. Interestingly, big data and NoSQL solutions make possible web-scale data analyses. So far, such analyses on dedicated big-data architectures are often limited to MapReduce scenarios on rather simple data models (key-value oriented, homogeneous graphs with only one type of nodes and one type of edges). Graph databases, as one NoSQL approach, allow for rich representation of multi-typed attributed nodes and edges. This better expressivity comes with a cost as graph and program distribution is not an easy task.

The objective of this PhD project is to make progress to the state of the art of link prediction problem in knowledge graphs in a distributed setting [1][2][3] in the context of predicting drug-target associations. Predicting new targets for known drug molecules (often called drug repositioning [5]) offers the possibility of using existing drug molecules in new ways, which is far cheaper and much less time-consuming than developing new drug molecules from scratch. However, the computational challenge here is to make link prediction algorithms capable to take into account the existence of attributes and labels on both nodes and edges in knowledge graphs. The proposed approaches will be evaluated using web-scale knowledge graphs for inferring missing links (data completion). YAGO, DBpedia, and synthetic benchmarks are usable for such evaluation and validation purposes [4].

Mission confiée

Mission:

This PhD thesis project aims to develop scalable link prediction methods in large and complex graphs. More specifically, the aims of this thesis project are:

- to propose link prediction approaches in knowledge graphs based on both graph topology and neighborhood constraints to be defined;
- to design scalable implementations of the proposed approaches for distributed architectures. In this context, the use of big graph processing frameworks such as Pregel, Trinity, GraphLab and BLADYG need to be studied [6];
- to define evaluation and validation protocols for the proposed algorithms in the context of web-scale knowledge graphs;
- to apply the approach to the prediction of the drug-target associations.

This project will be carried out mainly within the Capsid team at INRIA Nancy which combines expertise in knowledge graphs, distributed graph computing [6] and drug-target interactions (https://capsid.loria.fr). Achieving the objectives of the thesis will involve acquiring knowledge and understanding of the current state of the art in link prediction in large and complex graphs. An important aspect of this project will be to explore the use of big graph processing frameworks in order to design scalable implementations of proposed link prediction methods in knowledge graphs. The proposed techniques will be implemented on a local cluster and evaluated using publicly available data.
This project will develop novel and practical link prediction algorithms that will be applied to predict drug targets. This will help to satisfy an important and current research need in drug repositioning. The developed software will be made publicly available.

Required qualification:
Candidates must have a master degree in computer science, mathematics, or one of the physical sciences. Good programming skills in an object-oriented programming language such as JAVA or C++ are essential. Experience of NoSQL solutions (Neo4j, Titan, MongoDB), parallel/distributed programming (Spark, Hadoop, Flink) and graph processing frameworks (Pregel, GraphLab, GraphX)  is also desirable but not essential. A strong interest in structural biology would also be highly desirable.

Advantages:
- Duration: 3 years
- Starting date: between Oct. 1st 2019 and Jan. 1st 2020- Salary : 1 982 euros gross monthly (about 1 593 euros net) during the first and the second years. 2 085 euros the last year (about 1 676 euros net). Medical insurance is included.

Help and benefits:
Possibility of free French courses
Help for accommodations
Help for the resident card procedure and for husband/wife visa
Lunch cost at Inria canteen is about 3 €

References:   
[1] Seyed Mehran Kazemi and David Poole. SimplE Embedding for Link Prediction in Knowledge Graphs. Advances in Neural Information Processing Systems 31 (NIPS 2018), 4284--4295, 2018.
[2] Behera, Ranjan Kumar, Abhishek Sai Shukla, Sambit Mahapatra, Santanu Kumar Rath, Bibhudatta Sahoo and Swapan Bhattacharya. Map-Reduce based Link Prediction for Large Scale Social Network. The 29th International Conference on Software Engineering and Knowledge Engineering (SEKE),  2017.
[3] Xiaoya Xu, Bo Liu, Jianshe Wu and Licheng Jiao. Link prediction in complex networks via matrix perturbation and decomposition. Scientific Reports - Nature, volume 7, Article number: 14724, 2017.
[4] Melo A., Paulheim H. (2017) Synthesizing Knowledge Graphs for Link and Type Prediction Benchmarking. In: Blomqvist E., Maynard D., Gangemi A., Hoekstra R., Hitzler P., Hartig O. (eds) The Semantic Web. ESWC 2017. Lecture Notes in Computer Science, vol 10249. Springer, Cham.
[5] Sudeep Pushpakom et al. (2019)  Drug repurposing: progress, challenges and recommendations. Nature Reviews Drug Discovery volume 18, pages 41–58 (2019)
[6] S. Aridhi, E. Mephu Nguifo. Big Graph Mining: Frameworks and Techniques. Big Data Research (BDR), Elsevier, 9(C), pp. 9-17, 2017.


Supervision and contact :
Sabeur Aridhi, sabeur.aridhi@loria.fr, https://members.loria.fr/SAridhi
Malika Smail-Tabbone, malika.smail@loria.fr, https://members.loria.fr/MSmail

Principales activités

Thesis tasks:
Study existing link prediction algorithms in homogeneous graphs
Extend the approach to knowledge graphs (by considering node and edge properties)
Evaluate on well-known public RDF datasets
Apply to biological knowledge graphs describing drug-target interactions

Compétences

Candidates must have a master degree in computer science, mathematics, or one of the physical sciences. Good programming skills in an object-oriented programming language such as JAVA or C++ are essential. Experience of NoSQL solutions (Neo4j, Titan, MongoDB), parallel/distributed programming (Spark, Hadoop, Flink) and graph processing frameworks (Pregel, GraphLab, GraphX)  is also desirable but not essential. A strong interest in structural biology would also be highly desirable.

Rémunération

Salary: 1982€ gross/month for 1st and 2nd year. 2085€ gross/month for 3rd year.

Monthly salary after taxes : around 1596,05€ for 1st and 2nd year. 1678,99€ for 3rd year. (medical insurance included).