PhD Position F/M PhD in applied mathematics: Stochastic modeling and statistics for quantifying and predicting the evolution of tumor heterogeneity in chronic lymphocytic leukemia

Contract type : Fixed-term contract

Level of qualifications required : Graduate degree or equivalent

Fonction : PhD Position


Thesis context

The thesis will take place in the Probability and Statistics team of the Institut Élie Cartan de Lorraine (IECL) in Nancy and in the SIMBA team (Statistical Inference and Modeling for Biological Applications) of Inria Nancy. The PhD student will be involved in discussions with staff at the Strasbourg University Hospital on medical and data aspects all along the PhD project. During the thesis, the PhD student will have the opportunity to discover the world of mathematical research through the life of a dynamic mathematics laboratory, and to attend seminars and working groups in probability and statistics.


The thesis will be supervised by Nicolas Champagnat, Coralie Fritsch and Ulysse Herbach (IECL and INRIA Nancy - Grand Est) for the mathematical part and by Laurent Vallat (CHRU Strasbourg and University of Strasbourg) for the medical part.


Full PhD subject:


Biological context

The development of targeted therapies has allowed considerable progress in the treatment of many cancers, but their efficacy is dependent on intra-tumor heterogeneity. In lymphomas and leukemias, the identification of gene alterations by high-throughput sequencing allows the characterization of this heterogeneity. In healthy B cells, the maturation process provides a unique sequence of DNA, called VDJ genes, encoding for the immune repertoire of the antigen receptor (BCR) by combining 3 immunoglobulin chains V, D and J. In contrast, in hemopathies, every B cell in the initial leukemic clone (i.e. population of tumor cells with the same genome) has the same antigen receptor encoded by a specific VDJ gene sequence. The occurrence of additional mutations in VDJ genes may be responsible for the emergence of subclones with increased antigen receptor reactivity further complicating the clonal heterogeneity of these hemopathies. Leukemic B cells therefore have two levels of heterogeneity: the heterogeneity of cancer genes (a feature shared by any cancer) and the heterogeneity of VDJ genes (a feature specific to leukemia). However, these two levels of clonal heterogeneity and their co-evolution remain poorly characterized and are not considered in the management of these cancers today.

Project description

We propose to develop a mathematical model for the evolution of the two levels of clonal heterogeneity in leukemia, allowing to characterize their evolution from temporal bulk sequencing data of VDJ and cancer genes mutations using a Bayesian approach. We will test the predictive performance of clonal evolution from the inferred model.

Main activities


In this PhD project, we propose to tackle the problem of clonal reconstruction, first from data collected at a single time (already available), and second from longitudinal data. Data will be collected throughout the duration of the PhD thesis.

The main problem consists in reconstructing the phylogenetic tree of mutations and the dynamics of frequencies of each clone. The originality comes from the fact that data are heterogeneous: we will have the full profile of VDJ mutations of clones with frequencies and each cancer genes variants with allele frequencies. From the mathematical modeling perspective, VDJ data share common features with single-cell data since full sequences can be reconstructed using tools like MiXCR ( Existing packages for clonal heterogeneity analysis are B-SCITE (Malikic et al., 2017) and ddClone (Salehi et al., 2017). They are able to deal with both types of data (bulk and single-cell) and could in principle be used here. However, there are specificities of CLL that do not fit into these methods.

The PhD student will first construct a probabilistic model accounting for all the data. This model will contain the phylogenetic tree as latent variable, where each node in the tree corresponds either to a VDJ mutation, a mutation of cancer genes, or a chromosomic alteration, where each mutation occurs only once in the tree. The observations will then be obtained, following the classical rules of the infinitely many sites model, as linear combi- nations of the frequency of every clone in the sample (which are other latent variables), possibly with some noise.

Treating latent variables as parameters, we could use the maximum likelihood method, but maximization is a difficult problem in practice due to the very large number of possible trees. We will test genetic algorithms (Metropolis-Hastings, MCMC...), but we expect better results using a Bayesian approach, combined with a variational method to maximize the a posteriori likelihood.

Second, the PhD student will validate the method from data simulated from our model, then using the benchmark simulation tool proposed by Foglierini et al. (2020), adapting them to our double heterogeneity context, and finally comparing with single-cell sequencing data of 3D in vitro cultures of proliferating cells that will be collected all along the project. Prediction performances will also be tested.

Finally, we will try to detect if groups of patients have similar mutational patterns (such as phylogenetic tree topology), which could correspond to a similar tumorigenesis, or a similar stage of progression, or a similar response to treatments. This is a clustering problem that can be addressed by model-free artificial intelligence tools (such as latent Dirichlet allocation: Pritchard et al., 2000; Falush et al., 2003), or using models like those developed by Beerenwinkel et al. (2004, 2005). This will allow us to build a predictive model of treatment efficiency given the clonal heterogeneity of a patient, that can be used by clinicians in a context of personalized medicine.



The candidate should have skills in statistics and/or stochastic modeling. R, Python or Matlab programming skills are also required. An affinity or experience with medical applications will be highly appreciated.

Keywords: Applied probability, stochastic modeling, statistical modeling for medicine, variational Bayesian methods, clonal heterogeneity, chronic lymphocytic leukemia

Benefits package

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage


2100€ gross/month the 1st year