Informations générales
Intitulé de l'offre : M/F PhD Position in Statistics (H/F)
Référence : UMR5149-NATCOL-024
Nombre de Postes : 1
Lieu de travail : MONTPELLIER
Date de publication : jeudi 19 juin 2025
Type de contrat : CDD Doctorant
Durée du contrat : 36 mois
Date de début de la thèse : 1 octobre 2025
Quotité de travail : Complet
Rémunération : 2200 gross monthly
Section(s) CN : 51 - Modélisation mathématique, informatique et physique pour les sciences du vivant
Description du sujet de thèse
During cancer progression, various mutations accumulate in cancer cells, generating multiple cellular lineages that coexist within a given tumor. The objective of this project is to study the evolutionary history of a tumor based on high-throughput sequencing data called "bulk" sequencing, which consists of mixed cells from the tumor.
These data are complex due to both biological and technical reasons. Biologically, cancer evolution involves numerous processes that induce mutations, structural alterations in certain genomic regions in some cells, as well as changes in tumor size. Technically, high-throughput sequencing does not provide complete genome sequences but rather a very large number of small fragments, called "reads," which are mapped onto a reference sequence for analysis. In bulk sequencing, where many cells are sequenced together, it is not possible to directly assign each read to its originating cell.
The main goal of the thesis is to reconstruct the tumor's cellular composition history of a patient from longitudinal biopsy samples taken at multiple time points and sequenced. The proposed approach relies on developing a stochastic model of bulk sequencing data from a tumor. This model naturally decomposes into two main parts.
First, a birth-and-death process (modeling cell division and death), coupled with a Poisson process (modeling mutations), will be used to describe the evolution of the number of cells in each lineage and the emergence of new lineages. Conditionally on the size of these lineages, the second part models the sampling of tumor cells and their high-throughput sequencing, which generates the observed set of reads.
Initially, this model can be used to simulate sequencing data under various biological hypotheses to test the robustness and accuracy of existing reconstruction methods such as Pairtree [3] or CALDER [2].
Subsequently, the objective will be to compute the likelihood of bulk sequencing data under this model, in order to propose a new statistical inference method, for instance by adapting the approach of [1] for the first part of the model.
[1] Didier, Laurin. 2020. Systematic Biology. 69:1068–1087.
[2] Myers, Satas, Raphael. 2019. Cell systems. 8:514–522.
[3] Wintersinger, Dobson, Kulman, et al. 2022. Blood Cancer Discovery. 3:208–219
Contexte de travail
The PhD will take place at the Institut Montpelliérain Alexander Grothendieck (IMAG) in Montpellier, in collaboration with MAP5 in Paris. It will be supervised by Gilles Didier (IMAG) and Paul Bastide (MAP5), with collaboration from Alice Cleynen (IMAG) and Sophie Lèbre (IMAG).
The project is part of the ANR IdenTHiC (Identification of Tumor History at the Clone level) program, which focuses on the study of clinical data from cancer patients to support diagnosis.
Developing the simulation tool requires strong programming skills, particularly in R, C/C++, or Python. The study of the model involves expertise in probability and statistics, and benefits from an interest in biological applications.