En poursuivant votre navigation sur ce site, vous acceptez le dépôt de cookies dans votre navigateur. (En savoir plus)

PHD (M/F) : Mining of large sequence databases for the identification of predictive RNA signatures of phenotype.

This offer is available in the following languages:
Français - Anglais

Date Limite Candidature : mardi 23 août 2022

Assurez-vous que votre profil candidat soit correctement renseigné avant de postuler. Les informations de votre profil complètent celles associées à chaque candidature. Afin d’augmenter votre visibilité sur notre Portail Emploi et ainsi permettre aux recruteurs de consulter votre profil candidat, vous avez la possibilité de déposer votre CV dans notre CVThèque en un clic !

General information

Reference : UMR9198-DANGAU-002
Workplace : GIF SUR YVETTE
Date of publication : Tuesday, August 2, 2022
Scientific Responsible name : Daniel Gautheret
Type of Contract : PhD Student contract / Thesis offer
Contract Period : 36 months
Start date of the thesis : 1 November 2022
Proportion of work : Full time
Remuneration : 2 135,00 € gross monthly

Description of the thesis topic

-High-throughput RNA sequencing (RNA-seq) is a unique source for the discovery of medical biomarkers and drug targets. However, while over one million human RNA-seq libraries are publicly available, this treasure trove of medical information cannot realize its full potential because it is impossible to directly query this resource to measure the expression of an RNA of interest. Several bioinformatics projects have addressed this issue, but they rely on normal reference RNAs that do not capture the full diversity of transcripts found in disease. New reference-free data structures using k-mers could allow querying of these large sequence databases. However, several improvements are needed to make them true data mining tools for discovering RNAs associated with human diseases.

In the framework of a new ANR project, we will develop indexing structures capable of handling quantitative queries without reference in tens of thousands of RNA-seq libraries while optimizing disk and memory consumption. To this end, we will build on our Reindeer indexing system [1]. We will bring important innovations to reduce the disk and memory footprint of the tool. On the other hand, we will implement in the new version of Reindeer statistical tools to screen the indexes for RNAs significantly associated with qualitative or quantitative traits related to the phenotype of the samples. This will allow us to discover RNAs associated with clinical or cellular characteristics, and ultimately produce new diagnostic/prognostic models. We will create indexes of about 10,000 samples from public databases. Using these indexes, we will propose a series of applications aiming to better understand the determinants of aging and cellular senescence, two related processes involved in a large number of pathologies. We will generate the first predictive models of aging and senescence using unreferenced RNAs. The distributed architecture of our system, combined with web servers allowing public queries, will allow a large community to evaluate our tools, opening the way to a wide range of applications. Our consortium is composed of bioinformaticians from four institutions, with strong backgrounds in informatics, data structure, high-throughput RNA-seq analysis, and health transcriptomics.

The student will participate in the following activities:

- A minor contribution to the development of the indexing tool and its application for the realization of large-scale transcriptomic indexes, chained between different entities and centrally searchable. This is a mainly informatics activity that will be led by our informatics collaborators (notably INRIA/CNRS/Univ Lille). Our student will participate in the selection and retrieval of samples, as well as in the creation of indexes.
- The implementation of biostatistical tools to extract sequences associated with biological characteristics (age/senescence, pathology, cell type) from the index, to produce predictive models from these variables and to test these models. This will involve the development of model normalization and aggregation procedures adapted to the size and heterogeneity of the tables analyzed. The activity will be co-supervised by a biostatistician and conducted in collaboration with our bioinformatician colleagues from the ANR project.

The student will thus acquire a solid experience in artificial intelligence applied to health, while having the opportunity to advance knowledge on aging and cancer.

Work Context

The host team, specialized in bioinformatics, is composed of 5 permanent researchers and teachers. The student will be integrated in an ANR consortium ("full-RNA": 2022-2026) composed of 4 computer science and bioinformatics laboratories. He/she will participate in the consortium meetings and will benefit from our collaborations within this group.

1. Marchet, C., Iqbal, Z., Gautheret, D., Salson, M. & Chikhi, R. REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics. 36, i177-i185 (2020).

Constraints and risks


We talk about it on Twitter!