Informations générales
Intitulé de l'offre : PhD student : Error correction for data storage in DNA molecules (M/F) (H/F)
Référence : UMR6074-DOMLAV-022
Nombre de Postes : 1
Lieu de travail : RENNES
Date de publication : mardi 22 juillet 2025
Type de contrat : CDD Doctorant
Durée du contrat : 36 mois
Date de début de la thèse : 1 novembre 2025
Quotité de travail : Complet
Rémunération : 2200 gross monthly
Section(s) CN : 06 - Sciences de l'information : fondements de l'informatique, calculs, algorithmes, représentations, exploitations
Description du sujet de thèse
1. DATA STORAGE IN SYNTHETIC DNA MOLECULES
Today, data centers account for around 20% of digital energy consumption in France. An alternative, the storage of information in synthetic DNA molecules, has been actively explored for several years. In addition to offering a storage density far superior to current technologies (up to several exabits per mm3), DNA is a robust medium, capable of withstanding sharp variations in temperature, and durable over time. It should therefore make it possible to preserve information over the long term, and significantly reduce storage energy consumption.
A DNA molecule is made up of a sequence of bases, or nucleotides, of the types A,C,G,T. The DNA synthesis operation consists in constructing the molecule corresponding to a given sequence of quaternary symbols. At present, the synthesis operation represents the main bottleneck of this technology, as it is slow and costly, although highly reliable as it was originally dedicated to the medical field. The information is then read by a sequencing operation, a technique that introduces a high proportion of errors (around 5%) into the sequenced data.
2. ERROR CORRECTION
Channel coding consists of introducing structured redundancies into the data, which are used during decoding to correct errors introduced during transmission or storage. Modern channel coding solutions, such as Turbo codes, LDPC codes or Polar codes, are now an essential part of most telecommunications standards (Wi-Fi, mobile radio, etc.) and information storage standards (RAM memories, hard disks, etc.), as they make transmission and storage more reliable. However, storing data in DNA introduces errors (insertions, deletions), which conventional channel codes are unable to correct, as these errors break their redundancy structure.
That said, an interesting opportunity from the point of view of error correction lies in the fact that sequencing naturally produces a large number of reads of the same molecule, with different errors on each read. One solution from the field of bioinformatics is to use consensus algorithms to reconstruct the input sequence from the multiple reads. In this thesis, the idea will be to develop hybrid approaches combining these two complementary solutions (consensus algorithms and channel coding), to reconstruct input data more efficiently by exploiting both multiple reads and code redundancies.
Contexte de travail
The thesis will be carried out within the framework of PEPR MolécularXiv (see https://pepr-molecularxiv.fr/le-pepr/). The PhD student will be based in the GebScale team at IRISA in Rennes, and will also work with the MEE department at IMT Atlantique in Brest. This thesis is aimed at students with a Master's degree, or an engineering degree, or equivalent, who have followed a course of study in computer science, telecommunications or signal processing. Prior knowledge of channel coding would be a plus. However, no prior knowledge of biology is required to work on this subject.
Le poste se situe dans un secteur relevant de la protection du potentiel scientifique et technique (PPST), et nécessite donc, conformément à la réglementation, que votre arrivée soit autorisée par l'autorité compétente du MESR.