Faites connaître cette offre !
Reference : UMR8623-SARCOH-012
Workplace : ORSAY
Date of publication : Monday, June 29, 2020
Scientific Responsible name : Sarah Cohen-Boulakia
Type of Contract : PhD Student contract / Thesis offer
Contract Period : 36 months
Start date of the thesis : 1 October 2020
Proportion of work : Full time
Remuneration : 2 135,00 € gross monthly
Description of the thesis topic
Title : Analysis of multi-modal data for complex pathologies by designing and implementing Reproducible and Reusable Protocols
Sarah Cohen-Boulakia, Université Paris-Saclay (director);
Alain Denise, Université Paris-Saclay
Alban Gaignard, Université de Nantes (main co-supervisor)
The study of pathologies such as intracranial aneurysms requires the use of a wide variety of data and the design of complex analysis protocols. The diversity of their implementations makes their maintenance and sharing difficult and limits the confidence of biologists in the data produced. Reproducing and reusing protocols is crucial to systematically compare biological results, adapt protocols to new problems and meet the requirements of data management plans. The objective of this thesis is to provide (i) a large library of organized protocols, (ii) a module for the design and execution of reproducible, reusable and citable protocols (design of algorithms for indexing and efficient search of patterns in the graphs formed by the workflows implementing the protocols), (iii) an evaluation of the approach and (iv) a set of FAIR criteria for the protocols.
This subject is funded by the CNRS (R2P2 project, call 80 prime) where the doctoral student will collaborate with researchers from the Computer Science Research Laboratory (LRI, Saclay) and the Thorax Institute (ITX, Nantes).
Integration of biological data, Reuse and exchange of protocols, scientific workflows and FAIR protocols, Analysis of multi-scale data.
Background and state-of-the-art
Intracranial aneurysm is a cerebral vascular anomaly affecting 3.2% of the French population. While its rupture can lead to death or severe disability, there is no diagnostic tool. The study of these pathologies requires i) the use of a wide variety of data sets acquired at different scales (genome, vascular tissues, cerebrovascular organ, population) in the context of multidisciplinary and multi-site collaborations and ii) the design of complex and varied analysis protocols. It is crucial to be able to reproduce these analyzes with a high level of confidence on datasets. However, sharing health data is often hampered by the need to protect personal data and faces technical constraints (security, volume). These constraints can however be limited when the protocols are sufficiently reusable to reproduce analyzes in situ. Also, when designed to be reusable, protocol implementations (or workflows) provide the source of the analyzed data, and increase scientists' confidence in the results produced.
The reproducibility and reuse of protocols faces many challenges. It is when a protocol is reproducible that it can be exchanged to be reused in whole or in part, or adapted to answer new biological questions. The reproducibility crisis that erupted 15 years ago [SPZA03, AQM + 11] has highlighted the inability to reproduce results obtained by bioinformatic methods for very diverse reasons (lack of documentation on the tools used, non availability of libraries ...). A series of good practices has emerged, combined with the development of systems that capture the origin of tools, data sets and information relating to the environment [DCE + 07, Boe15, GNT10, BCC + 13].
However, the protocols are designed and implemented without a suitable framework. Workflow systems offer development interfaces, but none allows you to keep track of workflows reused during the construction of a new workflow. The result is an increasing number of workflows derived from pre-existing workflows. It is therefore difficult to identify the origin of a protocol and its implementation and to maintain the numerous implementations of these protocols in a coherent and efficient manner.
While many studies have tackled the production of FAIR (Findable Accessible Interoperable Reusable) data [WDA + 16, MNV + 17, HKP + 18], the central concept of FAIR protocols has only been considered very recently [GSS +20, Fai20]. The FAIR principles [WDA + 16] must be extended to take into account in particular the modular nature of the protocols and their implementations.
Scientific obstacles, Objectives and Methodology.
The objective of this thesis is twofold (i) to design a framework for the design and implementation of reproducible and reusable data analysis protocols for the study of intracranial aneurysms and (ii) to demonstrate the interest of this approach in re-using and adapting the protocols obtained in (i) on data generated in new projects. On the computer level, this thesis will provide solutions to the definition of FAIR protocols with contributions relating to the design i) of algorithms for indexing and efficient search of patterns in graphs formed by workflows and ii) the design and the implementation of tools to help the reuse (and citation) of workflows. On the application side, this thesis will provide concrete solutions to automatically document the data produced by the annotated protocols, as expected in a Data Management Plan. It will lead to a framework for the exchange of comprehensible protocols by peers which will demonstrate its ability to easily reuse and adapt complex protocols developed in a project on the data of a new project.
Task 1: Basis of reusable protocols for the study of intracranial aneurysms. The objective is to carry out an inventory of protocols and implementations to perform data analysis for the study of intracranial aneurysms. T1.1 lists the protocols already in place at ITX and collaborators with a focus on protocols exploiting biological data. Other sources of information will be exploited: (i) catalogs of tools and workflows, (ii) catalogs of open-source code of software and workflows (eg, GitHub), (iii) articles in the literature (methodological sections ). T1.2 aims to semi-automate the exploitation of T1.1 sources. T1.3 represents and organizes the protocols with a view to their reuse. This task is based on formalisms compatible with the FAIR principles and usable in graph algorithms.
Task 2: Module for reusing reproducible protocols. The objective is to develop a re-use module for reproducible protocols based on an algorithm for tracing (sub-) protocols (re) used when designing a new protocol. The challenges are numerous: the protocols and their implementations form complex graphs annotated by many terms from ontologies. We will define a mechanism (i) for indexing the basic bricks present in the protocols (seen as patterns in graphs) to identify these bricks in a compact and informative way and (ii) for reconstructing the history of a protocol , whose problem is directly linked to that of comparing these protocols (knowing that the problem of isomorphism of subgraphs is difficult algorithmically). T2.2 will concretely implement the T2.1 method which is applied here to systems widely supported by the community (Galaxy, SnakeMake or NextFlow) and based on workflow specification standards (e.g. CWL).
Task 3: Reuse of workflows for the analysis of new data. T3 aims to conduct an evaluation of the results obtained on the T1 protocols by the algorithms designed and implemented in T2. T3 is carried out in close collaboration with clinical researchers and biologists who have provided analysis protocols. T3.1 considers protocols designed in their past projects and re-executed on data from new ITX projects. An adaptation of the protocols to this new data will also be evaluated in the case where the characteristics of the new data sets require rethinking certain stages. The new data obtained will be interpreted in close collaboration with ITX biologists. T3.2 offers metrics to assess the ability of a protocol or workflow to be reused or reproducible, thus contributing to the work of the community to define FAIR protocols.
Expected results. (i) Library of annotated executable workflows for data analysis in the context of intracranial aneurysms, (ii) Module for indexing and citation of workflows, (iii) Evaluation of the robustness of the biological results obtained by reused protocols, (iv ) FAIR criteria dedicated to protocols.
[AQM+11] A. Alsheikh-Ali, and W. Qureshi and Mouaz A. et al. Public availability of published research data in high-impact journals, PloS one, 6(9):e24357, 2011, Public Library of Science
[BCC+13] K. Belhajjame, J. Cheney, D. Corsar et al., PROV-O: The PROV Ontology, W3C recommendation (2013)
[BCB+17] R. Bourcier, S. Chatel,..., A. Gaignard et al. on behalf of the ICAN Investigators, Understanding the Pathophysiology of Intracranial Aneurysm: The ICAN Project, Neurosurgery, 80(4):621–626, 2017
2018, Pages 133-141, ISSN 0002-9297, https://doi.org/10.1016/j.ajhg.2017.12.006
[Boe15] C. Boettiger, An introduction to Docker for reproducible research, ACM SIGOPS Operating Systems Review, 49(1):71--79,2015
[PAC+17] C Pradal, S Artzet, J Chopard… S. Cohen-Boulakia, InfraPhenoGrid: a scientific workflow infrastructure for plant phenomics on the grid, Future Generation Computer Systems 67, 341-353, 2017
[CBG+17] S. Cohen-Boulakia,..., A. Gaignard et al. Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities. Future Generation Computer Systems, 75, 284-298, 2017.
[DCE+07] S. Davidson, S. Cohen-Boulakia, A. Eyal et al. Provenance in Scientific Workflow Systems. IEEE Data Eng. Bull.,30(4):44--50, 2007
[DCF+17] P. Di Tommaso, M Chatzou, EW Floden et al. Nextflow enables reproducible computational workflows, Nature biotechnology 35(4):316, 2017
[Fai20] Fair workflows project (starting) https://fair-workflows.github.io/project.html
[GNT10] J. Goecks, A. Nekrutenko, J. Taylor, Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences, Genome Biology, 11(8),R86,2010
[GBSm+17] A. Gaignard, K. Belhajjame, H. Skaf-Molli. SHARP: Harmonizing and Bridging Cross-
Workflow Provenance. ESWC 2017 Satellite Events Revised Selected Papers, 2017.
[GSmB+19] A. Gaignard, H. Skaf-Molli, K. Belhajjame, Findable and Reusable Workflow DataProducts: A Genomic Workflow Case Study, Semantic Web Journal, Special Issue on Semantic e-Science, 2019 (accepted)
[GSS+20] C. Goble, S. Cohen-Boulakia, et al. FAIR computational workflows. Data Intelligence, 108-121, 2020
[HKP+18] P. Holub, F. Kohlmayer, F. Prasser et al. Enhancing reuse of data and biological material in medical research: From FAIR to FAIR-health. Biopreservation and biobanking, 16(2):97-105, 2018.
[KR12] J. Köster and S.Rahmann. Snakemake - A scalable bioinformatics workflow engine. Bioinformatics 2012.
[LCL+19] F. Lemoine, D. Correia, V. Lefort… S. Cohen-Boulakia, O. Gascuel, NGPhylogeny.fr: new generation phylogenetic services for non-specialists, Nucleic Acids Research, 47(W1):W260–W265
[MNV+17] B. Mons, C Neylon, J Velterop et al. Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud, Information Services & Use 37 (1):49-56, 2017.
[NAB+19] A. Nouri, F. Autrusseau, R. Bourcier, …, A. Gaignard, et al. 3D bifurcations characterization for intra-cranial aneurysms prediction, Proc. SPIE 10949, Medical Imaging 2019: Image Processing
[San16] G. Santori, Journals should drive data reproducibility, Nature, 535(7612):355--355, 2016
[SPZ13] V. Stodden, G. Peixuan and M. Zhaokun, Toward reproducible computational research: an empirical analysis of data and code policy adoption by journals, PloS one, 8(6):e67111, 2013, Public Library of Science.
[Yaf15] B. Yaffe, Reproducibility in science, 8(371), eg5--eg5, 2015, Science Signaling.
[WDA+16] M. Wilkinson, M. Dumontier, I. Aalbersberg et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018, 2016
The PhD will take place in the Laboratoire de Recherche en Informatique (LRI, Orsay) in close collaboration with The Institut du Thorax (Nantes). Registration will be done at the Doctoral School STIC of the University Paris-Saclay.
Master M2 in Computer Sciences or Bioinformatics.
Good knowledge of databases (if possible in data integration), knowledge representation (RDF), graph algorithm. Python programming. Very good communication skills, particularly in an interdisciplinary environment. Ability to communicate in English is a plus. Knowledge of scientific workflow systems (NextFlow, SnakeMake, Galaxy ...) is a plus.
We talk about it on Twitter!