Informations générales
Intitulé de l'offre : PhD on Building an Analytical Pipeline Management and Reuse System driven by Data/System/Human dimensions (M/F) (H/F)
Référence : UMR5217-SIHAME-014
Nombre de Postes : 1
Lieu de travail : ST MARTIN D HERES
Date de publication : mercredi 15 octobre 2025
Type de contrat : CDD Doctorant
Durée du contrat : 36 mois
Date de début de la thèse : 1 avril 2026
Quotité de travail : Complet
Rémunération : 2200 gross monthly
Section(s) CN : 06 - Sciences de l'information : fondements de l'informatique, calculs, algorithmes, représentations, exploitations
Description du sujet de thèse
Data Exploration is the process of progressively querying a dataset. The most common approach is to generate pipelines of operators whose goal is to transform the data to accomplish a task. These operators include exploration and synthesis actions, as well as calls to pre-trained models such as language models and reinforcement learning. There are many reinforcement learning methods whose goal is to generate an exploration policy that produces a pipeline [1, 2, 3]. For example, in the case of education, a pipeline would serve a specific learning objective, i.e., a task, such as reducing learning gaps [4]. Training models is costly both in terms of time and money. Moreover, models, and more generally, the pipelines that use them, have a significant impact on the environment.
The goal of this thesis is to develop a system and algorithms for managing analytical pipelines that promote the reuse of pipelines for future data mining tasks. This work is distinguished by its focus on the characterization of these pipelines by metadata reflecting the results of their Data/System/Human evaluation. To this end, the scientific contributions will be: (1) the design of expressive operators to represent analytical pipelines [ADD REF DocETL] (2) the design of a pipeline storage and retrieval backend (3) the formalization and resolution of multi-objective optimization problems (involving the Data/System/Human dimensions) for the search and reuse of pipelines to execute a task (4) the empirical validation of this work for a variety of data mining tasks.
State of the art: The research in this thesis is linked to two current research areas: (1) the reusability of ML models, and (2) declarative systems for defining and executing pipelines. Regarding (1), the work concerns the reusability of ML models during training, such as AutoML/VirnyFlow systems [ADD REF], or during inference [ADD REF]. Regarding (2), the work concerns systems such as PALIMPZEST and LOTUS, which offer a declarative language to facilitate pipeline specification and an approach to optimize their execution [ADD REF]. This thesis is placed within the framework of reusability during inference, from which it distinguishes itself by adding an optimization layer to choose among the pipelines to reuse.
Tasks:
State of the art on (1) reusability in policy inference in reinforcement learning, (2) AutoML/VirnyFlow systems dedicated to training ML models, and (3) declarative pipeline definition and execution systems such as PALIMPZEST and LOTUS
Design of data mining tasks such as recommendation for different datasets
Generation of analytical pipelines to execute these tasks
Design of the schema and database (Vector Database) to represent, store, and retrieve pipelines as well as their metadata, including the results of their Data/System/Human evaluation
Formalization of multi-objective optimization problems combining the Data, System, and Human dimensions, for the search for analytical pipelines to execute a given task
Empirical comparison of the proposed solution with declarative pipeline definition and execution systems such as PALIMPZEST and LOTUS on various datasets
REFERENCES
[0] C. Bekker, S. Rothmann, M. Kloppers (2023). The happy learner: Effects of academic boredom, burnout, and engagement, Frontiers in Psychology 13, (2023)
[1] Abdin, M. et al. (2024). Phi-3: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219
[2] Amer-Yahia S. 2024. Intelligent Agents for Data Exploration. Proc. VLDB Endow. 17, 12 (2024).
[3] Black, A. E., Deci, E. L. (2000). The effects of Instructors' Autonomy Support and learners' Autonomous
Motivation on Learning Organic Chemistry. Science Education, 84(6), 740–756.
[4] Basawapatna A., Repenning A., Koh H., Nickerson H. (2013). The Zones of Proximal Flow: Guiding students through a Space of Computational Thinking Skills. ICER 2013 67-74.
[5] Besta M. et al. (2024). Graph of Thoughts: Solving Elaborate Problems with LLMs. AAAI Conference on AI.
[6] Bittencourt, Ig. et al. (2023). Positive AI in Education (P-AIED): A Roadmap. Journal of AIED.
[7] Bonino G., Sanmartino G., Gatti-Pinheiro, Papotti P ., Troncy R., Michiardi P . (2024). Fine Tuning a Large Language Model for Socratic Interactions. Workshop on AI for Education (AI4EDU).
[8] Bouarour N., Benouaret I., Amer-Yahia S. (2024). Multi-Objective Test Recommendation for Adaptive
Learning. Trans. Large Scale Data Knowl. Centered Syst. 1-36.
[9] Chiang D., Lee H. (2023). Can LLMs be an Alternative to Human Evaluations? Assoc. for Comp. Linguistics.
[10] Durante Z., Huang Q., Wake N., Gong R., Park J. S., Sarkar B., Taori R., Noda Y ., Terzopoulos D., Choi Y ., (2024). Agent AI: Surveying the Horizons of Multimodal Interaction. arXiv:2401.03568
[11] Deeva G., Bogdanova D., Serral E., Snoeck M., De Weerdt J.( 2021). A Review of Automated Feedback Systems for Learners. Computers & Education 162.
[12] Gao J., Galley M., Li L. (2018). Neural Approaches to Conversational AI. Association for Comp. Linguistics.
[13] Guinet G., Omidvar-Tehrani B., Deoras A., Callot L. (2024). Automated Evaluation of Retrieval-Augmented LLMs with Task-Specific Exam Generation. arXiv:2405.13622
[14] Hao S., Gu Y ., Ma H., Hong J., Wang D., Hu. Z. (2023). Reasoning with LLMs is Planning with Word Model. EMNLP .
[15] Heutte, J., Fenouillet, F., Martin-Krumm, C., Gute, G., Raes, A. Gute, D., Bachelet, R. & Csikszentmihalyi, M. (2021). Optimal Experience in Adult Learning: Validation of Flow in Education. Frontiers in Psychology, 12, 1-12.
[16] Hong J., Lee N., Thorne J. (2024). ORPO: Monolithic Pref. Opt. without Reference Model. arXiv:2403.07691
[17] Huang X., Liu W., Chen X., Wang X., Wang H., Lian D., Wang Y ., Tang R., Chen E. (2024). Understanding the
Planning of LLM Agents: A survey. arXiv:2402.02716
[18] Liévin V., Hother C. E., Motzfeldt A. G., Winther O. (2022). Can LLMs Reason about Medical Questions? arXiv:2207.08143
[19] Liu Y ., Iter D., Xu Y ., Wang S., Xu R., Zhu C. (2023). G-Eval: NLG Eval. GPT-4 with Human Alignment. EMNLP .
[20] Matsubara M., Borromeo R., Amer-Yahia S., Morishima A. (2021). Task assignment strategies for Crowd Worker Ability Improvement. ACM Hum. Comput. Interact. 5 (CSCW): 1-375.
[21] McKee, K.R., Tacchetti, A., Bakker, M.A. et al. Scaffolding Cooperation in Human Groups with Deep
Reinforcement Learning. Nat Hum Behav 7, 1787–1796 (2023).
[22] Melnyk I., Mroueh Y ., Belgodere B., Rigotti M., Nitsure A., Yurochkin M., Greenewald K., Navratil J., Ross J. (2024). Distributional Preference Alignment of LLMs via Optimal Transport. arXiv:2406.05882
[23] Pilourdault J., Amer-Yahia S., Basu Roy S. Lee D. (2023). Task Relevance and Diversity as Worker Motivation in Crowdsourcing. IEEE ICDE.
[24] Rafailov R., Sharma A., Mitchell E., Manning C. D., Ermon S., Finn C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeuRIPS.
[25] Ryan, R. M., Deci, E. L. (2000). Self-Determination Theory and the Facilitation of Intrinsic Motivation, Social Development, and Well-Being. American Psychologist, 55, 68-78.
[26] Schulman J., Wolski F., Dhariwal P ., Radford A., Klimov O. (2017). PPO Algorithms. arXiv:1707.06347
[27] Shankar S., J. D. Zamfirescu-Pereira, Hartmann B., Parameswaran A., Arawjo I. (2024). Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences. CoRR, abs/2404.12272.
[28] Wang C., Deng Y ., Lyu Z., Zeng L., He J., Yan S., An. B. (2024). Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning. arXiv:2406.14283
[29] Wei J., Wang X., Schuurmans D., Bosma M., Ichter B., Xia F., Chi E. H., Le Q. V., Zhou. D. (2022).
Chain-of-Thought Prompting Elicits Reasoning in LLMs. NeuRIPS.
[30] Yao S., Yu D., Zhao J., Shafran I., Griffiths T. L., Cao Y ., Narasimhan. K. R. (2023). Tree of Thoughts: Deliberate Problem Solving with LLMs. NeuRIPS.
[31] Chen Y ., Wang R., Jiang H., Shi S., Xu R. Exploring the Use of LLMs for Reference-Free Quality Evaluation. (2023). CoRR, abs/2304.00723.
[32] W. H. Nori, Y . T. Lee, E. Horvitz. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. In CoRR, abs/2311.16452, 2023.
[33] S. Lamsiyah, A. El Mahdaouy. A Reinforcement Learning-based Method for Educational Question
Generation. In Artificial Intelligence in Education. AIED 2024.
[34] K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, W. Kiela. KTO: Model Alignment as Prospect Theoretic Optimization. arXiv:2402.01306, 2024.
Contexte de travail
European context: The work in this thesis is placed within the context of the DataGEMS project, which proposes a data discovery platform with generalized exploration, management, and search capabilities. DataGEMS is based on the principles of data fairness, openness, and reuse. It aims to seamlessly integrate data sharing, discovery, and analysis into a system that covers the entire data lifecycle—sharing, storage, management, discovery, analysis, and reuse (data and/or metadata), bridging the gap between data provider and data consumer.
DataGEMS is a HORIZON-INFRA-2024-EOSC-01-05 - HORIZON-RIA Research and Innovation Action (R&I) whose goal is to build a fully operational and sustainable ecosystem of open source tools for data fairness and to provide an ecosystem of free and open-source tools and a number of services covering all phases of the data lifecycle, including storage and management, discovery, analysis, description, publication, and reuse. The project involves 12 partners across eight European countries who will collaborate to develop new tools and services that enable faster access to FAIR-by-design datasets. They facilitate the collection and analysis of heterogeneous and/or large-scale datasets, ensure the automatic production of FAIR data at research instrument levels (e.g., weather stations), and support infrastructure with metadata automation tools and techniques.
Research laboratory context: The work will be carried out within the Grenoble Computer Science Laboratory. The LIG brings together nearly 450 researchers, lecturer-researchers, doctoral students, and research support staff. They are affiliated with various organizations and are spread across three LIG sites: the campus, Minatec, and Montbonnot. The goal is to leverage the complementarity and recognized quality of the LIG's 24 research teams to contribute to the development of fundamental aspects of computer science (models, languages, methods, algorithms) and to foster synergy between the conceptual, technological, and societal challenges associated with this discipline.
The LIG aims to be a laboratory focused on the foundations and development of computer science, while ensuring an ambitious outreach to society to address new challenges. The host team, DAISY, is a joint CNRS, Grenoble INP, and UGA research team within the Intelligent Systems for Data, Knowledge, and Humans research axis. Much of DAISY's research is evaluated using methods borrowed from the fields of information retrieval and machine learning.
Le poste se situe dans un secteur relevant de la protection du potentiel scientifique et technique (PPST), et nécessite donc, conformément à la réglementation, que votre arrivée soit autorisée par l'autorité compétente du MESR.
Contraintes et risques
The position is located in an area subject to French legislation on the protection of scientific and technical potential (PPST), and therefore requires, in accordance with regulations, that your arrival be authorized by the competent authority of the Ministry of Higher Education and Research (MESR).