Informations générales
Intitulé de l'offre : PhD student (M/W): continual learning of large langage models (H/F)
Référence : UMR7503-CHRCER-001
Nombre de Postes : 1
Lieu de travail : VANDOEUVRE LES NANCY
Date de publication : vendredi 8 septembre 2023
Type de contrat : CDD Doctorant/Contrat doctoral
Durée du contrat : 36 mois
Date de début de la thèse : 1 novembre 2023
Quotité de travail : Temps complet
Rémunération : 2 135,00 € gross monthly
Section(s) CN : Information sciences: processing, integrated hardware-software systems, robots, commands, images, content, interactions, signals and languages
Description du sujet de thèse
Large Language Models (LLM) have exhibited exceptional abilities in a wide range of Natural Language Processing (NLP) tasks, boosting the performances of automatic systems sometimes up or beyond the human level on standard NLP benchmarks. Although such generic large LLMs perform exceptionally well in many downstream NLP tasks, their zero-shot performances still lag behind those of specialised models when applied on specific tasks and domains, hence motivating the need to efficiently update such LLMs to specific data. Furthermore, the information these models contain in their parameters is dated in time, and keeping this information up-to-date requires advanced training and finetuning methods for LLMs, in particular approaches that enable continual updating of the model's parameters without catastrophic forgetting.
The objective of the thesis is to explore and propose novel methods to update the LLM parameters with new information.
Continual Learning of LLMs is different from the classical field of continual learning in Machine Learning, which focuses on training on a sequence of successive and different tasks without access to the past tasks. With LLMs, the objective is rather to inject up-to-date information into the LLM, making sure it does not lose its generic abilities or forget previous knowledge. We will explore several options to reach this goal, in particular combining the growing and sparsity paradigms. Growing neural networks are a viable solution to catastrophic forgetting (Evci et al. 2022; Moeed et al. 2020) in small neural networks; Synalp has experience in growing dense networks within large transformers (Caillon 2023). Yet, scaling to giga-size LLMs requires adapting growing to parameter-efficient training (Wang et al. 2023), which we propose to investigate in this project for instance by growing soft prompts (or other parameter-efficient approaches such as adapters, LoRA...) to accumulate new knowledge, while freezing all other LLM parameters to avoid forgetting. As a result, we will build an open-source LLM that will take as input continuous News streams of data and autonomously capture and store this new information. For evaluation, standard NLP benchmarks fail at evaluating such continuously updated models because they are dated in time. We will rather consider a new evaluation protocol, realtime-QA, that builds up-to-date Question-Answering tests from CNN News streams (Kasai et al. 2022).
Beyond purely textual multilingual models, a preferred application use case to demonstrate these theoretical contributions in the field of LLM finetuning will concern multimodal models, such as improving state-of-the-art speech recognition models with updated LLM, more specifically on spontaneous conversational speech involving specialised or recent vocabulary for specific domains. In particular, we will adapt and evaluate ASR models on the medical regulation and meetings dialogue domains. We will investigate multimodal self-supervised learning of both audio, text or aligned audio and text corporas, to improve the accuracy of ASR transformer models such as Whisper and MMS in a context of constantly evolving vocabulary. The main motivation is to enable seamless updates of these state-of-the-art models, allowing them to recognise novel words that were never included in their initial training. The challenge lies in the fact that encoder-decoder transformers consist of a joint acoustic and language model. Modifying the language modelling part without affecting the acoustic modelling part is challenging, as well as preventing catastrophic forgetting. The student will explore various methods for incorporating a new vocabulary into existing models, even in cases where there are no or limited audio samples available for the new words. The approaches we plan to investigate include leveraging speech synthesis techniques, embedding text into the intermediate representation of the models, and combining predictions of the model with a LLM. The model will be evaluated by word and character-error rate, which will also be computed for sub-group of the vocabulary specific to the medical domain (e.g. drug, symptom, disease names).
References
Caillon, Paul. 2023. “Weakly Supervised Deep Learning for Natural Language Processing.” PhD thesis, Nancy, France: Université de Lorraine.
Evci, Utku, Max Vladymyrov, Thomas Unterthiner, Bart van Merrienboer, and Fabian Pedregosa. 2022. “GradMax: Growing Neural Networks Using Gradient Information.” ArXiv abs/2201.05125. https://api.semanticscholar.org/CorpusID:245906452.
Kasai, Jungo, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Velocity Yu, Dragomir R. Radev, Noah A. Smith, Yejin Choi, and Kentaro Inui. 2022. “RealTime Qa: What's the Answer Right Now?” ArXiv abs/2207.13332. https://api.semanticscholar.org/CorpusID:251105205.
Moeed, Abdul, Gerhard Hagerer, Sumit Dugar, Sarthak Gupta, Mainak Ghosh, Hannah Danner, Oliver Mitevski, Andreas Nawroth, and Georg Groh. 2020. “An Evaluation of Progressive Neural Networksfor Transfer Learning in Natural Language Processing.” In Proceedings of the Twelfth Language Resources and Evaluation Conference, 1376–81. Marseille, France: European Language Resources Association. https://aclanthology.org/2020.lrec-1.172.
Wang, Peihao, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogério Schmidt Feris, David Cox, Zhangyang Wang, and Yoon Kim. 2023. “Learning to Grow Pretrained Models for Efficient Transformer Training.” ArXiv abs/2303.00980. https://api.semanticscholar.org/CorpusID:257280093.
Contexte de travail
This thesis is funded by the ANR project LLM4All.
The monthly salary is 2135 € brut, i.e., 1716 € net.
The PhD student will start with an intensive (75% of work time) litterature reviewing period of a few weeks, and then progressively reduce this effort to 20% of working time, but litterature review shall be maintained active during the whole thesis, especially given the fast progression of the AI and NLP fields nowadays.
Both advanced theoretical contributions in the field of Machine/Deep Learning and/or NLP, as well as extended and rigorous exprimental validation are expected. The objective is to publish 1 or 2 papers per year in top-ranking conferences and journals of the domains (ICLR, *-ACL, NeurIPS, ICML, AAAI...).
The Ph.D. student will spend 100% of his working time at LORIA laboratory in Nancy. He will benefit from an academic email, access to online scientific libraries, journals and proceedings, access to large computing facilities, scientific events (colloquiums, invited talks, workshops, team meetings...) and will follow the IAEM Doctoral School training program.