En poursuivant votre navigation sur ce site, vous acceptez le dépôt de cookies dans votre navigateur. (En savoir plus)

Fully-funded PhD contract 36 months M/W - Natural Language Processing: Multiword term embedding for clinical information extraction and text classification

This offer is available in the following languages:
Français - Anglais

Date Limite Candidature : mardi 7 décembre 2021

Assurez-vous que votre profil candidat soit correctement renseigné avant de postuler. Les informations de votre profil complètent celles associées à chaque candidature. Afin d’augmenter votre visibilité sur notre Portail Emploi et ainsi permettre aux recruteurs de consulter votre profil candidat, vous avez la possibilité de déposer votre CV dans notre CVThèque en un clic !

Faites connaître cette offre !

General information

Reference : UMR9015-PIEZWE-005
Workplace : ST AUBIN
Date of publication : Tuesday, November 16, 2021
Scientific Responsible name : Pierre Zweigenbaum
Type of Contract : PhD Student contract / Thesis offer
Contract Period : 36 months
Start date of the thesis : 1 February 2022
Proportion of work : Full time
Remuneration : 2 135,00 € gross monthly

Description of the thesis topic

The overall goal of the thesis is to design a neural architecture that optimizes a text classification task that relies on entity detection. Since multiword terms play an important role in specialized domains, the thesis hypothesizes that taking better care of their representation will improve entity detection and text classification. The application domain of the thesis is medicine. More specifically, the addressed classification task is the prediction of readmission and mortality of heart failure patients based upon the text and structured data of their electronic health records.

The thesis has the following three sub-goals:

- Detection and representation of multiword terms and entities.
Medicine, as most technical and specialized domains, uses a large terminology with many multiword terms. Mainstream NLP techniques are based on representations of single words or even word pieces, and creating suitable multiword term representations remains challenging. A sub-goal of the thesis is to design embedding methods that better represent key pieces of information such as multiword terms and entities.

- Training specialized word embeddings with limited in-domain data.
Limited training data is an issue for any machine-learning method. In that context, pre-training is a very common practice in current neural NLP methods. The thesis will explore methods that take advantage of out-domain and near-domain text corpora, as well as ontologies and knowledge graphs, to obtain better specialized word embeddings for clinical text.

- Text classification for risk prediction.
The above methods will be tested through their contribution to a real-world task: the prediction of readmission and mortality of heart failure patients based upon the text and structured data of their electronic health records. An end-to-end architecture will be designed, and the respective contributions of structured data and text will be studied.

Work Context

LISN is a multidisciplinary research laboratory of CNRS and Université Paris-Saclay that gathers researchers and professors in Information Sciences and Engineering Sciences, as well as Life Sciences, and Humanities and Social Sciences. LISN is structured into five departments, among which that named Language Sciences and Technologies (STL), which performs research in natural language processing covering spoken, written, and sign language, from acoustic signal processing to semantic modeling (ILES and TLP teams). The ILES team of LISN has a strong competence in natural language processing applied to the biomedical domain.

The ILES team leads the ANR-funded project PREDHIC (Predicting heart failure readmission and mortality using natural language processing), a 3.5 year ANR project between two computer science laboratories (LISN and LS2N) and two hospitals (Paris Saint-Joseph Hospital Group and Lille University Hospital), which funds this PhD thesis.

The PhD student will contribute to the PREDHIC project. This student will belong to LISN's ILES team and will work with the team members who take part in the project. The student will be co-supervised by Pierre Zweigenbaum (LISN, Orsay), the principal investigator of the PREDHIC project, and Emmanuel Morin (LS2N, Nantes), the head of LS2N's NLP team. The work will be performed in the LISN laboratory in Orsay; a close collaboration is planned with the Saint-Joseph hospital for data access. LISN has access to the Lab-ia GPU cluster, and access to the very large Jean-Zay national AI GPU cluster can be obtained additionally.

Travel will take place for meetings at project partners in France, and for workshops and conferences in France and abroad. The thesis is linked to the time frame of the PREDHIC project, with its work packages and deliverables. Remote work may be set up if the situation requires it.

Constraints and risks

Risks related to on-screen work.

We talk about it on Twitter!