En poursuivant votre navigation sur ce site, vous acceptez le dépôt de cookies dans votre navigateur. (En savoir plus)

PhD in Natural Language Processing: Statistical analysls of lexical distributions with an application to anomaly detection in natural texts (M/F)

This offer is available in the following languages:
Français - Anglais

Date Limite Candidature : jeudi 14 juillet 2022

Assurez-vous que votre profil candidat soit correctement renseigné avant de postuler. Les informations de votre profil complètent celles associées à chaque candidature. Afin d’augmenter votre visibilité sur notre Portail Emploi et ainsi permettre aux recruteurs de consulter votre profil candidat, vous avez la possibilité de déposer votre CV dans notre CVThèque en un clic !

General information

Reference : UMR9015-FRAYVO-009
Workplace : ST AUBIN
Date of publication : Thursday, June 23, 2022
Scientific Responsible name : François Yvon / Pablo Piantanida
Type of Contract : PhD Student contract / Thesis offer
Contract Period : 36 months
Start date of the thesis : 1 October 2022
Proportion of work : Full time
Remuneration : 2 135,00 € gross monthly

Description of the thesis topic

Forged texts and misinformation are ongoing issues and are in existence all around us in biased softwares that amplifies only our opinions for a “better” more seamless user experience. On social media platforms, these are used by rogue states, businesses and individuals to create misinformation, amplify doubts about factual data or to tarnish their competitors or adversaries, thereby enhancing their own strategic or economic positions. This spread may be the result of different factors and incentives; however, each pose the same fundamental issue to humanity: the misunderstanding of what is true and what is false.

Leveraging deep learning models for large-scale text generation such as GPT-3 has seen widespread use in recent years due to superior performance over traditional generation methods, demonstrating an ability to produce text of great quality, coherence and relevance that are sometines hard to distinguish from human productions. These models generate text via an autoregressive procedure that samples from a distribution learnt to mimic the "true" distribution of human written texts. Malecious uses of these technologies thus constitute a major threat to a truthful information.

Artificial text detection can viewed as a special case of anomaly detection, broadly defined as the task of identifying examples that deviate from regular ones to a degree that arouses suspicion. Current research in anomalies detection largely focuses either on deep classifiers (e.g., out-of-distribution detection, adversarial attack) or rely on the output of large language models (LMs) when label are unavailable. Although these lines of research are appealing, they do not scale without requiring large amount of compute. Additionally, these methods make the fundamental assumptions that (1) the statistical information needed to identify anomalies is available in the trained model, (2) the model uncertainty can be trusted, which is typically not the case as illustrated in presence of a small shift in the input distribution. LM-based approaches do not perform well when used on large text fragments, as may be needed in practical applications (e.g. novel, story or news generation), because of the fixed length context used when training the language model.

This PhD thesis focuses on developing hybrid anomaly detection methods using deep neural network based techniques and word frequency distributions that are linguistically inspired. Most of the research on language models to date focus on sentence-level processing and fail to capture long-range dependencies at the discourse level. Instead, we will leverage on word frequency distributions and information measures to characterize long documents, incorporating a very large number of rare words, which often leads to strange statistical phenomena such as mean frequencies that systematically keep changing as the number of observations is increased. Advanced concepts from statistics and information measures are necessary to understand the analysis of word frequency distributions and to capture the document level information. Extensive experiments on real-world datasets will be executed to showcase t viability of our approach.

Work Context

This thesis is a partnership between the Laboratoire Interdisciplinaire des Sciences du Numérique (LISN, Université Paris-Saclay) and the International Research Laboratory on Learning Systems (ILLS, Montréal). A joint supervision with McGill University or the École de Technologie Supérieure (ETS) of Montreal under the co-direction of Pablo Piantanida (Director of the ILLS) is planned. The PhD student will share the academic year between LISN at Paris-Saclay University and ILLS in Montreal, which will facilitate collaborations with other researchers from Canadian institutions involved in ILLS (MILA, ETS, McGill University).

The doctoral school at the Université Paris-Saclay will be the ICST doctoral school in Pôle B (Data, knowledge, learning and interactions)

We talk about it on Twitter!