Emotional Assessment and Adaptation of Voice-Based LLMs in Assistive Robotics for Elderly Care and Ethical Challenges (M/F)

This offer is available in the following languages:
- Français-- Anglais
Date Limite Candidature : jeudi 14 août 2025 23:59:00 heure de Paris

Assurez-vous que votre profil candidat soit correctement renseigné avant de postuler

Informations générales

Intitulé de l'offre : Emotional Assessment and Adaptation of Voice-Based LLMs in Assistive Robotics for Elderly Care and Ethical Challenges (M/F) (H/F)
Référence : UMR9015-LAUDEV-005
Nombre de Postes : 1
Lieu de travail : GIF SUR YVETTE
Date de publication : jeudi 24 juillet 2025
Type de contrat : CDD Doctorant
Durée du contrat : 36 mois
Date de début de la thèse : 1 novembre 2025
Quotité de travail : Complet
Rémunération : 2200 gross monthly
Section(s) CN : 07 - Sciences de l'information : traitements, systèmes intégrés matériel-logiciel, robots, commandes, images, contenus, interactions, signaux et langues

Description du sujet de thèse

Context: As part of the ANR HUMAAINE: HUMAN-MACHINE AFFECTIVE INTERACTION & ETHICS chair at LISN-CNRS, we have been developing research on oral interaction with social robots. 5 theses have already been defended on emotion detection systems (Feng, 2025; Deschamps-berger, 2024a; Ali mehenni, 2023) and studies on affective nudge (Kalashnikova, 2024; Kobylyanskaya, 2024). As part of this Chair, we have the opportunity to use robots to act out an emotional interaction.
Initial analyses have shown the interest of Large Language Models (LLMs) (Vaswani et al., 2017), acoustic (wav2vec) and textual (Flaubert) multimodality for emotional modeling (Deschamps-berger et al., 2024b). The emergence of LLMs has led to the construction of spoken dialogue systems such as GPT-4o, which recognizes and imitates human speech.

Many of us lend these AI devices capabilities that they don't have: knowledge, affect and even moral values. We are very vulnerable to these hallucination-prone AIs. ChatGPT, which is not explicitly built to be emotionally intelligent, nevertheless employs emotional patterns present in the data (Fang et al., 2025). LLMs are now widely used in user interaction systems. Among them, some approaches are called “speech-to-speech”, such as MOSHI (Defossez et al., 2025). Moshi is a system capable of speech interaction that has been directly “fine-tuned” on emotional interactions (Defossez et al., 2022, 2024) rather than being limited to a representation of implicit emotions formed from traditional learning corpora.

Description: The subject of this thesis is to obtain better control over the emotional aspect of responses given by LLM systems and their possible deviations (hallucination, toxicity, ...). In particular, for a robot assistant application for the elderly, it is essential to adapt and control this type of system in order to better master the agent's responses. New measures on ethical and explicability dimensions as well as benchmarks will be developed.
The thesis will focus on the hybridization of speech-to-speech LLMs with classical methods. An experiment will be conducted with hybridization between LLMs and emotion detection models. In addition, we plan to experiment with fine-tuning methods and RAGs (Lewis et al., 2021; Huang et al., 2024, 2025) to improve the control and emotion representation of these LLMs.
The adaptation of LLM speech-to-speech models to emotions will be evaluated with elderly people. The expected results are methodological contributions and best practices for better design of these voice interaction devices to take account of the emotions and ethical dimensions of exchanges with elderly people.

Some bibliographical references :

(Ali Mehenni, 2023) Hugues Ali Mehenni. « 'Nudges' dans l'interaction homme-machine : analyse et modélisation d'un agent capable de nudges personnalisés ». 2023UPASG043. Thèse de doct. 2023. url : http://www.theses.fr/2023UPASG043/document.

(Défossez et al., 2022) Alexandre Défossez et al. High Fidelity Neural Audio Compression. 2022. arXiv :2210.13438 [eess.AS]. url : https://arxiv.org/abs/2210.13438.

(Défossez et al., 2024) Alexandre Défossez et al. Moshi : a speech-text foundation model for real-time dialogue. 2024. arXiv : 2410.00037 [eess.AS]. url : https://arxiv.org/abs/2410.00037.

(Défossez, 2025) Alexandre Défossez. “Moshi : a speech-text foundation model for real-time dialogue” - Alexandre Défossez. YouTube, mars 2025. url : https://www.youtube.com/watch?v=0_c3bw_x6uU.

(Deschamps-Berger, 2024a) Théo Deschamps-Berger. « Social Emotion Recognition with multimodal deep learning architecture in emergency call centers ». 2024UPASG036. Thèse de doct. 2024. url : http://www.theses.fr/2024UPASG036/document.

(Deschamps-Berger et al., 2024b) Theo Deschamps-Berger, Lori Lamel et Laurence Devillers. « Exploring Attention Mechanisms for Multimodal Emotion Recognition in an Emergency Call Center Corpus ». In : ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023, p. 1-5. doi : 10.1109/ICASSP49357.2023.10096112.

(Fang et al., 2025) Cathy Mengying Fang et al. How AI and Human Behaviors Shape Psychosocial Effects of Chatbot Use : A Longitudinal Randomized Controlled Study. 2025. arXiv : 2503.17473 [cs.HC]. url : https://arxiv.org/abs/2503.17473.

(Feng, 2025) Yajing Feng, « Continuous emotion recognition in real-life call center conversations », Thèse de doct. 2025.

(Huang et al., 2024) Le Huang et al. Emotional RAG : Enhancing Role-Playing Agents through Emotional Retrieval. 2024. arXiv : 2410.23041 [cs.AI]. url : https://arxiv.org/abs/2410.23041.

(Huang et al., 2025) Ailin Huang et al. Step-Audio : Unified Understanding and Generation in Intelligent Speech Interaction. 2025. arXiv : 2502.11946 [cs.CL]. url : https://arxiv.org/abs/2502.11946.

(Kalashnikova, 2024) Natalia Kalashnikova. « Towards detection of nudges in Human-Human and Human-Machine interactions ». 2024UPASG031. Thèse de doct. 2024. url : http://www.theses.fr/2024UPASG031/document.

(Kobylyanskaya, 2024) Sofiya Kobylyanskaya. « Towards multimodal assessment of L2 level : speech and eye tracking features in a cross-cultural setting ». 2024UPASG111. Thèse de doct. 2024. url : http://www.theses.fr/2024UPASG111/document.

(Lewis et al., 2021) Patrick Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. 2021. arXiv : 2005.11401 [cs.CL]. url : https://arxiv.org/abs/2005.11401.

(Vaswani et al., 2017) Ashish Vaswani et al. Attention Is All You Need. 2017. arXiv : 1706.03762 [cs.CL]. url : https://arxiv.org/abs/1706.03762.

Contexte de travail

LISN, parc de GPU, labIA, Jean-Zay

Le poste se situe dans un secteur relevant de la protection du potentiel scientifique et technique (PPST), et nécessite donc, conformément à la réglementation, que votre arrivée soit autorisée par l'autorité compétente du MESR.

Contraintes et risques

RAS