Informations générales
Intitulé de l'offre : PhD Position: Multimodal Automatic Detection of Stuttering-Like Disfluencies (M/F) (H/F)
Référence : UMR5267-IVADID-001
Nombre de Postes : 1
Lieu de travail : VANDOEUVRE LES NANCY CEDEX
Date de publication : jeudi 22 mai 2025
Type de contrat : CDD Doctorant
Durée du contrat : 36 mois
Date de début de la thèse : 1 octobre 2025
Quotité de travail : Complet
Rémunération : 2200 gross monthly
Section(s) CN : 06 - Sciences de l'information : fondements de l'informatique, calculs, algorithmes, représentations, exploitations
Description du sujet de thèse
1. Introduction
Stuttering, a fluency disorder affecting millions of individuals, is characterized by stuttering-like disfluencies (blocks, prolongations, repetitions) linked to dysfunctions in speech motor control. While its automatic detection has already been explored using audio-based models, current systems remain limited by low robustness, difficulty in identifying certain disfluencies such as silent blocks, and reliance on scarce data. This PhD project proposes a multimodal approach (audio, video, text) to enhance the accuracy and robustness of disfluency detection, leveraging an audiovisual corpus of French-speaking individuals who stutter. The analysis will rely on modality-specific encoding techniques, followed by a strategic fusion of their representations for final classification.
2. Aims
The aim of this PhD is to design, develop, and evaluate a multimodal deep learning approach for the automatic detection of stuttering-like disfluencies in French, by combining audio, video, and textual modalities. The work will be based on an annotated audiovisual corpus of French-speaking people who stutter, with particular focus on disfluencies that are difficult to detect through audio alone, such as silent blocks, and on robustness to individual variability.
The doctoral candidate's work will include the following tasks:
Audio encoding: Implement and adapt Stutternet (Sheikh, S. A., Sahidullah, M., Hirsch, F., & Ouni, S. – 2021 – Stutternet: Stuttering detection using time delay neural network, in EUSIPCO) to extract acoustic features relevant to disfluency detection by capturing temporal dependencies.
Video encoding: Develop and train vision models (e.g., C3D or Transformers) to analyze video sequences for visual cues of stuttering (facial tension, blinking, atypical movements). The extraction of facial landmarks (with OpenFace or MediaPipe) will also be explored as a complementary or alternative source of features.
Text encoding: Generate automatic transcriptions (via Whisper) and encode them using pre-trained language models (BERT, RoBERTa) to extract linguistic context and identify textual patterns characteristic of disfluencies.
Multimodal fusion: Implement and compare several strategies to fuse the representations from the three modalities, such as concatenation, adaptive attention mechanisms, or other approaches leveraging data complementarity.
Classification and evaluation: Develop a classifier operating on the fused representation to predict the presence or absence of stuttering within a given time window. Evaluation will rely on standard metrics (precision, recall, F1-score, AUC), and results will be compared to expert manual annotations. Qualitative analyses will also be conducted to interpret model errors and refine the approach.
Beyond detection, this PhD aims to contribute methodologically to the field of multimodal fusion applied to pathological speech, with potential impact in clinical contexts.
3. Required Skills
The candidate should hold a Master's degree in computer science, have strong skills in machine learning and deep learning, and be proficient in Python and frameworks such as PyTorch or TensorFlow. An interest in signal processing (audio/video) and ideally in NLP is expected. Autonomy, rigor, critical thinking, and analytical abilities are essential, along with strong communication skills to work in a multidisciplinary environment. An interest in phonetics, linguistics, and speech disorders—particularly stuttering—would be a plus.
Contexte de travail
The PhD candidate will take part in a multidisciplinary research project involving two complementary laboratories: LORIA, a computer science lab with expertise in speech processing and deep learning, and PRAXILING, a language sciences lab known for its work in phonetics and stuttering. The research will rely on an existing annotated audiovisual corpus of French-speaking individuals with fluency disorders. The thesis will be jointly supervised by researchers in computer science and language sciences, ensuring interdisciplinary co-supervision. The doctoral work will be primarily conducted at LORIA in Nancy, with regular stays at PRAXILING in Montpellier to foster scientific collaboration and enrich the research approach through dual expertise.
Contraintes et risques
Regular travel between the two host laboratories is expected and will be financially supported.