Faites connaître cette offre !
Reference : UMR5216-ALLBEL-020
Workplace : ST MARTIN D HERES
Date of publication : Monday, April 27, 2020
Scientific Responsible name : Gérard BAILLY
Type of Contract : PhD Student contract / Thesis offer
Contract Period : 36 months
Start date of the thesis : 1 October 2020
Proportion of work : Full time
Remuneration : 2 135,00 € gross monthly
Description of the thesis topic
This PhD is part of the THERADIA project, aiming at designing a conversational agent for assisting digital therapies. The studentship involves collaboration with industrial (SBT, ATOS, Pertimm) and academic partners (EMC, LIG). The agent, embodied as a virtualm avatar, will handle interaction between patients, therapists and caregivers. GIPSA is responsible for the developpement of the avatar and its verbal and co-verbal - interactive, expressive and adaptive - behavior.
Interactive, expressive and adaptive behaviours will be soon collected using Wizard-of-Oz experiments where a human pilot will interact with patients, therapists and caregivers via the virtual avatar, so that to provide in-context multimodal data.
Many deep learning architectures for end-to-end text-to-speech synthesis have already been proposed. Such systems are typically trained using dozens of hours ot raw text aligned with the corresponding speech signals, from audiobooks read aloud by voice donours. The ouput speech quality is rather impressive but such systems lack several properties required by THERADIA: multimodality, adapativity and controlled expressivity.
The project will tackle the problem of building multimodal adapative and expressive end-to-end generative models using two approaches:
• A: Transfer learning and multi-tasking - The main focus of the work will be models that accept heteregenous output data (audio, visual, phonetic) at various temporal scales. The challenge here is to combine data from different sources (audiobooks, Wizard of Oz, transcriptions) and factor out the different sources of variation (speakers, styles, etc)
• B: Speaking styles - Complementary to angle A, we will investigate ways to augment the textual input with additional cues capturing the different sources of variation, either explicitely by paralinguistic labels or the id of the adressee) or implicitely by latent style embeddings (e.g. Google's style tokens) or perceptual cues (verbal and co-verbal behavior of the adressee). The challenge is here to forster controlabilty and interpretability of these additional input cues.
This PhD studentship is supported by THERADIA and offers full funding at the overseas rate for 3 years plus a generous package of funding for travel, compute infrastructure, and experimental costs. Several PhD students are working in the field of speech sysnthesis in the GIPSA/CRISSP team. This work also falls within the "Collaborative Intelligent Systems" of the Grenoble AI institute. The PhD work will be cosupervized by a researcher, who has a strong background in speech processing and expressive synthesis. The laureate will thus work in a rich ecosystem.
Constraints and risks
Because we process French speech data and work in a multilingual environnement, good knowledge of French and English is required.
This work is part of a working package of the THERADIA project and intermediary reports will be writen.
A first seq-to-seq text-to-speech for French is already available and constitutes a solid basis for the transfer learning challenge. Risks are limited and fallback solutions are easy to implement.
We talk about it on Twitter!