PhD Position F/M Construction of a Generic Multilingual, Multi-Speaker Articulatory Model Using Real-Time MRI Data (PhD offer)

Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : CDD

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

Niveau d'expérience souhaité : Jeune diplômé

Contexte et atouts du poste

Background

Speech production requires control of the movements of the articulators (jaw, tongue, lips, etc.) that are used to modify the shape of the vocal tract, and consequently the acoustic properties, including the resonance frequencies of the vocal tract.

When learning to speak or acquiring a second language, speakers learn to move and control their articulators to produce the sounds of their language. Articulatory synthesis mimics this process by using the temporal evolution of the vocal tract shape and source parameters as input. The advantage of articulatory synthesis is that it can explain the articulatory origin of phonetic contrasts, manipulate the movement of articulators (or even block one to simulate a speech impairment), adapt to a new speaker by modifying the size and shape of the articulators, and finally, reconstruct the vocal tract shape from the speech signal.

Compared to other synthesis approaches that offer high quality, the main advantage of articulatory synthesis is therefore its ability to control the entire speech production process.

Generating the geometric shape of the vocal tract at every moment of synthesis is the central focus of articulatory synthesis. It most often relies on the use of an articulatory model [1, 2] that determines the shape of the vocal tract using a small number of parameters. This model is almost exclusively constructed either from geometric primitives or from a small number of MRI images of a single speaker and, consequently, for a single language. Recently, we developed a model that uses a large number of dynamic MRI images of a speaker to generate the shape of the vocal tract based on a sequence of phonemes to be articulated [3].

The objective of this thesis is now to construct a generic model that is independent of the speaker and the language.

 

Mission confiée

Work

To this end, we are currently collecting data covering approximately thirty speakers and languages using a real-time acquisition system (at 50 frames per second) for two-dimensional sagittal MRI scans of the vocal tract, as part of a collaboration with the IADI laboratory (INSERM U1254) at the Nancy University Hospital.

These images of the mid-sagittal plane of the vocal tract are of very high quality, making it possible to track the contours of the articulators. Despite the excellent results we have obtained [4], we aim to improve tracking at the tongue tip —which has a significant acoustic impact— for example, by using the nnU-net approach [5].

The thesis work will be divided into two main parts:

  1. The construction of a generic articulatory model,
  2. The adaptation of this model to a new language and a new speaker.

For the first part, the work will be based on an anatomical normalization approach of the speaker in order to construct a model for controlling the articulators that accounts for anatomical differences.

For the second part, the work will consist of representing the places of articulation and articulatory movement (when necessary, such as with affricates and diphthongs, for example) associated with the phonemes of a new language within the framework of the generic articulatory model.

References

[1] B. J. Kröger, V. Graf-Borttscheller, A. Lowit. (2008). Two- and Three-Dimensional Visual Articulatory Models for Pronunciation Training and for Treatment of Speech Disorders, Proc. Of Interspeech 2008, Brisbane, Australia

[2] Y. Laprie, J. Busset. (2011). Construction and evaluation of an articulatory model of the vocal tract, In : 19th European Signal Processing Conference - EUSIPCO-2011. – Barcelona, Spain

[3] Vinicius Ribeiro, Karyna Isaieva, Justine Leclere, Pierre-André Vuissoz, Yves Laprie. Automatic generation of the complete vocal tract shape from the sequence of phonemes to be articulated. Speech Communication, 2022, 141, pp.1-13. ⟨10.1016/j.specom.2022.04.004⟩. ⟨hal-03650212⟩

[4] Vinicius Ribeiro, Karyna Isaieva, Justine Leclere, Jacques Felblinger, Pierre-André Vuissoz, et al.. Automatic segmentation of vocal tract articulators in real-time magnetic resonance imaging. Computer Methods and Programs in Biomedicine, 243 (2), ⟨10.1016/j.cmpb.2023.107907⟩. ⟨hal-04376938⟩

[5] Isensee, F., Jaeger, P.F., Kohl, S.A.A. et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 18, 203–211 (2021). https://doi.org/10.1038/s41592-020-01008-z

[6] Karyna Isaieva, Yves Laprie, Justine Leclère, Ioannis K Douros, Jacques Felblinger, et al.. Multimodal dataset of real-time 2D and static 3D MRI of healthy French speakers. Scientific Data , 2021, 8 (1), pp.258. ⟨10.1038/s41597-021-01041-3⟩. ⟨hal-03507532⟩

[7] Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie. Reconstruction of the Complete Vocal Tract Contour Through Acoustic to Articulatory Inversion Using Real-Time MRI Data. Interspeech 2025, Aug 2025, Rotterdam (NL), Netherlands. pp.978-982, ⟨10.21437/Interspeech.2025-963⟩. ⟨hal-05293831⟩

 

Principales activités

Research Environment

The doctoral student will have access to real-time MRI databases already acquired as part of the ANR ArtSpeech project (approximately 10 minutes of speech from 10 speakers [6]) and the Full3FDTalkingHead project (approximately 6 hours of speech from 2 speakers), as well as several other projects. The IADI and Loria laboratories are undoubtedly the most advanced in the field of real-time MRI data acquisition and analysis, with recent work on articulatory acoustic inversion of speech signals [7]. The PhD student will, of course, also be able to participate in ongoing data collection for several languages using the MRI system available at the IADI laboratory.

The scientific environments of the two teams are highly complementary, with strong expertise in all areas of MRI and anatomy within the IADI laboratory and in deep learning within the Loria MultiSpeech team. The two teams are geographically close (1.5 km). The PhD student will have access to both laboratories and the technical resources (computer, access to computing clusters) needed to work under excellent conditions. A progress meeting will be held weekly, and each team organizes a weekly scientific seminar. The PhD student will also have the opportunity to participate in one or two summer schools and conferences on MRI and automatic speech processing. He/she will also receive assistance in writing conference papers and journal articles.

Funding for this doctoral project has already been secured through the ANR ArtAny project.

Supervisor: Yves Laprie (Loria) 

Compétences

Skills

The candidate must have a strong background in deep learning, applied mathematics, and computer science. Knowledge of speech processing and magnetic resonance imaging (MRI) is also desirable.

Keywords

articulatory synthesis, real-time MRI, articulatory modeling, deep learning, computer science, speech processing, applied mathematics

Languages

French and/or English

Avantages

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage

Rémunération

€2300 gross/month