PhD Position F/M Language and speaker independent generic articulatory model of the vocal tract

Contract type : Fixed-term contract

Level of qualifications required : Graduate degree or equivalent

Fonction : PhD Position

Context

Context

Current methods for multilingual acoustic speech synthesis [1] rely on static phoneme representations, using phonological databases. Although they allow the phonemes of all languages to be “immersed” in a single space, in order to merge acoustic databases to synthesize the speech of poor-resourced languages, they do not capture the temporal dynamics of the vocal tract corresponding to the anticipation and coarticulation phenomena of natural speech. Phenomena of anticipation and coarticulation [2] are essential for the realization of phonetic contrasts. Moreover, articulatory gestures depend on individual anatomy (shape of the hard palate for instance) and require millimetric precision to guarantee the expected acoustic properties.

This PhD offer is provided by the ENACT AI Cluster and its partners. Find all ENACT PhD offers and actions on https://cluster-ia-enact.ai/.

Objective

This project aims to synthesize the temporal evolution of the vocal tract for any language and any speaker. It falls within the field of articulatory synthesis, seeking to model and simulate the physical process of human speech production via advanced approaches.

The work will make use of real-time MRI databases [3], which provide images of the evolution of the geometric shape of the vocal tract in the medio-sagittal plane at a frequency of 50 Hz. This frequency is sufficient to capture articulator gestures during speech production. We have data for around twenty speakers in several languages with different articulation points.

The task will be to build a dynamic model of the vocal tract that can be adapted to a specific language and speaker from these data.

 

Assignment

Work

The work will involve three stages:

(i) anatomical registration of real-time MRI data, with the aim of representing all gestures in a single anatomical landmark.

(ii) construction of a generic articulatory model merging the dynamics of the languages and speakers in the database used.

(iii) adaptation of the generic model to a language not included in the original database.

The aim of the first stage is to merge the dynamic data, which requires anatomical registration. This first step is based on the search for visible and robustly identifiable anatomical points on the MRI images. Of the numerous registration techniques available, we prefer those that explicitly identify anatomical points, so that we can link an anatomical transformation to the articulators concerned.

The second step is to develop a generic dynamic model capable of taking all articulation points into account. In the model we built previously [3], we used discrete phonetic labels, which limits the model to a language whose articulation points correspond exactly to the phonemes of the database language. To obtain a generic model, we need to move on to continuous coding covering the entire vocal tract. One of the difficulties to be solved is to be able to add a sufficiently fine description to cover not only the places of articulation, but also the degrees of constriction and probably the tongue shape and lip rounding.

The third step will be to adapt the generic model to a specific language described by its places of articulation and a speaker described by anatomical points. This model can be used in conjunction with multilingual acoustic synthesis, or as input for acoustic simulations.

References

[1]    Do, P., Coler, M., Dijkstra, J. and Klabbers, E. 2023. Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection. 12th ISCA Speech Synthesis Workshop (SSW2023) (2023), 21–26.

[2]    Farnetani, E. and Recasens, D. 2010. Coarticulation and Connected Speech Processes. The Handbook of Phonetic Sciences: Second Edition. 316–352.

[3]    Isaieva, K., Laprie, Y., Leclère, J., Douros, I., Felblinger, J. and Vuissoz, P.-A. 2021. Multimodal dataset of real-time 2D and static 3D MRI of healthy French speakers. Scientific Data. 8, (2021).

[4]    Ribeiro, V., Isaieva, K., Leclere, J., Vuissoz, P.-A. and Laprie, Y. 2022. Automatic generation of the complete vocal tract shape from the sequence of phonemes to be articulated. Speech Communication. 141, (Apr. 2022), 1–13.

Skills

Technical skills and level required :

The applicant should have a solid background in deep learning, applied mathematics and computer sciences. Knowledge in speech and MRI processing will be also appreciated.

Languages : English

 

Benefits package

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage

Remuneration

€2200 gross/month