PhD Position F/M Language and speaker independent generic articulatory model of the vocal tract
Contract type : Fixed-term contract
Level of qualifications required : Graduate degree or equivalent
Fonction : PhD Position
Context
Context
Current methods for multilingual acoustic speech synthesis [1] rely on static phoneme representations, using phonological databases. Although they allow the phonemes of all languages to be “immersed” in a single space, in order to merge acoustic databases to synthesize the speech of poor-resourced languages, they do not capture the temporal dynamics of the vocal tract corresponding to the anticipation and coarticulation phenomena of natural speech. Phenomena of anticipation and coarticulation [2] are essential for the realization of phonetic contrasts. Moreover, articulatory gestures depend on individual anatomy (shape of the hard palate for instance) and require millimetric precision to guarantee the expected acoustic properties.
This PhD offer is provided by the ENACT AI Cluster and its partners. Find all ENACT PhD offers and actions on https://cluster-ia-enact.ai/.
Objective
This project aims to synthesize the temporal evolution of the vocal tract for any language and any speaker. It falls within the field of articulatory synthesis, seeking to model and simulate the physical process of human speech production via advanced approaches.
The work will make use of real-time MRI databases [3], which provide images of the evolution of the geometric shape of the vocal tract in the medio-sagittal plane at a frequency of 50 Hz. This frequency is sufficient to capture articulator gestures during speech production. We have data for around twenty speakers in several languages with different articulation points.
The task will be to build a dynamic model of the vocal tract that can be adapted to a specific language and speaker from these data.
Assignment
Work
The work will involve three stages:
(i) anatomical registration of real-time MRI data, with the aim of representing all gestures in a single anatomical landmark.
(ii) construction of a generic articulatory model merging the dynamics of the languages and speakers in the database used.
(iii) adaptation of the generic model to a language not included in the original database.
The aim of the first stage is to merge the dynamic data, which requires anatomical registration. This first step is based on the search for visible and robustly identifiable anatomical points on the MRI images. Of the numerous registration techniques available, we prefer those that explicitly identify anatomical points, so that we can link an anatomical transformation to the articulators concerned.
The second step is to develop a generic dynamic model capable of taking all articulation points into account. In the model we built previously [3], we used discrete phonetic labels, which limits the model to a language whose articulation points correspond exactly to the phonemes of the database language. To obtain a generic model, we need to move on to continuous coding covering the entire vocal tract. One of the difficulties to be solved is to be able to add a sufficiently fine description to cover not only the places of articulation, but also the degrees of constriction and probably the tongue shape and lip rounding.
The third step will be to adapt the generic model to a specific language described by its places of articulation and a speaker described by anatomical points. This model can be used in conjunction with multilingual acoustic synthesis, or as input for acoustic simulations.
References
[1] Do, P., Coler, M., Dijkstra, J. and Klabbers, E. 2023. Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection. 12th ISCA Speech Synthesis Workshop (SSW2023) (2023), 21–26.
[2] Farnetani, E. and Recasens, D. 2010. Coarticulation and Connected Speech Processes. The Handbook of Phonetic Sciences: Second Edition. 316–352.
[3] Isaieva, K., Laprie, Y., Leclère, J., Douros, I., Felblinger, J. and Vuissoz, P.-A. 2021. Multimodal dataset of real-time 2D and static 3D MRI of healthy French speakers. Scientific Data. 8, (2021).
[4] Ribeiro, V., Isaieva, K., Leclere, J., Vuissoz, P.-A. and Laprie, Y. 2022. Automatic generation of the complete vocal tract shape from the sequence of phonemes to be articulated. Speech Communication. 141, (Apr. 2022), 1–13.
Skills
Technical skills and level required :
The applicant should have a solid background in deep learning, applied mathematics and computer sciences. Knowledge in speech and MRI processing will be also appreciated.
Languages : English
Benefits package
- Subsidized meals
- Partial reimbursement of public transport costs
- Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage
Remuneration
€2200 gross/month
General Information
- Theme/Domain :
Language, Speech and Audio
Scientific computing (BAP E) - Town/city : Villers lès Nancy
- Inria Center : Centre Inria de l'Université de Lorraine
- Starting date : 2025-10-01
- Duration of contract : 3 years
- Deadline to apply : 2025-03-26
Warning : you must enter your e-mail address in order to save your application to Inria. Applications must be submitted online on the Inria website. Processing of applications sent from other channels is not guaranteed.
Instruction to apply
Defence Security :
This position is likely to be situated in a restricted area (ZRR), as defined in Decree No. 2011-1425 relating to the protection of national scientific and technical potential (PPST).Authorisation to enter an area is granted by the director of the unit, following a favourable Ministerial decision, as defined in the decree of 3 July 2012 relating to the PPST. An unfavourable Ministerial decision in respect of a position situated in a ZRR would result in the cancellation of the appointment.
Recruitment Policy :
As part of its diversity policy, all Inria positions are accessible to people with disabilities.
Contacts
- Inria Team : MULTISPEECH
-
PhD Supervisor :
Laprie Yves / yves.laprie@loria.fr
About Inria
Inria is the French national research institute dedicated to digital science and technology. It employs 2,600 people. Its 200 agile project teams, generally run jointly with academic partners, include more than 3,500 scientists and engineers working to meet the challenges of digital technology, often at the interface with other disciplines. The Institute also employs numerous talents in over forty different professions. 900 research support staff contribute to the preparation and development of scientific and entrepreneurial projects that have a worldwide impact.