Contract type : Public service fixed-term contract
Level of qualifications required : PhD or equivalent
Fonction : Post-Doctoral Research Visit
Level of experience : Recently graduated
MultiSpeech, INRIA Nancy Grand-Est,
Yves Laprie (Yves.Laprie@loria.fr)
Our long term objective is to achieve articulatory synthesis of speech, i.e. the generation of the acoustic signal by simulating the production of the speech signal by a human being.
In order to keep this problem affordable, we do not consider the bio-mechanical phenomena involved in the movement of speech articulators (jaw, tongue, lips, soft palate, larynx and epiglottis). Indeed, the number of muscles involved, their complex organization, the lack of maturity of numerical models applied to muscles and the lack of data make numerical simulations too far from real speech.
We thus only consider the temporal geometry of the vocal tract, the aero-acoustic phenomena, and the vocal fold activity. The advantage is that there exist minimally invasive measuring devices that allow access to the shape of the vocal tract (Magnetic Resonance Imaging) and the activity of the vocal folds (ElectroPhotoGlottoGraphy).
The vocal tract shape, and especially its temporal evolution, has to be modeled so as to provide the numerical acoustic simulations with the relevant geometry at each time point of the synthesis. The shape changes according to the positions of the speech articulators over time. The articulators move continuously, and the speaker must anticipate the positions to be reached in order to produce the desired sounds.
A speech sound is thus not produced independently of the surrounding sounds. Coarticulation covers the influence of the surrounding sounds on the current sound to be articulated. It should be noted that an articulator that is not critical for the production of a sound, i.e. has not acoustical impact, can anticipate its position for the coming sounds. For instance, during the production of /ipu/ the tongue is not recruited by the production of /p/ and thus can anticipate the position required by /u/ well before the acoustic onset of the vowel.
The quantitative prediction of the coarticulation effects is a challenging task. One of the first numerical models was proposed by Öhman  and consists of superimposing the effect of the consonants onto the trajectories followed by the articulators between two consecutive vowels. Despite its simplicity, this model is still used for its ease of implementation and relatively good results.
The overlapping of coordinated gestures corresponding to critical articulatory variables (for example the glottal aperture, labial protrusion and aperture, the place and degree of constriction of the tongue tip or body…) is a key element of articulatory phonology. Attempts to calculate gestures from speech and articulatory data  are always based on simplifying assumptions so strong that they severely limit the scope of the results.
The approach proposed by Cohen and Massaro  relies on the idea of finding the influence domain and the coarticulatory effects of each phoneme. These two sets of parameters are trained from a corpus for each phoneme and articulatory parameter. The main weakness of learning coarticulatory effects independently for each articulator is that there is no overall consistency which is required to achieve the correct acoustic target.
Acoustic-to-articulatory inversion for recovering the geometrical position of a small set of flesh points from the acoustic signal  also incorporates some non-explicit coarticulation modeling. Deep learning methods that require big corpora of ElectroMagnetoArticulography data associating the position of sensors glued onto articulators for training are now widely used to tackle this problem. However, only “easily accessible” articulators (because sensors have to be glued) are considered. The vocal tract is therefore not taken into account in its entirety, and additionally those approaches are unable to involve a true aero-acoustic dimension.
The objective of this work is to train a coarticulation model that covers all the articulators and guarantees that the target sounds can be generated.
Since this year the IADI laboratory with which we have been collaborating for many years has been equipped with a real-time MRI data acquisition system (at 50 Hz) that allows us to monitor the evolution of the midsagittal shape of the vocal tract during speech production.
This represents a considerable asset in the perspective of studying and modeling coarticulation for several speakers.
The work proposed consists of exploiting these data and is organized in two stages.
The first will consist of tracking articulators in MRI data. Unlike several approaches which process the complete vocal tract as a single object we want to track each articulator independently because even if their movements are coordinated, they are not necessarily synchronized. The fact of connecting all the articulator contours in one general contour from the glottis to the lips thus prevents the coarticulation from being studied at the level of articulators. We already drawn articulatory contours in about a thousand images, and preliminary tests we carried out show that this enables fairly good results for the tongue. The objective is to implement a deep-learning auto-encoding approach which, in a first step learns the image and the associated contour for the images with outlined contours, and in a second step retrains the first hidden layers without the contours so as to enable the reconstruction of the contours, and thus tracking, without their prior knowledge [4,5,6].
The second step will be devoted to the modeling of coarticulation via deep learning techniques by identifying the role of each articulator in order to integrate the phenomena of acoustic compensation between articulators.
 S.E.G. Öhman. Numerical model of coarticulation. J. Acoust. Soc. Am., 41:310–320, 1967.
 H. Nam, V. Mitra, M. Hasegawa-Johnson, C. Epsy-Wilson, E. Saltzman, and L. Goldstein. A procedure for estimating gestural scores from speech acoustics. Journal of the Acoustical Society of America, 132(6):3080–3989, 2012.
 M.M. Cohen and D.W. Massaro, Modeling Coarticulation in Synthetic Visual Speech, In Models and Techniques in Computer Animation, Springer, 1993
 Uria, Benigno & Renals, Steve & Richmond, Korin. (2011). A Deep Neural Network for Acoustic-Articulatory Speech Inversion. Proceedings NIPS, 2011.
 A. Jaumard-Hakoun, K. Xu, P. Roussel, G. Dreyfus, M. Stone and B. Denby. Tongue contour extraction from ultrasound images based on deep neulral network. Proc. of International Congress of Phonetic Sciences, Glasgow, 2015.
 I. Fasel and J. Berry. Deep Belied Networks for Real-Time Extraction of Tongue Contours from Ultrasound During Speech. Proc. of 20th ICPR, Istanbul, 2010.
 G. Litjens, T. Kooi et al. A survey on deep learning in medical image analysis. Medical Image Analysis, 42 :60-88, 2017.
PhD in computer science or acoustics. Knowledge about speech processing and speech production is a decisive plus.
French or English.
- Subsidized meals
- Partial reimbursement of public transport costs
- Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage
Salary: 2653€ gross/month
- Town/city : Villers-lès-Nancy
- Inria Center : CRI Nancy - Grand Est
- Starting date : 2019-09-01
- Duration of contract : 1 year, 4 months
- Deadline to apply : 2019-05-05
- Inria Team : MULTISPEECH
Laprie Yves / email@example.com
The keys to success
June 6th, 2018 (Midnight Paris time)
How to apply
Upload your file on jobs.inria.fr in a single pdf or zip file, and send it as well by email to Yves.Laprie@loria.fr. Your file should contain the following documents:
- CV including a description of your research activities (2 pages max) and a short description of what you consider to be your best contributions and why (1 page max and 3 contributions max); the contributions could be theoretical or practical. Web links to the contributions should be provided. Include also a brief description of your scientific and career projects, and your scientific positioning regarding the proposed subject.
- The report(s) from your PhD external reviewer(s), if applicable.
- If you haven't defended yet, the list of expected members of your PhD committee (if known) and the expected date of defense (the defense, not the manuscript submission).
In addition, at least one recommendation letter from your PhD advisor should be sent directly by their author(s) to Yves.Laprie@loria.fr
Applications are to be sent as soon as possible.
Inria, the French national research institute for the digital sciences, promotes scientific excellence and technology transfer to maximise its impact. It employs 2,400 people. Its 200 agile project teams, generally with academic partners, involve more than 3,000 scientists in meeting the challenges of computer science and mathematics, often at the interface of other disciplines. Inria works with many companies and has assisted in the creation of over 160 startups. It strives to meet the challenges of the digital transformation of science, society and the economy.
Instruction to apply
Defence Security :
This position is likely to be situated in a restricted area (ZRR), as defined in Decree No. 2011-1425 relating to the protection of national scientific and technical potential (PPST).Authorisation to enter an area is granted by the director of the unit, following a favourable Ministerial decision, as defined in the decree of 3 July 2012 relating to the PPST. An unfavourable Ministerial decision in respect of a position situated in a ZRR would result in the cancellation of the appointment.
Recruitment Policy :
As part of its diversity policy, all Inria positions are accessible to people with disabilities.
Warning : you must enter your e-mail address in order to save your application to Inria. Applications must be submitted online on the Inria website. Processing of applications sent from other channels is not guaranteed.