PhD Position F/M Multilingual speech synthesis, with application to regional languages
Contract type : Fixed-term contract
Level of qualifications required : Graduate degree or equivalent
Fonction : PhD Position
Context
This PhD is part of the Inria COLaF Challenge "Corpora and Tools for the Languages of France", which aims to create open, inclusive corpora, models and software for the languages of France. These include regional languages (Alsatian, Breton, Corsican, Occitan, Picard, etc.), overseas languages (Creoles, Polynesian, Kanak, Mahorese languages, etc.), and non-territorial immigrant languages (dialectal Arabic, Western Armenian, Berber, Judeo-Spanish, Romani, Yiddish).
The PhD student will be co-supervised by Vincent Colotte, Pascale Erhart, and Emmanuel Vincent. They will benefit from the expertise of the Multispeech team in speech processing and that of LiLPa in dialectology, corpus phonetics and NLP. They will collaborate with the engineers responsible for the creation and distribution of corpora and software building blocks and with other project partners.
Assignment
Speech synthesis is a key technology for enhancing regional and immigrant languages. However, these languages remain largely ignored by language technology suppliers [1], who traditionally train text-to-speech systems on high-quality monolingual datasets recorded in a studio by a small number of professional actors. This method induces high costs for each language and limits the number of voices and their expressiveness.
The objective of this PhD is to design a multilingual, multi-speaker speech synthesis system applicable to regional languages. Among existing multilingual speech synthesis systems, IMS Toucan [2] is the only one which covers more than 7,000 languages. It is based on the transphone multilingual phonetizer [3], the PanPhon articulatory encoder [4], the FastSpeech 2 synthesizer [5] conditioned on speaker and language embeddings, and the HiFi-GAN vocoder [6] trained on a corpus of 18,000 hours of speech in 462 languages. Two challenges remain: reducing voice choppiness and phonetization errors, which are both more acute for low-resourced languages unseen at training time.
[1] DGLFLF, Rapport au Parlement sur la langue française 2023, https://www.culture.gouv.fr/Media/Presse/Rapport-au-Parlement-sur-la-langue-francaise-2023
[2] F. Lux, S. Meyer, L. Behringer, F. Zalkow, P. Do, M. Coler, E.A.P. Habets, N.T. Vu, “Meta learning text-to-speech synthesis in over 7000 languages”, in Interspeech, 2024, pp.4958-4962.
[3] X. Li, F. Metze, D. Mortensen, S. Watanabe, and A. Black, “Zero-shot learning for grapheme to phoneme conversion with language ensemble”, in Findings of ACL, 2022, pp.2106-2115.
[4] D.R. Mortensen, P. Littell, A. Bharadwaj, K. Goyal, C. Dyer, L.S. Levin, “PanPhon: A resource for mapping IPA segments to articulatory feature vectors”, in 26th International Conference on Computational Linguistics (COLING), 2016, pp.3475-3484.
[5] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao et al., “FastSpeech 2: Fast and high-quality end-to-end text to speech”, in 9th International Conference on Learning Representations (ICLR), 2021.
[6] J. Kong, J. Kim, J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis”, in NeurIPS, 2020, pp.17022-17033.
Main activities
To reduce choppiness, we will exploit available voice recordings for the regional and immigrant languages under consideration, as well as for other phonetically and/or morphologically related languages. These recordings come from open or private archives (radio, television, web, etc.) and have not always been produced and transcribed with a quality suitable for speech synthesis. We will rely on sound quality estimation [7] and transcription [8] systems to automatically select high-quality data.
To improve phonetization, we will use available phonological and phonetic knowledge in addition to this speech data, with particular attention to code switching and pronunciation variability. An active learning method enabling iterative correction of pronunciations will be considered.
The approach developed will be validated for Alsatian, which is the second most widely spoken regional language in France in terms of number of speakers while remaining a low-resourced language [9], and extended to other languages of France, depending on the skills and wishes of the candidate. The research work will be based on the datasets collected by the engineers of the COLaF Challenge.
[7] S. Ogun, V. Colotte, E. Vincent, “Can we use Common Voice to train a Multi-Speaker TTS system?”, in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 900-905.
[8] K. Fan, J. Wang, B. Li, S. Zhang, B. Chen, N. Ge, Z. Yan, “Neural zero-inflated quality estimation model for automatic speech recognition system”, in Interspeech, 2020, pp. 606-610.
[9] D. Bernhard, A.-L. Ligozat, M. Bras, F. Martin, M. Vergez-Couret, P. Erhart, J. Sibille, A. Todirascu, P. Boula de Mareüil, D. Huck, “Collecting and annotating corpora for three under-resourced languages of France: Methodological issues”, Language Documentation & Conservation, 2021, 15, pp.316-357.
Skills
MSc degree in speech processing, NLP, machine learning, computational linguistics, or in a related field.
Strong programming skills in Python/Pytorch.
Prior experience with speech processing or NLP is an asset.
Knowledge of a French regional, overseas or non-territorial language is a plus.
Benefits package
- Subsidized meals
- Partial reimbursement of public transport costs
- Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage
Remuneration
2100 € gross/month the 1st year
General Information
- Theme/Domain : Language, Speech and Audio
- Town/city : Villers lès Nancy
- Inria Center : Centre Inria de l'Université de Lorraine
- Starting date : 2025-01-01
- Duration of contract : 3 years
- Deadline to apply : 2024-12-06
Warning : you must enter your e-mail address in order to save your application to Inria. Applications must be submitted online on the Inria website. Processing of applications sent from other channels is not guaranteed.
Instruction to apply
Defence Security :
This position is likely to be situated in a restricted area (ZRR), as defined in Decree No. 2011-1425 relating to the protection of national scientific and technical potential (PPST).Authorisation to enter an area is granted by the director of the unit, following a favourable Ministerial decision, as defined in the decree of 3 July 2012 relating to the PPST. An unfavourable Ministerial decision in respect of a position situated in a ZRR would result in the cancellation of the appointment.
Recruitment Policy :
As part of its diversity policy, all Inria positions are accessible to people with disabilities.
Contacts
- Inria Team : MULTISPEECH
-
PhD Supervisor :
Vincent Emmanuel / emmanuel.vincent@inria.fr
About Inria
Inria is the French national research institute dedicated to digital science and technology. It employs 2,600 people. Its 200 agile project teams, generally run jointly with academic partners, include more than 3,500 scientists and engineers working to meet the challenges of digital technology, often at the interface with other disciplines. The Institute also employs numerous talents in over forty different professions. 900 research support staff contribute to the preparation and development of scientific and entrepreneurial projects that have a worldwide impact.