PhD Position F/M Multilingual speech synthesis, with application to regional languages

Contract type : Fixed-term contract

Level of qualifications required : Graduate degree or equivalent

Fonction : PhD Position

Context

This PhD is part of the Inria COLaF Challenge "Corpora and Tools for the Languages of France", which aims to create open, inclusive corpora, models and software for the languages of France. These include regional languages (Alsatian, Breton, Corsican, Occitan, Picard, etc.), overseas languages (Creoles, Polynesian, Kanak, Mahorese languages, etc.), and non-territorial immigrant languages (dialectal Arabic, Western Armenian, Berber, Judeo-Spanish, Romani, Yiddish).

The PhD student will be co-supervised by Vincent ColottePascale Erhart, and Emmanuel Vincent. They will benefit from the expertise of the Multispeech team in speech processing and that of LiLPa in dialectology, corpus phonetics and NLP. They will collaborate with the engineers responsible for the creation and distribution of corpora and software building blocks and with other project partners.

Assignment

Speech synthesis is a key technology for enhancing regional and immigrant languages. However, these languages remain largely ignored by language technology suppliers [1], who traditionally train text-to-speech systems on high-quality monolingual datasets recorded in a studio by a small number of professional actors. This method induces high costs for each language and limits the number of voices and their expressiveness.

The objective of this PhD is to design a multilingual, multi-speaker speech synthesis system applicable to regional languages. Among existing multilingual speech synthesis systems, IMS Toucan [2] is the only one which covers more than 7,000 languages. It is based on the transphone multilingual phonetizer [3], the PanPhon articulatory encoder [4], the FastSpeech 2 synthesizer [5] conditioned on speaker and language embeddings, and the HiFi-GAN vocoder [6] trained on a corpus of 18,000 hours of speech in 462 languages. Two challenges remain: reducing voice choppiness and phonetization errors, which are both more acute for low-resourced languages unseen at training time.

[1] DGLFLF, Rapport au Parlement sur la langue française 2023, https://www.culture.gouv.fr/Media/Presse/Rapport-au-Parlement-sur-la-langue-francaise-2023
[2] F. Lux, S. Meyer, L. Behringer, F. Zalkow, P. Do, M. Coler, E.A.P. Habets, N.T. Vu, “Meta learning text-to-speech synthesis in over 7000 languages”, in Interspeech, 2024, pp.4958-4962.
[3] X. Li, F. Metze, D. Mortensen, S. Watanabe, and A. Black, “Zero-shot learning for grapheme to phoneme conversion with language ensemble”, in Findings of ACL, 2022, pp.2106-2115.
[4] D.R. Mortensen, P. Littell, A. Bharadwaj, K. Goyal, C. Dyer, L.S. Levin, “PanPhon: A resource for mapping IPA segments to articulatory feature vectors”, in 26th International Conference on Computational Linguistics (COLING), 2016, pp.3475-3484.
[5] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao et al., “FastSpeech 2: Fast and high-quality end-to-end text to speech”, in 9th International Conference on Learning Representations (ICLR), 2021.
[6] J. Kong, J. Kim, J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis”, in NeurIPS, 2020, pp.17022-17033.

Main activities

To reduce choppiness, we will exploit available voice recordings for the regional and immigrant languages under consideration, as well as for other phonetically and/or morphologically related languages. These recordings come from open or private archives (radio, television, web, etc.) and have not always been produced and transcribed with a quality suitable for speech synthesis. We will rely on sound quality estimation [7] and transcription [8] systems to automatically select high-quality data.

To improve phonetization, we will use available phonological and phonetic knowledge in addition to this speech data, with particular attention to code switching and pronunciation variability. An active learning method enabling iterative correction of pronunciations will be considered.

The approach developed will be validated for Alsatian, which is the second most widely spoken regional language in France in terms of number of speakers while remaining a low-resourced language [9], and extended to other languages of France, depending on the skills and wishes of the candidate. The research work will be based on the datasets collected by the engineers of the COLaF Challenge.

[7] S. Ogun, V. Colotte, E. Vincent, “Can we use Common Voice to train a Multi-Speaker TTS system?”, in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 900-905.
[8] K. Fan, J. Wang, B. Li, S. Zhang, B. Chen, N. Ge, Z. Yan, “Neural zero-inflated quality estimation model for automatic speech recognition system”, in Interspeech, 2020, pp. 606-610.
[9] D. Bernhard, A.-L. Ligozat, M. Bras, F. Martin, M. Vergez-Couret, P. Erhart, J. Sibille, A. Todirascu, P. Boula de Mareüil, D. Huck, “Collecting and annotating corpora for three under-resourced languages of France: Methodological issues”, Language Documentation & Conservation, 2021, 15, pp.316-357.

Skills

MSc degree in speech processing, NLP, machine learning, computational linguistics, or in a related field.
Strong programming skills in Python/Pytorch.
Prior experience with speech processing or NLP is an asset.
Knowledge of a French regional, overseas or non-territorial language is a plus.

Benefits package

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage

Remuneration

2100 € gross/month the 1st year