PhD Position F/M Task-Specific and Linguistically Motivated Evaluation for Multilingual NLP
Type de contrat : CDD
Niveau de diplôme exigé : Bac + 5 ou équivalent
Fonction : Doctorant
Contexte et atouts du poste
This PhD subject is published as part of the PR[AI]RIE-PSAI 2026 PhD call for funding. An applicant will be pre-selected by the supervisors by the 22nd May at the latest; a second selection phase by a committee within the PR[AI]RIE-PSAI institute will then carry out the final selection based on the pre-selected applicant’s profile and the PhD subject. As stated in the funding call, the selection results will be published in two phases between the 30th May and mid-June.
Non-discrimination, openness, and transparency: All PR[AI]RIE-PSAI partners are committed to supporting and promoting equality, diversity, and inclusion within their communities. We encourage applications from diverse backgrounds, which we will carefully select through an open and transparent recruitment process.
Supervisors: Rachel Bawden and Benoît Sagot (Inria, ALMAnaCH project-team), <firstname.lastname@inria.fr>
PhD start date (if funding allocated): 1st September 2026–30th November 2026
Final deadline for applications: 17 May 2026 at 1pm CEST
Mission confiée
Context
The field of Natural Language Processing (NLP) is undergoing rapid change. Many of those changes result from decades of progress in the modelling of language, from the widespread use of neural methods (Bengio et al., 2003) to the increasing use of context when developing textual representations (Peters et al., 2018; Devlin et al., 2019), the introduction of adapted architectures for text generation (Sutskever et al., 2014; Bahdanau et al., 2015; Vaswani et al., 2017) and the training of language models on huge amounts of heterogeneous data, including in multiple languages (Raffel et al., 2019; Brown et al., 2020; Touvron et al., 2023). The result is that NLP tools, in particular multi-task conversational large language models (LLMs) such as GPT Brown et al. (2020) and open-source equivalents such as the Qwen3 family (Qwen Team, 2025) have been readily adopted by the general public for a wide range of tasks, including those that have traditionally been the main focus of NLP research, such as translation, summarisation, style transfer and question answering. NLP research itself has changed considerably, with a main focus now being on the use of LLMs, which excel in terms of fluency, rather than smaller, dedicated models and methods.
Human evaluation remains the gold standard (as long as it is carried out correctly) for monitoring progress in the field, comparing model performance, and assessing what challenges remain, but automatic evaluation is highly important for practical purposes, as it can be reproducible, and less costly in terms of money and time. However, evaluation has always been a major challenge across NLP tasks, and the influence from adjacent fields such as machine learning, the evaluation crisis continues: benchmarks are often treated as black boxes, with little analysis of inputs and outputs (including the types of errors being made), a lack of consideration of the challenges that are specific to particular tasks and the problem of data contamination, particularly for LLMs (Sainz et al., 2023).
This has multiple consequences. Firstly, there is the real risk of results, and therefore conclusions being drawn from them, being misleading. For example, Peter et al. (2025) showed that wrong conclusions have previously been drawn about performance gaps between high- and low-resource languages, whereas the actual issue was in quality issues with the benchmark data itself. Secondly, it means that real progress is likely to be limited because areas of improvement are overlooked. Finally, it can lead to wider problems such as an increase in hype and a lack of understanding of the weaknesses of such models, particularly important given the widespread use of LLMs by the general public, who may blindly trust outputs.
There is therefore an overwhelming need for research into evaluation of models, focusing on specific tasks and the challenges they pose: evaluation methods, benchmarks and analysis.
PhD subject: research aims and directions
The aim of this PhD will be to improve methods and benchmarks for the evaluation of NLP models, covering a range of languages, both high- and low-resource, challenging scenarios such as those representing high levels of variation and non-standardness, with a focus on particular tasks such as machine translation (MT) and diachronic transfer (e.g. normalisation and translating texts between different periods of a language). The topic builds on expertise and several past and ongoing research projects within the team (processing of non-standard, historical and dialectal language data).
There will be three main axes to the initial topic, with opportunities to expand beyond these directions: (i) diagnosing linguistic gaps in benchmarks, (ii) creating controlled benchmarks to stress-test models, particularly with respect to their robustness and (iii) developing new task-specific evaluation methods, both in terms of defining evaluation dimensions and training automatic metrics.
The first axis will involve the development of methods using linguistic knowledge about particular languages to uncover gaps in current benchmarks with respect to particular linguistic structures (e.g. sentence types, complex structures, morphological inflections). The aim is two-fold: (i) to develop diagnostic tools to assess how well data covers the complexity of particular languages, to be used either with existing benchmarks and when designing new datasets, and (ii) to complement gaps with additional data, including synthetically created data, to enable a better coverage but also evaluation for particular phenomena types, for instance by using approaches such as (Zebaze et al., 2025).
The second axis will involve the creation of controlled benchmarks that target particularly challenging examples in order to compare models on identified weak points (Bawden et al., 2018; Futeral et al., 2023). Stress-testing will involve both the identification of challenging phenomena, which inevitably evolve as models progress, a step for which automatisation is both desirable and itself challenging. Fitting in with the current research interests of the team, several directions can be explored here, among which robustness of models to variation (non-standard user-generated data (Bawden and Sagot, 2023, 2025), dialectal data (Sagot et al., 2025), degrees of formality).
Finally, the third axis focuses on the development of evaluation methods that take into account the particularities and challenges of specific tasks. Just as text simplification is characterised by several dimensions, including grammaticality, meaning preservation and degree of simplicity, which itself is specific to a particular target audience (Martin et al., 2018), many NLP tasks comprise several dimensions that can be evaluated separately. We will focus on the tasks of MT, particularly for low-resource languages, and style and diachronic transfer, for instance translating between French from different periods in history, fitting in with ongoing and future projects within the team. Different evaluation methods will be tested, including well-designed heuristics (where appropriate, particularly in scenarios state-of-the-art are poorly adapted), fine-tuning of pre-existing metrics and LLM-as-a-judge approaches.
References
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to
align and translate. In Proceedings of the first International Conference on Learning Representations, San Diego,
CA. - Rachel Bawden and Benoît Sagot. 2023. RoCS-MT: Robustness challenge set for machine translation. In Proceedings of the Eighth Conference on Machine Translation, pages 198–216, Singapore. Association for Computational Linguistics.
- Rachel Bawden and Benoît Sagot. 2025. RoCS-MT v2 at WMT 2025: Robust challenge set for machine translation. In Proceedings of the Tenth Conference on Machine Translation, pages 834–849, Suzhou, China. Association for Computational Linguistics.
- Rachel Bawden, Rico Sennrich, Alexandra Birch, et al. 2018. Evaluating discourse phenomena in neural machine
translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1304–1313, New Orleans, Louisiana. Association for Computational Linguistics. - Yoshua Bengio, Réjean Ducharme, Pascal Vincent, et al. 2003. A neural probabilistic language model. Journal of
Machine Learning Research, 3:1137–1155. - Tom B. Brown, Benjamin Mann, Nick Ryder, et al. 2020. Language Models are Few-Shot Learners. In Advances
in Neural Information Processing System, pages 1877–1901. Curran Associates, Inc. - Jacob Devlin, Ming-Wei Chang, Kenton Lee, et al. 2019. BERT: Pre-training of deep bidirectional transformers for
language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. - Matthieu Futeral, Cordelia Schmid, Ivan Laptev, et al. 2023. Tackling ambiguity with images: Improved multi-
modal machine translation and contrastive evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5394–5413, Toronto, Canada. Association for Computational Linguistics. - Louis Martin, Samuel Humeau, Pierre-Emmanuel Mazaré, et al. 2018. Reference-less quality estimation of text
simplification systems. In Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA), pages 29–38,
Tilburg, the Netherlands. Association for Computational Linguistics. - Jan-Thorsten Peter, David Vilar, Tobias Domhan, et al. 2025. Mind the gap… or not? how translation errors and
evaluation details skew multilingual results. CoRR, abs/2511.05162. - Matthew E. Peters, Mark Neumann, Mohit Iyyer, et al. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
- Qwen Team. 2025. Qwen3 technical report. CoRR, abs/2505.09388.
- Colin Raffel, Noam Shazeer, Adam Roberts, et al. 2019. Exploring the limits of transfer learning with a unified
text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67. - Benoît Sagot, Slim Ouni, Sam Bigeard, et al. 2025. COLaF : Corpus et outils pour les langues de France et variétés de français. In Actes de la session industrielle de CORIA-TALN 2025, pages 33–47, Marseille, France. ATALA & ARIA.
- Oscar Sainz, Jon Campos, Iker García-Ferrero, et al. 2023. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Singapore. Association for Computational Linguistics.
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In
Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc. - Hugo Touvron, Louis Martin, Kevin Stone, et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. 2017. Attention is all you need. In Advances in Neural
Information Processing Systems, volume 30. Curran Associates, Inc. - Armel Randy Zebaze, Benoît Sagot, and Rachel Bawden. 2025. TopXGen: Topic-diverse parallel data generation
for low-resource machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2025,
pages 22358–22381, Suzhou, China. Association for Computational Linguistics.
Principales activités
The successful candidate will be required to carry out research on the above topic.
This will involve becoming familiar with the related work on the topic, understanding the research challenges and proposing novel solutions. These solutions will be validated by experimental results. The candidate will be required to communicate these results through peer-reviewed publications and oral presentations (both within Inria and internationally), as well as in the final thesis.
Compétences
Applicant profiles: We are looking for applicants with:
- a master’s degree (or equivalent) in computer science, machine learning, natural language processing or computational linguistics
- a strong interest in language and linguistics (please do specify languages you speak)
- expertise in deep learning (familiarity with existing codebases is a plus)
Applicants should be rigorous, able to show initiative, creativity and have a good eye for analysis of data and results. A good level of English is required (written and spoken).
Required documents:
- an up-to-date CV
- a 1-page letter of motivation describing the relevance of your application with respect to the PhD subject
- a copy of the last degrees obtained and grades.
Applications (in English or in French) should be sent via this recruitment platform only.
Avantages
- Subsidized meals
- Partial reimbursement of public transport costs
- Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
Rémunération
Monthly gross salary : 1982 € during the first and second years. 2085 € the last year.
Informations générales
- Thème/Domaine : Langue, parole et audio
- Ville : Paris
- Centre Inria : Centre Inria de Paris
- Date de prise de fonction souhaitée : 2026-09-01
- Durée de contrat : 3 ans
- Date limite pour postuler : 2026-05-17
Attention: Les candidatures doivent être déposées en ligne sur le site Inria. Le traitement des candidatures adressées par d'autres canaux n'est pas garanti.
Consignes pour postuler
This PhD subject is published as part of the PR[AI]RIE-PSAI 2026 PhD call for
funding. An applicant will be pre-selected by the supervisors by the 22nd May
at the latest; a second selection phase by a committee within the
PR[AI]RIE-PSAI institute will then carry out the final selection based on the
pre-selected applicant’s profile and the PhD subject. As stated in the funding
call, the selection results will be published in two phases between the 30th
May and mid-June.
Non-discrimination, openness, and transparency: All PR[AI]RIE-PSAI partners are
committed to supporting and promoting equality, diversity, and inclusion within
their communities. We encourage applications from diverse backgrounds, which we
will carefully select through an open and transparent recruitment process.
Sécurité défense :
Ce poste est susceptible d’être affecté dans une zone à régime restrictif (ZRR), telle que définie dans le décret n°2011-1425 relatif à la protection du potentiel scientifique et technique de la nation (PPST). L’autorisation d’accès à une zone est délivrée par le chef d’établissement, après avis ministériel favorable, tel que défini dans l’arrêté du 03 juillet 2012, relatif à la PPST. Un avis ministériel défavorable pour un poste affecté dans une ZRR aurait pour conséquence l’annulation du recrutement.
Politique de recrutement :
Dans le cadre de sa politique diversité, tous les postes Inria sont accessibles aux personnes en situation de handicap.
Contacts
- Équipe Inria : ALMANACH
-
Directeur de thèse :
Bawden Rachel / rachel.bawden@inria.fr
L'essentiel pour réussir
A propos d'Inria
Inria est l’institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l’interface d’autres disciplines. L’institut fait appel à de nombreux talents dans plus d’une quarantaine de métiers différents. 900 personnels d’appui à la recherche et à l’innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'efforce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.