PhD Position F/M Task-Specific and Linguistically Motivated Evaluation for Multilingual NLP

Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : CDD

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

Contexte et atouts du poste

This PhD subject is published as part of the PR[AI]RIE-PSAI 2026 PhD call for funding. An applicant will be pre-selected by the supervisors by the 22nd May at the latest; a second selection phase by a committee within the PR[AI]RIE-PSAI institute will then carry out the final selection based on the pre-selected applicant’s profile and the PhD subject. As stated in the funding call, the selection results will be published in two phases between the 30th May and mid-June.

Non-discrimination, openness, and transparency: All PR[AI]RIE-PSAI partners are committed to supporting and promoting equality, diversity, and inclusion within their communities. We encourage applications from diverse backgrounds, which we will carefully select through an open and transparent recruitment process.

Supervisors: Rachel Bawden and Benoît Sagot (Inria, ALMAnaCH project-team), <firstname.lastname@inria.fr>

PhD start date (if funding allocated): 1st September 2026–30th November 2026

Final deadline for applications: 17 May 2026 at 1pm CEST

Mission confiée

Context

The field of Natural Language Processing (NLP) is undergoing rapid change. Many of those changes result from decades of progress in the modelling of language, from the widespread use of neural methods (Bengio et al., 2003) to the increasing use of context when developing textual representations (Peters et al., 2018; Devlin et al., 2019), the introduction of adapted architectures for text generation (Sutskever et al., 2014; Bahdanau et al., 2015; Vaswani et al., 2017) and the training of language models on huge amounts of heterogeneous data, including in multiple languages (Raffel et al., 2019; Brown et al., 2020; Touvron et al., 2023). The result is that NLP tools, in particular multi-task conversational large language models (LLMs) such as GPT Brown et al. (2020) and open-source equivalents such as the Qwen3 family (Qwen Team, 2025) have been readily adopted by the general public for a wide range of tasks, including those that have traditionally been the main focus of NLP research, such as translation, summarisation, style transfer and question answering. NLP research itself has changed considerably, with a main focus now being on the use of LLMs, which excel in terms of fluency, rather than smaller, dedicated models and methods.

Human evaluation remains the gold standard (as long as it is carried out correctly) for monitoring progress in the field, comparing model performance, and assessing what challenges remain, but automatic evaluation is highly important for practical purposes, as it can be reproducible, and less costly in terms of money and time. However, evaluation has always been a major challenge across NLP tasks, and the influence from adjacent fields such as machine learning, the evaluation crisis continues: benchmarks are often treated as black boxes, with little analysis of inputs and outputs (including the types of errors being made), a lack of consideration of the challenges that are specific to particular tasks and the problem of data contamination, particularly for LLMs (Sainz et al., 2023).

This has multiple consequences. Firstly, there is the real risk of results, and therefore conclusions being drawn from them, being misleading. For example, Peter et al. (2025) showed that wrong conclusions have previously been drawn about performance gaps between high- and low-resource languages, whereas the actual issue was in quality issues with the benchmark data itself. Secondly, it means that real progress is likely to be limited because areas of improvement are overlooked. Finally, it can lead to wider problems such as an increase in hype and a lack of understanding of the weaknesses of such models, particularly important given the widespread use of LLMs by the general public, who may blindly trust outputs.

There is therefore an overwhelming need for research into evaluation of models, focusing on specific tasks and the challenges they pose: evaluation methods, benchmarks and analysis.

 

PhD subject: research aims and directions

The aim of this PhD will be to improve methods and benchmarks for the evaluation of NLP models, covering a range of languages, both high- and low-resource, challenging scenarios such as those representing high levels of variation and non-standardness, with a focus on particular tasks such as machine translation (MT) and diachronic transfer (e.g. normalisation and translating texts between different periods of a language). The topic builds on expertise and several past and ongoing research projects within the team (processing of non-standard, historical and dialectal language data).

There will be three main axes to the initial topic, with opportunities to expand beyond these directions: (i) diagnosing linguistic gaps in benchmarks, (ii) creating controlled benchmarks to stress-test models, particularly with respect to their robustness and (iii) developing new task-specific evaluation methods, both in terms of defining evaluation dimensions and training automatic metrics.

The first axis will involve the development of methods using linguistic knowledge about particular languages to uncover gaps in current benchmarks with respect to particular linguistic structures (e.g. sentence types, complex structures, morphological inflections). The aim is two-fold: (i) to develop diagnostic tools to assess how well data covers the complexity of particular languages, to be used either with existing benchmarks and when designing new datasets, and (ii) to complement gaps with additional data, including synthetically created data, to enable a better coverage but also evaluation for particular phenomena types, for instance by using approaches such as (Zebaze et al., 2025).

The second axis will involve the creation of controlled benchmarks that target particularly challenging examples in order to compare models on identified weak points (Bawden et al., 2018; Futeral et al., 2023). Stress-testing will involve both the identification of challenging phenomena, which inevitably evolve as models progress, a step for which automatisation is both desirable and itself challenging. Fitting in with the current research interests of the team, several directions can be explored here, among which robustness of models to variation (non-standard user-generated data (Bawden and Sagot, 2023, 2025), dialectal data (Sagot et al., 2025), degrees of formality).

Finally, the third axis focuses on the development of evaluation methods that take into account the particularities and challenges of specific tasks. Just as text simplification is characterised by several dimensions, including grammaticality, meaning preservation and degree of simplicity, which itself is specific to a particular target audience (Martin et al., 2018), many NLP tasks comprise several dimensions that can be evaluated separately. We will focus on the tasks of MT, particularly for low-resource languages, and style and diachronic transfer, for instance translating between French from different periods in history, fitting in with ongoing and future projects within the team. Different evaluation methods will be tested, including well-designed heuristics (where appropriate, particularly in scenarios state-of-the-art are poorly adapted), fine-tuning of pre-existing metrics and LLM-as-a-judge approaches.

 

References

  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to
    align and translate. In Proceedings of the first International Conference on Learning Representations, San Diego,
    CA.
  • Rachel Bawden and Benoît Sagot. 2023. RoCS-MT: Robustness challenge set for machine translation. In Proceedings of the Eighth Conference on Machine Translation, pages 198–216, Singapore. Association for Computational Linguistics.
  • Rachel Bawden and Benoît Sagot. 2025. RoCS-MT v2 at WMT 2025: Robust challenge set for machine translation. In Proceedings of the Tenth Conference on Machine Translation, pages 834–849, Suzhou, China. Association for Computational Linguistics.
  • Rachel Bawden, Rico Sennrich, Alexandra Birch, et al. 2018. Evaluating discourse phenomena in neural machine
    translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1304–1313, New Orleans, Louisiana. Association for Computational Linguistics.
  • Yoshua Bengio, Réjean Ducharme, Pascal Vincent, et al. 2003. A neural probabilistic language model. Journal of
    Machine Learning Research, 3:1137–1155.
  • Tom B. Brown, Benjamin Mann, Nick Ryder, et al. 2020. Language Models are Few-Shot Learners. In Advances
    in Neural Information Processing System, pages 1877–1901. Curran Associates, Inc.
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, et al. 2019. BERT: Pre-training of deep bidirectional transformers for
    language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Matthieu Futeral, Cordelia Schmid, Ivan Laptev, et al. 2023. Tackling ambiguity with images: Improved multi-
    modal machine translation and contrastive evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5394–5413, Toronto, Canada. Association for Computational Linguistics.
  • Louis Martin, Samuel Humeau, Pierre-Emmanuel Mazaré, et al. 2018. Reference-less quality estimation of text
    simplification systems. In Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA), pages 29–38,
    Tilburg, the Netherlands. Association for Computational Linguistics.
  • Jan-Thorsten Peter, David Vilar, Tobias Domhan, et al. 2025. Mind the gap… or not? how translation errors and
    evaluation details skew multilingual results. CoRR, abs/2511.05162.
  • Matthew E. Peters, Mark Neumann, Mohit Iyyer, et al. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  • Qwen Team. 2025. Qwen3 technical report. CoRR, abs/2505.09388.
  • Colin Raffel, Noam Shazeer, Adam Roberts, et al. 2019. Exploring the limits of transfer learning with a unified
    text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  • Benoît Sagot, Slim Ouni, Sam Bigeard, et al. 2025. COLaF : Corpus et outils pour les langues de France et variétés de français. In Actes de la session industrielle de CORIA-TALN 2025, pages 33–47, Marseille, France. ATALA & ARIA.
  • Oscar Sainz, Jon Campos, Iker García-Ferrero, et al. 2023. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Singapore. Association for Computational Linguistics.
  • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In
    Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
  • Hugo Touvron, Louis Martin, Kevin Stone, et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models.
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. 2017. Attention is all you need. In Advances in Neural
    Information Processing Systems, volume 30. Curran Associates, Inc.
  • Armel Randy Zebaze, Benoît Sagot, and Rachel Bawden. 2025. TopXGen: Topic-diverse parallel data generation
    for low-resource machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2025,
    pages 22358–22381, Suzhou, China. Association for Computational Linguistics.

Principales activités

The successful candidate will be required to carry out research on the above topic. 

This will involve becoming familiar with the related work on the topic, understanding the research challenges and proposing novel solutions. These solutions will be validated by experimental results. The candidate will be required to communicate these results through peer-reviewed publications and oral presentations (both within Inria and internationally), as well as in the final thesis. 

Compétences

 Applicant profiles: We are looking for applicants with:

  • a master’s degree (or equivalent) in computer science, machine learning, natural language processing or computational linguistics
  • a strong interest in language and linguistics (please do specify languages you speak)
  • expertise in deep learning (familiarity with existing codebases is a plus)

Applicants should be rigorous, able to show initiative, creativity and have a good eye for analysis of data and results. A good level of English is required (written and spoken).

Required documents:

  • an up-to-date CV
  • a 1-page letter of motivation describing the relevance of your application with respect to the PhD subject
  • a copy of the last degrees obtained and grades.

Applications (in English or in French) should be sent via this recruitment platform only.

Avantages

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training

Rémunération

Monthly gross salary : 1982 € during the first and second years. 2085 € the last year.