Master Internship - Semantic Parser for Natural Language

Contract type : Internship agreement

Level of qualifications required : Graduate degree or equivalent

Fonction : Internship Research

Context

---
Ce sujet s'inscrit dans  le dispositif PhD tracks du centre Inria de l'Université de Lorraine et antenne de Strasbourg. Ce dispositif  vise à attirer et accompagner des  éléments  prometteurs et motivés, inscrits actuellement en Master 2,  vers le doctorat en proposant un financement couplé de quatre ans  couvrant stage de Master2 + thèse . Le stage de Master 2, d'une durée de 5 à 6 mois, sera gratifié à 4.35 €/heure (plus ou moins 670 €/mois). Les candidats admis dans le dispositif présenteront en mai  2025 l’avancement de leurs travaux devant un jury qui validera l'entrée en thèse (l’arrêt du PhD track devrait être exceptionnel).

Ce dispositif, le mode de candidature et le calendrier sont  décrits dans l'onglet PhD track du site https://www.inria.fr/fr/centre-inria-universite-lorraine

---
 This subject is part of the PhD tracks programme run by the Inria centre at the University of Lorraine and the Strasbourg site. The aim of this programme is to attract and support promising and motivated students currently enrolled in a Master 2 course towards a PhD by offering four years of combined funding covering a Master 2 internship + thesis. The Master 2 internship, lasting 5 to 6 months, will be paid at €4.35 /hour (plus or minus €670 per month). Candidates admitted to the programme will present the progress of their work to a jury in May 2025, which will validate their entry into the PhD programme (the PhD track should be discontinued in exceptional cases).
The programme, how to apply and the timetable are described in the PhD track section of the https://www.inria.fr/fr/centre-inria-universite-lorraine website.
---
 
As part of the PhD track call from the Inria centre at the University of Lorraine, the Semagram team would like to recruit a student at the end of their Master's training cycle for an end-of-training internship leading to a funded thesis.
 
The overall objective of the Sémagramme project is to design and develop new unifying logic-based models, methods, and tools for the semantic analysis of natural language utterances and discourses. This includes the logical modelling of pragmatic phenomena related to discourse dynamics. Typically, these models and methods will be based on standard logical concepts (stemming from formal language theory, mathematical logic, and type theory), which should make them easy to integrate.
 
Subject
 
NLP has made great strides in recent years in producing models capable of solving complex problems. While natural language processing seems to have reached a good level of understanding, the question of semantics is still open. For example, the development of task-oriented solutions requires the extraction of the underlying structure of an utterance in order to extract its intentions. However, producing the predicative structure of an utterance is beyond the capacity of LLMs.
 
Two main schools of thought have emerged in language semantics, one based on logical properties (Kamp and Reyle, 1993; Montague, 1970) and the other on graphical representations (Banarescu et al., 2013; Abend and Rappoport, 2013; White et al., 2016; Van Gysel et al., 2021). While the former are more suitable for the construction of parsers, the latter allow to consider a more reasonable amount of certified data, especially if we want to use machine learning strategies (Cheng 2017, Li 2018). Recently, several initiatives have been launched to designate semantic representation formalisms in which the graph structures used retain good logical properties (Michalon, 2016), with the aim of producing semantically annotated corpora, one of the functions of which is to train parsers. At the current stage of development, the quality obtained is still below a level that allows it to be used.
 
In this topic, we focus on a specific type of semantic representation, YARN (Pavlova 2024) framework (laYered meAning RepresentatioN), with a predicate argument structure (PA structure) based on the Abstract Meaning Representation (AMR) (Banarescu et al., 2013) and a layered approach to encode other semantic phenomena. A major interest of this representation is that it is based on a simplified representation of the argument structure, which remains more accessible in terms of tool development, while allowing us to consider a great complexity of classical semantic phenomena, such as negation, modality, temporality and quantification, and how they can interact with each other. Considering the interactions between different phenomena is a challenge that existing formalisms do not explicitly address.
 
In the semantic parsing task, the system must predict the formal structure that represents the semantic links. Machine learning strategies have difficulty generalising to structures that are not present in the large datasets used for training. This well-recognised phenomenon has led us to consider combining several approaches that integrate representations of semantic structures into the architectures used, in particular in encoders and decoders (Petit 2023, Petit and Corro 2024). 
 
The aim of the internship is to build a semantic parser for the YARN formalism and to evaluate it on real data as Geoquery, graphQuestions, Spades, etc.
 
The ideal candidate should have a strong interest in natural language processing, especially semantics, machine learning, especially deep learning, and a solid background in formal grammars and graph algorithms. Implementation will be done in Python and OCaml.
 
 
--- 
Le sujet de thèse du PhD track sera dans la continuité du sujet de stage M2 et sera défini en concertation avec le candidat courant avril 2025.
---
The subject of the PhD track thesis will be a follow-up to the subject of the M2 internship and will be defined in consultation with the candidate during April 2025.
---
 
 
Bibliography
 
Omri Abend and Ari Rappoport. 2017. The state of the art in semantic representation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 77–89, Vancouver, Canada. Association for Computational Linguistics.
 
Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract Meaning Representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178–186, Sofia, Bulgaria. Association for Computational Linguistics.
 
Jianpeng Cheng, Siva Reddy, Vijay Saraswat, and Mirella Lapata. Learning structured natural language representations for semantic parsing. arXiv preprint arXiv:1704.08387, 2017.
 
Hans Kamp and Uwe Reyle. 1993. From discourse to logic: Introduction to modeltheoretic semantics of natural language, formal logic and discourse representation theory. Dordrecht. Kluwer.
 
Li Dong and Mirella Lapata. Coarse-to-fine decoding for neural semantic parsing. arXiv preprint arXiv:1805.04793, 2018.
 
Olivier Michalon, Corentin Ribeyre, Marie Candito, Alexis Nasr. Deeper syntax for better semantic parsing. Coling 2016 - 26th International Conference on Computational Linguistics, Dec 2016, Osaka, Japan. ⟨hal-01391678⟩
 
Richard Montague. 1970. English as a formal language. Logic and philosophy for linguists.
 
Siyana Pavlova, Maxime Amblard, Bruno Guillaume. YARN is All You Knit: Encoding Multiple Semantic Phenomena with Layers. The Fifth International Workshop in Designing Meaning Representation, May 2024, Turin, Italy. ⟨hal-04551796⟩
 
Alban Petit and Caio Corro. 2023. On Graph-based Reentrancy-free Semantic Parsing. Transactions of the Association for Computational Linguistics, 11:703–722.
 
Alban Petit, Structured prediction methods for semantic parsing, Thèse de doctorat dirigée par Yvon, François et Corro, Caio Informatique université Paris-Saclay 2024
 
Jens EL Van Gysel, Meagan Vigus, Jayeol Chun, Kenneth Lai, Sarah Moeller, Jiarui Yao, Tim O’Gorman, Andrew Cowell, William Croft, ChuRen Huang, et al. 2021. Designing a uniform meaning representation for natural language processing. KI-Künstliche Intelligenz, 35(3-4):343–360.
 
Aaron Steven White, Drew Reisinger, Keisuke Sakaguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger, Kyle Rawlins, and Benjamin Van Durme. 2016. Universal decompositional semantics on universal dependencies. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1713–1723.

Assignment

Missions:
The thesis will be supervised by Maxime Amblard, Professor at the University of Lorraine.

Collaboration:
The work will be carried out at the Inria centre at the University of Lorraine in the Semagram team.

Details on the thesis supervision:
Weekly supervision of the work by the supervisors Compulsory additional training courses organised by the Doctoral School

Main activities

Main activities (maximum 5) :

Conducting state-of-the-art studies
Writing scientific articles
Implement experiments

Skills

Technical skills and level required: Good programming skills

Languages: English and French

Interpersonal skills:
ability to integrate into a research team and interact in a scientific environment

Master's degree in NLP, Computer Science or equivalent.

Skills:
- Lambda-calculus, type theory and logic
- Machine learning and Deep Learning
- computational linguistics
- corpora

Benefits package

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage

Remuneration

Internship bonus: €4.35/hour (plus or minus €670/month)

Remuneration for thesis: €2100 gross/month the 1st year