PhD Position F/M Towards Explainable Foundation Models via Learned Neural Specifications

Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : CDD

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

Contexte et atouts du poste

Foundation models (large language models (LLMs) and other transformer-based architectures, in language, vision, and multimodal settings) have achieved remarkable performance across a broad range of tasks, from natural language understanding and code generation to scientific reasoning. Their widespread deployment in high-stakes settings raises urgent questions about reliability, accountability, and transparency. Despite their success, these models remain fundamentally opaque: their decisions cannot be meaningfully inspected, audited, or contested.

Current approaches to explainability fall into two categories, both with critical limitations. Post-hoc attribution methods, such as attention visualization, gradient saliency, or SHAP scores, identify influential input features but provide no formal guarantees and are known to be unstable or misleading [1, 2]. Probing methods reveal information encoded in intermediate representations, but do so in an ad-hoc, non-compositional way that does not generalize across inputs or tasks. Neither approach yields explanations that are reusable, formally verifiable, or suitable for reasoning about model behavior. Crucially, both operate at the input-output interface, rather than at the level of the model's internal representations, i.e., the structured activation patterns across layers and attention heads that mediate computation and encode decision-relevant regularities.

This thesis is based on the hypothesis that these internal representations can be abstracted into formal, logic-based specifications that are interpretable to ML researchers and formal methods experts, and amenable to machine verification. The goal is not to fully recover the model's internal computation, but to extract sound and reusable regularities that support explanation and analysis.

This thesis will be conducted within PR[AI]RIE-PSAI, in the ANTIQUE team at Inria Paris & ENS | PSL. It includes planned research visits to the group of Xujie Si (University of Toronto, Canada), a leading expert on specification learning and program analysis, fostering strong international collaboration throughout the PhD.

---
[1] J. Adebayo et al. - Sanity Checks for Saliency Maps (NeurIPS 2018).
[2] S. Jain, B. C. Wallace - Attention is not Explanation (NAACL 2019).

Mission confiée

The objective of this thesis is to develop a methodology for automatically learning formal specifications over the internal representations of foundation models. These specifications, called neural specifications, form a new class of artifacts that bridge learned representations and formal reasoning, and will serve two complementary purposes:

  • Explainability: exposing structured, formally grounded regularities that hold across sets of inputs, interpretable by ML researchers and formal verification practitioners;
  • Formal reasoning: enabling consistency checking, robustness analysis, and detection of violations on new inputs.

In this setting, a behavior q denotes a formally defined property of the model's output or execution. To bound the initial research scope, the thesis will focus first on (i) token prediction properties (e.g., subject-verb agreement) and (ii) classification outputs in encoder-based models (e.g., BERT-style tasks), before extending to (iii) semantic or compositional output properties and task-level success defined via an external oracle.

Unlike existing approaches that either analyze neurons in isolation or provide instance-level explanations, this thesis targets the learning of global, compositional specifications over structured internal representations. Unlike prior neural specification learning [4], which focuses on flat predicate sets over individual neurons in feed-forward networks, this thesis introduces structured, compositional specifications over the internal representations of transformer models.

--
[4] C. Geng et al. - Towards Reliable Neural Specifications (ICML 2023).

Principales activités

Building on [3, 4], the methodology will be developed along three research directions.

1. Attention-Head Abstractions as Units of Specification

Existing approaches operate at the level of individual neurons, which is both intractable and difficult to interpret at scale. This thesis proposes to shift to a higher-level abstraction: predicates over attention heads.

Attention heads are natural functional units of transformer architectures: fewer in number than neurons, exhibiting interpretable patterns (e.g., syntactic dependencies, long-range interactions), and offering a structured view of internal processing [5]. While not a perfect proxy, they provide an empirically grounded basis that can be validated via controlled interventions (ablations, activation patching).

The thesis will define a rich vocabulary of predicates, including unary predicates (activation or pattern of a single head), relational predicates (comparisons across heads or tokens), and compositional predicates capturing higher-level structures. To keep the predicate space tractable, locality constraints restrict relational predicates to heads within the same or adjacent layers, reducing the combinatorial space from quadratic to linear in the number of heads, while sparsity-based pruning eliminates statistically rare predicates.

Example: A learned predicate might capture that a given attention head consistently attends from a verb to its subject. A specification could then express that:

(subject-tracking head active) AND (plural noun detected) => (plural verb prediction).

2. Horn-Clause Specifications as Formal Explanations

Specifications will be expressed as Horn clauses over attention-head predicates. A clause p_1 AND ... AND p_k => q captures a regularity linking internal representation patterns to observable behavior. This representation is interpretable (expressing general rules rather than instance-specific explanations), compositional (enabling reuse and combination), and formally grounded (supporting inference and consistency checking), enabling the expression of conditional and hierarchical relationships that flat predicate sets cannot capture.

3. Scalable and Robust Specification Learning

The learning pipeline will proceed in three stages. First, predicate statistics will be collected over large datasets, capturing frequencies and co-occurrences with behaviors. Initial specifications will then be constructed by thresholding and conjunctive composition, providing a candidate baseline. These candidates will subsequently be refined using attribution-based predicate scoring, search strategies inspired by inductive logic programming, and redundancy elimination to obtain compact rule sets.

To distinguish stable, explanatory invariants from spurious statistical correlations, the thesis will incorporate held-out validation (generalization across inputs), consistency checking (logical coherence of rule sets), perturbation-based analysis (stability under input changes), and counterfactual evaluation (verifying that specifications are falsified under targeted interventions). Success will be quantified by the precision and recall of learned rules in predicting held-out behavior, fidelity, i.e., compression tradeoff curves measuring how compactly a rule set captures observed behavior, and comparison against baselines including linear probing classifiers and decision trees trained on the same representations.

--
[3] D. Gopinath et al. - Property Inference for Deep Neural Networks (ASE 2019).
[4] C. Geng et al. - Towards Reliable Neural Specifications (ICML 2023).
[5] K. Clark et al. - What Does BERT Look at? An Analysis of BERT’s Attention (BlackboxNLP 2019).

Compétences

 

Non-discrimination, openness, and transparency. All partners of PR[AI]RIE-PSAI are committed to supporting and promoting equality, diversity and inclusion within their communities. We encourage applications from candidates with diverse backgrounds, which will be selected through an open, transparent, and merit-based recruitment process (OTM).

[fr] Non-discrimination, ouverture et transparence. L'ensemble des partenaires de PR[AI]RIE-PSAI s'engagent à soutenir et promouvoir l'égalité, la diversité et l'inclusion au sein de ses communautés. Nous encourageons les candidatures issues de profils variés, que nous veillerons à sélectionner via un processus de recrutement ouvert et transparent (OTM).

Avantages

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage