PhD Position F/M LLM-based code generation for controlling artificial agents in simulated environments

Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : CDD

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

Contexte et atouts du poste

Program synthesis (Chaudhuri et al., 2021) has been traditionally considered for different programming tasks, but very little studied for synthesizing controllers, i.e.  programs controlling artificial agents in simulated environments. Such controllers are usually studied under the Reinforcement Learning (RL) paradigm in AI (Sutton & Barto, 2018), where an agent learns an action policy from experience in order to maximize cumulative reward in a simulated environment. We believe that the recent rise of Large Language Models (LLMs) opens important perspectives and novel directions for controller synthesis. The successes of Copilot and ChatGPT show that LLMs can provide help and assistance in many programming tasks, both saving time and reducing errors through for instance bug finding and code suggestions (Chen et al., 2021). Moreover, LLMs are increasingly used in RL, in particular in LLM-Augmented RL, where LLMs can interpret users' intent and translate them into concrete rewards to be integrated in a RL algorithm (Fan et al., 2022). A few recent papers have proposed to use LLMs for generating AI controllers, reward functions and tasks (Faldor et al., 2024; Lehman et al., 2022; Wang et al., 2023), opening many exciting perspectives where the code generation abilities of LLMs is leveraged to synthesize AI agents and their environments.

In this project, we propose to explore novel research directions for using LLMs to synthesize controllers in unknown and complex environments. How to generate a controller program from a natural language description of the environment dynamics and task properties? Are LLM able to generalize to the generation of controllers that combine skills from previously discovered controllers? How to generate adaptive controllers, e.g. in the form of code specifying neural architectures using standard deep learning libraries such as Pytorch? Can LLM be used to generate the morphology of embodied agents? Can LLM help to disentangle semantic vs procedural knowledge in the context of controller synthesis? In particular, we believe that the question of combining skills is the key to scale these techniques to large environments and complex tasks.

Mission confiée

Several methodologies have been identified to address the above research questions. Existing approaches for adapting/evolving LLMs such as fine-tuning using Reinforcement Learning from Human Feedback (RLHF, Ouyang et al., 2022) was made famous with ChatGPT, but alternative promising approaches such as mutation operators (Lehman et al., 2022; Wang et al., 2023) and evolution of prompts (Fernando et al., 2023; Guo et al., 2024; Wan et al., 2024) have been proposed. This will enable us to explore complementary strategies (i) fixing the prompt and adapt the LLM or (ii) evolve the prompt with a fixed LLM; or using interactive vs. autonomous adaptation. We will also consider combining such approaches with non-LLM methods such as automatic program repair.  Moreover, sorting knowledge in different components, e.g. semantic vs procedural knowledge, could help evolutionary methods to be more efficient and compositional, and decouple the evolution of such representations. Indeed, our brains represent "semantic" and "procedural" knowledge in different ways: knowing that mugs, lamps or books are inanimate objects (generally described as "facts") are different in nature from procedural knowledge (e.g. know how to bike or swim, how to sort action primitives or instructions primitives) is crucial. Enabling LLMs to store these two kinds of knowledge will help to obtain more disentangled representations and compositionality, and offer more interpretable representations.

To initiate models, we will bootstrap them with supervised learning by for instance providing a first database of problems and solutions as prompts, or fine-tuning an existing LLM with it. The evaluation of found solutions will be performed through a Python interpreter or by various feedbacks (e.g. success or failure of code execution) with a special emphasis on the ability to generalize the acquired knowledge to novel environments and tasks that are compositions of the already solved ones. Datasets and environments such as the ones proposed in stable-baselines and evogym will be relevant baselines. We will also consider data augmentation, for instance by mutating existing problems and solutions and evaluating them.

Principales activités

The project will be structured along several milestones of increasing complexity, although this proposed plan can be adapted according to the student’s interests. In each of them, we plan to first bootstrap the model with a database of problem descriptions and known solutions to solve them, using it either to prompt the LLM (if the database is sufficiently small) or to fine-tune it with either supervised learning or RLHF (e.g. if the database is too large to fit in a prompt). Then we will evaluate how the LLM can generalize to novel problems, in particular environments and tasks that are compositions of the already solved ones. 

In the first stage, we will study the synthesis of reactive controllers on simple navigation tasks, for example in the form of Braitenberg vehicles (Braitenberg, 1984). The database will consist of a few examples of simple problems and solutions from existing Braitenberg vehicle implementations, which will be sufficiently short to input as a prompt. 

In the second stage, we will study the synthesis of adaptive controllers, i.e controllers that can learn for experience (e.g. in the form of learnable decision trees or neural networks). For this aim, we will bootstrap the model with a database of descriptions of RL environments and known RL algorithms to solve them, e.g. from the library and documentation of stable-baselines. Since this will be too large to input as a prompt, we will instead use supervised learning or RLHF to fine-tune the LLM on these examples. 

Finally, in the third stage, we will study the synthesis of meta-learners, i.e. an adaptive controller that can perform well on a wide distribution of environments, as usually studied in the field of meta reinforcement learning (Duan et al., 2016). A possible direction here will be to explore how to train a Transformer model taking as input the sequence of agent’s observations, actions and rewards in a given environment in order to infer properties of the task at hand and outputting it in a natural language form. This natural language description will then be used as input to the LLM in order to generate the controller or to adapt an existing one. 


In addition, in each of the three proposed stages, we will in parallel study the ability to synthesize agent’s morphologies as JSON text files that can be interpreted in a 3D simulator such as Mujoco. This will allow the study of body-controller co-evolution, which is an important problem in both AI and Artificial Life (Bhatia et al., 2021). We believe that LLM-assisted program synthesis could lead to important breakthroughs in this domain.

Take this workplan mostly as a suggestion at this point. If you have your own ideas on alternative methodology or objectives we will be glad to discuss them.

References (most relevant are in bold)

Bhatia, J., Jackson, H., Tian, Y., Xu, J., & Matusik, W. (2021). Evolution Gym: A Large-Scale Benchmark for Evolving Soft Robots. Advances in Neural Information Processing Systems, 34, 2201–2214. https://papers.nips.cc/paper/2021/hash/118921efba23fc329e6560b27861f0c2-Abstract.html

Braitenberg, V. (1984). Vehicles: Experiments in Synthetic Psychology. MIT Press.

Chaudhuri, S., Ellis, K., Polozov, O., Singh, R., Solar-Lezama, A., & Yue, Y. (2021). Neurosymbolic Programming. Foundations and Trends® in Programming Languages, 7(3), 158–243. https://doi.org/10.1561/2500000049

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., … Zaremba, W. (2021). Evaluating Large Language Models Trained on Code (arXiv:2107.03374). arXiv. https://doi.org/10.48550/arXiv.2107.03374

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., & Abbeel, P. (2016). RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning. arXiv:1611.02779 [Cs, Stat]. http://arxiv.org/abs/1611.02779

Faldor, M., Zhang, J., Cully, A., & Clune, J. (2024). OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code (arXiv:2405.15568). arXiv. https://doi.org/10.48550/arXiv.2405.15568

Fan, L., Wang, G., Jiang, Y., Mandlekar, A., Yang, Y., Zhu, H., Tang, A., Huang, D.-A., Zhu, Y., & Anandkumar, A. (2022). MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge (arXiv:2206.08853). arXiv. https://doi.org/10.48550/arXiv.2206.08853

Fernando, C., Banarse, D., Michalewski, H., Osindero, S., & Rocktäschel, T. (2023). Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (arXiv:2309.16797). arXiv. https://doi.org/10.48550/arXiv.2309.16797

Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., & Yang, Y. (2024). Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (arXiv:2309.08532). arXiv. https://doi.org/10.48550/arXiv.2309.08532

Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., & Stanley, K. O. (2022). Evolution through Large Models (arXiv:2206.08896; Version 1). arXiv. https://doi.org/10.48550/arXiv.2206.08896

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning, second edition: An Introduction (second edition). Bradford Books.

Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y., Liu, J., Qu, Z., Yan, S., Zhu, Y., Zhang, Q., Chowdhury, M., & Zhang, M. (2024). Efficient Large Language Models: A Survey (arXiv:2312.03863). arXiv. https://doi.org/10.48550/arXiv.2312.03863

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models (arXiv:2305.16291). arXiv. https://doi.org/10.48550/arXiv.2305.16291



Compétences

  • Excellent programming skills, preferably in Python. Experience with Pytorch or JAX.
  • Prior experience with foundational models, deep reinforcement learning and data analysis. 
  • Strong interest in implementing artificial agents able to acquire an open-ended repertoire of skills.
  • Prior experience with running large-scale experiments on CPU/GPU clusters is a plus.
  • Fluent English

Avantages

  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Possibility of teleworking and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage

Rémunération

2100€ / month (before taxs) during the first 2 years,
2190€ / month (before taxs) during the third year.