PhD Position F/M Explainability of multimodal neural models for the analysis of socio-emotional skills: application to the training of medical students
Contract type : Fixed-term contract
Level of qualifications required : Graduate degree or equivalent
Fonction : PhD Position
Context
Oral communication skills are essential in many situations and have been identified as the core skills of the 21st century. They are now assessed within the curricula themselves by examinations such as the grand oral du baccalauréat in high school or objective structured clinical examinations (OSCEs) in medical studies.
This PhD is part of a partnership between Inria and the Medical School of the Université de Lorraine aiming to develop a conversational agent helping medical students to train for OSCEs. During an OSCE, a student faces a “standardized” patient and must ask a series of questions to reach the correct diagnosis. The relevance of the questions asked and the non-verbal communication are graded by an evaluator. Our long-term goal is for the conversational agent to replace both the standardized patient and the evaluator during training sessions, in order to maximize the number of situations seen by students before the actual exam.
The PhD will be co-supervised by Emmanuel Vincent and Chloé Clavel. The PhD student will be based at Inria Nancy to interact with medical students and other staff involved in the project and to benefit from the expertise of the Multispeech team on non-verbal speech attributes and gestures, and spend 12-18 months at Inria Paris to leverage the complementary expertise of the Almanach team on verbal attributes, socio-emotional behaviors and explainability.
Assignment
The goals of this PhD are to characterize the communication skills of medical students in the context of OSCE training, to design new explainability approaches providing them with useful feedback, to evaluate the relevance of this feedback, and to contribute to integrating the results into an OSCE training app currently being developed.
This work will be based on a corpus of multimodal data (voice and video of the student and the standardized patient, and questionnaires completed by the student, the standardized patient and the human evaluator after each OSCE) whose collection has begun and will continue throughout the thesis. Each week, 100 to 150 students each take 2 OSCEs lasting 7 minutes, representing a total volume of around 1,000 hours per year that is significantly larger than existing corpora for the study of socio-emotional behaviors.
Main activities
The expression of socio-emotional skills will be analyzed by extracting the textual transcription and a set of low-level audio and visual features using standard libraries and by deriving high-level features using a neural network trained in a self-supervised manner by predicting the features of the next speech turn [1,2] and/or in a supervised manner by predicting the score and comments assigned by the evaluator. This approach could be supplemented by supervised learning on other corpora annotated in socio-emotional skills such as POM or the MT180 corpus annotated in persuasion capacity [3]. The questionnaires may be gradually modified in order to collect precise comments on the students' performance.
Explainability methods [4,5] will then be devised to identify the high-level multimodal features linked to each socio-emotional skill. We will evaluate the post-hoc explainability method explored in our previous work [6] to identify important time instants and social signals in job interviews. We will also consider counterfactual reasoning [7], a state-of-the-art explainability method which has not yet been explored in this context. The explanations thus found will be verbalized in textual form using a large language model (LLM). The use of Chain-of-Thought LLM architectures capable of generating textual explanations will also be considered [8]. Experimental protocols will be developed to evaluate the explanations provided and in particular assess their ability to provide relevant feedback to the students.
[1] M. McNeill & R. Levitan, “An autoregressive conversational dynamics model for dialogue systems”, in Interspeech, 2023.
[2] E. Chapuis, P. Colombo, M. Manica, M. Labeau & C. Clavel, “Hierarchical pre-training for sequence labelling in spoken dialog”, in Findings of EMNLP, 2020.
[3] A. Barkar, M. Chollet, B. Biancardi & C. Clavel, “Insights into the importance of linguistic textual features on the persuasiveness of public speaking”, in ICMI, 2023.
[4] L.H. Gilpin, D. Bau, B.Z. Yuan, A. Bajwa, M. Specter & L. Kagal, “Explaining explanations: An overview of interpretability of machine learning”, in DSAA, 2018.
[5] C. Rudin, C. Chen, Z. Chen, H. Huang, L. Semenova & C. Zhong, “Interpretable machine learning: Fundamental principles and 10 grand challenges”, Statistic Surveys, 2022.
[6] L. Hemamou, A. Guillon, J.C. Martin & C. Clavel, “Multimodal hierarchical attention neural network: Looking for candidates behaviour which impact recruiter's decision”, IEEE Transactions on Affective Computing, 2021.
[7] S. Verma, J. Dickerson & K. Hines, “Counterfactual explanations for machine learning: Challenges revisited”, in CHI HCXAI workshop, 2021.
[8] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le & D. Zhou, “Chain-of-Thought prompting elicits reasoning in large language models”, arXiv:2201.11903, 2022.
Skills
MSc degree in speech processing, NLP, computer vision, machine learning, or in a related field.
Strong programming skills in Python/Pytorch.
Prior experience with speech, text and/or video processing is an asset.
French speaking skills are a plus.
Benefits package
- Subsidized meals
- Partial reimbursement of public transport costs
- Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage
Remuneration
2100€ gross/month the 1st year
General Information
- Theme/Domain : Language, Speech and Audio
- Town/city : Villers lès Nancy
- Inria Center : Centre Inria de l'Université de Lorraine
- Starting date : 2024-10-01
- Duration of contract : 3 years
- Deadline to apply : 2024-05-12
Warning : you must enter your e-mail address in order to save your application to Inria. Applications must be submitted online on the Inria website. Processing of applications sent from other channels is not guaranteed.
Instruction to apply
Defence Security :
This position is likely to be situated in a restricted area (ZRR), as defined in Decree No. 2011-1425 relating to the protection of national scientific and technical potential (PPST).Authorisation to enter an area is granted by the director of the unit, following a favourable Ministerial decision, as defined in the decree of 3 July 2012 relating to the PPST. An unfavourable Ministerial decision in respect of a position situated in a ZRR would result in the cancellation of the appointment.
Recruitment Policy :
As part of its diversity policy, all Inria positions are accessible to people with disabilities.
Contacts
- Inria Team : MULTISPEECH
-
PhD Supervisor :
Vincent Emmanuel / emmanuel.vincent@inria.fr
About Inria
Inria is the French national research institute dedicated to digital science and technology. It employs 2,600 people. Its 200 agile project teams, generally run jointly with academic partners, include more than 3,500 scientists and engineers working to meet the challenges of digital technology, often at the interface with other disciplines. The Institute also employs numerous talents in over forty different professions. 900 research support staff contribute to the preparation and development of scientific and entrepreneurial projects that have a worldwide impact.