R&D Engineer - Feature Extraction for Activity Recognition

Contract type : Fixed-term contract

Level of qualifications required : Graduate degree or equivalent

Fonction : Temporary scientific engineer

About the research centre or Inria department

The Inria center at Université Côte d'Azur includes 42 research teams and 9 support services. The center’s staff (about 500 people) is made up of scientists of diﬀerent nationalities, engineers, technicians and administrative staff. The teams are mainly located on the university campuses of Sophia Antipolis and Nice as well as Montpellier, in close collaboration with research and higher education laboratories and establishments (Université Côte d'Azur, CNRS, INRAE, INSERM ...), but also with the regional economic players.

With a presence in the fields of computational neuroscience and biology, data science and modeling, software engineering and certification, as well as collaborative robotics, the Inria Centre at Université Côte d'Azur is a major player in terms of scientific excellence through its results and collaborations at both European and international levels.

Context

Inria, the French National Institute for Computer Science and Applied Mathematics, promotes “scientific excellence for technology transfer and society”. Graduates from the world’s top universities, Inria's 2,700 employees rise to the challenges of digital sciences. With its open, agile model, Inria can explore original approaches with its partners in industry and academia and provide an efficient response to the multidisciplinary and application challenges of digital transformation. Inria is the source of many innovations that add value and create jobs.

Team

The STARS research team combines advanced theory with cutting-edge practice focusing on cognitive vision systems.

Team web site : https://team.inria.fr/stars/

Scientific context

Feature extraction is a challenging computer vision problem which targets extracting relevant information from raw data in order to reduce dimensionality and capture meaningful patterns. When this needs to be done in a dataset and task invariant way, it is referred to as general feature extraction. This is a crucial step in machine learning pipelines and popular methods like VideoSwin and VIdeoMAE work well for the task of action recognition and video understanding. However, these works and also the datasets that they are tested on, like Something-Something and Kinetics, fail to capture information about interactions in daily life.

Towards this research direction, several methods have been proposed to model these complex fine grained interactions using datasets like UDIVA, MPII Group Interactions and Epic-Kitchen. Those datasets encompassing real-world challenges share the following characteristics: Firstly, there is rich multimodal information available where each modality provides important information relevant to the labels. Secondly, there is a lot of irrelevant information that has to be ignored as deep learning models easily identify patterns that are coincidental (local minima). For example, the colour of the T-shirt could be used to assign a certain personality score to someone if by coincidence the majority of the extrovert people are wearing warm colours. Lastly, the videos in these
datasets are generally very long.

So, the main question is:
How to extract general features from multimodal data with a lot of noise in the form of irrelevant information?

Typical situations that we would like to monitor are daily interactions, responses and reactions and analyse cause and effect in behaviour (it could be humanhuman interaction or human-object interaction).

The system we want to develop will be beneficial for all tasks requiring focus on interactions. Specifically, healthcare for psychological disorders -- general feature extraction will allow deep learning models to assist in various subtasks involved in the diagnosis process.

Assignment

In this work, we would like to go beyond existing computer vision deep learning models and introduce ways to extend them to utilise information from new modalities. Also, to identify ways to focus on relevant information for interactions in the input. The system should also take into account the long temporal duration of videos in the datasets in this domain. These have to be done in a flexible way, so that there is minimal change to the original model and hence the original model’s trained weights are useful too.

Existing methods have mostly focused on modelling the variation of visual cues pertinent to the classes provided for video classification tasks. Though they perform these tasks well, changes in the recording setting or addition of noise in the form of irrelevant background information makes it hard for these models to perform well. So, for obtaining a general feature extractor, the models have to be modified to accommodate for these shortcomings.

Main activities

The Inria STARS team is seeking an engineer with a strong background in computer vision, deep learning, and machine learning.

In this work, we focus on two things: First, in action scenarios, utilizing all available information to obtain relevant features for multiple downstream tasks while ignoring irrelevant background information. Second,
efficient transfer learning for a new recording paradigm. This can include new modalities, changes in recording settings, and different downstream tasks. The first objective can be tackled by forcing attention in transformers to attend to relevant parts of the input and having more specific architectures for modeling interactions. The second objective caters to a more general problem of parameter-efficient transfer learning which has benefited from works like adapters, prefix tuning, and prompt tuning [refs for all three]. These have worked well for the field of NLP and have been adapted to computer vision, but work only for specific cases. The theory behind these techniques can be utilized to develop new methods that serve the second objective of this work.

Large pretrained vision models and their architectures can be used as the backbone for this work.

Skills

Candidates must hold a Master's or Engineering degree or equivalent in Computer Science or a closely related discipline by the start date.

The candidate must be grounded in computer vision basics and have solid mathematical and programming skills.

With theoretical knowledge in Computer Vision, OpenCV, Mathematics, Deep Learning (PyTorch, TensorFlow), and technical background in C++ and Python programming, and Linux.

The candidate must be committed to scientific research and substantial publications.

In order to protect its scientific and technological assets, Inria is a restricted-access establishment. Consequently, it follows special regulations for welcoming any person who wishes to work with the institute. The final acceptance of each candidate thus depends on applying this security and defense procedure.

Benefits package

Subsidized meals
Partial reimbursement of public transport costs
Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
Possibility of teleworking and flexible organization of working hours
Professional equipment available (videoconferencing, loan of computer equipment, etc.)
Social, cultural and sports events and activities
Access to vocational training
Contribution to mutual insurance (subject to conditions)

Remuneration

From 2692 € gross monthly (according to degree and experience)

Apply for this position

General Information

Theme/Domain : Vision, perception and multimedia interpretation
Town/city : Sophia Antipolis
Inria Center : Centre Inria d'Université Côte d'Azur
Starting date : 2025-04-01
Duration of contract : 3 months
Deadline to apply : 2025-04-13

Warning : you must enter your e-mail address in order to save your application to Inria. Applications must be submitted online on the Inria website. Processing of applications sent from other channels is not guaranteed.

Instruction to apply

Applications must be submitted online on the Inria website. Collecting applications by other channels is not guaranteed.

Defence Security :
This position is likely to be situated in a restricted area (ZRR), as defined in Decree No. 2011-1425 relating to the protection of national scientific and technical potential (PPST).Authorisation to enter an area is granted by the director of the unit, following a favourable Ministerial decision, as defined in the decree of 3 July 2012 relating to the PPST. An unfavourable Ministerial decision in respect of a position situated in a ZRR would result in the cancellation of the appointment.

Recruitment Policy :
As part of its diversity policy, all Inria positions are accessible to people with disabilities.

Contacts

Inria Team : STARS
Recruiter :
Brémond François / Francois.Bremond@inria.fr

The keys to success

Essential qualities in order to fulfil this assignment are feeling at ease in an environment of scientific dynamics and wanting to learn and listen.
Passionate about innovation, willing to go for a PhD thesis in the field of Computer Vision and Machine Learning.

Languages: English

Relational skills: team work
Other valued appreciated: leadership

About Inria

Inria is the French national research institute dedicated to digital science and technology. It employs 2,600 people. Its 200 agile project teams, generally run jointly with academic partners, include more than 3,500 scientists and engineers working to meet the challenges of digital technology, often at the interface with other disciplines. The Institute also employs numerous talents in over forty different professions. 900 research support staff contribute to the preparation and development of scientific and entrepreneurial projects that have a worldwide impact.