2018-01198 - PhD Position F/M Deep learning based approaches to speaker identification in real conditions

Level of qualifications required : Graduate degree or equivalent

Fonction : PhD Position

Context

This PhD fits within the scope of the ANR project "ROBOVOX" involving the Multispeech team at Inria Nancy - Grand Est (https://team.inria.fr/multispeech/), the speech processing team at Laboratoire d'informatique d'Avignon (http://lia.univ-avignon.fr/), and A.I. Mergence (http://www.ai-mergence.com/fr/).

Assignment

Speaker identification has recently been deployed in several real-world application including secured access to bank services via telephone or internet. However, identification based solely on voice remains a modality with a limited reliability under real conditions including several acoustic perturbations (noise, reverberation...). Recent works indicate that multichannel speech enhancement of the test signal results in improved performance for speaker identification systems in noisy environments [1], especially as it enables controlling the distortion introduced on the speech signal [2]. Additionally, the usage of deep learning [3] for multichannel speech enhancement has recently allowed for a large performance improvement [4, 5].

[1] D. Ribas, E. Vincent, J. R. Calvo, “Full multicondition training for robust i-vector based speaker recognition”, In Proc. Interspeech, 2015.

[2] R. Serizel, M. Moonen, B. Van Dijk and J. Wouters, “Low-rank Approximation Based Multichannel Wiener Filter Algorithms for Noise Reduction with Application in Cochlear Implants”. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2014, vol. 22, pp. 785–799.

[3] L. Deng and D. Yu, Deep Learning: Methods and Applications, NOW Publishers, 2014.

[4] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming”. In Proc. ICASSP, 2016.

[5] Nugraha, A. A., Liutkus, A. and Vincent, E. "Multichannel audio source separation with deep neural networks", IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, vol. 24, n. 9, pp. 1652–1664.

Main activities

The goal of this PhD thesis is to explore the usage of deep learning based speech enhancement techniques to improveme the performance of speaker identification systems in real conditions. In a first step, we propose to develop algorithms to process both noise and reverberation simultaneously inspired by recent works in the dereverberation domain [6]. The final goal is to propose end-to-end approaches that perform speaker identification directly from multichannel perturbed signal. We propose to explore methods that compare several recordings from the same speaker captured under different acoustic conditions in order to learn intermediate representations that are robust to these perturbations [7, 8, 9].

[6] O. Schwartz, S. Gannot and E. A. Habets, “Multi-microphone speech dereverberation and noise reduction using relative early transfer functions.” IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015, vol. 23, n. 2, pp. 240-251.

[7] H. Bredin. "Tristounet: triplet loss for speaker turn embedding". In Proc. ICASSP, 2015.

[8] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. "Deep canonical correlation analysis". In Proc. ICML, 2013.

[9] S. Sun, S. "A survey of multi-view machine learning". Neural Computing and Applications, 2013, vol. 23, n. 7-8, pp 2031–2038.

Skills

MSc in computer science, machine learning, signal processing
Experience with programming language Python
Experience with deep learning toolkits is a plus

Benefits package

  • Subsidised catering service
  • Partially-reimbursed public transport
  • Social security
  • Paid leave
  • Flexible working hours
  • Sports facilities

Remuneration

Salary: 1982€ gross/month for 1st and 2nd year. 2085€ gross/month for 3rd year.

Monthly salary after taxes : around 1594,00€ for 1st and 2nd year. 1677,00€ for 3rd year. (medical insurance included).