Faites connaître cette offre !
Reference : UMR7271-VIVROS-013
Workplace : SOPHIA ANTIPOLIS
Date of publication : Monday, June 22, 2020
Scientific Responsible name : Jean MARTINET
Type of Contract : PhD Student contract / Thesis offer
Contract Period : 36 months
Start date of the thesis : 1 October 2020
Proportion of work : Full time
Remuneration : 2 135,00 € gross monthly
Description of the thesis topic
The objective of the work is to design and implement Spiking-Neural-Network-based machine learning methods to extract vision features and infer useful information about the visual scene from stereo event-based cameras. In particular, the work will focus on two use cases: scene segmentation and depth estimation, and a demonstrator is expected, that will be benchmarked at software level for comparison with standard frame-based approaches in terms of precision and energy consumption, before its hardware integration.
In this PhD work, bio-inspired SNN-based machine learning methods for stereo event-based sensors will be designed and implemented to target both use cases. Simulations will use e.g. Nengo / NEST / Brian2, and the implementation will use specific hardware such as HBP SpiNNaker and Intel Loihi.
The candidate should hold a Master degree in Computer Science or Signal and Image Processing.
The candidate should have a solid training in machine learning and computer vision. Knowledge in Computational Neurosciences is a plus. Programming skills in Python/C++, software versioning with git, interest in research, machine learning, bio-inspiration and neurosciences are required.
This PhD proposal takes part in the European CHIST-ERA APROVIS3D project (April 2020-March 2023) on the topic of bio-inspired machine learning for stereo event-based vision, using mixed analog-digital hardware. The work will be carried out in collaboration with a leading neuroscience institute in Marseille, the Institute of Neuroscience of la Timone, that will be part of the supervision team.
In less than a decade, deep Artificial Neural Networks (ANN) such as Inception and VGG-16 have successfully pulled state-of-the-are image classification performances to new levels on challenging computer vision benchmarks like ImageNet. The availability of both tremendous amounts of annotated data and huge computational resources have enabled remarkable progress. Therefore, this success comes with substantial human cost required for manually labeling data, and energy cost required for training.
Spiking Neural Networks (SNN) represent a special class of artificial neural networks, where neurons communicate by sequences of spikes [Ponulak, 2011] [Paugam-Moisy, 2012]. Contrary to deep convolutional networks, spiking neurons do not fire at each propagation cycle, but rather fire only when their activation level (or membrane potential, an intrinsic quality of the neuron related to its membrane electrical charge) reaches a specific threshold value. When a neuron fires, it generates a non-binary signal that travels to other neurons, which in turn increases their potentials. The activation level either increases with incoming spikes, or decays over time. Neuromorphic hardware implementing SNN can be built with CMOS technology, and typically uses low power (under the threshold voltage), enabling to reduce energy dissipation by several orders of magnitude, compared to standard digital architectures [Merolla, 2014] [Desbief, 2015].
Regarding learning, SNN does not rely on stochastic gradient descent and backpropagation. Instead, neurons are connected through synapses, that implement a learning mechanism inspired from biology: it rests upon the “Spike- Timing-Dependent Plasticity”, a rule that updates synaptic weights (strength of connections) according to causal links observed between presynaptic and postsynaptic spikes. This updating rule reinforces incoming connections that cause the neuron to fire. Therefore, the learning process is intrinsically not supervised, and can be successfully used detect patterns in data in an unsupervised manner [Bichler, 2012] [Beyeler, 2013].
SNN are little used, yet there is an increasing interest in using such type of ANN in computer vision. They show many interesting features for a profound paradigm change in machine learning and computer vision [Verzi, 2018] [Pei, 2019] [Roy, 2019] [Taherkhani, 2020], such as their unsupervised training with Spike-Timing-Dependant Plasticity rules [Vigneron, 2020], namely using reinforcement learning like reward-modulated STDP (R-STDP) [Mozafari, 2018], and their implementation on ultra-low-power neuromorphic hardware. And yet, a number of challenges lie ahead before they become a realistic alternative to deep CNN. A number of open issues and questions need to be addressed, such as the design of an efficient SNN topology (convolutional? layered? recurrent?), the understanding and control of the learning process with parameter tuning, the right way to input supervision (during or after training?), the input coding (pixel values to spike trains) and output decoding (interpret the output spikes). Recent work shows that SNN are competitive with the state-of-the-art on “easy” datasets such as MNIST (handwritten digits) [Diehl, 2015] [Kheradpisheh, 2018] [Falez, 2019].
Beside static images, because of their asynchronous operation principle, SNN are allegedly likely to handle well temporal data such as video. Event-based cameras (or silicon retinas) bring a new vision paradigm by mimicking the biological retina. Instead of measuring the intensity of every pixel in a fixed time interval, it reports events of significant pixel intensity changes. Every such event is represented by its position, sign of change, and timestamp, accurate to the microsecond. Because of their asynchronous operation principle, they are a natural match for SNN [Steffen, 2019] [Paredes, 2019] [Oudjail, 2019]. State-of-the-art approaches in machine learning provide excellent results for vision tasks with standard cameras, however, asynchronous event sequences require special handling, and spiking networks can take advantage of this asynchrony.
Moreover, conventional video sensors record the entire image with a given rate and resolution. The original rationale for sensing a scene this way is that the transmission or recording is intended to be viewed by a human observer who may be looking closely at any part of the moving image. Frame-based video contains a huge amount of redundant data and requires enormous computational power to process. As stated in [Hopkins, 2018], biological vision sensors, however, are quite different from frame-based cameras. They do not sample images at a uniform rate, nor at a uniform resolution. The human eye has a small high-resolution region (the fovea) in the center of the field of vision, and a much larger peripheral vision, which has much lower resolution, combined with an increased sensitivity to movement. Therefore, limited resources are deployed to extract the most salient information from the scene without wasting energy capturing the entire scene at the highest resolution. Furthermore, the human eye is primarily sensitive to changes in the luminance falling on its individual sensors. These changes are processed by layers of neurons in the retina through to the retinal ganglion cells that generate action potentials, or spikes, whenever a significant change is detected. Then these spikes propagate through the optic nerve to the brain. This approach focuses the resources on the areas of the image that convey most useful information such as edges and other details. Given the core objective of computer vision systems, it seems natural to sense the world with bio-inspired sensors. Moreover, primates and other mammals are given the ability to compute depth information from views acquired simultaneously from different points in space with stereopsis, which is a fundamental feature in environment 3D sensing. Bio-inspired models from binocular vision have also been used to solve the event-based stereo correspondence problem such as in [Oswald, 2017] [Tulyakov, 2019].
During this PhD work, the candidate will develop and implement bio-inspired models for stereo event-based cameras.
[Beyeler, 2013] Michael Beyeler, Nikil D. Dutt et Jeffrey L. Krichmar: Categorization and decision-making in a neurobiologically plausible spiking network using a STDP-like learning rule. Neural Networks 48 (2013) 109– 124.
[Bichler, 2012] O. Bichler, D. Querlioz, S. J. Thorpe, J.-P. Bourgoin, C. Gamrat: Extraction of temporally correlated features from dynamic vision sensors with spike-timing-dependent plasticity. Neural Networks 32 (2012) 339–348.
[Desbief, 2015] Desbief, Simon; Kyndiah, Adrica; Guérin, David; Gentili, Denis; Murgia, Mauro; Lenfant, Stéphane; Alibart, Fabien; Cramer, Tobias; Biscarini, Fabio; Vuillaume, Dominique/ Low voltage and time constant organic synapse-transistor. Organic Electronics, June 2015, Vol.21, pp.47-53.
[Diehl, 2015] Peter Diehl and Matthew Cook: Unsupervised learning of digit recognition using spike-timing- dependent plasticity. Front. Comput. Neurosci. 9:99 (2015) doi: 10.3389/fncom.2015.00099.
[Falez, 2019] Unsupervised Visual Feature Learning with STDP: How Far are we from Traditional Feature Learning Approaches? Pattern Recognition, 2019.
[Hopkins et. al 2018] Hopkins M, Pineda-Garcıa G, Bogdan PA, Furber SB. 2018 Spiking neural networks for computer vision. Interface Focus8: 20180007
[Kheradpisheh, 2018] STDP-based spiking deep convolutional neural networks for object recognition. Neural Networks (99:56–67), 2018.
[Merolla, 2014] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, I. Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M. D. Flickner, W. P. Risk, R. Manohar, and D. S. Modha. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, vol. 345, pp. 668– 673, Aug. 2014.
[Mozafari, 2018] Mozafari, Kheradpishesh, Masquelier, Nowzari-Dalini, and Gantabesh. First-spike-based visual categorization using reward-modulated STDP. IEEE Transactions on Neural Networks and Learning Systems, 29(12):6178–6190, dec 2018.
Constraints and risks
No constraint and no risk are foreseen.
We talk about it on Twitter!