p08821 advanced m learning week 9cms.brookes.ac.uk/staff/fabiocuzzolin/files/week 9 -...
TRANSCRIPT
P08821 ADVANCED MACHINE LEARNING
WEEK 9 CONVOLUTIONAL NEURAL NETWORKS Professor Fabio Cuzzolin
School of Engineering, Computing and Mathematics
Oxford Brookes University
Academic year 2018-19, Semester 2
1
OUTLINE OF WEEK 9
2
Deep learning
Impact and motivation of DL
Convolutional Neural Nets Convolutional Layer
Non-linear Layer
Pooling Layer
CNN training
Architectures AlexNet
GoogLeNet, inception
Case study: Action detection
Resources
DEEP LEARNING Week 9 – Convolutional neural networks 3
DEEP LEARNING
Deep learning: neural networks with very many layers, and piecewise linear activation functions
Made possible improvements of recognition rates of 20-30%, compared to previous support vector machine (SVM) classifiers
Seem able to encode high-level abstractions from the data
Theory is not well understood yet, empirical investigations
Check out: deeplearning.net/
4
IMPACT OF DEEP LEARNING
Advancement in speech recognition
A few long-standing performance records were broken with deep learning methods (e.g. object detection on ImageNet)
Microsoft and Google have both deployed DL-based speech recognition systems in their products
Advancement in Computer Vision
The record holders on ImageNet and Semantic Segmentation are convolutional nets
Advancement in Natural Language Processing
Fine-grained sentiment analysis, syntactic parsing
Language models, machine translation, question answering 5
6
In “Nature” 27 January 2016
“DeepMind’s program AlphaGo beat Fan Hui, the European Go champion, five times out of five in tournament conditions...”
“AlphaGo was not preprogrammed to play Go: rather, it learned using a general-purpose algorithm that allowed it to interpret the game’s patterns.”
“…AlphaGo program applied deep learning in neural networks (convolutional NN) — brain-inspired programs in which connections between layers of simulated neurons are strengthened through examples and experience.”
CASE STUDY: ALPHAGO
MOTIVATIONS FOR DEEP ARCHITECTURES
Insufficient depth can hurt
With shallow architecture (SVM, NB, KNN, etc.), the required number of nodes in the graph (i.e. computations, and also number of parameters, when we try to learn the function) may grow very large
Many functions that can be represented efficiently with a deep architecture cannot be represented efficiently with a shallow one
The brain has a deep architecture
The visual cortex shows a sequence of areas each of which contains a representation of the input, and signals flow from one to the next
Cognitive processes seem deep
Humans first learn simpler concepts and then compose them to represent more abstract ones hierarchically
Engineers break-up solutions into multiple levels of abstraction and processing
7
FEATURE LEARNING
The deeper the layer, the higher the feature abstraction level
First layer: local patterns; second layer: face features; third: faces
Lots of empirical studies on this quality of deep networks
8
CONVOLUTIONAL NEURAL NETWORKS Week 9 – Convolutional neural networks 9
A BIOLOGICALLY INSPIRED FRAMEWORK
Convolutional Neural Networks are inspired by mammalian visual cortex
The visual cortex contains a complex arrangement of cells, which are sensitive to small sub-regions of the visual field, called a receptive field
These cells act as local filters over the input space and are well-suited to exploit the strong spatially local correlation present in natural images
10
THE MAMMALIAN VISUAL CORTEX INSPIRES CNN
11
Convolutional Neural Net
(DEEP) LEARNING TO ENCODE COMPLEXITY
Seem able to encode high-level abstractions from the data
Theory is not well understood yet, empirical investigations yes
12
CONVOLUTIONAL NEURAL NETWORKS (CNNS)
Use convolution, alternated with max pooling layers and rectified linear unit activation functions for feature learning
Typically, fully connected layers at the end produce softmax scores for classification
deeplearning.net/tutorial/lenet.html
13
CONVOLUTIONAL LAYER
The core layer of CNNs
The convolutional layer consists of a set of filters
Each filter covers a spatially small portion of the input data
Each filter is convolved across the dimensions of the input data, producing a multidimensional feature map
As we convolve the filter, we are computing the dot product between the parameters of the filter and the input
Intuition: the network will learn filters that activate when they see some specific type of feature at some spatial position in the input
The key architectural characteristics of the convolutional layer is local connectivity and shared weights
14
CONVOLUTION
Linear operation, which consists of multiplications and sums
15
CONVOLUTION EXAMPLE
16
FEATURE MAP
Just the activation values of the next layer, really
17
SHARED WEIGHTS
We show 3 hidden neurons belonging to the same feature map (the layer right above the input layer)
Weights of the same colour are ‘shared’ - constrained to be identical
Gradient descent can be used to learn shared parameters
Replicating neurons in this way allows for features to be detected regardless of their position in the input (location invariance)
Additionally, weight sharing increases learning efficiency by greatly reducing the number of free parameters being learnt
18
NON-LINEAR LAYER: RECTIFIED LINEAR INIT (RELU)
Rectified Linear Unit (ReLU) versus the classical sigmoid activation functions
Makes learning faster, improves the generalisation power of the network
19
Input
POOLING LAYER
Intuition: to progressively reduce the spatial size of the representation reduce the amount of parameters and computation
control overfitting
Pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value of the features in that region
20
CNN BUILDING BLOCKS
21
Each sub-region yields a feature map, representing its feature.
Images are segmented into sub-regions.
Feature maps are trained with neurons.
Feature maps of a larger region are combined.
Shared weights
EXAMPLE OF THE WHOLE PROCESS
22
Courtesy Robotics and Computer Tech Lab, http://www.rtc.us.es
SOME JARGON
23
TRAINING A CNN: BACKPROPAGATION
How do the filters in the first conv layer know to look for edges and curves?
How does the fully connected layer know what activation maps to look at?
Happens through backpropagation (cfr. Introduction to Machine Learning)
Initially the weights or filter values are randomized
In practise, the network to work needs to be pre-trained on a related problem (transfer learning)
E.g.: action detection networks are pre-trained on object detection weights
24
TRAINING A CNN: BACKPROPAGATION
Backpropagation can be separated into 4 distinct sections, the forward pass, the loss function, the backward pass, and the weight update
Forward pass: you take the input and ‘pass’ it through the network, generating outputs
Backward pass: determines which weights contributed most to the loss and finds ways to adjust them so that the loss decreases
Loss 𝐿 is usually mean squared error
Weight update: 𝑤 = 𝑤𝑖 − 𝜂𝑑𝐿
𝑑𝑤
25
EXAMPLE ARCHITECTURES Week 9 – Convolutional neural networks 26
ALEXNET: A MILESTONE IN CNN
Designed by Alex Krizhevsky and G. Hinton
Won the the ImageNet Large Scale Challenge 2012
Input image: 227 X 227 X 3
First convolutional layer: 96 filters with K = 11 and stride = 4
Width and height of output: (227 – 11)/4 + 1 = 55
27
ALEXNET: A MILESTONE IN CNN
Number of parameters in first layer? 11 x 11 x 3 x 96 = 34848
Popularized the use of ReLUs
Used heavy data augmentation (flipped images, random crops of size 227 by 227) 28
OTHER NETWORKS
ImageNet 2013 was won by a network similar to AlexNet (Matthew Zeiler and Rob Fergus)
Changed the first convolutional layer from 11 X 11 with stride of 4, to 7 X 7 with stride of 2
AlexNet used 384, 384 and 256 layers in the next three convolutional layers, ZF used 512, 1024, 512
ImageNet 2013: 14.8 % (reduced from 15.4 %) (top 5 errors)
Other popular network: VGG (Simonyan and Zisserman, 2014)
Total number of parameters: 138 Million!
29
GOOGLENET
https://ai.google/research/pubs/pub43022
Has 5 Million or 12x fewer parameters than AlexNet
Gets rid of fully connected layers
Based on inception module
30
INCEPTION MODULE
Parallel paths with different receptive field sizes
Capture sparse patterns of correlation in stack of feature maps
Also include auxiliary classifiers for ease of training
Also note 1 by 1 convolutions
31
INCEPTION MODULE
From: https://stackoverflow.com/questions/45420926/poor-results-for-tensorflow-googlenet-inception
32
CASE STUDY: ACTION DETECTION Week 9 – Convolutional neural networks 33
ACTION DETECTION
given a video containing one or more actions of interest ..
.. locate where this actions are located (in both the image place/space and time) ..
.. and classify each action instance as one of a known category set
action instances are represented as action tubes
tubes are built by linking up frame-level detections in time
34
DEEP LEARNING PIPELINE
Given a number of examples, region proposal networks (RPNs) learn to regress the location of bounding boxes containing actions of interest [Singh et al, 2017; Saha et al, 2017]
Typically mutuated from object detection (e.g. Single-Shot Detector, SSD)
https://www.youtube.com/watch?v=P8e-G-Mhx4k
Dominant paradigm right now: linking up these detections in time
35
A LITTLE VIDEO
Demo of our most recent ICCV 2017 work
36
FASTER R-CNN
Makes use of a Region Proposal Network
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, 2015
37
SINGLE-SHOT DETECTOR (SSD)
Unlike Faster-RCNN, SSD is a fully convolutional neural network
It replaces the fully connected layers (i.e., FC6, FC7, cls. and reg. layers) of Faster-RCNN with convolutional layers
It eliminates the need of a Region Proposal Network - does not compute the RPN's “actionness" classification (i.e., action or background) and bounding box
Regression losses at each training iteration
Only computes the action classification and box regression losses (thus requires less computing time than Faster-RCNN)
Unlike Faster-RCNN, which introduces invariance using max-pooling to pool ROI features at RPN step, SSD uses a rich data augmentation scheme in place of max-pooling to achieve invariance 38
SINGLE-SHOT DETECTOR (SSD)
At training time SSD inputs: images and GT boxes (Fig. (a))
Evaluates a small set of default boxes (e.g. 4 or 6) of different aspect ratios at each feature map grid locations
Convolutional feature maps are extracted at different scales (e.g. 8 8, 4 4 as in Fig. (b) and (c))
For each default box it predicts location offsets and confidences for all the C classes
39
SUMMARY Week 9 – Convolutional neural networks 40
SUMMARY OF WEEK 9
Principles of deep learning
Convolution
Convolutional neural networks
Convolutional layer
Non-linear layer
Pooling layer
Training: backpropagation
Specific architectures
AlexNet
GoogLeNet and inception
Case study: action detection 41
ADDITIONAL RESOURCES
GitHub repository: http://cs231n.github.io/convolutional-networks/
A beginner’s guide: https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
https://www.quora.com/What-is-meant-by-feature-maps-in-convolutional-neural-networks
42