p08821 advanced m learning week 9cms.brookes.ac.uk/staff/fabiocuzzolin/files/week 9 -...

P08821 ADVANCED MACHINE LEARNING

WEEK 9 CONVOLUTIONAL NEURAL NETWORKS Professor Fabio Cuzzolin

School of Engineering, Computing and Mathematics

Oxford Brookes University

Academic year 2018-19, Semester 2

1

OUTLINE OF WEEK 9

2

Deep learning

Impact and motivation of DL

Convolutional Neural Nets Convolutional Layer

Non-linear Layer

Pooling Layer

CNN training

Architectures AlexNet

GoogLeNet, inception

Case study: Action detection

Resources

DEEP LEARNING Week 9 – Convolutional neural networks 3

DEEP LEARNING

Deep learning: neural networks with very many layers, and piecewise linear activation functions

Made possible improvements of recognition rates of 20-30%, compared to previous support vector machine (SVM) classifiers

Seem able to encode high-level abstractions from the data

Theory is not well understood yet, empirical investigations

Check out: deeplearning.net/

4

IMPACT OF DEEP LEARNING

Advancement in speech recognition

A few long-standing performance records were broken with deep learning methods (e.g. object detection on ImageNet)

Microsoft and Google have both deployed DL-based speech recognition systems in their products

Advancement in Computer Vision

The record holders on ImageNet and Semantic Segmentation are convolutional nets

Advancement in Natural Language Processing

Fine-grained sentiment analysis, syntactic parsing

Language models, machine translation, question answering 5

6

In “Nature” 27 January 2016

“DeepMind’s program AlphaGo beat Fan Hui, the European Go champion, five times out of five in tournament conditions...”

“AlphaGo was not preprogrammed to play Go: rather, it learned using a general-purpose algorithm that allowed it to interpret the game’s patterns.”

“…AlphaGo program applied deep learning in neural networks (convolutional NN) — brain-inspired programs in which connections between layers of simulated neurons are strengthened through examples and experience.”

CASE STUDY: ALPHAGO

MOTIVATIONS FOR DEEP ARCHITECTURES

Insufficient depth can hurt

With shallow architecture (SVM, NB, KNN, etc.), the required number of nodes in the graph (i.e. computations, and also number of parameters, when we try to learn the function) may grow very large

Many functions that can be represented efficiently with a deep architecture cannot be represented efficiently with a shallow one

The brain has a deep architecture

The visual cortex shows a sequence of areas each of which contains a representation of the input, and signals flow from one to the next

Cognitive processes seem deep

Humans first learn simpler concepts and then compose them to represent more abstract ones hierarchically

Engineers break-up solutions into multiple levels of abstraction and processing

7

FEATURE LEARNING

The deeper the layer, the higher the feature abstraction level

First layer: local patterns; second layer: face features; third: faces

Lots of empirical studies on this quality of deep networks

8

CONVOLUTIONAL NEURAL NETWORKS Week 9 – Convolutional neural networks 9

A BIOLOGICALLY INSPIRED FRAMEWORK

Convolutional Neural Networks are inspired by mammalian visual cortex

The visual cortex contains a complex arrangement of cells, which are sensitive to small sub-regions of the visual field, called a receptive field

These cells act as local filters over the input space and are well-suited to exploit the strong spatially local correlation present in natural images

10

THE MAMMALIAN VISUAL CORTEX INSPIRES CNN

11

Convolutional Neural Net

(DEEP) LEARNING TO ENCODE COMPLEXITY

Seem able to encode high-level abstractions from the data

Theory is not well understood yet, empirical investigations yes

12

CONVOLUTIONAL NEURAL NETWORKS (CNNS)

Use convolution, alternated with max pooling layers and rectified linear unit activation functions for feature learning

Typically, fully connected layers at the end produce softmax scores for classification

deeplearning.net/tutorial/lenet.html

13

CONVOLUTIONAL LAYER

The core layer of CNNs

The convolutional layer consists of a set of filters

Each filter covers a spatially small portion of the input data

Each filter is convolved across the dimensions of the input data, producing a multidimensional feature map

As we convolve the filter, we are computing the dot product between the parameters of the filter and the input

Intuition: the network will learn filters that activate when they see some specific type of feature at some spatial position in the input

The key architectural characteristics of the convolutional layer is local connectivity and shared weights

14

CONVOLUTION

Linear operation, which consists of multiplications and sums

15

CONVOLUTION EXAMPLE

16

FEATURE MAP

Just the activation values of the next layer, really

17

SHARED WEIGHTS

We show 3 hidden neurons belonging to the same feature map (the layer right above the input layer)

Weights of the same colour are ‘shared’ - constrained to be identical

Gradient descent can be used to learn shared parameters

Replicating neurons in this way allows for features to be detected regardless of their position in the input (location invariance)

Additionally, weight sharing increases learning efficiency by greatly reducing the number of free parameters being learnt

18

NON-LINEAR LAYER: RECTIFIED LINEAR INIT (RELU)

Rectified Linear Unit (ReLU) versus the classical sigmoid activation functions

Makes learning faster, improves the generalisation power of the network

19

Input

POOLING LAYER

Intuition: to progressively reduce the spatial size of the representation reduce the amount of parameters and computation

control overfitting

Pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value of the features in that region

20

CNN BUILDING BLOCKS

21

Each sub-region yields a feature map, representing its feature.

Images are segmented into sub-regions.

Feature maps are trained with neurons.

Feature maps of a larger region are combined.

Shared weights

EXAMPLE OF THE WHOLE PROCESS

22

Courtesy Robotics and Computer Tech Lab, http://www.rtc.us.es

http://www.rtc.us.es/

http://www.rtc.us.es/

SOME JARGON

23

TRAINING A CNN: BACKPROPAGATION

How do the filters in the first conv layer know to look for edges and curves?

How does the fully connected layer know what activation maps to look at?

Happens through backpropagation (cfr. Introduction to Machine Learning)

Initially the weights or filter values are randomized

In practise, the network to work needs to be pre-trained on a related problem (transfer learning)

E.g.: action detection networks are pre-trained on object detection weights

24

TRAINING A CNN: BACKPROPAGATION

Backpropagation can be separated into 4 distinct sections, the forward pass, the loss function, the backward pass, and the weight update

Forward pass: you take the input and ‘pass’ it through the network, generating outputs

Backward pass: determines which weights contributed most to the loss and finds ways to adjust them so that the loss decreases

Loss 𝐿 is usually mean squared error

Weight update: 𝑤 = 𝑤𝑖 − 𝜂𝑑𝐿

𝑑𝑤

25

EXAMPLE ARCHITECTURES Week 9 – Convolutional neural networks 26

ALEXNET: A MILESTONE IN CNN

Designed by Alex Krizhevsky and G. Hinton

Won the the ImageNet Large Scale Challenge 2012

Input image: 227 X 227 X 3

First convolutional layer: 96 filters with K = 11 and stride = 4

Width and height of output: (227 – 11)/4 + 1 = 55

27

ALEXNET: A MILESTONE IN CNN

Number of parameters in first layer? 11 x 11 x 3 x 96 = 34848

Popularized the use of ReLUs

Used heavy data augmentation (flipped images, random crops of size 227 by 227) 28

OTHER NETWORKS

ImageNet 2013 was won by a network similar to AlexNet (Matthew Zeiler and Rob Fergus)

Changed the first convolutional layer from 11 X 11 with stride of 4, to 7 X 7 with stride of 2

AlexNet used 384, 384 and 256 layers in the next three convolutional layers, ZF used 512, 1024, 512

ImageNet 2013: 14.8 % (reduced from 15.4 %) (top 5 errors)

Other popular network: VGG (Simonyan and Zisserman, 2014)

Total number of parameters: 138 Million!

29

GOOGLENET

https://ai.google/research/pubs/pub43022

Has 5 Million or 12x fewer parameters than AlexNet

Gets rid of fully connected layers

Based on inception module

30










INCEPTION MODULE

Parallel paths with different receptive field sizes

Capture sparse patterns of correlation in stack of feature maps

Also include auxiliary classifiers for ease of training

Also note 1 by 1 convolutions

31

INCEPTION MODULE

From: https://stackoverflow.com/questions/45420926/poor-results-for-tensorflow-googlenet-inception

32

https://stackoverflow.com/questions/45420926/poor-results-for-tensorflow-googlenet-inception












CASE STUDY: ACTION DETECTION Week 9 – Convolutional neural networks 33

ACTION DETECTION

given a video containing one or more actions of interest ..

.. locate where this actions are located (in both the image place/space and time) ..

.. and classify each action instance as one of a known category set

action instances are represented as action tubes

tubes are built by linking up frame-level detections in time

34

DEEP LEARNING PIPELINE

Given a number of examples, region proposal networks (RPNs) learn to regress the location of bounding boxes containing actions of interest [Singh et al, 2017; Saha et al, 2017]

Typically mutuated from object detection (e.g. Single-Shot Detector, SSD)

https://www.youtube.com/watch?v=P8e-G-Mhx4k

Dominant paradigm right now: linking up these detections in time

35








A LITTLE VIDEO

Demo of our most recent ICCV 2017 work

36

https://www.youtube.com/watch?v=e6r_39ETe-g

FASTER R-CNN

Makes use of a Region Proposal Network

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, 2015

37

SINGLE-SHOT DETECTOR (SSD)

Unlike Faster-RCNN, SSD is a fully convolutional neural network

It replaces the fully connected layers (i.e., FC6, FC7, cls. and reg. layers) of Faster-RCNN with convolutional layers

It eliminates the need of a Region Proposal Network - does not compute the RPN's “actionness" classification (i.e., action or background) and bounding box

Regression losses at each training iteration

Only computes the action classification and box regression losses (thus requires less computing time than Faster-RCNN)

Unlike Faster-RCNN, which introduces invariance using max-pooling to pool ROI features at RPN step, SSD uses a rich data augmentation scheme in place of max-pooling to achieve invariance 38

SINGLE-SHOT DETECTOR (SSD)

At training time SSD inputs: images and GT boxes (Fig. (a))

Evaluates a small set of default boxes (e.g. 4 or 6) of different aspect ratios at each feature map grid locations

Convolutional feature maps are extracted at different scales (e.g. 8 8, 4 4 as in Fig. (b) and (c))

For each default box it predicts location offsets and confidences for all the C classes

39

SUMMARY Week 9 – Convolutional neural networks 40

SUMMARY OF WEEK 9

Principles of deep learning

Convolution

Convolutional neural networks

Convolutional layer

Non-linear layer

Pooling layer

Training: backpropagation

Specific architectures

AlexNet

GoogLeNet and inception

Case study: action detection 41

ADDITIONAL RESOURCES

GitHub repository: http://cs231n.github.io/convolutional-networks/

A beginner’s guide: https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/

https://www.quora.com/What-is-meant-by-feature-maps-in-convolutional-neural-networks

42

http://cs231n.github.io/convolutional-networks/



https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/




































p08821 advanced m learning week 9cms.brookes.ac.uk/staff/fabiocuzzolin/files/week 9 -...

Documents