machine learningzhe/teach/pdf/neural-networks-introductio… · neural networks: introduction...

41
1 Neural Networks: Introduction Machine Learning Spring 2020 The slides are partly from Vivek Srikumar

Upload: others

Post on 12-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

1

Neural Networks: Introduction

MachineLearningFall2017

SupervisedLearning:TheSetup

1

Machine LearningSpring 2020

The slides are partly from Vivek Srikumar

Page 2: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Where are we?

Learning algorithms• Decision Trees• AdaBoost• Bagging• Least Mean Square• Perceptron• Support Vector Machines• Kernel SVM• Kernel Perceptron

2

Page 3: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Where are we?

General learning principles• Overfitting• Mistake-bound learning• PAC learning, sample complexity• Hypothesis choice & VC dimensions• Training and generalization errors• Regularized Empirical Risk

Minimization

Learning algorithms• Decision Trees• AdaBoost• Bagging• Least Mean Square• Perceptron• Support Vector Machines• Kernel SVM• Kernel Perceptron

3

Page 4: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Where are we?

General learning principles• Overfitting• Mistake-bound learning• PAC learning, sample complexity• Hypothesis choice & VC dimensions• Training and generalization errors• Regularized Empirical Risk

Minimization

Learning algorithms• Decision Trees• AdaBoost• Bagging• Least Mean Square• Perceptron• Support Vector Machines• Kernel SVM• Kernel Perceptron

4

Produce linear classifiers

Page 5: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Learning algorithms• Decision Trees• AdaBoost• Bagging• Least Mean Square• Perceptron• Support Vector Machines• Kernel SVM• Kernel Perceptron

Where are we?

General learning principles• Overfitting• Mistake-bound learning• PAC learning, sample complexity• Hypothesis choice & VC dimensions• Training and generalization errors• Regularized Empirical Risk

Minimization

5

So far, we’ve seen how to use kernel tricks to extend linear classiers to nonlinear onesWhat if we want to directly train a non-linear classifier?

Where do the features come from?

Produce linear classifiers

Page 6: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Neural Networks

• What is a neural network?

• Predicting with a neural network

• Training neural networks

• Practical concerns

6

Page 7: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

This lecture

• What is a neural network?– The hypothesis class– Structure, expressiveness

• Predicting with a neural network

• Training neural networks

• Practical concerns

7

Page 8: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

We have seen linear threshold units

8

features

dot product

threshold

Page 9: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

We have seen linear threshold units

9

features

dot product

threshold

Prediction𝑠𝑔𝑛 (𝒘!𝒙 + 𝑏) = 𝑠𝑔𝑛(∑𝑤"𝑥" + 𝑏)

Page 10: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

We have seen linear threshold units

10

features

dot product

threshold

Prediction𝑠𝑔𝑛 (𝒘!𝒙 + 𝑏) = 𝑠𝑔𝑛(∑𝑤"𝑥" + 𝑏)

Learningvarious algorithms perceptron, SVM,…

in general, minimize loss

Page 11: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

We have seen linear threshold units

11

features

dot product

threshold

Prediction𝑠𝑔𝑛 (𝒘!𝒙 + 𝑏) = 𝑠𝑔𝑛(∑𝑤"𝑥" + 𝑏)

Learningvarious algorithms perceptron, SVM,…

in general, minimize loss

But where do these input features come from?

What if the features were outputs of another classifier?

Page 12: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Features from classifiers

12

Page 13: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Features from classifiers

13

Page 14: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Features from classifiers

14

Each of these connections have their own weights as well

Page 15: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Features from classifiers

15

Page 16: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Features from classifiers

16

This is a two layer feed forward neural network

Page 17: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Features from classifiers

17

The output layer

The hidden layerThe input layer

This is a two layer feed forward neural network

Think of the hidden layer as learning a good representation of the inputs

Page 18: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Features from classifiers

18

The dot product followed by the threshold constitutes a neuron

This is a two layer feed forward neural network

Page 19: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Features from classifiers

19

The dot product followed by the threshold constitutes a neuron

Five neurons in this picture (four in hidden layer and one output)

This is a two layer feed forward neural network

Page 20: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

But where do the inputs come from?

20

What if the inputs were the outputs of a classifier?The input layer

We can make a three layer network…. And so on.

Page 21: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Let us try to formalize this

21

Page 22: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Neural networks

• A robust approach for approximating real-valued, discrete-valued or vector valued functions

• Among the most effective general purpose supervised learning methods currently known– Especially for complex and hard to interpret data such as real-

world sensory data

• The Backpropagation algorithm for neural networks has been shown successful in many practical problems– handwritten character recognition, speech recognition, object

recognition, some NLP problems

22

Page 23: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Biological neurons

23

The first drawing of a brain cells by Santiago Ramón y Cajal in 1899

Neurons: core components of brain and the nervous system consisting of

1. Dendrites that collect information from other neurons

2. An axon that generates outgoing spikes

Page 24: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Biological neurons

24

The first drawing of a brain cells by Santiago Ramón y Cajal in 1899

Neurons: core components of brain and the nervous system consisting of

1. Dendrites that collect information from other neurons

2. An axon that generates outgoing spikes

Modern artificial neurons are “inspired” by biological neurons

But there are many, many fundamental differences

Don’t take the similarity seriously (as also claims in the news about the “emergence” of intelligent behavior)

Page 25: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Artificial neurons

Functions that very loosely mimic a biological neuron

A neuron accepts a collection of inputs (a vector x) and produces an output by:

1. Applying a dot product with weights w and adding a bias b2. Applying a (possibly non-linear) transformation called an activation

25

𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛(𝒘!𝒙 + 𝑏)

Page 26: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Artificial neurons

Functions that very loosely mimic a biological neuron

A neuron accepts a collection of inputs (a vector x) and produces an output by:

1. Applying a dot product with weights w and adding a bias b2. Applying a (possibly non-linear) transformation called an activation

26

Dot product

Threshold activation

𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛(𝒘!𝒙 + 𝑏)

Page 27: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Artificial neurons

Functions that very loosely mimic a biological neuron

A neuron accepts a collection of inputs (a vector x) and produces an output by:

1. Applying a dot product with weights w and adding a bias b2. Applying a (possibly non-linear) transformation called an activation

27

Dot product

Threshold activation

Other activations are possible

𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛(𝒘!𝒙 + 𝑏)

Page 28: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Activation functions

Name of the neuron Activation function: 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛(𝑧)Linear unit 𝑧Threshold/sign unit sgn(𝑧)

Sigmoid unit1

1 + exp (−𝑧)Rectified linear unit (ReLU) max (0, 𝑧)Tanh unit tanh (𝑧)

28

𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛(𝒘!𝒙 + 𝑏)

Many more activation functions exist (sinusoid, sinc, gaussian, polynomial…)

Also called transfer functions

Page 29: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

A neural network

A function that converts inputs to outputs defined by a directed acyclic graph

– Nodes organized in layers, correspond to neurons

– Edges carry output of one neuron to another, associated with weights

• To define a neural network, we need to specify:– The structure of the graph

• How many nodes, the connectivity– The activation function on each node– The edge weights

29

Page 30: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

A neural network

A function that converts inputs to outputs defined by a directed acyclic graph

– Nodes organized in layers, correspond to neurons

– Edges carry output of one neuron to another, associated with weights

• To define a neural network, we need to specify:– The structure of the graph

• How many nodes, the connectivity– The activation function on each node– The edge weights

30

Input

Hidden

Output

w#$%

w#$&

Page 31: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

A neural network

A function that converts inputs to outputs defined by a directed acyclic graph

– Nodes organized in layers, correspond to neurons

– Edges carry output of one neuron to another, associated with weights

• To define a neural network, we need to specify:– The structure of the graph

• How many nodes, the connectivity– The activation function on each node– The edge weights

31

Input

Hidden

Output

w#$%

w#$&

Page 32: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

A neural network

A function that converts inputs to outputs defined by a directed acyclic graph

– Nodes organized in layers, correspond to neurons

– Edges carry output of one neuron to another, associated with weights

• To define a neural network, we need to specify:– The structure of the graph

• How many nodes, the connectivity– The activation function on each node– The edge weights

32

Called the architectureof the networkTypically predefined, part of the design of the classifier

Input

Hidden

Output

w#$%

w#$&

Page 33: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

A neural network

A function that converts inputs to outputs defined by a directed acyclic graph

– Nodes organized in layers, correspond to neurons

– Edges carry output of one neuron to another, associated with weights

• To define a neural network, we need to specify:– The structure of the graph

• How many nodes, the connectivity– The activation function on each node– The edge weights

33

Called the architectureof the networkTypically predefined, part of the design of the classifier

Learned from data

Input

Hidden

Output

w#$%

w#$&

Page 34: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

A brief history of neural networks

• 1943: McCullough and Pitts showed how linear threshold units can compute logical functions

• 1949: Hebb suggested a learning rule that has some physiological plausibility

• 1950s: Rosenblatt, the Peceptron algorithm for a single threshold neuron

• 1969: Minsky and Papert studied the neuron from a geometrical perspective

• 1980s: Convolutional neural networks (Fukushima, LeCun), the backpropagation algorithm (various)

• 2003-today: More compute, more data, deeper networks

34See also: http://people.idsia.ch/~juergen/deep-learning-overview.html

very

Page 35: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

What functions do neural networks express?

35

Page 36: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

A single neuron with threshold activation

36

Prediction = sgn(b +w1 x1 + w2x2)

++

++

+ +++

-- --

-- -- --

---- --

--

b +w1 x1 + w2x2=0

Page 37: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Two layers, with threshold activations

37

In general, convex polygons

Figure from Shai Shalev-Shwartz and Shai Ben-David, 2014

Page 38: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Three layers with threshold activations

38

In general, unions of convex polygons

Figure from Shai Shalev-Shwartz and Shai Ben-David, 2014

Page 39: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Neural networks are universal function approximators

• Any continuous function can be approximated to arbitrary accuracy using one hidden layer of sigmoid units [Cybenko 1989]

• Approximation error is insensitive to the choice of activation functions [DasGupta et al 1993]

• Two layer threshold networks can express any Boolean function

• VC dimension of threshold network with edges E: 𝑉𝐶 = 𝑂(|𝐸| log |𝐸|)

• VC dimension of sigmoid networks with nodes V and edges E:– Upper bound: Ο 𝑉 ! 𝐸 !

– Lower bound: Ω 𝐸 !

39

Exercise: Show that if we have only linear units, then multiple layers does not change the expressiveness

Page 40: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Neural networks are universal function approximators

• Any continuous function can be approximated to arbitrary accuracy using one hidden layer of sigmoid units [Cybenko 1989]

• Approximation error is insensitive to the choice of activation functions [DasGupta et al 1993]

• Two layer threshold networks can express any Boolean function

• VC dimension of threshold network with edges E: 𝑉𝐶 = 𝑂(|𝐸| log |𝐸|)

• VC dimension of sigmoid networks with nodes V and edges E:– Upper bound: Ο 𝑉 ! 𝐸 !

– Lower bound: Ω 𝐸 !

40

Page 41: Machine Learningzhe/teach/pdf/neural-networks-introductio… · Neural Networks: Introduction Machine Learning Fall 2017 Supervised Learning: The Setup 1 Machine Learning Spring 2020

Neural networks are universal function approximators

• Any continuous function can be approximated to arbitrary accuracy using one hidden layer of sigmoid units [Cybenko 1989]

• Approximation error is insensitive to the choice of activation functions [DasGupta et al 1993]

• Two layer threshold networks can express any Boolean function

• VC dimension of threshold network with edges E: 𝑉𝐶 = 𝑂(|𝐸| log |𝐸|)

• VC dimension of sigmoid networks with nodes V and edges E:– Upper bound: Ο 𝑉 ! 𝐸 !

– Lower bound: Ω 𝐸 !

41

Exercise: Show that if we have only linear units, then multiple layers does not change the expressiveness