a gentle introduction to machine learning

10/7/2014

1

A Gentle Introduction to

Machine LearningFirst Lecture

Olov Andersson, AIICS

Linköpings Universitet

What is Machine Learning about?

• To imbue the capacity to learn into machines• Our only frame of reference for learning is from biology

…but brains are hideously complex, the result of ages of evolution

• Like much of AI, Machine Learning mainly takes an engineering approach1

• Humanity didn’t first master flight by just imitating birds!

Although there is some occasional biological inspiration

1.

2014-10-07 2

10/7/2014

2

Why Machine Learning

• It may be impossible to manually program for every situation in

advance

• The world may change, if the agent cannot adapt it will fail

• Many argue that learning is required for AI to scale up

• We are still far from a general learning agent!

but the algorithms we have so far have shown themselves to be useful

in a wide range of applications!

2014-10-07 3

Some Application Aspects

• May not be as versatile as human learning, but domain specific problems can often be processed much faster than by a human

• Computers work 24/7 and you can often scale performance by piling on more of them

Data Mining Companies can collect ever more data and processing power is cheap Put it to use automatically analyzing the performance of products! Machine Learning is almost ubiquitous on the web: Mail filters, search

engines, product recommendations , customized content, ad serving…

“Big Data” – much hyped technology trend.

Robotics, Computer Vision Many capabilities that humans take for granted like locomotion,

grasping and recognizing objects have turned out to be ridiculously difficult to manually construct rules for.

2014-10-07 4

10/7/2014

3

Demo – Stanford Helicopter Acrobatics

…in narrow applications machine learning can even rival human performance

2014-10-07 5

To Define Machine Learning

A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E

-Tom Mitchell

From the agent perspective:

Task

(En

viro

nm

en

t)

Input (Sensors)

Output (Actuators)

Performance Metric

Agent

2014-10-07 6

10/7/2014

4

The Three Main Types of Machine Learning

2014-10-07 7

Machine learning is a young science that is still changing, but traditionally algorithms are usually divided into three types depending on their purpose.

• Supervised Learning

• Reinforcement Learning

• Unsupervised Learning

Supervised Learning at a glance


-Tom Mitchell

In Supervised Learning:

• The correct output is given to the algorithm during a trainingphase

• Experience E is thus tuples of training data consisting of (Input,Output) pairs

• Performance metric P is some function of how well the predicted output matches the given correct output

Mathematically, can be seen as trying to approximate an unknown function f(x) = y given examples of (x, y)

2014-10-07 8

10/7/2014

5

Supervised Learning at a glance II

Representation from agent perspective:

Task

(En

viro

nm

en

t)

Input (Sensors)

Output (Actuators)

Performance Metric

Reactive

Agent

Supervised Learning is surprisingly powerful and ubiquitous

Some real world examples

• Spam Filter: f(mail) = spam?

• Microsoft Kinect: f(pixels, distance) = body part

state

action

f(input)=output

or

f(state) = action

…but it can also be used as a component in other architectures

2014-10-07 9

Body Part Classification on the Microsoft Kinect

2014-10-07 10

rightelbow

right hand leftshoulderneck

Shotton et al @ MSR, CVPR 2011

10/7/2014

6

Reinforcement Learning at a glance


-Tom Mitchell

In Reinforcement Learning:

• A reward is given at each step instead of the correct input/output

• Experience E consists of the history of inputs, the chosen outputs and the rewards

• Performance metric P is some sum of how much reward the agent can accumulate

Inspired by early work in psychology and how pets are trained

The agent can learn on its own as long as the reward signal can be concisely defined.

2014-10-07 11

Reinforcement Learning at a glance II

RL fits neatly into a utility (reward) maximizing agent framework• Rewards of actions in different states are learned• Agent plans ahead to maximize reward over time

2014-10-07 12

Task

(En

viro

nm

en

t)

Input (Sensors)

Output (Actuators)

Performance Metric

(reward)

RL Agent

Real world examples – Robot Control, Game Playing (Checkers…)

state

action

f(state, action) = reward

Maximize future reward

10/7/2014

7

Unsupervised Learning at a glance


-Tom Mitchell

In Unsupervised Learning:

• The Task is to find a more concise representation of the data

• Neither the correct answer, nor a reward is given

• Experience E is just the given data

• P depends on the task

Examples:

Clustering – When the data distribution is confined to lie in a small number of “clusters “ we can find these and use them instead of the original representation

Dimensionality Reduction – Finding a suitable lower dimensional representation while preserving as much information as possible

2014-10-07 13

Clustering – Continuous Data

(Bishop, 2006)Two-dimensional continuous input

2014-10-07 15

10/7/2014

8

Outline of Machine Learning Lectures

First we will talk about Supervised Learning• Definition• Main Concepts• Some Approaches & Applications• Pitfalls & Limitations• In-depth: Decision Trees (a supervised learning approach)

Then finish with a short introduction to Reinforcement Learning

The idea is that you will be informed enough to find and try alearning algorithm if the need arises.

2014-10-07 17

Supervised Learning in more detail…

Remember, in Supervised Learning:• Tuples of training data consisting of (x,y) pairs are given to the

algorithm• The objective is to learn to predict the output yi for an input xi

Can be seen as searching for an approximation to the unknownfunction y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn)• A candidate approximation is sometimes called a hypothesis

Two major classes of supervised learning• Classification – Output is a discrete category label

Example: Detecting cancer, y = “healthy” or “ill”

• Regression – Output is a numeric value Example: Predicting temperature, y = 15.3 degrees

In either case input data xi can be vector valued and discrete, continuous or mixed. Example: x1 = (12.5, “cloud free”, 1.35).

2014-10-07 18

10/7/2014

9

Supervised Learning in Practice

Can be seen as searching for an approximation to the unknown function y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn)

The goal is to have the algorithm learn from training examples to successfully generalize to new examples

• First construct an input vector xi of examples by encoding relevant problem data. This is often called the feature vector. • Examples of such xi, yi is the training set.

• A model is selected and trained on the examples by searchingfor parameters (the hypothesis space) that yield a good approximation to the unknown true function.

• Evaluate performance, (carefully) tweak algorithm or features.

,

,

,

2014-10-07 19

Feature Vector Construction

Want to learn y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn)

• Standard algorithms tend to work on variables defined as numbers • If the inputs x or outputs y contain categorical values like “book” or

“car” we need to encode them, typically as integers. With only two classes we get y in {0,1}, called binary classification Classification into multiple classes can be reduced to a sequence of

binary one-vs-all classifiers

• The variables may also be structured like in text, graphs, audio, image or video data

• Finding a suitable feature representation can be non-trivial, but there are standard approaches for the common domains

2014-10-07 20

10/7/2014

10

Example of Feature Vector Construction

One of the early successes was learning spam filters

Spam classification example:

Each mail is an input, some mails are flagged as

spam or not spam to create training examples.

Feature vector:

Encode the existence of a fixed set of relevant

key words in each mail as the feature vector.

Feature Exists?

“Customer” 1 (Yes)

“Dollar” 0 (No)

“Nigeria” 0

“Accept” 1

“Bank” 0

…. …

xi = wordsi =

2014-10-07 21

yi = 1 (spam) or 0 (not spam)

Simply learn f(x)=y using suitable classifier!

Simple Linear Classification Example

I. Construct a feature vector xi to be used with examples of yi

II. Select algorithm and train on training data by searching for a good approximation to the unknown function

Fictional example: A learning smartphone app that determines if silent mode should be on or off based on background noise, light level and observed user behavior.Feature vector xi = (Light level, Noise), yi = {“silent on”, “silent off”}

• Select the familiy of linear

discriminant functions

• Train the algorithm by searching for

a line that separates the classes

well

• New cases will be classified

according to which side they fall

2014-10-07 22

10/7/2014

11

2014-10-07 23

Simple Linear Regression Example

I. Construct a feature vector xi to be used with examples of yi

II. Select algorithm and train on training set by searching for a good approximation to the unknown function

Fictional example: Same smartphone app but now we want to predict the ring volume based on noise level (only)

Feature vector xi = (Noise), yi = (Volume)

• Select the familiy of linear functions

• Train the algorithm by searching for

a line that fits the data well

…but how does ”training” really work?

2014-10-07 24

10/7/2014

12

2014-10-07 25

Training a learning algorithm...

• Want to find approximation h(x) to the unknown function f(x)• As an example we select the hypothesis space to be the family of

polynomials of degree one, that is linear functions:

• The hypothesis space has two parameters• How do we find parameters that result in a good hypothesis h?

2014-10-07 26

Three (poor) linear hypotheses

Feature vector xi = (Noise), outputs yi = (Volume)

10/7/2014

13

Training a Learning Algorithm – Loss Functions

How do we find parameters w that result in a good hypothesis ?

• First we need to define ”good”!

• Define it mathematically as maximizing some function of how well the hypothesis fits the data Often one instead minimize how well it doesn’t fit

• Search in continuous domains like w is known as optimization (see Ch4.2 in course book AIMA)

• For regression one common choice is a square loss function:

2014-10-07 27

Training a Learning Algorithm – Loss Functions II

• What about classification? Squared error does not make sense when target output in {0,1}

• Custom loss functions for classification Number of missclassifications (unsmooth w.r.t. parameter changes) (inverse) information gain (used in decision trees, next lecture)

• These require specialized parameter search methods• Alternative: Squash a numeric output to [0,1], then use square

error!

2014-10-07 28

Sigmoid functions allow us to use any

regression method for classification.

Common choice is the logistic function:

10/7/2014

14

Initialize w to some random point in the parameter space

loop until improvement in loss is small

for each in w do

Where:

• Optimization approaches typically move in the direction thatlocally improves the objective (ie. decreasing the loss function)

• One simple and popular example is gradient descent:

Training a Learning Algorithm – Optimization

2014-10-07 29

How do we find parameters w that minimize the loss?

Training a Learning Algorithm – Limitations

Limitations• Locally greedy optimization approaches only work if the loss

function is convex w.r.t. w, i.e. there is only one minima• Linear regression models are always convex, however more

advanced models that we will look at later are vulnerable togetting stuck in local minina

• Care should be taken when training such models by utilizing for example random restarts

2014-10-07 30

Start positions in red area will

get stuck in a local minima!

10/7/2014

15

Linear Models in Summary

Advantages• Linear algorithms are simple and computationally efficient• Training them is a convex problem, so one is guaranteed to find

the best hypothesis in the hypothesis spaceDisadvantages• The hypothesis space is very restricted, it cannot handle non-

linear relations well

They are widely used in applications• Recommender Systems – Initial Netflix Cinematch was a linear

regression, before their $1 million competition to improve it• At the core of the recent Google Gmail Priority feature is a linear

classifier• And many more…

2014-10-07 31

Beyond Linear Models – Artificial Neural Networks

• One non-linear parametric model that has captivated people for decades is Artificial Neural Networks (ANNs)

• These draw upon inspiration from the physical structure of the brain as an interconnected network of neurons, represented by non-linear ”activation functions” of the inputs

2014-10-07 32

The Neuron The Network

10/7/2014

16

Artificial Neural Networks – The Neuron

• In univariate (one input) linear regression we used the model:

• Each neuron in an ANN is a linear model of all the inputs passed through a non-linear activation function g

• The activation function is typically the sigmoid

2014-10-07 33

Artificial Neural Networks – The Neuron II

• However, there is not just one neuron, but a network of neurons!• Each neuron may operate on inputs from other neurons and in

turn pass their outputs on to other neurons• We rewrite our neuron definition using ai for the input, aj for the

output and wi,j for the weight parameters:

2014-10-07 34

10/7/2014

17

Artificial Neural Networks – The Network

• Networks are composed of layers• All neurons in a layer are typically connected to all neurons in the

next layer, but not to each other• A two layer network (one hidden layer) with a sufficient number

of neurons is good enough for most problems• Expanding the output of a second layer neuron (5) we get

2014-10-07 35

Artificial Neural Networks – Training

• How do we train an ANN to find the best parameters wi,j?• By optimization, minimizing a loss function like before!• We can compute a loss function (errors) for outputs of the last

layer against the training examples• A classical approach is backpropagation which propagates the

errors backwards to earlier layers and applies gradient descent

Predict output on training set

Propagate errors backwards

and adjust weights in all layers

Compute errors

w.r.t. a loss function

2014-10-07 36

10/7/2014

18

Deep Learning

• ANNs (among others) can also be layered to capture abstractions

• Some tasks are uniquely suited to this like vision, text and speech recognition

• Already used by Google, MSFT etc.

• Training these require a lot of care, often in combination with unsupervised learning methods

2014-10-07 37

Faces

Facial parts

Edges

Abstraction

(Honglak Lee, 2009)

Artificial Neural Networks – Summary

Advantages• Very large hypothesis space, under some conditions it is a

universal approximator to any function f(x)• Some biological justification• Has been moderately popular in applications ever since the 90ies• Can be layered to capture some abstraction (deep learning)

Disadvantages• Training is a non-convex problem with many local minima• Has many tuning parameters to twiddle with (number of neurons,

layers, starting weights…)• Very difficult to interpret or debug weights in the network• It is quite simplified compared to biological neurual networks

2014-10-07 38

10/7/2014

19

Overfitting

• Models can overfit if they have have too many parameters in relation to the training set size

• Example: 9th degree polynomial regression model on 15 data points:

• This is not a local minima during training, it is the best fit possible on the given training examples!

• The trained model captured noise, small independent variations that will change each time we get a new example

2014-10-07 39

Green: True function (unknown)

Blue: Training examples (noisy!)

Red: Trained model

(Bishop, 2006)

Overfitting – Where Does the Noise Come From?

• Noise is small variations in the data due to unknown variables that cannot be represented by the given inputs• Example: Predict the temperature based on season, time-of-day and

cloud cover. What about atmospheric changes like a cold front? As they are not included in the model, or completely tied to the other variables, this variation will show up as random noise in the model!

• With few data points to go on the model may mistake the effect of such variables as coming from variables that are included.

• Since this relationship is merely chance the model will not generalize well to future situations

2014-10-07 40

10/7/2014

20

Model Selection – Choosing Between Models

• In conclusion, we want to avoid unnecessarily complex models• This is a fairly general concept throughout science and is often

referred to as Ockham’s Razor:“Pluralitas non est ponenda sine necessitate”

-Willian of Ockham“Everything should be kept as simple as possible, but no simpler.”

-Albert Einstein (paraphrased)

• There are several mathematically principled ways to penalize model complexity during training

• However, the most straightforward way is to keep a separate validation set of known examples that is only used for evaluating the performance of different models and not for training them.

2014-10-07 41

Model Selection – Hold-out Validation

• This is called a hold-out validation set as we keep the data away from the training phase

• Measuring performance on such a validation set is a better metric of actual generalization error to unseen examples

• With the validation set we can compare models of different complexity to select the one which generalizes best.

• Examples could be polynomial models of different order, the number of neurons or layers in an ANN etc.

Validation SetTraining Set

Given example data:

2014-10-07 42

10/7/2014

21

Model Selection – Selection Strategy

• As the number of parameters increases the size of the hypothesis space increases accordingly allowing a better fit to training data

• However, at some point it is sufficiently flexible to capture the underlying patterns, and any more will just capture noise, leading to worse generalization to new examples!

(Hastie et al., 2009)

Red: Validation Set Error

Blue: Training Set ErrorBest choice

2014-10-07 43

Overfit

Thank you for listening!

2014-10-07 54

a gentle introduction to machine learning

Documents