a gentle introduction to machine learning
TRANSCRIPT
![Page 1: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/1.jpg)
10/7/2014
1
A Gentle Introduction to
Machine LearningFirst Lecture
Olov Andersson, AIICS
Linköpings Universitet
What is Machine Learning about?
• To imbue the capacity to learn into machines• Our only frame of reference for learning is from biology
…but brains are hideously complex, the result of ages of evolution
• Like much of AI, Machine Learning mainly takes an engineering approach1
• Humanity didn’t first master flight by just imitating birds!
Although there is some occasional biological inspiration
1.
2014-10-07 2
![Page 2: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/2.jpg)
10/7/2014
2
Why Machine Learning
• It may be impossible to manually program for every situation in
advance
• The world may change, if the agent cannot adapt it will fail
• Many argue that learning is required for AI to scale up
• We are still far from a general learning agent!
but the algorithms we have so far have shown themselves to be useful
in a wide range of applications!
2014-10-07 3
Some Application Aspects
• May not be as versatile as human learning, but domain specific problems can often be processed much faster than by a human
• Computers work 24/7 and you can often scale performance by piling on more of them
Data Mining Companies can collect ever more data and processing power is cheap Put it to use automatically analyzing the performance of products! Machine Learning is almost ubiquitous on the web: Mail filters, search
engines, product recommendations , customized content, ad serving…
“Big Data” – much hyped technology trend.
Robotics, Computer Vision Many capabilities that humans take for granted like locomotion,
grasping and recognizing objects have turned out to be ridiculously difficult to manually construct rules for.
2014-10-07 4
![Page 3: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/3.jpg)
10/7/2014
3
Demo – Stanford Helicopter Acrobatics
…in narrow applications machine learning can even rival human performance
2014-10-07 5
To Define Machine Learning
A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E
-Tom Mitchell
From the agent perspective:
Task
(En
viro
nm
en
t)
Input (Sensors)
Output (Actuators)
Performance Metric
Agent
2014-10-07 6
![Page 4: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/4.jpg)
10/7/2014
4
The Three Main Types of Machine Learning
2014-10-07 7
Machine learning is a young science that is still changing, but traditionally algorithms are usually divided into three types depending on their purpose.
• Supervised Learning
• Reinforcement Learning
• Unsupervised Learning
Supervised Learning at a glance
A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E
-Tom Mitchell
In Supervised Learning:
• The correct output is given to the algorithm during a trainingphase
• Experience E is thus tuples of training data consisting of (Input,Output) pairs
• Performance metric P is some function of how well the predicted output matches the given correct output
Mathematically, can be seen as trying to approximate an unknown function f(x) = y given examples of (x, y)
2014-10-07 8
![Page 5: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/5.jpg)
10/7/2014
5
Supervised Learning at a glance II
Representation from agent perspective:
Task
(En
viro
nm
en
t)
Input (Sensors)
Output (Actuators)
Performance Metric
Reactive
Agent
Supervised Learning is surprisingly powerful and ubiquitous
Some real world examples
• Spam Filter: f(mail) = spam?
• Microsoft Kinect: f(pixels, distance) = body part
state
action
f(input)=output
or
f(state) = action
…but it can also be used as a component in other architectures
2014-10-07 9
Body Part Classification on the Microsoft Kinect
2014-10-07 10
rightelbow
right hand leftshoulderneck
Shotton et al @ MSR, CVPR 2011
![Page 6: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/6.jpg)
10/7/2014
6
Reinforcement Learning at a glance
A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E
-Tom Mitchell
In Reinforcement Learning:
• A reward is given at each step instead of the correct input/output
• Experience E consists of the history of inputs, the chosen outputs and the rewards
• Performance metric P is some sum of how much reward the agent can accumulate
Inspired by early work in psychology and how pets are trained
The agent can learn on its own as long as the reward signal can be concisely defined.
2014-10-07 11
Reinforcement Learning at a glance II
RL fits neatly into a utility (reward) maximizing agent framework• Rewards of actions in different states are learned• Agent plans ahead to maximize reward over time
2014-10-07 12
Task
(En
viro
nm
en
t)
Input (Sensors)
Output (Actuators)
Performance Metric
(reward)
RL Agent
Real world examples – Robot Control, Game Playing (Checkers…)
state
action
f(state, action) = reward
Maximize future reward
![Page 7: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/7.jpg)
10/7/2014
7
Unsupervised Learning at a glance
A machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E
-Tom Mitchell
In Unsupervised Learning:
• The Task is to find a more concise representation of the data
• Neither the correct answer, nor a reward is given
• Experience E is just the given data
• P depends on the task
Examples:
Clustering – When the data distribution is confined to lie in a small number of “clusters “ we can find these and use them instead of the original representation
Dimensionality Reduction – Finding a suitable lower dimensional representation while preserving as much information as possible
2014-10-07 13
Clustering – Continuous Data
(Bishop, 2006)Two-dimensional continuous input
2014-10-07 15
![Page 8: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/8.jpg)
10/7/2014
8
Outline of Machine Learning Lectures
First we will talk about Supervised Learning• Definition• Main Concepts• Some Approaches & Applications• Pitfalls & Limitations• In-depth: Decision Trees (a supervised learning approach)
Then finish with a short introduction to Reinforcement Learning
The idea is that you will be informed enough to find and try alearning algorithm if the need arises.
2014-10-07 17
Supervised Learning in more detail…
Remember, in Supervised Learning:• Tuples of training data consisting of (x,y) pairs are given to the
algorithm• The objective is to learn to predict the output yi for an input xi
Can be seen as searching for an approximation to the unknownfunction y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn)• A candidate approximation is sometimes called a hypothesis
Two major classes of supervised learning• Classification – Output is a discrete category label
Example: Detecting cancer, y = “healthy” or “ill”
• Regression – Output is a numeric value Example: Predicting temperature, y = 15.3 degrees
In either case input data xi can be vector valued and discrete, continuous or mixed. Example: x1 = (12.5, “cloud free”, 1.35).
2014-10-07 18
![Page 9: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/9.jpg)
10/7/2014
9
Supervised Learning in Practice
Can be seen as searching for an approximation to the unknown function y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn)
The goal is to have the algorithm learn from training examples to successfully generalize to new examples
• First construct an input vector xi of examples by encoding relevant problem data. This is often called the feature vector. • Examples of such xi, yi is the training set.
• A model is selected and trained on the examples by searchingfor parameters (the hypothesis space) that yield a good approximation to the unknown true function.
• Evaluate performance, (carefully) tweak algorithm or features.
,
,
,
2014-10-07 19
Feature Vector Construction
Want to learn y = f(x) given N examples of x and y: (x1,y1), … ,(xn,yn)
• Standard algorithms tend to work on variables defined as numbers • If the inputs x or outputs y contain categorical values like “book” or
“car” we need to encode them, typically as integers. With only two classes we get y in {0,1}, called binary classification Classification into multiple classes can be reduced to a sequence of
binary one-vs-all classifiers
• The variables may also be structured like in text, graphs, audio, image or video data
• Finding a suitable feature representation can be non-trivial, but there are standard approaches for the common domains
2014-10-07 20
![Page 10: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/10.jpg)
10/7/2014
10
Example of Feature Vector Construction
One of the early successes was learning spam filters
Spam classification example:
Each mail is an input, some mails are flagged as
spam or not spam to create training examples.
Feature vector:
Encode the existence of a fixed set of relevant
key words in each mail as the feature vector.
Feature Exists?
“Customer” 1 (Yes)
“Dollar” 0 (No)
“Nigeria” 0
“Accept” 1
“Bank” 0
…. …
xi = wordsi =
2014-10-07 21
yi = 1 (spam) or 0 (not spam)
Simply learn f(x)=y using suitable classifier!
Simple Linear Classification Example
I. Construct a feature vector xi to be used with examples of yi
II. Select algorithm and train on training data by searching for a good approximation to the unknown function
Fictional example: A learning smartphone app that determines if silent mode should be on or off based on background noise, light level and observed user behavior.Feature vector xi = (Light level, Noise), yi = {“silent on”, “silent off”}
• Select the familiy of linear
discriminant functions
• Train the algorithm by searching for
a line that separates the classes
well
• New cases will be classified
according to which side they fall
2014-10-07 22
![Page 11: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/11.jpg)
10/7/2014
11
2014-10-07 23
Simple Linear Regression Example
I. Construct a feature vector xi to be used with examples of yi
II. Select algorithm and train on training set by searching for a good approximation to the unknown function
Fictional example: Same smartphone app but now we want to predict the ring volume based on noise level (only)
Feature vector xi = (Noise), yi = (Volume)
• Select the familiy of linear functions
• Train the algorithm by searching for
a line that fits the data well
…but how does ”training” really work?
2014-10-07 24
![Page 12: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/12.jpg)
10/7/2014
12
2014-10-07 25
Training a learning algorithm...
• Want to find approximation h(x) to the unknown function f(x)• As an example we select the hypothesis space to be the family of
polynomials of degree one, that is linear functions:
• The hypothesis space has two parameters• How do we find parameters that result in a good hypothesis h?
2014-10-07 26
Three (poor) linear hypotheses
Feature vector xi = (Noise), outputs yi = (Volume)
![Page 13: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/13.jpg)
10/7/2014
13
Training a Learning Algorithm – Loss Functions
How do we find parameters w that result in a good hypothesis ?
• First we need to define ”good”!
• Define it mathematically as maximizing some function of how well the hypothesis fits the data Often one instead minimize how well it doesn’t fit
• Search in continuous domains like w is known as optimization (see Ch4.2 in course book AIMA)
• For regression one common choice is a square loss function:
2014-10-07 27
Training a Learning Algorithm – Loss Functions II
• What about classification? Squared error does not make sense when target output in {0,1}
• Custom loss functions for classification Number of missclassifications (unsmooth w.r.t. parameter changes) (inverse) information gain (used in decision trees, next lecture)
• These require specialized parameter search methods• Alternative: Squash a numeric output to [0,1], then use square
error!
2014-10-07 28
Sigmoid functions allow us to use any
regression method for classification.
Common choice is the logistic function:
![Page 14: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/14.jpg)
10/7/2014
14
Initialize w to some random point in the parameter space
loop until improvement in loss is small
for each in w do
Where:
• Optimization approaches typically move in the direction thatlocally improves the objective (ie. decreasing the loss function)
• One simple and popular example is gradient descent:
Training a Learning Algorithm – Optimization
2014-10-07 29
How do we find parameters w that minimize the loss?
Training a Learning Algorithm – Limitations
Limitations• Locally greedy optimization approaches only work if the loss
function is convex w.r.t. w, i.e. there is only one minima• Linear regression models are always convex, however more
advanced models that we will look at later are vulnerable togetting stuck in local minina
• Care should be taken when training such models by utilizing for example random restarts
2014-10-07 30
Start positions in red area will
get stuck in a local minima!
![Page 15: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/15.jpg)
10/7/2014
15
Linear Models in Summary
Advantages• Linear algorithms are simple and computationally efficient• Training them is a convex problem, so one is guaranteed to find
the best hypothesis in the hypothesis spaceDisadvantages• The hypothesis space is very restricted, it cannot handle non-
linear relations well
They are widely used in applications• Recommender Systems – Initial Netflix Cinematch was a linear
regression, before their $1 million competition to improve it• At the core of the recent Google Gmail Priority feature is a linear
classifier• And many more…
2014-10-07 31
Beyond Linear Models – Artificial Neural Networks
• One non-linear parametric model that has captivated people for decades is Artificial Neural Networks (ANNs)
• These draw upon inspiration from the physical structure of the brain as an interconnected network of neurons, represented by non-linear ”activation functions” of the inputs
2014-10-07 32
The Neuron The Network
![Page 16: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/16.jpg)
10/7/2014
16
Artificial Neural Networks – The Neuron
• In univariate (one input) linear regression we used the model:
• Each neuron in an ANN is a linear model of all the inputs passed through a non-linear activation function g
• The activation function is typically the sigmoid
2014-10-07 33
Artificial Neural Networks – The Neuron II
• However, there is not just one neuron, but a network of neurons!• Each neuron may operate on inputs from other neurons and in
turn pass their outputs on to other neurons• We rewrite our neuron definition using ai for the input, aj for the
output and wi,j for the weight parameters:
2014-10-07 34
![Page 17: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/17.jpg)
10/7/2014
17
Artificial Neural Networks – The Network
• Networks are composed of layers• All neurons in a layer are typically connected to all neurons in the
next layer, but not to each other• A two layer network (one hidden layer) with a sufficient number
of neurons is good enough for most problems• Expanding the output of a second layer neuron (5) we get
2014-10-07 35
Artificial Neural Networks – Training
• How do we train an ANN to find the best parameters wi,j?• By optimization, minimizing a loss function like before!• We can compute a loss function (errors) for outputs of the last
layer against the training examples• A classical approach is backpropagation which propagates the
errors backwards to earlier layers and applies gradient descent
Predict output on training set
Propagate errors backwards
and adjust weights in all layers
Compute errors
w.r.t. a loss function
2014-10-07 36
![Page 18: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/18.jpg)
10/7/2014
18
Deep Learning
• ANNs (among others) can also be layered to capture abstractions
• Some tasks are uniquely suited to this like vision, text and speech recognition
• Already used by Google, MSFT etc.
• Training these require a lot of care, often in combination with unsupervised learning methods
2014-10-07 37
Faces
Facial parts
Edges
Abstraction
(Honglak Lee, 2009)
Artificial Neural Networks – Summary
Advantages• Very large hypothesis space, under some conditions it is a
universal approximator to any function f(x)• Some biological justification• Has been moderately popular in applications ever since the 90ies• Can be layered to capture some abstraction (deep learning)
Disadvantages• Training is a non-convex problem with many local minima• Has many tuning parameters to twiddle with (number of neurons,
layers, starting weights…)• Very difficult to interpret or debug weights in the network• It is quite simplified compared to biological neurual networks
2014-10-07 38
![Page 19: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/19.jpg)
10/7/2014
19
Overfitting
• Models can overfit if they have have too many parameters in relation to the training set size
• Example: 9th degree polynomial regression model on 15 data points:
• This is not a local minima during training, it is the best fit possible on the given training examples!
• The trained model captured noise, small independent variations that will change each time we get a new example
2014-10-07 39
Green: True function (unknown)
Blue: Training examples (noisy!)
Red: Trained model
(Bishop, 2006)
Overfitting – Where Does the Noise Come From?
• Noise is small variations in the data due to unknown variables that cannot be represented by the given inputs• Example: Predict the temperature based on season, time-of-day and
cloud cover. What about atmospheric changes like a cold front? As they are not included in the model, or completely tied to the other variables, this variation will show up as random noise in the model!
• With few data points to go on the model may mistake the effect of such variables as coming from variables that are included.
• Since this relationship is merely chance the model will not generalize well to future situations
2014-10-07 40
![Page 20: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/20.jpg)
10/7/2014
20
Model Selection – Choosing Between Models
• In conclusion, we want to avoid unnecessarily complex models• This is a fairly general concept throughout science and is often
referred to as Ockham’s Razor:“Pluralitas non est ponenda sine necessitate”
-Willian of Ockham“Everything should be kept as simple as possible, but no simpler.”
-Albert Einstein (paraphrased)
• There are several mathematically principled ways to penalize model complexity during training
• However, the most straightforward way is to keep a separate validation set of known examples that is only used for evaluating the performance of different models and not for training them.
2014-10-07 41
Model Selection – Hold-out Validation
• This is called a hold-out validation set as we keep the data away from the training phase
• Measuring performance on such a validation set is a better metric of actual generalization error to unseen examples
• With the validation set we can compare models of different complexity to select the one which generalizes best.
• Examples could be polynomial models of different order, the number of neurons or layers in an ANN etc.
Validation SetTraining Set
Given example data:
2014-10-07 42
![Page 21: A Gentle Introduction to Machine Learning](https://reader033.vdocument.in/reader033/viewer/2022052805/628f43e8f7840147f5797d0a/html5/thumbnails/21.jpg)
10/7/2014
21
Model Selection – Selection Strategy
• As the number of parameters increases the size of the hypothesis space increases accordingly allowing a better fit to training data
• However, at some point it is sufficiently flexible to capture the underlying patterns, and any more will just capture noise, leading to worse generalization to new examples!
(Hastie et al., 2009)
Red: Validation Set Error
Blue: Training Set ErrorBest choice
2014-10-07 43
Overfit
Thank you for listening!
2014-10-07 54