![Page 1: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/1.jpg)
Machine Learning:
The Optimization Perspective
Konstantin Tretyakov http://kt.era.ee
AACIMP 2012
August 2012
![Page 2: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/2.jpg)
AACIMP Summer School.
August, 2012
![Page 3: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/3.jpg)
So far…
Machine learning is important and
interesting
The general concept:
AACIMP Summer School.
August, 2012
Fitting models to data
![Page 4: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/4.jpg)
So far…
Machine learning is important and
interesting
The general concept:
AACIMP Summer School.
August, 2012
Searching for the best
model fitting to data
![Page 5: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/5.jpg)
Searching for the best
model fitting to data
So far…
Machine learning is important and
interesting
The general concept:
AACIMP Summer School.
August, 2012
Optimization
Probability
Theory
![Page 6: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/6.jpg)
Today
1. Optimization is important
2. Optimization is possible
AACIMP Summer School.
August, 2012
![Page 7: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/7.jpg)
Today
1. Optimization is important
2. Optimization is possible*
* Basic techniques Constrained / Unconstrained
Analytic / Iterative
Continuous / Discrete
AACIMP Summer School.
August, 2012
![Page 8: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/8.jpg)
Special cases of optimization
Machine learning
…
AACIMP Summer School.
August, 2012
![Page 9: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/9.jpg)
Special cases of optimization
Machine learning
Algorithms and data structures
General problem-solving
Management and decision-making
AACIMP Summer School.
August, 2012
![Page 10: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/10.jpg)
Special cases of optimization
Machine learning
Algorithms and data structures
General problem-solving
Management and decision-making
Evolution
The Meaning of Life?
AACIMP Summer School.
August, 2012
![Page 11: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/11.jpg)
Optimization task
Given a function
find the argument x resulting in the optimal
value.
AACIMP Summer School.
August, 2012
![Page 12: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/12.jpg)
Constrained optimization task
Given a function
find the argument x resulting in the optimal
value, subject to
AACIMP Summer School.
August, 2012
![Page 13: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/13.jpg)
Optimization methods
In principle, x can be anything:
Discrete
Value (e.g. a name)
Structure (e.g. a graph, plaintext)
Finite / infinite
Continuous*
Real-number, vector, matrix, …
Complex-number, function, …
AACIMP Summer School.
August, 2012
![Page 14: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/14.jpg)
Optimization methods
In principle, f can be anything:
Random oracle
Structured
Continuous
Differentiable
Convex
…
AACIMP Summer School.
August, 2012
![Page 15: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/15.jpg)
Optimization methods
AACIMP Summer School.
August, 2012
Knowledge about f
Not much A lot
Type of x
Discrete
Continuous
![Page 16: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/16.jpg)
Optimization methods
AACIMP Summer School.
August, 2012
Knowledge about f
Not much A lot
Type of x
Discrete
Combinatorial
search:
Brute-force,
Stepwise, MCMC,
Population-based, …
Algorithmic
Continuous
Numeric methods:
Gradient-based,
Newton-like,
MCMC,
Population-based, …
Analytic
![Page 17: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/17.jpg)
Optimization methods
AACIMP Summer School.
August, 2012
Knowledge about f
Not much A lot
Type of x
Discrete
Combinatorial
search:
Brute-force,
Stepwise, MCMC,
Population-based, …
Algorithmic
Continuous
Numeric methods:
Gradient-based,
Newton-like,
MCMC,
Population-based, …
Analytic
Finding a weight-
vector w, that
minimizes the model
error
![Page 18: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/18.jpg)
Optimization methods
AACIMP Summer School.
August, 2012
Knowledge about f
Not much A lot
Type of x
Discrete
Combinatorial
search:
Brute-force,
Stepwise, MCMC,
Population-based, …
Algorithmic
Continuous
Numeric methods:
Gradient-based,
Newton-like,
MCMC,
Population-based, …
Analytic
Finding a weight-
vector w, that
minimizes the model
error,
in a fairly general
case
![Page 19: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/19.jpg)
Optimization methods
AACIMP Summer School.
August, 2012
Knowledge about f
Not much A lot
Type of x
Discrete
Combinatorial
search:
Brute-force,
Stepwise, MCMC,
Population-based, …
Algorithmic
Continuous
Numeric methods:
Gradient-based,
Newton-like,
MCMC,
Population-based, …
Analytic
Finding a weight-
vector w, that
minimizes the model
error,
in a very general
case
![Page 20: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/20.jpg)
Optimization methods
AACIMP Summer School.
August, 2012
Knowledge about f
Not much A lot
Type of x
Discrete
Combinatorial
search:
Brute-force,
Stepwise, MCMC,
Population-based, …
Algorithmic
Continuous
Numeric methods:
Gradient-based,
Newton-like,
MCMC,
Population-based, …
Analytic
Finding a weight-
vector w, that
minimizes the model
error,
in many practical
cases
![Page 21: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/21.jpg)
Optimization methods
AACIMP Summer School.
August, 2012
Knowledge about f
Not much A lot
Type of x
Discrete
Combinatorial
search:
Brute-force,
Stepwise, MCMC,
Population-based, …
Algorithmic
Continuous
Numeric methods:
Gradient-based,
Newton-like,
MCMC,
Population-based, …
Analytic
This lecture
![Page 22: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/22.jpg)
Minima and maxima
AACIMP Summer School.
August, 2012
![Page 23: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/23.jpg)
Differentiability
AACIMP Summer School.
August, 2012
𝒇 𝒙 = 𝒃 + 𝒄𝚫𝒙
![Page 24: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/24.jpg)
Differentiability
AACIMP Summer School.
August, 2012
𝒇 𝒙𝟏, 𝒙𝟐 = 𝒃 + 𝒄𝟏𝚫𝒙𝟏 + 𝒄𝟐𝚫𝐱𝟐
![Page 25: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/25.jpg)
Differentiability
AACIMP Summer School.
August, 2012
![Page 26: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/26.jpg)
The Most Important Observation
AACIMP Summer School.
August, 2012
![Page 27: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/27.jpg)
The Most Important Observation
AACIMP Summer School.
August, 2012
![Page 28: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/28.jpg)
The Most Important Observation
This small observation gives us everything we
need for now
A nice interpretation of the gradient
An extremality criterion
An iterative algorithm for function minimization
AACIMP Summer School.
August, 2012
![Page 29: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/29.jpg)
Interpretation of the gradient
AACIMP Summer School.
August, 2012
![Page 30: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/30.jpg)
Interpretation of the gradient
AACIMP Summer School.
August, 2012
![Page 31: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/31.jpg)
Extremality criterion
AACIMP Summer School.
August, 2012
![Page 32: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/32.jpg)
Iterative algorithm
1. Pick random point 𝒙0
2. If 𝛻𝑓 𝒙0 = 𝟎, then we’ve found an extremum.
3. Otherwise,
AACIMP Summer School.
August, 2012
![Page 33: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/33.jpg)
Iterative algorithm
1. Pick random point 𝒙0
2. If 𝛻𝑓 𝒙0 = 𝟎, then we’ve found an extremum.
3. Otherwise, make a small step downhill:
𝒙1 ← 𝒙0 − 𝜇0𝛻𝑓 𝒙0
AACIMP Summer School.
August, 2012
![Page 34: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/34.jpg)
Iterative algorithm
1. Pick random point 𝒙0
2. If 𝛻𝑓 𝒙0 = 𝟎, then we’ve found an extremum.
3. Otherwise, make a small step downhill:
𝒙1 ← 𝒙0 − 𝜇0𝛻𝑓 𝒙0
4. … and then another step
𝒙2 ← 𝒙1 − 𝜇1𝛻𝑓 𝒙1
5. … and so on until
AACIMP Summer School.
August, 2012
![Page 35: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/35.jpg)
Gradient descent
1. Pick random point 𝒙0
2. If 𝛻𝑓 𝒙0 = 𝟎, then we’ve found an extremum.
3. Otherwise, make a small step downhill:
𝒙1 ← 𝒙0 − 𝜇0𝛻𝑓 𝒙0
4. … and then another step
𝒙2 ← 𝒙1 − 𝜇1𝛻𝑓 𝒙1
5. … and so on until 𝛻𝑓 𝒙𝑛 ≈ 𝟎 or we’re tired.
With a smart choice of 𝜇𝑖 we’ll converge to a minimum
AACIMP Summer School.
August, 2012
![Page 36: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/36.jpg)
Gradient descent
1. 2.
3.
𝒙1 ← 𝒙0 − 𝜇0𝛻𝑓 𝒙0
4.
𝒙2 ← 𝒙1 − 𝜇1𝛻𝑓 𝒙1
AACIMP Summer School.
August, 2012
![Page 37: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/37.jpg)
Gradient descent
𝒙𝑖+1 ← 𝒙𝑖 − 𝜇𝑖𝛻𝑓 𝒙𝑖
AACIMP Summer School.
August, 2012
![Page 38: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/38.jpg)
Gradient descent
∆𝒙𝑖= −𝜇𝑖𝛻𝑓 𝒙𝑖
AACIMP Summer School.
August, 2012
![Page 39: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/39.jpg)
Gradient descent
∆𝒙𝑖= −𝜇𝒄
AACIMP Summer School.
August, 2012
![Page 40: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/40.jpg)
Gradient descent (fixed step)
AACIMP Summer School.
August, 2012
∆𝒙𝑖= −𝜇 𝛻𝑓 𝒙𝑖
![Page 41: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/41.jpg)
Gradient descent (fixed step)
AACIMP Summer School.
August, 2012
∆𝒙𝑖= −𝜇 𝛻𝑓 𝒙𝑖
![Page 42: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/42.jpg)
Example
AACIMP Summer School.
August, 2012
![Page 43: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/43.jpg)
Example
AACIMP Summer School.
August, 2012
𝒙𝟏, 𝒙𝟐, … , 𝒙𝟓𝟎
![Page 44: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/44.jpg)
Example
AACIMP Summer School.
August, 2012
𝑓 𝒘 = 𝒙𝒊 −𝒘 𝟐
𝒊
![Page 45: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/45.jpg)
Example
AACIMP Summer School.
August, 2012
𝛻𝑓 𝒘 = − 2(𝒙𝒊 −𝒘)
𝒊
![Page 46: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/46.jpg)
Example
AACIMP Summer School.
August, 2012
𝛻𝑓 𝒘 = 2𝑛(𝒘 − 𝒙 )
![Page 47: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/47.jpg)
Example
AACIMP Summer School.
August, 2012
𝛻𝑓 𝒘 = 𝟎
![Page 48: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/48.jpg)
Example
AACIMP Summer School.
August, 2012
𝛻𝑓 𝒘
![Page 49: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/49.jpg)
Example
AACIMP Summer School.
August, 2012
𝚫𝐰 = −𝜇𝛻𝑓 𝒘
![Page 50: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/50.jpg)
Stochastic gradient descent
Whenever
𝑓 𝒘 = 𝑔(𝒘, 𝒙𝒌)
AACIMP Summer School.
August, 2012
![Page 51: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/51.jpg)
Stochastic gradient descent
Whenever
𝑓 𝒘 = 𝑔(𝒘, 𝒙𝒌)
e.g.
Error 𝒘 = model𝒘 𝒙𝒌 − 𝒚𝒌𝟐
AACIMP Summer School.
August, 2012
![Page 52: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/52.jpg)
Stochastic gradient descent
Whenever
𝑓 𝒘 = 𝑔(𝒘, 𝒙𝒌)
the gradient is also a sum:
𝛻𝑓 𝒘 = 𝛻𝑔(𝒘, 𝒙𝒌)
AACIMP Summer School.
August, 2012
![Page 53: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/53.jpg)
Stochastic gradient descent
Whenever
𝑓 𝒘 = 𝑔(𝒘, 𝒙𝒌)
the gradient is also a sum:
𝛻𝑓 𝒘 = 𝛻𝑔(𝒘, 𝒙𝒌)
The GD step is then also a sum
∆𝒘𝒊 = −𝜇 𝛻𝑔 𝒘𝒊, 𝒙𝒌
AACIMP Summer School.
August, 2012
![Page 54: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/54.jpg)
Stochastic gradient descent
Whenever
𝑓 𝒘 = 𝑔(𝒘, 𝒙𝒌)
the gradient is also a sum:
𝛻𝑓 𝒘 = 𝛻𝑔(𝒘, 𝒙𝒌)
The GD step is then also a sum
∆𝒘𝒊 = −𝜇 𝛻𝑔 𝒘𝒊, 𝒙𝒌
AACIMP Summer School.
August, 2012
![Page 55: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/55.jpg)
Stochastic gradient descent
Batch update:
∆𝒘𝒊 = −𝜇 𝛻𝑔 𝒘𝒊, 𝒙𝒌
AACIMP Summer School.
August, 2012
![Page 56: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/56.jpg)
Stochastic gradient descent
Batch update:
∆𝒘𝒊 = −𝜇 𝛻𝑔 𝒘𝒊, 𝒙𝒌
On-line update:
∆𝒘𝒊 = −𝜇 𝛻𝑔 𝒘𝒊, 𝒙𝒓𝒂𝒏𝒅𝒐𝒎
AACIMP Summer School.
August, 2012
![Page 57: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/57.jpg)
Example
AACIMP Summer School.
August, 2012
𝚫𝐰 = 𝜇(𝒙𝒊 −𝒘)
![Page 58: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/58.jpg)
Summary
An interpretation of the gradient
An extremality criterion
An iterative algorithm for function
minimization
A stochastic iterative algorithm for
function minimization
AACIMP Summer School.
August, 2012
![Page 59: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/59.jpg)
Quiz
The symbol 𝚫 is called ______ and denotes
_____
The symbol 𝛁 is called ______ and denotes
_____
AACIMP Summer School.
August, 2012
![Page 60: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/60.jpg)
Quiz
Gradient descent:
∆𝒘𝑖= _________
AACIMP Summer School.
August, 2012
![Page 61: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/61.jpg)
Quiz
Gradient descent:
∆𝒘𝑖= −𝜇𝛁𝒘𝐟(𝒘)
AACIMP Summer School.
August, 2012
![Page 62: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/62.jpg)
Quiz
Gradient descent:
∆𝒘𝑖= −𝜇𝒄
AACIMP Summer School.
August, 2012
![Page 63: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/63.jpg)
Quiz
Gradient descent:
∆𝒘𝑖= −𝜇𝒄
Suppose the batch gradient descent step is
∆𝒘𝒊 = 𝒘𝑻𝒙𝒊 + 𝒙𝒊𝟐 𝒆
𝒊
the corresponding stochastic gradient descent
step is:
𝚫𝒘𝒊 = _______________ AACIMP Summer School.
August, 2012
![Page 64: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/64.jpg)
Quiz
Can bacteria learn?
AACIMP Summer School.
August, 2012
![Page 65: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/65.jpg)
Questions?
AACIMP Summer School.
August, 2012
![Page 67: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/67.jpg)
Supervised Learning
Let 𝑋 and 𝑌 be some sets.
Let there be a training dataset:
𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑛, 𝑦𝑛
𝑥𝑖 ∈ 𝑋, 𝑦𝑖 ∈ 𝑌
AACIMP Summer School.
August, 2012
![Page 68: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/68.jpg)
Supervised Learning
Let 𝑋 and 𝑌 be some sets.
Let there be a training dataset:
𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑛, 𝑦𝑛
𝑥𝑖 ∈ 𝑋, 𝑦𝑖 ∈ 𝑌
Supervised learning:
AACIMP Summer School.
August, 2012
Find a function 𝑓: 𝑋 → 𝑌,
generalizing the dependency
present in the data.
![Page 69: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/69.jpg)
Classification
𝑋 = ℝ2, 𝑌 = {blue, red}
𝐷 = { 1.3, 0.8 , red , 2.5,2.3 , blue , … }
𝑓 𝑥1, 𝑥2 = if 𝑥1 + 𝑥2 > 3 then blue else red
AACIMP Summer School.
August, 2012
![Page 70: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/70.jpg)
Regression
𝑋 = ℝ, 𝑌 = ℝ
𝐷 = { 0.5, 0.26 , 0.43, 0.08 , … }
𝑓 𝑥 = 𝑥2
AACIMP Summer School.
August, 2012
![Page 71: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/71.jpg)
Linear Regression
𝑋 = ℝ, 𝑌 = ℝ
𝐷 = { 0.5, 0.26 , 0.43, 0.08 , … }
𝑓 𝑥 = −0.14 + 1.01 𝑥
AACIMP Summer School.
August, 2012
![Page 72: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/72.jpg)
Linear Regression
𝑋 = ℝm, Y = ℝ
𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + 𝑤1𝑥1 +⋯+𝑤𝑚𝑥𝑚
AACIMP Summer School.
August, 2012
![Page 73: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/73.jpg)
Linear Regression
𝑋 = ℝm, Y = ℝ
𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + 𝑤1𝑥1 +⋯+𝑤𝑚𝑥𝑚
𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + ⟨𝒘, 𝒙⟩
AACIMP Summer School.
August, 2012
Inner product
![Page 74: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/74.jpg)
Linear Regression
𝑋 = ℝm, Y = ℝ
𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + 𝑤1𝑥1 +⋯+𝑤𝑚𝑥𝑚
𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + ⟨𝒘, 𝒙⟩
𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + (𝑤1, … , 𝑤𝑚)
𝑥1𝑥2⋮𝑥𝑚
AACIMP Summer School.
August, 2012
![Page 75: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/75.jpg)
Linear Regression
𝑋 = ℝm, Y = ℝ
𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + 𝑤1𝑥1 +⋯+𝑤𝑚𝑥𝑚
𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + ⟨𝒘, 𝒙⟩
𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 + (𝑤1, … , 𝑤𝑚)
𝑥1𝑥2⋮𝑥𝑚
𝑓 𝑥1, … , 𝑥𝑚 = 𝑤0 +𝒘𝑇𝒙
AACIMP Summer School.
August, 2012
![Page 76: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/76.jpg)
Linear Regression
𝑓 𝒙 = 𝑤0 +𝒘𝑇𝒙
AACIMP Summer School.
August, 2012
![Page 77: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/77.jpg)
Linear Regression
𝑓 𝒙 = 𝑤0 +𝒘𝑇𝒙
AACIMP Summer School.
August, 2012
Bias term
![Page 78: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/78.jpg)
Linear Regression
𝑓 𝒙 = 𝑤0 +𝒘𝑇𝒙
𝑓 𝑥1, … , 𝑥𝑚 = (𝑤0, 𝑤1, … , 𝑤𝑚)
1𝑥1⋮𝑥𝑚
AACIMP Summer School.
August, 2012
![Page 79: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/79.jpg)
Linear Regression
𝑓 𝒙 = 𝑤0 +𝒘𝑇𝒙
𝑓 𝑥1, … , 𝑥𝑚 = (𝑤0, 𝑤1, … , 𝑤𝑚)
1𝑥1⋮𝑥𝑚
𝑓 𝒙 = 𝒘 𝑇𝒙
AACIMP Summer School.
August, 2012
![Page 80: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/80.jpg)
Linear Regression
𝑓 𝒙 = 𝒘𝑇𝒙
AACIMP Summer School.
August, 2012
![Page 81: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/81.jpg)
Linear Regression
𝑓 𝑥 = 𝑤𝑥 AACIMP Summer School.
August, 2012
![Page 82: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/82.jpg)
Objective Function
𝑒𝑖 = 𝑓(𝑥𝑖) − 𝑦𝑖2
AACIMP Summer School.
August, 2012
![Page 83: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/83.jpg)
Objective Function
𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖2
AACIMP Summer School.
August, 2012
![Page 84: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/84.jpg)
Objective Function
𝑒𝑖 = 𝑤𝑥𝑖 − 𝑦𝑖2
AACIMP Summer School.
August, 2012
![Page 85: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/85.jpg)
Objective Function
𝐸𝐷(𝑤) = 𝑤𝑥𝑖 − 𝑦𝑖2
𝑖
AACIMP Summer School.
August, 2012
“OLS”
![Page 86: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/86.jpg)
Objective Function
𝐸𝐷(𝑤) = 𝑤𝑥𝑖 − 𝑦𝑖2
𝑖
AACIMP Summer School.
August, 2012
![Page 87: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/87.jpg)
Objective Function
𝐸𝐷(𝒘) = 𝒘𝑻𝒙𝒊 − 𝑦𝑖2
𝑖
AACIMP Summer School.
August, 2012
![Page 88: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/88.jpg)
Objective Function
𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖
2
𝑖
AACIMP Summer School.
August, 2012
(…𝒙𝟏𝑻…) 𝒙𝟏
𝑻𝒘
(…𝒙𝟐𝑻…) 𝒙𝟐
𝑻𝒘⋮
(…𝒙𝒏𝑻…) 𝒙𝒏
𝑻𝒘
𝑦1𝑦2⋮𝑦𝑛
⋮𝒘⋮
![Page 89: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/89.jpg)
Objective Function
𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖
2
𝑖
AACIMP Summer School.
August, 2012
(…𝒙𝟏𝑻…) 𝒙𝟏
𝑻𝒘
(…𝒙𝟐𝑻…) 𝒙𝟐
𝑻𝒘⋮
(…𝒙𝒏𝑻…) 𝒙𝒏
𝑻𝒘
𝑦1𝑦2⋮𝑦𝑛
⋮𝒘⋮
𝑿 𝒚
![Page 90: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/90.jpg)
Objective Function
𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖
2
𝑖
AACIMP Summer School.
August, 2012
𝑿𝒘− 𝒚
![Page 91: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/91.jpg)
Objective Function
𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖
2
𝑖
AACIMP Summer School.
August, 2012
𝑿𝒘− 𝒚 𝟐
![Page 92: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/92.jpg)
Objective Function
𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖
2
𝑖
AACIMP Summer School.
August, 2012
𝑿𝒘− 𝒚 𝟐
𝑿𝒘 ≈ 𝒚
![Page 93: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/93.jpg)
Objective Function
𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖
2
𝑖
AACIMP Summer School.
August, 2012
𝑿𝒘− 𝒚 𝟐
𝒘 ≈ 𝑿−𝟏𝒚?
𝑿𝒘 ≈ 𝒚
![Page 94: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/94.jpg)
Objective Function
𝐸𝐷 𝒘 = 𝒙𝒊𝑇𝒘− 𝑦𝑖
2
𝑖
AACIMP Summer School.
August, 2012
𝑿𝒘− 𝒚 𝟐
𝒘 = 𝑿+𝒚 !
𝑿𝒘 ≈ 𝒚
![Page 95: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/95.jpg)
Optimization
argmin𝐰 𝑿𝒘 − 𝒚 2
AACIMP Summer School.
August, 2012
![Page 96: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/96.jpg)
Optimization
argmin𝐰1
2𝑿𝒘 − 𝒚 2
AACIMP Summer School.
August, 2012
![Page 97: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/97.jpg)
Optimization
argmin𝐰1
2𝑿𝒘 − 𝒚 2
1
2𝑿𝒘 − 𝒚 𝑇(𝑿𝒘 − 𝒚)
AACIMP Summer School.
August, 2012
𝒂 2 = 𝒂𝑇𝒂
![Page 98: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/98.jpg)
Optimization
argmin𝐰1
2𝑿𝒘 − 𝒚 2
1
2𝑿𝒘 − 𝒚 𝑇(𝑿𝒘 − 𝒚)
1
2𝒘𝑇𝑿𝑇 − 𝒚𝑇 (𝑿𝒘 − 𝒚)
AACIMP Summer School.
August, 2012
𝒂 + 𝒃 𝑇 = (𝒂𝑇 + 𝒃𝑇)
![Page 99: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/99.jpg)
Optimization
1
2𝒘𝑇𝑿𝑇 − 𝒚𝑇 (𝑿𝒘 − 𝒚)
1
2𝒘𝑇𝑿𝑇𝑿𝒘 − 𝒚𝑇𝑿𝒘−𝒘𝑻𝑿𝑻𝒚 + 𝒚𝑻𝒚
AACIMP Summer School.
August, 2012
𝒂 𝒃 + 𝒄 = 𝒂𝒃 + 𝒂𝒄
![Page 100: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/100.jpg)
Optimization
1
2𝒘𝑇𝑿𝑇 − 𝒚𝑇 (𝑿𝒘 − 𝒚)
1
2𝒘𝑇𝑿𝑇𝑿𝒘 − 𝒚𝑇𝑿𝒘−𝒘𝑻𝑿𝑻𝒚 + 𝒚𝑻𝒚
1
2𝒘𝑇𝑿𝑇𝑿𝒘 − 2𝒚𝑇𝑿𝒘+ 𝒚𝑻𝒚
AACIMP Summer School.
August, 2012
𝒚𝑻𝑿𝒘 = 𝒘𝑻𝑿𝑻𝒚 = scalar
![Page 101: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/101.jpg)
Optimization
𝐸 𝒘 =1
2𝒘𝑇𝑿𝑇𝑿𝒘 − 2𝒚𝑇𝑿𝒘+ 𝒚𝑇𝒚
AACIMP Summer School.
August, 2012
![Page 102: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/102.jpg)
Optimization
𝐸 𝒘 =1
2𝒘𝑇𝑿𝑇𝑿𝒘 − 2𝒚𝑇𝑿𝒘+ 𝒚𝑇𝒚
𝛻𝐸 𝒘 = 𝑿𝑻𝑿𝒘− 𝑿𝑻𝒚
AACIMP Summer School.
August, 2012
𝛁 𝒇 + 𝒈 = 𝛁𝐟 + 𝛁𝐠
𝛁 𝒘𝑻𝑨𝒘 = 2𝑨𝒘
𝛁 𝒂𝑻𝒘 = 𝒂
![Page 103: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/103.jpg)
Optimization
𝐸 𝒘 =1
2𝒘𝑇𝑿𝑇𝑿𝒘 − 2𝒚𝑇𝑿𝒘+ 𝒚𝑇𝒚
𝛻𝐸 𝒘 = 𝑿𝑻𝑿𝒘− 𝑿𝑻𝒚
𝟎 = 𝑿𝑻𝑿𝒘− 𝑿𝑻𝒚
𝑿𝑻𝑿𝒘 = 𝑿𝑻𝒚
AACIMP Summer School.
August, 2012
![Page 104: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/104.jpg)
Optimization
𝐸 𝒘 =1
2𝒘𝑇𝑿𝑇𝑿𝒘 − 2𝒚𝑇𝑿𝒘+ 𝒚𝑇𝒚
𝛻𝐸 𝒘 = 𝑿𝑻𝑿𝒘− 𝑿𝑻𝒚
𝟎 = 𝑿𝑻𝑿𝒘− 𝑿𝑻𝒚
𝑿𝑻𝑿𝒘 = 𝑿𝑻𝒚
𝒘 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚
AACIMP Summer School.
August, 2012
![Page 105: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/105.jpg)
Linear Regression Solution
𝒘 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚
AACIMP Summer School.
August, 2012
![Page 106: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/106.jpg)
Linear Regression Solution
𝒘 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚
X = matrix(X)
y = matrix(y)
w =(X.T * X).I * X.T * y
AACIMP Summer School.
August, 2012
![Page 107: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/107.jpg)
Linear Regression Solution
𝒘 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚
X = matrix(X)
y = matrix(y)
w =(X.T * X).I * X.T * y
AACIMP Summer School.
August, 2012
Moore-Penrose
pseudoinverse
![Page 108: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/108.jpg)
Linear Regression Solution
𝒘 = 𝑿𝑇𝑿 −1𝑿𝑇𝒚 𝒘 = 𝑿+𝒚
X = matrix(X)
y = matrix(y)
w =(X.T * X).I * X.T * y
w = pinv(X) * y
AACIMP Summer School.
August, 2012
Moore-Penrose
pseudoinverse
![Page 109: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/109.jpg)
Linear Regression & SKLearn
from sklearn.linear_model import
LinearRegression
model = LinearRegression()
model.fit(X, y)
w=(model.intercept_,model.coef_)
model.predict(X_new)
AACIMP Summer School.
August, 2012
![Page 110: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/110.jpg)
Stochastic Gradient Regression
𝚫𝒘 = −𝜇(𝒘𝑇𝒙𝑖 − 𝑦𝑖)𝒙𝒊
AACIMP Summer School.
August, 2012
![Page 111: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/111.jpg)
Stochastic Gradient Regression
𝚫𝒘 = −𝜇𝑒𝑖𝒙𝒊
AACIMP Summer School.
August, 2012
![Page 112: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/112.jpg)
Stochastic Gradient Regression
𝚫𝒘 = −𝜇𝑒𝑖𝒙𝒊
AACIMP Summer School.
August, 2012
from sklearn.linear_model import SGDRegressor
model = SGDRegressor(alpha=0, n_iter=30)
model.fit(X, y)
![Page 113: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/113.jpg)
Polynomial Regression
Say we’d like to fit a model:
𝑓 𝑥1, 𝑥2= 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2
2 + 𝑤3𝑥1𝑥2
AACIMP Summer School.
August, 2012
![Page 114: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/114.jpg)
Polynomial Regression
Say we’d like to fit a model:
𝑓 𝑥1, 𝑥2= 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2
2 + 𝑤3𝑥1𝑥2
Simply transform the features and proceed as
normal:
𝑥1, 𝑥2 → (𝑥1, 𝑥22, 𝑥1𝑥2)
AACIMP Summer School.
August, 2012
![Page 115: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/115.jpg)
Single-variable Polynomial OLS
# n x 1 matrix
x = matrix(…)
# Add bias & square features
X = hstack([x**0, x**1, x**2])
# Solve for w
w = pinv(X) * y
AACIMP Summer School.
August, 2012
![Page 116: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/116.jpg)
Overfitting
AACIMP Summer School.
August, 2012
![Page 117: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/117.jpg)
Regularization
AACIMP Summer School.
August, 2012
𝐸 𝒘 =1
2𝑿𝒘 − 𝒚 2 + 𝜆 𝒘 2
![Page 118: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/118.jpg)
Regularization
AACIMP Summer School.
August, 2012
𝐸 𝒘 =1
2𝑿𝒘 − 𝒚 2 + 𝜆 𝒘 2
ℓ2-penalty ℓ2-loss
![Page 119: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/119.jpg)
Regularization
AACIMP Summer School.
August, 2012
𝐸 𝒘 =1
2𝑿𝒘 − 𝒚 2 + 𝜆 𝑤 1
ℓ1-penalty ℓ2-loss
![Page 120: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/120.jpg)
Regularization
AACIMP Summer School.
August, 2012
𝐸 𝒘 =1
2𝑿𝒘 − 𝒚 2 + 𝜆 𝑤 0
ℓ0-penalty ℓ2-loss
![Page 121: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/121.jpg)
Regularization
AACIMP Summer School.
August, 2012
𝐸 𝒘 =1
2𝑿𝒘 − 𝒚 2 + 𝜆 𝑤 0
ℓ0-penalty ℓ2-loss
>>> SGDRegressor?
Parameters
----------
loss : str, 'squared_loss' or 'huber' ...
...
penalty : str, 'l2' or 'l1' or 'elasticnet'
...
![Page 122: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/122.jpg)
Regularization
AACIMP Summer School.
August, 2012
𝐸 𝒘 =1
2𝑿𝒘 − 𝒚 2 + 𝜆 𝒘 2
ℓ2-penalty ℓ2-loss
![Page 123: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/123.jpg)
Ridge regression
argmin𝒘1
2𝑿𝒘 − 𝒚 2 + 𝜆 𝒘 2
𝒘 = 𝑿𝑇𝑿 + 𝜆 𝑰 −1𝑿𝑇𝒚
AACIMP Summer School.
August, 2012
𝑰 =
1 0 … 00 1 … 0⋮ ⋮ ⋱ ⋮0 0 … 1
![Page 124: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/124.jpg)
Ridge regression
argmin𝒘1
2𝑿𝒘 − 𝒚 2 + 𝜆 𝒘∗
2
𝒘 = 𝑿𝑇𝑿 + 𝜆 𝑰∗−1𝑿𝑇𝒚
AACIMP Summer School.
August, 2012
𝑰∗ =
𝟎 0 … 00 1 … 0⋮ ⋮ ⋱ ⋮0 0 … 1
The bias term 𝑤0 is
usually not penalized.
![Page 125: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/125.jpg)
Exercise
Derive an SGD algorithm
for Ridge Regression.
AACIMP Summer School.
August, 2012
![Page 126: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/126.jpg)
Effects of Regularization
AACIMP Summer School.
August, 2012
![Page 127: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/127.jpg)
Quiz
OLS linear regression searches for a _______
model that has the best _________.
Analytic solution for OLS regression:
𝒘 = _________
Stochastic gradient solution for OLS
regression:
𝚫𝒘 = _________
AACIMP Summer School.
August, 2012
![Page 128: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/128.jpg)
Quiz
Large number of model parameters and/or
small data may lead to ___________.
We address overfitting by __________.
“Ridge regression” means __-loss and ___-
penalty.
Analytic solution for Ridge regression:
𝒘 = _________
AACIMP Summer School.
August, 2012
![Page 129: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/129.jpg)
Quiz
As we increase regularization strength (i.e.
increase 𝜆), the training error _________.
… and the test error ________.
AACIMP Summer School.
August, 2012
![Page 130: Machine Learning: The Optimization Perspective - kt.era.eekt.era.ee/lectures/aacimp2012/slides2.pdf · Optimization methods In principle, x can be anything: Discrete Value (e.g. a](https://reader033.vdocument.in/reader033/viewer/2022052002/60152c7db958d9720c735dca/html5/thumbnails/130.jpg)
Questions?
AACIMP Summer School.
August, 2012