inductive reasoning and (one of) the foundations of machine learning
TRANSCRIPT
Inductive Reasoning and (one of) the Foundations of
Machine Learning
“beware of mathematicians, and all those who make empty prophecies”
— St. Augustine
All men are mortal Socrates is a man
Deductive reasoning
Socrates is mortal
All men are mortal Socrates is a man
Deductive reasoning
Socrates is mortal
Idea: Thinking is deductive reasoning!
Articles
WINTER 2006 13
Photo courtesy Dartmouth College.
Page 1 of the Original Proposal.
Trenchard More, John McCarthy, Marvin Minsky, Oliver Selfridge,
Ray Solomonoff
50 years later
“To understand the real world, we must have a different set of primitives from the relatively
simple line trackers suitable and sufficient for the blocks world”
!
— Patrick Winston (1975) Director of MIT’s AI lab from 1972-1997
A bump in the road
The AI winterhttp://en.wikipedia.org/wiki/AI_winter
Reductio ad absurdum
“Intelligence is 10 million rules” — Doug Lenat
The story so far…• Boy meets girl !
!!
!!
!
The story so far…• Boy meets girl !• Boy spends 100s of millions of dollars
wooing girl with deductive reasoning!!!
!
The story so far…• Boy meets girl !• Boy spends 100s of millions of dollars
wooing girl with deductive reasoning!!
• Girl says: “drop dead”; boy becomes very sad
The story so far…• Boy meets girl !• Boy spends 100s of millions of dollars
wooing girl with deductive reasoning!!
• Girl says: “drop dead”; boy becomes very sad
Next: Boy ponders the errors of his ways
Next: Boy ponders the errors of his ways
Next: Boy ponders the errors of his ways
“this book is composed […] upon one very simple theme […] that we can learn from our mistakes”
!!!!!!!!!!
Karl Popper, Conjectures and Refutations
We’re going to look at 4 learning algorithms.
Sequential predictionScenario: At time t, Forecaster predicts 0 or 1.
Nature then reveals the truth. !!!
Forecaster has access to N experts. One of them is always correct.
Goal: Predict as accurately as possible.
Algorithm #1While t>0:
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.
Set t = 1.
While t>0:
Question:
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.
How long to find correct expert?
Set t = 1.Algorithm #1
While t>0:
BAD!!!
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.
How long to find correct expert?
Set t = 1.Algorithm #1
While t>0:
Question:
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.
How many errors?
Set t = 1.Algorithm #1
Algorithm #1
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.
How many errors?
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.
How many errors?
When algorithm makes a mistake,
it removes ≥ half of experts
Algorithm #1
≤ log N
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.
How many errors?
When algorithm makes a mistake,
it removes ≥ half of experts
Algorithm #1
Deep thought #1Track errors, not runtime
What’s going on?Didn’t we just use deductive reasoning!?!
What’s going on?Didn’t we just use deductive reasoning!?!
Yes… but No!
What’s going on?Algorithm: makes educated guesses about Nature
Analysis: proves theorem about number of errors
(inductive)
(deductive)
What’s going on?Algorithm: makes educated guesses about Nature
Analysis: proves theorem about number of errors
(inductive)
(deductive)
The algorithm learns — but it does not deduce!
Adversarial predictionScenario: At time t, Forecaster predicts 0 or 1.
Nature then reveals the truth. !!!
Forecaster has access to N experts. One of them is always correct. Nature is adversarial.
Goal: Predict as accurately as possible.
At time t, Forecaster predicts 0 or 1. Nature then reveals the truth.
!!!
Forecaster has access to N experts. One of them is always correct. Nature is adversarial.
Goal: Predict as accurately as possible.
Seriously?!?!
Regret
Let m* be the best expert in hindsight. regret := errors(Forecaster) - errors(m*)
Goal: Predict as accurately as possible. Minimize regret.
While t ≤ T:
Question:
Predict by weighted majority vote.Step 1.Multiply incorrect experts by β.Step 2.t ← t+1Step 3.
What is the regret?
Set t = 1.Algorithm #2
Pick β in (0,1). Assign 1 to experts.
While t ≤ T:Predict by weighted majority vote.Step 1.Multiply incorrect experts by β.Step 2.t ← t+1Step 3.
Set t = 1.Algorithm #2
What is the regret? [ choose β carefully ]r
T · logN2
Pick β in (0,1). Assign 1 to experts.
Deep thought #2Model yourself, not Nature
Online Convex Opt.Scenario: Convex set K; convex loss L(a,b)
[ in both arguments, separately ] !
At time t, Forecaster picks at in K Nature responds with bt in K [ Nature is adversarial ] Forecaster’s loss is L(a,b)
Goal: Minimize regret.
Follow the Leader
Idea: Predict with the at that would have worked best on { b1, … ,bt-1 }
While t ≤ T:Step 1.Step 2.
Set t = 1.Follow the Leader
Idea:
t ← t+1
Pick a1 at random.
at := argmina2K
"t�1X
i=1
L(a, bi)#
Predict with the at that would have worked best on { b1, … ,bt-1 }
While t ≤ T:Step 1.Step 2.
Set t = 1.Follow the LeaderBAD!
Problem: Nature pulls Forecaster back-and-forth No memory!
t ← t+1
Pick a1 at random.
at := argmina2K
"t�1X
i=1
L(a, bi)#
While t ≤ T:Step 1.Step 2.
Set t = 1.
t ← t+1
Algorithm #3Pick a1 at random.
regularize
at := argmina2K
"t�1X
i=1
L(a, bi) +�
2· kak22
#
While t ≤ T:Step 1.Step 2.
Set t = 1.Algorithm #3
t ← t+1
Pick a1 at random.
gradient descent
at at�1 � � · @
@aL(at�1, bt�1)
While t ≤ T:Step 1.Step 2.
Set t = 1.Algorithm #3
Intuition: β controls memory
t ← t+1
Pick a1 at random.
at at�1 � � · @
@aL(at�1, bt�1)
While t ≤ T:Step 1.Step 2.
Set t = 1.Algorithm #3
What is the regret? [ choose β carefully ]
diam(K) · Lipschitz(L) ·pT
t ← t+1
Pick a1 at random.
at at�1 � � · @
@aL(at�1, bt�1)
Deep thought #3
Those who cannot remember [their]
past are condemned to repeat it
George Santayana
Minimax theoreminfa2K
supb2K
L(a, b) = supb2K
infa2K
L(a, b)
Minimax theorem
Forecaster picks a, Nature responds b
infa2K
supb2K
L(a, b) = supb2K
infa2K
L(a, b)
Minimax theorem
Forecaster picks a, Nature responds b
Nature picks b, Forecaster responds a
infa2K
supb2K
L(a, b) = supb2K
infa2K
L(a, b)
Minimax theorem
Forecaster picks a, Nature responds b
Nature picks b, Forecaster responds a
infa2K
supb2K
L(a, b) = supb2K
infa2K
L(a, b)
infa2K
supb2K
L(a, b) � supb2K
infa2K
L(a, b)going first hurts Forecaster, so
Minimax theorem
Proof idea:No-regret algorithm →
!!
→ !!
→
Forecaster can asymptotically match hindsight !Order of players doesn’t matter asymptotically !Convert series of moves into average via online-to-batch.
Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)
infa2K
supb2K
L(a, b) supb2K
infa2K
L(a, b)
Minimax theorem
Proof idea:No-regret algorithm →
!!
→ !!
→
Forecaster can asymptotically match hindsight !Order of players doesn’t matter asymptotically !Convert series of moves into average via online-to-batch.
Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)
infa2K
supb2K
L(a, b) supb2K
infa2K
L(a, b)
Minimax theorem
Proof idea:No-regret algorithm →
!!
→ !!
→
Forecaster can asymptotically match hindsight !Order of players doesn’t matter asymptotically !Convert series of moves into average via online-to-batch.
Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)
infa2K
supb2K
L(a, b) supb2K
infa2K
L(a, b)
Minimax theorem
Proof idea:No-regret algorithm →
!!
→ !!
→
Forecaster can asymptotically match hindsight !Order of players doesn’t matter asymptotically !Convert series of moves into average via online-to-batch.
Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)
a =1
T
TX
t=1
at
infa2K
supb2K
L(a, b) supb2K
infa2K
L(a, b)
BoostingScenario:
Goal: Combine to perform well
Algorithm W is better than guessing on any data distribution: loss ≤ 0.5 - ε
The Boosting GameValue of game: V(w,d) = # mistakes w
makes on d
Algorithm W is better than guessing on any data distribution: loss ≤ 0.5 - ε
supd
infw
V(w, d) 1
2� ✏
The Boosting GameValue of game: V(w,d) = # mistakes w
makes on d
infw
supd
V(w, d) 1
2� ✏ MINIMAX!
The Boosting Game
infw
supd
V(w, d) 1
2� ✏ MINIMAX!
∃ distribution w* on learners that averages correctly on any data!
Meta-Algorithm #4Play Algorithm #2 against Algorithm W[ #2 maximizes W’s mistakes ]
Meta-Algorithm #4Play Algorithm #2 against Algorithm W[ #2 maximizes W’s mistakes ]
infw
supd
V(w, d) 1
2� ✏
Algorithm #2
Algorithm W
Meta-Algorithm #4
• Freund and Schapire 1995 !
• Best learning algorithm in late 1990s and early 2000s !
• Authors won Gödel prize
Play Algorithm #2 against Algorithm W[ #2 maximizes W’s mistakes ]
infw
supd
V(w, d) 1
2� ✏
Algorithm #2
Algorithm W
Deep thought #4
Your teachers are not your
friends
The story so far…• Boy met girl !• Boy spent 100s of millions of dollars
wooing girl with deductive reasoning !
• Girl said: “drop dead”; boy became very sad !
!
The story so far…• Boy met girl !• Boy spent 100s of millions of dollars
wooing girl with deductive reasoning !
• Girl said: “drop dead”; boy became very sad !
• Boy learnt to learn from mistakes
The story so far…• Boy met girl !• Boy spent 100s of millions of dollars
wooing girl with deductive reasoning !
• Girl showed no interest; boy became very sad !
• Boy learnt to learn from mistakes
Next: Boy invites girl for coffee. Girl accepts!
Online Convex Opt. (deep learning)
Apply Algorithm #3 to nonconvex optimization. !Theorems don’t work (not convex) → tons of engineering on top of #3 !Amazing performance. !New mathematics needs to be invented!!
Online Convex Opt. (deep learning)
In the last 2 years deep learning has: !• Better than human performance at object
recognition (ImageNet). !• Outperformed humans at recognising
street-signs (Google streetview). !
• Superhuman performance on Atari games (DeepMind).
!• Real-time translation: English voice
to Chinese text and voice.
Thank you!#1. Halving #2. Multiplicative Weights Exponential Weights Algorithm (EWA) #3. Online Gradient Descent (OGD) Stochastic Gradient Descent (SGD) Mirror Descent Backpropagation #4. AdaBoost
Details? Lecture notes on my webpage: https://dl.dropboxusercontent.com/u/
5874168/math482.pdf
Thank you!#1. Halving #2. Multiplicative Weights Exponential Weights Algorithm (EWA) #3. Online Gradient Descent (OGD) Stochastic Gradient Descent (SGD) Mirror Descent Backpropagation #4. AdaBoost
Vladimir Vapnik
Alexey Chervonenkis!1938 — 2014
“[A] theory of induction is superfluous. It has no function in a logic of science. The best we can say of a hypothesis is that up to now it has been able to show its worth, and that it has been
more successful that other hypotheses although, in principle, it can never be justified, verified, or even shown to be probable. This appraisal of the hypothesis relies
solely upon deductive consequences (predictions) which may be drawn
from the hypothesis: There is no need to even mention induction.”
“the learning process may be regarded as a search for a form of behaviour which will
satisfy the teacher (or some other criterion)”