inductive reasoning and (one of) the foundations of machine learning

Inductive Reasoning and (one of) the Foundations of

Machine Learning

“beware of mathematicians, and all those who make empty prophecies”

— St. Augustine

All men are mortal Socrates is a man

Deductive reasoning

Socrates is mortal

All men are mortal Socrates is a man

Deductive reasoning

Socrates is mortal

Idea: Thinking is deductive reasoning!

Articles

WINTER 2006 13

Photo courtesy Dartmouth College.

Page 1 of the Original Proposal.

Trenchard More, John McCarthy, Marvin Minsky, Oliver Selfridge,

Ray Solomonoff

50 years later

“To understand the real world, we must have a different set of primitives from the relatively

simple line trackers suitable and sufficient for the blocks world”

!

— Patrick Winston (1975) Director of MIT’s AI lab from 1972-1997

A bump in the road

The AI winterhttp://en.wikipedia.org/wiki/AI_winter

Reductio ad absurdum

“Intelligence is 10 million rules” — Doug Lenat

The story so far…• Boy meets girl !

!!

!!

!

The story so far…• Boy meets girl !• Boy spends 100s of millions of dollars

wooing girl with deductive reasoning!!!

!


wooing girl with deductive reasoning!!

• Girl says: “drop dead”; boy becomes very sad


wooing girl with deductive reasoning!!

• Girl says: “drop dead”; boy becomes very sad

Next: Boy ponders the errors of his ways


“this book is composed […] upon one very simple theme […] that we can learn from our mistakes”

!!!!!!!!!!

Karl Popper, Conjectures and Refutations

We’re going to look at 4 learning algorithms.

Sequential predictionScenario: At time t, Forecaster predicts 0 or 1.

Nature then reveals the truth. !!!

Forecaster has access to N experts. One of them is always correct.

Goal: Predict as accurately as possible.

Algorithm #1While t>0:

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

Set t = 1.

While t>0:

Question:


How long to find correct expert?

Set t = 1.Algorithm #1

While t>0:

BAD!!!


How long to find correct expert?


While t>0:

Question:


How many errors?


Algorithm #1

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.

How many errors?


How many errors?

When algorithm makes a mistake,

it removes ≥ half of experts

Algorithm #1

≤ log N


How many errors?

When algorithm makes a mistake,

it removes ≥ half of experts

Algorithm #1

Deep thought #1Track errors, not runtime

What’s going on?Didn’t we just use deductive reasoning!?!

What’s going on?Didn’t we just use deductive reasoning!?!

Yes… but No!

What’s going on?Algorithm: makes educated guesses about Nature

Analysis: proves theorem about number of errors

(inductive)

(deductive)

What’s going on?Algorithm: makes educated guesses about Nature

Analysis: proves theorem about number of errors

(inductive)

(deductive)

The algorithm learns — but it does not deduce!

Adversarial predictionScenario: At time t, Forecaster predicts 0 or 1.

Nature then reveals the truth. !!!

Forecaster has access to N experts. One of them is always correct. Nature is adversarial.


At time t, Forecaster predicts 0 or 1. Nature then reveals the truth.

!!!

Forecaster has access to N experts. One of them is always correct. Nature is adversarial.


Seriously?!?!

Regret

Let m* be the best expert in hindsight. regret := errors(Forecaster) - errors(m*)

Goal: Predict as accurately as possible. Minimize regret.

While t ≤ T:

Question:

Predict by weighted majority vote.Step 1.Multiply incorrect experts by β.Step 2.t ← t+1Step 3.

What is the regret?


Pick β in (0,1). Assign 1 to experts.

While t ≤ T:Predict by weighted majority vote.Step 1.Multiply incorrect experts by β.Step 2.t ← t+1Step 3.


What is the regret? [ choose β carefully ]r

T · logN2

Pick β in (0,1). Assign 1 to experts.

Deep thought #2Model yourself, not Nature

Online Convex Opt.Scenario: Convex set K; convex loss L(a,b)

[ in both arguments, separately ] !

At time t, Forecaster picks at in K Nature responds with bt in K [ Nature is adversarial ] Forecaster’s loss is L(a,b)

Goal: Minimize regret.

Follow the Leader

Idea: Predict with the at that would have worked best on { b1, … ,bt-1 }

While t ≤ T:Step 1.Step 2.

Set t = 1.Follow the Leader

Idea:

t ← t+1

Pick a1 at random.

at := argmina2K

"t�1X

i=1

L(a, bi)#

Predict with the at that would have worked best on { b1, … ,bt-1 }


Set t = 1.Follow the LeaderBAD!

Problem: Nature pulls Forecaster back-and-forth No memory!

t ← t+1

Pick a1 at random.

at := argmina2K

"t�1X

i=1

L(a, bi)#


Set t = 1.

t ← t+1

Algorithm #3Pick a1 at random.

regularize

at := argmina2K

"t�1X

i=1

L(a, bi) +�

2· kak22

#



t ← t+1

Pick a1 at random.

gradient descent

at at�1 � � · @

@aL(at�1, bt�1)



Intuition: β controls memory

t ← t+1

Pick a1 at random.

at at�1 � � · @

@aL(at�1, bt�1)



What is the regret? [ choose β carefully ]

diam(K) · Lipschitz(L) ·pT

t ← t+1

Pick a1 at random.

at at�1 � � · @

@aL(at�1, bt�1)

Deep thought #3

Those who cannot remember [their]

past are condemned to repeat it

George Santayana

Minimax theoreminfa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

Minimax theorem

Forecaster picks a, Nature responds b

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

Minimax theorem


Nature picks b, Forecaster responds a

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

Minimax theorem


Nature picks b, Forecaster responds a

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

infa2K

supb2K

L(a, b) � supb2K

infa2K

L(a, b)going first hurts Forecaster, so

Minimax theorem

Proof idea:No-regret algorithm →

!!

→ !!

→

Forecaster can asymptotically match hindsight !Order of players doesn’t matter asymptotically !Convert series of moves into average via online-to-batch.

Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

Minimax theorem

Proof idea:No-regret algorithm →

!!

→ !!

→

Forecaster can asymptotically match hindsight !Order of players doesn’t matter asymptotically !Convert series of moves into average via online-to-batch.

Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)

a =1

T

TX

t=1

at

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

BoostingScenario:

Goal: Combine to perform well

Algorithm W is better than guessing on any data distribution: loss ≤ 0.5 - ε

The Boosting GameValue of game: V(w,d) = # mistakes w

makes on d

Algorithm W is better than guessing on any data distribution: loss ≤ 0.5 - ε

supd

infw

V(w, d) 1

2� ✏

The Boosting GameValue of game: V(w,d) = # mistakes w

makes on d

infw

supd

V(w, d) 1

2� ✏ MINIMAX!

The Boosting Game

infw

supd

V(w, d) 1

2� ✏ MINIMAX!

∃ distribution w* on learners that averages correctly on any data!

Meta-Algorithm #4Play Algorithm #2 against Algorithm W[ #2 maximizes W’s mistakes ]

Meta-Algorithm #4Play Algorithm #2 against Algorithm W[ #2 maximizes W’s mistakes ]

infw

supd

V(w, d) 1

2� ✏

Algorithm #2

Algorithm W

Meta-Algorithm #4

• Freund and Schapire 1995 !

• Best learning algorithm in late 1990s and early 2000s !

• Authors won Gödel prize

Play Algorithm #2 against Algorithm W[ #2 maximizes W’s mistakes ]

infw

supd

V(w, d) 1

2� ✏

Algorithm #2

Algorithm W

Deep thought #4

Your teachers are not your

friends

The story so far…• Boy met girl !• Boy spent 100s of millions of dollars

wooing girl with deductive reasoning !

• Girl said: “drop dead”; boy became very sad !

!



• Girl said: “drop dead”; boy became very sad !

• Boy learnt to learn from mistakes



• Girl showed no interest; boy became very sad !

• Boy learnt to learn from mistakes

Next: Boy invites girl for coffee. Girl accepts!

Online Convex Opt. (deep learning)

Apply Algorithm #3 to nonconvex optimization. !Theorems don’t work (not convex) → tons of engineering on top of #3 !Amazing performance. !New mathematics needs to be invented!!

Online Convex Opt. (deep learning)

In the last 2 years deep learning has: !• Better than human performance at object

recognition (ImageNet). !• Outperformed humans at recognising

street-signs (Google streetview). !

• Superhuman performance on Atari games (DeepMind).

!• Real-time translation: English voice

to Chinese text and voice.

Thank you!#1. Halving #2. Multiplicative Weights Exponential Weights Algorithm (EWA) #3. Online Gradient Descent (OGD) Stochastic Gradient Descent (SGD) Mirror Descent Backpropagation #4. AdaBoost

Details? Lecture notes on my webpage: https://dl.dropboxusercontent.com/u/

5874168/math482.pdf

Thank you!#1. Halving #2. Multiplicative Weights Exponential Weights Algorithm (EWA) #3. Online Gradient Descent (OGD) Stochastic Gradient Descent (SGD) Mirror Descent Backpropagation #4. AdaBoost

Vladimir Vapnik

Alexey Chervonenkis!1938 — 2014

“[A] theory of induction is superfluous. It has no function in a logic of science. The best we can say of a hypothesis is that up to now it has been able to show its worth, and that it has been

more successful that other hypotheses although, in principle, it can never be justified, verified, or even shown to be probable. This appraisal of the hypothesis relies

solely upon deductive consequences (predictions) which may be drawn

from the hypothesis: There is no need to even mention induction.”

“the learning process may be regarded as a search for a form of behaviour which will

satisfy the teacher (or some other criterion)”

inductive reasoning and (one of) the foundations of machine learning

Data & Analytics