3/29/2016 · • models of human decision making 8 focus of course • rigorous algorithm design...

Administrivia,Introduc1ontoOnlineLearning

CS159:AdvancedTopicsinMachineLearning

1

3/29/2016

ClassDetails

•  Instructor:YisongYue•  TAs:

•  CourseWebsite:hBp://www.yisongyue.com/courses/cs159/

2

HoangLe StephanZheng

StyleofCourse

•  Graduatelevelcourse

•  Givestudentsanoverviewoftopics

•  Digdeepintoonetopicforfinalproject

•  Assumestudentsaremathema1callymature– Goalistounderstandbasicconcepts– Understandspecificmathema1caldetailsdependingonyourinterest

3

GradingBreakdown

•  Par1cipa1on(20%)

•  Mini-quizzes(10%)

•  FinalProject(70%)

4

PaperReading&Discussion

•  PaperReadingCourse– Readingassignmentsforeachlecture– Lecturesmorelikediscussion

•  StudentPresenta1ons– Presenta1onschedulesignupsoon– Presentingroups– Canchoosewhichpaper(s)topresent

5

Mini-quizzes

•  Eveningabereverylecture– Veryshort– Easyifyoureadmaterial&aBendedlecture

•  ReleasedviaPiazza– AlsousePiazzaforQ&A

6

FinalProject

•  Canbeonanytopicrelatedtothecourse

•  Workingroups

•  Willrelease1melineofprogressreportssoon

•  Peerreview(?)

7

Topics

•  OnlineLearning•  Mul1-armedBandits•  Ac1veLearning•  Crowdsourcing•  ReinforcementLearning•  ModelsofHumanDecisionmaking

8

FocusofCourse

•  Rigorousalgorithmdesign– Mathintensive,butnothingtoohard– Willwalkthroughrelevantmathinclass

•  Applytointeres1ngapplica1ons– Whataretherightwaystomodelaproblem?

9

WhatDoesRigorousMean?

•  Formalmodel– Explicitlystateyourassump1ons

•  Rigorouslyreasonabouthowyouralgorithmsolvesthemodel– Some1meswithprovableguarantees

•  Arguethatyourmodelisareasonableone

10

WhatMakesaGoodFinalProject?

•  PureTheory–  Studyprooftechniques,trytoextendproof,orapplytonewsejng

•  Algorithms–  Extendalgorithms,designnewones,fornewsejngs

•  Modeling– Modelnewsejng,whataretherightassump1ons?

11

Outline

•  First3-5lectures– Reviewbasicalgorithms– Somewhatdry,butnecessary

•  Topics/readingschosenbystudents– Withcura1ngfromInstructor&Tas– Listofpapersalreadyonwebsite

•  Butisnego1able

12

RestofToday

•  Introduc1ontoOnlineLearning– FollowtheLeader– Perceptron

•  BriefOverviewofOtherTopicsinCourse

13

Introduc1ontoOnlineLearning

14

(MostBasic)OnlineLearning

•  Fort=1….T(some1mesTisunknown)

– Algorithmchoosespt– Worldrevealslossfunc1onLt– AlgorithmsufferslossLt(pt)

•  Goal:minimizetotalloss

15

Lt (pt )t=1

T

∑

WhatarethesemanDcsofpt?

Whatistheloss?

Howisthelosschosen?

Recall:SupervisedLearning

•  Op1mizeviaStochas1cGradientDescent– Maintainawt

– Eachitera1onreceive:

– AssumesampledrandomlyfromS

– Choosewt+1basedonwtandLt

16

argminw

L yi, f (xi |w)( )i=1

N

∑ S = (xi, yi ){ }i=1N

Lt (wt ) = L yi, f (xi |wt )( )

(MostBasic)OnlineLearning


– Algorithmchoosespt– Worldrevealslossfunc1onLt– AlgorithmsufferslossLt(pt)


17

Lt (pt )t=1

T

∑

pt=wt

Lt(wt)=L(yt,f(xt|wt))

Ltchosenrandomly

Whatif…

•  Wereceiveaconstantstreamofdata?– Don’tknowTapriori

•  Wereceivedatainsomearbitraryway?– Notsampledindependentlyfromsomedistribu1on

•  CanwesDll(provably)achievegoodperformance?

18

Quan1fyingPerformance

•  Insupervisedlearningwecareabout:

•  Inonlinelearning,wecareabout:

19

L yi, f (xi |w)( )i=1

N

∑ = Li (w)i=1

N

∑

L yt, f (xt |wt )( )t=1

T

∑ = Lt (wt )t=1

T

∑

asinglew

asequenceofwt

Quan1fyingPerformance

•  Competeagainstsinglebestwinhindsight:

20

Lt (w*)

t=1

T

∑ =minw

Lt (w)t=1

T

∑

R(T ) = Lt (wt )t=1

T

∑ − Lt (w*)

t=1

T

∑ “Regret”

InterpretaDon:bestpossiblelossw.r.t.supervisedlearning

Interpre1ngRegret

•  ExpectedTrainingErroris:

•  Wantexpectedtrainingerrorto(quickly)convergetoop1mal–  Equivalenttoaverageregret(quickly)convergingto0:

•  SaDsfiedwhenregretgrowssublinearlyw.r.t.T!

21

1TR(T ) = 1

TLt (wt )

t=1

T

∑ − Lt (w*)

t=1

T

∑#

$%

&

'(→ 0

1T

Lt (wt )t=1

T

∑

SummaryofRegret

•  Genericwaytoquan1fyperformance– CharacterizesspeedofconvergenceforSGD

•  Appliestomanyonlinelearningsejngs

•  We’llseeotherwaystoquan1fyperformancelaterincourse

22

FollowtheLeader

23

BasicOnlineConvexOp1miza1on

•  Fort=1….T(Tunknown)– AlgorithmchoosesptinRD– Worldrevealslossfunc1onLt(pt)=|yt-pt|2

– AlgorithmsufferslossLt(pt)


24

Lt (pt )t=1

T

∑

SquaredDistancetoytIngeneral,convexloss

FollowtheLeaderAlgorithm

•  The“leader”isthebestpointgivenwhatweknowsofar:

25

pt = argminp

Lt ' p( )t '=1

t−1

∑ = argminp

yt ' − p2

t '=1

t−1

∑ =1t −1

yt 't '=1

t−1

∑

ThisistheenDrealgorithm!

BenefitsandDrawbacks

•  Benefits:– Efficientregretbounds(willseenextslide)– Conceptuallyverysimple

•  Canbeappliedtomanysejngs

•  Drawbacks:– Canbecomputa1onallyveryexpensive

•  Forarbitrarylossfunc1ons–  (can’tuseaverageallthe1me)

26

Defini1ons

•  Besthindsightchoiceoffirstt1mesteps:

•  FollowtheLeaderplays:

27

pt* = argmin

pLt ' p( )

t '=1

t

∑ = argminp

yt ' − p2

t '=1

t

∑ =1t

yt 't '=1

t

∑

pt = pt−1*

pt = argminp

Lt ' p( )t '=1

t−1

∑ = argminp

yt ' − p2

t '=1

t−1

∑ =1t −1

yt 't '=1

t−1

∑

Goal

•  MinimizeRegret:

28

R(T ) = Lt (pt )t=1

T

∑ − Lt (pT* )

t=1

T

∑

pT* = argmin

pLt p( )

t=1

T

∑ = argminp

yt − p2

t=1

T

∑ =1T

ytt=1

T

∑

Lemma1

•  InterpretaDon:–  themovingbesthindsightisatleastasgoodasthefinalbesthindsight

•  ProofbyInduc1on– Basecase(T=1):

29

L1(p1*) = L1(p1

*)

Lt (pt*)

t=1

T

∑ ≤ Lt (pT* )

t=1

T

∑

ProofCon1nued

•  Induc1veCase(T>1):–  Removelasttermbecauseit’sequivalent

–  Observe:

30

Lt (pt*)

t=1

T

∑ ≤ Lt (pT* )

t=1

T

∑ ⇒ Lt (pt*)

t=1

T−1

∑ ≤ Lt (pT* )

t=1

T−1

∑

Lt (pt*)

t=1

T−1

∑ ≤ Lt (pT−1* )

t=1

T−1

∑ ≤ Lt (pT* )

t=1

T−1

∑

Induc1veHypothesis

Defini1onofp*

RegretBound

31

R(T ) = Lt (pt )t=1

T

∑ − Lt (pT* )

t=1

T

∑

= Lt (pt−1* )

t=1

T

∑ − Lt (pT* )

t=1

T

∑

≤ Lt (pt−1* )

t=1

T

∑ − Lt (pt*)

t=1

T

∑

DefiniDonofFollowtheLeader

Lemma1

RegretBound(con1nued)

32

Lt (pt−1* )

t=1

T

∑ − Lt (pt*)

t=1

T

∑ = pt−1* − yt

2

t=1

T

∑ − pt* − yt

2

t=1

T

∑

= pt−1* − pt

*, pt−1* + pt

* − 2ytt=1

T

∑

≤ pt−1* − pt

* ⋅t=1

T

∑ pt−1* + pt

* − 2yt

≤ pt−1* − pt

* ⋅t=1

T

∑ pt−1* + pt

*t + 2yt( )

Cauchy-Schwarz

TriangleInequality


33

pt−1* − pt

* ⋅t=1

T

∑ pt−1* + pt

*t + 2yt( ) ≤ 4B pt−1

* − pt*

t=1

T

∑

AssumeeachythasnormboundedbyB:

Notethateachp*alsohasnormboundedbyB


34

pt−1* − pt

* = pt−1* −

(t −1)pt−1* + ytt

= 1tpt−1

* − yt

≤ 1t

pt−1* + yt( )

≤ 2Bt

Usethefactthat:

pt* =(t −1)pt−1

* + ytt

TriangleInequality

EachhasnormB

RegretBound(complete)

35

R(T ) = Lt (pt )t=1

T

∑ − Lt (pT* )

t=1

T

∑

≤ Lt (pt−1* )

t=1

T

∑ − Lt (pt*)

t=1

T

∑

≤ 4B pt−1* − pt

*

t=1

T

∑

≤ 8B2 1tt=1

T

∑ =O B2 lnT( ) LogarithmicRegret!

Independentofhoweachytischosen!

Recall:Interpre1ngRegret

•  ExpectedTrainingErroris:

•  Wantexpectedtrainingerrorto(quickly)convergetoop1mal–  Equivalenttoaverageregret(quickly)convergingto0:

•  SaDsfiedwhenregretgrowssublinearlyw.r.t.T!

36

1TR(T ) = 1

TLt (wt )

t=1

T

∑ − Lt (w*)

t=1

T

∑#

$%

&

'(→ 0

1T

Lt (wt )t=1

T

∑

WhenShouldYouUseFTLinPrac1ce?

•  Whensolvingeachop1miza1onproblemisnottheboBleneck– Forsimplesquareddistance,itistrivial– Formorecomplexlossfunc1ons,mightrequireexpensiveop1miza1on

•  WewillseeananalysisofSGD-stylealgorithmsnextTuesday– MakesmallupdatestoptusingonlyLt

37

Perceptron

38

BinaryClassifica1onOnlineLearning


– AlgorithmchooseswtinRD– Worldrevealslossfunc1on:

– AlgorithmsufferslossLt(wt)


39

Lt (pt )t=1

T

∑

Lt (wt ) =1 yt≠sign wt ,xt( )"# $%0/1Loss

PerceptronLearningAlgorithm

40

IfLt(wt)=1: wt+1 = wt + ytxt

Else: wt+1 = wt

y ∈ −1,+1{ }x ∈ RD

41

PerceptronLearningAssumeLinearlySeparable

42

Misclassified!


43

Update!


44

Correct!


45

Misclassified!


46

Update!


47

Update!


48

Correct!


49

Correct!


50

Misclassified!


51

Update!


52

Update!


53

AllTrainingExamplesCorrectlyClassified!


RegretBound=MistakeBound(forSeparableCase)

•  Forseparablecase:

•  Regret=#MistakesPerceptronmakes

54

R(T ) = Lt (wt )t=1

T

∑ − Lt (w*)

t=1

T

∑

Lt (w*)

t=1

T

∑ = 0

Lemma2

55

ytxtt∈I∑ = (wt+1 −wt )

t∈I∑ = wT+1

= wt+12− wt

2( )t∈I∑

= wt + ytxt2− wt

2( )t∈I∑

= 2yt wt, xt + xt2( )

t∈I∑

≤ xt2

t∈I∑

ytxtt∈I∑ ≤ xt

2

t∈I∑

Proof:

MistakeItera1ons

TelescopingSum

UpdateDefiniDon

≤0

PerceptronMistakeBound

56

#MistakesBoundedBy: B2

γ 2

Margin

B =maxx

x

**IfLinearlySeparable

Holdsforanyorderingoftrainingexamples!

“Radius”ofFeatureSpace

Proof

•  Margin:

57

γ =maxwmin(xt ,yt )

yt w, xtw

!"#

$#

%&#

'#MustbeposiDveduetolinearseparability

I γ ≤w, ytxt

t∈I∑w

≤ ytxtt∈I∑ ≤ xt

2

t∈I∑ ≤ I B2

I γ ≤ I B2 ⇒ I ≤ B2

γ 2

Interpreta1on

•  Ifthedataislinearlyseparable

•  ThenANYorderingof(x,y)willcauseperceptrontoconvergewithfinitemistakes

•  NodependenceonIIDsamplingfromtruedistribu1on

58

BriefOverviewofOtherTopics

59

ContextualOnlineLearning(akaOnlineLearningwithExperts)

•  Given:Setofexperts{fk}•  Fort=1….T(some1mesTisunknown)

– Eachexpertpredictsfk,t– Algorithmchoosespt– Worldrevealslossfunc1onLt– AlgorithmsufferslossLt(pt)


60

Lt (pt )t=1

T

∑

GeneralizesBoosDng

Par1alInforma1onOnlineLearning


– Algorithmchoosespt– WorldrevealslossLt(pt)– AlgorithmsufferslossLt(pt)


61

Lt (pt )t=1

T

∑

Wedon’tknowlossofotherchoices

Needto“explore”tomeasurelossofalternaDves

BasicAc1veLearning(forsupervisedlearning)

•  Fort=1….– Algorithmchoosesx– Worldrevealsassociatedlabely– Add(x,y)totrainingset

•  Terminatewhensufficientlyconfidentofbestmodel

62

SimpleExample

•  1feature•  Learnthresholdfunc1on

63

TrueModelPassiveLearningSamplefromdistribu1on

LearnedModel

SimpleExample

•  1feature•  Learnthresholdfunc1on

64

TrueModelAcDveLearningBinarySearch

ComparisonwithPassiveLearning

•  #samplestobewithinεoftruemodel

•  PassiveLearning:

•  Ac1veLearning:

65

O 1ε

!

"#$

%&

O log 1ε

!

"#

$

%&

Simple'Example'

•  1'feature'•  Learn'threshold'func7on'

39'

True'Model'Passive'Learning'Sample'from'distribu7on'

Learned'Model'Simple'Example'

•  1'feature'•  Learn'threshold'func7on'

40'

True'Model'Ac#ve&Learning&Binary'Search'

Crowdsourcing

66

Y LeCunMA Ranzato

Object Recognition [Krizhevsky, Sutskever, Hinton 2012]

“Mushroom”

Labeled and Unlabeled data

Human expert/Special equipment/

Experiment

“Crystal” “Needle” “Empty”

Cheap and abundant ! Expensive and scarce !

“0” “1” “2” …

“Sports”“News”“Science”

…

Unlabeled

LabeledIni1allyEmpty

Repeat

HowReliableareAnnotators?

•  Ifweknewwhatthelabelswere– Canjudgeworkersonlabelquality

•  Ifweknewwhothegoodworkerswere– Cancreatelabelsfromtheirannota1ons

•  Chickenandeggproblem!

67

ReinforcementLearning

68

•  Inprevioussejngs:– Ac1onsdonotimpactstate– “Stateless”

•  ReinforcementLearning– Ac1onseffectstateyou’rein– Rewardfunc1ondependsonstate– Example:PlayingGo

Off-PolicyEvalua1on

•  Example:Wehavehospitallogsofpneumoniadeathsundervariouscondi1ons.

– Wanttotrainmodelpredictwhoismostatrisk

– Modelpredictsthatasthmapa1entshaveLOWERriskforpneumoniadeath….

– BecausedoctorspaycloseraBen1ontoasthmapa1ents!

69

ModelingHumanDecisionMaking

•  Howdohumansreactinsequen1aldecisionmakingprocesses?

– Dotheybehavelikefollowtheleader?

– Dotheybehavelikeaperceptron?

70

3/29/2016 · • models of human decision making 8 focus of course • rigorous algorithm design...

Documents