10-405 ml from large datasetswcohen/10-405/overview.pdfwho/what/where/when •10-405 is a brand new...

10-405MLfromLargeDatasets

WilliamCohen

1

ADMINISTRIVIA

Who/What/Where/When• Wiki:google://”cohen CMU”àteachingà

– http://curtis.ml.cmu.edu/w/courses/index.php/Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018

– thisshouldpointtoeverythingelse

• OfficeHours:– StillTBAforallofus,checkthewiki

• Courseassistant:DorothyHolland-Minkley(dfh@cs)

• Instructor:–WilliamW.Cohen

• TAs:– Introscomingup….

Who/What/Where/When• 10-405isabrandnewcourse!• Butit’sgoingtobeverycloseto10-605whichhasbeenhappeningsince2012

• It’saimedatundergrads–willcoveraboutthesametopics– notaprojectcourse– onechancetodoanopen-ended“extensionofanassignment”

• Anyonecanenrollbutnotallgradprogramswillgiveyoufullcreditfor4xxcoursessoreadthefineprint

Who/What/Where/When• Whoishere?

Who/What/Where/When• Mostdaystherewillbeanon-linequizyoushoulddoafterlecture

• ThequizzeswillusuallycloseFridaynoon• Wehaveonetoday– seethewiki!• Theydon’tcountforalotbuttherearenomakeups–Quizzesreinforcethelecture–Andmakesureyoukeepup

Who/What/Where/When• WilliamW.Cohen– 1990-mid1990’s:AT&TBellLabs(workingonILPandscalablepropositionalrule-learningalgorithms)

– Mid1990’s—2000:AT&TResearch(textclassification,dataintegration,knowledge-as-text,informationextraction)

– 2000-2002:Whizbang!Labs(infoextractionfromWeb)

– 2002-2008,2010-now:CMU(IE,biotext,socialnetworks,learningingraphs,infoextractionfromWeb,scalablefirst-orderlearning)• 2008-2009,Jan-Jul2017:VisitingScientistatGoogle

I am a Masters Student in Information Networking Institute going to join Microsoft AI & Research Teampost this semester. My research interests are in the domains of Scalable Machine Learning and Deep Learning.

I took the course last semester (Fall 2017) and learnt a lot.

Hope you all have a great semester. See you at the office hours!

Vidhan Agarwal (MSIN)

SarthakGarg- MSinCS• IamafirstyearMastersstudentintheComputerScienceDepartment

• IaminterestedinDeepGenerativeModelsandDistributedSystemsforMachineLearning

• Itook10-605lastfallandfounditveryinteresting,hopeyouenjoythecourse!

Nitish Kulkarni MSinMCDS

11

I am a first year master's student in the Master of Computational Data Science program, LTI Department.

I’m interested in scalable machine learning algorithms and information retrieval.

I took 10-805 in fall ‘17. The course was quite fun and incredibly useful. I’m glad to be a TA for the course this semester, and hope you have a similar experience.

VivekShankar– BSinSCS

• VivekShankar– BSinSCS

• IamafourthyearundergraduateintheSchoolofComputerScience.

• Thissummer,IinternedatGoogle,MontrealworkingonbuildingaURL-basedMachineLearningmodelfordetectingmaliciousChromeextensions.I’llbejoiningGooglefulltimeinPittsburgh.

• Iaminterestedindesigningparallelizablemachinelearningalgorithmsthatscalewelltolargedatasets– abigthemein10605.Itook10605inFall'17,anditwasoneofmyfavoriteclassesthusfaratCMU!

What/HowIkindoflikelanguagetasks,especiallyforthistask:– Thedata(usually)makessense– Themodels(usually)makesense– Themodels(usually)arecomplex,so• Moredataactuallyhelps• Learningsimplemodelsvs complexonesissometimescomputationallydifferent

What/How• ProgrammingLanguagesandSystems:– Python– HadoopandSpark

• Resources:– unix.andrew machines– Stoathadoop cluster:• 104workernodes,with8cores,16GBRAM,41TB.• 30workernodes,with8cores,16GBRAM, 250Gb+

– AmazonElasticCloud• AmazonEC2[http://aws.amazon.com/ec2/]• Allocation:$50worthoftimeperstudent

What/How:601co-req• Youshouldhaveasaprereq orco-req oneoftheMLD’sintroMLcourses:10-401,10-601,10-701,10-715

• Lecturesaredesignedtocomplement thatmaterial– computationalaspectsvs informationalaspects

What/How:cheatingvs workingtogether• Ihavealongandexplicitpolicy– stolenfromRoni Rosenfeld- readthewebpage– tl;dr:transmitinformationliketheydidinthestoneage,brain-to-brain,anddocumentit

– donotcopyanythingdigitally– exceptions(eg projects)willbeexplicitlystated

– everybodyinvolvedwillfail bydefault– every infractionalwaysgetsreporteduptotheOfficeofAcademicIntegrity,theheadofyourprogram,thedeanofyourschool,….

– asecondoffenseisverybad

BIGDATAHISTORY:FROMTHEDAWNOFTIMETOTHEPRESENT

17

Big ML c. 1993 (Cohen, “Efficient…Rule Learning”, IJCAI 1993)

$ ripper ../tdata/talksFinal hypothesis is:talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk (54/1).talk_announcement :- WORDS ~ '2d416' (26/3).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (4/0).talk_announcement :- WORDS ~ mh, WORDS ~ time (5/1).talk_announcement :- WORDS ~ talk, WORDS ~ used (3/0).talk_announcement :- WORDS ~ presentations (2/1).default non_talk_announcement (390/1).

18

MoreonthispaperAlgorithm

• Phase1:buildrules– Discretegreedysearch:– Startingwithemptyruleset,addconditionsgreedily

• Phase2:prunerules– startingwithphase1output,removeconditions

talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk, WORDS ~ p_comma.talk_announcement :- WORDS ~ '2d416', WORDS ~ be.talk_announcement :- WORDS ~ show, WORDS ~ talk (7/0).talk_announcement :- WORDS ~ mh, WORDS ~ time, WORDS ~ research (4/0).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (3/0).talk_announcement :- WORDS ~ '2d416', WORDS ~ memory (3/0).talk_announcement :- WORDS ~ interfaces, WORDS ~ From_p_exclaim_point (2/0).talk_announcement :- WORDS ~ presentations, WORDS ~ From_att (2/0).default non_talk_announcement .

20



• Phase2:prunerules– startingwithphase1output,removeconditions,greedily

talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk, WORDS ~ p_comma (54/0).talk_announcement :- WORDS ~ '2d416', WORDS ~ be (19/0).talk_announcement :- WORDS ~ show, WORDS ~ talk (7/0).talk_announcement :- WORDS ~ mh, WORDS ~ time, WORDS ~ research (4/0).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (3/0).talk_announcement :- WORDS ~ '2d416', WORDS ~ memory (3/0).talk_announcement :- WORDS ~ interfaces, WORDS ~ From_p_exclaim_point (2/0).talk_announcement :- WORDS ~ presentations, WORDS ~ From_att (2/0).default non_talk_announcement .

21



• Phase2:prunerules– startingwithphase1output,removeconditions,greedily

talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk (54/1).talk_announcement :- WORDS ~ '2d416' (26/3).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (4/0).talk_announcement :- WORDS ~ mh, WORDS ~ time (5/1).talk_announcement :- WORDS ~ talk, WORDS ~ used (3/0).talk_announcement :- WORDS ~ presentations (2/1).default non_talk_announcement (390/1).

22


• FitthePOS,NEGexample• WhilePOSisn’tempty:

– LetR be“ifTrueè pos”– WhileNEGisn’tempty:

• Pickthe“best”[i]conditionc oftheform“xi=True”or“xi=false”

• Addc totheLHSofR• Removeexamplesthatdon’tsatisfyc fromNEG

• AddRtotheruleset[ii]• RemoveexamplesthatsatisfyRfromPOS

• Prunetheruleset:– …

Analysis

[i] “Best” is wrt some statistics on c’s coverage of POS,NEG

[ii] R is now of the form “if xi1=_ and xi2=_ and … è pos”

• ThetotalnumberofiterationsofL1isthenumberofconditionsintheruleset– callitm

• Pickingthe“best”conditionrequireslookingatallexamples– saytherearen ofthese

• Timeisatleastm*n• Theproblem:

– Whentherearenoisypositiveexamplesthealgorithmbuildsrulesthatcoverjust1-2ofthem

– Sowithhugenoisydatasetsyoubuildhugerulesets

L1

quadratic

cubic!

23

Related paper from 1995…

24

Soinmid1990’s…..• Experimentaldatasetsweresmall• Manycommonlyusedalgorithmswereasymptotically “slow”

• Notmanypeoplereallycared

25

Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)

Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

26


27

WhyMoreDataHelps:ADemo• Data:–All5-gramsthatappear>=40timesinacorpusof1MEnglishbooks• approx 80Bwords• 5-grams:30Gbcompressed,250-300Gbuncompressed• Each5-gramcontainsfrequencydistributionoveryears

28

http://xkcd.com/ngram-charts/

31

WhyMoreDataHelps:ADemo• Data:– All5-gramsthatappear>=40timesinacorpusof1MEnglishbooks• approx 80Bwords• 5-grams:30Gbcompressed,250-300Gbuncompressed• Each5-gramcontainsfrequencydistributionoveryears

– Wrotecodetocompute• Pr(A,B,C,D,E|C=affectorC=effect)• Pr(anysubsetofA,…,E|any otherfixedvaluesofA,…,EwithC=affectVeffect)

– Demo:• /Users/wcohen/Documents/code/pyhack/bigml• eg:pythonngram-query.py data/aeffect-train.txt __Beffect__

32


Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

33

WhyMoreDataHelps• Data:

– All5-gramsthatappear>=40timesinacorpusof1MEnglishbooks• approx 80Bwords• 5-grams:30Gbcompressed,250-300Gbuncompressed• Each5-gramcontainsfrequencydistributionoveryears

– Wrotecodetocompute• Pr(A,B,C,D,E|C=affectorC=effect)• Pr(anysubsetofA,…,E|any otherfixedvaluesofA,…,EwithC=affectVeffect)

• Observations [fromplayingwithdata]:– Mostlyeffect notaffect– Mostcommonwordbeforeaffect isnot– After noteffectmostcommonwordisa– …

34

Soin2001…..• We’relearning:– “there’snodatalikemoredata”–Formanytasks,there’snorealsubstituteforusinglotsofdata

35

…andin2009Eugene Wigner’s article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” examines why so much of physics can be neatly explained with simple mathematical formulas such as f = ma or e = mc2. Meanwhile, sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over their inability to neatly model human behavior. An informal, incomplete grammar of the English language runs over 1,700 pages.

Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.

Norvig, Pereira, Halevy, “The Unreasonable Effectiveness of Data”, 200936

…andin2012

Dec 2011

37

…andin2013

38

…andin2014

39

Bengio,Foundations&Trends,2009

40

naïve vsclever optimization

1Mvs10M

examples

2.5Mexamplesfor“pretraining”

41

Today….• Commonlyuseddeeplearningdatasets:– Images/videos:• ImageNet:20k+categories,14M+images• MSCOCO:91categories,2.5Mlabels,328kimages• YouTube-M:8Murls,4800classes,0.5Mhours

– Readingcomprehension:• Children’sbooktest:600k+context/querypairs• CNN/Dailymail:~300kdocs,1.2Mclozequestions

–Other:• Ubuntudialog:7M+utterances,1M+dialogs• …

42

REVIEW:ASYMPTOTICCOMPLEXITY

43

Howdoweuseverylargeamountsofdata?• Workingwithbigdataisnot about– codeoptimization– learningdetailsoftodays hardware/software:• GraphLab,Hadoop,Spark,parallelhardware,….

• Workingwithbigdataisabout– Understandingthecostofwhatyouwanttodo– Understandingwhatthetoolsthatareavailableoffer– Understandinghowmuchcanbeaccomplishedwithlinearornearly-linearoperations(e.g.,sorting,…)

– Understandinghowtoorganizeyourcomputationssothattheyeffectivelyusewhatever’sfast

– Understandinghowtotest/debug/verifywithlargedata

*

* according to William44

AsymptoticAnalysis:BasicPrinciples

)()(,:, iff ))(()( 00 ngkxfnnnkngOnf ×£>"$Î

)()(,:, iff ))(()( 00 ngkxfnnnkngnf ×³>"$WÎ

Usually we only care about positive f(n), g(n), n here…

45

AsymptoticAnalysis:BasicPrinciples

)()(,:, iff ))(()( 00 ngkxfnnnkngOnf ×£>"$=

)()(,:, iff ))(()( 00 ngkxfnnnkngnf ×³>"$W=

Less pedantically:

Some useful rules:

)O( )( 434 nnnO =+

)O( )1273( 434 nnnO =+

)loglog4log 4 nO(n) O() nO( =×=

Only highest-order terms matter

Leading constants don’t matter

Degree of something in a log doesn’t matter

46

Backtorulepruning….

Algorithm• FitthePOS,NEGexampleWhile POSisn’tempty:

– LetR be“ifTrueè pos”– WhileNEGisn’tempty:

• Pickthe“best”[1]conditionc oftheform“xi=True”or“xi=false”

• Addc totheLHSofR• Removeexamplesthatdon’tsatisfycfromNEG

• AddRtotheruleset[2]• RemoveexamplesthatsatisfyRfromPOS

• Prunetheruleset:– Foreachconditioncintheruleset:

• Evaluatetheaccuracyoftheruleset w/oconheldout data

– Ifremovinganycimprovesaccuracy• Removecandrepeatthepruningstep

Analysis• Assumenexamples• Assume mconditionsinruleset• Growingrulestakestimeatleast

Ω(m*n)ifevaluatingcisΩ(n)• Whendataiscleanmissmall,

fittingtakeslineartime• Whenk%ofdataisnoisy,m is

Ω(n*0.01*k)sogrowingrulestakesΩ(n2)

• Pruningarulesetwithm =0.01*kn extraconditionsisveryslow:Ω(n3)ifimplementednaively

[1] “Best” is wrt some statistics on c’s coverage of POS,NEG

[2] R is now of the form “if xi1=_ and xi2=_ and … è pos”

47

Empiricalanalysis of complexity: plot run-time on a log-log plot and measure the slope (using linear regression)

48

Wheredoasymptotics breakdown?• Whentheconstantsaretoobig– ornistoosmall

• Whenwecan’tpredictwhattheprogramwilldo– Eg,howmanyiterationsbeforeconvergence?Doesitdependondatasizeornot?– Thisiswhenyouneedexperiments

• Whentherearedifferenttypesofoperationswithdifferentcosts–Weneedtounderstandwhatweshouldcount

49

Whatdowecount?

• Compilersdon’twarnJeffDean.JeffDeanwarnscompilers.

• JeffDeanbuildshiscodebeforecommittingit,butonlytocheckforcompilerandlinkerbugs.

• JeffDeanwritesdirectlyinbinary.Hethenwritesthesourcecodeasadocumentationforotherdevelopers.

• JeffDeanonceshiftedabitsohard,itendeduponanothercomputer.

• WhenJeffDeanhasanergonomicevaluation,itisfortheprotectionofhiskeyboard.

• gcc -O4emailsyourcodetoJeffDeanforarewrite.• WhenheheardthatJeffDean'sautobiographywouldbe

exclusivetotheplatform,RichardStallmanboughtaKindle.

• JeffDeanputshispantsononelegatatime,butifhehadmorelegs,you’drealizethealgorithmisactuallyonlyO(logn)

50

Numbers(JeffDeansays)EveryoneShouldKnow

51

Update:ColinScott,UCB

file:///Users/wcohen/Documents/code/interactive_latencies/interactive_latency.html - *may need to open this from shell

52

What’sHappeningwithHardware?• Clockspeed:stuckat3Ghzfor~10years• Netbandwidthdoubles~2years• Diskbandwidthdoubles~2years• SSDbandwidthdoubles~3years• Diskseekspeeddoubles~10years• SSDlatencynearlysaturated

53


55

AtypicalCPU(nottoscale)K8 core in the AMD Athlon 64 CPU

16x bigger

256x bigger

Hard disk (1Tb) 128x bigger

56


16x bigger

256x bigger


57


16x bigger

256x bigger


58


16x bigger

256x bigger


59

Atypicaldisk

60


~= 10x

~= 15x

~= 100,000x

40x

61

Whatdowecount?

• Compilersdon’twarnJeffDean.JeffDeanwarnscompilers.• ….

• Memoryaccess/instructionsarequalitativelydifferentfromdiskaccess

• Seeksarequalitativelydifferentfromsequentialreadsondisk

• Cache,diskfetches,etcworkbestwhenyoustreamthroughdatasequentially

• Bestcasefordataprocessing:streamthroughthedataonce insequentialorder,asit’sfoundondisk.

62

Otherlessons-?

* but not important enough for this class’s assignments….

*

63

What/How–Nextlecture:probabilityreviewandNaïveBayes.–Homework:• WatchthereviewlectureIlinkedtoonthewiki

– I’mnot goingtorepeatit

64