10-405 ml from large datasetswcohen/10-405/overview.pdfwho/what/where/when •10-405 is a brand new...
TRANSCRIPT
10-405MLfromLargeDatasets
WilliamCohen
1
ADMINISTRIVIA
Who/What/Where/When• Wiki:google://”cohen CMU”àteachingà
– http://curtis.ml.cmu.edu/w/courses/index.php/Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018
– thisshouldpointtoeverythingelse
• OfficeHours:– StillTBAforallofus,checkthewiki
• Courseassistant:DorothyHolland-Minkley(dfh@cs)
• Instructor:–WilliamW.Cohen
• TAs:– Introscomingup….
Who/What/Where/When• 10-405isabrandnewcourse!• Butit’sgoingtobeverycloseto10-605whichhasbeenhappeningsince2012
• It’saimedatundergrads–willcoveraboutthesametopics– notaprojectcourse– onechancetodoanopen-ended“extensionofanassignment”
• Anyonecanenrollbutnotallgradprogramswillgiveyoufullcreditfor4xxcoursessoreadthefineprint
Who/What/Where/When• Whoishere?
Who/What/Where/When• Mostdaystherewillbeanon-linequizyoushoulddoafterlecture
• ThequizzeswillusuallycloseFridaynoon• Wehaveonetoday– seethewiki!• Theydon’tcountforalotbuttherearenomakeups–Quizzesreinforcethelecture–Andmakesureyoukeepup
Who/What/Where/When• WilliamW.Cohen– 1990-mid1990’s:AT&TBellLabs(workingonILPandscalablepropositionalrule-learningalgorithms)
– Mid1990’s—2000:AT&TResearch(textclassification,dataintegration,knowledge-as-text,informationextraction)
– 2000-2002:Whizbang!Labs(infoextractionfromWeb)
– 2002-2008,2010-now:CMU(IE,biotext,socialnetworks,learningingraphs,infoextractionfromWeb,scalablefirst-orderlearning)• 2008-2009,Jan-Jul2017:VisitingScientistatGoogle
TAs
I am a Masters Student in Information Networking Institute going to join Microsoft AI & Research Teampost this semester. My research interests are in the domains of Scalable Machine Learning and Deep Learning.
I took the course last semester (Fall 2017) and learnt a lot.
Hope you all have a great semester. See you at the office hours!
Vidhan Agarwal (MSIN)
SarthakGarg- MSinCS• IamafirstyearMastersstudentintheComputerScienceDepartment
• IaminterestedinDeepGenerativeModelsandDistributedSystemsforMachineLearning
• Itook10-605lastfallandfounditveryinteresting,hopeyouenjoythecourse!
Nitish Kulkarni MSinMCDS
11
I am a first year master's student in the Master of Computational Data Science program, LTI Department.
I’m interested in scalable machine learning algorithms and information retrieval.
I took 10-805 in fall ‘17. The course was quite fun and incredibly useful. I’m glad to be a TA for the course this semester, and hope you have a similar experience.
VivekShankar– BSinSCS
• VivekShankar– BSinSCS
• IamafourthyearundergraduateintheSchoolofComputerScience.
• Thissummer,IinternedatGoogle,MontrealworkingonbuildingaURL-basedMachineLearningmodelfordetectingmaliciousChromeextensions.I’llbejoiningGooglefulltimeinPittsburgh.
• Iaminterestedindesigningparallelizablemachinelearningalgorithmsthatscalewelltolargedatasets– abigthemein10605.Itook10605inFall'17,anditwasoneofmyfavoriteclassesthusfaratCMU!
What/HowIkindoflikelanguagetasks,especiallyforthistask:– Thedata(usually)makessense– Themodels(usually)makesense– Themodels(usually)arecomplex,so• Moredataactuallyhelps• Learningsimplemodelsvs complexonesissometimescomputationallydifferent
What/How• ProgrammingLanguagesandSystems:– Python– HadoopandSpark
• Resources:– unix.andrew machines– Stoathadoop cluster:• 104workernodes,with8cores,16GBRAM,41TB.• 30workernodes,with8cores,16GBRAM, 250Gb+
– AmazonElasticCloud• AmazonEC2[http://aws.amazon.com/ec2/]• Allocation:$50worthoftimeperstudent
What/How:601co-req• Youshouldhaveasaprereq orco-req oneoftheMLD’sintroMLcourses:10-401,10-601,10-701,10-715
• Lecturesaredesignedtocomplement thatmaterial– computationalaspectsvs informationalaspects
What/How:cheatingvs workingtogether• Ihavealongandexplicitpolicy– stolenfromRoni Rosenfeld- readthewebpage– tl;dr:transmitinformationliketheydidinthestoneage,brain-to-brain,anddocumentit
– donotcopyanythingdigitally– exceptions(eg projects)willbeexplicitlystated
– everybodyinvolvedwillfail bydefault– every infractionalwaysgetsreporteduptotheOfficeofAcademicIntegrity,theheadofyourprogram,thedeanofyourschool,….
– asecondoffenseisverybad
BIGDATAHISTORY:FROMTHEDAWNOFTIMETOTHEPRESENT
17
Big ML c. 1993 (Cohen, “Efficient…Rule Learning”, IJCAI 1993)
$ ripper ../tdata/talksFinal hypothesis is:talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk (54/1).talk_announcement :- WORDS ~ '2d416' (26/3).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (4/0).talk_announcement :- WORDS ~ mh, WORDS ~ time (5/1).talk_announcement :- WORDS ~ talk, WORDS ~ used (3/0).talk_announcement :- WORDS ~ presentations (2/1).default non_talk_announcement (390/1).
18
19
MoreonthispaperAlgorithm
• Phase1:buildrules– Discretegreedysearch:– Startingwithemptyruleset,addconditionsgreedily
• Phase2:prunerules– startingwithphase1output,removeconditions
talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk, WORDS ~ p_comma.talk_announcement :- WORDS ~ '2d416', WORDS ~ be.talk_announcement :- WORDS ~ show, WORDS ~ talk (7/0).talk_announcement :- WORDS ~ mh, WORDS ~ time, WORDS ~ research (4/0).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (3/0).talk_announcement :- WORDS ~ '2d416', WORDS ~ memory (3/0).talk_announcement :- WORDS ~ interfaces, WORDS ~ From_p_exclaim_point (2/0).talk_announcement :- WORDS ~ presentations, WORDS ~ From_att (2/0).default non_talk_announcement .
20
MoreonthispaperAlgorithm
• Phase1:buildrules– Discretegreedysearch:– Startingwithemptyruleset,addconditionsgreedily
• Phase2:prunerules– startingwithphase1output,removeconditions,greedily
talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk, WORDS ~ p_comma (54/0).talk_announcement :- WORDS ~ '2d416', WORDS ~ be (19/0).talk_announcement :- WORDS ~ show, WORDS ~ talk (7/0).talk_announcement :- WORDS ~ mh, WORDS ~ time, WORDS ~ research (4/0).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (3/0).talk_announcement :- WORDS ~ '2d416', WORDS ~ memory (3/0).talk_announcement :- WORDS ~ interfaces, WORDS ~ From_p_exclaim_point (2/0).talk_announcement :- WORDS ~ presentations, WORDS ~ From_att (2/0).default non_talk_announcement .
21
MoreonthispaperAlgorithm
• Phase1:buildrules– Discretegreedysearch:– Startingwithemptyruleset,addconditionsgreedily
• Phase2:prunerules– startingwithphase1output,removeconditions,greedily
talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk (54/1).talk_announcement :- WORDS ~ '2d416' (26/3).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (4/0).talk_announcement :- WORDS ~ mh, WORDS ~ time (5/1).talk_announcement :- WORDS ~ talk, WORDS ~ used (3/0).talk_announcement :- WORDS ~ presentations (2/1).default non_talk_announcement (390/1).
22
MoreonthispaperAlgorithm
• FitthePOS,NEGexample• WhilePOSisn’tempty:
– LetR be“ifTrueè pos”– WhileNEGisn’tempty:
• Pickthe“best”[i]conditionc oftheform“xi=True”or“xi=false”
• Addc totheLHSofR• Removeexamplesthatdon’tsatisfyc fromNEG
• AddRtotheruleset[ii]• RemoveexamplesthatsatisfyRfromPOS
• Prunetheruleset:– …
Analysis
[i] “Best” is wrt some statistics on c’s coverage of POS,NEG
[ii] R is now of the form “if xi1=_ and xi2=_ and … è pos”
• ThetotalnumberofiterationsofL1isthenumberofconditionsintheruleset– callitm
• Pickingthe“best”conditionrequireslookingatallexamples– saytherearen ofthese
• Timeisatleastm*n• Theproblem:
– Whentherearenoisypositiveexamplesthealgorithmbuildsrulesthatcoverjust1-2ofthem
– Sowithhugenoisydatasetsyoubuildhugerulesets
L1
quadratic
cubic!
23
Related paper from 1995…
24
Soinmid1990’s…..• Experimentaldatasetsweresmall• Manycommonlyusedalgorithmswereasymptotically “slow”
• Notmanypeoplereallycared
25
Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)
Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context
26
Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)
27
WhyMoreDataHelps:ADemo• Data:–All5-gramsthatappear>=40timesinacorpusof1MEnglishbooks• approx 80Bwords• 5-grams:30Gbcompressed,250-300Gbuncompressed• Each5-gramcontainsfrequencydistributionoveryears
28
29
30
http://xkcd.com/ngram-charts/
31
WhyMoreDataHelps:ADemo• Data:– All5-gramsthatappear>=40timesinacorpusof1MEnglishbooks• approx 80Bwords• 5-grams:30Gbcompressed,250-300Gbuncompressed• Each5-gramcontainsfrequencydistributionoveryears
– Wrotecodetocompute• Pr(A,B,C,D,E|C=affectorC=effect)• Pr(anysubsetofA,…,E|any otherfixedvaluesofA,…,EwithC=affectVeffect)
– Demo:• /Users/wcohen/Documents/code/pyhack/bigml• eg:pythonngram-query.py data/aeffect-train.txt __Beffect__
32
Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)
Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context
33
WhyMoreDataHelps• Data:
– All5-gramsthatappear>=40timesinacorpusof1MEnglishbooks• approx 80Bwords• 5-grams:30Gbcompressed,250-300Gbuncompressed• Each5-gramcontainsfrequencydistributionoveryears
– Wrotecodetocompute• Pr(A,B,C,D,E|C=affectorC=effect)• Pr(anysubsetofA,…,E|any otherfixedvaluesofA,…,EwithC=affectVeffect)
• Observations [fromplayingwithdata]:– Mostlyeffect notaffect– Mostcommonwordbeforeaffect isnot– After noteffectmostcommonwordisa– …
34
Soin2001…..• We’relearning:– “there’snodatalikemoredata”–Formanytasks,there’snorealsubstituteforusinglotsofdata
35
…andin2009Eugene Wigner’s article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” examines why so much of physics can be neatly explained with simple mathematical formulas such as f = ma or e = mc2. Meanwhile, sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over their inability to neatly model human behavior. An informal, incomplete grammar of the English language runs over 1,700 pages.
Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.
Norvig, Pereira, Halevy, “The Unreasonable Effectiveness of Data”, 200936
…andin2012
Dec 2011
37
…andin2013
38
…andin2014
39
Bengio,Foundations&Trends,2009
40
naïve vsclever optimization
1Mvs10M
examples
2.5Mexamplesfor“pretraining”
41
Today….• Commonlyuseddeeplearningdatasets:– Images/videos:• ImageNet:20k+categories,14M+images• MSCOCO:91categories,2.5Mlabels,328kimages• YouTube-M:8Murls,4800classes,0.5Mhours
– Readingcomprehension:• Children’sbooktest:600k+context/querypairs• CNN/Dailymail:~300kdocs,1.2Mclozequestions
–Other:• Ubuntudialog:7M+utterances,1M+dialogs• …
42
REVIEW:ASYMPTOTICCOMPLEXITY
43
Howdoweuseverylargeamountsofdata?• Workingwithbigdataisnot about– codeoptimization– learningdetailsoftodays hardware/software:• GraphLab,Hadoop,Spark,parallelhardware,….
• Workingwithbigdataisabout– Understandingthecostofwhatyouwanttodo– Understandingwhatthetoolsthatareavailableoffer– Understandinghowmuchcanbeaccomplishedwithlinearornearly-linearoperations(e.g.,sorting,…)
– Understandinghowtoorganizeyourcomputationssothattheyeffectivelyusewhatever’sfast
– Understandinghowtotest/debug/verifywithlargedata
*
* according to William44
AsymptoticAnalysis:BasicPrinciples
)()(,:, iff ))(()( 00 ngkxfnnnkngOnf ×£>"$Î
)()(,:, iff ))(()( 00 ngkxfnnnkngnf ׳>"$WÎ
Usually we only care about positive f(n), g(n), n here…
45
AsymptoticAnalysis:BasicPrinciples
)()(,:, iff ))(()( 00 ngkxfnnnkngOnf ×£>"$=
)()(,:, iff ))(()( 00 ngkxfnnnkngnf ׳>"$W=
Less pedantically:
Some useful rules:
)O( )( 434 nnnO =+
)O( )1273( 434 nnnO =+
)loglog4log 4 nO(n) O() nO( =×=
Only highest-order terms matter
Leading constants don’t matter
Degree of something in a log doesn’t matter
46
Backtorulepruning….
Algorithm• FitthePOS,NEGexampleWhile POSisn’tempty:
– LetR be“ifTrueè pos”– WhileNEGisn’tempty:
• Pickthe“best”[1]conditionc oftheform“xi=True”or“xi=false”
• Addc totheLHSofR• Removeexamplesthatdon’tsatisfycfromNEG
• AddRtotheruleset[2]• RemoveexamplesthatsatisfyRfromPOS
• Prunetheruleset:– Foreachconditioncintheruleset:
• Evaluatetheaccuracyoftheruleset w/oconheldout data
– Ifremovinganycimprovesaccuracy• Removecandrepeatthepruningstep
Analysis• Assumenexamples• Assume mconditionsinruleset• Growingrulestakestimeatleast
Ω(m*n)ifevaluatingcisΩ(n)• Whendataiscleanmissmall,
fittingtakeslineartime• Whenk%ofdataisnoisy,m is
Ω(n*0.01*k)sogrowingrulestakesΩ(n2)
• Pruningarulesetwithm =0.01*kn extraconditionsisveryslow:Ω(n3)ifimplementednaively
[1] “Best” is wrt some statistics on c’s coverage of POS,NEG
[2] R is now of the form “if xi1=_ and xi2=_ and … è pos”
47
Empiricalanalysis of complexity: plot run-time on a log-log plot and measure the slope (using linear regression)
48
Wheredoasymptotics breakdown?• Whentheconstantsaretoobig– ornistoosmall
• Whenwecan’tpredictwhattheprogramwilldo– Eg,howmanyiterationsbeforeconvergence?Doesitdependondatasizeornot?– Thisiswhenyouneedexperiments
• Whentherearedifferenttypesofoperationswithdifferentcosts–Weneedtounderstandwhatweshouldcount
49
Whatdowecount?
• Compilersdon’twarnJeffDean.JeffDeanwarnscompilers.
• JeffDeanbuildshiscodebeforecommittingit,butonlytocheckforcompilerandlinkerbugs.
• JeffDeanwritesdirectlyinbinary.Hethenwritesthesourcecodeasadocumentationforotherdevelopers.
• JeffDeanonceshiftedabitsohard,itendeduponanothercomputer.
• WhenJeffDeanhasanergonomicevaluation,itisfortheprotectionofhiskeyboard.
• gcc -O4emailsyourcodetoJeffDeanforarewrite.• WhenheheardthatJeffDean'sautobiographywouldbe
exclusivetotheplatform,RichardStallmanboughtaKindle.
• JeffDeanputshispantsononelegatatime,butifhehadmorelegs,you’drealizethealgorithmisactuallyonlyO(logn)
50
Numbers(JeffDeansays)EveryoneShouldKnow
51
Update:ColinScott,UCB
file:///Users/wcohen/Documents/code/interactive_latencies/interactive_latency.html - *may need to open this from shell
52
What’sHappeningwithHardware?• Clockspeed:stuckat3Ghzfor~10years• Netbandwidthdoubles~2years• Diskbandwidthdoubles~2years• SSDbandwidthdoubles~3years• Diskseekspeeddoubles~10years• SSDlatencynearlysaturated
53
54
Numbers(JeffDeansays)EveryoneShouldKnow
55
AtypicalCPU(nottoscale)K8 core in the AMD Athlon 64 CPU
16x bigger
256x bigger
Hard disk (1Tb) 128x bigger
56
AtypicalCPU(nottoscale)K8 core in the AMD Athlon 64 CPU
16x bigger
256x bigger
Hard disk (1Tb) 128x bigger
57
AtypicalCPU(nottoscale)K8 core in the AMD Athlon 64 CPU
16x bigger
256x bigger
Hard disk (1Tb) 128x bigger
58
AtypicalCPU(nottoscale)K8 core in the AMD Athlon 64 CPU
16x bigger
256x bigger
Hard disk (1Tb) 128x bigger
59
Atypicaldisk
60
Numbers(JeffDeansays)EveryoneShouldKnow
~= 10x
~= 15x
~= 100,000x
40x
61
Whatdowecount?
• Compilersdon’twarnJeffDean.JeffDeanwarnscompilers.• ….
• Memoryaccess/instructionsarequalitativelydifferentfromdiskaccess
• Seeksarequalitativelydifferentfromsequentialreadsondisk
• Cache,diskfetches,etcworkbestwhenyoustreamthroughdatasequentially
• Bestcasefordataprocessing:streamthroughthedataonce insequentialorder,asit’sfoundondisk.
62
Otherlessons-?
* but not important enough for this class’s assignments….
*
63
What/How–Nextlecture:probabilityreviewandNaïveBayes.–Homework:• WatchthereviewlectureIlinkedtoonthewiki
– I’mnot goingtorepeatit
64