10-405 ml from large datasetswcohen/10-405/overview.pdfwho/what/where/when •10-405 is a brand new...
TRANSCRIPT
![Page 1: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/1.jpg)
10-405MLfromLargeDatasets
WilliamCohen
1
![Page 2: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/2.jpg)
ADMINISTRIVIA
![Page 3: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/3.jpg)
Who/What/Where/When• Wiki:google://”cohen CMU”àteachingà
– http://curtis.ml.cmu.edu/w/courses/index.php/Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018
– thisshouldpointtoeverythingelse
• OfficeHours:– StillTBAforallofus,checkthewiki
• Courseassistant:DorothyHolland-Minkley(dfh@cs)
• Instructor:–WilliamW.Cohen
• TAs:– Introscomingup….
![Page 4: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/4.jpg)
Who/What/Where/When• 10-405isabrandnewcourse!• Butit’sgoingtobeverycloseto10-605whichhasbeenhappeningsince2012
• It’saimedatundergrads–willcoveraboutthesametopics– notaprojectcourse– onechancetodoanopen-ended“extensionofanassignment”
• Anyonecanenrollbutnotallgradprogramswillgiveyoufullcreditfor4xxcoursessoreadthefineprint
![Page 5: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/5.jpg)
Who/What/Where/When• Whoishere?
![Page 6: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/6.jpg)
Who/What/Where/When• Mostdaystherewillbeanon-linequizyoushoulddoafterlecture
• ThequizzeswillusuallycloseFridaynoon• Wehaveonetoday– seethewiki!• Theydon’tcountforalotbuttherearenomakeups–Quizzesreinforcethelecture–Andmakesureyoukeepup
![Page 7: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/7.jpg)
Who/What/Where/When• WilliamW.Cohen– 1990-mid1990’s:AT&TBellLabs(workingonILPandscalablepropositionalrule-learningalgorithms)
– Mid1990’s—2000:AT&TResearch(textclassification,dataintegration,knowledge-as-text,informationextraction)
– 2000-2002:Whizbang!Labs(infoextractionfromWeb)
– 2002-2008,2010-now:CMU(IE,biotext,socialnetworks,learningingraphs,infoextractionfromWeb,scalablefirst-orderlearning)• 2008-2009,Jan-Jul2017:VisitingScientistatGoogle
![Page 8: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/8.jpg)
TAs
![Page 9: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/9.jpg)
I am a Masters Student in Information Networking Institute going to join Microsoft AI & Research Teampost this semester. My research interests are in the domains of Scalable Machine Learning and Deep Learning.
I took the course last semester (Fall 2017) and learnt a lot.
Hope you all have a great semester. See you at the office hours!
Vidhan Agarwal (MSIN)
![Page 10: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/10.jpg)
SarthakGarg- MSinCS• IamafirstyearMastersstudentintheComputerScienceDepartment
• IaminterestedinDeepGenerativeModelsandDistributedSystemsforMachineLearning
• Itook10-605lastfallandfounditveryinteresting,hopeyouenjoythecourse!
![Page 11: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/11.jpg)
Nitish Kulkarni MSinMCDS
11
I am a first year master's student in the Master of Computational Data Science program, LTI Department.
I’m interested in scalable machine learning algorithms and information retrieval.
I took 10-805 in fall ‘17. The course was quite fun and incredibly useful. I’m glad to be a TA for the course this semester, and hope you have a similar experience.
![Page 12: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/12.jpg)
VivekShankar– BSinSCS
• VivekShankar– BSinSCS
• IamafourthyearundergraduateintheSchoolofComputerScience.
• Thissummer,IinternedatGoogle,MontrealworkingonbuildingaURL-basedMachineLearningmodelfordetectingmaliciousChromeextensions.I’llbejoiningGooglefulltimeinPittsburgh.
• Iaminterestedindesigningparallelizablemachinelearningalgorithmsthatscalewelltolargedatasets– abigthemein10605.Itook10605inFall'17,anditwasoneofmyfavoriteclassesthusfaratCMU!
![Page 13: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/13.jpg)
What/HowIkindoflikelanguagetasks,especiallyforthistask:– Thedata(usually)makessense– Themodels(usually)makesense– Themodels(usually)arecomplex,so• Moredataactuallyhelps• Learningsimplemodelsvs complexonesissometimescomputationallydifferent
![Page 14: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/14.jpg)
What/How• ProgrammingLanguagesandSystems:– Python– HadoopandSpark
• Resources:– unix.andrew machines– Stoathadoop cluster:• 104workernodes,with8cores,16GBRAM,41TB.• 30workernodes,with8cores,16GBRAM, 250Gb+
– AmazonElasticCloud• AmazonEC2[http://aws.amazon.com/ec2/]• Allocation:$50worthoftimeperstudent
![Page 15: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/15.jpg)
What/How:601co-req• Youshouldhaveasaprereq orco-req oneoftheMLD’sintroMLcourses:10-401,10-601,10-701,10-715
• Lecturesaredesignedtocomplement thatmaterial– computationalaspectsvs informationalaspects
![Page 16: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/16.jpg)
What/How:cheatingvs workingtogether• Ihavealongandexplicitpolicy– stolenfromRoni Rosenfeld- readthewebpage– tl;dr:transmitinformationliketheydidinthestoneage,brain-to-brain,anddocumentit
– donotcopyanythingdigitally– exceptions(eg projects)willbeexplicitlystated
– everybodyinvolvedwillfail bydefault– every infractionalwaysgetsreporteduptotheOfficeofAcademicIntegrity,theheadofyourprogram,thedeanofyourschool,….
– asecondoffenseisverybad
![Page 17: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/17.jpg)
BIGDATAHISTORY:FROMTHEDAWNOFTIMETOTHEPRESENT
17
![Page 18: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/18.jpg)
Big ML c. 1993 (Cohen, “Efficient…Rule Learning”, IJCAI 1993)
$ ripper ../tdata/talksFinal hypothesis is:talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk (54/1).talk_announcement :- WORDS ~ '2d416' (26/3).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (4/0).talk_announcement :- WORDS ~ mh, WORDS ~ time (5/1).talk_announcement :- WORDS ~ talk, WORDS ~ used (3/0).talk_announcement :- WORDS ~ presentations (2/1).default non_talk_announcement (390/1).
18
![Page 19: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/19.jpg)
19
![Page 20: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/20.jpg)
MoreonthispaperAlgorithm
• Phase1:buildrules– Discretegreedysearch:– Startingwithemptyruleset,addconditionsgreedily
• Phase2:prunerules– startingwithphase1output,removeconditions
talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk, WORDS ~ p_comma.talk_announcement :- WORDS ~ '2d416', WORDS ~ be.talk_announcement :- WORDS ~ show, WORDS ~ talk (7/0).talk_announcement :- WORDS ~ mh, WORDS ~ time, WORDS ~ research (4/0).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (3/0).talk_announcement :- WORDS ~ '2d416', WORDS ~ memory (3/0).talk_announcement :- WORDS ~ interfaces, WORDS ~ From_p_exclaim_point (2/0).talk_announcement :- WORDS ~ presentations, WORDS ~ From_att (2/0).default non_talk_announcement .
20
![Page 21: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/21.jpg)
MoreonthispaperAlgorithm
• Phase1:buildrules– Discretegreedysearch:– Startingwithemptyruleset,addconditionsgreedily
• Phase2:prunerules– startingwithphase1output,removeconditions,greedily
talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk, WORDS ~ p_comma (54/0).talk_announcement :- WORDS ~ '2d416', WORDS ~ be (19/0).talk_announcement :- WORDS ~ show, WORDS ~ talk (7/0).talk_announcement :- WORDS ~ mh, WORDS ~ time, WORDS ~ research (4/0).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (3/0).talk_announcement :- WORDS ~ '2d416', WORDS ~ memory (3/0).talk_announcement :- WORDS ~ interfaces, WORDS ~ From_p_exclaim_point (2/0).talk_announcement :- WORDS ~ presentations, WORDS ~ From_att (2/0).default non_talk_announcement .
21
![Page 22: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/22.jpg)
MoreonthispaperAlgorithm
• Phase1:buildrules– Discretegreedysearch:– Startingwithemptyruleset,addconditionsgreedily
• Phase2:prunerules– startingwithphase1output,removeconditions,greedily
talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk (54/1).talk_announcement :- WORDS ~ '2d416' (26/3).talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (4/0).talk_announcement :- WORDS ~ mh, WORDS ~ time (5/1).talk_announcement :- WORDS ~ talk, WORDS ~ used (3/0).talk_announcement :- WORDS ~ presentations (2/1).default non_talk_announcement (390/1).
22
![Page 23: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/23.jpg)
MoreonthispaperAlgorithm
• FitthePOS,NEGexample• WhilePOSisn’tempty:
– LetR be“ifTrueè pos”– WhileNEGisn’tempty:
• Pickthe“best”[i]conditionc oftheform“xi=True”or“xi=false”
• Addc totheLHSofR• Removeexamplesthatdon’tsatisfyc fromNEG
• AddRtotheruleset[ii]• RemoveexamplesthatsatisfyRfromPOS
• Prunetheruleset:– …
Analysis
[i] “Best” is wrt some statistics on c’s coverage of POS,NEG
[ii] R is now of the form “if xi1=_ and xi2=_ and … è pos”
• ThetotalnumberofiterationsofL1isthenumberofconditionsintheruleset– callitm
• Pickingthe“best”conditionrequireslookingatallexamples– saytherearen ofthese
• Timeisatleastm*n• Theproblem:
– Whentherearenoisypositiveexamplesthealgorithmbuildsrulesthatcoverjust1-2ofthem
– Sowithhugenoisydatasetsyoubuildhugerulesets
L1
quadratic
cubic!
23
![Page 24: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/24.jpg)
Related paper from 1995…
24
![Page 25: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/25.jpg)
Soinmid1990’s…..• Experimentaldatasetsweresmall• Manycommonlyusedalgorithmswereasymptotically “slow”
• Notmanypeoplereallycared
25
![Page 26: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/26.jpg)
Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)
Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context
26
![Page 27: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/27.jpg)
Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)
27
![Page 28: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/28.jpg)
WhyMoreDataHelps:ADemo• Data:–All5-gramsthatappear>=40timesinacorpusof1MEnglishbooks• approx 80Bwords• 5-grams:30Gbcompressed,250-300Gbuncompressed• Each5-gramcontainsfrequencydistributionoveryears
28
![Page 29: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/29.jpg)
29
![Page 30: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/30.jpg)
30
![Page 31: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/31.jpg)
http://xkcd.com/ngram-charts/
31
![Page 32: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/32.jpg)
WhyMoreDataHelps:ADemo• Data:– All5-gramsthatappear>=40timesinacorpusof1MEnglishbooks• approx 80Bwords• 5-grams:30Gbcompressed,250-300Gbuncompressed• Each5-gramcontainsfrequencydistributionoveryears
– Wrotecodetocompute• Pr(A,B,C,D,E|C=affectorC=effect)• Pr(anysubsetofA,…,E|any otherfixedvaluesofA,…,EwithC=affectVeffect)
– Demo:• /Users/wcohen/Documents/code/pyhack/bigml• eg:pythonngram-query.py data/aeffect-train.txt __Beffect__
32
![Page 33: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/33.jpg)
Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)
Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context
33
![Page 34: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/34.jpg)
WhyMoreDataHelps• Data:
– All5-gramsthatappear>=40timesinacorpusof1MEnglishbooks• approx 80Bwords• 5-grams:30Gbcompressed,250-300Gbuncompressed• Each5-gramcontainsfrequencydistributionoveryears
– Wrotecodetocompute• Pr(A,B,C,D,E|C=affectorC=effect)• Pr(anysubsetofA,…,E|any otherfixedvaluesofA,…,EwithC=affectVeffect)
• Observations [fromplayingwithdata]:– Mostlyeffect notaffect– Mostcommonwordbeforeaffect isnot– After noteffectmostcommonwordisa– …
34
![Page 35: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/35.jpg)
Soin2001…..• We’relearning:– “there’snodatalikemoredata”–Formanytasks,there’snorealsubstituteforusinglotsofdata
35
![Page 36: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/36.jpg)
…andin2009Eugene Wigner’s article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” examines why so much of physics can be neatly explained with simple mathematical formulas such as f = ma or e = mc2. Meanwhile, sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over their inability to neatly model human behavior. An informal, incomplete grammar of the English language runs over 1,700 pages.
Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.
Norvig, Pereira, Halevy, “The Unreasonable Effectiveness of Data”, 200936
![Page 37: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/37.jpg)
…andin2012
Dec 2011
37
![Page 38: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/38.jpg)
…andin2013
38
![Page 39: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/39.jpg)
…andin2014
39
![Page 40: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/40.jpg)
Bengio,Foundations&Trends,2009
40
![Page 41: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/41.jpg)
naïve vsclever optimization
1Mvs10M
examples
2.5Mexamplesfor“pretraining”
41
![Page 42: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/42.jpg)
Today….• Commonlyuseddeeplearningdatasets:– Images/videos:• ImageNet:20k+categories,14M+images• MSCOCO:91categories,2.5Mlabels,328kimages• YouTube-M:8Murls,4800classes,0.5Mhours
– Readingcomprehension:• Children’sbooktest:600k+context/querypairs• CNN/Dailymail:~300kdocs,1.2Mclozequestions
–Other:• Ubuntudialog:7M+utterances,1M+dialogs• …
42
![Page 43: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/43.jpg)
REVIEW:ASYMPTOTICCOMPLEXITY
43
![Page 44: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/44.jpg)
Howdoweuseverylargeamountsofdata?• Workingwithbigdataisnot about– codeoptimization– learningdetailsoftodays hardware/software:• GraphLab,Hadoop,Spark,parallelhardware,….
• Workingwithbigdataisabout– Understandingthecostofwhatyouwanttodo– Understandingwhatthetoolsthatareavailableoffer– Understandinghowmuchcanbeaccomplishedwithlinearornearly-linearoperations(e.g.,sorting,…)
– Understandinghowtoorganizeyourcomputationssothattheyeffectivelyusewhatever’sfast
– Understandinghowtotest/debug/verifywithlargedata
*
* according to William44
![Page 45: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/45.jpg)
AsymptoticAnalysis:BasicPrinciples
)()(,:, iff ))(()( 00 ngkxfnnnkngOnf ×£>"$Î
)()(,:, iff ))(()( 00 ngkxfnnnkngnf ׳>"$WÎ
Usually we only care about positive f(n), g(n), n here…
45
![Page 46: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/46.jpg)
AsymptoticAnalysis:BasicPrinciples
)()(,:, iff ))(()( 00 ngkxfnnnkngOnf ×£>"$=
)()(,:, iff ))(()( 00 ngkxfnnnkngnf ׳>"$W=
Less pedantically:
Some useful rules:
)O( )( 434 nnnO =+
)O( )1273( 434 nnnO =+
)loglog4log 4 nO(n) O() nO( =×=
Only highest-order terms matter
Leading constants don’t matter
Degree of something in a log doesn’t matter
46
![Page 47: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/47.jpg)
Backtorulepruning….
Algorithm• FitthePOS,NEGexampleWhile POSisn’tempty:
– LetR be“ifTrueè pos”– WhileNEGisn’tempty:
• Pickthe“best”[1]conditionc oftheform“xi=True”or“xi=false”
• Addc totheLHSofR• Removeexamplesthatdon’tsatisfycfromNEG
• AddRtotheruleset[2]• RemoveexamplesthatsatisfyRfromPOS
• Prunetheruleset:– Foreachconditioncintheruleset:
• Evaluatetheaccuracyoftheruleset w/oconheldout data
– Ifremovinganycimprovesaccuracy• Removecandrepeatthepruningstep
Analysis• Assumenexamples• Assume mconditionsinruleset• Growingrulestakestimeatleast
Ω(m*n)ifevaluatingcisΩ(n)• Whendataiscleanmissmall,
fittingtakeslineartime• Whenk%ofdataisnoisy,m is
Ω(n*0.01*k)sogrowingrulestakesΩ(n2)
• Pruningarulesetwithm =0.01*kn extraconditionsisveryslow:Ω(n3)ifimplementednaively
[1] “Best” is wrt some statistics on c’s coverage of POS,NEG
[2] R is now of the form “if xi1=_ and xi2=_ and … è pos”
47
![Page 48: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/48.jpg)
Empiricalanalysis of complexity: plot run-time on a log-log plot and measure the slope (using linear regression)
48
![Page 49: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/49.jpg)
Wheredoasymptotics breakdown?• Whentheconstantsaretoobig– ornistoosmall
• Whenwecan’tpredictwhattheprogramwilldo– Eg,howmanyiterationsbeforeconvergence?Doesitdependondatasizeornot?– Thisiswhenyouneedexperiments
• Whentherearedifferenttypesofoperationswithdifferentcosts–Weneedtounderstandwhatweshouldcount
49
![Page 50: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/50.jpg)
Whatdowecount?
• Compilersdon’twarnJeffDean.JeffDeanwarnscompilers.
• JeffDeanbuildshiscodebeforecommittingit,butonlytocheckforcompilerandlinkerbugs.
• JeffDeanwritesdirectlyinbinary.Hethenwritesthesourcecodeasadocumentationforotherdevelopers.
• JeffDeanonceshiftedabitsohard,itendeduponanothercomputer.
• WhenJeffDeanhasanergonomicevaluation,itisfortheprotectionofhiskeyboard.
• gcc -O4emailsyourcodetoJeffDeanforarewrite.• WhenheheardthatJeffDean'sautobiographywouldbe
exclusivetotheplatform,RichardStallmanboughtaKindle.
• JeffDeanputshispantsononelegatatime,butifhehadmorelegs,you’drealizethealgorithmisactuallyonlyO(logn)
50
![Page 51: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/51.jpg)
Numbers(JeffDeansays)EveryoneShouldKnow
51
![Page 52: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/52.jpg)
Update:ColinScott,UCB
file:///Users/wcohen/Documents/code/interactive_latencies/interactive_latency.html - *may need to open this from shell
52
![Page 53: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/53.jpg)
What’sHappeningwithHardware?• Clockspeed:stuckat3Ghzfor~10years• Netbandwidthdoubles~2years• Diskbandwidthdoubles~2years• SSDbandwidthdoubles~3years• Diskseekspeeddoubles~10years• SSDlatencynearlysaturated
53
![Page 54: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/54.jpg)
54
![Page 55: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/55.jpg)
Numbers(JeffDeansays)EveryoneShouldKnow
55
![Page 56: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/56.jpg)
AtypicalCPU(nottoscale)K8 core in the AMD Athlon 64 CPU
16x bigger
256x bigger
Hard disk (1Tb) 128x bigger
56
![Page 57: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/57.jpg)
AtypicalCPU(nottoscale)K8 core in the AMD Athlon 64 CPU
16x bigger
256x bigger
Hard disk (1Tb) 128x bigger
57
![Page 58: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/58.jpg)
AtypicalCPU(nottoscale)K8 core in the AMD Athlon 64 CPU
16x bigger
256x bigger
Hard disk (1Tb) 128x bigger
58
![Page 59: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/59.jpg)
AtypicalCPU(nottoscale)K8 core in the AMD Athlon 64 CPU
16x bigger
256x bigger
Hard disk (1Tb) 128x bigger
59
![Page 60: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/60.jpg)
Atypicaldisk
60
![Page 61: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/61.jpg)
Numbers(JeffDeansays)EveryoneShouldKnow
~= 10x
~= 15x
~= 100,000x
40x
61
![Page 62: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/62.jpg)
Whatdowecount?
• Compilersdon’twarnJeffDean.JeffDeanwarnscompilers.• ….
• Memoryaccess/instructionsarequalitativelydifferentfromdiskaccess
• Seeksarequalitativelydifferentfromsequentialreadsondisk
• Cache,diskfetches,etcworkbestwhenyoustreamthroughdatasequentially
• Bestcasefordataprocessing:streamthroughthedataonce insequentialorder,asit’sfoundondisk.
62
![Page 63: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/63.jpg)
Otherlessons-?
* but not important enough for this class’s assignments….
*
63
![Page 64: 10-405 ML from Large Datasetswcohen/10-405/overview.pdfWho/What/Where/When •10-405 is a brand new course! •But it’s going to be very close to 10-605 which has been happening](https://reader033.vdocument.in/reader033/viewer/2022060414/5f12a9cd2952703d2a12cf93/html5/thumbnails/64.jpg)
What/How–Nextlecture:probabilityreviewandNaïveBayes.–Homework:• WatchthereviewlectureIlinkedtoonthewiki
– I’mnot goingtorepeatit
64