![Page 1: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/1.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Statistical Machine Learning
Venue: Tuesday/Thursday 11:40 - 12:55WN 145
Lecturer: Giles HookerOffice Hours: Wednesday 2 - 4
Comstock 1186Ph: 5-1638e-mail: gjh27
![Page 2: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/2.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Texts and Resources
Hastie, Tibshirani, Friedman,2001, "The Elements ofStatistical Learning", Springer.
Other References: see www.bscb.cornell.edu/˜hooker/ML2007Software: R plus toolboxes, see www.r-project.org
![Page 3: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/3.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Other Resources
See class website.
Vapnik, 2000, The Nature of Statistical Learning Theory
CS 478 – Machine LearningCS 578 – Empirical Machine LearningMATH 774 – Topics in Statistical Learning TheoryORIE 474 – Statistical Data Mining; a masters level course.ORIE 674 – superceded by this class.
http://www.cs.cornell.edu/Projects/learning/
![Page 4: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/4.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Evaluation and Expectations
Prerequisites: ORIE 351 and ORIE 670Really: Basic probability, theoretical statistics, linear algebra,
multivariate calculus, programming.Assesment: 4 Assignments + 1 (small) project
First 3 assignments: 25% EachProject: 5%Collaboration (with acknowledgement) isencouraged for first three assignments andproject.Final assignment (20%) will be in place of anexam; you are expected to do your own work
![Page 5: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/5.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Context and Data
Last 25+ years → automated data collection
automated transactionscentralized electronic data basesnew measurement devices
Massive increase in amount, quality and complexity ofavailable dataCanonical example: Walmart transactions
Items purchased, sale information, store layout, checkoutinformation, mode of paymentCustomer info: age, gender, address, previous purchasepatterns, credit history, health status, marital status,occupation, education, driving records....in 2000, 1 Peta-byte; roughly the amount of visual informationyou will process in your lifetime.
![Page 6: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/6.jpg)
Preliminaries Background Challenges Statistics and CS Outline
More Examples
More recent: Amazon.com
Same info as above but also, click streams, browsing patterns,visiting patterns, listed opinions, links followed, mousegestures....
Also
Image data: photos (recent Microsoft research), digitizedhandwriting, face recognitionAudio data: same thingNatural writing: epinions, aviation safety inspection reports,google searches, e-mail...E-bay buying patterns, auction patterns
![Page 7: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/7.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Outside of Sociology/Economics/Commerce
Medical imagingAutomatic diagnosis/ER triageDrug discoveryCancer taxonomyAstronomy: Sloan Digital Sky SurveyCommunications: predicting network failures
What are we going to do with all this?
![Page 8: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/8.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Data Size: Challenges for Statistics
Storage, access, processing
⇒, heavy CS involvementConcern over computational efficiencySub-sampling ⇒ very small populations might be of higheconomic valueAmazon Goldbox example
Modeling and Inference
Enormous power → all (reasonable) models are wrongManual model-building requires large amounts of timeParametric assumptions are not reasonableDon’t know what we’re looking for (eg association rules)
![Page 9: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/9.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Complexity: Challenges
High dimensionality
many "nuisance" variables, lots of correlation (frequentlynonlinear)model/variable selection ⇒ massive variancedifficulty of model checkingcurse of dimensionality
"Feature" complexity
Different lengths of covariatesSpacial and geometric structureUnstructured covariates: natural language, click-streams
Uncertain measures of fit: Google search
![Page 10: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/10.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Feature ComplexityData do not look like a traditional design matrix:
imagessentencesvideomoleculesebay auctions
Need to have a metric for comparison ⇒ feature extraction.But also complex relationships:
links between webpagescomposite molecules
Strange invariances (face recognition: rotation, translation,occlusion).Much modern effort = dealing with specific complex problems.
![Page 11: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/11.jpg)
Preliminaries Background Challenges Statistics and CS Outline
What is Maching Learning?And what is Data Mining?
Definitions (and distinctions) differ; not welldemarcated/largely a matter of advertising.Concerned with "automated", model-free statistics.
My definitions:Machine Learning = focus on predictive modelingData Mining = discovery of patterns from massive amounts ofdata.
General Philosophy"Don’t show me a model: let the data do the work".
![Page 12: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/12.jpg)
Preliminaries Background Challenges Statistics and CS Outline
This Class
Focus on prediction:
Generic model y = F (x) + ε
Given data {yi , xi}Ni=1 ∼ p(y , x)y = "output", or "response"x = "inputs", "features" (ML = not "predictors")
Want to estimate F to get good prediction:
F ∗ = argminF∈C
EyxL(y ,F (x))
![Page 13: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/13.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Why?
L = "loss" of guessing F when truth is y
F ∗ is best predictor of y under L
Examples:Regression: y ∈ R,
L(y ,F (x)) = |y − F (x)|, (y − F (x))2
Classification: y ∈ {c1, . . . , cK}
L = Ly ,F (K × Kmatrix)
![Page 14: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/14.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Nearest Neighbors
Canonical example
Observations: x1, . . . , xN
Responses: y1, . . . , yN .Desired: Estimate y for response for a new observation x.
Prediction Rule: letI = argmin
i∈1,...,N‖xi − x‖
predict yI .
Model-free, successful (sort of), not something a statistician wouldthink of.
![Page 15: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/15.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Is Data Mining Statistics?
Historically (and currently) dominated by computer science.Very little attention from statisticians.Still perceived as Computer Science/MathLead to "obvious" statistical mistakes in research, but alsosuccessful ideas that are very different from statistical thinking.Growing recognition/convergence from both sides.Friedman, 1997, "Data Mining and Statistics: What’s theConnection?"Brieman, 2002, "Statistical Modeling: The Two Cultures"
![Page 16: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/16.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Is Data Mining Statistics?
![Page 17: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/17.jpg)
Preliminaries Background Challenges Statistics and CS Outline
This Class
Statistical Machine Learning
How do we understand ML success from a statisticalperspective?
bias, variance, regularizationHow do we assess the performance of these methods
cross validation, bootstrap....Does this fit into what we already know?
![Page 18: PreliminariesBackgroundChallengesStatistics and CSOutlinefaculty.bscb.cornell.edu › ~hooker › ML2007 › Lecture1.pdf · PreliminariesBackgroundChallengesStatistics and CSOutline](https://reader033.vdocument.in/reader033/viewer/2022060423/5f19fdb4c4641515e4027de0/html5/thumbnails/18.jpg)
Preliminaries Background Challenges Statistics and CS Outline
Outline
Linear methods: linear regression/linear discriminant analysisExtensions: nonlinear regression, kernel methods, neuralnetworksRegularization and Model AssessmentNonlinear methods: nearest neighborsTree-based methods and extensionsAggregation methods: boosting, bagging and randomizationRequested topics, PAC learning theory, semi-supervisedlearning, clustering