preliminariesbackgroundchallengesstatistics and csoutlinefaculty.bscb.cornell.edu › ~hooker ›...
TRANSCRIPT
Preliminaries Background Challenges Statistics and CS Outline
Statistical Machine Learning
Venue: Tuesday/Thursday 11:40 - 12:55WN 145
Lecturer: Giles HookerOffice Hours: Wednesday 2 - 4
Comstock 1186Ph: 5-1638e-mail: gjh27
Preliminaries Background Challenges Statistics and CS Outline
Texts and Resources
Hastie, Tibshirani, Friedman,2001, "The Elements ofStatistical Learning", Springer.
Other References: see www.bscb.cornell.edu/˜hooker/ML2007Software: R plus toolboxes, see www.r-project.org
Preliminaries Background Challenges Statistics and CS Outline
Other Resources
See class website.
Vapnik, 2000, The Nature of Statistical Learning Theory
CS 478 – Machine LearningCS 578 – Empirical Machine LearningMATH 774 – Topics in Statistical Learning TheoryORIE 474 – Statistical Data Mining; a masters level course.ORIE 674 – superceded by this class.
http://www.cs.cornell.edu/Projects/learning/
Preliminaries Background Challenges Statistics and CS Outline
Evaluation and Expectations
Prerequisites: ORIE 351 and ORIE 670Really: Basic probability, theoretical statistics, linear algebra,
multivariate calculus, programming.Assesment: 4 Assignments + 1 (small) project
First 3 assignments: 25% EachProject: 5%Collaboration (with acknowledgement) isencouraged for first three assignments andproject.Final assignment (20%) will be in place of anexam; you are expected to do your own work
Preliminaries Background Challenges Statistics and CS Outline
Context and Data
Last 25+ years → automated data collection
automated transactionscentralized electronic data basesnew measurement devices
Massive increase in amount, quality and complexity ofavailable dataCanonical example: Walmart transactions
Items purchased, sale information, store layout, checkoutinformation, mode of paymentCustomer info: age, gender, address, previous purchasepatterns, credit history, health status, marital status,occupation, education, driving records....in 2000, 1 Peta-byte; roughly the amount of visual informationyou will process in your lifetime.
Preliminaries Background Challenges Statistics and CS Outline
More Examples
More recent: Amazon.com
Same info as above but also, click streams, browsing patterns,visiting patterns, listed opinions, links followed, mousegestures....
Also
Image data: photos (recent Microsoft research), digitizedhandwriting, face recognitionAudio data: same thingNatural writing: epinions, aviation safety inspection reports,google searches, e-mail...E-bay buying patterns, auction patterns
Preliminaries Background Challenges Statistics and CS Outline
Outside of Sociology/Economics/Commerce
Medical imagingAutomatic diagnosis/ER triageDrug discoveryCancer taxonomyAstronomy: Sloan Digital Sky SurveyCommunications: predicting network failures
What are we going to do with all this?
Preliminaries Background Challenges Statistics and CS Outline
Data Size: Challenges for Statistics
Storage, access, processing
⇒, heavy CS involvementConcern over computational efficiencySub-sampling ⇒ very small populations might be of higheconomic valueAmazon Goldbox example
Modeling and Inference
Enormous power → all (reasonable) models are wrongManual model-building requires large amounts of timeParametric assumptions are not reasonableDon’t know what we’re looking for (eg association rules)
Preliminaries Background Challenges Statistics and CS Outline
Complexity: Challenges
High dimensionality
many "nuisance" variables, lots of correlation (frequentlynonlinear)model/variable selection ⇒ massive variancedifficulty of model checkingcurse of dimensionality
"Feature" complexity
Different lengths of covariatesSpacial and geometric structureUnstructured covariates: natural language, click-streams
Uncertain measures of fit: Google search
Preliminaries Background Challenges Statistics and CS Outline
Feature ComplexityData do not look like a traditional design matrix:
imagessentencesvideomoleculesebay auctions
Need to have a metric for comparison ⇒ feature extraction.But also complex relationships:
links between webpagescomposite molecules
Strange invariances (face recognition: rotation, translation,occlusion).Much modern effort = dealing with specific complex problems.
Preliminaries Background Challenges Statistics and CS Outline
What is Maching Learning?And what is Data Mining?
Definitions (and distinctions) differ; not welldemarcated/largely a matter of advertising.Concerned with "automated", model-free statistics.
My definitions:Machine Learning = focus on predictive modelingData Mining = discovery of patterns from massive amounts ofdata.
General Philosophy"Don’t show me a model: let the data do the work".
Preliminaries Background Challenges Statistics and CS Outline
This Class
Focus on prediction:
Generic model y = F (x) + ε
Given data {yi , xi}Ni=1 ∼ p(y , x)y = "output", or "response"x = "inputs", "features" (ML = not "predictors")
Want to estimate F to get good prediction:
F ∗ = argminF∈C
EyxL(y ,F (x))
Preliminaries Background Challenges Statistics and CS Outline
Why?
L = "loss" of guessing F when truth is y
F ∗ is best predictor of y under L
Examples:Regression: y ∈ R,
L(y ,F (x)) = |y − F (x)|, (y − F (x))2
Classification: y ∈ {c1, . . . , cK}
L = Ly ,F (K × Kmatrix)
Preliminaries Background Challenges Statistics and CS Outline
Nearest Neighbors
Canonical example
Observations: x1, . . . , xN
Responses: y1, . . . , yN .Desired: Estimate y for response for a new observation x.
Prediction Rule: letI = argmin
i∈1,...,N‖xi − x‖
predict yI .
Model-free, successful (sort of), not something a statistician wouldthink of.
Preliminaries Background Challenges Statistics and CS Outline
Is Data Mining Statistics?
Historically (and currently) dominated by computer science.Very little attention from statisticians.Still perceived as Computer Science/MathLead to "obvious" statistical mistakes in research, but alsosuccessful ideas that are very different from statistical thinking.Growing recognition/convergence from both sides.Friedman, 1997, "Data Mining and Statistics: What’s theConnection?"Brieman, 2002, "Statistical Modeling: The Two Cultures"
Preliminaries Background Challenges Statistics and CS Outline
Is Data Mining Statistics?
Preliminaries Background Challenges Statistics and CS Outline
This Class
Statistical Machine Learning
How do we understand ML success from a statisticalperspective?
bias, variance, regularizationHow do we assess the performance of these methods
cross validation, bootstrap....Does this fit into what we already know?
Preliminaries Background Challenges Statistics and CS Outline
Outline
Linear methods: linear regression/linear discriminant analysisExtensions: nonlinear regression, kernel methods, neuralnetworksRegularization and Model AssessmentNonlinear methods: nearest neighborsTree-based methods and extensionsAggregation methods: boosting, bagging and randomizationRequested topics, PAC learning theory, semi-supervisedlearning, clustering