preliminariesbackgroundchallengesstatistics and csoutlinefaculty.bscb.cornell.edu › ~hooker ›...

Post on 03-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Preliminaries Background Challenges Statistics and CS Outline

Statistical Machine Learning

Venue: Tuesday/Thursday 11:40 - 12:55WN 145

Lecturer: Giles HookerOffice Hours: Wednesday 2 - 4

Comstock 1186Ph: 5-1638e-mail: gjh27

Preliminaries Background Challenges Statistics and CS Outline

Texts and Resources

Hastie, Tibshirani, Friedman,2001, "The Elements ofStatistical Learning", Springer.

Other References: see www.bscb.cornell.edu/˜hooker/ML2007Software: R plus toolboxes, see www.r-project.org

Preliminaries Background Challenges Statistics and CS Outline

Other Resources

See class website.

Vapnik, 2000, The Nature of Statistical Learning Theory

CS 478 – Machine LearningCS 578 – Empirical Machine LearningMATH 774 – Topics in Statistical Learning TheoryORIE 474 – Statistical Data Mining; a masters level course.ORIE 674 – superceded by this class.

http://www.cs.cornell.edu/Projects/learning/

Preliminaries Background Challenges Statistics and CS Outline

Evaluation and Expectations

Prerequisites: ORIE 351 and ORIE 670Really: Basic probability, theoretical statistics, linear algebra,

multivariate calculus, programming.Assesment: 4 Assignments + 1 (small) project

First 3 assignments: 25% EachProject: 5%Collaboration (with acknowledgement) isencouraged for first three assignments andproject.Final assignment (20%) will be in place of anexam; you are expected to do your own work

Preliminaries Background Challenges Statistics and CS Outline

Context and Data

Last 25+ years → automated data collection

automated transactionscentralized electronic data basesnew measurement devices

Massive increase in amount, quality and complexity ofavailable dataCanonical example: Walmart transactions

Items purchased, sale information, store layout, checkoutinformation, mode of paymentCustomer info: age, gender, address, previous purchasepatterns, credit history, health status, marital status,occupation, education, driving records....in 2000, 1 Peta-byte; roughly the amount of visual informationyou will process in your lifetime.

Preliminaries Background Challenges Statistics and CS Outline

More Examples

More recent: Amazon.com

Same info as above but also, click streams, browsing patterns,visiting patterns, listed opinions, links followed, mousegestures....

Also

Image data: photos (recent Microsoft research), digitizedhandwriting, face recognitionAudio data: same thingNatural writing: epinions, aviation safety inspection reports,google searches, e-mail...E-bay buying patterns, auction patterns

Preliminaries Background Challenges Statistics and CS Outline

Outside of Sociology/Economics/Commerce

Medical imagingAutomatic diagnosis/ER triageDrug discoveryCancer taxonomyAstronomy: Sloan Digital Sky SurveyCommunications: predicting network failures

What are we going to do with all this?

Preliminaries Background Challenges Statistics and CS Outline

Data Size: Challenges for Statistics

Storage, access, processing

⇒, heavy CS involvementConcern over computational efficiencySub-sampling ⇒ very small populations might be of higheconomic valueAmazon Goldbox example

Modeling and Inference

Enormous power → all (reasonable) models are wrongManual model-building requires large amounts of timeParametric assumptions are not reasonableDon’t know what we’re looking for (eg association rules)

Preliminaries Background Challenges Statistics and CS Outline

Complexity: Challenges

High dimensionality

many "nuisance" variables, lots of correlation (frequentlynonlinear)model/variable selection ⇒ massive variancedifficulty of model checkingcurse of dimensionality

"Feature" complexity

Different lengths of covariatesSpacial and geometric structureUnstructured covariates: natural language, click-streams

Uncertain measures of fit: Google search

Preliminaries Background Challenges Statistics and CS Outline

Feature ComplexityData do not look like a traditional design matrix:

imagessentencesvideomoleculesebay auctions

Need to have a metric for comparison ⇒ feature extraction.But also complex relationships:

links between webpagescomposite molecules

Strange invariances (face recognition: rotation, translation,occlusion).Much modern effort = dealing with specific complex problems.

Preliminaries Background Challenges Statistics and CS Outline

What is Maching Learning?And what is Data Mining?

Definitions (and distinctions) differ; not welldemarcated/largely a matter of advertising.Concerned with "automated", model-free statistics.

My definitions:Machine Learning = focus on predictive modelingData Mining = discovery of patterns from massive amounts ofdata.

General Philosophy"Don’t show me a model: let the data do the work".

Preliminaries Background Challenges Statistics and CS Outline

This Class

Focus on prediction:

Generic model y = F (x) + ε

Given data {yi , xi}Ni=1 ∼ p(y , x)y = "output", or "response"x = "inputs", "features" (ML = not "predictors")

Want to estimate F to get good prediction:

F ∗ = argminF∈C

EyxL(y ,F (x))

Preliminaries Background Challenges Statistics and CS Outline

Why?

L = "loss" of guessing F when truth is y

F ∗ is best predictor of y under L

Examples:Regression: y ∈ R,

L(y ,F (x)) = |y − F (x)|, (y − F (x))2

Classification: y ∈ {c1, . . . , cK}

L = Ly ,F (K × Kmatrix)

Preliminaries Background Challenges Statistics and CS Outline

Nearest Neighbors

Canonical example

Observations: x1, . . . , xN

Responses: y1, . . . , yN .Desired: Estimate y for response for a new observation x.

Prediction Rule: letI = argmin

i∈1,...,N‖xi − x‖

predict yI .

Model-free, successful (sort of), not something a statistician wouldthink of.

Preliminaries Background Challenges Statistics and CS Outline

Is Data Mining Statistics?

Historically (and currently) dominated by computer science.Very little attention from statisticians.Still perceived as Computer Science/MathLead to "obvious" statistical mistakes in research, but alsosuccessful ideas that are very different from statistical thinking.Growing recognition/convergence from both sides.Friedman, 1997, "Data Mining and Statistics: What’s theConnection?"Brieman, 2002, "Statistical Modeling: The Two Cultures"

Preliminaries Background Challenges Statistics and CS Outline

Is Data Mining Statistics?

Preliminaries Background Challenges Statistics and CS Outline

This Class

Statistical Machine Learning

How do we understand ML success from a statisticalperspective?

bias, variance, regularizationHow do we assess the performance of these methods

cross validation, bootstrap....Does this fit into what we already know?

Preliminaries Background Challenges Statistics and CS Outline

Outline

Linear methods: linear regression/linear discriminant analysisExtensions: nonlinear regression, kernel methods, neuralnetworksRegularization and Model AssessmentNonlinear methods: nearest neighborsTree-based methods and extensionsAggregation methods: boosting, bagging and randomizationRequested topics, PAC learning theory, semi-supervisedlearning, clustering

top related