statistical learning and data mining...

26
Statistical Learning and Data Mining Stat557 Statistical Learning and Data Mining Stat557 Jia Li Department of Statistics The Pennsylvania State University Email: [email protected] http://www.stat.psu.edu/jiali Jia Li http://www.stat.psu.edu/jiali

Upload: others

Post on 26-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Statistical Learning and Data MiningStat557

Jia Li

Department of StatisticsThe Pennsylvania State University

Email: [email protected]://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali

Page 2: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

General Information

I Course homepage:http://www.stat.psu.edu/˜jiali/stat557

I Prerequisite:I Elementary probability theoryI Conditional distribution, expectationI C, Matlab, or S-plus programming

Jia Li http://www.stat.psu.edu/∼jiali

Page 3: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

I Text books:I Required: The Elements of Statistical Learning, by T. Hastie,

R. Tibshirani, and J. Friedman(ElemStatLearn).

I Optional:

1. Classification and Regression Trees by L. Breiman, J. H.Friedman, R. A. Olshen, and C. J. Stone

2. Pattern Recognition and Neural Networks by B. Ripley3. Principles of Data Mining by H. Mannila, P. Smyth and D. J.

Hand4. Data Mining: Concepts and Techniques by J. Han and M.

Kamber

Jia Li http://www.stat.psu.edu/∼jiali

Page 4: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

What Is Data Mining?

Data mining: tools, methodologies, and theories for revealingpatterns in data—a critical step in knowledge discovery.Driving forces:

I Big data:I Enormous volumeI High complexity: dimension, structureI Dynamic

I Explosive growth of data in a great variety of fieldsI Cheaper storage devices with higher capacityI Faster communicationI Better database manage systems

I Rapidly increasing computing power: distributed and parallelplatforms

I Make data to work for us

Jia Li http://www.stat.psu.edu/∼jiali

Page 5: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Research fields

I Statistics

I Machine learning

I Pattern recognition

I Signal processing

I Database

Jia Li http://www.stat.psu.edu/∼jiali

Page 6: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Applications

I BusinessI Wal-Mart data warehouseI Credit card companies

I GenomicsI Human genome project: DNA sequencesI Microarray data

I Information retrievalI Terrabytes of data on the internetI Multimedia information

I Communication systemsI Speech recognitionI Image analysis

I Many other scientific fields

Jia Li http://www.stat.psu.edu/∼jiali

Page 7: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Problems Focused: Prediction

Jia Li http://www.stat.psu.edu/∼jiali

Page 8: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Terminology

Notation

I Input X : X is often multidimensional. Each dimension of X isdenoted by Xj and is referred to as a feature, predictor, orindependent variable/variable.

I Output Y : response, dependent variable.

CategorizationI Supervised learning vs. unsupervised learning

I Is Y available in the training data?

I Regression vs. ClassificationI Is Y quantitative or qualitative?I For qualitative Y , it is also denoted by

G ∈ G = {1, 2, ...,K}.

Jia Li http://www.stat.psu.edu/∼jiali

Page 9: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Examples

Email spam: (ElemStatLearn)

I Goal: predict whether an email is a junk email, i.e., “spam”.

I Raw data: text email messages.

I Input X : relative frequencies of 57 of the most commonlyoccurring words and punctuation marks in the email message.

I Training data set: 4601 email messages with email typeknown (supervised learning).

Jia Li http://www.stat.psu.edu/∼jiali

Page 10: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Examples

Handwritten digit recognition:(ElemStatLearn)

I Goal: identify single digits 0 ∼ 9 based on images.I Raw data: images that are scaled segments from five digit

ZIP codes.I 16× 16 eight-bit grayscale mapsI Pixel intensities range from 0 (black) to 255 (white).

I Input data: a 256 dimension vector, or feature vectors withlower dimensions.

Jia Li http://www.stat.psu.edu/∼jiali

Page 11: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Jia Li http://www.stat.psu.edu/∼jiali

Page 12: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Examples

Image segmentation:

I Goal: segment images into regions of different types, e.g.,man-made vs. natural in aerial images, graph and picture vs.text in document images.

I Raw data: grayscale images represented by matrices of sizem × n, or color images represented by 3 such matrices.

Jia Li http://www.stat.psu.edu/∼jiali

Page 13: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Aerial images. Left: Original image of size 512× 512 with pixel intensity

ranging from 0 to 255, Right: Hand-labeled classified images. White:

man-made, Gray: natural.

Jia Li http://www.stat.psu.edu/∼jiali

Page 14: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

I Input data:I Divide images into blocks of pixels or form a neighborhood

around each pixel.I Compute statistics using pixel intensities in each block.I An image is converted to an array of input vectors.

I Methodologies:I Assume the feature vectors are independent.I Employ spatial models to capture dependence among the

vectors.

Jia Li http://www.stat.psu.edu/∼jiali

Page 15: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Jia Li http://www.stat.psu.edu/∼jiali

Page 16: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Examples

Speech recognition:

I Goal: identify words spoken according to speech signalsI Automatic voice recognition systems used by airline companiesI Automatic stock price reporting

I Raw data: voice amplitude sampled at discrete time spots (atime sequence).

Jia Li http://www.stat.psu.edu/∼jiali

Page 17: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Jia Li http://www.stat.psu.edu/∼jiali

Page 18: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

I Input data: speech feature vectors computed at the samplingtime.

I Methodology:I Estimate an Hidden Markov Model (HMM) for each word,

e.g., State College, San Francisco,Pittsburgh.

I For a new word, find the HMM that yields the maximumlikelihood.

I Identify the word as the one associated with the HMM.

Jia Li http://www.stat.psu.edu/∼jiali

Page 19: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

ExamplesDNA Expression Microarray:

I Goal: identify disease or tissue types

I Raw data: for each sample taken from a tissue of a particulardisease type, the expression levels of a large collection ofgenes are measured.

I Input data: cleaned-up gene expression dataI NormalizationI Denoising.I Ample literature on the topic of cleaning microarray data

I Example data set: 4026 genes, 96 samples taken from 9classes of tissues.

I Challenges:I very high dimensional dataI very limited number of samples

Jia Li http://www.stat.psu.edu/∼jiali

Page 20: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Examples

DNA sequence classification:

I Goal: distinguish “junk” segments from coding segments.

I Raw data: sequences of letters, e.g., A,C,G,T for DNAsequences.

I Input data: likelihood ratio statistics computed fromstochastic models.

I Supervised learning: estimate stochastic models, selectmodels.

Jia Li http://www.stat.psu.edu/∼jiali

Page 21: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Supervised Learning

Two types of learning:

I Regression: the response Y is quantitative.

I Classification: the response Y is qualitative, or categorical.

Two aspects in learning:

I Fit the data well.

I Robust

Equivalent concepts:

I Training error vs. testing error

I Bias vs. variance

I Fitting vs. overfitting

I Empirical risk vs. model complexity (capacity)

Jia Li http://www.stat.psu.edu/∼jiali

Page 22: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Jia Li http://www.stat.psu.edu/∼jiali

Page 23: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Learning Spectrum

Jia Li http://www.stat.psu.edu/∼jiali

Page 24: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Regression

Overview:I Linear models:

I The mean response is a linear function of the independentvariables.

I Generalized linear modelsI Expand basis:

I Splines (polynomials)I Reproducing Kernel Hilbert SpacesI Wavelet smoothing

I Kernel methods

Jia Li http://www.stat.psu.edu/∼jiali

Page 25: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Classification: A graphic View

Jia Li http://www.stat.psu.edu/∼jiali

Page 26: Statistical Learning and Data Mining Stat557personal.psu.edu/jol2/course/stat597e/notes2/intro.pdfStatistical Learning and Data Mining Stat557 Examples Image segmentation: I Goal:

Statistical Learning and Data Mining Stat557

Outlines

I Linear regression

I Linear methods for classification

I Prototype methods

I Classification and regression tree (CART)

I Mixture discriminant analysis

I Hidden Markov models and its applications

Jia Li http://www.stat.psu.edu/∼jiali