574m: introduction to statistical machine...

34
Introduction Examples 574M: Introduction to Statistical Machine Learning Hao Helen Zhang Fall, 2017 Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 1 / 33

Upload: nguyennhi

Post on 12-Jun-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

574M: Introduction to Statistical MachineLearning

Hao Helen Zhang

Fall, 2017

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 1 / 33

Page 2: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Course Information

Course homepage:http://www.math.arizona.edu/∼hzhang/math574m.html

Prerequisite:

MATH 464, MATH 466, or equivalentProgramming language: R (recommended)

Textbooks: The Elements of Statistical Learning (electronicversion available at course website)

Reference Books:1 Principle and Theory for Data Mining and Machine Learning

by Clark, Forkoue, Zhang (2009)2 Pattern Recognition and Neural Networks by B. Ripley (1996)3 Learning with Kernels by Scholkopf and Smola (2000)4 Nature of Statistical Learning Theory by Vapnik (1998)

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 2 / 33

Page 3: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

What is Data Mining

the nontrivial process of identifying valid, novel, potentiallyuseful, and ultimately understandable patterns in data

the process of extracting previously unknown, comprehensible,and actionable information from large databases and using itto make crucial business decisions

a set of methods used in the knowledge discovery process

the process of discovering advantageous patterns in data

a decision support process where we look in large databasesfor unknown and unexpected patterns of information

......

DM is a process of discovering patterns and relationships in data,with an emphasis on large observational databases.

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 3 / 33

Page 4: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Emerging Discipline, Why Now?

Driving Forces

explosive growth of data in a great variety of fields

revolution in biotechniques (microarray, GWAS, nextgeneration sequencing)internet, network, search engines, digital images, multi-mediainformation

Rapidly increasing computer power

cheaper storage devices with higher capacity

faster communications; better database management systems

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 4 / 33

Page 5: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

What is Big Data

Wikipedia says,

“Big data is a broad term for data sets so large or complex thattraditional data processing applications are inadequate. ”

NSF Big Data Initiative (2012),

“from a scientific perspective, that scientists manage, analyze,visualize, and extract useful information from

large, diverse, distributed and heterogeneous data sets so as toaccelerate the progress of scientific discovery and innovation”

Wall Street Journal (04-20-2012),

“from a business perspective, an enterprise mine all the data itcollects right across its operations to unlock golden nuggets of

business intelligence”

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 5 / 33

Page 6: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Big Data References

The Wall Street Journal, Article 11-29-2012, “Big Data is onthe rise, bringing big questions”

The Wall Street Journal, Article 04-29-2012, “Big Data’s bigproblem: little talent”

McKinsey & Company, Report 05-2011, “Big Data: The nextfrontier for innovation, competition, and productivity.”

Google search “Big Data” gives 0.8 billion results.

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 6 / 33

Page 7: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Sizes of Big Data

The sizes of modern datasets are increasing faster than ever:

Bytes=8 bits, kilobyte= 103 bytes, Megabyte= 106 bytes,Gigabyte= 109 bytes, Terabyte= 1012 bytes,Petabyte= 1015 bytes, Exabyte= 1018 bytes,Zettabyte= 1021 bytes, Yottabyte = 1024 bytes, ...

2 Megabytes: A high resolution photograph

50 Terabytes: The contents of a large MASS Storage System

2 Petabytes: All US academic research libraries

5 Exabytes: All words ever spoken by human beings

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 7 / 33

Page 8: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Big Data Features and Challenges

Big data are featured with

massive volume, ultra-high dimension

complex structure, heterogeneous

multiple-source, multiple-type data

Big Data Challenges:

data storage, transfer [enormous memory, parallel processing].

data search, retrieval [at a high speed]

data cleaning, organization, visualization [present/display datain a meaningful way]

data analysis [gain deep insight about data]

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 8 / 33

Page 9: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Examples

Wal-Mart made > 20 million transactions daily, andconstructed an 11 terabyte database of customer transactions

AT&T had 100 million customers and carried on the order of300 million calls a day on its long distance network

molecular data: DNA copy-number alteration, mRNAexpression, protein expression

images, network data, tweet data, ...

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 9 / 33

Page 10: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Role of Data Scientists

Data Science is an interdisciplinary field:

statistics machine Learning, computer science, mathematics,pattern recognition, signal Processing

Goal: to extract useful informationn from flood of data, tofind hidden patterns in data

feasible and fast computation. Example: Hadoop system issoftware for distributed storage and processing of very largedata sets on computer clusters.

develop new tools and algorithms for data analysis

provide theoretical foundations for learning algorithms

help researchers gain deeper understanding of cancer

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 10 / 33

Page 11: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 11 / 33

Page 12: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Applications (I)

Business

Walmart data warehouse, credit card companies, bank data,stock data

Marketing

Given data on age, income, etc., predict the spending capacityof each customer;Discover the relationship of customers’ spending behaviors;recommend products (example: diaper-beer)

Genomics

Human genome projects: DNA sequences; microarray data,SNP, next generation sequence

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 12 / 33

Page 13: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Applications (II)

Information Retrieval

search engine, text mining, document search.For example, given some key words in a document, determineits topic and content. (many words in a document, and many,many documents available on the web).speech recognition, image analysis, multimedia information

Healthfare

personalized medicinecancer classification and treatment, given gene expressions andclinical measurements.

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 13 / 33

Page 14: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

What is Role of Statistics

Statistics machine learning plays a central role in data mining.

provide theoretical foundations for learning algorithms

give useful tools to analyze an algorithm’sstatistical properties and performance guarantee

help researchers gain deeper understanding of the approaches,design better algorithms, and select appropriate methods for agiven problem

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 14 / 33

Page 15: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Examples of Learning Problems

Predict whether a patient, hospitalized due to a heart attack,will have a second heart attack, based on demographic, dietand clinical measurements (Regression and Classification)

Predict the price of a stock in 6 months, based on companyperformance measures and economic data (Regression)

Identify a handwritten ZIP code, from a digitized image(Classification)

Identify risk factors for prostate cancer, based on clinical,genetic and demographic variables (Feature Selection)

Identify groups of genes with similar functions from DNAmicroarray data (High Dimensional Clustering)

In fraud detection, what covariates are useful in building amodel to predict the probability of being a fraud order? howto handle covariate correlations? (Classification Problem)

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 15 / 33

Page 16: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Goals of This Course

understand different types of data mining problems

supervised, unsupervised, semi-supervised learningregression, classification, clustering, graphical modelstext mining, multimedia mining, image recognition, socialnetworks, recommendation system, network models

learn basic statistical concepts and principles (which arefavored over a black-box learning algorithm)

statistical inference on uncertainty, distribution, loss, riskmodel building, evaluation, selection, prediction, andoptimality; bias-variance tradefoffparameter tuning, training error, test error, cross-validation,

learn statistical and machine learning methods for big data

SVM, PCA, lasso, boosting, tree, random forest

learn R software packages to analyze data

take into serious consideration scalable, parallel algorithms

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 16 / 33

Page 17: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

What is “Special” about this course

To combine the “art” of designing good learning algorithms andthe “science” of analyzing statistical properties and performance ofthe approaches

emphasize “statistical” principles behind the approaches andalgorithms

learn how to formulate a learning problem in a “statistical”framework

understand existing techniques from a “statistical”perspective; what are the limitations and strengths? Can wedo better by relaxing assumptions?

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 17 / 33

Page 18: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Terminologies

Statistics Data Mining

X predictor, covariate input, feature

Y response, outcome output

{(Xi ,Yi )}ni=1 data points, samples examples, training data

f̂ (X) model fitting machine learning

data analysis consistency, inference prediction, speedgoals confidence interval risk bound

convergence rate learning theory

Practical vs Reluctant to use methods Willing to use ad hocTheoretical methods without theoretical methods if they seemsConcerns justification (even if the to work well (though

justification is appearances may beactually meaningless) misleading)

Computing more and more important heavy

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 18 / 33

Page 19: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 19 / 33

Page 20: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Different Learning Problems

Typically, we collect data (Xi ,Yi ), i = 1, · · · , n.

Yi is the outcome or response variable.

Xi is the input or prediction variables.

Various Machine Learning Problems

Supervised learning (Y observed)

Unsupervised learning (Y unobserved)

Semi-supervised learning (Y partially available)

Various Statistical problems

Regression (Y quantitative, a numerical quantity)

Classification (Y qualitative; a class label)

Density estimation (no Y )

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 20 / 33

Page 21: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Supervised Learning vs Unsupervised Learning

Supervised Learning Unsupervised Learning

Response Y observed unobserved

Major predict Y , given find interestingGoal the observed input X patterns in data

Examples linear regression clusteringnonparametric regression density estimation

classification dimension reduction

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 21 / 33

Page 22: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Example 1: Email or Spam (Textbook Page 2)

Training data: 4, 601 email messages with known email type.

Input X: the relative frequencies of 57 of the most commonlyoccurring words and punctuation marks in the email message.Outcome Y : −1 =email , + =spam

Goal: Design an automatic spam detector to predict whethera message is email or spam. In the future, the detector can beused to filter out the junk emails before clogging the users’mailbox.

Supervised learning, binary classification, n > p. (data matrixn × p “long and narrow”)

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 22 / 33

Page 23: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Table: Average percentage of words (the largest difference between spam and email

george you your hp free hpl ! our re edu removespam 0.00 2.26 1.38 0.02 0.52 0.01 0.51 0.51 0.13 0.01 0.28email 1.27 1.27 0.44 0.90 0.07 0.43 0.11 0.18 0.42 0.29 0.01

Possible classification rules:

if (george < 0.6) & (you > 1.5), then spam; otherwise email.

if (0.2 you -0.3 george)> 0, then spam; otherwise email.

Two types of decision errors:

(1) false positive: classify email to spam (filter out email)

(2) false negative: classify spam to email (email box jammed)

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 23 / 33

Page 24: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Example 2: Prostate Cancer

Read Textbook page 3

Training data: 97 male patient with different stages ofprostate cancer.

Input X: eight clinical measures: log-cancer volume (lcavol),log prostate weight (lweight), age, and the other five.

Outcome Y : the log of the level of prostate specific antigen(lpsa)

Questions of interest:

What is the relationship between lpsa and clinical measures?

Is the linear model sufficient? Nonlinear effect? Interactions?

Which clinical measures are more relevant to the prediction?

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 24 / 33

Page 25: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 1lpsa

-1 1 3

oooooo ooo ooo ooooo oo o o oo oo ooo o ooo ooo ooo ooo oo ooo oo ooo oo o ooooooo o oooo ooo o ooo o ooooooo oooo ooo oo ooo

ooo o

o oo ooooooooooo oo ooo ooooooo oooo o oooo ooo ooo ooooo oooo ooo oo oo o oo ooo ooooooooo oo oo ooooo o oooooo oo ooo

oooo40 60 80

o o oooo ooo oooo oo o ooo oooo o oooooooo oo oo oo oo o ooo oo ooo oooo oo ooo oo oo oooo o oooo ooo oo o ooooo ooo oooo ooo oo o oo

oooooo o oooo ooooo ooo oo oo o ooo ooo ooooo ooo oooo oo oo ooo oo oooo oo oo o ooo oo o o oo oo oo oo o oooo ooo o oo ooo oo oooo oo

0.0 0.4 0.8

oooooooooooooooooooooooooooooooooooooo oooooooo ooooooooooooooo oo ooooooo oo oooooo oooo ooo oo oooo oo

oooo

oooooooooooo oo oo o oooo oo ooo oo o ooooo oo oo oooo o ooo ooo ooo o ooo ooooo oo oo o ooo o oo o o o ooo ooo oo oo oo o oooo o ooo o

6.0 7.5 9.0

oo oooooooooo oooo ooooo oo ooo ooooooooo o oooo ooo ooo oooo oo oooooo ooo o ooo oooo oooo oooooooooo oooo oooo oo

oooo

0123

45

oo oooooooooo oooo ooooo oo ooo oo oooooooooo oo oooo oo oooo oo oooooo o oo o ooo o oooo ooo oo ooo ooo oo oo oo o o oo o ooo oo

-10

12

34

oooo

o

o

oo

ooo

o

oooo

o

o

oooo

o

o

oooo

o

o

oo

o

oo

ooo

o

oooo

oooo

ooooo

o

oo

oooooo

ooooooo

o

ooooo

oo

oooo

oooo

o

o

oo

o

o

ooo

oooo

lcavol

o oo

o

o

o

oo

ooo

o

oo oo

o

o

oo

oo

o

o

oo

oo

o

o

o o

o

oo

ooo

o

oooo

oooo

ooo o

o

o

oo

ooo oo

o

oo

ooooo

o

oo

o oo

oo

oooo

oo oo

o

o

oo

o

o

oooo

ooo

o oo

o

o

o

oo

ooo

o

o oo o

o

o

oo

oo

o

o

oo

oo

o

o

oo

o

o o

o oo

o

o oo

o

o ooo

oo

ooo

o

oo

ooo o

oo

oo

ooo

o o

o

oo

ooo

oo

oooo

oo oo

o

o

oo

o

o

oo o

oo o

o

oooo

o

o

o o

ooo

o

oooo

o

o

oo

oo

o

o

oo

oo

o

o

oo

o

oo

ooo

o

ooo

o

o ooo

oo

ooo

o

oo

ooooo

o

oo

ooo

o o

o

oo

ooo

o o

oooo

ooo o

o

o

oo

o

o

ooo

oo o

o

oooo

o

o

oo

ooo

o

oooo

o

o

oooo

o

o

oooo

o

o

oo

o

oo

ooo

o

oooo

oooo

ooooo

o

oo

oooooo

oo

ooooo

o

oo

o oo

oo

oo oo

oo oo

o

o

o o

o

o

ooooooo

oooo

o

o

oo

ooo

o

oo oo

o

o

ooo

o

o

o

oo

oo

o

o

oo

o

o o

o oo

o

ooo

o

oooo

oo

ooo

o

oo

o ooooo

oo

ooo

oo

o

oo

o oo

o o

oo oo

ooo o

o

o

o o

o

o

oo o

ooo

o

ooo

o

o

o

oo

ooo

o

oooo

o

o

ooo

o

o

o

oo

oo

o

o

oo

o

oo

o oo

o

o oo

o

oooo

ooo oo

o

oo

ooo o

oo

oo

ooooo

o

oooo

o

oo

oooo

ooo o

o

o

o o

o

o

ooooooo

ooo

o

o

o

oo

ooo

o

oooo

o

o

ooo

o

o

o

oo

oo

o

o

oo

o

oo

ooo

o

o oo

o

o ooo

ooo oo

o

oo

ooo o

oo

oo

ooo

oo

o

oo

ooo

o o

oo oo

ooo o

o

o

o o

o

o

oo o

oo o

o

oooooooooo

ooooooooooooooooooooo

o

ooo

oo

o

o

ooooooooooooooooo

o

oooo

ooooooooo

oooooo

oooooooooooo

o

oooooo

oo

oo

oo oo ooo o

oooo

oo

o oo

oo oo ooo

ooo o

o

o

ooo

oo

o

o

oooo ooo

oooo

o oo

oo

o

o

oo

oo

o oooo oo

oo

ooo

ooo

oooo

ooooo oo

o

o

oo

oo ooo o lweight

oo

oooo ooo o

ooo o

oo

ooo

oooo o o

ooooo

o

o

oo o

oo

o

o

o oooo

oooo

o ooo

oo

oo

o

oo

oo

o oooo o ooo

o oo

ooo

o ooo

oo

ooo ooo

o

oo

o ooo

oo

oooooo o oooo ooooo

ooo

oo oo o o

oo o

ooo

o

ooo

oo

o

o

ooo oo

oooo

o oo o

oo

oo

o

oo

oo

oooo o o o

oo

o oo

oo

o

oooo

oo

o o oo oo

o

oo

oooo

oo

ooooooooooooooooooooooooooooooo

o

ooo

oo

o

o

ooooooo

oooooooooo

o

oooo

oooooo

ooo

oooooo

oooo

oo

ooo ooo

o

oo

oooooo

oooooooooooo

ooo

oo o

ooo oo ooo

ooo o

o

o

oo o

oo

o

o

ooo o o

oooo

o ooo

oo

oo

o

oooo

oooo o oo

oo

ooo

oo

o

oooo

oo

o oo ooo

o

ooo o o

oo o

oo

oooooooooo

ooo

ooo

ooo oo ooo

oooo

o

o

ooo

oo

o

o

o ooo ooo

oooo

oooooo

o

ooo

o

o ooo ooo

oo

ooo

ooo

ooooooo ooooo

o

oo

oooooo

34

56

oo

oooooooooo

ooo

ooo

ooo oo ooo

oooo

o

o

ooo

oo

o

o

o oooo

oooo

oooo

oo

oo

o

oo

oo

o ooo o oo

oo

ooo

oo

o

oooo

oo

o oo ooo

o

oo

o ooo

oo

4050

6070

80

o

o

o

oo

o

oo

o

ooooo

o

ooo

o

o

ooooooooooooo

o

oo

o

oo

oo

ooooo

o

o

o

ooooo

oo

oo

o

o

o

o

oooooooo

o

o

o

o

oooo

ooo

o

o

oooooo

o

ooo

oo

oo

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

o ooo o oooo

o

o o

o

oo

oo

oo o

oo

o

o

o

oooo

o

oo

oo

o

o

o

o

ooooooo o

o

o

o

o

ooo

o

ooo

o

o

oo

oo

o o

o

ooo

oo

o o

o

o

o

oo

o

oo

o

ooooo

o

ooo

o

o

oooooo oooo o o

o

o

o o

o

oo

oo

ooo

oo

o

o

o

o oooo

oo

o o

o

o

o

o

oooooooo

o

o

o

o

oooo

ooo

o

o

oo

oo

o o

o

ooo

oo

oo

ageo

o

o

oo

o

oo

o

oo ooo

o

oo

o

o

o

o ooo

ooo ooo ooo

o

o o

o

oo

oo

oooo

o

o

o

o

oooo

o

oo

o o

o

o

o

o

ooo o

ooo o

o

o

o

o

o ooo

oo o

o

o

oo

oo

oo

o

oo

o

oo

oo

o

o

o

oo

o

oo

o

ooooo

o

ooo

o

o

ooooooooooooo

o

oo

o

oo

oo

ooooo

o

o

o

ooooo

oo

oo

o

o

o

o

oooo

oooo

o

o

o

o

oooo

ooo

o

o

oo

oo

oo

o

ooo

oo

oo

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

oo oo o oooo

o

oo

o

oo

oo

oo oo

o

o

o

o

ooo o

o

oo

oo

o

o

o

o

ooo o

ooo o

o

o

o

o

o oo

o

ooo

o

o

oo

oo

o o

o

oo

o

oo

o o

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

oo ooooooo

o

o o

o

oo

oo

oo o

oo

o

o

o

o oo o

o

oo

oo

o

o

o

o

ooo oooo o

o

o

o

o

oooo

ooo

o

o

ooo

ooo

o

ooo

oo

oo

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

oo oo ooooo

o

oo

o

oo

oo

ooo

oo

o

o

o

o oo o

o

oo

oo

o

o

o

o

ooo o

oooo

o

o

o

o

o oo

o

ooo

o

o

oo

oo

o o

o

oo

o

oo

oo

oooooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

oo

o

oo

o

o

oo

o

o

o

ooo

o

o

oooo oo

o

o

o oo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

o o

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

o oo

o

o

o oo ooo

o

o

ooo

o

oo oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o oo

oo

o

o

o

oo

o o

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

ooo

o

o

o o oooo

o

o

o oo

o

o oo o

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

o o

o

o

o

oo

o o

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

oo

oo

o

oo

o

o

oo

o

o

o

oo o

o

o lbph

oooooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

oo

o

oo

o

o

oo

o

o

o

ooo

o

o

oooooo

o

o

ooo

o

oo oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

o o

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

o oo

o

o

oo oooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

oo

o

oo

o

o

oo

o

o

o

ooo

o

o

-10

12

oo oooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

ooo

o

o

0.00.4

0.8

oooooooooooooooooooooooooooooooooooooo

o

ooooooo

o

oooooooooooooo

o

o

o

oooooo

o

o

oooo

oo

o

ooo

o

oo

o

o

ooo

o

oooooo

oooo oo ooo ooo ooooo oo o o oo oo ooo o ooo ooo ooo

o

oo oo ooo

o

o ooo oo o ooooooo

o

o

o

oo ooo o

o

o

o o oo

oo

o

oo o

o

oo

o

o

o oo

o

oo ooo o

o oo ooooooooooo oo ooo ooooooo oooo o oooo ooo

o

oo ooooo

o

ooo ooo oo oo o oo o

o

o

o

oooooo

o

o

oo oo

oo

o

oo o

o

oo

o

o

o oo

o

oooooo

o o oooo ooo oooo oo o ooo oooo o oooooooo oo oo oo

o

o o ooo oo

o

oo oooo oo ooo oo o

o

o

o

oo o ooo

o

o

oo oo

o o

o

ooo

o

oo

o

o

oo o

o

o oo o oo

oooooo o oooo ooooo ooo oo oo o ooo ooo ooooo ooo

o

ooo oo oo

o

oo oo oooo oo oo o o

o

o

o

o o o oo o

o

o

o oo o

oo

o

o oo

o

o o

o

o

oo o

o

oooo oo

svi

oooooooooooo oo oo o oooo oo ooo oo o ooooo oo oo

o

ooo o ooo

o

oo ooo o ooo ooooo

o

o

o

o o ooo o

o

o

o o o o

oo

o

oo o

o

oo

o

o

o oo

o

o o ooo o

oo oooooooooo oooo ooooo oo ooo ooooooooo o oo

o

o ooo ooo

o

ooo oo oooooo ooo

o

o

o

o oooo o

o

o

o ooo

oo

o

ooo

o

oo

o

o

ooo

o

oooooo

oo oooooooooo oooo ooooo oo ooo oo oooooooooo

o

o oooo oo

o

ooo oo oooooo o oo

o

o

o

o o oooo

o

o

o oo o

oo

o

oo o

o

oo

o

o

o o o

o

o ooo oo

oooooooooooooo

o

oo

o

ooo

o

o

o

oooo

o

o

oooooo

o

o

o

o

oo

o

o

o

o

o

o

oooo

o

o

oo

o

oooo

ooo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

ooo

o

o

oooo oo ooo ooooo

o

oo

o

o o o

o

o

o

o oo

o

o

o

oo ooo

o

o

o

o

o

o o

o

o

o

o

o

o

ooo o

o

o

oo

o

oooo

o oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

ooo

o

o

o oo ooooooooooo

o

oo

o

o oo

o

o

o

oooo

o

o

oooo

oo

o

o

o

o

o o

o

o

o

o

o

o

oo

oo

o

o

o o

o

o oo o

ooo

o

o

o

o

oo

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

ooo

o

o

o o oooo ooo oooo

o

o

oo

o

o oo

o

o

o

oooo

o

o

oo oo

oo

o

o

o

o

o o

o

o

o

o

o

o

oo

oo

o

o

o o

o

o oo o

o oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

oo o

o

o

oooooo o oooo ooo

o

oo

o

o oo

o

o

o

ooo

o

o

o

oooo

oo

o

o

o

o

oo

o

o

o

o

o

o

oo

o o

o

o

o o

o

oo o o

ooo

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o o

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

oooooooooooooo

o

oo

o

ooo

o

o

o

oooo

o

o

oooooo

o

o

o

o

oo

o

o

o

o

o

o

oooo

o

o

oo

o

oooo

ooo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

lcp

oo oooooooooooo

o

oo

o

ooo

o

o

o

oooo

o

o

ooooo

o

o

o

o

o

oo

o

o

o

o

o

o

oo

oo

o

o

oo

o

o ooo

o oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

-10

12

3

oo ooooooooooo

o

o

oo

o

ooo

o

o

o

ooo

o

o

o

oooooo

o

o

o

o

oo

o

o

o

o

o

o

oo

oo

o

o

oo

o

o o oo

o oo

o

o

o

o

oo

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

6.07.0

8.09.0

oo

o

ooooooooo

ooo

o

o

oooo

o

o

o

oo

ooo

oooooo

o

o

ooo

o

o

o

ooo

o

o

oo

o

o

ooooo

o

oo

o

o

o

o

o

ooo

o

oooo

o

ooooooooo

o

oo

o

ooo

o

oooooo

oo

o

o oo ooo ooo

ooo

o

o

oo o o

o

o

o

o o

oo o

ooo ooo

o

o

o oo

o

o

o

ooo

o

o

oo

o

o

o o ooo

o

oo

o

o

o

o

o

o oo

o

o ooo

o

ooooooo oo

o

o o

o

o oo

o

oo ooo o

o o

o

ooooooooo

oo o

o

o

oo oo

o

o

o

oo

ooo

o o oooo

o

o

o oo

o

o

o

ooo

o

o

oo

o

o

o oo oo

o

oo

o

o

o

o

o

ooo

o

ooo o

o

oo ooooo o o

o

oo

o

o oo

o

oooooo

o o

o

ooo ooo ooo

o oo

o

o

oo oo

o

o

o

oo

ooo

ooo oo o

o

o

o oo

o

o

o

o oo

o

o

o o

o

o

o oo oo

o

oo

o

o

o

o

o

o o o

o

oo oo

o

oo o ooooo o

o

o o

o

oo o

o

o oo o oo

oo

o

ooo o oooo o

ooo

o

o

oo oo

o

o

o

oo

o oo

o ooooo

o

o

o oo

o

o

o

o oo

o

o

o o

o

o

ooo oo

o

o o

o

o

o

o

o

o o o

o

oo oo

o

o o oooo ooo

o

oo

o

oo o

o

oooo oo

oo

o

ooooooooo

ooo

o

o

oooo

o

o

o

oo

ooo

oooooo

o

o

o oo

o

o

o

ooo

o

o

oo

o

o

ooooo

o

oo

o

o

o

o

o

ooo

o

o oo o

o

oooo oooo o

o

o o

o

ooo

o

oooooo

oo

o

ooooooooo

oo o

o

o

oooo

o

o

o

oo

oo o

ooooo o

o

o

o oo

o

o

o

ooo

o

o

o o

o

o

o ooo o

o

oo

o

o

o

o

o

o oo

o

o oo o

o

o ooo ooo oo

o

o o

o

o oo

o

o o ooo ogleason

oo

o

ooooooooo

ooo

o

o

oooo

o

o

o

oo

oo o

oooooo

o

o

o oo

o

o

o

o oo

o

o

oo

o

o

ooooo

o

o o

o

o

o

o

o

o oo

o

o ooo

o

o ooo ooo oo

o

o o

o

o o o

o

o ooo oo

0 2 4oooooooooooo

o

ooo

o

oooooo

o

oo

o

o

o

oooooooooo

o

o

ooooo

o

o

oo

o

o

o

o

oooooo

o

o

o

o

oo

o

ooo

o

oo

o

o

oo

o

oooo

o

o

o

o

o

o

oo

oo

oo

o

o

oooo

o oo ooo ooo

o

ooo

o

oo o oo

o

o

o o

o

o

o

ooo ooo ooo

o

o

o

oo o

oo

o

o

oo

o

o

o

o

oo

ooo

o

o

o

o

o

oo

o

oo o

o

oo

o

o

oo

o

oo

oo

o

o

o

o

o

o

oo

oo

oo

o

o

o

3 4 5 6o oo

ooooooooo

o

o oo

o

oo oooo

o

oo

o

o

o

o o oooo ooo

o

o

o

ooo

oo

o

o

oo

o

o

o

o

oo

o oo

o

o

o

o

o

oo

o

ooo

o

o o

o

o

oo

o

oo

oo

o

o

o

o

o

o

oo

oo

oo

o

o

oo o

oooo ooo ooo

o

oo o

o

oo oooo

o

oo

o

o

o

ooo oo oooo

o

o

o

ooo

oo

o

o

o o

o

o

o

o

oo

ooo

o

o

o

o

o

oo

o

ooo

o

oo

o

o

oo

o

ooo

o

o

o

o

o

o

o

oo

oo

oo

o

o

o

-1 0 1 2oooooo o oooo o

o

ooo

o

oo ooo

o

o

oo

o

o

o

o ooooo ooo

o

o

o

ooooo

o

o

o o

o

o

o

o

oo

o oo

o

o

o

o

o

oo

o

oo o

o

oo

o

o

oo

o

oo

oo

o

o

o

o

o

o

oo

oo

oo

o

o

ooooooooooooo

o

ooo

o

oooooo

o

oo

o

o

o

ooooooooo

o

o

o

ooooo

o

o

oo

o

o

o

o

oooooo

o

o

o

o

oo

o

ooo

o

o o

o

o

oo

o

oooo

o

o

o

o

o

o

oo

oo

oo

o

o

o

-1 1 2 3oooooooooooo

o

o oo

o

ooooo

o

o

oo

o

o

o

ooooo ooooo

o

o

oo ooo

o

o

o o

o

o

o

o

oo

oooo

o

o

o

o

oo

o

oo o

o

o o

o

o

oo

o

ooo

o

o

o

o

o

o

o

oo

oo

oo

o

o

ooo

oooooooooo

o

ooo

o

ooooo

o

o

oo

o

o

o

oooooo ooo

o

o

o

oo o

oo

o

o

oo

o

o

o

o

oooooo

o

o

o

o

oo

o

oo o

o

oo

o

o

oo

o

oooo

o

o

o

o

o

o

oo

oo

oo

o

o

o

0 40 80

020

60100

pgg45

Figure 1.1: S atterplot matrix of the prostate an erdata. The �rst row shows the response against ea h ofthe predi tors in turn. Two of the predi tors, svi andgleason, are ategori al.

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 25 / 33

Page 26: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3

Shrinkage Factor s

Coeffi

cients

0.0 0.2 0.4 0.6 0.8 1.0

-0.20.0

0.20.4

0.6

••

•• • • • • • • • • • • • • • • • lcavol

• • • • ••

••

•• • • • • • • • • • • • • • • • lweight

• • • • • • • • • • • • • ••

• • • • • • • • • •age

• • • • • • • • • ••

••

•• • • • • • • • • • • lbph

• • • • • • ••

••

••

•• • • • • • • • • • • •svi

• • • • • • • • • • • • • • ••

••

••

••

••

• lcp

• • • • • • • • • • • • • • • • • • • • • • • • •gleason• • • • • • • • • •

••

•• • • • • • • • • •

••pgg45

Figure 3.9: Pro�les of lasso oeÆ ients, as tuningparameter t is varied. CoeÆ ients are plotted versuss = t=Pp1 j�̂j j. A verti al line is drawn at s = 0:5, thevalue hosen by ross-validation. Compare Figure 3.7on page 7; the lasso pro�les hit zero, while those forridge do not.

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 25 / 33

Page 27: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Handwritten Digit Recognition

Handwritten Digit Recognition (Textbook page 4)

Goal: identify single digits 0 ∼ 9 based on the images.

Raw Data: images that are scaled segments from five digitZIP codes.

Each digit has an image of 16× 16 eight-bit gray scale mapsPixel intensities range from 0 (black) to 255 (white)Images are normalized to have approximately the same sizeand orientation

Input X is a 16× 16 matrix, or a 256 dimensional vector.

Output G ∈ G = {0, 1, ..., 9}.The error rate should be very low to avoid misdirection of mail.Some objects are assigned to “do not know” category and sortedby hand.

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 26 / 33

Page 28: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 27 / 33

Page 29: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Example 4: DNA Expression Microarray

DNA = Deoxy-ribo-nucleic acid, a basic material making uphuman chromosomes.DNA Microarray Technique: for each sample from a tissue, theexpression level (the amount of mRNA) of thousands of genes aremeasured.

Training data: p = 6, 830 genes (rows), n = 64 samples(columns) (cancer tumors) taken from two classes.

Input X: the level of expression for each geneGoal: discover the relationship between gene and cancer type,or find the gene signature of each cancer subtype

Challenge: p >> n (data matrix n × p “short and fat”’)

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 28 / 33

Page 30: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

How DNA Microarray Technique Works

A breakthrough technology in biology, facilitating the quantitativestudy of thousands of genes simultaneously from a single sample ofcells

1 The nucleotide sequences for a few thousand genes are printedon a glass slide;

2 A target sample and a reference sample are labeled with redand green dyes, and each are hybridized with the DNA on theslide;

3 Through fluroscopy, the log (red/green) intensities of RNAhybridizing each site is measured;

4 The results are a few thousand numbers, typically rangingfrom -6 to 6, measuring the expression level of each gene inthe target relative to the reference sample (positive valuesindicate higher expression in the target versus the reference,and vice versa for negative values).

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 29 / 33

Page 31: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 1

SID42354SID31984SID301902SIDW128368SID375990SID360097SIDW325120ESTsChr.10SIDW365099SID377133SID381508SIDW308182SID380265SIDW321925ESTsChr.15SIDW362471SIDW417270SIDW298052SID381079SIDW428642TUPLE1TUP1ERLUMENSIDW416621SID43609ESTsSID52979SIDW357197SIDW366311ESTsSMALLNUCSIDW486740ESTsSID297905SID485148SID284853ESTsChr.15SID200394SIDW322806ESTsChr.2SIDW257915SID46536SIDW488221ESTsChr.5SID280066SIDW376394ESTsChr.15SIDW321854WASWiskottHYPOTHETICALSIDW376776SIDW205716SID239012SIDW203464HLACLASSISIDW510534SIDW279664SIDW201620SID297117SID377419SID114241ESTsCh31SIDW376928SIDW310141SIDW298203PTPRCSID289414SID127504ESTsChr.3SID305167SID488017SIDW296310ESTsChr.6SID47116MITOCHONDRIAL60ChrSIDW376586HomosapiensSIDW487261SIDW470459SID167117SIDW31489SID375812DNAPOLYMERSID377451ESTsChr.1MYBPROTOSID471915ESTsSIDW469884HumanmRNASIDW377402ESTsSID207172RASGTPASESID325394H.sapiensmRNAGNALSID73161SIDW380102SIDW299104

BREA

STRE

NAL

MELA

NOMA

MELA

NOMA

MCF7

D-rep

roCO

LON

COLO

NK5

62B-r

epro

COLO

NNS

CLC

LEUK

EMIA

RENA

LME

LANO

MABR

EAST CN

SCN

SRE

NAL

MCF7

A-rep

roNS

CLC

K562

A-rep

roCO

LON

CNS

NSCL

CNS

CLC

LEUK

EMIA

CNS

OVAR

IANBR

EAST

LEUK

EMIA

MELA

NOMA

MELA

NOMA

OVAR

IANOV

ARIAN

NSCL

CRE

NAL

BREA

STME

LANO

MAOV

ARIAN

OVAR

IANNS

CLC

RENA

LBR

EAST

MELA

NOMA

LEUK

EMIA

COLO

NBR

EAST

LEUK

EMIA

COLO

NCN

SME

LANO

MANS

CLC

PROS

TATE

NSCL

CRE

NAL

RENA

LNS

CLC

RENA

LLE

UKEM

IAOV

ARIAN

PROS

TATE

COLO

NBR

EAST

RENA

LUN

KNOW

N

Figure 1.3: DNA mi roarray data: expression matrix of6830 genes (rows) and 64 samples ( olumns), for the humantumor data. Only a random sample of 100 rows are shown.The display is a heat map, ranging from bright green (nega-tive, under expressed) to bright red (positive, over expressed).Missing values are gray. The rows and olumns are displayedin a randomly hosen order.Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 30 / 33

Page 32: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Microarray Data Analysis

Typical questions of interest:

Which samples are most similar to each other, in terms oftheir expression profiles across genes?

Which genes are most similar to each other, in terms of theirexpression profiles across sample?

Do certain genes show very high (or low) expression forcertain cancer samples?

.......

This task could be viewed as

a supervised or an unsupervised problem

a regression or a classification problem

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 31 / 33

Page 33: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

Statistical Problems in Data Mining

Regression (linear, nonlinear, logistic regression, GLM,graphical model)

Classification (Discriminant Analysis; LDA, QDA, SVM, trees,random forest)

Feature Selection (Variable Selection; Sparse Analysis;LASSO, screening)

Dimension Reduction (PCA; ICA; SDR)

Regularization (control model complexity, aim forparsimonious and interpretable solutions)

Clustering Analysis (k-means, association role)

Other Variations

Semi-supervised Learning

Mixture of the above

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 32 / 33

Page 34: 574M: Introduction to Statistical Machine Learningmath.arizona.edu/~hzhang/math574m/2017Lect1.pdf · 2017-08-08 · Introduction Examples 574M: Introduction to Statistical Machine

IntroductionExamples

New Challenges to Statisticians

Data Complexity: involves many variables which are oftenrelated in complex (nonlinear) ways.

Feature Selection: many features are available but some areredundant, leading to the feature selection or dimensionreduction problem.

Optimization: many methods involve finding the “best”parameters values by solving complex and large (containingmany parameters) optimization problems. Therefore, efficientoptimization techniques are required.

Visualization: much harder in the high dimensional space.

This is the so-called curse of dimensionality.

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 33 / 33