part i: introduction to statistical learningdzeng/bios740/introduction.pdf · definition of...
Post on 21-May-2020
8 Views
Preview:
TRANSCRIPT
PART I: INTRODUCTION TOSTATISTICAL LEARNING
Donglin Zeng, Department of Biostatistics, University of North Carolina
Statistical Decision Theory
Donglin Zeng, Department of Biostatistics, University of North Carolina
Definition of statistical learning
I My definition: statistical learning is a framework ofstatistical methods and computational algorithms usingprobabilistic distribution generated data for the goal of eitherprediction or data extraction in future applications.
I Statistical learning consists of developing– statistical methods;– computational algorithms.
I Statistical learning concerns– empirical data and their randomness.
I Statistical learning aims for– future prediction;– understanding future data patterns.
I Hence, many scientific disciplines play roles in statisticallearning: probability and statistics, computer science, datascience, informatics and subject-area applications.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Other alias names for statistical learning
I Machine learning, data miningI Pattern recognitionI Supervised learning and unsupervised learningI Data analytics or predictive analytics
Donglin Zeng, Department of Biostatistics, University of North Carolina
Comparisons between statistical learning andstatistical inference
I Traditional statistical inference focuses on understandingdistributional behavior of data.
I In statistical inference, estimation and hypothesis testing(inference) of distribution parameters are of most interest;bias, consistency, efficiency are the main concerns.
I In statistical learning, distribution estimation is lessimportant compared to the learning goals such asprediction and feature extraction.
I Thus, prediction accuracy is most important in statisticallearning.
I Prediction rule consistency and expected risk control are ofmore concern in statistical learning.
I However,– both assume data randomly generated from someunderlying distribution so account for random behaviorsin the procedures;– both rely on data-dependent objective functions forestimation and inference;– both, more or less, involve development of statisticalmodels for data and computation algorithm for execution;– more specifically, supervised learning is analogue toregression and unsupervised learning is to densityestimation.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Challenges in modern statistical learning
I Method challenges: what kinds of methods/models enablethe achievement of prediction goals?
I Data challenges: how to deal with data complexity:dimensionality, heterogeneous structure, missing data etc.
I Algorithm challenges: what kind of computationalgorithms are suitable for estimation and data?
I Inference challenges: how well is the performance whenapplication to future data?
Donglin Zeng, Department of Biostatistics, University of North Carolina
Example 1. Email Spam Data
Donglin Zeng, Department of Biostatistics, University of North Carolina
Example 2. Prostate Cancer Data
Donglin Zeng, Department of Biostatistics, University of North Carolina
Example 3. Handwritten Digit Data
Donglin Zeng, Department of Biostatistics, University of North Carolina
Example 4. DNA Expression Data
Donglin Zeng, Department of Biostatistics, University of North Carolina
Overview of lectures on statistical learning
– I will introduce a number of statistical or machine learningmethods.
– I will discuss probabilistic and statistical theory behindlearning methods.
– Computation algorithms and examples will be usedthroughout lectures.
Donglin Zeng, Department of Biostatistics, University of North Carolina
What you should know
I Many data examples and figures are taken from Hastie,Tibshirani and Friedman’s book.
I A number of R-algorithms and examples are taken from avariety of web-sources publicly available.
I All errors in this course are mine.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Statistical Decision Theory (Supervised Learning)
I The goal of supervised learning is to learn a prediction ruleto predict outcome given subject’s feature variables.
I The components in supervised learning– X: feature variables – Y: outcome variable (continuous,categorical, ordinal)
I We assume that (X,Y) follows some distribution.I We aim to determine a prediction rule:
f : X→ Y
using available data (X1,Y1), ..., (Xn,Yn), called trainingdata or training sample.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Loss function to assess a prediction rule
I Loss function is a functional that for a given rule f and aspecific subject with (X,Y), what is the incurred loss due toimprecision.
I General notation: L(y, x; f ), but usually, it is defined basedone certain metric between y and f (x). For the latter, weuse L(y, f (x)).
I Examples of typical loss functions– squared loss: L(y, f ) = (y− f )2
– absolute deviation loss: L(y, f ) = |y− f |– Huber loss:L(y, f ) = (y− f )2I(|y− f | < δ) + (2δ|y− f | − δ2)I(|y− f | ≥ δ)– zero-one loss: L(y, f ) = I(y 6= f ) – preference loss:L(y1, y2, f1, f2) = 1− I(y1 < y2, f1 < f2)
Donglin Zeng, Department of Biostatistics, University of North Carolina
Plot of loss function
−2 −1 0 1 2
01
23
4
x
loss
func
tions
Donglin Zeng, Department of Biostatistics, University of North Carolina
Statistical framework for supervised learning
I Feature variables: X; outcome: Y.I Loss function: L(y, f (x)).I The goal is to find the optimal prediction f ∗ to minimize
the expected prediction error:
EPE(f ) = E [L(Y, f (X))] .
I Training data: (X1,Y1), ..., (Xn,Yn).
Donglin Zeng, Department of Biostatistics, University of North Carolina
top related