part i: introduction to statistical learning

PART I: INTRODUCTION TO STATISTICAL LEARNINGDonglin Zeng, Department of Biostatistics, University of North Carolina
Introduction
My definition of statistical learning
I Statistical learning is a framework of combining statistical reasoning and computing method to develop analytic and predictive tools from available data, with the goal of data extraction and future prediction.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Statistical reasoning
I statistical reasoning emphasizes I data are always filled with random errors, and they are
assumed from some generation mechanism following a probability law;
I available data (training data or training sample) are only a random copy of future applications and may or may not contain bias, so any method fitting training data perfectly is believed to fit badly for future data.
I Statistical reasoning focuses on developing methods that can be generalizable and are statistically optimal under certain randomness assumptions.
I Statistical reasoning looks for the answer of data MECHANISM.
Computing method
I Computing method emphasizes I one well-defined objective function to optimize that aligns
with the goal in future applications; I an effective and efficient algorithm for optimization, while
accounting for necessary constraints to prevent overfitting and be resilient to potential bias.
I Computing method focuses on developing algorithms that can solve an optimization problem efficiently and produce useful results for practice.
I Computing method looks for the answer of data UTILITY.
I Statistical learning includes development of both statistical reasoning methods and computational algorithms, as well as analytic and predictive tools (software development).
I Usually, the two main goals of statistical learning are – future prediction (supervised learning); – data pattern analysis (unsupervised learning).
I Disciplines involved in modern statistical learning: probability and statistics, computer science, data science (?), informatics and all subject-of-matter applications.
I Other names used: machine learning, data mining, pattern recognition, data analytics, predictive analytics.
How does it differ from classical statistical inference?
I Traditional statistical inference focuses on statistical reasoning, i.e., to understand the distributional behavior of data.
I In statistical inference, estimation and hypothesis testing (inference) of distribution parameters are of most interest; statistical properties such as unbiased, consistency, efficiency are the main concerns.
I In statistical learning, distribution is the backbone but its estimation is less concerned, as compared to the learning goals such as prediction and feature extraction.
I Most often, prediction accuracy is the most important quantity in statistical learning. Thus, prediction rule consistency and risk bound are widely studied in statistical learning literature.
Statistical learning and statistical inference share the same reasoning philosophy
I Both assume data randomly generated from some underlying distribution so account for random behaviors in the procedures.
I Both rely on data-dependent objective functions for estimation and inference.
I Both, directly or indirectly, involve statistical models for data and computation algorithm for execution.
I Specifically, supervised learning is analogue to regression and unsupervised learning is to density estimation.
Challenges in statistical learning
I Method: what methods lead to prediction goals? I Data: how to deal with data complexity (dimensionality,
heterogeneous structure, missing data etc.) related to its goals?
I Algorithm: what computation algorithms are feasible and efficient to compute prediction?
I Inference: in consideration of data randomness, what is the assurance of the performance when applied to future data or applications?
Example 1. Email Spam Data
Example 2. Prostate Cancer Data
Example 3. Handwritten Digit Data
Example 4. DNA Expression Data
Overview of my lectures
I I will introduce a number of statistical or machine learning methods.
I I will discuss the probabilistic and statistical theory behind learning methods.
I Computation algorithms and examples will be used during the lectures.
What you should know
I Many data examples and figures are taken from Hastie, Tibshirani and Friedman’s book.
I A number of R-algorithms and examples are taken from a variety of web-sources publicly available.
I All errors in this course are mine.
Statistical Decision Theory
Basic set-up for Supervised Learning
I The goal of supervised learning is to learn a rule to predict an outcome using feature variables.
I Variable components in supervised learning include I X: feature variables (continuous, categorical, structured or
unstructured feature) with domain X . I Y: outcome variable (continuous, categorical, ordinal or
even functional) with domain Y .
I We assume that (X,Y) follows a probability distribution P(X,Y) (abbreviated as P).
I A prediction rule is a function from X to Y : y = f (x). I Supervised learning is to learn a prediction rule using
training data (X1,Y1), ..., (Xn,Yn),
Loss function for inaccurate prediction
I A loss function is a function that quantifies the error incurred due to imprecision of a prediction rule.
I It is a map from Y × X × {f : f is a prediction rule} to [0,∞):
(y, x, f )→ L(y, x, f ),
such that L(y, x, f ) = 0 if y = f (x). I For most of applications, L(y, x, f ) = L(y, f (x)), i.e., some
distance measuring how far the predicted value, f (x), is from the outcome value, y. But we can also use outcome-feature-dependent metric, for example, w(y, x)L(y, f (x)) where w(y, x) is a non-negative weight function.
I Specifying a loss function depends on (a) the goal of a task and (b) the desired statistical and computational properties of the loss itself.
Examples of loss function
I Continuous Y I squared loss: L(y, f (x)) = (y− f (x))2–most commonly used,
strictly convex. I absolute deviation loss (L1-loss):
L(y, f (x)) = |y− f (x)|–convex, more robust. I more generally, Huber loss: L(y, f (x)) = (y− f (x))2 if |y− f (x)| < δ and (2δ|y− f (x)| − δ2) otherwise.
I Categorical Y I zero-one (0-1) loss: L(y, f (x)) = I(y 6= f (x)). I when Y is binary and labelled as 1 vs −1 (commonly used
in supervised learning), L(y, f (x)) = I(yf (x) > 0). I weighted zero-one loss: L(y, f (x)) = w(y)I(y 6= f (x))
(application includes cancer diagnosis). I Ordinal Y: Y takes values y1 < y2 < ... < yk
I we can still use zero-one loss by treating Y as categorical, I or, we use the squared loss as L(y, f (x)) = (sy − f (x))2 where
sy is a score assigned to y value. I an interesting loss function is called preference loss, which
is defined for a pair of data points:
L(y1, f (x1), y2, f (x2)) = 1− I(y1 < y2, f (x1) < f (xx)). Donglin Zeng, Department of Biostatistics, University of North Carolina
Plot of loss functions for continuous Y
−2 −1 0 1 2
0 1
2 3
Risk function
I Since (Y,X) is from a distribution P, we define a risk function for any prediction rule, f , as the expectation of the loss function:
R(f ) = EP[L(Y, f (X))].
I R(f ) is equivalent to the average prediction loss (error) if we apply the prediction rule f to the population where (Y,X) follows the probability law P.
I The optimal prediction rule, denoted as f ∗(x), is a rule minimizing the risk function:
f ∗ = argminR(f ).
I Certainly, there can be many ways to summarize the error due to the prediction rule f (median risk, minimax) but the expectation is mostly common in statistical learning.
Specific goals for supervised learning
I Given a loss function L(y, f (x)), what is the optimal prediction rule f ∗?
I With training data (X1,Y1), ..., (Xn,Yn), how can we estimate or learn the optimal prediction rule?
I What computing algorithms can be used ? I How do we evaluate the performance of the learned
prediction rule? I What are the statistical properties of the estimated rule?
Compared to classical maximum likelihood estimation
I The goal is to learn the underlying distribution P, assumed from a probability model Pθ.
I Thus, the prediction rule f is equivalent to θ. I The loss function becomes the negative log-likelihood
function. I The maximum likelihood estimator based on the training
data is used to estimate θ. I Optimization algorithms can be used for computing the
MLE. In missing data, we often use EM algorithm. I Under some assumptions, the MLE is consistent,
asymptotically normal and efficient. I In some sense, classical statistical inference is one special
type of statistical learning but the goal is to learn distribution parameters instead of a prediction rule.

part i: introduction to statistical learning

Documents