live lecture 1 data analysis cs 5100j
TRANSCRIPT
IMPORTANT:Moodle page of this module
https://moodle.royalholloway.ac.uk/
course/view.php?id=9088
contains all info of this module you need.
If you still cannot access it,
email [email protected]
Module Activities
• Pre-recorded (asynchronous) lectures and revision exercises
– posted on Moodle every Saturday
– watch this video before coming to live lecture
• Live (synchronous) lectures
– 4–5pm every Thursday (start from 14 Jan)
• Q&A sessions
– 9–10am, 11am–12noon every Thursday (start from 14 Jan)
– Q&A is optional; “you ask I answer”
Module Activities
• 8 lab sessions
– 5–6pm every Thursday (start from 21 Jan, until 11 March)
• 7 quizzes
– almost every Thursday (first quiz opens on 21 Jan, final quiz opens on 11 March;
no quiz opens on 4 Feb)
– only one attempt per quiz allowed, time limit: 30 minutes
– you need to complete within one-week timeframe
(e.g. the first quiz opens at 21 Jan 5pm, it closes at 28 Jan 6pm)
• Labs and quizzes account for 16% of your final grade
Evaluations
• 3 Homework Assignments
– on 11 Feb, 25 Feb, 11 March
– you have two weeks to complete each assignment
– account for 24% of your total grade
• Examination (arrangement announced later)
– account for 60% of your total grade
Evaluations
• This module will use quite a bit of mathematics
(linear algebra, probability, set theory).
Go to the Moodle page to find
“Pre-sessional Mathematics”,
which contains handouts that discuss the
relevant math knowledge.
Module Activities
• Pre-recorded lectures cover most (if not all) essential
materials for this module.
• Watch pre-recorded lecture before coming to live lecture.
• Live lectures are meant to be lively.
– Do ask questions! (And I will ask you questions.)
– Have discussions. (I know it is difficult now.)
– I will present more examples. (+ experiences of my friends working in industry)
• Eventually, live lectures reinforce your learning.
Live Lectures
• An Overview of The “Big Names”, and
Some Advices about Working on Data Analysis
• Basic Concepts in Data Analysis
via Premier League Example
• Decision Trees, via ICLR Reviews Example
• Pruning Decision Trees
Today’s Agenda
Google is like an elephant. Most employees see a tiny fraction of the
elephant, then use their “expert knowledge” to make that part work
and improving. Only a few architects know the overview of how this
whole thing works.
— Martin, Google engineer
The “Big Names”
Google is like an elephant. Most employees see a tiny fraction of the
elephant, then use their “expert knowledge” to make that part work
and improving. Only a few architects know the overview of how this
whole thing works.
— Martin, Google engineer
Data Science
The “Big Names”
From: Boss
Content: I want to understand how
salaries affect the spending habit of
people.
Informally, data science/mining refers to the whole
process of gaining insight/information from data.
From: Employee
Content: Here you go.
Attachment: spending habit data.csv
(535MB)
The “Fractions”
Data science/mining refers to the whole process of gaining
insight/information from data. Many sub-processes, including:
• Data collection (what data to collect? how to collect? cost?)
• Data validation/cleaning (clean data is valuable and expensive)
• Data infrastructure (how to store thousands of TB of data? how to ensure efficient retrievals from
hundreds of users? need which hardware and software?)
• Data Analysis (which hypotheses to test? what model to use? which algorithm to use
for various types of data? is it efficient? how long does it take?)
• Information Representation (data visualization, report, product)
The “Fractions”
Data science/mining refers to the whole process of gaining
insight/information from data. Many sub-processes, including:
• Data collection (what data to collect? how to collect? cost?)
• Data validation/cleaning (clean data is valuable and expensive)
• Data infrastructure (how to store thousands of TB of data? how to ensure efficient retrievals from
hundreds of users? need which hardware and software?)
• Data Analysis (which hypotheses to test? what model to use? which algorithm to use
for various types of data? is it efficient? how long does it take?)
• Information Representation (data visualization, report, product)
Algorithm is the key
component of
“Machine Learning”.
The “Fractions”
A friend-of-friend of mine was employed to a car
manufacturing company.
His task: use machine learning and computer vision to
improve car assembly process.
Data Collection
A friend-of-friend of mine was employed to a car
manufacturing company.
His task: use machine learning and computer vision to
improve car assembly process.
Source: “RTV” youtube channel,
screen cap from video W8vd2ulZBGs
Data Collection
A friend-of-friend of mine was employed to a car
manufacturing company.
His task: use machine learning and computer vision to
improve car assembly process.
Friend-of-friend:
What data is available?
Manager:
What data do you want to collect?Source: “RTV” youtube channel,
screen cap from video W8vd2ulZBGs
Data Collection
Without data, you can do no analysis.
good good
Finding (Open-Source) data:
kaggle.com
datasetsearch.research.google.com
Football: www.whoscored.com, www.sofascore.com
Paper Reviews: openreview.net
Data Collection
“What data do you want to collect?”• Collecting data takes time and money. Many open-source or
free data is either small or not good enough.
• Bureaucracy. Some data needs cooperation of several
departments in the same company to be collected.
• Concern of privacy. Are collection and storage of some
specific data illegal?
Data Collection
• Some data is too costly or even impossible to collect.
(E.g. drug dealers’ revenues; see Levitt’s book
“Freakonomics”.)
• Some data is too sensitive, so lying is persistent during
collection. (E.g. history of committing crimes)
Data Collection“What data do you want to collect?”
“What data do you want to collect?”• Poor design of collection protocol leads to erratic data.
(E.g. asking weight of a person: unit, range...)
• Ambiguity in the definition / specification.
(E.g. in football, “long-ball pass”, what’s “long”? how to
distinguish between “pass” & “clearance”?)
Data Collection
“What data do you want to collect?”• Poor design of collection protocol leads to erratic data.
(E.g. asking weight of a person: unit, range...)
• Ambiguity in the definition / specification.
(E.g. in football, “long-ball pass”, what’s “long”? how to
distinguish between “pass” & “clearance”?)Indeed, the main challenges in this stage
are mostly un-computer-scientific
(of course, good software design can help).
Data Collection
Indeed, the main challenges in data collection
are mostly un-computer-scientific.
Same for data validation / cleaning too.
(Although nowadays engineers design algorithms to validate/clean data.)
Data Validation/Cleaning
/Infrastructure
Indeed, the main challenges in data collection
are mostly un-computer-scientific.
Same for data validation / cleaning too.
(Although nowadays engineers design algorithms to validate/clean data.)
Data Validation/Cleaning
/Infrastructure
After this, we want to store (big) data in servers / data
centers in an organized way to ensure reliable, efficient and
convenient retrievals from end-users (nowadays under the
umbrella of “data scientists”). (This is traditionally closer to “Database Systems”.)
Source:
https://xkcd.com/1838/
Rules of Thumb:
• No one is going to tell you which algorithm
does the best, because “the best” might
not even exist.
• In practice, you need to do a lot of
“experiments” to guide yourself towards a
good enough solution.
• When you handle huge amount of data,
speed & memory requirement of analgorithm becomes critical concerns.
Statistics & Data Analysis
• Statistics is a centuries-old subject, it started long before computers
were born. (To be clear, I am no expert in statistics.)
• For one-variable data, mean, variance/SD, median are some natural
statistical measures you have learnt in high school.
• If the data has two or more variables, we may want to find out their
relations. Linear regression (LR) is one of the very first algorithms
(Legendre, Gauss in 180x).
Let’s go back to a pre-computer world
for the next few slides...
Statistics
• Least-square is the most popular version of linear regression. Let’s
recall how it is taught in high school or STAT001.
x
y
y = mx + c
Linear Regression
i xi yi1 11 50
2 7 30
3 20 88
4 2 14... ... ...
100000 4 26
slope m
=n (∑n
i=1 xiyi)− (∑n
i=1 xi) (∑n
i=1 yi)
n (∑n
i=1(xi)2)− (
∑ni=1(xi))
2
y-intercept c
=(∑n
i=1 yi)(∑n
i=1(xi)2)− (∑n
i=1 xi) (∑n
i=1 yi)
n (∑n
i=1(xi)2)− (
∑ni=1(xi))
2
Linear Regression
i xi yi1 11 50
2 7 30
3 20 88
4 2 14... ... ...
100000 4 26
slope m
=n (∑n
i=1 xiyi)− (∑n
i=1 xi) (∑n
i=1 yi)
n (∑n
i=1(xi)2)− (
∑ni=1(xi))
2
y-intercept c
=(∑n
i=1 yi)(∑n
i=1(xi)2)− (∑n
i=1 xi) (∑n
i=1 yi)
n (∑n
i=1(xi)2)− (
∑ni=1(xi))
2
Before explaining how this formula/algorithm arises, note that it is
simple to implement. People who do middle school math can handle.
When n is huge, in 18xx, it even allows simple division of labour
(in modern terminology, parallel computing)!
Linear Regression
slope m
=n (∑n
i=1 xiyi)− (∑n
i=1 xi) (∑n
i=1 yi)
n (∑n
i=1(xi)2)− (
∑ni=1(xi))
2
y-intercept c
=(∑n
i=1 yi)(∑n
i=1(xi)2)− (∑n
i=1 xi) (∑n
i=1 yi)
n (∑n
i=1(xi)2)− (
∑ni=1(xi))
2
Linear RegressionBut... why these slope andy-intercept are chosen?
slope m
=n (∑n
i=1 xiyi)− (∑n
i=1 xi) (∑n
i=1 yi)
n (∑n
i=1(xi)2)− (
∑ni=1(xi))
2
y-intercept c
=(∑n
i=1 yi)(∑n
i=1(xi)2)− (∑n
i=1 xi) (∑n
i=1 yi)
n (∑n
i=1(xi)2)− (
∑ni=1(xi))
2
Linear RegressionBut... why these slope andy-intercept are chosen?
Theorem. The slope m and y-intercept c is the minimizer of the functionn∑i=1
(yi −mxi − c)2
(This is called “Mean Squared Error” which we will discuss more later in this module.)
Theorem. The slope m and y-intercept c is the minimizer of the functionn∑i=1
(yi −mxi − c)2
Proof. Rewrite the function f as(n∑i=1
(xi)2
)m2 + nc2 − 2
(n∑i=1
xi
)mc− 2
(n∑i=1
xiyi
)m− 2
(n∑i=1
yi
)c +
(n∑i=1
(yi)2
)As we are finding m, c that minimizes the function, we do partial derivatives.
∂f
∂m= 2
(n∑i=1
(xi)2
)m− 2
(n∑i=1
xi
)c− 2
(n∑i=1
xiyi
)∂f
∂c= 2nc− 2
(n∑i=1
xi
)m− 2
(n∑i=1
yi
)Note that ∂f
∂m = ∂f∂c = 0 is a linear system in variables m, c. Solving it gives the formula.
Linear Regression
Questions:Why not minimizing
∑(length of green bar)4?
Why not minimizing∑ ∣∣(length of green bar)
∣∣?Why not minimizing
∑(perpendicular distance)12?
Linear Regression
Questions:Why not minimizing
∑(length of green bar)4?
Why not minimizing∑ ∣∣(length of green bar)
∣∣?Why not minimizing
∑(perpendicular distance)12?
An answer: Nothing to prohibit you to do so, except...
Linear Regression
Questions:Why not minimizing
∑(length of green bar)4?
Why not minimizing∑ ∣∣(length of green bar)
∣∣?Why not minimizing
∑(perpendicular distance)12?
An answer: Nothing to prohibit you to do so, except...
Which algorithms find out these minimums?
(In particular, any algorithm for pre-computer era?)
How long do these algorithms take
(when n = 106, say)?
Linear Regression
Questions:Why not ignore the “outlier”?
Is “linear” really the right model?
An answer:It really depends on the properties of the data.
There is no “universal” answer to these questions.
How to design algorithms that ignore “outlier”?
Linear Regression
• There are similarities between statistics and data analysis.
With powerful computers, modern data analysis affords
huge and sophisticated data, model and algorithms.
• In both pre-computer and modern eras, speed and memory
requirement of algorithm is a major concern.
• Least-square linear regression appears to have earned its
fame due to its simple, quick and parallelizable algorithm.
Also it can be implemented in an online fashion while using little amount of memory.
Statistics and Data Analysis –
Summary
• There is no “universally best” algorithm. You need to decide
which algorithm to use based on
– your judgement on the properties of data;
– the (fuzzy) results of experiments;
– what is the upper limit of data it can handle (under time
and memory constraints)? (scalability)
Statistics and Data Analysis –
Summary
• There is no “universally best” algorithm. You need to decide
which algorithm to use based on
– your judgement on the properties of data;
– the (fuzzy) results of experiments;
– what is the upper limit of data it can handle (under time
and memory constraints)? (scalability)
• In this module, you will see a number of algorithms.
You learn them not only because you want to strictly follow them,
but you also want to gain inspirations from them, so as to
make yourselves adaptive in your future career.
Statistics and Data Analysis –
Summary
• In this module, you will see a number of algorithms.
You learn them not only because you want to strictly follow them,
but you also want to gain inspirations from them, so as to
make yourselves adaptive in your future career.
Statistics and Data Analysis –
Summary
Not a fish, but to fish.
• An Overview of The “Big Names”, and
Some Advices about Working on Data Analysis
• Basic Concepts in Data Analysis
via Premier League Example
• Decision Trees, via ICLR Reviews Example
• Pruning Decision Trees
Today’s Agenda
• Attributes (input variable) X1, X2, . . ..
• Labels (output variable) Y
• Quantitative / Numerical variables
(age/height of a person, spending on a category)
• Qualitative variables (animals: cat/dog/fox)
Types of Variables
• Problems with quantitative label:
regression problems
• Problems with qualitative label:
classification problems
(latter in this module, we may see how to turn qualitative label into quantitative via probability space)
• Attributes (input variable) X1, X2, . . ..
• Labels (output variable) Y
• Quantitative / Numerical variables (age/height, spending on a category)
• Qualitative variables (animals: cat/dog/fox)
Types of Variables
Label: Y Attribute: X = (X1, X2, . . . , Xp)
Regression Model
Y = f (X) + εmodel: function
you want to learn
error term; may due to:
randomness, measure
error, effect of hidden
variables...
(more data helps reduce
random or measure errors)
Y = f (X) + εmodel:
function you
want to learn
error term; may due to:
randomness, measure error,
effect of hidden variables...
Problem:There are infinitely many choices of f .
For efficient learning, we must constrain f in a
reasonable way. This is a choice we need to make.
Regression Model
• We shall see: sometimes it is an “art” to choose the model f .
• How to choose f also depends on purpose.
• Prediction: User does not care f , they just care about the accuracy
of its prediction. In this case, f can be very complicated (e.g. neural
network) in this case.
Example: Take a photo of size 1024×1024 as input, and tell which
people (among your friends and families) are in the photo.
Representation of f :
Prediction vs. Inference
• Inference: User really wants to know f :
– when social scientist wants to know the relation between salary and health
– when medical doctors want to know the effectiveness of different vaccineson different age groups
– when physicist wants to know how the relation between temperature of acompound material and its conductivity
Thus, f cannot be very complicated, and it should be reasonable
with intuition. (In contrast, huge neural network might not have
clear intuition behind.)
Representation of f :
Prediction vs. Inference
In whoscored.com or sofascore.com , there are team performance data.
Let’s do an example:
X1: possession percent
X2: pass percent
X3: dribbles per game
X : (X1, X2, X3)
Y : points per game
We use the data from season 2019/2020.
Y = f (X) + ε
Example: Premier League Data
Important Warning:In the example, there will be
terminologies and methods
I won’t explain.
The purpose of this example is
demonstrating the components
of a data analysis process.
Example: Premier League Data
Y = f (X) + εWhat f should we choose? Let’s try a first candidate:
f (X) = aX1 + bX2 + cX3 + d
Example: Premier League Data
Y = f (X) + εWhat f should we choose? Let’s try a first candidate:
f (X) = aX1 + bX2 + cX3 + d
Now f can be determined with
four parameters a, b, c, d. We say
f is a parametrized function.
Example: Premier League Data
What f should we choose? Let’s try a first candidate:
f (X) = aX1 + bX2 + cX3 + d
We use least-square LR, i.e., find a, b, c, d that minimizes the following error function:∑(Yi − aXi1 − bXi2 − cXi3 − d)2.
We get a = 0.1112, b = −0.0714, c = 0.0410, d = 1.0371.
Example: Premier League Data
What f should we choose? Let’s try a first candidate:
f (X) = aX1 + bX2 + cX3 + d
We use least-square LR, i.e., find a, b, c, d that minimizes the following error function:∑(Yi − aXi1 − bXi2 − cXi3 − d)2.
We get a = 0.1112, b = −0.0714, c = 0.0410, d = 1.0371.
How good is this fit?When there are two or more attributes, it is difficult to plot a
graph to visualize the result. What can we do to
evaluate?
Example: Premier League Data
For small amount of data, we can look
into the “residual”, which is the
difference between Yi and predicted Y
= aXi1 + bXi2 + cXi3 + d.
The fit seems not very good (the red
boxes are large residuals).
Example: Premier League Data
When there is a lot of data, listing residuals is not feasible.
We can use a standard measure called R2 for evaluation.
R2 = 0 is bad fit, R2 = 1 is perfect fit.
(I do not explain how R2 is calculated. If you are really interested,
go to Wiki for “Coefficient of Determination”.)
What f should we choose? Let’s try a first candidate:
f (X) = aX1 + bX2 + cX3 + d
Least-square LR yields a = 0.1112, b = −0.0714, c = 0.0410, d = 1.0371.
Example: Premier League Data
In this case, R2 = 0.626, which is okay but not very good.
(Whether this R2 is good or not depends on the “norm” in different areas;
in social science, data tends to be more fuzzy, so R2 > 0.5 is considered very good;
in some areas of physics, it is quite often that R2 > 0.9.)
What f should we choose? Let’s try a first candidate:
f (X) = aX1 + bX2 + cX3 + d
Least-square LR yields a = 0.1112, b = −0.0714, c = 0.0410, d = 1.0371.
Example: Premier League Data
We have done evaluations via residual and R2.
But after deriving f with this data,
what we really want is to use this f
to make predictions on other data.
So we see how this f performs with the data
in the current 2020/2021 season.
Example: Premier League Data
A Reflection on why the fit is not good:
• The truth might be just that “there is no strong relation between the
attributes and labels”.
• X1, X2, X3 are all “attacking” attributes. There are teams which relies
more on defence to earn many points (e.g. Jose Mourinho’s Tottenham this season, Sheffield United last
season), while there are other teams with attacking style but defense poorly
and earn few points (e.g. Brighton this season, Norwich last season).
X1: possession percent
X2: pass percent
X3: dribbles per game
Y : points per game
Example: Premier League Data
A Reflection on why the fit is not good:
• The trend changes from season to season. A reasonably nice fit for last
season can do very badly for this season’s data.
Example: Premier League Data
A Reflection on why the fit is not good:
• The trend changes from season to season. A reasonably nice fit for last
season can do very badly for this season’s data.
How to gain better insights from data:
• I cherry-picked the “attacking” attributes. Are there any “defending” attributes
available? (YES!)
• Some teams earn many points by focusing on defense, other teams focusing on attacks.
How to modify f to capture this observation? One idea:
f = max {f1(attack attributes) , f2(defend attributes)}
But this is not a linear function, so the least-square linear regression algorithm cannot
apply; we might need more sophisticated (and less efficient) algorithm.
Example: Premier League Data
Training & Test Data• In the Premier League example, we use some data to derive a
function f . Such data is called training data or training set.
• In general, we do not care how well the method works on training
data. We are interested in the accuracy of prediction on some other
data, called test data.
• In the Premier League example,
training data = data from last (completed) season
test data = data from this (ongoing) season
• We want to measure the quality of fit. In the regression setting, a
common measure is MSE:
1
n·
n∑i=1
(yi − f (xi))2 ,
where {(xi, yi)}1≤i≤n can be either training data (then it is called
training MSE), or test data (test MSE).
• Note that test MSE depends on the test data, so it is possible that
it is small for one test dataset, but large for another.
Mean Squared Error (MSE)
• In the classification setting (qualitative label, e.g. a photo
as attribute, label is cat/dog/fox), a common measure for
quality of fit is
1
n·
n∑i=1
I(yi 6= f (xi)).
Again, {(xi, yi)}1≤i≤n can be either training data (training
error rate) or test data (test error rate).
Error Rate
We have learnt these terminologies:
• attributes, labels
• quantitative vs. qualitative variables
• regression problems, classification problems
• prediction vs. inference
• model f , parametrized function
• training data, test data
• mean squared error (for regression)
• error rate (for classification)
Summary
• A “toy” example of Premier League:
– Which training data to use? (Just “attacking” attributes, or both attacking &
defending attributes?)
– What parametrization of f? (linear function)
– What algorithm? (least-square linear regression)
Summary
• A “toy” example of Premier League:
– Evaluation on the quality of fit. (residuals, R2, against test data of current
season, MSE)
– Reflect on why the fitting is poor. (bias to “attack” attributes)
– Another choice of parametrization of f . (linear with cutoffs? polynomial
regression?)
Summary
Collect and
organize
Data
Choose
Parametrization of
f (e.g. linear)
Choose error
function to minimize
(e.g. MSE)
After obtaining a
concrete f ,
evaluation.
(e.g. residual, R2,
against test data)
Reflection.
Seek for
improvements.
collect more
data? use
different data?
choose another
reasonable
parametrization of
f?
choose
different error
function?
algorithm!
Summary
• An Overview of The “Big Names”, and
Some Advices about Working on Data Analysis
• Basic Concepts in Data Analysis
via Premier League Example
• Decision Trees, via ICLR Reviews Example
• Pruning Decision Trees
Today’s Agenda
• We will look into f and algorithms that involve many mathematics
in this module.
• In the first lecture, however, let’s look into an “intuitive” family:
decision tree.
age ≤ 2223 ≤age ≤ 31
32 ≤age ≤ 40
60 ≤ab ≤ 79 ab ≥ 8065 ≤ab ≤ 78 ab ≥ 79
salary = 5 salary = 20 salary = 15 salary = 30
salary = 25
Decision Tree
• In the first lecture, however, let’s look into an “intuitive” family:
decision tree.
• Decision tree is easy to interpret (even for people who are very bad
in math).
• But it is typically not very good in predictive efficiency
(i.e. high error rate / MSE).
• Predictive efficiency can be improved with the use of more
complicated decision tree, but interpretability gets lower.
Decision Tree
• Recall attribute X = (X1, X2, . . . , Xp).
• In the simplest form of a decision tree, at each node we pick one
Xk, then follow a TRUE statement about Xk to move to next node.
X1 ≤ 64
535 ≥ X1 > 64
X1 > 535
X4 = femaleX4 = male
Decision Tree
• It is also possible to have a condition on more than one Xk’s.
X1 +X3 ≤ 64
535 ≥X1 +X3 > 64
X1 +X3 > 535
Decision Tree
Question:
How to use algorithm to
generate a decision tree with data?
An answer:
Divide the space containing all possible X = (X1, . . . , Xp)
into non-overlapping regions R1, . . . , Rj.
For all points in the same region, make the same prediction,
which is the mean of the labels of these points.
Decision Tree
Divide the space containing all possible X = (X1, . . . , Xp)
into non-overlapping regions R1, . . . , Rj.
For all points in the same region, make the same prediction,
which is the mean of the labels of these points.A division (decision tree) is good if the residual sum of square (RSS)
for each region Rj is small, where RSS is∑i∈Rj
(yi − yRj
)2,
where yRj
:= 1nj
(∑i∈Rj yi
)is the mean of all labels in Rj.
Decision Tree (Regression)
A division (decision tree) is good if the residual sum of square (RSS) for each
region is small, where RSS of region Rj is∑i∈Rj
(yi − yRj
)2,
where yRj
:= 1nj
(∑i∈Rj yi
)is the mean of all labels in Rj.
yRj
is the predicted label of region Rj.
But one problem...
Decision Tree (Regression)
A division (decision tree) is good if the residual sum of square (RSS) for each
region is small, where RSS of region Rj is∑i∈Rj
(yi − yRj
)2,
where yRj
:= 1nj
(∑i∈Rj yi
)is the mean of all labels in Rj.
yRj
is the predicted label of region Rj.
But one problem...
If we make a division such that
there is only one point in each region...
Decision Tree (Regression)
If we make the devision such that
there is only one point in each region...
We want to design an efficient algorithm that outputs
a decision tree, which strikes a balance between:
(i) number of regions should be kept small
(ii) RSS in each region is reasonably small
Decision Tree
X1
X2
R1
R2
R3
R4
Same as in earlier example, to have an efficient
algorithm and interpretable decision tree, we
cannot afford arbitrary division of regions.
In the next algorithm, every time we split an
existing region into two via constraints
Xk ≤ s and Xk > s. Thus, each region must
be a rectangular box.
Decision Tree
1. Initially, there is one region, which contains all data points.
2. Select one existing region, denote it by R#.
(a) Let R1(k, s) = {X|X ∈ R#, Xk ≤ s}, R2(k, s) = {X|X ∈ R#, Xk > s}.(b) Find k, s such that RSS of R1(k, s) plus RSS of R2(k, s) is the smallest.
(c) Replace region R# by two regions R1(k, s) and R2(k, s).
3. Repeat Step 2 until certain condition is met
4. For each region Rj, set yRj
to be its predicted label
Decision Tree (Regression):
Recursive Binary Splitting
RSS of a region Rj is∑
i∈Rj
(yi − yRj
)2, where y
Rj:= 1
nj
(∑i∈Rj yi
)is the mean of all labels in Rj.
The above is just a visualization.
For X = (X1, X2, . . . , Xp) with p ≥ 3,
visualization is difficult.
Warning: Visualization is for you to earn first
understanding; don’t rely on it too much.
What you really do is writing a program.
How to implement the algorithm efficiently?
Decision Tree
1. Initially, there is one region, which contains all data points.
2. Select one existing region, denote it by R#.
(a) Let R1(k, s) = {X|X ∈ R#, Xk ≤ s}, R2(k, s) = {X|X ∈ R#, Xk > s}.(b) Find k, s such that RSS of R1(k, s) plus RSS of R2(k, s) is the smallest.
(c) Replace region R# by two regions R1(k, s) and R2(k, s).
3. Repeat Step 2 until certain condition is met
4. For each region Rj, set yRj
to be its predicted label
Decision Tree (Regression):
Recursive Binary Splitting
RSS of a region Rj is∑
i∈Rj
(yi − yRj
)2, where y
Rj:= 1
nj
(∑i∈Rj yi
)is the mean of all labels in Rj.
select a region, denoted by R#
cur best rss := +∞for k = 1, 2, . . . , p
for s where there exists X ∈ R# with Xk = s
r1 := RSS of those points in {X|X ∈ R#, Xk ≤ s}r2 := RSS of those points in {X|X ∈ R#, Xk > s}if (r1 + r2 < cur best rss) then
cur best rss := r1 + r2k∗ := k
s∗ := s
replace region R# by two regions {X|X ∈ R#, Xk∗ ≤ s∗} and {X|X ∈ R#, Xk∗ > s∗}
Decision Tree (Regression):
Recursive Binary Splitting
Challenging Question (for those who have done undergraduate algorithm design & analysis course):
Suppose there are n data points in R#. In the above algorithm, if you
naively computes the two RSS, the algorithm takes time O(pn2), which
is a lot if n ≥ 106.
Can you improve the algorithm running time to O(pn log n)?Hint: When there are ` real numbers y1, y2, . . . , y` with mean m, then
∑i=1
(yi −m)2 =∑i=1
(yi)2 − 1
`·
(∑i=1
yi
)2
Decision Tree (Regression):
Recursive Binary Splitting
select a region, denoted by R#
cur best rss := +∞for k = 1, 2, . . . , p
for s where there exists X ∈ R# with Xk = s
r1 := RSS of those points in {X|X ∈ R#, Xk ≤ s}r2 := RSS of those points in {X|X ∈ R#, Xk > s}if (r1 + r2 < cur best rss) then
cur best rss := r1 + r2k∗ := k
s∗ := s
replace region R# by two regions {X|X ∈ R#, Xk∗ ≤ s∗} and {X|X ∈ R#, Xk∗ > s∗}
Decision Tree (Regression):
Recursive Binary Splitting
select a region, denoted by R#
cur best error rate := +∞for k = 1, 2, . . . , p
for s where there exists X ∈ R# with Xk = s
r1 := Gini/Entropy/Class ER of those points in {X|X ∈ R#, Xk ≤ s}r2 := Gini/Entropy/Class ER of those points in {X|X ∈ R#, Xk > s}if (r1 + r2 < cur best error rate) then
cur best error rate := r1 + r2k∗ := k
s∗ := s
replace region R# by two regions {X|X ∈ R#, Xk∗ ≤ s∗} and {X|X ∈ R#, Xk∗ > s∗}
Decision Tree (Classification):
Recursive Binary Splitting
select a region, denoted by R#
cur best error rate := +∞for k = 1, 2, . . . , p
for s where there exists X ∈ R# with Xk = s
r1 := Gini/Entropy/Class ER of those points in {X|X ∈ R#, Xk ≤ s}
Decision Tree (Classification):
Recursive Binary Splitting
select a region, denoted by R#
cur best error rate := +∞for k = 1, 2, . . . , p
for s where there exists X ∈ R# with Xk = s
r1 := Gini/Entropy/Class ER of those points in {X|X ∈ R#, Xk ≤ s}
Decision Tree (Classification):
Recursive Binary Splitting
Suppose there are ` classes (e.g. cat, dog, fox).
Given a set of n data points of the form (Xi, Yi), suppose class j appears nj times.
Then let ~p ∈ R` denote the vector
(p1, p2, . . . , p`) :=(n1n,n2n, . . . ,
n`n
)
r1 := Gini/Entropy/Class ER of points in {X ∈ R#|Xk ≤ s}
Decision Tree (Classification):
Recursive Binary Splitting
Suppose there are ` classes (e.g. cat, dog, fox). Given a set of n data points of the form (Xi, Yi), suppose
class j appears nj times. Then let ~p ∈ R` denote the vector
(p1, p2, . . . , p`) :=(n1n,n2n, . . . ,
n`n
)What is good? At best, one entry of ~p is 1, all others are zero. We want to inventfunction that takes ~p as input, and it outputs a number that indicates how good ~p is.
r1 := Gini/Entropy/Class ER of points in {X ∈ R#|Xk ≤ s}
Decision Tree (Classification):
Recursive Binary Splitting
Suppose there are ` classes (e.g. cat, dog, fox). Given a set of n data points of the form (Xi, Yi), suppose
class j appears nj times. Then let ~p ∈ R` denote the vector
(p1, p2, . . . , p`) :=(n1n,n2n, . . . ,
n`n
)What is good? At best, one entry of ~p is 1, all others are zero. We want to inventfunction that takes ~p as input, and it outputs a number that indicates how good ~p is.
“Classification Error Rate” function is
Class ER(~p) := 1− max1≤i≤`
pi
r1 := Gini/Entropy/Class ER of points in {X ∈ R#|Xk ≤ s}~p = (p1, p2, . . . , p`) :=
(n1n,n2n, . . . ,
n`n
)What is good? At best, one entry of ~p is 1, all others are zero. We want to invent function that takes ~p as input, and it
outputs a number that indicates how good ~p is.
“Classification Error Rate” function is
Class ER(~p) := 1− max1≤i≤`
pi
“Gini Index” function is
Gini(~p) :=∑1≤i≤`
pi(1− pi)
“Entropy” function is
Entropy(~p) := −∑1≤i≤`
pi log2 pi (pi = 0⇒ pi log2 pi = 0)
Decision Tree (Classification):
Recursive Binary Splitting
“Classification Error Rate” function is
Class ER(~p) := 1− max1≤i≤`
pi
“Gini Index” function is
Gini(~p) :=∑1≤i≤`
pi(1− pi)
“Entropy” function is
Entropy(~p) := −∑1≤i≤`
pi log2 pi (pi = 0 ⇒ pi log2 pi = 0)
~p Class ER Gini Entropy
(0.5, 0.25, 0.25) 0.500 0.6250 1.5000
(0.45, 0.378, 0.172) 0.550 0.6250 1.4857
Decision Tree (Classification):
Recursive Binary Splitting
ICLR is a top-tier machine learning conference which uses
OpenReview: data of paper reviews are publicly available.
Each paper is reviewed by several reviewers, each reviewer gives a
rating (between 0 and 10) to that paper.
Thanks to https://github.com/evanzd/ICLR2021-OpenReviewData ,
all the ratings are collected into a tsv file.
We want to analyze how the decisions (reject, accept as poster,
accept as spotlight, accept as oral) were made.
Example: ICLR 2021 Decision
There are 2966 paper submissions.
Each attribute is X = (X0, X1, X2, X3, X4), where
• X1 is the average rating of the paper,
• X2 is the minimum rating,
• X3 is the maximum rating,
• X4 is the number of reviews with rating ≥ 6,
• X5 is the number of reviews with rating ≥ 7.
The label of each paper is Y :
• reject: Y = 0
• accept as poster/spotlight/oral: Y = 1
Example: ICLR 2021 Decision
We use recursive binary splitting with Gini. Level 1:
X0 ≤ 5.8 X0 > 5.8
X0: average rating
X1: minimum rating
X2: maximum rating
X3: # reviews with rating ≥ 6
X4: # reviews with rating ≥ 7
[2106,860]
{800}[1885,95] [221,765]
Using computer program, we found that by using X0 and threshold 5.8, it
minimizes the sum of Gini indices r1, r2.
r1 =18851980
(1− 1885
1980
)+ 95
1980
(1− 95
1980
)≈ 0.09136
r2 =221986
(1− 221
986
)+ 765
986
(1− 765
986
)≈ 0.34780
r1 + r2 ≈ 0.43916
Example: ICLR 2021 Decision
What if we replace Gini with Entropy? Level 1:
X0 ≤ 4.75 X0 > 4.75
[2106,860]
{800}[977,1] [1129,859]
Using computer program, we found that by using X0 and threshold 4.75, it
minimizes the sum of entropies r1, r2.
r1 = −(977978 log2
977978 +
1978 log2
1978
)≈ 0.01163
r2 = −(11291988 log2
11291988 +
8591988 log2
8591988
)≈ 0.98665
r1 + r2 ≈ 0.99828
X0: average rating
X1: minimum rating
X2: maximum rating
X3: # reviews with rating ≥ 6
X4: # reviews with rating ≥ 7
Example: ICLR 2021 Decision
We continue to use Entropy for the remaining levels.
Level 2:
X0 ≤ 4.75 X0 > 4.75
[2106,860]
{800}[977,1] [1129,859]
X0 ≤ 6.0 X0 > 6.0{400}[1058,189] [71,670]
X0: average rating
X1: minimum rating
X2: maximum rating
X3: # reviews with rating ≥ 6
X4: # reviews with rating ≥ 7
Example: ICLR 2021 Decision
X0 ≤ 4.75 X0 > 4.75
[2106,860]
{800}[977,1] [1129,859]
X0 ≤ 6.0 X0 > 6.0{400}
[1058,189] [71,670]
X3 ≤ 1 X3 ≥ 2{280}[367,8] [691,181]
X0 ≤ 6.8 X0 ≥ 6.8{180}
[69,459] [2,211]
We continue to use Entropy for the remaining levels.
Level 3:
X0: average rating
X1: minimum rating
X2: maximum rating
X3: # reviews with rating ≥ 6
X4: # reviews with rating ≥ 7
Example: ICLR 2021 Decision
X0 ≤ 4.75 X0 > 4.75
[2106,860]
{800}[977,1] [1129,859]
X0 ≤ 6.0 X0 > 6.0{400}[1058,189] [71,670]
X3 ≤ 1 X3 ≥ 2{280}
[367,8] [691,181]
X0 ≤ 6.8 X0 > 6.8{180}[69,459]
[2,211]X4 ≤ 2 X4 ≥ 3{70}
[10,108][59,351]
X0 ≤ 5.25 X0 > 5.25
{100}
[518,171][173,10]
We continue to use Entropy for the remaining levels.
Level 4:
X0: average rating
X1: minimum rating
X2: maximum rating
X3: # reviews with rating ≥ 6
X4: # reviews with rating ≥ 7
Example: ICLR 2021 Decision
[59,351]
X0 ≤ 6.5 X0 > 6.5
[51,270]
[8,81]
[518,171]
X0 ≤ 5.5 X0 > 5.5
[175,18]
[343,153]
{40}{70}
X3 ≤ 2 X3 ≥ 3
[162,44] [181,109]
X0 ≤ 6.34 X0 > 6.34
[41,166] [10,104]
5.5 < X0 ≤ 6
X3 = 2
5.5 < X0 ≤ 6
X3 ≥ 3
6 < X0 ≤ 6.34
X4 ≤ 2
6.34 < X0 ≤ 6.5
X4 ≤ 2
We continue to use Entropy for the remaining levels.
Level 5,6:
X0: average rating
X1: minimum rating
X2: maximum rating
X3: # reviews with rating ≥ 6
X4: # reviews with rating ≥ 7
Example: ICLR 2021 Decision
• An Overview of The “Big Names”, and
Some Advices about Working on Data Analysis
• Basic Concepts in Data Analysis
via Premier League Example
• Decision Trees, via ICLR Reviews Example
• Pruning Decision Trees
Today’s Agenda
Pruning Decision TreeA division (decision tree) is good if the residual sum of square (RSS) for each region is
small, where RSS of region Rj is∑
i∈Rj
(yi − yRj
)2.
yRj
:= 1nj
(∑i∈Rj yi
)is the mean of all labels in Rj, and it is the predicted label of
region Rj. But one problem...
If we make a division such that
there is only one point in each region...We want a decision tree which strikes a balance between:
(i) number of regions (= number of leaves in the decision tree)
should be kept small
(ii) RSS/error rate in each region is reasonably small
We want a decision tree which strikes a balance between:
(i) number of regions (= number of leaves in the tree) should be kept small
(ii) RSS/error rate in each region is reasonably small (RSS =∑
i∈Rj
(yi − yRj
)2)
This motivates the notion of penalized RSS that captures the
tradeoff between the two quantities:
α|T | +|T |∑m=1
RSS(Rm),
where |T | is number of leaves in the decision tree T , and
α ≥ 0 is a parameter we will look into.
Pruning Decision Tree
penalized RSS = α|T | +|T |∑m=1
RSS(Rm),
where α ≥ 0 is a parameter we will look into.
• If α = 0, we simply create a division such that each leaf contains one data
point. The penalized RSS is 0 (or small).
• As α grows, we desire |T | to be smaller.
• If α is crazily large, then we desire just one leaf (region), so no grouping of
data points will be done.
So we want to choose a “moderate” value of α.
Pruning Decision Tree
penalized RSS = α|T | +|T |∑m=1
RSS(Rm),
We want to choose a “moderate” value of α.Problem 1:
We cannot use this function
in recursive binary splitting (which is top-down),
since we do not know the exact value of |T | when splitting.
Problem 2:We do not know which choice of α is good.
Pruning Decision Tree
penalized RSS = α|T | +|T |∑m=1
RSS(Rm),
One Solution: Bottom-up!
1. Use recursive binary splitting to generate a large tree:
i.e., one (or very few) data point per leaf.
2. Prune the tree with a chosen α (we will discuss how
to choose α later). See next slide.
Pruning Decision Tree
Definition of Pruning:
[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]
Pruning starts by selecting an internal node Z of the tree.
Z
Pruning Decision Tree
Definition of Pruning:
[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]
Then removes all nodes below Z. Z now corresponds to a region
containing all data points originally below Z.
Pruning Decision Tree
[4, 8.9, 7, 7.4]
Definition of Subtree:
[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]
A subtree simply means any new tree which can be obtained via
one or more prunings. The black tree above is a subtree of the original tree.
Pruning Decision Tree
[4, 8.9, 7, 7.4]
Definition of Subtree:
[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]
The black tree above is another subtree of the original tree,
obtained via two prunings.
Pruning Decision Tree
[4, 8.9, 7, 7.4]
[3, 3.2]
Definition of |T ′|, where T ′ is a subtree:
[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]
|T ′| simply means the number of leaves in the subtree.
In the above subtree, |T ′| = 5.
Pruning Decision Tree
[4, 8.9, 7, 7.4]
Definition of |T ′|, where T ′ is a subtree:
[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]
|T ′| simply means the number of leaves in the subtree.
In the above subtree, |T ′| = 4.
Pruning Decision Tree
[4, 8.9, 7, 7.4]
[3, 3.2]
Observations: After each pruning,
• |T ′| decreases;
• the summation∑|T ′|
m=1 RSS(Rm) either remains the same or
increases. (This is NOT trivial. Challenging problem: prove this.)
penalized RSS = α|T ′| +|T ′|∑m=1
RSS(Rm),
Pruning Decision Tree
Observations: After each pruning,
• |T ′| decreases;
• the summation∑|T ′|
m=1 RSS(Rm) either remains the same or
increases. (This is NOT trivial. Challenging problem: prove this.)
penalized RSS = α|T ′| +|T ′|∑m=1
RSS(Rm),
As the trade-off is there, at each internal node, we need
to determine pruning it or not.
Pruning Decision Tree
[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]
α = 0.6
Current: α · 2 + 0 = 1.2
Merged: α · 1 + 0.02 = 0.62
RSS([3, 3.2]) =
(3.2− 3.1)2 + (3− 3.1)2 = 0.02
Pruning Decision Tree
[4] [8.9] [7] [7.4]
[3, 3.2]
[3.5] [2.9]
α = 0.6
Current: α · 2 + 0 = 1.2
Merged: α · 1 + 0.18 = 0.78
RSS([3.5, 2.9]) =
(3.5− 3.2)2 + (2.9− 3.2)2 = 0.18
Pruning Decision Tree
[4] [8.9] [7] [7.4]
[3, 3.2] [3.5, 2.9]
α = 0.6
Pruning Decision Tree
Similarly, this node is also pruned.
[4] [8.9]
[7, 7.4] [3, 3.2] [3.5, 2.9]
α = 0.6
Current: α · 2 + 0 = 1.2
Merged: α · 1 + 12.005 = 12.605RSS([4, 8.9]) =
(4− 6.45)2+ (8.9− 6.45)2 = 12.005
Pruning Decision Tree
[4] [8.9]
[7, 7.4] [3, 3.2] [3.5, 2.9]
α = 0.6
Current: α · 2 + 0.02 + 0.18 = 1.4
Merged: α · 1 + 0.21 = 0.81RSS([3, 3.2, 3.5, 2.9]) =
(3− 3.15)2+ (3.2− 3.15)2+ (3.5−3.15)2 + (2.9− 3.15)2 = 0.21
Pruning Decision Tree
[4] [8.9]
[7, 7.4]
[3, 3.2, 3.5, 2.9]
α = 0.6
Current: α · 3 + 0.08 = 1.88
Merged: α · 1 + 12.6475 = 13.2475RSS([4, 8.9, 7, 7.4]) = (4− 6.825)2 + (8.9−6.825)2 + (7− 6.825)2 + (7.4− 6.825)2 = 12.6475
Pruning Decision Tree
[4] [8.9]
[7, 7.4]
[3, 3.2, 3.5, 2.9]
α = 0.6
Decision Tree After Pruning
with α = 0.6
Pruning Decision Tree
penalized RSS
= α|T | +∑|T |m=1 RSS(Rm)
1. Use recursive binary splitting to generate a large tree:
i.e., one (or very few) data point per leaf.
2. Use penalized RSS to determine pruning a node or not.
Pruning Decision Tree
Questions:
• How to implement this algorithm? (We discuss next.)
• What data structure should we use for efficient algorithm?
(Leave for you to think.)
• Which α should we use? (k-fold cross validation.)
penalized RSS
= α|T | +∑|T |m=1 RSS(Rm)
Pruning Decision TreeHow to implement this algorithm?
• We will compute all possible subtrees for the choices α = 0
to α = +∞.
• Think that we increase α from 0. Initially, the subtree for
α = 0 is simply the original tree. (why?)
• Then we increase α. The subtree remains the same up to a
certain critical α value. How do we compute this critical α?
penalized RSS
= α|T | +∑|T |m=1 RSS(Rm)
Pruning Decision Tree• Then we increase α. The subtree remains the same up to a certain
critical α value. How do we compute this critical α?
– For each internal node v, compute the pRSS of
the tree under v, in term of α. Note that the pRSS is of the form
c · α + d, where c ≥ 2.
– If v is pruned, then the new pRSS of the tree under v is
1·α + RSS(all data points under v)
penalized RSS
= α|T | +∑|T |m=1 RSS(Rm)
Pruning Decision Tree• For each internal node v, compute the pRSS of the tree under v, in term of α. Note that the pRSS is of
the form c · α + d, where c ≥ 2.
• If v is pruned, then the new pRSS of the tree under v is α + RSS(all data points under v)
y = c ·α+ d
y = α + RSS(all data points under v)
critical α for node v
penalized RSS
= α|T | +∑|T |m=1 RSS(Rm)
Pruning Decision Tree
critical α for
node v
• Then we increase α. The subtree remains
the same up to a certain critical α value.
How do we compute this critical α?
penalized RSS
= α|T | +∑|T |m=1 RSS(Rm)
Pruning Decision Tree
critical α for
node v
• Then we increase α. The subtree remains
the same up to a certain critical α value.
How do we compute this critical α?
α := overall critical α value =
minv{critical α value of node v} ,
where v runs over ALL internal nodes.
penalized RSS
= α|T | +∑|T |m=1 RSS(Rm)
Pruning Decision TreeHow to implement this algorithm?
• We will compute all possible subtrees for the choices α = 0 to α = +∞.
• Think that we increase α from 0. Initially, the subtree for α = 0 is simply
the original tree. (why?)
• Then we increase α. The subtree remains the same up to a certain critical
α value. How do we compute this critical α, denoted by α?
• Now we have a new subtree. Repeat the above argument, but start
from α = α instead of α = 0. This is repeated until we obtain a
subtree with just a single node.
Q1: why don’t we restart from α = 0? Q2: why this repetition must end?
penalized RSS
= α|T | +∑|T |m=1 RSS(Rm)
Pruning Decision TreeWhich α should we use?
k-fold cross validation.
Cross validation is a general technique: divide the
training data into several parts, use one part as
“testing” and the other parts as “training”.
penalized RSS
= α|T | +∑|T |m=1 RSS(Rm)
Pruning Decision TreeWhich α should we use?
k-fold cross validation.• k-fold Cross Validation:
– Divide training data D into k parts, denoted by D1,D2, . . . ,Dk.
– for i = 1, 2, . . . , k:
use D −Di to obtain decision trees for various α’s
use Di as test data to compute test-MSE for various α’s
– for each α, compute the total of the k test-MSE computed above
– select α that yields the least total test-MSE