live lecture 1 data analysis cs 5100j

CS 5100J

Data Analysis

Live Lecture 1Yun Kuen (Marco) CHEUNG

[email protected]

IMPORTANT:Moodle page of this module

https://moodle.royalholloway.ac.uk/

course/view.php?id=9088

contains all info of this module you need.

If you still cannot access it,

email [email protected]

Module Activities

• Pre-recorded (asynchronous) lectures and revision exercises

– posted on Moodle every Saturday

– watch this video before coming to live lecture

• Live (synchronous) lectures

– 4–5pm every Thursday (start from 14 Jan)

• Q&A sessions

– 9–10am, 11am–12noon every Thursday (start from 14 Jan)

– Q&A is optional; “you ask I answer”

Module Activities

• 8 lab sessions

– 5–6pm every Thursday (start from 21 Jan, until 11 March)

• 7 quizzes

– almost every Thursday (first quiz opens on 21 Jan, final quiz opens on 11 March;

no quiz opens on 4 Feb)

– only one attempt per quiz allowed, time limit: 30 minutes

– you need to complete within one-week timeframe

(e.g. the first quiz opens at 21 Jan 5pm, it closes at 28 Jan 6pm)

• Labs and quizzes account for 16% of your final grade

Evaluations

• 3 Homework Assignments

– on 11 Feb, 25 Feb, 11 March

– you have two weeks to complete each assignment

– account for 24% of your total grade

• Examination (arrangement announced later)

– account for 60% of your total grade

Evaluations

• This module will use quite a bit of mathematics

(linear algebra, probability, set theory).

Go to the Moodle page to find

“Pre-sessional Mathematics”,

which contains handouts that discuss the

relevant math knowledge.

Module Activities

• Pre-recorded lectures cover most (if not all) essential

materials for this module.

• Watch pre-recorded lecture before coming to live lecture.

• Live lectures are meant to be lively.

– Do ask questions! (And I will ask you questions.)

– Have discussions. (I know it is difficult now.)

– I will present more examples. (+ experiences of my friends working in industry)

• Eventually, live lectures reinforce your learning.

Live Lectures

• An Overview of The “Big Names”, and

Some Advices about Working on Data Analysis

• Basic Concepts in Data Analysis

via Premier League Example

• Decision Trees, via ICLR Reviews Example

• Pruning Decision Trees

Today’s Agenda

Machine

Learning

The “Big Names”

Statistics

Data

ScienceArtificial

Intelligence

Data

Analysis

Google is like an elephant. Most employees see a tiny fraction of the

elephant, then use their “expert knowledge” to make that part work

and improving. Only a few architects know the overview of how this

whole thing works.

— Martin, Google engineer

The “Big Names”

Google is like an elephant. Most employees see a tiny fraction of the

elephant, then use their “expert knowledge” to make that part work

and improving. Only a few architects know the overview of how this

whole thing works.

— Martin, Google engineer

Data Science

The “Big Names”

From: Boss

Content: I want to understand how

salaries affect the spending habit of

people.

Informally, data science/mining refers to the whole

process of gaining insight/information from data.

From: Employee

Content: Here you go.

Attachment: spending habit data.csv

(535MB)

The “Fractions”

Data science/mining refers to the whole process of gaining

insight/information from data. Many sub-processes, including:

• Data collection (what data to collect? how to collect? cost?)

• Data validation/cleaning (clean data is valuable and expensive)

• Data infrastructure (how to store thousands of TB of data? how to ensure efficient retrievals from

hundreds of users? need which hardware and software?)

• Data Analysis (which hypotheses to test? what model to use? which algorithm to use

for various types of data? is it efficient? how long does it take?)

• Information Representation (data visualization, report, product)

The “Fractions”

Data science/mining refers to the whole process of gaining

insight/information from data. Many sub-processes, including:

• Data collection (what data to collect? how to collect? cost?)

• Data validation/cleaning (clean data is valuable and expensive)

• Data infrastructure (how to store thousands of TB of data? how to ensure efficient retrievals from

hundreds of users? need which hardware and software?)

• Data Analysis (which hypotheses to test? what model to use? which algorithm to use

for various types of data? is it efficient? how long does it take?)

• Information Representation (data visualization, report, product)

Algorithm is the key

component of

“Machine Learning”.

The “Fractions”

A friend-of-friend of mine was employed to a car

manufacturing company.

His task: use machine learning and computer vision to

improve car assembly process.

Data Collection





Source: “RTV” youtube channel,

screen cap from video W8vd2ulZBGs

Data Collection





Friend-of-friend:

What data is available?

Manager:

What data do you want to collect?Source: “RTV” youtube channel,

screen cap from video W8vd2ulZBGs

Data Collection

Without data, you can do no analysis.

Data Collection


good good

Data Collection


good good

Finding (Open-Source) data:

kaggle.com

datasetsearch.research.google.com

Football: www.whoscored.com, www.sofascore.com

Paper Reviews: openreview.net

Data Collection

“What data do you want to collect?”• Collecting data takes time and money. Many open-source or

free data is either small or not good enough.

• Bureaucracy. Some data needs cooperation of several

departments in the same company to be collected.

• Concern of privacy. Are collection and storage of some

specific data illegal?

Data Collection

• Some data is too costly or even impossible to collect.

(E.g. drug dealers’ revenues; see Levitt’s book

“Freakonomics”.)

• Some data is too sensitive, so lying is persistent during

collection. (E.g. history of committing crimes)

Data Collection“What data do you want to collect?”

“What data do you want to collect?”• Poor design of collection protocol leads to erratic data.

(E.g. asking weight of a person: unit, range...)

• Ambiguity in the definition / specification.

(E.g. in football, “long-ball pass”, what’s “long”? how to

distinguish between “pass” & “clearance”?)

Data Collection

“What data do you want to collect?”• Poor design of collection protocol leads to erratic data.

(E.g. asking weight of a person: unit, range...)

• Ambiguity in the definition / specification.

(E.g. in football, “long-ball pass”, what’s “long”? how to

distinguish between “pass” & “clearance”?)Indeed, the main challenges in this stage

are mostly un-computer-scientific

(of course, good software design can help).

Data Collection

Indeed, the main challenges in data collection

are mostly un-computer-scientific.

Same for data validation / cleaning too.

(Although nowadays engineers design algorithms to validate/clean data.)

Data Validation/Cleaning

/Infrastructure

Indeed, the main challenges in data collection

are mostly un-computer-scientific.

Same for data validation / cleaning too.

(Although nowadays engineers design algorithms to validate/clean data.)

Data Validation/Cleaning

/Infrastructure

After this, we want to store (big) data in servers / data

centers in an organized way to ensure reliable, efficient and

convenient retrievals from end-users (nowadays under the

umbrella of “data scientists”). (This is traditionally closer to “Database Systems”.)

Statistics & Data Analysis

Source: “Microeconomics Memes”

Facebook page, 18 January, 2019.

Source:

https://xkcd.com/1838/

Rules of Thumb:

• No one is going to tell you which algorithm

does the best, because “the best” might

not even exist.

• In practice, you need to do a lot of

“experiments” to guide yourself towards a

good enough solution.

• When you handle huge amount of data,

speed & memory requirement of analgorithm becomes critical concerns.

Statistics & Data Analysis

• Statistics is a centuries-old subject, it started long before computers

were born. (To be clear, I am no expert in statistics.)

• For one-variable data, mean, variance/SD, median are some natural

statistical measures you have learnt in high school.

• If the data has two or more variables, we may want to find out their

relations. Linear regression (LR) is one of the very first algorithms

(Legendre, Gauss in 180x).

Let’s go back to a pre-computer world

for the next few slides...

Statistics

• Least-square is the most popular version of linear regression. Let’s

recall how it is taught in high school or STAT001.

x

y

y = mx + c

Linear Regression

i xi yi1 11 50

2 7 30

3 20 88

4 2 14... ... ...

100000 4 26

slope m

=n (∑n

i=1 xiyi)− (∑n

i=1 xi) (∑n

i=1 yi)

n (∑n

i=1(xi)2)− (

∑ni=1(xi))

2

y-intercept c

=(∑n

i=1 yi)(∑n

i=1(xi)2)− (∑n

i=1 xi) (∑n

i=1 yi)

n (∑n

i=1(xi)2)− (

∑ni=1(xi))

2

Linear Regression

i xi yi1 11 50

2 7 30

3 20 88

4 2 14... ... ...

100000 4 26

slope m

=n (∑n

i=1 xiyi)− (∑n

i=1 xi) (∑n

i=1 yi)

n (∑n

i=1(xi)2)− (

∑ni=1(xi))

2

y-intercept c

=(∑n

i=1 yi)(∑n

i=1(xi)2)− (∑n

i=1 xi) (∑n

i=1 yi)

n (∑n

i=1(xi)2)− (

∑ni=1(xi))

2

Before explaining how this formula/algorithm arises, note that it is

simple to implement. People who do middle school math can handle.

When n is huge, in 18xx, it even allows simple division of labour

(in modern terminology, parallel computing)!

Linear Regression

slope m

=n (∑n

i=1 xiyi)− (∑n

i=1 xi) (∑n

i=1 yi)

n (∑n

i=1(xi)2)− (

∑ni=1(xi))

2

y-intercept c

=(∑n

i=1 yi)(∑n

i=1(xi)2)− (∑n

i=1 xi) (∑n

i=1 yi)

n (∑n

i=1(xi)2)− (

∑ni=1(xi))

2

Linear RegressionBut... why these slope andy-intercept are chosen?

slope m

=n (∑n

i=1 xiyi)− (∑n

i=1 xi) (∑n

i=1 yi)

n (∑n

i=1(xi)2)− (

∑ni=1(xi))

2

y-intercept c

=(∑n

i=1 yi)(∑n

i=1(xi)2)− (∑n

i=1 xi) (∑n

i=1 yi)

n (∑n

i=1(xi)2)− (

∑ni=1(xi))

2

Linear RegressionBut... why these slope andy-intercept are chosen?

Theorem. The slope m and y-intercept c is the minimizer of the functionn∑i=1

(yi −mxi − c)2

(This is called “Mean Squared Error” which we will discuss more later in this module.)

Theorem. The slope m and y-intercept c is the minimizer of the functionn∑i=1

(yi −mxi − c)2

Proof. Rewrite the function f as(n∑i=1

(xi)2

)m2 + nc2 − 2

(n∑i=1

xi

)mc− 2

(n∑i=1

xiyi

)m− 2

(n∑i=1

yi

)c +

(n∑i=1

(yi)2

)As we are finding m, c that minimizes the function, we do partial derivatives.

∂f

∂m= 2

(n∑i=1

(xi)2

)m− 2

(n∑i=1

xi

)c− 2

(n∑i=1

xiyi

)∂f

∂c= 2nc− 2

(n∑i=1

xi

)m− 2

(n∑i=1

yi

)Note that ∂f

∂m = ∂f∂c = 0 is a linear system in variables m, c. Solving it gives the formula.

Linear Regression

x

y

Geometrically, least-square LR is minimizing∑(length of green bar)2

Linear Regression

x

y

Question:Why not minimizing

∑(length of green bar)4?

Linear Regression

x

y


∑ ∣∣(length of green bar)∣∣?

Linear Regression

x

y


∑(perpendicular distance)12?

Linear Regression

x

y

Question:Why not ignore the “outlier”

(the green dot)?

Linear Regression

x

y

Question:Is “linear” really the right model?

Linear Regression

Questions:Why not minimizing


Why not minimizing∑ ∣∣(length of green bar)

∣∣?Why not minimizing


Linear Regression






An answer: Nothing to prohibit you to do so, except...

Linear Regression






An answer: Nothing to prohibit you to do so, except...

Which algorithms find out these minimums?

(In particular, any algorithm for pre-computer era?)

How long do these algorithms take

(when n = 106, say)?

Linear Regression

Questions:Why not ignore the “outlier”?

Is “linear” really the right model?

Linear Regression

Questions:Why not ignore the “outlier”?

Is “linear” really the right model?

An answer:It really depends on the properties of the data.

There is no “universal” answer to these questions.

How to design algorithms that ignore “outlier”?

Linear Regression

• There are similarities between statistics and data analysis.

With powerful computers, modern data analysis affords

huge and sophisticated data, model and algorithms.

• In both pre-computer and modern eras, speed and memory

requirement of algorithm is a major concern.

• Least-square linear regression appears to have earned its

fame due to its simple, quick and parallelizable algorithm.

Also it can be implemented in an online fashion while using little amount of memory.

Statistics and Data Analysis –

Summary

• There is no “universally best” algorithm. You need to decide

which algorithm to use based on

– your judgement on the properties of data;

– the (fuzzy) results of experiments;

– what is the upper limit of data it can handle (under time

and memory constraints)? (scalability)


Summary

• There is no “universally best” algorithm. You need to decide

which algorithm to use based on

– your judgement on the properties of data;

– the (fuzzy) results of experiments;

– what is the upper limit of data it can handle (under time

and memory constraints)? (scalability)

• In this module, you will see a number of algorithms.

You learn them not only because you want to strictly follow them,

but you also want to gain inspirations from them, so as to

make yourselves adaptive in your future career.


Summary

• In this module, you will see a number of algorithms.

You learn them not only because you want to strictly follow them,

but you also want to gain inspirations from them, so as to

make yourselves adaptive in your future career.


Summary

Not a fish, but to fish.







Today’s Agenda

• Attributes (input variable) X1, X2, . . ..

• Labels (output variable) Y

• Quantitative / Numerical variables

(age/height of a person, spending on a category)

• Qualitative variables (animals: cat/dog/fox)

Types of Variables

• Problems with quantitative label:

regression problems

• Problems with qualitative label:

classification problems

(latter in this module, we may see how to turn qualitative label into quantitative via probability space)

• Attributes (input variable) X1, X2, . . ..

• Labels (output variable) Y

• Quantitative / Numerical variables (age/height, spending on a category)

• Qualitative variables (animals: cat/dog/fox)

Types of Variables

Label: Y Attribute: X = (X1, X2, . . . , Xp)

Regression Model

Y = f (X) + ε

Label: Y Attribute: X = (X1, X2, . . . , Xp)

Regression Model

Y = f (X) + εmodel: function

you want to learn

error term; may due to:

randomness, measure

error, effect of hidden

variables...

(more data helps reduce

random or measure errors)

Y = f (X) + εmodel:

function you

want to learn

error term; may due to:

randomness, measure error,

effect of hidden variables...

Problem:There are infinitely many choices of f .

For efficient learning, we must constrain f in a

reasonable way. This is a choice we need to make.

Regression Model

• We shall see: sometimes it is an “art” to choose the model f .

• How to choose f also depends on purpose.

• Prediction: User does not care f , they just care about the accuracy

of its prediction. In this case, f can be very complicated (e.g. neural

network) in this case.

Example: Take a photo of size 1024×1024 as input, and tell which

people (among your friends and families) are in the photo.

Representation of f :

Prediction vs. Inference

• Inference: User really wants to know f :

– when social scientist wants to know the relation between salary and health

– when medical doctors want to know the effectiveness of different vaccineson different age groups

– when physicist wants to know how the relation between temperature of acompound material and its conductivity

Thus, f cannot be very complicated, and it should be reasonable

with intuition. (In contrast, huge neural network might not have

clear intuition behind.)

Representation of f :

Prediction vs. Inference

In whoscored.com or sofascore.com , there are team performance data.

Let’s do an example:

X1: possession percent

X2: pass percent

X3: dribbles per game

X : (X1, X2, X3)

Y : points per game

We use the data from season 2019/2020.

Y = f (X) + ε

Example: Premier League Data

Important Warning:In the example, there will be

terminologies and methods

I won’t explain.

The purpose of this example is

demonstrating the components

of a data analysis process.


Y = f (X) + εWhat f should we choose? Let’s try a first candidate:

f (X) = aX1 + bX2 + cX3 + d


Y = f (X) + εWhat f should we choose? Let’s try a first candidate:

f (X) = aX1 + bX2 + cX3 + d

Now f can be determined with

four parameters a, b, c, d. We say

f is a parametrized function.


What f should we choose? Let’s try a first candidate:

f (X) = aX1 + bX2 + cX3 + d

We use least-square LR, i.e., find a, b, c, d that minimizes the following error function:∑(Yi − aXi1 − bXi2 − cXi3 − d)2.

We get a = 0.1112, b = −0.0714, c = 0.0410, d = 1.0371.



f (X) = aX1 + bX2 + cX3 + d

We use least-square LR, i.e., find a, b, c, d that minimizes the following error function:∑(Yi − aXi1 − bXi2 − cXi3 − d)2.

We get a = 0.1112, b = −0.0714, c = 0.0410, d = 1.0371.

How good is this fit?When there are two or more attributes, it is difficult to plot a

graph to visualize the result. What can we do to

evaluate?


For small amount of data, we can look

into the “residual”, which is the

difference between Yi and predicted Y

= aXi1 + bXi2 + cXi3 + d.

The fit seems not very good (the red

boxes are large residuals).


When there is a lot of data, listing residuals is not feasible.

We can use a standard measure called R2 for evaluation.

R2 = 0 is bad fit, R2 = 1 is perfect fit.

(I do not explain how R2 is calculated. If you are really interested,

go to Wiki for “Coefficient of Determination”.)


f (X) = aX1 + bX2 + cX3 + d

Least-square LR yields a = 0.1112, b = −0.0714, c = 0.0410, d = 1.0371.


In this case, R2 = 0.626, which is okay but not very good.

(Whether this R2 is good or not depends on the “norm” in different areas;

in social science, data tends to be more fuzzy, so R2 > 0.5 is considered very good;

in some areas of physics, it is quite often that R2 > 0.9.)


f (X) = aX1 + bX2 + cX3 + d

Least-square LR yields a = 0.1112, b = −0.0714, c = 0.0410, d = 1.0371.


We have done evaluations via residual and R2.

But after deriving f with this data,

what we really want is to use this f

to make predictions on other data.

So we see how this f performs with the data

in the current 2020/2021 season.


Very bad...


A Reflection on why the fit is not good:

• The truth might be just that “there is no strong relation between the

attributes and labels”.

• X1, X2, X3 are all “attacking” attributes. There are teams which relies

more on defence to earn many points (e.g. Jose Mourinho’s Tottenham this season, Sheffield United last

season), while there are other teams with attacking style but defense poorly

and earn few points (e.g. Brighton this season, Norwich last season).

X1: possession percent

X2: pass percent

X3: dribbles per game

Y : points per game



• The trend changes from season to season. A reasonably nice fit for last

season can do very badly for this season’s data.



• The trend changes from season to season. A reasonably nice fit for last

season can do very badly for this season’s data.

How to gain better insights from data:

• I cherry-picked the “attacking” attributes. Are there any “defending” attributes

available? (YES!)

• Some teams earn many points by focusing on defense, other teams focusing on attacks.

How to modify f to capture this observation? One idea:

f = max {f1(attack attributes) , f2(defend attributes)}

But this is not a linear function, so the least-square linear regression algorithm cannot

apply; we might need more sophisticated (and less efficient) algorithm.


Training & Test Data• In the Premier League example, we use some data to derive a

function f . Such data is called training data or training set.

• In general, we do not care how well the method works on training

data. We are interested in the accuracy of prediction on some other

data, called test data.

• In the Premier League example,

training data = data from last (completed) season

test data = data from this (ongoing) season

• We want to measure the quality of fit. In the regression setting, a

common measure is MSE:

1

n·

n∑i=1

(yi − f (xi))2 ,

where {(xi, yi)}1≤i≤n can be either training data (then it is called

training MSE), or test data (test MSE).

• Note that test MSE depends on the test data, so it is possible that

it is small for one test dataset, but large for another.

Mean Squared Error (MSE)

• In the classification setting (qualitative label, e.g. a photo

as attribute, label is cat/dog/fox), a common measure for

quality of fit is

1

n·

n∑i=1

I(yi 6= f (xi)).

Again, {(xi, yi)}1≤i≤n can be either training data (training

error rate) or test data (test error rate).

Error Rate

We have learnt these terminologies:

• attributes, labels

• quantitative vs. qualitative variables

• regression problems, classification problems

• prediction vs. inference

• model f , parametrized function

• training data, test data

• mean squared error (for regression)

• error rate (for classification)

Summary

• A “toy” example of Premier League:

– Which training data to use? (Just “attacking” attributes, or both attacking &

defending attributes?)

– What parametrization of f? (linear function)

– What algorithm? (least-square linear regression)

Summary

• A “toy” example of Premier League:

– Evaluation on the quality of fit. (residuals, R2, against test data of current

season, MSE)

– Reflect on why the fitting is poor. (bias to “attack” attributes)

– Another choice of parametrization of f . (linear with cutoffs? polynomial

regression?)

Summary

Collect and

organize

Data

Choose

Parametrization of

f (e.g. linear)

Choose error

function to minimize

(e.g. MSE)

After obtaining a

concrete f ,

evaluation.

(e.g. residual, R2,

against test data)

Reflection.

Seek for

improvements.

collect more

data? use

different data?

choose another

reasonable

parametrization of

f?

choose

different error

function?

algorithm!

Summary







Today’s Agenda

• We will look into f and algorithms that involve many mathematics

in this module.

• In the first lecture, however, let’s look into an “intuitive” family:

decision tree.

age ≤ 2223 ≤age ≤ 31

32 ≤age ≤ 40

60 ≤ab ≤ 79 ab ≥ 8065 ≤ab ≤ 78 ab ≥ 79

salary = 5 salary = 20 salary = 15 salary = 30

salary = 25

Decision Tree

• In the first lecture, however, let’s look into an “intuitive” family:

decision tree.

• Decision tree is easy to interpret (even for people who are very bad

in math).

• But it is typically not very good in predictive efficiency

(i.e. high error rate / MSE).

• Predictive efficiency can be improved with the use of more

complicated decision tree, but interpretability gets lower.

Decision Tree

• Recall attribute X = (X1, X2, . . . , Xp).

• In the simplest form of a decision tree, at each node we pick one

Xk, then follow a TRUE statement about Xk to move to next node.

X1 ≤ 64

535 ≥ X1 > 64

X1 > 535

X4 = femaleX4 = male

Decision Tree

• It is also possible to have a condition on more than one Xk’s.

X1 +X3 ≤ 64

535 ≥X1 +X3 > 64

X1 +X3 > 535

Decision Tree

Question:

How to use algorithm to

generate a decision tree with data?

Decision Tree

Question:

How to use algorithm to

generate a decision tree with data?

An answer:

Divide the space containing all possible X = (X1, . . . , Xp)

into non-overlapping regions R1, . . . , Rj.

For all points in the same region, make the same prediction,

which is the mean of the labels of these points.

Decision Tree

X1

X2

R1

R2

R3

R4 R5

R6

R7

Decision Tree

X1

X2

R1

R2

R3

R4

Decision Tree

Divide the space containing all possible X = (X1, . . . , Xp)

into non-overlapping regions R1, . . . , Rj.

For all points in the same region, make the same prediction,

which is the mean of the labels of these points.A division (decision tree) is good if the residual sum of square (RSS)

for each region Rj is small, where RSS is∑i∈Rj

(yi − yRj

)2,

where yRj

:= 1nj

(∑i∈Rj yi

)is the mean of all labels in Rj.

Decision Tree (Regression)

A division (decision tree) is good if the residual sum of square (RSS) for each

region is small, where RSS of region Rj is∑i∈Rj

(yi − yRj

)2,

where yRj

:= 1nj

(∑i∈Rj yi


yRj

is the predicted label of region Rj.

But one problem...


A division (decision tree) is good if the residual sum of square (RSS) for each

region is small, where RSS of region Rj is∑i∈Rj

(yi − yRj

)2,

where yRj

:= 1nj

(∑i∈Rj yi


yRj

is the predicted label of region Rj.

But one problem...

If we make a division such that

there is only one point in each region...


If we make the devision such that

there is only one point in each region...

We want to design an efficient algorithm that outputs

a decision tree, which strikes a balance between:

(i) number of regions should be kept small

(ii) RSS in each region is reasonably small

Decision Tree

X1

X2

R1

R2

R3

R4

Same as in earlier example, to have an efficient

algorithm and interpretable decision tree, we

cannot afford arbitrary division of regions.

In the next algorithm, every time we split an

existing region into two via constraints

Xk ≤ s and Xk > s. Thus, each region must

be a rectangular box.

Decision Tree

1. Initially, there is one region, which contains all data points.

2. Select one existing region, denote it by R#.

(a) Let R1(k, s) = {X|X ∈ R#, Xk ≤ s}, R2(k, s) = {X|X ∈ R#, Xk > s}.(b) Find k, s such that RSS of R1(k, s) plus RSS of R2(k, s) is the smallest.

(c) Replace region R# by two regions R1(k, s) and R2(k, s).

3. Repeat Step 2 until certain condition is met

4. For each region Rj, set yRj

to be its predicted label

Decision Tree (Regression):

Recursive Binary Splitting

RSS of a region Rj is∑

i∈Rj

(yi − yRj

)2, where y

Rj:= 1

nj

(∑i∈Rj yi


X1

X2

R1

Decision Tree

X1

X2

R1

s

Decision Tree

X1

X2

R1

s

R2

Decision Tree

X1

X2

R1 R2

s

Decision Tree

X1

X2

R1 R2

s

R3

Decision Tree

X1

X2

R1 R2

s

R3

R4

Decision Tree

X1

X2

R1 R2

s

R3

R4

R5

Decision Tree

The above is just a visualization.

For X = (X1, X2, . . . , Xp) with p ≥ 3,

visualization is difficult.

Warning: Visualization is for you to earn first

understanding; don’t rely on it too much.

What you really do is writing a program.

How to implement the algorithm efficiently?

Decision Tree

1. Initially, there is one region, which contains all data points.

2. Select one existing region, denote it by R#.

(a) Let R1(k, s) = {X|X ∈ R#, Xk ≤ s}, R2(k, s) = {X|X ∈ R#, Xk > s}.(b) Find k, s such that RSS of R1(k, s) plus RSS of R2(k, s) is the smallest.

(c) Replace region R# by two regions R1(k, s) and R2(k, s).

3. Repeat Step 2 until certain condition is met

4. For each region Rj, set yRj

to be its predicted label



RSS of a region Rj is∑

i∈Rj

(yi − yRj

)2, where y

Rj:= 1

nj

(∑i∈Rj yi


select a region, denoted by R#

cur best rss := +∞for k = 1, 2, . . . , p

for s where there exists X ∈ R# with Xk = s

r1 := RSS of those points in {X|X ∈ R#, Xk ≤ s}r2 := RSS of those points in {X|X ∈ R#, Xk > s}if (r1 + r2 < cur best rss) then

cur best rss := r1 + r2k∗ := k

s∗ := s

replace region R# by two regions {X|X ∈ R#, Xk∗ ≤ s∗} and {X|X ∈ R#, Xk∗ > s∗}



Challenging Question (for those who have done undergraduate algorithm design & analysis course):

Suppose there are n data points in R#. In the above algorithm, if you

naively computes the two RSS, the algorithm takes time O(pn2), which

is a lot if n ≥ 106.

Can you improve the algorithm running time to O(pn log n)?Hint: When there are ` real numbers y1, y2, . . . , y` with mean m, then

∑i=1

(yi −m)2 =∑i=1

(yi)2 − 1

`·

(∑i=1

yi

)2




cur best rss := +∞for k = 1, 2, . . . , p


r1 := RSS of those points in {X|X ∈ R#, Xk ≤ s}r2 := RSS of those points in {X|X ∈ R#, Xk > s}if (r1 + r2 < cur best rss) then

cur best rss := r1 + r2k∗ := k

s∗ := s





cur best error rate := +∞for k = 1, 2, . . . , p


r1 := Gini/Entropy/Class ER of those points in {X|X ∈ R#, Xk ≤ s}r2 := Gini/Entropy/Class ER of those points in {X|X ∈ R#, Xk > s}if (r1 + r2 < cur best error rate) then

cur best error rate := r1 + r2k∗ := k

s∗ := s


Decision Tree (Classification):





r1 := Gini/Entropy/Class ER of those points in {X|X ∈ R#, Xk ≤ s}






r1 := Gini/Entropy/Class ER of those points in {X|X ∈ R#, Xk ≤ s}



Suppose there are ` classes (e.g. cat, dog, fox).

Given a set of n data points of the form (Xi, Yi), suppose class j appears nj times.

Then let ~p ∈ R` denote the vector

(p1, p2, . . . , p`) :=(n1n,n2n, . . . ,

n`n

)

r1 := Gini/Entropy/Class ER of points in {X ∈ R#|Xk ≤ s}



Suppose there are ` classes (e.g. cat, dog, fox). Given a set of n data points of the form (Xi, Yi), suppose

class j appears nj times. Then let ~p ∈ R` denote the vector

(p1, p2, . . . , p`) :=(n1n,n2n, . . . ,

n`n

)What is good? At best, one entry of ~p is 1, all others are zero. We want to inventfunction that takes ~p as input, and it outputs a number that indicates how good ~p is.

r1 := Gini/Entropy/Class ER of points in {X ∈ R#|Xk ≤ s}



Suppose there are ` classes (e.g. cat, dog, fox). Given a set of n data points of the form (Xi, Yi), suppose

class j appears nj times. Then let ~p ∈ R` denote the vector

(p1, p2, . . . , p`) :=(n1n,n2n, . . . ,

n`n

)What is good? At best, one entry of ~p is 1, all others are zero. We want to inventfunction that takes ~p as input, and it outputs a number that indicates how good ~p is.

“Classification Error Rate” function is

Class ER(~p) := 1− max1≤i≤`

pi

r1 := Gini/Entropy/Class ER of points in {X ∈ R#|Xk ≤ s}~p = (p1, p2, . . . , p`) :=

(n1n,n2n, . . . ,

n`n

)What is good? At best, one entry of ~p is 1, all others are zero. We want to invent function that takes ~p as input, and it

outputs a number that indicates how good ~p is.



pi

“Gini Index” function is

Gini(~p) :=∑1≤i≤`

pi(1− pi)

“Entropy” function is

Entropy(~p) := −∑1≤i≤`

pi log2 pi (pi = 0⇒ pi log2 pi = 0)





pi

“Gini Index” function is

Gini(~p) :=∑1≤i≤`

pi(1− pi)

“Entropy” function is

Entropy(~p) := −∑1≤i≤`

pi log2 pi (pi = 0 ⇒ pi log2 pi = 0)

~p Class ER Gini Entropy

(0.5, 0.25, 0.25) 0.500 0.6250 1.5000

(0.45, 0.378, 0.172) 0.550 0.6250 1.4857



ICLR is a top-tier machine learning conference which uses

OpenReview: data of paper reviews are publicly available.

Each paper is reviewed by several reviewers, each reviewer gives a

rating (between 0 and 10) to that paper.

Thanks to https://github.com/evanzd/ICLR2021-OpenReviewData ,

all the ratings are collected into a tsv file.

We want to analyze how the decisions (reject, accept as poster,

accept as spotlight, accept as oral) were made.

Example: ICLR 2021 Decision

There are 2966 paper submissions.

Each attribute is X = (X0, X1, X2, X3, X4), where

• X1 is the average rating of the paper,

• X2 is the minimum rating,

• X3 is the maximum rating,

• X4 is the number of reviews with rating ≥ 6,

• X5 is the number of reviews with rating ≥ 7.

The label of each paper is Y :

• reject: Y = 0

• accept as poster/spotlight/oral: Y = 1


We use recursive binary splitting with Gini. Level 1:

X0 ≤ 5.8 X0 > 5.8

X0: average rating

X1: minimum rating

X2: maximum rating

X3: # reviews with rating ≥ 6


[2106,860]

{800}[1885,95] [221,765]

Using computer program, we found that by using X0 and threshold 5.8, it

minimizes the sum of Gini indices r1, r2.

r1 =18851980

(1− 1885

1980

)+ 95

1980

(1− 95

1980

)≈ 0.09136

r2 =221986

(1− 221

986

)+ 765

986

(1− 765

986

)≈ 0.34780

r1 + r2 ≈ 0.43916


What if we replace Gini with Entropy? Level 1:

X0 ≤ 4.75 X0 > 4.75

[2106,860]

{800}[977,1] [1129,859]

Using computer program, we found that by using X0 and threshold 4.75, it

minimizes the sum of entropies r1, r2.

r1 = −(977978 log2

977978 +

1978 log2

1978

)≈ 0.01163

r2 = −(11291988 log2

11291988 +

8591988 log2

8591988

)≈ 0.98665

r1 + r2 ≈ 0.99828

X0: average rating

X1: minimum rating

X2: maximum rating




We continue to use Entropy for the remaining levels.

Level 2:

X0 ≤ 4.75 X0 > 4.75

[2106,860]

{800}[977,1] [1129,859]

X0 ≤ 6.0 X0 > 6.0{400}[1058,189] [71,670]

X0: average rating

X1: minimum rating

X2: maximum rating




X0 ≤ 4.75 X0 > 4.75

[2106,860]

{800}[977,1] [1129,859]

X0 ≤ 6.0 X0 > 6.0{400}

[1058,189] [71,670]

X3 ≤ 1 X3 ≥ 2{280}[367,8] [691,181]

X0 ≤ 6.8 X0 ≥ 6.8{180}

[69,459] [2,211]


Level 3:

X0: average rating

X1: minimum rating

X2: maximum rating




X0 ≤ 4.75 X0 > 4.75

[2106,860]

{800}[977,1] [1129,859]

X0 ≤ 6.0 X0 > 6.0{400}[1058,189] [71,670]

X3 ≤ 1 X3 ≥ 2{280}

[367,8] [691,181]

X0 ≤ 6.8 X0 > 6.8{180}[69,459]

[2,211]X4 ≤ 2 X4 ≥ 3{70}

[10,108][59,351]

X0 ≤ 5.25 X0 > 5.25

{100}

[518,171][173,10]


Level 4:

X0: average rating

X1: minimum rating

X2: maximum rating




[59,351]

X0 ≤ 6.5 X0 > 6.5

[51,270]

[8,81]

[518,171]

X0 ≤ 5.5 X0 > 5.5

[175,18]

[343,153]

{40}{70}

X3 ≤ 2 X3 ≥ 3

[162,44] [181,109]

X0 ≤ 6.34 X0 > 6.34

[41,166] [10,104]

5.5 < X0 ≤ 6

X3 = 2

5.5 < X0 ≤ 6

X3 ≥ 3

6 < X0 ≤ 6.34

X4 ≤ 2

6.34 < X0 ≤ 6.5

X4 ≤ 2


Level 5,6:

X0: average rating

X1: minimum rating

X2: maximum rating










Today’s Agenda

Pruning Decision TreeA division (decision tree) is good if the residual sum of square (RSS) for each region is

small, where RSS of region Rj is∑

i∈Rj

(yi − yRj

)2.

yRj

:= 1nj

(∑i∈Rj yi

)is the mean of all labels in Rj, and it is the predicted label of

region Rj. But one problem...

If we make a division such that

there is only one point in each region...We want a decision tree which strikes a balance between:

(i) number of regions (= number of leaves in the decision tree)

should be kept small

(ii) RSS/error rate in each region is reasonably small

We want a decision tree which strikes a balance between:

(i) number of regions (= number of leaves in the tree) should be kept small

(ii) RSS/error rate in each region is reasonably small (RSS =∑

i∈Rj

(yi − yRj

)2)

This motivates the notion of penalized RSS that captures the

tradeoff between the two quantities:

α|T | +|T |∑m=1

RSS(Rm),

where |T | is number of leaves in the decision tree T , and

α ≥ 0 is a parameter we will look into.

Pruning Decision Tree

penalized RSS = α|T | +|T |∑m=1

RSS(Rm),

where α ≥ 0 is a parameter we will look into.

• If α = 0, we simply create a division such that each leaf contains one data

point. The penalized RSS is 0 (or small).

• As α grows, we desire |T | to be smaller.

• If α is crazily large, then we desire just one leaf (region), so no grouping of

data points will be done.

So we want to choose a “moderate” value of α.



RSS(Rm),

We want to choose a “moderate” value of α.Problem 1:

We cannot use this function

in recursive binary splitting (which is top-down),

since we do not know the exact value of |T | when splitting.

Problem 2:We do not know which choice of α is good.



RSS(Rm),

One Solution: Bottom-up!

1. Use recursive binary splitting to generate a large tree:

i.e., one (or very few) data point per leaf.

2. Prune the tree with a chosen α (we will discuss how

to choose α later). See next slide.


Definition of Pruning:

[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]

Pruning starts by selecting an internal node Z of the tree.

Z


Definition of Pruning:

[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]

Then removes all nodes below Z. Z now corresponds to a region

containing all data points originally below Z.


[4, 8.9, 7, 7.4]

Definition of Subtree:

[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]

A subtree simply means any new tree which can be obtained via

one or more prunings. The black tree above is a subtree of the original tree.


[4, 8.9, 7, 7.4]

Definition of Subtree:

[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]

The black tree above is another subtree of the original tree,

obtained via two prunings.


[4, 8.9, 7, 7.4]

[3, 3.2]

Definition of |T ′|, where T ′ is a subtree:

[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]

|T ′| simply means the number of leaves in the subtree.

In the above subtree, |T ′| = 5.


[4, 8.9, 7, 7.4]

Definition of |T ′|, where T ′ is a subtree:

[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]

|T ′| simply means the number of leaves in the subtree.

In the above subtree, |T ′| = 4.


[4, 8.9, 7, 7.4]

[3, 3.2]

Observations: After each pruning,

• |T ′| decreases;

• the summation∑|T ′|

m=1 RSS(Rm) either remains the same or

increases. (This is NOT trivial. Challenging problem: prove this.)

penalized RSS = α|T ′| +|T ′|∑m=1

RSS(Rm),


Observations: After each pruning,

• |T ′| decreases;

• the summation∑|T ′|

m=1 RSS(Rm) either remains the same or

increases. (This is NOT trivial. Challenging problem: prove this.)

penalized RSS = α|T ′| +|T ′|∑m=1

RSS(Rm),

As the trade-off is there, at each internal node, we need

to determine pruning it or not.


[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]

α = 0.6


[4] [8.9] [7] [7.4] [3] [3.2] [3.5] [2.9]

α = 0.6

Current: α · 2 + 0 = 1.2

Merged: α · 1 + 0.02 = 0.62

RSS([3, 3.2]) =

(3.2− 3.1)2 + (3− 3.1)2 = 0.02


[4] [8.9] [7] [7.4]

[3, 3.2]

[3.5] [2.9]

α = 0.6


[4] [8.9] [7] [7.4]

[3, 3.2]

[3.5] [2.9]

α = 0.6

Current: α · 2 + 0 = 1.2

Merged: α · 1 + 0.18 = 0.78

RSS([3.5, 2.9]) =

(3.5− 3.2)2 + (2.9− 3.2)2 = 0.18


[4] [8.9] [7] [7.4]

[3, 3.2] [3.5, 2.9]

α = 0.6


Similarly, this node is also pruned.

[4] [8.9]

[7, 7.4] [3, 3.2] [3.5, 2.9]

α = 0.6

Current: α · 2 + 0 = 1.2

Merged: α · 1 + 12.005 = 12.605RSS([4, 8.9]) =

(4− 6.45)2+ (8.9− 6.45)2 = 12.005


[4] [8.9]

[7, 7.4] [3, 3.2] [3.5, 2.9]

α = 0.6

Current: α · 2 + 0.02 + 0.18 = 1.4

Merged: α · 1 + 0.21 = 0.81RSS([3, 3.2, 3.5, 2.9]) =

(3− 3.15)2+ (3.2− 3.15)2+ (3.5−3.15)2 + (2.9− 3.15)2 = 0.21


[4] [8.9]

[7, 7.4]

[3, 3.2, 3.5, 2.9]

α = 0.6

Current: α · 3 + 0.08 = 1.88

Merged: α · 1 + 12.6475 = 13.2475RSS([4, 8.9, 7, 7.4]) = (4− 6.825)2 + (8.9−6.825)2 + (7− 6.825)2 + (7.4− 6.825)2 = 12.6475


[4] [8.9]

[7, 7.4]

[3, 3.2, 3.5, 2.9]

α = 0.6

Decision Tree After Pruning

with α = 0.6


penalized RSS

= α|T | +∑|T |m=1 RSS(Rm)

1. Use recursive binary splitting to generate a large tree:

i.e., one (or very few) data point per leaf.

2. Use penalized RSS to determine pruning a node or not.


Questions:

• How to implement this algorithm? (We discuss next.)

• What data structure should we use for efficient algorithm?

(Leave for you to think.)

• Which α should we use? (k-fold cross validation.)

penalized RSS

= α|T | +∑|T |m=1 RSS(Rm)

Pruning Decision TreeHow to implement this algorithm?

• We will compute all possible subtrees for the choices α = 0

to α = +∞.

• Think that we increase α from 0. Initially, the subtree for

α = 0 is simply the original tree. (why?)

• Then we increase α. The subtree remains the same up to a

certain critical α value. How do we compute this critical α?

penalized RSS

= α|T | +∑|T |m=1 RSS(Rm)

Pruning Decision Tree• Then we increase α. The subtree remains the same up to a certain

critical α value. How do we compute this critical α?

– For each internal node v, compute the pRSS of

the tree under v, in term of α. Note that the pRSS is of the form

c · α + d, where c ≥ 2.

– If v is pruned, then the new pRSS of the tree under v is

1·α + RSS(all data points under v)

penalized RSS

= α|T | +∑|T |m=1 RSS(Rm)

Pruning Decision Tree• For each internal node v, compute the pRSS of the tree under v, in term of α. Note that the pRSS is of

the form c · α + d, where c ≥ 2.

• If v is pruned, then the new pRSS of the tree under v is α + RSS(all data points under v)

y = c ·α+ d

y = α + RSS(all data points under v)

critical α for node v

penalized RSS

= α|T | +∑|T |m=1 RSS(Rm)


critical α for

node v

• Then we increase α. The subtree remains

the same up to a certain critical α value.

How do we compute this critical α?

penalized RSS

= α|T | +∑|T |m=1 RSS(Rm)


critical α for

node v

• Then we increase α. The subtree remains

the same up to a certain critical α value.

How do we compute this critical α?

α := overall critical α value =

minv{critical α value of node v} ,

where v runs over ALL internal nodes.

penalized RSS

= α|T | +∑|T |m=1 RSS(Rm)

Pruning Decision TreeHow to implement this algorithm?

• We will compute all possible subtrees for the choices α = 0 to α = +∞.

• Think that we increase α from 0. Initially, the subtree for α = 0 is simply

the original tree. (why?)

• Then we increase α. The subtree remains the same up to a certain critical

α value. How do we compute this critical α, denoted by α?

• Now we have a new subtree. Repeat the above argument, but start

from α = α instead of α = 0. This is repeated until we obtain a

subtree with just a single node.

Q1: why don’t we restart from α = 0? Q2: why this repetition must end?

penalized RSS

= α|T | +∑|T |m=1 RSS(Rm)

Pruning Decision TreeWhich α should we use?

k-fold cross validation.

Cross validation is a general technique: divide the

training data into several parts, use one part as

“testing” and the other parts as “training”.

penalized RSS

= α|T | +∑|T |m=1 RSS(Rm)

Pruning Decision TreeWhich α should we use?

k-fold cross validation.• k-fold Cross Validation:

– Divide training data D into k parts, denoted by D1,D2, . . . ,Dk.

– for i = 1, 2, . . . , k:

use D −Di to obtain decision trees for various α’s

use Di as test data to compute test-MSE for various α’s

– for each α, compute the total of the k test-MSE computed above

– select α that yields the least total test-MSE

End of CS5100J

Live Lecture 1

Yun Kuen (Marco) CHEUNG

[email protected]

live lecture 1 data analysis cs 5100j

Documents