cs246: mining massive datasets jure leskovec,...

40
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

Upload: others

Post on 26-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

http://cs246.stanford.edu

Page 2: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Input features: N features: X1, X2, … XN

Each Xj has domain Dj Categorical:

Dj = {red, blue} Numerical: Dj = (0, 10)

Y is output variable with domain DY: Categorical: Classification Numerical: Regression

Task: Given input data

vector xi predict yi

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2

A

X1<v1

C

D F

F G H I

Y= 0.42

X2∈{v2, v3}

Page 3: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Decision trees: Split the data at each

internal node Each leaf node

makes a prediction Lecture today: Binary splits: Xj<v

Numerical attrs. Regression

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3

A

X1<v1

C

D F

F G H I

Y= 0.42

X2<v2

X3<v4 X2<v5

Page 4: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Input: Example xi Output: Predicted yi’

“Drop” xi down the tree until it hits a leaf node

Predict the value stored in the leaf that xi hits

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4

A

B

X1<v1

C

D E

F G H I

X2<v2

X3<v4 X2<v5

Y= 0.42

Page 5: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Training dataset D*, |D*|=100 examples

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5

A

B

X1<v1

C

D E

F G H I

|D|=90 |D|=10

X2<v2

X3<v4 X2<v5

|D|=45 |D|=45

Y= 0.42

|D|=25 |D|=20 |D|=30 |D|=15

# of examples traversing the edge

Page 6: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Imagine we are currently at some node G Let DG be the data reaches G

There is a decision we have to make: Do we continue building the tree? If so, which variable and which value

do we use for a split? If not, how do we make a prediction? We need to build a “predictor node”

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6

A

B C

D E

F G H I

Page 7: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Alternative view:

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7

+ + + +

+

+

+

+ + +

+

+

+

+

+ +

+

– – –

– –

+ + +

+ +

+ +

+ + +

+ +

+ +

+ +

+

+ +

+ +

– – –

– –

– – –

– –

– – –

– –

+ +

X1

X2

Page 8: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Requires at least a single pass over the data! 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8

Page 9: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

How to split? Pick attribute & value that optimizes some criterion

Classification: Information Gain IG(Y|X) = H(Y) – H(Y|X) Entropy: 𝐻 𝑍 = −∑ 𝑝𝑗 log 𝑝𝑗𝑚

𝑗=1 Conditional entropy: 𝐻 𝑊|𝑍 = −∑ 𝑃 𝑍 = 𝑣𝑗 𝐻 𝑊 𝑍 = 𝑣𝑗𝑚

𝑗=1 Suppose Z takes m values (v1 … vm) H(W|Z=v) ... Entropy of W among the records in which Z has value v

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9

A

B X1<v1

C

D E

F G H I

|D|=90 |D|=10

X2<v2

X3<v4 X2<v5

|D|=45 |D|=45

.42

|D|=25 |D|=20 |D|=30 |D|=15

Page 10: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

How to split? Pick attribute & value that optimizes some criterion

Regression: Find split (Xi, v) that

creates D, DL, DR: parent, left, right child datasets and maximizes: 𝐷 ⋅ 𝑉𝑉𝑉 𝐷

− 𝐷𝐿 ⋅ 𝑉𝑉𝑉 𝐷𝐿 + 𝐷𝑅 ⋅ 𝑉𝑉𝑉 𝐷𝑅 For ordered domains sort Xi and consider a split between

each pair of adjacent values For categorical Xi find best split based on subsets

(Breiman’s algorithm) 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10

A

B X1<v1

C

D E

F G H I

|D|=90 |D|=10

X2<v2

X3<v4 X2<v5

|D|=45 |D|=45

.42

|D|=25 |D|=20 |D|=30 |D|=15

Page 11: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

When to stop? 1) When the leaf is “pure” E.g., Var(yi) < ε

2) When # of examples in the leaf is too small E.g., |D|≤ 10

How to predict? Predictor: Regression: Avg. yi of the examples in the leaf Classification: Most common yi in the leaf

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11

A

B X1<v1

C

D E

F G H I

|D|=90 |D|=10

X2<v2

X3<v4 X2<v5

|D|=45 |D|=45

.42

|D|=25 |D|=20 |D|=30 |D|=15

Page 12: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12

Page 13: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Given a large dataset with hundreds of attributes

Build a decision tree!

General considerations: Tree is small (can keep it memory): Shallow (~10 levels)

Dataset too large to keep in memory Dataset too big to scan over on a single machine MapReduce to the rescue!

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13

FindBestSplit

FindBestSplit

FindBestSplit

FindBestSplit

Page 14: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14

Page 15: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Parallel Learner for Assembling Numerous Ensemble Trees [Panda et al., VLDB ‘09]

A sequence of MapReduce jobs that build a decision tree

Setting: Hundreds of numerical (discrete & continuous)

attributes Target (class) is numerical: Regression Splits are binary: Xj < v

Decision tree is small enough for each Mapper to keep it in memory Data too large to keep in memory

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15

Page 16: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

2/29/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

Input data

Model Attribute metadata

Master

FindBestSplit InMemoryGrow

Intermediate results

A

B C

D E

F G H I

Page 17: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Mapper loads the model and info about which attribute splits to consider

Each mapper sees a subset of the data D* Mapper “drops” each datapoint to find the

appropriate leaf node L For each leaf node L it keeps statistics about 1) the data reaching L 2) the data in left/right subtree under split S

Reducer aggregates the statistics (1) and (2) and determines the best split for each node

2/29/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

A

B C

D E

F G H I

Page 18: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Master Monitors everything

(runs multiple MapReduce jobs) MapReduce Initialization For each attribute identify values

to be considered for splits MapReduce FindBestSplit MapReduce job to find best split when

there is too much data to fit in memory MapReduce InMemoryBuild Similar to FindBestSplit (but for small data) Grows an entire sub-tree once the data fits in memory

Model file A file describing the state of the model

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18

FindBestSplit

FindBestSplit

FindBestSplit

FindBestSplit

Hardest part

A

B C

D E

F G H I

Page 19: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Identifies all the attribute values which need to be considered for splits

Splits for numerical attributes: Would like to consider very possible value v∈D Compute an approximate equi-depth histogram on D* Idea: Select buckets such that counts per bucket are equal

Use boundary points of histogram as potential splits Generates an “attribute metadata” to be loaded

in memory by other tasks

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19

Count for bucket

Domain values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

j

Xj < v

D

Page 20: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Goal: Equal number of elements per bucket (B buckets total)

Construct by first sorting and then taking B-1 equally-spaced splits

Faster construction: Sample & take equally-spaced splits in the sample Nearly equal buckets

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20

Count in bucket

Domain values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 2 2 3 4 7 8 9 10 10 10 10 11 11 12 12 14 16 16 18 19 20 20 20

Page 21: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Controls the entire process Determines the state of the tree and grows it: Decides if nodes should be split If there is little data entering a node, runs an

InMemory-Build MapReduce job to grow the entire subtree For larger nodes, launches MapReduce

FindBestSplit to find candidates for best split Collects results from MapReduce jobs and chooses

the best split for a node Updates model

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21

A

B C

D E

F G H I

Page 22: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Master keeps two node queues: MapReduceQueue (MRQ) Nodes for which D is too large to fit in memory

InMemoryQueue (InMemQ) Nodes for which the data D in the node fits in memory

The tree will be built in levels Epoch by epoch

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22

j DR DL

D

Xj < v

A

B C

D E

F G H I

Page 23: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Two MapReduce jobs: FindBestSplit: Processes nodes

from the MRQ For a given set of nodes S, computes a candidate of good

split predicate for each node in S

InMemoryBuild: Processes nodes from the InMemQ For a given set of nodes S, completes tree induction

at nodes in S using the InMemoryBuild algorithm Start by executing FindBestSplit on full data

D*

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23

j DR DL

D

Xj < v

Page 24: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

MapReduce job to find best split when there is too much data to fit in memory

Goal: For a particular split node find attribute Xj and value v that maximize: D … training data (xi, yi) reaching the node DL … training data xi, where xi,j < v DR … training data xi, where xi,j ≥ v Var(D) = 1/(n-1) Σi yi

2 – (Σi yi)2/n Note: Can be computed from sufficient

statistics: Σyi, Σyi2

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24

j DR DL

D

Xj < v

Page 25: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Mapper: Initialize by loading from Initialization task Current Model (to find which node each xi ends up) Attribute metadata (all split points for each attribute)

For each record run the Map algorithm For each node store statistics and at the end emit

(to all reducers): <Node.Id, { Σy, Σy2, Σ1 } > For each split store statistics and at the end emit: <Split.Id, { Σy, Σy2, Σ1 } > Split.Id = (node, feature, split value)

2/29/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25

A

B C

D E

F G H I

Page 26: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Requires: Split node set S, Model file M, Training record (xi,yi)

Node n = TraverseTree(M, xi) if n ∈ S: Update Tn ← yi //stores {Σy, Σy2, Σ1} for each node

for j = 1 … N: // N… number of features

v = value of feature Xj of example xi

for each split point s of feature Xj, s.t. s < v: Update Tn,j[s] ← yi //stores {Σy, Σ,y2, Σ1} for each (node, feature, split)

MapFinalize: Emit <Node.Id, { Σy, Σy2, Σ1 } > // sufficient statistics (so we can later

<Split.Id, { Σy, Σy2, Σ1} > // compute variance reduction)

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26

A

B C

D E

F G H I

Page 27: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Reducer: 1) Load all the <Node_Id, List {Σy, Σy2, Σ1}>

pairs and aggregate the per node statistics 2) For all the <Split_Id, List {Σy, Σy2, Σ1}>

aggregate and run the reduce algorithm For each Node_Id,

output the best split found:

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27

Reduce(Split_Id, values): split = NewSplit(Split_Id) best = BestSplitSoFar(split.node.id) for stats in values split.stats.AddStats(stats) left = GetImpurity(split.stats) right = GetImpurity(split.node.stats–split.stats) split.impurity = left + right if split.impurity < best.impurity: UpdateBestSplit(Split.Node.Id, split)

A

B C

D E

F G H I

Page 28: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Collects outputs from FindBestSplit reducers <Split.Node.Id, feature, value, impurity>

For each node decides the best split If data in DL/DR is small enough put

the nodes in the InMemoryQueue to later run InMemoryBuild on the node

Else put the nodes into MapReduceQueue

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28

A DR DL

D

Xj < v

B C

A

B C

D E

F G H I

Page 29: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Task: Grow an entire subtree once the data fits in memory

Mapper: Initialize by loading current

model file For each record identify the node

it falls under and if that node is to be grown, output <Node_Id, Record>

Reducer: Initialize by loading attribute file

from Initialization task For each <Node_Id, List{Record}> run the basic tree

growing algorithm on the records Output the best splits for each node in the subtree

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29

A

B C

D E

F G H I

Page 30: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Need to split nodes F, G, H, I D1, D4 small, run InMemoryGrow D2, D3 too big, run FindBestSplit({G, H}): FindBestSplit::Map (each mapper) Load the current model M Drop every example xi down the tree If it hits G or H, update in-memory hash tables: For each node: Tn: (node)→{Σy, Σy2, Σ1} For each split,node: Tn,j,s: (node, attribute, split_value)→{Σy, Σy2, Σ1}

Map::Finalize: output the key-value pairs from above hashtables FindBestSplit::Reduce (each reducer) Collect: T1:<node, List{Σy, Σy2, Σ1} > → <node, {Σ Σy, Σ Σy2, Σ Σ1} > T2:<(node, attr. split), List{Σy, Σy2, Σ1}> → <(node, attr. split), {ΣΣy, ΣΣy2,

ΣΣ1}> Compute impurity for each node using T1, T2 Return best split to Master (that decides on the globally best spit)

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30

A

B C

D E

F G H I

D1 D2 D3 D4

Page 31: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

We need one pass over the data to construct one level of the tree!

Set up and tear down Per-MapReduce overhead is significant Starting/ending MapReduce job costs time

Reduce tear-down cost by polling for output instead of waiting for a task to return Reduce start-up cost through forward scheduling Maintain a set of live MapReduce jobs and assign them

tasks instead of starting new jobs from scratch

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31

Page 32: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Very high dimensional data If the number of splits is too large the Mapper

might run out of memory Instead of defining split tasks as a set of nodes to

grow, define them as a set of nodes to grow and a set of attributes to explore This way each mapper explores a smaller number of

splits (needs less memory)

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 32

Page 33: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Learn multiple trees and combine their predictions Gives better performance in practice

Bagging: Learns multiple trees over independent

samples of the training data Predictions from each tree are averaged to

compute the final model prediction

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33

Page 34: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Model construction for bagging in PLANET When tree induction begins at the root, nodes of all trees

in the bagged model are pushed onto the MRQ queue Controller does tree induction over dataset samples Queues will contain nodes belonging to many different trees

instead of a single tree How to create random samples of D*? Compute a hash of a training record’s id and tree id Use records that hash into a particular range to learn a

tree This way the same sample is used for all nodes in a tree Note: This is sampling D* without replacement

(but samples of D* should be created with replacement) 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 34

Page 35: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

SVM Classification Real valued features

(no categorical ones) Tens/hundreds of

thousands of features Very sparse features Simple decision

boundary No issues with overfitting

Example applications Text classification Spam detection Computer vision

Decision trees Classification Real valued and

categorical features Few (hundreds) of

features Usually dense features Complicated decision

boundaries Overfitting!

Example applications User profile classification Landing page bounce

prediction 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 35

Page 36: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Google: Bounce rate of ad = fraction of users who bounced from ad landing page Clicked on ad and quickly moved on to other tasks Bounce rate high --> users not satisfied

Prediction goal: Given an new add and a query Predict bounce rate using query/ad features

Feature sources: Query Ad keyword Ad creative Ad landing page

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36

Page 37: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

MapReduce Cluster 200 machines 768MB RAM, 1GB Disk per machine 3 MapReduce jobs forward-scheduled

Full Dataset: 314 million records 6 categorical features, cardinality varying from 2-500 4 numeric features

Compare performance of PLANET on whole data with R on sampled data R model trains on 10 million records (~ 2GB) Single machine: 8GB, 10 trees, each of depth 1-10 Peak RAM utilization: 6GB

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 37

Page 38: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 38

Page 39: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

Prediction accuracy (RMSE) of PLANET on full data better than R on sampled data

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 39

Page 40: CS246: Mining Massive Datasets Jure Leskovec, …snap.stanford.edu/class/cs246-2012/slides/14-dt.pdf · 2020-01-09 · decision tree Setting: Hundreds of numerical (discrete & continuous)

B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. PLANET: Massively parallel learning of tree ensembles with MapReduce. VLDB 2009.

J. Ye, J.-H. Chow, J. Chen, Z. Zheng. Stochastic Gradient Boosted Distributed Decision Trees. CIKM 2009.

2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 40