incremental learning of decision trees from time-changing data streams

105
Introduction Incremental decision tree learning Evaluation Results Appendix References Incremental decision tree learning from time-changing data streams Blaˇ z Sovdat [email protected] Artificial Intelligence Laboratory, Joˇ zef Stefan Institute October 15, 2013

Upload: blaz-sovdat

Post on 02-Aug-2015

106 views

Category:

Software


3 download

TRANSCRIPT

Page 1: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental decision tree learning fromtime-changing data streams

Blaz [email protected]

Artificial Intelligence Laboratory, Jozef Stefan Institute

October 15, 2013

Page 2: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Talk outline

1 IntroductionMotivationClassical decision tree learning

2 Incremental decision tree learningIncremental classification tree learning

3 EvaluationAssessing learning performanceLearning algorithm comparison

4 ResultsData descriptionResultsPrequential fading error estimation

Page 3: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Motivation

Motivation

In certain scenarios data arrive continuously and areunbounded (data streams)Sensor networks, search queries, road traffic, network trafficNo control over the order and speed of arrivalBecause of the limited working memory we may view eachexample only onceSource distribution may change over time (concept drift)Classical (batch) decision tree learning methods fail

Page 4: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Motivation

Motivation

In certain scenarios data arrive continuously and areunbounded (data streams)Sensor networks, search queries, road traffic, network trafficNo control over the order and speed of arrivalBecause of the limited working memory we may view eachexample only onceSource distribution may change over time (concept drift)Classical (batch) decision tree learning methods fail

Page 5: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Motivation

Motivation

In certain scenarios data arrive continuously and areunbounded (data streams)Sensor networks, search queries, road traffic, network trafficNo control over the order and speed of arrivalBecause of the limited working memory we may view eachexample only onceSource distribution may change over time (concept drift)Classical (batch) decision tree learning methods fail

Page 6: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Motivation

Motivation

In certain scenarios data arrive continuously and areunbounded (data streams)Sensor networks, search queries, road traffic, network trafficNo control over the order and speed of arrivalBecause of the limited working memory we may view eachexample only onceSource distribution may change over time (concept drift)Classical (batch) decision tree learning methods fail

Page 7: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Motivation

Motivation

In certain scenarios data arrive continuously and areunbounded (data streams)Sensor networks, search queries, road traffic, network trafficNo control over the order and speed of arrivalBecause of the limited working memory we may view eachexample only onceSource distribution may change over time (concept drift)Classical (batch) decision tree learning methods fail

Page 8: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Motivation

Motivation

In certain scenarios data arrive continuously and areunbounded (data streams)Sensor networks, search queries, road traffic, network trafficNo control over the order and speed of arrivalBecause of the limited working memory we may view eachexample only onceSource distribution may change over time (concept drift)Classical (batch) decision tree learning methods fail

Page 9: Incremental learning of decision trees from time-changing data streams

Classical decision tree learning

sex

status1

female

status2

male

yes1

first

yes2

second

age1

third

yes3

crew

age2

first

no1

second

age3

third

no2

crew

no3

adult

no4

child

no5

adult

yes4

child

no6

adult

no7

child

Page 10: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Classical decision tree learning

The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:

Define a heuristic measure, say information gain

G(A, S) := H(S)−d∑

i=1

|Si ||S| H(Si)

Then pick the best attribute:

A? = arg maxA∈A

G(A, S)

Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes

Page 11: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Classical decision tree learning

The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:

Define a heuristic measure, say information gain

G(A, S) := H(S)−d∑

i=1

|Si ||S| H(Si)

Then pick the best attribute:

A? = arg maxA∈A

G(A, S)

Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes

Page 12: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Classical decision tree learning

The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:

Define a heuristic measure, say information gain

G(A, S) := H(S)−d∑

i=1

|Si ||S| H(Si)

Then pick the best attribute:

A? = arg maxA∈A

G(A, S)

Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes

Page 13: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Classical decision tree learning

The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:

Define a heuristic measure, say information gain

G(A, S) := H(S)−d∑

i=1

|Si ||S| H(Si)

Then pick the best attribute:

A? = arg maxA∈A

G(A, S)

Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes

Page 14: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Classical decision tree learning

The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:

Define a heuristic measure, say information gain

G(A, S) := H(S)−d∑

i=1

|Si ||S| H(Si)

Then pick the best attribute:

A? = arg maxA∈A

G(A, S)

Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes

Page 15: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Classical decision tree learning

The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:

Define a heuristic measure, say information gain

G(A, S) := H(S)−d∑

i=1

|Si ||S| H(Si)

Then pick the best attribute:

A? = arg maxA∈A

G(A, S)

Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes

Page 16: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Classical decision tree learning

The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:

Define a heuristic measure, say information gain

G(A, S) := H(S)−d∑

i=1

|Si ||S| H(Si)

Then pick the best attribute:

A? = arg maxA∈A

G(A, S)

Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes

Page 17: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Simple example

Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:

status: first, second, third, or crew;age: adult, child;sex: male, female.

Learn to predict whether unlabeled x survived or died

Page 18: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Simple example

Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:

status: first, second, third, or crew;age: adult, child;sex: male, female.

Learn to predict whether unlabeled x survived or died

Page 19: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Simple example

Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:

status: first, second, third, or crew;age: adult, child;sex: male, female.

Learn to predict whether unlabeled x survived or died

Page 20: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Simple example

Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:

status: first, second, third, or crew;age: adult, child;sex: male, female.

Learn to predict whether unlabeled x survived or died

Page 21: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Simple example

Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:

status: first, second, third, or crew;age: adult, child;sex: male, female.

Learn to predict whether unlabeled x survived or died

Page 22: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Simple example

Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:

status: first, second, third, or crew;age: adult, child;sex: male, female.

Learn to predict whether unlabeled x survived or died

Page 23: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Simple example

Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:

status: first, second, third, or crew;age: adult, child;sex: male, female.

Learn to predict whether unlabeled x survived or died

Page 24: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Classical decision tree learning

Simple example

Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:

status: first, second, third, or crew;age: adult, child;sex: male, female.

Learn to predict whether unlabeled x survived or died

Page 25: Incremental learning of decision trees from time-changing data streams

Simple example

no

Page 26: Incremental learning of decision trees from time-changing data streams

Simple example

sex

yes

female

no

male

Page 27: Incremental learning of decision trees from time-changing data streams

Simple example

sex

status1

female

no1

male

yes1

first

yes2

second

no2

third

yes3

crew

Page 28: Incremental learning of decision trees from time-changing data streams

Simple example

sex

status1

female

status2

male

yes1

first

yes2

second

no1

third

yes3

crew

no2

first

no3

second

no4

third

no5

crew

Page 29: Incremental learning of decision trees from time-changing data streams

Simple example

sex

status1

female

status2

male

yes1

first

yes2

second

age1

third

yes3

crew

no1

first

no2

second

no3

third

no4

crew

no5

adult

no6

child

Page 30: Incremental learning of decision trees from time-changing data streams

Simple example

sex

status1

female

status2

male

yes1

first

yes2

second

age1

third

yes3

crew

age2

first

no1

second

no2

third

no3

crew

no4

adult

no5

child

no6

adult

yes4

child

Page 31: Incremental learning of decision trees from time-changing data streams

Simple example

sex

status1

female

status2

male

yes1

first

yes2

second

age1

third

yes3

crew

age2

first

no1

second

age3

third

no2

crew

no3

adult

no4

child

no5

adult

yes4

child

no6

adult

no7

child

Page 32: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental classification tree learning

Incremental decision tree learning

In data stream world we only have a small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilitySuppose A1 and A2 are attributes with highest estimatesG(A1) and G(A2)

If G(A1)− G(A2) > ε, then A1 is truly best with probability atleast 1− δ for 1− δ ∈ (0, 1) and

ε =

√R2 log(1/δ)

2n

This is the main idea behind VFDT learner [Domingos andHulten, 2000]

Page 33: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental classification tree learning

Incremental decision tree learning

In data stream world we only have a small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilitySuppose A1 and A2 are attributes with highest estimatesG(A1) and G(A2)

If G(A1)− G(A2) > ε, then A1 is truly best with probability atleast 1− δ for 1− δ ∈ (0, 1) and

ε =

√R2 log(1/δ)

2n

This is the main idea behind VFDT learner [Domingos andHulten, 2000]

Page 34: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental classification tree learning

Incremental decision tree learning

In data stream world we only have a small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilitySuppose A1 and A2 are attributes with highest estimatesG(A1) and G(A2)

If G(A1)− G(A2) > ε, then A1 is truly best with probability atleast 1− δ for 1− δ ∈ (0, 1) and

ε =

√R2 log(1/δ)

2n

This is the main idea behind VFDT learner [Domingos andHulten, 2000]

Page 35: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental classification tree learning

Incremental decision tree learning

In data stream world we only have a small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilitySuppose A1 and A2 are attributes with highest estimatesG(A1) and G(A2)

If G(A1)− G(A2) > ε, then A1 is truly best with probability atleast 1− δ for 1− δ ∈ (0, 1) and

ε =

√R2 log(1/δ)

2n

This is the main idea behind VFDT learner [Domingos andHulten, 2000]

Page 36: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental classification tree learning

Incremental decision tree learning

In data stream world we only have a small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilitySuppose A1 and A2 are attributes with highest estimatesG(A1) and G(A2)

If G(A1)− G(A2) > ε, then A1 is truly best with probability atleast 1− δ for 1− δ ∈ (0, 1) and

ε =

√R2 log(1/δ)

2n

This is the main idea behind VFDT learner [Domingos andHulten, 2000]

Page 37: Incremental learning of decision trees from time-changing data streams

VFDT algorithm

1: Let HT be the root node2: for x ∈ S do3: Sort x down the tree to the leaf ` and update its sufficient statistic4: if n` mod nm = 0 and examples from ` have nonzero entropy then5: Let Xa and Xb be attributes with highest estimates G`(Xi)

6: Compute ε :=√

R2 log(1/δ)2n`

7: if G(Xa)− G(Xb) > ε or G(Xa)− G(Xb) ≤ ε < τ then8: Turn leaf ` into a node that tests on Xa9: for values of Xa do

10: Add a leaf and initialize its sufficient statistic11: end for12: end if13: end if14: end for

Page 38: Incremental learning of decision trees from time-changing data streams

VFDT algorithm

1: Let HT be the root node2: for x ∈ S do3: Sort x down the tree to the leaf ` and update its sufficient statistic4: if n` mod nm = 0 and examples from ` have nonzero entropy then5: Let Xa and Xb be attributes with highest estimates G`(Xi)

6: Compute ε :=√

R2 log(1/δ)2n`

7: if G(Xa)− G(Xb) > ε or G(Xa)− G(Xb) ≤ ε < τ then8: Turn leaf ` into a node that tests on Xa9: for values of Xa do

10: Add a leaf and initialize its sufficient statistic11: end for12: end if13: end if14: end for

Page 39: Incremental learning of decision trees from time-changing data streams

VFDT algorithm

1: Let HT be the root node2: for x ∈ S do3: Sort x down the tree to the leaf ` and update its sufficient statistic4: if n` mod nm = 0 and examples from ` have nonzero entropy then5: Let Xa and Xb be attributes with highest estimates G`(Xi)

6: Compute ε :=√

R2 log(1/δ)2n`

7: if G(Xa)− G(Xb) > ε or G(Xa)− G(Xb) ≤ ε < τ then8: Turn leaf ` into a node that tests on Xa9: for values of Xa do

10: Add a leaf and initialize its sufficient statistic11: end for12: end if13: end if14: end for

Page 40: Incremental learning of decision trees from time-changing data streams

VFDT algorithm

1: Let HT be the root node2: for x ∈ S do3: Sort x down the tree to the leaf ` and update its sufficient statistic4: if n` mod nm = 0 and examples from ` have nonzero entropy then5: Let Xa and Xb be attributes with highest estimates G`(Xi)

6: Compute ε :=√

R2 log(1/δ)2n`

{Here, R = log2 C}7: if G(Xa)− G(Xb) > ε or G(Xa)− G(Xb) ≤ ε < τ then8: Turn leaf ` into a node that tests on Xa9: for values of Xa do

10: Add a leaf and initialize its sufficient statistic11: end for12: end if13: end if14: end for

Page 41: Incremental learning of decision trees from time-changing data streams

VFDT algorithm

1: Let HT be the root node2: for x ∈ S do3: Sort x down the tree to the leaf ` and update its sufficient statistic4: if n` mod nm = 0 and examples from ` have nonzero entropy then5: Let Xa and Xb be attributes with highest estimates G`(Xi)

6: Compute ε :=√

R2 log(1/δ)2n`

7: if G(Xa)− G(Xb) > ε or G(Xa)− G(Xb) ≤ ε < τ then8: Turn leaf ` into a node that tests on Xa9: for values of Xa do

10: Add a leaf and initialize its sufficient statistic11: end for12: end if13: end if14: end for

Page 42: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental classification tree learning

VFDT algorithm

The algorithm doesn’t adapt to changesHandle numeric attributes with online discretizationWe introduced τ parameter to resolve cases when twoattributes are almost equally goodRecompute G(Ai) periodically (typically nm = 200)With high probability, VFDT-induced tree uses the samesequence of tests as (hypothetical) batch-induced tree toclassify a randomly chosen example [Domingos and Hulten,2000]

Page 43: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental classification tree learning

VFDT algorithm

The algorithm doesn’t adapt to changesHandle numeric attributes with online discretizationWe introduced τ parameter to resolve cases when twoattributes are almost equally goodRecompute G(Ai) periodically (typically nm = 200)With high probability, VFDT-induced tree uses the samesequence of tests as (hypothetical) batch-induced tree toclassify a randomly chosen example [Domingos and Hulten,2000]

Page 44: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental classification tree learning

VFDT algorithm

The algorithm doesn’t adapt to changesHandle numeric attributes with online discretizationWe introduced τ parameter to resolve cases when twoattributes are almost equally goodRecompute G(Ai) periodically (typically nm = 200)With high probability, VFDT-induced tree uses the samesequence of tests as (hypothetical) batch-induced tree toclassify a randomly chosen example [Domingos and Hulten,2000]

Page 45: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental classification tree learning

VFDT algorithm

The algorithm doesn’t adapt to changesHandle numeric attributes with online discretizationWe introduced τ parameter to resolve cases when twoattributes are almost equally goodRecompute G(Ai) periodically (typically nm = 200)With high probability, VFDT-induced tree uses the samesequence of tests as (hypothetical) batch-induced tree toclassify a randomly chosen example [Domingos and Hulten,2000]

Page 46: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental classification tree learning

VFDT algorithm

The algorithm doesn’t adapt to changesHandle numeric attributes with online discretizationWe introduced τ parameter to resolve cases when twoattributes are almost equally goodRecompute G(Ai) periodically (typically nm = 200)With high probability, VFDT-induced tree uses the samesequence of tests as (hypothetical) batch-induced tree toclassify a randomly chosen example [Domingos and Hulten,2000]

Page 47: Incremental learning of decision trees from time-changing data streams

Big picture of the CVFDT algorithm

......

T

a1 ad

T’

Alternate trees grown by node T´Subtrees of the node T´

New examplesOld examplesSliding window W

Root node

Page 48: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Assessing learning performance

Assessing learning performance

Roughly we distinguish two approaches [Gama et al., 2013]:Holdout error estimation

The idea is to periodically (period is, say, 20 000) sacrificem := 2 000 examples and use them to estimate classificationaccuracy: Hm := 1

m∑k+m

i=k L(yi , yi)

Prequential error estimation (also known as “test-then-train”)Let α ∈ (0, 1] be a fading factor and let A be a classifierDefine estimated prequential error Pα(i):

SαA (i) := L(yi , yi) + αL(yi−1, yi−1) + . . .+ αi−1L(y1, y1),

Nα(i) := 1 + α+ . . .+ αi ,

Pα(i) := SαA (i)/Nα(i).

Page 49: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Assessing learning performance

Assessing learning performance

Roughly we distinguish two approaches [Gama et al., 2013]:Holdout error estimation

The idea is to periodically (period is, say, 20 000) sacrificem := 2 000 examples and use them to estimate classificationaccuracy: Hm := 1

m∑k+m

i=k L(yi , yi)

Prequential error estimation (also known as “test-then-train”)Let α ∈ (0, 1] be a fading factor and let A be a classifierDefine estimated prequential error Pα(i):

SαA (i) := L(yi , yi) + αL(yi−1, yi−1) + . . .+ αi−1L(y1, y1),

Nα(i) := 1 + α+ . . .+ αi ,

Pα(i) := SαA (i)/Nα(i).

Page 50: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Assessing learning performance

Assessing learning performance

Roughly we distinguish two approaches [Gama et al., 2013]:Holdout error estimation

The idea is to periodically (period is, say, 20 000) sacrificem := 2 000 examples and use them to estimate classificationaccuracy: Hm := 1

m∑k+m

i=k L(yi , yi)

Prequential error estimation (also known as “test-then-train”)Let α ∈ (0, 1] be a fading factor and let A be a classifierDefine estimated prequential error Pα(i):

SαA (i) := L(yi , yi) + αL(yi−1, yi−1) + . . .+ αi−1L(y1, y1),

Nα(i) := 1 + α+ . . .+ αi ,

Pα(i) := SαA (i)/Nα(i).

Page 51: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Assessing learning performance

Assessing learning performance

Roughly we distinguish two approaches [Gama et al., 2013]:Holdout error estimation

The idea is to periodically (period is, say, 20 000) sacrificem := 2 000 examples and use them to estimate classificationaccuracy: Hm := 1

m∑k+m

i=k L(yi , yi)

Prequential error estimation (also known as “test-then-train”)Let α ∈ (0, 1] be a fading factor and let A be a classifierDefine estimated prequential error Pα(i):

SαA (i) := L(yi , yi) + αL(yi−1, yi−1) + . . .+ αi−1L(y1, y1),

Nα(i) := 1 + α+ . . .+ αi ,

Pα(i) := SαA (i)/Nα(i).

Page 52: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Assessing learning performance

Assessing learning performance

Roughly we distinguish two approaches [Gama et al., 2013]:Holdout error estimation

The idea is to periodically (period is, say, 20 000) sacrificem := 2 000 examples and use them to estimate classificationaccuracy: Hm := 1

m∑k+m

i=k L(yi , yi)

Prequential error estimation (also known as “test-then-train”)Let α ∈ (0, 1] be a fading factor and let A be a classifierDefine estimated prequential error Pα(i):

SαA (i) := L(yi , yi) + αL(yi−1, yi−1) + . . .+ αi−1L(y1, y1),

Nα(i) := 1 + α+ . . .+ αi ,

Pα(i) := SαA (i)/Nα(i).

Page 53: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Assessing learning performance

Assessing learning performance

Roughly we distinguish two approaches [Gama et al., 2013]:Holdout error estimation

The idea is to periodically (period is, say, 20 000) sacrificem := 2 000 examples and use them to estimate classificationaccuracy: Hm := 1

m∑k+m

i=k L(yi , yi)

Prequential error estimation (also known as “test-then-train”)Let α ∈ (0, 1] be a fading factor and let A be a classifierDefine estimated prequential error Pα(i):

SαA (i) := L(yi , yi) + αL(yi−1, yi−1) + . . .+ αi−1L(y1, y1),

Nα(i) := 1 + α+ . . .+ αi ,

Pα(i) := SαA (i)/Nα(i).

Page 54: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Learning algorithm comparison

Comparing learning algorithms

Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα

i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:

Qαi (A,B) < 0 means that A is better than B,

Qαi (A,B) > 0 means that B is better than A,

Qαi (A,B) = 0 means A and B perform equally well.

Here |Qαi (A,B)| is strength of the difference — how much

better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001

Page 55: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Learning algorithm comparison

Comparing learning algorithms

Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα

i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:

Qαi (A,B) < 0 means that A is better than B,

Qαi (A,B) > 0 means that B is better than A,

Qαi (A,B) = 0 means A and B perform equally well.

Here |Qαi (A,B)| is strength of the difference — how much

better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001

Page 56: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Learning algorithm comparison

Comparing learning algorithms

Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα

i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:

Qαi (A,B) < 0 means that A is better than B,

Qαi (A,B) > 0 means that B is better than A,

Qαi (A,B) = 0 means A and B perform equally well.

Here |Qαi (A,B)| is strength of the difference — how much

better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001

Page 57: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Learning algorithm comparison

Comparing learning algorithms

Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα

i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:

Qαi (A,B) < 0 means that A is better than B,

Qαi (A,B) > 0 means that B is better than A,

Qαi (A,B) = 0 means A and B perform equally well.

Here |Qαi (A,B)| is strength of the difference — how much

better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001

Page 58: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Learning algorithm comparison

Comparing learning algorithms

Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα

i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:

Qαi (A,B) < 0 means that A is better than B,

Qαi (A,B) > 0 means that B is better than A,

Qαi (A,B) = 0 means A and B perform equally well.

Here |Qαi (A,B)| is strength of the difference — how much

better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001

Page 59: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Learning algorithm comparison

Comparing learning algorithms

Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα

i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:

Qαi (A,B) < 0 means that A is better than B,

Qαi (A,B) > 0 means that B is better than A,

Qαi (A,B) = 0 means A and B perform equally well.

Here |Qαi (A,B)| is strength of the difference — how much

better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001

Page 60: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Learning algorithm comparison

Comparing learning algorithms

Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα

i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:

Qαi (A,B) < 0 means that A is better than B,

Qαi (A,B) > 0 means that B is better than A,

Qαi (A,B) = 0 means A and B perform equally well.

Here |Qαi (A,B)| is strength of the difference — how much

better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001

Page 61: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Learning algorithm comparison

Comparing learning algorithms

Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα

i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:

Qαi (A,B) < 0 means that A is better than B,

Qαi (A,B) > 0 means that B is better than A,

Qαi (A,B) = 0 means A and B perform equally well.

Here |Qαi (A,B)| is strength of the difference — how much

better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001

Page 62: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Learning algorithm comparison

Comparing learning algorithms

Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα

i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:

Qαi (A,B) < 0 means that A is better than B,

Qαi (A,B) > 0 means that B is better than A,

Qαi (A,B) = 0 means A and B perform equally well.

Here |Qαi (A,B)| is strength of the difference — how much

better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001

Page 63: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Data description

Data description

We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:

numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute

We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)

Page 64: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Data description

Data description

We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:

numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute

We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)

Page 65: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Data description

Data description

We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:

numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute

We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)

Page 66: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Data description

Data description

We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:

numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute

We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)

Page 67: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Data description

Data description

We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:

numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute

We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)

Page 68: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Data description

Data description

We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:

numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute

We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)

Page 69: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Data description

Data description

We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:

numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute

We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)

Page 70: Incremental learning of decision trees from time-changing data streams

Load zones

A

B

NEW YORK CONTROL AREA LOAD ZONES

G

F

E

D

C

B

E

H

I

J K

A - WEST B - GENESE C - CENTRL D - NORTH E - MHK VL F - CAPITL G - HUD VL H - MILLWD I - DUNWOD J - N.Y.C. K - LONGIL

Figure : Taken from NYISO (http://www.nyiso.com/public/index.jsp).

Page 71: Incremental learning of decision trees from time-changing data streams

One month demand for a single area

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

600

800

1000

1200

1400

1600

1800

2000

Zaporedna stevilka mertive

Porabavtem

tren

utku

Page 72: Incremental learning of decision trees from time-changing data streams

One year demand for a single area

0 2 4 6 8 10 12

x 104

0

500

1000

1500

2000

2500

3000

3500

4000

Zaporedna stevilka mertive

Porabavtem

tren

utku

Page 73: Incremental learning of decision trees from time-changing data streams

Global demand for a single area

0 2 4 6 8 10 12 14

x 105

0

500

1000

1500

2000

2500

3000

3500

4000

Zaporedna stevilka mertive

Porabavtem

tren

utku

Page 74: Incremental learning of decision trees from time-changing data streams

Target variable distribution

0 1000 2000 3000 4000 5000 6000 7000 80000

2

4

6

8

10

12x 10

4

Poraba

Fre

kven

ca p

orab

e

Page 75: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Results

Results

Method Learner A/Leraner B Median p-valueHoldout estimate VFDT-MAJ/CVFDT-MAJ µ1/2 = −0.4285 p < 0.0001Holdout estimate VFDT-NB/CVFDT-NB µ1/2 = 0 p = 0.6538Holdout estimate CVFDT-MAJ/CVFDT-NB µ1/2 = 0.4410 p < 0.0001

Fading factors VFDT-MAJ/CVFDT-MAJ µ1/2 = −0.377 p < 0.0001Fading factors VFDT-NB/CVFDT-NB µ1/2 = 0.0297 p = 0.1424Fading factors CVFDT-MAJ/CVFDT-NB µ1/2 = 0.3819 p < 0.0001

Table : Results of Wilcoxon test when testing hypothesis that the medianof Q-statistics is zero.

Page 76: Incremental learning of decision trees from time-changing data streams

CVFDT-MAJ versus CVFDT-NB

0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Stevilka ucnega primera v toku

Bledecanapaka

CVFDT−MAJCVFDT−NB

Page 77: Incremental learning of decision trees from time-changing data streams

CVFDT-MAJ versus CVFDT-NB

0 200 400 600 800 1000 1200−8

−6

−4

−2

0

2

4

6

8

10

Stevilka ucnega primera v toku

Vrednost

Qstatistike

Page 78: Incremental learning of decision trees from time-changing data streams

VFDT-NB versus CVFDT-NB

0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Stevilka ucnega primera v toku

Bledecanapaka

VFDT−NBCVFDT−NB

Page 79: Incremental learning of decision trees from time-changing data streams

VFDT-NB versus CVFDT-NB

0 200 400 600 800 1000 1200−15

−10

−5

0

5

10

Stevilka ucnega primera v toku

Vrednost

Qstatistike

Page 80: Incremental learning of decision trees from time-changing data streams

VFDT-MAJ versus CVFDT-MAJ

0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Sevilka ucnega primera v toku

Bledecanapaka

CVFDT−MAJVFDT−MAJ

Page 81: Incremental learning of decision trees from time-changing data streams

VFDT-MAJ versus CVFDT-MAJ

0 200 400 600 800 1000 1200−12

−10

−8

−6

−4

−2

0

2

4

6

8

Stevilka ucnega primera v toku

Vrednost

Qstatistike

Page 82: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

The End

The End

Thank you for your attention!

Page 83: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Appendix

Hoeffding’s inequality

Theorem ([Hoeffding, 1963])Let S := X1 + X2 + . . .+ Xn be sum of independent boundedrandom variables ai ≤ Xi ≤ bi and let ε > 0 be a positive realnumber. Then

P (S − E[S] ≥ nε) ≤ exp(−2n2ε2

/ n∑i=1

(bi − ai)2). (1)

Page 84: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Appendix

Hoeffding’s inequality

CorollaryLet S := X1 + X2 + . . .+ Xn be sum of independent boundedrandom variables a ≤ Xi ≤ b and let ε > 0 be a positive realnumber. For R := b − a we have

P (S − E[S] ≥ nε) ≤ exp(−2nε2/R2

). (2)

Page 85: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Appendix

Incremental decision tree learning

In data stream world we only have small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilityLet a ≤ X ≤ b be a bounded random variable and letX1,X2, . . . ,Xn be its measurementsLet µ := (X1 + X2 + . . .+ Xn)/n be sample mean and letµ := E[X ] be the true meanFurthermore let 1− δ ∈ (0, 1) be the desired confidence levelBy Hoeffding’s inequality, we have P(µ ≥ µ− ε) ≥ 1− δ for

ε =

√(b − a)2 log(1/δ)

2n

Page 86: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Appendix

Incremental decision tree learning

In data stream world we only have small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilityLet a ≤ X ≤ b be a bounded random variable and letX1,X2, . . . ,Xn be its measurementsLet µ := (X1 + X2 + . . .+ Xn)/n be sample mean and letµ := E[X ] be the true meanFurthermore let 1− δ ∈ (0, 1) be the desired confidence levelBy Hoeffding’s inequality, we have P(µ ≥ µ− ε) ≥ 1− δ for

ε =

√(b − a)2 log(1/δ)

2n

Page 87: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Appendix

Incremental decision tree learning

In data stream world we only have small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilityLet a ≤ X ≤ b be a bounded random variable and letX1,X2, . . . ,Xn be its measurementsLet µ := (X1 + X2 + . . .+ Xn)/n be sample mean and letµ := E[X ] be the true meanFurthermore let 1− δ ∈ (0, 1) be the desired confidence levelBy Hoeffding’s inequality, we have P(µ ≥ µ− ε) ≥ 1− δ for

ε =

√(b − a)2 log(1/δ)

2n

Page 88: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Appendix

Incremental decision tree learning

In data stream world we only have small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilityLet a ≤ X ≤ b be a bounded random variable and letX1,X2, . . . ,Xn be its measurementsLet µ := (X1 + X2 + . . .+ Xn)/n be sample mean and letµ := E[X ] be the true meanFurthermore let 1− δ ∈ (0, 1) be the desired confidence levelBy Hoeffding’s inequality, we have P(µ ≥ µ− ε) ≥ 1− δ for

ε =

√(b − a)2 log(1/δ)

2n

Page 89: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Appendix

Incremental decision tree learning

In data stream world we only have small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilityLet a ≤ X ≤ b be a bounded random variable and letX1,X2, . . . ,Xn be its measurementsLet µ := (X1 + X2 + . . .+ Xn)/n be sample mean and letµ := E[X ] be the true meanFurthermore let 1− δ ∈ (0, 1) be the desired confidence levelBy Hoeffding’s inequality, we have P(µ ≥ µ− ε) ≥ 1− δ for

ε =

√(b − a)2 log(1/δ)

2n

Page 90: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Appendix

Incremental decision tree learning

In data stream world we only have small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilityLet a ≤ X ≤ b be a bounded random variable and letX1,X2, . . . ,Xn be its measurementsLet µ := (X1 + X2 + . . .+ Xn)/n be sample mean and letµ := E[X ] be the true meanFurthermore let 1− δ ∈ (0, 1) be the desired confidence levelBy Hoeffding’s inequality, we have P(µ ≥ µ− ε) ≥ 1− δ for

ε =

√(b − a)2 log(1/δ)

2n

Page 91: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Done at JSI by Elena Ikonomovska [Ikonomovska, 2012]Regression trees predict real number instead of classDefine standard deviation reduction:

sdr(A,S) := σ(S)−d∑

i=1

|Si ||S| σ(Si),

where Si := {x ∈ S : A(x) = ai} and σ(S) denotes standarddeviationPick attribute that maximizes SDR: A? := arg max

A∈Asdr(A, S)

Again, using Hoeffding’s inequality, we can find the bestattribute with high probability

Page 92: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Done at JSI by Elena Ikonomovska [Ikonomovska, 2012]Regression trees predict real number instead of classDefine standard deviation reduction:

sdr(A,S) := σ(S)−d∑

i=1

|Si ||S| σ(Si),

where Si := {x ∈ S : A(x) = ai} and σ(S) denotes standarddeviationPick attribute that maximizes SDR: A? := arg max

A∈Asdr(A, S)

Again, using Hoeffding’s inequality, we can find the bestattribute with high probability

Page 93: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Done at JSI by Elena Ikonomovska [Ikonomovska, 2012]Regression trees predict real number instead of classDefine standard deviation reduction:

sdr(A,S) := σ(S)−d∑

i=1

|Si ||S| σ(Si),

where Si := {x ∈ S : A(x) = ai} and σ(S) denotes standarddeviationPick attribute that maximizes SDR: A? := arg max

A∈Asdr(A, S)

Again, using Hoeffding’s inequality, we can find the bestattribute with high probability

Page 94: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Done at JSI by Elena Ikonomovska [Ikonomovska, 2012]Regression trees predict real number instead of classDefine standard deviation reduction:

sdr(A,S) := σ(S)−d∑

i=1

|Si ||S| σ(Si),

where Si := {x ∈ S : A(x) = ai} and σ(S) denotes standarddeviationPick attribute that maximizes SDR: A? := arg max

A∈Asdr(A, S)

Again, using Hoeffding’s inequality, we can find the bestattribute with high probability

Page 95: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Done at JSI by Elena Ikonomovska [Ikonomovska, 2012]Regression trees predict real number instead of classDefine standard deviation reduction:

sdr(A,S) := σ(S)−d∑

i=1

|Si ||S| σ(Si),

where Si := {x ∈ S : A(x) = ai} and σ(S) denotes standarddeviationPick attribute that maximizes SDR: A? := arg max

A∈Asdr(A, S)

Again, using Hoeffding’s inequality, we can find the bestattribute with high probability

Page 96: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Let A and B be the best and the second-best attributes,respectivelyThen r := sdr(A)/ sdr(B) is a random variable and r ∈ [0, 1]Let r1, r2, . . . , rn be such ratios for the last n examplesNow pick 1− δ ∈ (0, 1) and let

ε =

√log(1/δ)

2n

By Hoeffding’s inequality we have P(r ∈ [r − ε, r + ε]) ≥ 1− δfor r = (r1 + r2 + . . .+ rn)/n

Page 97: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Let A and B be the best and the second-best attributes,respectivelyThen r := sdr(A)/ sdr(B) is a random variable and r ∈ [0, 1]Let r1, r2, . . . , rn be such ratios for the last n examplesNow pick 1− δ ∈ (0, 1) and let

ε =

√log(1/δ)

2n

By Hoeffding’s inequality we have P(r ∈ [r − ε, r + ε]) ≥ 1− δfor r = (r1 + r2 + . . .+ rn)/n

Page 98: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Let A and B be the best and the second-best attributes,respectivelyThen r := sdr(A)/ sdr(B) is a random variable and r ∈ [0, 1]Let r1, r2, . . . , rn be such ratios for the last n examplesNow pick 1− δ ∈ (0, 1) and let

ε =

√log(1/δ)

2n

By Hoeffding’s inequality we have P(r ∈ [r − ε, r + ε]) ≥ 1− δfor r = (r1 + r2 + . . .+ rn)/n

Page 99: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Let A and B be the best and the second-best attributes,respectivelyThen r := sdr(A)/ sdr(B) is a random variable and r ∈ [0, 1]Let r1, r2, . . . , rn be such ratios for the last n examplesNow pick 1− δ ∈ (0, 1) and let

ε =

√log(1/δ)

2n

By Hoeffding’s inequality we have P(r ∈ [r − ε, r + ε]) ≥ 1− δfor r = (r1 + r2 + . . .+ rn)/n

Page 100: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Let A and B be the best and the second-best attributes,respectivelyThen r := sdr(A)/ sdr(B) is a random variable and r ∈ [0, 1]Let r1, r2, . . . , rn be such ratios for the last n examplesNow pick 1− δ ∈ (0, 1) and let

ε =

√log(1/δ)

2n

By Hoeffding’s inequality we have P(r ∈ [r − ε, r + ε]) ≥ 1− δfor r = (r1 + r2 + . . .+ rn)/n

Page 101: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Now we can derive a split criterionLet SA and SB be deviation reduction after testing on A andB, respectivelyIf SB/SA < 1− ε, then A is truly best attribute withprobability at least 1− δ (see [Ikonomovska, 2012])When predicting target variable, sort example down the treeand return average of examples at given leaf

Page 102: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Now we can derive a split criterionLet SA and SB be deviation reduction after testing on A andB, respectivelyIf SB/SA < 1− ε, then A is truly best attribute withprobability at least 1− δ (see [Ikonomovska, 2012])When predicting target variable, sort example down the treeand return average of examples at given leaf

Page 103: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Now we can derive a split criterionLet SA and SB be deviation reduction after testing on A andB, respectivelyIf SB/SA < 1− ε, then A is truly best attribute withprobability at least 1− δ (see [Ikonomovska, 2012])When predicting target variable, sort example down the treeand return average of examples at given leaf

Page 104: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

What about regression?

Now we can derive a split criterionLet SA and SB be deviation reduction after testing on A andB, respectivelyIf SB/SA < 1− ε, then A is truly best attribute withprobability at least 1− δ (see [Ikonomovska, 2012])When predicting target variable, sort example down the treeand return average of examples at given leaf

Page 105: Incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental regression tree learning

Pedro Domingos and Geoff Hulten. Mining high-speed datastreams. In Proceedings of the sixth ACM SIGKDD internationalconference on Knowledge discovery and data mining, KDD ’00,pages 71–80, New York, NY, USA, 2000. ACM. ISBN1-58113-233-6. doi: 10.1145/347090.347107. URLhttp://doi.acm.org/10.1145/347090.347107.

Joao Gama, Raquel Sebastiao, and Pedro Pereira Rodrigues. Onevaluating stream learning algorithms. Machine Learning, 90(3):317–346, 2013.

W. Hoeffding. Probability inequalities for sums of bounded randomvariables. Journal of the American Statistical Association, 58(301):13–30, 1963.

Elena Ikonomovska. Algoritmi za ucenje regresijskih dreves inansamblov iz spremenljivih podatkovnih tokov. PhD thesis,Mednarodna podiplomska sola Jozefa Stefana, 2012.

J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.