incremental learning of decision trees from time-changing data streams

Introduction Incremental decision tree learning Evaluation Results Appendix References

Incremental decision tree learning fromtime-changing data streams

Blaz [email protected]

Artificial Intelligence Laboratory, Jozef Stefan Institute

October 15, 2013

Talk outline

1 IntroductionMotivationClassical decision tree learning

2 Incremental decision tree learningIncremental classification tree learning

3 EvaluationAssessing learning performanceLearning algorithm comparison

4 ResultsData descriptionResultsPrequential fading error estimation

Motivation

In certain scenarios data arrive continuously and areunbounded (data streams)Sensor networks, search queries, road traffic, network trafficNo control over the order and speed of arrivalBecause of the limited working memory we may view eachexample only onceSource distribution may change over time (concept drift)Classical (batch) decision tree learning methods fail

Motivation

Classical decision tree learning

status1

female

status2

second

The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:

Define a heuristic measure, say information gain

G(A, S) := H(S)−d∑

|Si ||S| H(Si)

Then pick the best attribute:

A? = arg maxA∈A

G(A, S)

Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes

G(A, S) := H(S)−d∑

|Si ||S| H(Si)

A? = arg maxA∈A

G(A, S)

G(A, S) := H(S)−d∑

|Si ||S| H(Si)

A? = arg maxA∈A

G(A, S)

G(A, S) := H(S)−d∑

|Si ||S| H(Si)

A? = arg maxA∈A

G(A, S)

G(A, S) := H(S)−d∑

|Si ||S| H(Si)

A? = arg maxA∈A

G(A, S)

G(A, S) := H(S)−d∑

|Si ||S| H(Si)

A? = arg maxA∈A

G(A, S)

G(A, S) := H(S)−d∑

|Si ||S| H(Si)

A? = arg maxA∈A

G(A, S)

Simple example

Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:

status: first, second, third, or crew;age: adult, child;sex: male, female.

Learn to predict whether unlabeled x survived or died

Simple example

female

Simple example

status1

female

second

Simple example

status1

female

status2

second

Simple example

status1

female

status2

second

Simple example

status1

female

status2

second

Simple example

status1

female

status2

second

Incremental classification tree learning

Incremental decision tree learning

In data stream world we only have a small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilitySuppose A1 and A2 are attributes with highest estimatesG(A1) and G(A2)

If G(A1)− G(A2) > ε, then A1 is truly best with probability atleast 1− δ for 1− δ ∈ (0, 1) and

√R2 log(1/δ)

This is the main idea behind VFDT learner [Domingos andHulten, 2000]

√R2 log(1/δ)

VFDT algorithm

1: Let HT be the root node2: for x ∈ S do3: Sort x down the tree to the leaf ` and update its sufficient statistic4: if n` mod nm = 0 and examples from ` have nonzero entropy then5: Let Xa and Xb be attributes with highest estimates G`(Xi)

6: Compute ε :=√

R2 log(1/δ)2n`

7: if G(Xa)− G(Xb) > ε or G(Xa)− G(Xb) ≤ ε < τ then8: Turn leaf ` into a node that tests on Xa9: for values of Xa do

10: Add a leaf and initialize its sufficient statistic11: end for12: end if13: end if14: end for

VFDT algorithm

6: Compute ε :=√

R2 log(1/δ)2n`

VFDT algorithm

6: Compute ε :=√

R2 log(1/δ)2n`

VFDT algorithm

6: Compute ε :=√

R2 log(1/δ)2n`

{Here, R = log2 C}7: if G(Xa)− G(Xb) > ε or G(Xa)− G(Xb) ≤ ε < τ then8: Turn leaf ` into a node that tests on Xa9: for values of Xa do

VFDT algorithm

6: Compute ε :=√

R2 log(1/δ)2n`

VFDT algorithm

The algorithm doesn’t adapt to changesHandle numeric attributes with online discretizationWe introduced τ parameter to resolve cases when twoattributes are almost equally goodRecompute G(Ai) periodically (typically nm = 200)With high probability, VFDT-induced tree uses the samesequence of tests as (hypothetical) batch-induced tree toclassify a randomly chosen example [Domingos and Hulten,2000]

VFDT algorithm

Big picture of the CVFDT algorithm

......

Alternate trees grown by node T´Subtrees of the node T´

New examplesOld examplesSliding window W

Root node

Assessing learning performance

Roughly we distinguish two approaches [Gama et al., 2013]:Holdout error estimation

The idea is to periodically (period is, say, 20 000) sacrificem := 2 000 examples and use them to estimate classificationaccuracy: Hm := 1

m∑k+m

i=k L(yi , yi)

Prequential error estimation (also known as “test-then-train”)Let α ∈ (0, 1] be a fading factor and let A be a classifierDefine estimated prequential error Pα(i):

SαA (i) := L(yi , yi) + αL(yi−1, yi−1) + . . .+ αi−1L(y1, y1),

Nα(i) := 1 + α+ . . .+ αi ,

Pα(i) := SαA (i)/Nα(i).

m∑k+m

i=k L(yi , yi)

Nα(i) := 1 + α+ . . .+ αi ,

m∑k+m

i=k L(yi , yi)

Nα(i) := 1 + α+ . . .+ αi ,

m∑k+m

i=k L(yi , yi)

Nα(i) := 1 + α+ . . .+ αi ,

m∑k+m

i=k L(yi , yi)

Nα(i) := 1 + α+ . . .+ αi ,

m∑k+m

i=k L(yi , yi)

Nα(i) := 1 + α+ . . .+ αi ,

Learning algorithm comparison

Comparing learning algorithms

Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα

i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:

Qαi (A,B) < 0 means that A is better than B,

Qαi (A,B) > 0 means that B is better than A,

Qαi (A,B) = 0 means A and B perform equally well.

Here |Qαi (A,B)| is strength of the difference — how much

better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001

Data description

We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:

numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute

We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)

Data description

Load zones

NEW YORK CONTROL AREA LOAD ZONES

A - WEST B - GENESE C - CENTRL D - NORTH E - MHK VL F - CAPITL G - HUD VL H - MILLWD I - DUNWOD J - N.Y.C. K - LONGIL

Figure : Taken from NYISO (http://www.nyiso.com/public/index.jsp).

One month demand for a single area

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Zaporedna stevilka mertive

Porabavtem

One year demand for a single area

0 2 4 6 8 10 12

Porabavtem

Global demand for a single area

0 2 4 6 8 10 12 14

Porabavtem

Target variable distribution

0 1000 2000 3000 4000 5000 6000 7000 80000

12x 10

Poraba

Results

Method Learner A/Leraner B Median p-valueHoldout estimate VFDT-MAJ/CVFDT-MAJ µ1/2 = −0.4285 p < 0.0001Holdout estimate VFDT-NB/CVFDT-NB µ1/2 = 0 p = 0.6538Holdout estimate CVFDT-MAJ/CVFDT-NB µ1/2 = 0.4410 p < 0.0001

Fading factors VFDT-MAJ/CVFDT-MAJ µ1/2 = −0.377 p < 0.0001Fading factors VFDT-NB/CVFDT-NB µ1/2 = 0.0297 p = 0.1424Fading factors CVFDT-MAJ/CVFDT-NB µ1/2 = 0.3819 p < 0.0001

Table : Results of Wilcoxon test when testing hypothesis that the medianof Q-statistics is zero.

CVFDT-MAJ versus CVFDT-NB

0 200 400 600 800 1000 12000

Stevilka ucnega primera v toku

Bledecanapaka

CVFDT−MAJCVFDT−NB

CVFDT-MAJ versus CVFDT-NB

0 200 400 600 800 1000 1200−8

Vrednost

Qstatistike

VFDT-NB versus CVFDT-NB

0 200 400 600 800 1000 12000

Bledecanapaka

VFDT−NBCVFDT−NB

VFDT-NB versus CVFDT-NB

0 200 400 600 800 1000 1200−15

Vrednost

Qstatistike

VFDT-MAJ versus CVFDT-MAJ

0 200 400 600 800 1000 12000

Sevilka ucnega primera v toku

Bledecanapaka

CVFDT−MAJVFDT−MAJ

VFDT-MAJ versus CVFDT-MAJ

0 200 400 600 800 1000 1200−12

Vrednost

Qstatistike

The End

Thank you for your attention!

Appendix

Hoeffding’s inequality

Theorem ([Hoeffding, 1963])Let S := X1 + X2 + . . .+ Xn be sum of independent boundedrandom variables ai ≤ Xi ≤ bi and let ε > 0 be a positive realnumber. Then

P (S − E[S] ≥ nε) ≤ exp(−2n2ε2

/ n∑i=1

(bi − ai)2). (1)

Appendix

Hoeffding’s inequality

CorollaryLet S := X1 + X2 + . . .+ Xn be sum of independent boundedrandom variables a ≤ Xi ≤ b and let ε > 0 be a positive realnumber. For R := b − a we have

P (S − E[S] ≥ nε) ≤ exp(−2nε2/R2

). (2)

Appendix

In data stream world we only have small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilityLet a ≤ X ≤ b be a bounded random variable and letX1,X2, . . . ,Xn be its measurementsLet µ := (X1 + X2 + . . .+ Xn)/n be sample mean and letµ := E[X ] be the true meanFurthermore let 1− δ ∈ (0, 1) be the desired confidence levelBy Hoeffding’s inequality, we have P(µ ≥ µ− ε) ≥ 1− δ for

√(b − a)2 log(1/δ)

Appendix

√(b − a)2 log(1/δ)

Appendix

√(b − a)2 log(1/δ)

Appendix

√(b − a)2 log(1/δ)

Appendix

√(b − a)2 log(1/δ)

Appendix

√(b − a)2 log(1/δ)

Incremental regression tree learning

What about regression?

Done at JSI by Elena Ikonomovska [Ikonomovska, 2012]Regression trees predict real number instead of classDefine standard deviation reduction:

sdr(A,S) := σ(S)−d∑

|Si ||S| σ(Si),

where Si := {x ∈ S : A(x) = ai} and σ(S) denotes standarddeviationPick attribute that maximizes SDR: A? := arg max

A∈Asdr(A, S)

Again, using Hoeffding’s inequality, we can find the bestattribute with high probability

|Si ||S| σ(Si),

A∈Asdr(A, S)

|Si ||S| σ(Si),

A∈Asdr(A, S)

|Si ||S| σ(Si),

A∈Asdr(A, S)

|Si ||S| σ(Si),

A∈Asdr(A, S)

Let A and B be the best and the second-best attributes,respectivelyThen r := sdr(A)/ sdr(B) is a random variable and r ∈ [0, 1]Let r1, r2, . . . , rn be such ratios for the last n examplesNow pick 1− δ ∈ (0, 1) and let

√log(1/δ)

By Hoeffding’s inequality we have P(r ∈ [r − ε, r + ε]) ≥ 1− δfor r = (r1 + r2 + . . .+ rn)/n

√log(1/δ)

Now we can derive a split criterionLet SA and SB be deviation reduction after testing on A andB, respectivelyIf SB/SA < 1− ε, then A is truly best attribute withprobability at least 1− δ (see [Ikonomovska, 2012])When predicting target variable, sort example down the treeand return average of examples at given leaf

Pedro Domingos and Geoff Hulten. Mining high-speed datastreams. In Proceedings of the sixth ACM SIGKDD internationalconference on Knowledge discovery and data mining, KDD ’00,pages 71–80, New York, NY, USA, 2000. ACM. ISBN1-58113-233-6. doi: 10.1145/347090.347107. URLhttp://doi.acm.org/10.1145/347090.347107.

Joao Gama, Raquel Sebastiao, and Pedro Pereira Rodrigues. Onevaluating stream learning algorithms. Machine Learning, 90(3):317–346, 2013.

W. Hoeffding. Probability inequalities for sums of bounded randomvariables. Journal of the American Statistical Association, 58(301):13–30, 1963.

Elena Ikonomovska. Algoritmi za ucenje regresijskih dreves inansamblov iz spremenljivih podatkovnih tokov. PhD thesis,Mednarodna podiplomska sola Jozefa Stefana, 2012.

J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.

incremental learning of decision trees from time-changing data streams

Software

incremental classication

certain scenarios data

road trac

network trac

data streams blaz sovdat

search queries

speed of arrival

source distribution