incremental learning of decision trees from time-changing data streams
TRANSCRIPT
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental decision tree learning fromtime-changing data streams
Blaz [email protected]
Artificial Intelligence Laboratory, Jozef Stefan Institute
October 15, 2013
Introduction Incremental decision tree learning Evaluation Results Appendix References
Talk outline
1 IntroductionMotivationClassical decision tree learning
2 Incremental decision tree learningIncremental classification tree learning
3 EvaluationAssessing learning performanceLearning algorithm comparison
4 ResultsData descriptionResultsPrequential fading error estimation
Introduction Incremental decision tree learning Evaluation Results Appendix References
Motivation
Motivation
In certain scenarios data arrive continuously and areunbounded (data streams)Sensor networks, search queries, road traffic, network trafficNo control over the order and speed of arrivalBecause of the limited working memory we may view eachexample only onceSource distribution may change over time (concept drift)Classical (batch) decision tree learning methods fail
Introduction Incremental decision tree learning Evaluation Results Appendix References
Motivation
Motivation
In certain scenarios data arrive continuously and areunbounded (data streams)Sensor networks, search queries, road traffic, network trafficNo control over the order and speed of arrivalBecause of the limited working memory we may view eachexample only onceSource distribution may change over time (concept drift)Classical (batch) decision tree learning methods fail
Introduction Incremental decision tree learning Evaluation Results Appendix References
Motivation
Motivation
In certain scenarios data arrive continuously and areunbounded (data streams)Sensor networks, search queries, road traffic, network trafficNo control over the order and speed of arrivalBecause of the limited working memory we may view eachexample only onceSource distribution may change over time (concept drift)Classical (batch) decision tree learning methods fail
Introduction Incremental decision tree learning Evaluation Results Appendix References
Motivation
Motivation
In certain scenarios data arrive continuously and areunbounded (data streams)Sensor networks, search queries, road traffic, network trafficNo control over the order and speed of arrivalBecause of the limited working memory we may view eachexample only onceSource distribution may change over time (concept drift)Classical (batch) decision tree learning methods fail
Introduction Incremental decision tree learning Evaluation Results Appendix References
Motivation
Motivation
In certain scenarios data arrive continuously and areunbounded (data streams)Sensor networks, search queries, road traffic, network trafficNo control over the order and speed of arrivalBecause of the limited working memory we may view eachexample only onceSource distribution may change over time (concept drift)Classical (batch) decision tree learning methods fail
Introduction Incremental decision tree learning Evaluation Results Appendix References
Motivation
Motivation
In certain scenarios data arrive continuously and areunbounded (data streams)Sensor networks, search queries, road traffic, network trafficNo control over the order and speed of arrivalBecause of the limited working memory we may view eachexample only onceSource distribution may change over time (concept drift)Classical (batch) decision tree learning methods fail
Classical decision tree learning
sex
status1
female
status2
male
yes1
first
yes2
second
age1
third
yes3
crew
age2
first
no1
second
age3
third
no2
crew
no3
adult
no4
child
no5
adult
yes4
child
no6
adult
no7
child
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Classical decision tree learning
The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:
Define a heuristic measure, say information gain
G(A, S) := H(S)−d∑
i=1
|Si ||S| H(Si)
Then pick the best attribute:
A? = arg maxA∈A
G(A, S)
Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Classical decision tree learning
The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:
Define a heuristic measure, say information gain
G(A, S) := H(S)−d∑
i=1
|Si ||S| H(Si)
Then pick the best attribute:
A? = arg maxA∈A
G(A, S)
Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Classical decision tree learning
The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:
Define a heuristic measure, say information gain
G(A, S) := H(S)−d∑
i=1
|Si ||S| H(Si)
Then pick the best attribute:
A? = arg maxA∈A
G(A, S)
Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Classical decision tree learning
The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:
Define a heuristic measure, say information gain
G(A, S) := H(S)−d∑
i=1
|Si ||S| H(Si)
Then pick the best attribute:
A? = arg maxA∈A
G(A, S)
Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Classical decision tree learning
The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:
Define a heuristic measure, say information gain
G(A, S) := H(S)−d∑
i=1
|Si ||S| H(Si)
Then pick the best attribute:
A? = arg maxA∈A
G(A, S)
Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Classical decision tree learning
The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:
Define a heuristic measure, say information gain
G(A, S) := H(S)−d∑
i=1
|Si ||S| H(Si)
Then pick the best attribute:
A? = arg maxA∈A
G(A, S)
Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Classical decision tree learning
The following ID3 learner is due to [Quinlan, 1986]Let S be a set of training examplesFind attribute A? that alone best classifies examples from S:
Define a heuristic measure, say information gain
G(A, S) := H(S)−d∑
i=1
|Si ||S| H(Si)
Then pick the best attribute:
A? = arg maxA∈A
G(A, S)
Partition S to Si := {x ∈ S : A?(x) = ai} for all values ai ofA?, and create leaf node for each partitionRecursively apply procedure on examples Si at children nodes
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Simple example
Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:
status: first, second, third, or crew;age: adult, child;sex: male, female.
Learn to predict whether unlabeled x survived or died
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Simple example
Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:
status: first, second, third, or crew;age: adult, child;sex: male, female.
Learn to predict whether unlabeled x survived or died
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Simple example
Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:
status: first, second, third, or crew;age: adult, child;sex: male, female.
Learn to predict whether unlabeled x survived or died
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Simple example
Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:
status: first, second, third, or crew;age: adult, child;sex: male, female.
Learn to predict whether unlabeled x survived or died
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Simple example
Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:
status: first, second, third, or crew;age: adult, child;sex: male, female.
Learn to predict whether unlabeled x survived or died
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Simple example
Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:
status: first, second, third, or crew;age: adult, child;sex: male, female.
Learn to predict whether unlabeled x survived or died
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Simple example
Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:
status: first, second, third, or crew;age: adult, child;sex: male, female.
Learn to predict whether unlabeled x survived or died
Introduction Incremental decision tree learning Evaluation Results Appendix References
Classical decision tree learning
Simple example
Example on the Titanic datasetList of all Titanic passengersEach passenger is represented as (status, age, sex)vector, labeled either yes (survived) or no (died)Attribute description:
status: first, second, third, or crew;age: adult, child;sex: male, female.
Learn to predict whether unlabeled x survived or died
Simple example
no
Simple example
sex
yes
female
no
male
Simple example
sex
status1
female
no1
male
yes1
first
yes2
second
no2
third
yes3
crew
Simple example
sex
status1
female
status2
male
yes1
first
yes2
second
no1
third
yes3
crew
no2
first
no3
second
no4
third
no5
crew
Simple example
sex
status1
female
status2
male
yes1
first
yes2
second
age1
third
yes3
crew
no1
first
no2
second
no3
third
no4
crew
no5
adult
no6
child
Simple example
sex
status1
female
status2
male
yes1
first
yes2
second
age1
third
yes3
crew
age2
first
no1
second
no2
third
no3
crew
no4
adult
no5
child
no6
adult
yes4
child
Simple example
sex
status1
female
status2
male
yes1
first
yes2
second
age1
third
yes3
crew
age2
first
no1
second
age3
third
no2
crew
no3
adult
no4
child
no5
adult
yes4
child
no6
adult
no7
child
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental classification tree learning
Incremental decision tree learning
In data stream world we only have a small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilitySuppose A1 and A2 are attributes with highest estimatesG(A1) and G(A2)
If G(A1)− G(A2) > ε, then A1 is truly best with probability atleast 1− δ for 1− δ ∈ (0, 1) and
ε =
√R2 log(1/δ)
2n
This is the main idea behind VFDT learner [Domingos andHulten, 2000]
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental classification tree learning
Incremental decision tree learning
In data stream world we only have a small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilitySuppose A1 and A2 are attributes with highest estimatesG(A1) and G(A2)
If G(A1)− G(A2) > ε, then A1 is truly best with probability atleast 1− δ for 1− δ ∈ (0, 1) and
ε =
√R2 log(1/δ)
2n
This is the main idea behind VFDT learner [Domingos andHulten, 2000]
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental classification tree learning
Incremental decision tree learning
In data stream world we only have a small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilitySuppose A1 and A2 are attributes with highest estimatesG(A1) and G(A2)
If G(A1)− G(A2) > ε, then A1 is truly best with probability atleast 1− δ for 1− δ ∈ (0, 1) and
ε =
√R2 log(1/δ)
2n
This is the main idea behind VFDT learner [Domingos andHulten, 2000]
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental classification tree learning
Incremental decision tree learning
In data stream world we only have a small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilitySuppose A1 and A2 are attributes with highest estimatesG(A1) and G(A2)
If G(A1)− G(A2) > ε, then A1 is truly best with probability atleast 1− δ for 1− δ ∈ (0, 1) and
ε =
√R2 log(1/δ)
2n
This is the main idea behind VFDT learner [Domingos andHulten, 2000]
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental classification tree learning
Incremental decision tree learning
In data stream world we only have a small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilitySuppose A1 and A2 are attributes with highest estimatesG(A1) and G(A2)
If G(A1)− G(A2) > ε, then A1 is truly best with probability atleast 1− δ for 1− δ ∈ (0, 1) and
ε =
√R2 log(1/δ)
2n
This is the main idea behind VFDT learner [Domingos andHulten, 2000]
VFDT algorithm
1: Let HT be the root node2: for x ∈ S do3: Sort x down the tree to the leaf ` and update its sufficient statistic4: if n` mod nm = 0 and examples from ` have nonzero entropy then5: Let Xa and Xb be attributes with highest estimates G`(Xi)
6: Compute ε :=√
R2 log(1/δ)2n`
7: if G(Xa)− G(Xb) > ε or G(Xa)− G(Xb) ≤ ε < τ then8: Turn leaf ` into a node that tests on Xa9: for values of Xa do
10: Add a leaf and initialize its sufficient statistic11: end for12: end if13: end if14: end for
VFDT algorithm
1: Let HT be the root node2: for x ∈ S do3: Sort x down the tree to the leaf ` and update its sufficient statistic4: if n` mod nm = 0 and examples from ` have nonzero entropy then5: Let Xa and Xb be attributes with highest estimates G`(Xi)
6: Compute ε :=√
R2 log(1/δ)2n`
7: if G(Xa)− G(Xb) > ε or G(Xa)− G(Xb) ≤ ε < τ then8: Turn leaf ` into a node that tests on Xa9: for values of Xa do
10: Add a leaf and initialize its sufficient statistic11: end for12: end if13: end if14: end for
VFDT algorithm
1: Let HT be the root node2: for x ∈ S do3: Sort x down the tree to the leaf ` and update its sufficient statistic4: if n` mod nm = 0 and examples from ` have nonzero entropy then5: Let Xa and Xb be attributes with highest estimates G`(Xi)
6: Compute ε :=√
R2 log(1/δ)2n`
7: if G(Xa)− G(Xb) > ε or G(Xa)− G(Xb) ≤ ε < τ then8: Turn leaf ` into a node that tests on Xa9: for values of Xa do
10: Add a leaf and initialize its sufficient statistic11: end for12: end if13: end if14: end for
VFDT algorithm
1: Let HT be the root node2: for x ∈ S do3: Sort x down the tree to the leaf ` and update its sufficient statistic4: if n` mod nm = 0 and examples from ` have nonzero entropy then5: Let Xa and Xb be attributes with highest estimates G`(Xi)
6: Compute ε :=√
R2 log(1/δ)2n`
{Here, R = log2 C}7: if G(Xa)− G(Xb) > ε or G(Xa)− G(Xb) ≤ ε < τ then8: Turn leaf ` into a node that tests on Xa9: for values of Xa do
10: Add a leaf and initialize its sufficient statistic11: end for12: end if13: end if14: end for
VFDT algorithm
1: Let HT be the root node2: for x ∈ S do3: Sort x down the tree to the leaf ` and update its sufficient statistic4: if n` mod nm = 0 and examples from ` have nonzero entropy then5: Let Xa and Xb be attributes with highest estimates G`(Xi)
6: Compute ε :=√
R2 log(1/δ)2n`
7: if G(Xa)− G(Xb) > ε or G(Xa)− G(Xb) ≤ ε < τ then8: Turn leaf ` into a node that tests on Xa9: for values of Xa do
10: Add a leaf and initialize its sufficient statistic11: end for12: end if13: end if14: end for
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental classification tree learning
VFDT algorithm
The algorithm doesn’t adapt to changesHandle numeric attributes with online discretizationWe introduced τ parameter to resolve cases when twoattributes are almost equally goodRecompute G(Ai) periodically (typically nm = 200)With high probability, VFDT-induced tree uses the samesequence of tests as (hypothetical) batch-induced tree toclassify a randomly chosen example [Domingos and Hulten,2000]
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental classification tree learning
VFDT algorithm
The algorithm doesn’t adapt to changesHandle numeric attributes with online discretizationWe introduced τ parameter to resolve cases when twoattributes are almost equally goodRecompute G(Ai) periodically (typically nm = 200)With high probability, VFDT-induced tree uses the samesequence of tests as (hypothetical) batch-induced tree toclassify a randomly chosen example [Domingos and Hulten,2000]
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental classification tree learning
VFDT algorithm
The algorithm doesn’t adapt to changesHandle numeric attributes with online discretizationWe introduced τ parameter to resolve cases when twoattributes are almost equally goodRecompute G(Ai) periodically (typically nm = 200)With high probability, VFDT-induced tree uses the samesequence of tests as (hypothetical) batch-induced tree toclassify a randomly chosen example [Domingos and Hulten,2000]
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental classification tree learning
VFDT algorithm
The algorithm doesn’t adapt to changesHandle numeric attributes with online discretizationWe introduced τ parameter to resolve cases when twoattributes are almost equally goodRecompute G(Ai) periodically (typically nm = 200)With high probability, VFDT-induced tree uses the samesequence of tests as (hypothetical) batch-induced tree toclassify a randomly chosen example [Domingos and Hulten,2000]
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental classification tree learning
VFDT algorithm
The algorithm doesn’t adapt to changesHandle numeric attributes with online discretizationWe introduced τ parameter to resolve cases when twoattributes are almost equally goodRecompute G(Ai) periodically (typically nm = 200)With high probability, VFDT-induced tree uses the samesequence of tests as (hypothetical) batch-induced tree toclassify a randomly chosen example [Domingos and Hulten,2000]
Big picture of the CVFDT algorithm
......
T
a1 ad
T’
Alternate trees grown by node T´Subtrees of the node T´
New examplesOld examplesSliding window W
Root node
Introduction Incremental decision tree learning Evaluation Results Appendix References
Assessing learning performance
Assessing learning performance
Roughly we distinguish two approaches [Gama et al., 2013]:Holdout error estimation
The idea is to periodically (period is, say, 20 000) sacrificem := 2 000 examples and use them to estimate classificationaccuracy: Hm := 1
m∑k+m
i=k L(yi , yi)
Prequential error estimation (also known as “test-then-train”)Let α ∈ (0, 1] be a fading factor and let A be a classifierDefine estimated prequential error Pα(i):
SαA (i) := L(yi , yi) + αL(yi−1, yi−1) + . . .+ αi−1L(y1, y1),
Nα(i) := 1 + α+ . . .+ αi ,
Pα(i) := SαA (i)/Nα(i).
Introduction Incremental decision tree learning Evaluation Results Appendix References
Assessing learning performance
Assessing learning performance
Roughly we distinguish two approaches [Gama et al., 2013]:Holdout error estimation
The idea is to periodically (period is, say, 20 000) sacrificem := 2 000 examples and use them to estimate classificationaccuracy: Hm := 1
m∑k+m
i=k L(yi , yi)
Prequential error estimation (also known as “test-then-train”)Let α ∈ (0, 1] be a fading factor and let A be a classifierDefine estimated prequential error Pα(i):
SαA (i) := L(yi , yi) + αL(yi−1, yi−1) + . . .+ αi−1L(y1, y1),
Nα(i) := 1 + α+ . . .+ αi ,
Pα(i) := SαA (i)/Nα(i).
Introduction Incremental decision tree learning Evaluation Results Appendix References
Assessing learning performance
Assessing learning performance
Roughly we distinguish two approaches [Gama et al., 2013]:Holdout error estimation
The idea is to periodically (period is, say, 20 000) sacrificem := 2 000 examples and use them to estimate classificationaccuracy: Hm := 1
m∑k+m
i=k L(yi , yi)
Prequential error estimation (also known as “test-then-train”)Let α ∈ (0, 1] be a fading factor and let A be a classifierDefine estimated prequential error Pα(i):
SαA (i) := L(yi , yi) + αL(yi−1, yi−1) + . . .+ αi−1L(y1, y1),
Nα(i) := 1 + α+ . . .+ αi ,
Pα(i) := SαA (i)/Nα(i).
Introduction Incremental decision tree learning Evaluation Results Appendix References
Assessing learning performance
Assessing learning performance
Roughly we distinguish two approaches [Gama et al., 2013]:Holdout error estimation
The idea is to periodically (period is, say, 20 000) sacrificem := 2 000 examples and use them to estimate classificationaccuracy: Hm := 1
m∑k+m
i=k L(yi , yi)
Prequential error estimation (also known as “test-then-train”)Let α ∈ (0, 1] be a fading factor and let A be a classifierDefine estimated prequential error Pα(i):
SαA (i) := L(yi , yi) + αL(yi−1, yi−1) + . . .+ αi−1L(y1, y1),
Nα(i) := 1 + α+ . . .+ αi ,
Pα(i) := SαA (i)/Nα(i).
Introduction Incremental decision tree learning Evaluation Results Appendix References
Assessing learning performance
Assessing learning performance
Roughly we distinguish two approaches [Gama et al., 2013]:Holdout error estimation
The idea is to periodically (period is, say, 20 000) sacrificem := 2 000 examples and use them to estimate classificationaccuracy: Hm := 1
m∑k+m
i=k L(yi , yi)
Prequential error estimation (also known as “test-then-train”)Let α ∈ (0, 1] be a fading factor and let A be a classifierDefine estimated prequential error Pα(i):
SαA (i) := L(yi , yi) + αL(yi−1, yi−1) + . . .+ αi−1L(y1, y1),
Nα(i) := 1 + α+ . . .+ αi ,
Pα(i) := SαA (i)/Nα(i).
Introduction Incremental decision tree learning Evaluation Results Appendix References
Assessing learning performance
Assessing learning performance
Roughly we distinguish two approaches [Gama et al., 2013]:Holdout error estimation
The idea is to periodically (period is, say, 20 000) sacrificem := 2 000 examples and use them to estimate classificationaccuracy: Hm := 1
m∑k+m
i=k L(yi , yi)
Prequential error estimation (also known as “test-then-train”)Let α ∈ (0, 1] be a fading factor and let A be a classifierDefine estimated prequential error Pα(i):
SαA (i) := L(yi , yi) + αL(yi−1, yi−1) + . . .+ αi−1L(y1, y1),
Nα(i) := 1 + α+ . . .+ αi ,
Pα(i) := SαA (i)/Nα(i).
Introduction Incremental decision tree learning Evaluation Results Appendix References
Learning algorithm comparison
Comparing learning algorithms
Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα
i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:
Qαi (A,B) < 0 means that A is better than B,
Qαi (A,B) > 0 means that B is better than A,
Qαi (A,B) = 0 means A and B perform equally well.
Here |Qαi (A,B)| is strength of the difference — how much
better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001
Introduction Incremental decision tree learning Evaluation Results Appendix References
Learning algorithm comparison
Comparing learning algorithms
Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα
i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:
Qαi (A,B) < 0 means that A is better than B,
Qαi (A,B) > 0 means that B is better than A,
Qαi (A,B) = 0 means A and B perform equally well.
Here |Qαi (A,B)| is strength of the difference — how much
better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001
Introduction Incremental decision tree learning Evaluation Results Appendix References
Learning algorithm comparison
Comparing learning algorithms
Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα
i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:
Qαi (A,B) < 0 means that A is better than B,
Qαi (A,B) > 0 means that B is better than A,
Qαi (A,B) = 0 means A and B perform equally well.
Here |Qαi (A,B)| is strength of the difference — how much
better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001
Introduction Incremental decision tree learning Evaluation Results Appendix References
Learning algorithm comparison
Comparing learning algorithms
Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα
i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:
Qαi (A,B) < 0 means that A is better than B,
Qαi (A,B) > 0 means that B is better than A,
Qαi (A,B) = 0 means A and B perform equally well.
Here |Qαi (A,B)| is strength of the difference — how much
better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001
Introduction Incremental decision tree learning Evaluation Results Appendix References
Learning algorithm comparison
Comparing learning algorithms
Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα
i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:
Qαi (A,B) < 0 means that A is better than B,
Qαi (A,B) > 0 means that B is better than A,
Qαi (A,B) = 0 means A and B perform equally well.
Here |Qαi (A,B)| is strength of the difference — how much
better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001
Introduction Incremental decision tree learning Evaluation Results Appendix References
Learning algorithm comparison
Comparing learning algorithms
Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα
i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:
Qαi (A,B) < 0 means that A is better than B,
Qαi (A,B) > 0 means that B is better than A,
Qαi (A,B) = 0 means A and B perform equally well.
Here |Qαi (A,B)| is strength of the difference — how much
better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001
Introduction Incremental decision tree learning Evaluation Results Appendix References
Learning algorithm comparison
Comparing learning algorithms
Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα
i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:
Qαi (A,B) < 0 means that A is better than B,
Qαi (A,B) > 0 means that B is better than A,
Qαi (A,B) = 0 means A and B perform equally well.
Here |Qαi (A,B)| is strength of the difference — how much
better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001
Introduction Incremental decision tree learning Evaluation Results Appendix References
Learning algorithm comparison
Comparing learning algorithms
Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα
i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:
Qαi (A,B) < 0 means that A is better than B,
Qαi (A,B) > 0 means that B is better than A,
Qαi (A,B) = 0 means A and B perform equally well.
Here |Qαi (A,B)| is strength of the difference — how much
better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001
Introduction Incremental decision tree learning Evaluation Results Appendix References
Learning algorithm comparison
Comparing learning algorithms
Let A and B learners and let SA and SB be aligned errorsequencesDefine Qα
i (A,B) := log (SαA(i)/SαB(i))Interpretation of Q-statistic:
Qαi (A,B) < 0 means that A is better than B,
Qαi (A,B) > 0 means that B is better than A,
Qαi (A,B) = 0 means A and B perform equally well.
Here |Qαi (A,B)| is strength of the difference — how much
better is one learner from the otherWilcoxon test tests the null hypothesis that the vector ofQ-statistics come from zero-median distributionFor all tests we took significance level α := 0.0001
Introduction Incremental decision tree learning Evaluation Results Appendix References
Data description
Data description
We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:
numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute
We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)
Introduction Incremental decision tree learning Evaluation Results Appendix References
Data description
Data description
We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:
numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute
We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)
Introduction Incremental decision tree learning Evaluation Results Appendix References
Data description
Data description
We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:
numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute
We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)
Introduction Incremental decision tree learning Evaluation Results Appendix References
Data description
Data description
We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:
numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute
We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)
Introduction Incremental decision tree learning Evaluation Results Appendix References
Data description
Data description
We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:
numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute
We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)
Introduction Incremental decision tree learning Evaluation Results Appendix References
Data description
Data description
We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:
numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute
We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)
Introduction Incremental decision tree learning Evaluation Results Appendix References
Data description
Data description
We evaluated VFDT and CVFDT learners onelectricity-demand data for New York stateWe discretize the target attribute load to get 5-classclassification problemOther attributes:
numeric attributes hourOfDay, dayOfWeek, month, computedfrom datename of area name is 11-valued discrete attribute, PTID isnumeric attribute
We took data for the last 10 years and tried to predictdemand for the next measurementTogether around 13 878 974 records (about 1.3GB ofuncompressed data)
Load zones
A
B
NEW YORK CONTROL AREA LOAD ZONES
G
F
E
D
C
B
E
H
I
J K
A - WEST B - GENESE C - CENTRL D - NORTH E - MHK VL F - CAPITL G - HUD VL H - MILLWD I - DUNWOD J - N.Y.C. K - LONGIL
Figure : Taken from NYISO (http://www.nyiso.com/public/index.jsp).
One month demand for a single area
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
600
800
1000
1200
1400
1600
1800
2000
Zaporedna stevilka mertive
Porabavtem
tren
utku
One year demand for a single area
0 2 4 6 8 10 12
x 104
0
500
1000
1500
2000
2500
3000
3500
4000
Zaporedna stevilka mertive
Porabavtem
tren
utku
Global demand for a single area
0 2 4 6 8 10 12 14
x 105
0
500
1000
1500
2000
2500
3000
3500
4000
Zaporedna stevilka mertive
Porabavtem
tren
utku
Target variable distribution
0 1000 2000 3000 4000 5000 6000 7000 80000
2
4
6
8
10
12x 10
4
Poraba
Fre
kven
ca p
orab
e
Introduction Incremental decision tree learning Evaluation Results Appendix References
Results
Results
Method Learner A/Leraner B Median p-valueHoldout estimate VFDT-MAJ/CVFDT-MAJ µ1/2 = −0.4285 p < 0.0001Holdout estimate VFDT-NB/CVFDT-NB µ1/2 = 0 p = 0.6538Holdout estimate CVFDT-MAJ/CVFDT-NB µ1/2 = 0.4410 p < 0.0001
Fading factors VFDT-MAJ/CVFDT-MAJ µ1/2 = −0.377 p < 0.0001Fading factors VFDT-NB/CVFDT-NB µ1/2 = 0.0297 p = 0.1424Fading factors CVFDT-MAJ/CVFDT-NB µ1/2 = 0.3819 p < 0.0001
Table : Results of Wilcoxon test when testing hypothesis that the medianof Q-statistics is zero.
CVFDT-MAJ versus CVFDT-NB
0 200 400 600 800 1000 12000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Stevilka ucnega primera v toku
Bledecanapaka
CVFDT−MAJCVFDT−NB
CVFDT-MAJ versus CVFDT-NB
0 200 400 600 800 1000 1200−8
−6
−4
−2
0
2
4
6
8
10
Stevilka ucnega primera v toku
Vrednost
Qstatistike
VFDT-NB versus CVFDT-NB
0 200 400 600 800 1000 12000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Stevilka ucnega primera v toku
Bledecanapaka
VFDT−NBCVFDT−NB
VFDT-NB versus CVFDT-NB
0 200 400 600 800 1000 1200−15
−10
−5
0
5
10
Stevilka ucnega primera v toku
Vrednost
Qstatistike
VFDT-MAJ versus CVFDT-MAJ
0 200 400 600 800 1000 12000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Sevilka ucnega primera v toku
Bledecanapaka
CVFDT−MAJVFDT−MAJ
VFDT-MAJ versus CVFDT-MAJ
0 200 400 600 800 1000 1200−12
−10
−8
−6
−4
−2
0
2
4
6
8
Stevilka ucnega primera v toku
Vrednost
Qstatistike
Introduction Incremental decision tree learning Evaluation Results Appendix References
The End
The End
Thank you for your attention!
Introduction Incremental decision tree learning Evaluation Results Appendix References
Appendix
Hoeffding’s inequality
Theorem ([Hoeffding, 1963])Let S := X1 + X2 + . . .+ Xn be sum of independent boundedrandom variables ai ≤ Xi ≤ bi and let ε > 0 be a positive realnumber. Then
P (S − E[S] ≥ nε) ≤ exp(−2n2ε2
/ n∑i=1
(bi − ai)2). (1)
Introduction Incremental decision tree learning Evaluation Results Appendix References
Appendix
Hoeffding’s inequality
CorollaryLet S := X1 + X2 + . . .+ Xn be sum of independent boundedrandom variables a ≤ Xi ≤ b and let ε > 0 be a positive realnumber. For R := b − a we have
P (S − E[S] ≥ nε) ≤ exp(−2nε2/R2
). (2)
Introduction Incremental decision tree learning Evaluation Results Appendix References
Appendix
Incremental decision tree learning
In data stream world we only have small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilityLet a ≤ X ≤ b be a bounded random variable and letX1,X2, . . . ,Xn be its measurementsLet µ := (X1 + X2 + . . .+ Xn)/n be sample mean and letµ := E[X ] be the true meanFurthermore let 1− δ ∈ (0, 1) be the desired confidence levelBy Hoeffding’s inequality, we have P(µ ≥ µ− ε) ≥ 1− δ for
ε =
√(b − a)2 log(1/δ)
2n
Introduction Incremental decision tree learning Evaluation Results Appendix References
Appendix
Incremental decision tree learning
In data stream world we only have small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilityLet a ≤ X ≤ b be a bounded random variable and letX1,X2, . . . ,Xn be its measurementsLet µ := (X1 + X2 + . . .+ Xn)/n be sample mean and letµ := E[X ] be the true meanFurthermore let 1− δ ∈ (0, 1) be the desired confidence levelBy Hoeffding’s inequality, we have P(µ ≥ µ− ε) ≥ 1− δ for
ε =
√(b − a)2 log(1/δ)
2n
Introduction Incremental decision tree learning Evaluation Results Appendix References
Appendix
Incremental decision tree learning
In data stream world we only have small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilityLet a ≤ X ≤ b be a bounded random variable and letX1,X2, . . . ,Xn be its measurementsLet µ := (X1 + X2 + . . .+ Xn)/n be sample mean and letµ := E[X ] be the true meanFurthermore let 1− δ ∈ (0, 1) be the desired confidence levelBy Hoeffding’s inequality, we have P(µ ≥ µ− ε) ≥ 1− δ for
ε =
√(b − a)2 log(1/δ)
2n
Introduction Incremental decision tree learning Evaluation Results Appendix References
Appendix
Incremental decision tree learning
In data stream world we only have small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilityLet a ≤ X ≤ b be a bounded random variable and letX1,X2, . . . ,Xn be its measurementsLet µ := (X1 + X2 + . . .+ Xn)/n be sample mean and letµ := E[X ] be the true meanFurthermore let 1− δ ∈ (0, 1) be the desired confidence levelBy Hoeffding’s inequality, we have P(µ ≥ µ− ε) ≥ 1− δ for
ε =
√(b − a)2 log(1/δ)
2n
Introduction Incremental decision tree learning Evaluation Results Appendix References
Appendix
Incremental decision tree learning
In data stream world we only have small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilityLet a ≤ X ≤ b be a bounded random variable and letX1,X2, . . . ,Xn be its measurementsLet µ := (X1 + X2 + . . .+ Xn)/n be sample mean and letµ := E[X ] be the true meanFurthermore let 1− δ ∈ (0, 1) be the desired confidence levelBy Hoeffding’s inequality, we have P(µ ≥ µ− ε) ≥ 1− δ for
ε =
√(b − a)2 log(1/δ)
2n
Introduction Incremental decision tree learning Evaluation Results Appendix References
Appendix
Incremental decision tree learning
In data stream world we only have small subset of examplesavailableUsing Hoeffding inequality we can find the truly best attributefrom sample with high probabilityLet a ≤ X ≤ b be a bounded random variable and letX1,X2, . . . ,Xn be its measurementsLet µ := (X1 + X2 + . . .+ Xn)/n be sample mean and letµ := E[X ] be the true meanFurthermore let 1− δ ∈ (0, 1) be the desired confidence levelBy Hoeffding’s inequality, we have P(µ ≥ µ− ε) ≥ 1− δ for
ε =
√(b − a)2 log(1/δ)
2n
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Done at JSI by Elena Ikonomovska [Ikonomovska, 2012]Regression trees predict real number instead of classDefine standard deviation reduction:
sdr(A,S) := σ(S)−d∑
i=1
|Si ||S| σ(Si),
where Si := {x ∈ S : A(x) = ai} and σ(S) denotes standarddeviationPick attribute that maximizes SDR: A? := arg max
A∈Asdr(A, S)
Again, using Hoeffding’s inequality, we can find the bestattribute with high probability
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Done at JSI by Elena Ikonomovska [Ikonomovska, 2012]Regression trees predict real number instead of classDefine standard deviation reduction:
sdr(A,S) := σ(S)−d∑
i=1
|Si ||S| σ(Si),
where Si := {x ∈ S : A(x) = ai} and σ(S) denotes standarddeviationPick attribute that maximizes SDR: A? := arg max
A∈Asdr(A, S)
Again, using Hoeffding’s inequality, we can find the bestattribute with high probability
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Done at JSI by Elena Ikonomovska [Ikonomovska, 2012]Regression trees predict real number instead of classDefine standard deviation reduction:
sdr(A,S) := σ(S)−d∑
i=1
|Si ||S| σ(Si),
where Si := {x ∈ S : A(x) = ai} and σ(S) denotes standarddeviationPick attribute that maximizes SDR: A? := arg max
A∈Asdr(A, S)
Again, using Hoeffding’s inequality, we can find the bestattribute with high probability
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Done at JSI by Elena Ikonomovska [Ikonomovska, 2012]Regression trees predict real number instead of classDefine standard deviation reduction:
sdr(A,S) := σ(S)−d∑
i=1
|Si ||S| σ(Si),
where Si := {x ∈ S : A(x) = ai} and σ(S) denotes standarddeviationPick attribute that maximizes SDR: A? := arg max
A∈Asdr(A, S)
Again, using Hoeffding’s inequality, we can find the bestattribute with high probability
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Done at JSI by Elena Ikonomovska [Ikonomovska, 2012]Regression trees predict real number instead of classDefine standard deviation reduction:
sdr(A,S) := σ(S)−d∑
i=1
|Si ||S| σ(Si),
where Si := {x ∈ S : A(x) = ai} and σ(S) denotes standarddeviationPick attribute that maximizes SDR: A? := arg max
A∈Asdr(A, S)
Again, using Hoeffding’s inequality, we can find the bestattribute with high probability
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Let A and B be the best and the second-best attributes,respectivelyThen r := sdr(A)/ sdr(B) is a random variable and r ∈ [0, 1]Let r1, r2, . . . , rn be such ratios for the last n examplesNow pick 1− δ ∈ (0, 1) and let
ε =
√log(1/δ)
2n
By Hoeffding’s inequality we have P(r ∈ [r − ε, r + ε]) ≥ 1− δfor r = (r1 + r2 + . . .+ rn)/n
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Let A and B be the best and the second-best attributes,respectivelyThen r := sdr(A)/ sdr(B) is a random variable and r ∈ [0, 1]Let r1, r2, . . . , rn be such ratios for the last n examplesNow pick 1− δ ∈ (0, 1) and let
ε =
√log(1/δ)
2n
By Hoeffding’s inequality we have P(r ∈ [r − ε, r + ε]) ≥ 1− δfor r = (r1 + r2 + . . .+ rn)/n
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Let A and B be the best and the second-best attributes,respectivelyThen r := sdr(A)/ sdr(B) is a random variable and r ∈ [0, 1]Let r1, r2, . . . , rn be such ratios for the last n examplesNow pick 1− δ ∈ (0, 1) and let
ε =
√log(1/δ)
2n
By Hoeffding’s inequality we have P(r ∈ [r − ε, r + ε]) ≥ 1− δfor r = (r1 + r2 + . . .+ rn)/n
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Let A and B be the best and the second-best attributes,respectivelyThen r := sdr(A)/ sdr(B) is a random variable and r ∈ [0, 1]Let r1, r2, . . . , rn be such ratios for the last n examplesNow pick 1− δ ∈ (0, 1) and let
ε =
√log(1/δ)
2n
By Hoeffding’s inequality we have P(r ∈ [r − ε, r + ε]) ≥ 1− δfor r = (r1 + r2 + . . .+ rn)/n
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Let A and B be the best and the second-best attributes,respectivelyThen r := sdr(A)/ sdr(B) is a random variable and r ∈ [0, 1]Let r1, r2, . . . , rn be such ratios for the last n examplesNow pick 1− δ ∈ (0, 1) and let
ε =
√log(1/δ)
2n
By Hoeffding’s inequality we have P(r ∈ [r − ε, r + ε]) ≥ 1− δfor r = (r1 + r2 + . . .+ rn)/n
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Now we can derive a split criterionLet SA and SB be deviation reduction after testing on A andB, respectivelyIf SB/SA < 1− ε, then A is truly best attribute withprobability at least 1− δ (see [Ikonomovska, 2012])When predicting target variable, sort example down the treeand return average of examples at given leaf
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Now we can derive a split criterionLet SA and SB be deviation reduction after testing on A andB, respectivelyIf SB/SA < 1− ε, then A is truly best attribute withprobability at least 1− δ (see [Ikonomovska, 2012])When predicting target variable, sort example down the treeand return average of examples at given leaf
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Now we can derive a split criterionLet SA and SB be deviation reduction after testing on A andB, respectivelyIf SB/SA < 1− ε, then A is truly best attribute withprobability at least 1− δ (see [Ikonomovska, 2012])When predicting target variable, sort example down the treeand return average of examples at given leaf
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
What about regression?
Now we can derive a split criterionLet SA and SB be deviation reduction after testing on A andB, respectivelyIf SB/SA < 1− ε, then A is truly best attribute withprobability at least 1− δ (see [Ikonomovska, 2012])When predicting target variable, sort example down the treeand return average of examples at given leaf
Introduction Incremental decision tree learning Evaluation Results Appendix References
Incremental regression tree learning
Pedro Domingos and Geoff Hulten. Mining high-speed datastreams. In Proceedings of the sixth ACM SIGKDD internationalconference on Knowledge discovery and data mining, KDD ’00,pages 71–80, New York, NY, USA, 2000. ACM. ISBN1-58113-233-6. doi: 10.1145/347090.347107. URLhttp://doi.acm.org/10.1145/347090.347107.
Joao Gama, Raquel Sebastiao, and Pedro Pereira Rodrigues. Onevaluating stream learning algorithms. Machine Learning, 90(3):317–346, 2013.
W. Hoeffding. Probability inequalities for sums of bounded randomvariables. Journal of the American Statistical Association, 58(301):13–30, 1963.
Elena Ikonomovska. Algoritmi za ucenje regresijskih dreves inansamblov iz spremenljivih podatkovnih tokov. PhD thesis,Mednarodna podiplomska sola Jozefa Stefana, 2012.
J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.