mining decision trees from data streams

1

Mining Decision Trees fromData Streams

Thanks: Tong Suk Man IvyHKU

2

Contents Introduction: problems in mining data streams Classification of stream data

VFDT algorithm Window approach

CVFDT algorithm Experimental results Conclusions Future work

3

Data Streams Characteristics

Large volume of ordered data points, possibly infinite Arrive continuously Fast changing

Appropriate model for many applications: Phone call records Network and security monitoring Financial applications (stock exchange) Sensor networks

4

Problems in Mining Data Streams Traditional data mining techniques usually require

Entire data set to be present Random access (or multiple passes) to the data Much time per data item

Challenges of stream mining Impractical to store the whole data Random access is expensive Simple calculation per data due to time and space

constraints

5

Processing Data Streams: Motivation

A growing number of applications generate streams of data Performance measurements in network monitoring and traffic

management Call detail records in telecommunications Transactions in retail chains, ATM operations in banks Log records generated by Web Servers Sensor network data

Application characteristics Massive volumes of data (several terabytes) Records arrive at a rapid rate

Goal: Mine patterns, process queries and compute statistics on data streams in real-time

(from VLDB’02 Tutorial)

6

Data Streams: Computation Model

A data stream is a (massive) sequence of elements: Stream processing requirements Single pass: Each record is examined at most once Bounded storage: Limited Memory (M) for storing synopsis Real-time: Per record processing time (to maintain synopsis) must be low

Stream ProcessingEngine

(Approximate) Answer

Synopsis in MemoryData Streams nee ,...,1

7

Network Management Application Network Management involves monitoring and configuring

network hardware and software to ensure smooth operation Monitor link bandwidth usage, estimate traffic demands Quickly detect faults, congestion and isolate root cause Load balancing, improve utilization of network resources

Network Operations Center

Network

MeasurementsAlarms


8

IP Network Measurement Data IP session data (collected using Cisco NetFlow)

AT&T collects 100 GBs of NetFlow data each day!

AT&T collects 100 GB of NetFlow data per day!

Source Destination Duration Bytes Protocol 10.1.0.2 16.2.3.7 12 20K http 18.6.7.1 12.4.0.3 16 24K http 13.9.4.3 11.6.8.2 15 20K http 15.2.2.9 17.1.2.1 19 40K http 12.4.3.8 14.8.7.4 26 58K http 10.5.1.3 13.0.0.1 27 100K ftp 11.1.0.6 10.3.4.5 32 300K ftp 19.7.1.2 16.5.5.8 18 80K ftp


9

Network Data Processing Traffic estimation

How many bytes were sent between a pair of IP addresses? What fraction network IP addresses are active? List the top 100 IP addresses in terms of traffic

Traffic analysis What is the average duration of an IP session? What is the median of the number of bytes in each IP session?

Fraud detection List all sessions that transmitted more than 1000 bytes Identify all sessions whose duration was more than twice the normal

Security/Denial of Service List all IP addresses that have witnessed a sudden spike in traffic Identify IP addresses involved in more than 1000 sessions


10

Data Stream Processing Algorithms

Generally, algorithms compute approximate answers Difficult to compute answers accurately with limited memory

Approximate answers - Deterministic bounds Algorithms only compute an approximate answer, but bounds on

error Approximate answers - Probabilistic bounds

Algorithms compute an approximate answer with high probabilityWith probability at least , the computed answer is within a factor

of the actual answer Single-pass algorithms for processing streams also

applicable to (massive) terabyte databases!

1


11

Classification of Stream Data VFDT algorithm

“Mining High-Speed Data Streams”, KDD 2000. Pedro Domingos, Geoff Hulten

CVFDT algorithm (window approach) “Mining Time-Changing Data Streams”, KDD 2001.

Geoff Hulten, Laurie Spencer, Pedro Domingos

12

Hoeffding Trees

13

Definitions A classification problem is defined as:

N is a set of training examples of the form (x, y) x is a vector of d attributes y is a discrete class label

Goal: To produce from the examples a model y=f(x) that predict the classes y for future examples x with high accuracy

14

Decision Tree Learning One of the most effective and

widely-used classification methods

Induce models in the form of decision trees Each node contains a test on the

attribute Each branch from a node

corresponds to a possible outcome of the test

Each leaf contains a class prediction A decision tree is learned by

recursively replacing leaves by test nodes, starting at the root

Age<30?

Car Type=Sports Car?

No

Yes

Yes

Yes No

No

15

Challenges Classic decision tree learners assume all training

data can be simultaneously stored in main memory

Disk-based decision tree learners repeatedly read training data from disk sequentially Prohibitively expensive when learning complex trees

Goal: design decision tree learners that read each example at most once, and use a small constant time to process it

16

Key Observation In order to find the best attribute at a node, it may be

sufficient to consider only a small subset of the training examples that pass through that node. Given a stream of examples, use the first ones to choose the

root attribute. Once the root attribute is chosen, the successive examples

are passed down to the corresponding leaves, and used to choose the attribute there, and so on recursively.

Use Hoeffding bound to decide how many examples are enough at each node

17

Hoeffding Bound Consider a random variable a whose range is R Suppose we have n observations of a Mean: Hoeffding bound states:

With probability 1- , the true mean of a is at least , where

nR

2)/1ln(2

_

a

_

a

18

How many examples are enough? Let G(Xi) be the heuristic measure used to choose test

attributes (e.g. Information Gain, Gini Index) Xa : the attribute with the highest attribute evaluation

value after seeing n examples. Xb : the attribute with the second highest split

evaluation function value after seeing n examples. Given a desired , if after

seeing n examples at a node, Hoeffding bound guarantees the true , with

probability 1-. This node can be split using Xa, the succeeding examples will

be passed to the new leaves.

0 GG

nR

2)/1ln(2

)()( ba XGXGG

19

Algorithm Calculate the information gain for the attributes and

determines the best two attributes Pre-pruning: consider a “null” attribute that consists of not

splitting the node At each node, check for the condition

)()( ba XGXGG If condition satisfied, create child nodes based on the test at

the node If not, stream in more examples and perform calculations till

condition satisfied

20

Data Stream

Data Stream

(Gender)-Type) (Car_GG

_

Age<30?

Yes

Yes No

Age<30?


No

Yes

Yes

Yes No

No

21

Performance Analysis p: probability that an example passed through DT

to level i will fall into a leaf at that point The expected disagreement between the tree

produced by Hoeffding tree algorithm and that produced using infinite examples at each node is no greater than /p.

Required memory: O(leaves * attributes * values * classes)

22

VFDT

23

VFDT (Very Fast Decision Tree) A decision-tree learning system based on the Hoeffding tree

algorithm Split on the current best attribute, if the difference is less than a

user-specified threshold Wasteful to decide between identical attributes

Compute G and check for split periodically Memory management

Memory dominated by sufficient statistics Deactivate or drop less promising leaves when needed

Bootstrap with traditional learner Rescan old data when time available

24

VFDT(2) Scales better than pure memory-based or pure

disk-based learners Access data sequentially Use subsampling to potentially require much less than

one scan VFDT is incremental and anytime

New examples can be quickly incorporated as they arrive

A usable model is available after the first few examples and then progressively defined

25

Experiment Results (VFDT vs. C4.5) Compared VFDT and C4.5 (Quinlan, 1993) Same memory limit for both (40 MB)

100k examples for C4.5 VFDT settings: δ= 10-7, τ= 5%, nmin=200

Domains: 2 classes, 100 binary attributes Fifteen synthetic trees 2.2k – 500k leaves Noise from 0% to 30%

26

Experiment Results

Accuracy as a function of the number of training examples

27

Experiment Results

Tree size as a function of number of training examples

28

Mining Time-Changing Data Stream Most KDD systems, include VFDT, assume training data is

a sample drawn from stationary distribution Most large databases or data streams violate this

assumption Concept Drift: data is generated by a time-changing

concept function, e.g. Seasonal effects Economic cycles

Goal: Mining continuously changing data streams Scale well

29

Window Approach Common Approach: when a new example

arrives, reapply a traditional learner to a sliding window of w most recent examples Sensitive to window size

If w is small relative to the concept shift rate, assure the availability of a model reflecting the current concept

Too small w may lead to insufficient examples to learn the concept

If examples arrive at a rapid rate or the concept changes quickly, the computational cost of reapplying a learner may be prohibitively high.

30

CVFDT

31

CVFDT CVFDT (Concept-adapting Very Fast Decision

Tree learner) Extend VFDT Maintain VFDT’s speed and accuracy Detect and respond to changes in the example-

generating process

32

Observations With a time-changing concept, the current

splitting attribute of some nodes may not be the best any more.

An outdated subtree may still be better than the best single leaf, particularly if it is near the root. Grow an alternative subtree with the new best attribute

at its root, when the old attribute seems out-of-date. Periodically use a bunch of samples to evaluate

qualities of trees. Replace the old subtree when the alternate one

becomes more accurate.

33

CVFDT algorithm Alternate trees for each node in HT start as empty. Process examples from the stream indefinitely. For

each example (x, y), Pass (x, y) down to a set of leaves using HT and all

alternate trees of the nodes (x, y) passes through. Add (x, y) to the sliding window of examples. Remove and forget the effect of the oldest examples, if the

sliding window overflows. CVFDTGrow CheckSplitValidity if f examples seen since last checking of

alternate trees. Return HT.

34

CVFDT algorithm: process each example

Pass example down to leaves

add example to sliding window

Window overflow? Forget oldest example

CVFDTGrow

CheckSplitValidty

f examples since last checking?Yes No

No

Yes

Read new example

35





CVFDTGrow

CheckSplitValidty


No

Yes

Read new example

36

CVFDTGrow For each node reached by the example in HT,

Increment the corresponding statistics at the node. For each alternate tree Talt of the node,

CVFDTGrow If enough examples seen at the leaf in HT which

the example reaches, Choose the attribute that has the highest average

value of the attribute evaluation measure (information gain or gini index).

If the best attribute is not the “null” attribute, create a node for each possible value of this attribute

37





CVFDTGrow

CheckSplitValidty


No

Yes

Read new example

38

Forget old example Maintain the sufficient statistics at every node in HT to monitor

the validity of its previous decisions. VFDT only maintain such statistics at leaves.

HT might have grown or changed since the example was initially incorporated. Assigned each node a unique, monotonically increasing ID

as they are created. forgetExample (HT, example, maxID)

For each node reached by the old example with node ID no larger than the max leave ID the example reaches,

Decrement the corresponding statistics at the node. For each alternate tree Talt of the node, forget(Talt, example, maxID).

39


Read new example




CVFDTGrow

CheckSplitValidty


No

Yes

40

CheckSplitValidtiy Periodically scans the internal nodes of HT. Start a new alternate tree when a new winning

attribute is found. Tighter criteria to avoid excessive alternate tree

creation. Limit the total number of alternate trees.

41

Smoothly adjust to concept drift Alternate trees are grown

the same way HT is. Periodically each node

with non-empty alternate trees enter a testing mode. M training examples to

compare accuracy. Prune alternate trees

with non-increasing accuracy over time.

Replace if an alternate tree is more accurate.

NoAge<30?


No

Yes

Yes

Yes

No

Married?

Yes No

Yes No

Experience<1 year?

No Yes

Yes No

42

Adjust to concept drift(2) Dynamically change the window size

Shrink the window when many nodes gets questionable or data rate changes rapidly.

Increase the window size when few nodes are questionable.

43

Performance Require memory O(nodes * attributes * attribute

values * classes). Independent of the total number of examples.

Running time O(Lc * attributes * attribute values * number of classes). Lc : the longest length an example passes through * number

of alternate trees. Model learned by CVFDT vs. the one learned by

VFDT-Window: Similar in accuracy O(1) vs. O(window size) per new example.

44

Experiment Results Compare CVFDT, VFDT, VFDT-Window 5 million training examples Concept changed at every 50k examples Drift Level: average percentage of the test points that

changes label at each concept change. About 8% of test points change label each drift

100,000 examples in window 5% noise Test the model every 10k examples throughout the run,

averaged these results.

45

Experiment Results (CVFDT vs. VFDT)

Error rate as a function of number of attributes

drift level

46


Tree size as a function of number of attributes

47


Error rates of learners as a function of the numberof examples seen

Portion of data setthat is labelled -ve

48


Error rates as a function of the amount of concept drift

49

Experiment Results

CVFDT’s drift characteristics

50

Experiment Results (CVFDT vs. VFDT vs. VFDT-window)

Error rates over time of CVFDT, VFDT, and VFDT-window

Stimulated by running VFDT on W for every 100K examples instead of every example

Error Rate: VFDT: 19.4%CVFDT: 16.3%VFDT-Window: 15.3%Running Time:VFDT: 10 minutesCVFDT: 46 minutesVFDT-Window: expect 548 days

51

Experiment Results CVFDT not use too much RAM D=50, CVFDT never uses more than 70MB

Use as little as half the RAM of VFDT VFDT often had twice as many leaves as the

number of nodes in CVFDT’s HT and alternate subtrees combined

Reason: VFDT considers many more outdated examples and is forced to grow larger trees to make up for its earlier wrong decisions due to concept drift

52

Conclusions CVFDT – a decision-tree induction system

capable of learning accurate models from high speed, concept-drifting data streams

Grow an alternative subtree whenever an old one becomes questionable

Replace the old subtree when the new more accurate

Similar in accuracy to applying VFDT to a moving window of examples

53

Future Work Concepts changed periodically and removed

subtrees may become useful again Comparisons with related systems Continuous attributes Weighting examples

54

Reference List P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of

the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000.

G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001

V. Ganti, J. Gehrke, and R. Ramakrishnan. DEMON: Mining and monitoring evolving data. In Proceedings of the Sixteenth International Conference on Data Engineering, 2000.

J. Gehrke, V. Ganti, R. Ramakrishnan, and W.L. Loh. BOAT: optimistic decision tree construction. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, 1999.

55

The end

Q & A

56

Thank You!

mining decision trees from data streams

Documents