cds-tree: an effective index for clustering arbitrary shapes in data streams huanliang sun, ge yu,...

15
CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor Jia-Ling Koh Speaker Tsui-Feng Yen

Upload: cameron-rodgers

Post on 17-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

CDS-Tree An Effective Index for Clustering Arbitrary Shapesin Data Streams

Huanliang Sun Ge Yu Yubin Bao Faxin Zhao Daling Wang

RIDE-SDMArsquo05

Advisor Jia-Ling Koh

Speaker Tsui-Feng Yen

IntroductionPartitioning -k-means and k-medians algorithms donrsquot emphasize

on finding arbitrary shapes in data streams

Density-based -DBSCAN can find arbitrary shapes in data streams but

need to scan database more than one time

Cell-based (Grid-based) -CLIQUE has three problems

-high complexity -high memory -accuracy is not good with limited memory for changing data streams

Problem Definition

Domain A=A1A2hellipAk S= A1xA2x 1048693 xAk be a k-dimensional

numerical space A1 A2hellipAk as the dimensions (attributes) of S

A k-dimension data stream X=x1 x2 hellip xn is a set of ordered objects at t time point wh

ere xi=ltxi1 xi2hellip xikgt and xij the jth component of xi is drawn from

domain Aj

Definition

Sliding window model on data stream X -B1 is the most recent bucket and Bu is the oldest

-The window slides by creating a new bucket and discarding a oldest one

Definition cont

Partition P of data stream X -P be a set of non-overlapping rectangular cells which

is obtained by partitioning every dimension of X into equal length

-Each cell C is the intersection of one interval from each dimension It is represented as the form c1c2hellip

ck

-A cell can also be denoted as (cNO1 cNO2 hellip cNOk)named the coordinate of the cell where cNOi is the interval number of the cell on i-th dimension

Definition contSelectivity pc of cell C -The number of points that belong to C defines the sel

ectivity pc of cell C

Clustering based on cells data stream X in a sliding window

-If the selectivity of a cell is larger than a threshold τ we call the cell dense

-A cluster is the largest set of cells that are adjacent and dense

-Two cells C1 and C2 are connective when they are neighboring or there exists a cell C3 C1 and C3 are neighboring C2 and C3 are neighboring

CDS-Tree data stream coming (23)(54)(65)

root-node

mid

leaf

total-num-list

Related Algorithms of CDS-Tree CDS-Tree building algorithm

Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree

Granularity Adjustment

-the finer the partition is the higher the accuracy is but the more number of the cells is created

-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy

-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow

Granularity Adjustment contSafety factor (in case of exhausting me

mory) -λ is used to avoid the memory required exceeding

the limited memory Mmax when the granularity turns finer here we set it larger than 1

-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory

Granularity Adjustment Algorithm

Experimental Results

OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d

ataset

-Image Fourier Coefficient dataset

Experimental Results

Experimental Results

Page 2: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

IntroductionPartitioning -k-means and k-medians algorithms donrsquot emphasize

on finding arbitrary shapes in data streams

Density-based -DBSCAN can find arbitrary shapes in data streams but

need to scan database more than one time

Cell-based (Grid-based) -CLIQUE has three problems

-high complexity -high memory -accuracy is not good with limited memory for changing data streams

Problem Definition

Domain A=A1A2hellipAk S= A1xA2x 1048693 xAk be a k-dimensional

numerical space A1 A2hellipAk as the dimensions (attributes) of S

A k-dimension data stream X=x1 x2 hellip xn is a set of ordered objects at t time point wh

ere xi=ltxi1 xi2hellip xikgt and xij the jth component of xi is drawn from

domain Aj

Definition

Sliding window model on data stream X -B1 is the most recent bucket and Bu is the oldest

-The window slides by creating a new bucket and discarding a oldest one

Definition cont

Partition P of data stream X -P be a set of non-overlapping rectangular cells which

is obtained by partitioning every dimension of X into equal length

-Each cell C is the intersection of one interval from each dimension It is represented as the form c1c2hellip

ck

-A cell can also be denoted as (cNO1 cNO2 hellip cNOk)named the coordinate of the cell where cNOi is the interval number of the cell on i-th dimension

Definition contSelectivity pc of cell C -The number of points that belong to C defines the sel

ectivity pc of cell C

Clustering based on cells data stream X in a sliding window

-If the selectivity of a cell is larger than a threshold τ we call the cell dense

-A cluster is the largest set of cells that are adjacent and dense

-Two cells C1 and C2 are connective when they are neighboring or there exists a cell C3 C1 and C3 are neighboring C2 and C3 are neighboring

CDS-Tree data stream coming (23)(54)(65)

root-node

mid

leaf

total-num-list

Related Algorithms of CDS-Tree CDS-Tree building algorithm

Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree

Granularity Adjustment

-the finer the partition is the higher the accuracy is but the more number of the cells is created

-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy

-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow

Granularity Adjustment contSafety factor (in case of exhausting me

mory) -λ is used to avoid the memory required exceeding

the limited memory Mmax when the granularity turns finer here we set it larger than 1

-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory

Granularity Adjustment Algorithm

Experimental Results

OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d

ataset

-Image Fourier Coefficient dataset

Experimental Results

Experimental Results

Page 3: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

Problem Definition

Domain A=A1A2hellipAk S= A1xA2x 1048693 xAk be a k-dimensional

numerical space A1 A2hellipAk as the dimensions (attributes) of S

A k-dimension data stream X=x1 x2 hellip xn is a set of ordered objects at t time point wh

ere xi=ltxi1 xi2hellip xikgt and xij the jth component of xi is drawn from

domain Aj

Definition

Sliding window model on data stream X -B1 is the most recent bucket and Bu is the oldest

-The window slides by creating a new bucket and discarding a oldest one

Definition cont

Partition P of data stream X -P be a set of non-overlapping rectangular cells which

is obtained by partitioning every dimension of X into equal length

-Each cell C is the intersection of one interval from each dimension It is represented as the form c1c2hellip

ck

-A cell can also be denoted as (cNO1 cNO2 hellip cNOk)named the coordinate of the cell where cNOi is the interval number of the cell on i-th dimension

Definition contSelectivity pc of cell C -The number of points that belong to C defines the sel

ectivity pc of cell C

Clustering based on cells data stream X in a sliding window

-If the selectivity of a cell is larger than a threshold τ we call the cell dense

-A cluster is the largest set of cells that are adjacent and dense

-Two cells C1 and C2 are connective when they are neighboring or there exists a cell C3 C1 and C3 are neighboring C2 and C3 are neighboring

CDS-Tree data stream coming (23)(54)(65)

root-node

mid

leaf

total-num-list

Related Algorithms of CDS-Tree CDS-Tree building algorithm

Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree

Granularity Adjustment

-the finer the partition is the higher the accuracy is but the more number of the cells is created

-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy

-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow

Granularity Adjustment contSafety factor (in case of exhausting me

mory) -λ is used to avoid the memory required exceeding

the limited memory Mmax when the granularity turns finer here we set it larger than 1

-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory

Granularity Adjustment Algorithm

Experimental Results

OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d

ataset

-Image Fourier Coefficient dataset

Experimental Results

Experimental Results

Page 4: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

Definition

Sliding window model on data stream X -B1 is the most recent bucket and Bu is the oldest

-The window slides by creating a new bucket and discarding a oldest one

Definition cont

Partition P of data stream X -P be a set of non-overlapping rectangular cells which

is obtained by partitioning every dimension of X into equal length

-Each cell C is the intersection of one interval from each dimension It is represented as the form c1c2hellip

ck

-A cell can also be denoted as (cNO1 cNO2 hellip cNOk)named the coordinate of the cell where cNOi is the interval number of the cell on i-th dimension

Definition contSelectivity pc of cell C -The number of points that belong to C defines the sel

ectivity pc of cell C

Clustering based on cells data stream X in a sliding window

-If the selectivity of a cell is larger than a threshold τ we call the cell dense

-A cluster is the largest set of cells that are adjacent and dense

-Two cells C1 and C2 are connective when they are neighboring or there exists a cell C3 C1 and C3 are neighboring C2 and C3 are neighboring

CDS-Tree data stream coming (23)(54)(65)

root-node

mid

leaf

total-num-list

Related Algorithms of CDS-Tree CDS-Tree building algorithm

Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree

Granularity Adjustment

-the finer the partition is the higher the accuracy is but the more number of the cells is created

-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy

-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow

Granularity Adjustment contSafety factor (in case of exhausting me

mory) -λ is used to avoid the memory required exceeding

the limited memory Mmax when the granularity turns finer here we set it larger than 1

-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory

Granularity Adjustment Algorithm

Experimental Results

OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d

ataset

-Image Fourier Coefficient dataset

Experimental Results

Experimental Results

Page 5: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

Definition cont

Partition P of data stream X -P be a set of non-overlapping rectangular cells which

is obtained by partitioning every dimension of X into equal length

-Each cell C is the intersection of one interval from each dimension It is represented as the form c1c2hellip

ck

-A cell can also be denoted as (cNO1 cNO2 hellip cNOk)named the coordinate of the cell where cNOi is the interval number of the cell on i-th dimension

Definition contSelectivity pc of cell C -The number of points that belong to C defines the sel

ectivity pc of cell C

Clustering based on cells data stream X in a sliding window

-If the selectivity of a cell is larger than a threshold τ we call the cell dense

-A cluster is the largest set of cells that are adjacent and dense

-Two cells C1 and C2 are connective when they are neighboring or there exists a cell C3 C1 and C3 are neighboring C2 and C3 are neighboring

CDS-Tree data stream coming (23)(54)(65)

root-node

mid

leaf

total-num-list

Related Algorithms of CDS-Tree CDS-Tree building algorithm

Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree

Granularity Adjustment

-the finer the partition is the higher the accuracy is but the more number of the cells is created

-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy

-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow

Granularity Adjustment contSafety factor (in case of exhausting me

mory) -λ is used to avoid the memory required exceeding

the limited memory Mmax when the granularity turns finer here we set it larger than 1

-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory

Granularity Adjustment Algorithm

Experimental Results

OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d

ataset

-Image Fourier Coefficient dataset

Experimental Results

Experimental Results

Page 6: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

Definition contSelectivity pc of cell C -The number of points that belong to C defines the sel

ectivity pc of cell C

Clustering based on cells data stream X in a sliding window

-If the selectivity of a cell is larger than a threshold τ we call the cell dense

-A cluster is the largest set of cells that are adjacent and dense

-Two cells C1 and C2 are connective when they are neighboring or there exists a cell C3 C1 and C3 are neighboring C2 and C3 are neighboring

CDS-Tree data stream coming (23)(54)(65)

root-node

mid

leaf

total-num-list

Related Algorithms of CDS-Tree CDS-Tree building algorithm

Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree

Granularity Adjustment

-the finer the partition is the higher the accuracy is but the more number of the cells is created

-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy

-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow

Granularity Adjustment contSafety factor (in case of exhausting me

mory) -λ is used to avoid the memory required exceeding

the limited memory Mmax when the granularity turns finer here we set it larger than 1

-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory

Granularity Adjustment Algorithm

Experimental Results

OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d

ataset

-Image Fourier Coefficient dataset

Experimental Results

Experimental Results

Page 7: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

CDS-Tree data stream coming (23)(54)(65)

root-node

mid

leaf

total-num-list

Related Algorithms of CDS-Tree CDS-Tree building algorithm

Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree

Granularity Adjustment

-the finer the partition is the higher the accuracy is but the more number of the cells is created

-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy

-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow

Granularity Adjustment contSafety factor (in case of exhausting me

mory) -λ is used to avoid the memory required exceeding

the limited memory Mmax when the granularity turns finer here we set it larger than 1

-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory

Granularity Adjustment Algorithm

Experimental Results

OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d

ataset

-Image Fourier Coefficient dataset

Experimental Results

Experimental Results

Page 8: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

Related Algorithms of CDS-Tree CDS-Tree building algorithm

Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree

Granularity Adjustment

-the finer the partition is the higher the accuracy is but the more number of the cells is created

-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy

-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow

Granularity Adjustment contSafety factor (in case of exhausting me

mory) -λ is used to avoid the memory required exceeding

the limited memory Mmax when the granularity turns finer here we set it larger than 1

-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory

Granularity Adjustment Algorithm

Experimental Results

OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d

ataset

-Image Fourier Coefficient dataset

Experimental Results

Experimental Results

Page 9: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree

Granularity Adjustment

-the finer the partition is the higher the accuracy is but the more number of the cells is created

-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy

-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow

Granularity Adjustment contSafety factor (in case of exhausting me

mory) -λ is used to avoid the memory required exceeding

the limited memory Mmax when the granularity turns finer here we set it larger than 1

-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory

Granularity Adjustment Algorithm

Experimental Results

OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d

ataset

-Image Fourier Coefficient dataset

Experimental Results

Experimental Results

Page 10: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

Granularity Adjustment

-the finer the partition is the higher the accuracy is but the more number of the cells is created

-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy

-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow

Granularity Adjustment contSafety factor (in case of exhausting me

mory) -λ is used to avoid the memory required exceeding

the limited memory Mmax when the granularity turns finer here we set it larger than 1

-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory

Granularity Adjustment Algorithm

Experimental Results

OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d

ataset

-Image Fourier Coefficient dataset

Experimental Results

Experimental Results

Page 11: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

Granularity Adjustment contSafety factor (in case of exhausting me

mory) -λ is used to avoid the memory required exceeding

the limited memory Mmax when the granularity turns finer here we set it larger than 1

-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory

Granularity Adjustment Algorithm

Experimental Results

OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d

ataset

-Image Fourier Coefficient dataset

Experimental Results

Experimental Results

Page 12: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

Granularity Adjustment Algorithm

Experimental Results

OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d

ataset

-Image Fourier Coefficient dataset

Experimental Results

Experimental Results

Page 13: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

Experimental Results

OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d

ataset

-Image Fourier Coefficient dataset

Experimental Results

Experimental Results

Page 14: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

Experimental Results

Experimental Results

Page 15: CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor

Experimental Results