cds-tree: an effective index for clustering arbitrary shapes in data streams huanliang sun, ge yu,...
TRANSCRIPT
CDS-Tree An Effective Index for Clustering Arbitrary Shapesin Data Streams
Huanliang Sun Ge Yu Yubin Bao Faxin Zhao Daling Wang
RIDE-SDMArsquo05
Advisor Jia-Ling Koh
Speaker Tsui-Feng Yen
IntroductionPartitioning -k-means and k-medians algorithms donrsquot emphasize
on finding arbitrary shapes in data streams
Density-based -DBSCAN can find arbitrary shapes in data streams but
need to scan database more than one time
Cell-based (Grid-based) -CLIQUE has three problems
-high complexity -high memory -accuracy is not good with limited memory for changing data streams
Problem Definition
Domain A=A1A2hellipAk S= A1xA2x 1048693 xAk be a k-dimensional
numerical space A1 A2hellipAk as the dimensions (attributes) of S
A k-dimension data stream X=x1 x2 hellip xn is a set of ordered objects at t time point wh
ere xi=ltxi1 xi2hellip xikgt and xij the jth component of xi is drawn from
domain Aj
Definition
Sliding window model on data stream X -B1 is the most recent bucket and Bu is the oldest
-The window slides by creating a new bucket and discarding a oldest one
Definition cont
Partition P of data stream X -P be a set of non-overlapping rectangular cells which
is obtained by partitioning every dimension of X into equal length
-Each cell C is the intersection of one interval from each dimension It is represented as the form c1c2hellip
ck
-A cell can also be denoted as (cNO1 cNO2 hellip cNOk)named the coordinate of the cell where cNOi is the interval number of the cell on i-th dimension
Definition contSelectivity pc of cell C -The number of points that belong to C defines the sel
ectivity pc of cell C
Clustering based on cells data stream X in a sliding window
-If the selectivity of a cell is larger than a threshold τ we call the cell dense
-A cluster is the largest set of cells that are adjacent and dense
-Two cells C1 and C2 are connective when they are neighboring or there exists a cell C3 C1 and C3 are neighboring C2 and C3 are neighboring
CDS-Tree data stream coming (23)(54)(65)
root-node
mid
leaf
total-num-list
Related Algorithms of CDS-Tree CDS-Tree building algorithm
Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree
Granularity Adjustment
-the finer the partition is the higher the accuracy is but the more number of the cells is created
-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy
-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow
Granularity Adjustment contSafety factor (in case of exhausting me
mory) -λ is used to avoid the memory required exceeding
the limited memory Mmax when the granularity turns finer here we set it larger than 1
-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory
Granularity Adjustment Algorithm
Experimental Results
OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d
ataset
-Image Fourier Coefficient dataset
Experimental Results
Experimental Results
IntroductionPartitioning -k-means and k-medians algorithms donrsquot emphasize
on finding arbitrary shapes in data streams
Density-based -DBSCAN can find arbitrary shapes in data streams but
need to scan database more than one time
Cell-based (Grid-based) -CLIQUE has three problems
-high complexity -high memory -accuracy is not good with limited memory for changing data streams
Problem Definition
Domain A=A1A2hellipAk S= A1xA2x 1048693 xAk be a k-dimensional
numerical space A1 A2hellipAk as the dimensions (attributes) of S
A k-dimension data stream X=x1 x2 hellip xn is a set of ordered objects at t time point wh
ere xi=ltxi1 xi2hellip xikgt and xij the jth component of xi is drawn from
domain Aj
Definition
Sliding window model on data stream X -B1 is the most recent bucket and Bu is the oldest
-The window slides by creating a new bucket and discarding a oldest one
Definition cont
Partition P of data stream X -P be a set of non-overlapping rectangular cells which
is obtained by partitioning every dimension of X into equal length
-Each cell C is the intersection of one interval from each dimension It is represented as the form c1c2hellip
ck
-A cell can also be denoted as (cNO1 cNO2 hellip cNOk)named the coordinate of the cell where cNOi is the interval number of the cell on i-th dimension
Definition contSelectivity pc of cell C -The number of points that belong to C defines the sel
ectivity pc of cell C
Clustering based on cells data stream X in a sliding window
-If the selectivity of a cell is larger than a threshold τ we call the cell dense
-A cluster is the largest set of cells that are adjacent and dense
-Two cells C1 and C2 are connective when they are neighboring or there exists a cell C3 C1 and C3 are neighboring C2 and C3 are neighboring
CDS-Tree data stream coming (23)(54)(65)
root-node
mid
leaf
total-num-list
Related Algorithms of CDS-Tree CDS-Tree building algorithm
Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree
Granularity Adjustment
-the finer the partition is the higher the accuracy is but the more number of the cells is created
-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy
-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow
Granularity Adjustment contSafety factor (in case of exhausting me
mory) -λ is used to avoid the memory required exceeding
the limited memory Mmax when the granularity turns finer here we set it larger than 1
-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory
Granularity Adjustment Algorithm
Experimental Results
OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d
ataset
-Image Fourier Coefficient dataset
Experimental Results
Experimental Results
Problem Definition
Domain A=A1A2hellipAk S= A1xA2x 1048693 xAk be a k-dimensional
numerical space A1 A2hellipAk as the dimensions (attributes) of S
A k-dimension data stream X=x1 x2 hellip xn is a set of ordered objects at t time point wh
ere xi=ltxi1 xi2hellip xikgt and xij the jth component of xi is drawn from
domain Aj
Definition
Sliding window model on data stream X -B1 is the most recent bucket and Bu is the oldest
-The window slides by creating a new bucket and discarding a oldest one
Definition cont
Partition P of data stream X -P be a set of non-overlapping rectangular cells which
is obtained by partitioning every dimension of X into equal length
-Each cell C is the intersection of one interval from each dimension It is represented as the form c1c2hellip
ck
-A cell can also be denoted as (cNO1 cNO2 hellip cNOk)named the coordinate of the cell where cNOi is the interval number of the cell on i-th dimension
Definition contSelectivity pc of cell C -The number of points that belong to C defines the sel
ectivity pc of cell C
Clustering based on cells data stream X in a sliding window
-If the selectivity of a cell is larger than a threshold τ we call the cell dense
-A cluster is the largest set of cells that are adjacent and dense
-Two cells C1 and C2 are connective when they are neighboring or there exists a cell C3 C1 and C3 are neighboring C2 and C3 are neighboring
CDS-Tree data stream coming (23)(54)(65)
root-node
mid
leaf
total-num-list
Related Algorithms of CDS-Tree CDS-Tree building algorithm
Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree
Granularity Adjustment
-the finer the partition is the higher the accuracy is but the more number of the cells is created
-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy
-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow
Granularity Adjustment contSafety factor (in case of exhausting me
mory) -λ is used to avoid the memory required exceeding
the limited memory Mmax when the granularity turns finer here we set it larger than 1
-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory
Granularity Adjustment Algorithm
Experimental Results
OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d
ataset
-Image Fourier Coefficient dataset
Experimental Results
Experimental Results
Definition
Sliding window model on data stream X -B1 is the most recent bucket and Bu is the oldest
-The window slides by creating a new bucket and discarding a oldest one
Definition cont
Partition P of data stream X -P be a set of non-overlapping rectangular cells which
is obtained by partitioning every dimension of X into equal length
-Each cell C is the intersection of one interval from each dimension It is represented as the form c1c2hellip
ck
-A cell can also be denoted as (cNO1 cNO2 hellip cNOk)named the coordinate of the cell where cNOi is the interval number of the cell on i-th dimension
Definition contSelectivity pc of cell C -The number of points that belong to C defines the sel
ectivity pc of cell C
Clustering based on cells data stream X in a sliding window
-If the selectivity of a cell is larger than a threshold τ we call the cell dense
-A cluster is the largest set of cells that are adjacent and dense
-Two cells C1 and C2 are connective when they are neighboring or there exists a cell C3 C1 and C3 are neighboring C2 and C3 are neighboring
CDS-Tree data stream coming (23)(54)(65)
root-node
mid
leaf
total-num-list
Related Algorithms of CDS-Tree CDS-Tree building algorithm
Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree
Granularity Adjustment
-the finer the partition is the higher the accuracy is but the more number of the cells is created
-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy
-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow
Granularity Adjustment contSafety factor (in case of exhausting me
mory) -λ is used to avoid the memory required exceeding
the limited memory Mmax when the granularity turns finer here we set it larger than 1
-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory
Granularity Adjustment Algorithm
Experimental Results
OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d
ataset
-Image Fourier Coefficient dataset
Experimental Results
Experimental Results
Definition cont
Partition P of data stream X -P be a set of non-overlapping rectangular cells which
is obtained by partitioning every dimension of X into equal length
-Each cell C is the intersection of one interval from each dimension It is represented as the form c1c2hellip
ck
-A cell can also be denoted as (cNO1 cNO2 hellip cNOk)named the coordinate of the cell where cNOi is the interval number of the cell on i-th dimension
Definition contSelectivity pc of cell C -The number of points that belong to C defines the sel
ectivity pc of cell C
Clustering based on cells data stream X in a sliding window
-If the selectivity of a cell is larger than a threshold τ we call the cell dense
-A cluster is the largest set of cells that are adjacent and dense
-Two cells C1 and C2 are connective when they are neighboring or there exists a cell C3 C1 and C3 are neighboring C2 and C3 are neighboring
CDS-Tree data stream coming (23)(54)(65)
root-node
mid
leaf
total-num-list
Related Algorithms of CDS-Tree CDS-Tree building algorithm
Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree
Granularity Adjustment
-the finer the partition is the higher the accuracy is but the more number of the cells is created
-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy
-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow
Granularity Adjustment contSafety factor (in case of exhausting me
mory) -λ is used to avoid the memory required exceeding
the limited memory Mmax when the granularity turns finer here we set it larger than 1
-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory
Granularity Adjustment Algorithm
Experimental Results
OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d
ataset
-Image Fourier Coefficient dataset
Experimental Results
Experimental Results
Definition contSelectivity pc of cell C -The number of points that belong to C defines the sel
ectivity pc of cell C
Clustering based on cells data stream X in a sliding window
-If the selectivity of a cell is larger than a threshold τ we call the cell dense
-A cluster is the largest set of cells that are adjacent and dense
-Two cells C1 and C2 are connective when they are neighboring or there exists a cell C3 C1 and C3 are neighboring C2 and C3 are neighboring
CDS-Tree data stream coming (23)(54)(65)
root-node
mid
leaf
total-num-list
Related Algorithms of CDS-Tree CDS-Tree building algorithm
Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree
Granularity Adjustment
-the finer the partition is the higher the accuracy is but the more number of the cells is created
-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy
-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow
Granularity Adjustment contSafety factor (in case of exhausting me
mory) -λ is used to avoid the memory required exceeding
the limited memory Mmax when the granularity turns finer here we set it larger than 1
-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory
Granularity Adjustment Algorithm
Experimental Results
OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d
ataset
-Image Fourier Coefficient dataset
Experimental Results
Experimental Results
CDS-Tree data stream coming (23)(54)(65)
root-node
mid
leaf
total-num-list
Related Algorithms of CDS-Tree CDS-Tree building algorithm
Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree
Granularity Adjustment
-the finer the partition is the higher the accuracy is but the more number of the cells is created
-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy
-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow
Granularity Adjustment contSafety factor (in case of exhausting me
mory) -λ is used to avoid the memory required exceeding
the limited memory Mmax when the granularity turns finer here we set it larger than 1
-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory
Granularity Adjustment Algorithm
Experimental Results
OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d
ataset
-Image Fourier Coefficient dataset
Experimental Results
Experimental Results
Related Algorithms of CDS-Tree CDS-Tree building algorithm
Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree
Granularity Adjustment
-the finer the partition is the higher the accuracy is but the more number of the cells is created
-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy
-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow
Granularity Adjustment contSafety factor (in case of exhausting me
mory) -λ is used to avoid the memory required exceeding
the limited memory Mmax when the granularity turns finer here we set it larger than 1
-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory
Granularity Adjustment Algorithm
Experimental Results
OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d
ataset
-Image Fourier Coefficient dataset
Experimental Results
Experimental Results
Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree
Granularity Adjustment
-the finer the partition is the higher the accuracy is but the more number of the cells is created
-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy
-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow
Granularity Adjustment contSafety factor (in case of exhausting me
mory) -λ is used to avoid the memory required exceeding
the limited memory Mmax when the granularity turns finer here we set it larger than 1
-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory
Granularity Adjustment Algorithm
Experimental Results
OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d
ataset
-Image Fourier Coefficient dataset
Experimental Results
Experimental Results
Granularity Adjustment
-the finer the partition is the higher the accuracy is but the more number of the cells is created
-if the current cost memory Mp is far less than Mmax we can execute finer granularity partition for higher accuracy
-if the current memory cost Mp is close to Mmax we should use coarser partition to avoid memory overflow
Granularity Adjustment contSafety factor (in case of exhausting me
mory) -λ is used to avoid the memory required exceeding
the limited memory Mmax when the granularity turns finer here we set it larger than 1
-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory
Granularity Adjustment Algorithm
Experimental Results
OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d
ataset
-Image Fourier Coefficient dataset
Experimental Results
Experimental Results
Granularity Adjustment contSafety factor (in case of exhausting me
mory) -λ is used to avoid the memory required exceeding
the limited memory Mmax when the granularity turns finer here we set it larger than 1
-η we set it to decide the time point to adjust the granularity where ηis less than 1 For example 1048693 is set 01 which represents when left memory is less than 10 of Mmax the algorithm will turn granularity coarse to save more memory
Granularity Adjustment Algorithm
Experimental Results
OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d
ataset
-Image Fourier Coefficient dataset
Experimental Results
Experimental Results
Granularity Adjustment Algorithm
Experimental Results
OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d
ataset
-Image Fourier Coefficient dataset
Experimental Results
Experimental Results
Experimental Results
OS Microsoft Windows 2000CPU 25GHz RAM 512MBTwo databases -KDD-CUP-99 Network Intrusion Detection stream d
ataset
-Image Fourier Coefficient dataset
Experimental Results
Experimental Results
Experimental Results
Experimental Results
Experimental Results