clustering: large databases in data mining

28
5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology

Upload: zhao-sam

Post on 14-Jun-2015

5.632 views

Category:

Technology


4 download

DESCRIPTION

This chapter describes the application of clustering algorithms to large database.

TRANSCRIPT

Page 1: Clustering: Large Databases in data mining

5/29/2008AI Lab @ UEC in Japan

Chapter 12

Clustering: Large DatabasesWritten by Farial Shahnaz

Presented by Zhao Xinyou

Data Mining Technology

Page 2: Clustering: Large Databases in data mining

Contents

Introduction Idea for there major approaches for scalable

clustering {Divide-and-Conquer, Incremental, Parallel}

There approaches for scalable clustering { BIRCH, DSBCAN, CURE}

Application

Page 3: Clustering: Large Databases in data mining

Introduction –Common method Common method for clustering: visit all data f

rom database and analyze the data, just like:

Time: Computational Complexities: O(n*n). Memory: Need to load all data to main memo

ry

PP133

huge, huge number millions

Time/Memory

Data

Page 4: Clustering: Large Databases in data mining

Motivation—Clustering for large database

f(x): O(n*n).

f(x): O(n).

Time/Memory

Data

Time/Memory

Data

Method ???

PP134

Page 5: Clustering: Large Databases in data mining

Requirement—Clustering for large database

f(x): O(n*n).

f(x): O(n).

Time/Memory

Data

Time/Memory

Data

Method ???

PP134

No more (preferably less) than one scan of the database.

Process each [record] only once With limited memory

Can suspend, stop, and resume Can update the results when new

data inserted or removed Can perform different technology to

scan the database During execution, method should

provide status and ‘best’ answer.

Page 6: Clustering: Large Databases in data mining

Major approach for scalable clustering Divide-and-Conquer approach Parallel clustering approach Incremental clustering approach

PP135

Page 7: Clustering: Large Databases in data mining

Divide-and Conquer approach Definition.

Divide-and-conquer is a problem-solving approach in which we:

divide the problem into sub-problems, recursively conquer or solve each sub-problem, and

then combine the sub-problem solutions to obtain a

solution to the original problem.

PP135

Key Assumptions1.Problem solutions can be constructed using subproblem solutions. 2.Subproblem solutions are independent of one another.

9*9 数独

Page 8: Clustering: Large Databases in data mining

Parallel clustering approach

Idea: Divide data into small set and then run small set on different machine (Come from Divide-and-Conquer)

PP136-137

Page 9: Clustering: Large Databases in data mining

Explanation about Divide-and-Conquer

n/p

n/p

n/p

Divide

K clusters

K clusters

K clusters

Conquer

Conquer

Record: nAim: k cluster

kp clusters

Conquer

P0

P1

Pi

K clustersMerging

n/p n/p n/pDivide

K clusters

K clusters

Conquer

kp clusters

P0P1Pi

K clusters

Merging

Divide is some algorithmsConquer is some algorithms

Page 10: Clustering: Large Databases in data mining

Application

Sorting: quick-sort and merge sort Fast Fourier transforms Tower of Hanoi puzzle matrix multiplication …..

PP135

Page 11: Clustering: Large Databases in data mining

CURE- Divide-and-Conquer

1.Get the size n of set D and partition D into p group (contain n/p elements)

2.To each group pi, clustered into k groups by using Heap and k-d tree

3.delete some no relationship node in Heap and k-d tree

4. Cluster the partial clusters and get the final cluster

PP140-141

Page 12: Clustering: Large Databases in data mining

Heap PP140-141

Page 13: Clustering: Large Databases in data mining

k-D Tree

Technically, the letter k refers to the number of dimensions

PP140-141

3-dimensional kd-tree

Page 14: Clustering: Large Databases in data mining

K-D TreePP140-141

Page 15: Clustering: Large Databases in data mining

CURE- Divide-and-ConquerPP140-141

Nearest Merge

Nearest

Merge

Page 16: Clustering: Large Databases in data mining

Incremental clustering approach Idea: scan all data in database, Compare with the existing clusters,

if find similar cluster, assign it to with cluster, or else, create a new cluster. Go on till no data

Steps: 1. S={};//set cluster = NULL 2. do{ 3. read one record d; 4. r = find_simiarity_cluster(d, S); 5. if (r exists) 6. assign d to the cluster r 6. else 7. Add_cluster(d, S); 8. } untill (no record in database);

PP135-136

Page 17: Clustering: Large Databases in data mining

Application--Incremental clustering approach BIRCH

Balanced Iterative Reducing and Clustering using Hierarchies

DBSCAN

Density-Based Spatial Clustering of Application with Noise

Page 18: Clustering: Large Databases in data mining

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies ) Based on distance measurement, compute

the similarity between record and cluster and give the clusters.

Inner Cluster

Among Cluster

PP137-138

Page 19: Clustering: Large Databases in data mining

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies )Inner Cluster Among Cluster

PP137-138

Page 20: Clustering: Large Databases in data mining

Related Definiation

Cluster: {xi}, where i = 1, 2, …, N CF(Clustering Feature) : is a triple, (N,LS,S

S) , N : number of data ; LS : linear sum of N data ; SS : Square sum

Page 21: Clustering: Large Databases in data mining

Related Definiation

CF tree = (B,T), B = (CFi, childi), if is internal node in a cluster

B = (CFi, prev, next) if is external or leaf node in a cluster.

T: threshold for all leaf node, which should satisfy mean distance D < T

Page 22: Clustering: Large Databases in data mining

Algorithm for BIRCH

Page 23: Clustering: Large Databases in data mining

DBSCAN

DBSCAN: Density-Based Spatial Clustering of Application with Noise

Ex1: We want to class house along with river from one

spatial photo Ex2:

Page 24: Clustering: Large Databases in data mining

Definition for DBSCAN

Eps-neighborhood of a point The Eps-neighborhood of a point p, denoted

by NEps(p), is defined by NEps(p)={q∈D|dist(p,q) ≤ Eps}

Minimum Number (MinPts) The MinPts is the minimum number of data p

oints in any cluster.

Page 25: Clustering: Large Databases in data mining

Definition for DBSCAN

Directly density-reachable A point p is directly density-reachable from a

point q. Eps and MinPts if 1): p ∈ NEps(q);

2): |NEps(q)|≥MinPts;

Page 26: Clustering: Large Databases in data mining

Definition for DBSCAN

Density-reachable A point p is density-reachable from a point q.

Eps and MinPts if there is a chain of points p

1,p2,…,pn,p=p1,q=pn such as pi+1 is directly desity-reachable from pi;

Page 27: Clustering: Large Databases in data mining

Definition for DBSCAN

Density-reachable A point p is density-reachable from a point q.

Eps and MinPts if there is a chain of points p

1,p2,…,pn,p=p1,q=pn such as pi+1 is directly desity-reachable from pi;

Page 28: Clustering: Large Databases in data mining

Algorithm of DBSCAN

Input D={t1,t2,…,tn} MinPts EpsOutput K=K1,K2,…Kk

k = 0; for i =1 to n do

if ti is not in a cluster then

X={ti| tj is density-reachable from ti} end if if X is a valid cluster then k= k+1; Kk = X; end if end for