1 copyright jiawei han, modified by charles ling for cs411a course outline d introduction d data...

Copyright Jiawei Han, modified by Charles Ling for CS411a

Course Outline

Introduction Data warehousing and OLAP Data preprocessing for mining and warehousing Concept description: characterization and

discrimination Classification and prediction Association analysis Clustering analysis Mining complex data and advanced mining

techniques Trends and research issues

Data Mining and Warehousing: Session 7

Clustering Analysis

Clustering analysis

What is Clustering Analysis?

Clustering in Data Mining Applications

Handling Different Types of Variables

Major Clustering Techniques

Outlier Discovery

Problems and Challenges

What Is Clustering ?

Clustering is a process of partitioning a set of data (or objects)

into a set of meaningful sub-classes, called clusters.

May help users understand the natural grouping or

structure in a data set.

Cluster: a collection of data objects that are “similar” to one

another and thus can be treated collectively as one group.

Clustering: unsupervised classification: no predefined classes.

Used either as a stand-alone tool to get insight into data

distribution or as a preprocessing step for other algorithms.

What Is Good Clustering?

A good clustering method will produce high quality

clusters in which:

the intra-class (that is, intraintra-cluster) similarity is high.

the inter-class similarity is low.

The quality of a clustering result also depends on both the

similarity measure used by the method and its

implementation.

The quality of a clustering method is also measured by its

ability to discover some or all of the hidden patterns.

Requirements of Clustering in Data Mining

Scalability

Dealing with different types of attributes

Discovery of clusters with arbitrary shape

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Interpretability and usability.

Clustering analysis

Outlier Discovery

Applications of Clustering

Clustering has wide applications in

Pattern Recognition

Spatial Data Analysis:

– create thematic maps in GIS by clustering feature spaces

– detect spatial clusters and explain them in spatial data mining.

Image Processing

Economic Science (especially market research)

– Document classification

– Cluster Weblog data to discover groups of similar access patterns

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.

Land use: Identification of areas of similar land use in an earth observation database.

Insurance: Identifying groups of motor insurance policy holders with a high average claim cost.

City-planning: Identifying groups of houses according to their house type, value, and geographical location.

Clustering analysis

Outlier Discovery

Similarity and Dissimilarity Between Objects

Distances are normally used to measure the similarity or dissimilarity between two data objects.

Some popular ones include: Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional

data objects, and q is a positive integer.

If q = 1, d is Manhattan distance.

If q = 2, d is Euclidean distance:)||...|||(|),( 22

11 pp jx

ixjid )||...|||(|),(

||...||||),(2211 pp jxixjxixjxixjid

Measure Similarity

The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.

Values should be scaled (normalized to 0-1) Weights should be associated with different variables based

on applications and data semantics. It is hard to define “similar enough” or “good enough”

the answer is typically highly subjective.

Binary, Nominal, Continuous variables

Binary variable: d = 0 of x=y; d=0 otherwise

Nominal variables: > 2 states, e.g., red, yellow, blue, green. Simple matching: u: # of matches, p: total # of variables.

Also, one can use a large number of binary variables.

Continuos variables: d = |x-y| Scaling and normalization

pupjid ),(

Clustering analysis

Outlier Discovery

Five Categories of Clustering Methods

Partitioning algorithms: Construct various partitions and

then evaluate them by some criterion.

Hierarchy algorithms: Create a hierarchical decomposition

of the set of data (or objects) using some criterion.

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the

clusters and the idea is to find the best fit of that model to

each other.

Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D

of n objects into a set of k clusters

Given a k, find a partition of k clusters that optimizes the

chosen partitioning criterion. Global optimal: exhaustively enumerate all partitions.

Heuristic methods: k-means and k-medoids algorithms.

k-means (MacQueen’67): Each cluster is represented by the center

of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by one of the objects in

the cluster.

The K-Means Clustering Method

Given k, the k-means algorithm is implemented in 4 steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of

the current partition. The centroid is the center (mean point) of the cluster.

Assign each object to the cluster with the nearest seed point.

Go back to Step 2, stop when no more new assignment.

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 100

0 1 2 3 4 5 6 7 8 9 10

Comments on the K-Means Method

Strength of the k-means: Relatively efficient: O(tkn), where n is # of objects, k is # of

clusters, and t is # of iterations. Normally, k, t << n. Often terminates at a local optimum.

Weakness of the k-means: Applicable only when mean is defined, then what about

categorical data? Need to specify k, the number of clusters, in advance. Unable to handle noisy data and outliers. Not suitable to discover clusters with non-convex shapes.

The K-Medoids Clustering Method

Find representative objects, called medoids, in clusters To achieve this goal, only the definition of distance from

any two objects is needed. PAM (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering.

PAM works effectively for small data sets, but does not scale well for large data sets.

Two Types of Hierarchical Clustering Algorithms

Agglomerative (bottom-up): merge clusters iteratively. start by placing each object in its own cluster merge these atomic clusters into larger and larger clusters until all objects are in a single cluster. Most hierarchical methods belong to this category. They

differ only in their definition of between-cluster similarity. Divisive (top-down): split a cluster iteratively.

It does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces.

Divisive methods are not generally available, and rarely have been applied.

Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition.

Step 0 Step 1 Step 2 Step 3 Step 4

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

1 copyright jiawei han, modified by charles ling for cs411a course outline d introduction d data...

d insurance

d weights

d values

data mining d scalability

cs411a clustering analysis

objects d distances

d land use

cs411a measure similarity

Documents

video preprocessing

preprocessing and dimensionality...

03 preprocessing

spatial preprocessing

richmond v. jiawei technology

data preprocessing · data preprocessing mirek riedewald...

fourier transform jiawei chiu

real-time preprocessing for dense 3-d range imaging on the...

hong cheng jiawei han

data preprocessing d bm g data base and data mining group

image preprocessing

real-time preprocessing for dense 3-d range imaging on...

importance of data preprocessing for improving...

meg preprocessing

spatial and temporal data mining vasileios megalooikonomou...

data mining: concepts and techniques ©2006 jiawei han and...

jiawei yang master’s thesis

jiawei xu april 2009 - pitt

nonlinearity & preprocessing

d-impact: a data preprocessing algorithm to improve the...