knowledge discovery & data mining clustering & customer segmentation

68
mining Clustering & customer segmentation Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/ A tutorial @ EDBT2000

Upload: jera

Post on 20-Jan-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Knowledge discovery & data mining Clustering & customer segmentation. Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/. A tutorial @ EDBT2000. Module outline. The clustering task Application to a case-study in customer segmentation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Knowledge discovery & data mining  Clustering & customer segmentation

Knowledge discovery & data mining

Clustering & customer segmentation

Fosca Giannotti and Dino PedreschiPisa KDD Lab

CNUCE-CNR & Univ. Pisahttp://www-kdd.di.unipi.it/

A tutorial @ EDBT2000

Page 2: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 2

Module outline

The clustering taskApplication to a case-study in

customer segmentationQuick overview of main clustering

techniques

Page 3: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 3

Problem 1: customer segmentation

Given: Large data base of customer data

containing their properties and past buying records

Goal: Find groups of customers with similar

behavior Find customers with unusual behavior

Page 4: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 4

Problem 2: CAD databases

Given: Large data base of CAD data containing

abstract feature vectors

Goal: Find homogeneous groups of similar CAD

parts Determine standard parts for each group Use standard parts instead of special

partsreduction of the number of parts to be produced

Page 5: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 5

Clustering: general problem description

Given: A data set with N d-dimensional data

items.Find:

a natural partitioning of the data set into a number of clusters (k) and noise.

Clusters should be such that: items in same cluster are similar

intra-cluster similarity is maximized items from different clusters are different

inter-cluster similarity is minimized

Page 6: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 6

Clustering = unsupervised classification

No predefined classes!

Used either as a stand-alone tool to get insight

into data distribution

or as a preprocessing step for other algorithms.

Help to understand how objects in a data set

tend to naturally group together

Cluster: a number of things of the same kind

being close together in a group

(Longman dictionary of contemporary English)

Page 7: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 7

Research perspective

From the past … clustering is a well-known problem in

statistics more recent research in

machine learning databases visualization

… to the future effective and efficient clustering algorithms

for large high-dimensional data sets with high noise

requires scalability with respect tothe number of data points (N)the number of dimensions (d)the noise level

Page 8: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 8

Basic methods

Partitioning methods k-means, k-medoids

Hierarchical methods Agglomerative/divisive, BIRCH, CURE

Linkage-based methodsDensity-based methods

DBSCAN, DENCLUE

Statistical methods IBM-IM demographic clustering, COBWEB

Page 9: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 9

A small example

Source: DM@CINECAhttp://open.cineca.it/datamining/dmCineca/index.htm

Domain: technology watch - a.k.a. competitive intelligence Which are the emergent technologies? Which of my competitors are investing on

them? In which area are my competitors active? Which area will my competitor drop in

the near future?Data sources:

public (on-line) databases

Page 10: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 10

The Derwent database

Contains all patents filed worldwide in last 10 years

Searching this database by keywords may yield thousands of documents

Derwent documents are semi-structured: many long text fields

Goal: analyze Derwent documents to build a model of competitors’ strategy

Page 11: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 11

Structure of Derwent documents

Page 12: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 12

Example dataset

Patents in the area: patch technology (medical plaster) 105 companies from 12 countries 94 classification codes 52 Derwent codes

Page 13: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 13

Clustering output

Page 14: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 14

Zoom on cluster 2

Page 15: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 15

Zoom on cluster 2 - profiling competitors

Page 16: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 16

Activity of competitors in the clusters

Page 17: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 17

Temporal analysis of clusters

Page 18: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 18

Module outline

The clustering taskApplication to a case-study in

customer segmentationQuick overview of main clustering

techniques

Page 19: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 19

A case-study on customer segmentation

Source: Gary Saarenvirta, “Mining customer data”. DB2 magazine on line, 1998

http://www.db2mag.com/98fsaar.html

Gratefully acknowledged!

Page 20: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 20

Customer segmentation

Task: use customer-purchase transaction data to track buying behavior create strategic business initiatives divide customers into segments based on

shareholder value variables:customer profitability, measure of risk, measure of the lifetime value of a customer,retention probability.

Page 21: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 21

Attractive customer segments

high-profit, high-value, and low-risk customers highly attractive: typically 10% to 20% of

customers who create 50% to 80% of a company's profits

strategic initiative for the segment is retention

low-profit, high-value, and low-risk customers attractive strategic initiative for the segment is to increase

profitabilitycross-selling (selling new products) up-selling (selling more of what customers currently buy)

Page 22: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 22

Behavioral vs. demographic segments

Within behavioral segments: demographic subsegments.

Customer demographic data are not used to create segments.

Demographic (sub)segmenting is used to select appropriate tactics (advertising, marketing channels, and campaigns) to implement the strategic initiatives.

Page 23: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 23

AIR MILES

is a Canadian reward program for a coalition of more than 125 companies in all industry sectors finance, credit card, retail, grocery, gas,

telecom.

60% of Canadian households enrolled is a frequent-shopper program:

consumers collect bonuses that can be redeemed for rewards (air travel, hotel accommodation, theatre/sport tickets, …)

Page 24: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 24

Data capture

The coalition partners capture consumer transactions and transmit them to AIR MILES, which

stores these transactions and uses the data for database marketing initiatives.

AIR MILES data warehouse contained (at the date of the project) more than 6.3 million household records 1 billion transaction records.

Page 25: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 25

Before data mining

Analytical techniques used: Recency, Frequency, Monetary value

(RFM) analysis online analytic processing tools linear statistical methods

to analyze the success of the various marketing initiatives undertaken by the coalition.

Page 26: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 26

Data mining project at AMRP

Goal: create a customer segmentation using a data mining tool and

compare the results to an existing segmentation developed using RFM analysis.

data mining platform IBM - DB2 Universal Database Enterprise

parallelized over a five-node RS/6000 SP system.

IBM - Intelligent Miner for Data

Page 27: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 27

Dataset

~ 50,000 customers and their associated transactions for a 12-month period.

Page 28: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 28

Data preparation

“shareholder value” variables revenue customer tenure number of sponsor companies shopped at over the

customer tenure number of sponsor companies shopped at over the

last 12 months, recency (in months) of the last transaction

calculated by aggregating the transaction data and then adding then to each customer record

84 variables = 14 categories of sponsor companies 3 variables per category 2 quarters (first two quarters of 1997)

Page 29: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 29

Data cleaning - missing values

demographic data is usually categorical has a high % of missing values the missing values can be set to either

unknown or unanswered (if result of unanswered questions)

if a large portion of the attribute is missing, it may be discarded

in the case study, missing numeric values set to 0

Page 30: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 30

Data transformation

Ratio variables. E.g.: profitability = profit / tenure

Time-derivative variables. E.g.: profit 2nd quarter - profit 1st quarter

Discretization using quantiles. E.g., break points at 10, 25, 50, 75, and 90.

Discretization using predefined ranges. E.g., those used in census

Log transforms. E.g., for very skewed distributions

Page 31: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 31

Distribution of original data

Page 32: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 32

Distribution of discretized data

Page 33: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 33

Before/after discretization

Page 34: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 34

Clustering/segmentation methodology

Page 35: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 35

IBM-IM demographic clustering

Designed for categorical variablesSimilarity index:

increases with number of common values on same attribute

decreases with number of different values on same attribute

# of clusters is not fixed a priori only upper bound set

Page 36: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 36

Demographic clustering: data structure

Page 37: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 37

Demographic clustering: parameters

Page 38: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 38

Demographic clustering: similarity index

proportional to 1-1 (# of common 1’s) inversely proportional to 0-1 and 1-0unaffected by 0-0 (# of common 0’s)Condorcet index=

N11 / (N11 + 1/2(N01 + N10))Dice index=

N11 / (N11 + 1/4(N01 + N10))Dice looser then Condorcet

appropriate with highly different objects Indexes generalize from objects to clusters

Page 39: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 39

Demographic clustering: similarity index

Similarity threshold i,j assumed similar if s(i,j)> low values (<0.5) appropriate with highly

different objects

Weights for attributes importance of attributes in the similarity

index may be varied with different weights default weight = 1

Page 40: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 40

IM Demographic clustering

basic parameters to set: Maximum number of clusters. Maximum number of passes through the data. Accuracy: a stopping criterion for the

algorithm. If the change in the Condorcet criterion between data passes is smaller than the accuracy (as %), the algorithm will terminate.

In the case study: maximum # of clusters: 9 maximum # of passes: 5 accuracy: 0.1

Page 41: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 41

Input dataset

dataset: all continuous variables discretized. input variables :

# of products purchased over customer’s lifetime # of products purchased in the last 12 months Customer's revenue contribution over lifetime Customer tenure in months Ratio of revenue to tenure Ratio of number of products to tenure Region Recency Tenure (# of months since customer first enrolled in

the program).

Page 42: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 42

Input dataset

Other discrete and categorical variables and some interesting continuous variables were input as supplementary variables:

variables used to profile the clusters but not to define them.

easier interpretation of clusters using data other than the input variables.

Page 43: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 43

Output of demographic clustering

Page 44: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 44

Visualization of clusters

horizontal strip = a clusterclusters are ordered from top to bottom

in order of sizevariables are ordered from left to right

in order of importance to the cluster, based on a chi-square test between variable and cluster ID.

other metrics include entropy, Condorcet criterion, and database order.

Page 45: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 45

Visualization of clusters

variables used to define clusters are without brackets, while the supplementary variables appear within brackets.

numeric (integer), discrete numeric (small integer), binary, and continuous variables have their frequency distribution shown as a bar graph.

red bars = distribution of the variable within the current cluster.

gray solid bars = distribution of the variable in the whole universe.

Page 46: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 46

Visualization of clusters

Categorical variables are shown as pie charts.

inner pie = distribution of the categories for the current cluster

outer ring = distribution of the variable for the entire universe.

The more different the cluster distribution is from the average, the more interesting or distinct the cluster.

Page 47: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 47

Qualitative characterization of clusters

Gold98 is a binary variable that indicates the best customers in the database, created previously by the business using RFM analysis.

The clusters have almost all Gold or no Gold customers.

Gold segment is confirmed!

Page 48: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 48

Qualitative characterization of clusters

Clustering results validate the existing concept of Gold customers, suggest the introduction of a cluster within the

Gold98 segment - a platinum segment

Cluster 5: 9 %of the population. revenue, bonus collected are all in the

75th percentile and above, skewed to almost all greater than the 90th percentile.

looks like a very profitable cluster

Page 49: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 49

Detailed view of cluster 5

Page 50: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 50

Profiling clusters

Goal: assess the potential business value of each cluster quantitatively by profiling the aggregate values of the shareholder value variables by cluster.

Page 51: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 51

Profiling clusters

leverage = ratio of revenue to customer. product index = ratio of the average

number of products purchased by the customers in the cluster divided by the average number of products purchased overall.

as profitability increases, so does the average number of products purchased.

customer profitability increases as tenure increases.

Page 52: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 52

Business opportunities

Best customers in clusters 2, 5, and 7: indication: retention

Clusters 2, 6, and 0 indication: cross-selling by contrasting with

clusters 5 and 7. 2, 6, and 0 have a product index close to that of 5

and 7, which have the highest number of products purchased.

Try to convert customers from clusters 2, 6, and 0 to clusters 5 and 7. By comparing which products are bought we can find products that are candidates for cross-selling.

Page 53: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 53

Business opportunities

Clusters 3 and 4 indication: cross-selling to clusters 2, 6,

and 0

Cluster 1 indication: wait and see. It appears to be

a group of new customers

Cluster 8 indication: no waste of marketing dollars

Page 54: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 54

Follow-up

Reactions from the management visualization of results allowed for meaningful

and actionable analysis. original segmentation methodology validated,

and obtained refinements potentially valuable. decision to undertake further data mining

projects, including predictive models for direct mail targeting, further work on segmentation using more detailed

behavioral data, opportunity identification using association algorithms

within the segments discovered.

Page 55: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 55

Module outline

The clustering taskApplication to a case-study in

customer segmentationQuick overview of main clustering

techniques

Page 56: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 56

A taxonomy of main clustering methods

Partitioning methods K-means (MacQueen 67), expectation maximization

(Lauritzen 95), CLARANS (Ng and Han 94)

Hierarchical methods Agglomerative/divisive, BIRCH (Zhang et al 96), CURE

(Guha et al 98)

Linkage-based methodsDensity-based methods

DBSCAN (Ester et al 96), DENCLUE (Hinnenburg and Keim 98)

Statistical methods IBM-IM demographic clustering, COBWEB (Fisher 87),

Autoclass (Cheeseman 96)

Page 57: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 57

Which notion of distance?

Given two samples with n continuous attributes X=<x1,…,xn> and Y=<y1,…,yn>, the Minkowski distance between X and Y is:

d(X,Y)= (|x1-y1|q + … + |xn-yn|

q)1/q

Euclidean distance: q=2Manhattan distance: q=1

Page 58: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 58

K-Means

Determine k prototypes (p1, … ,pk) of a given data set

Assign data points to nearest prototype

Minimize distance stop-criterion:

Iterative Algorithm Shift the prototypes towards the mean of their point

set Re-assign the data points to the nearest prototype

Page 59: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 59

K-Means: example

112

2 3

3

Page 60: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 60

Expectation Maximization

Generalization of k-Meansprobabilistic assignment of points to

clusters Idea:

Estimate parameters of k Gaussians Optimize the probability that the mixture

of parameterized Gaussians fits the data Iterative algorithm similar to k-Means

Page 61: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 61

Linkage-based methods (from statistics)

Single Linkage Connected components for distance d

Page 62: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 62

Linkage-based methods

Method of Wishart Min. no. of neighboring points: 4

Reduce data set

Apply Single Linkage

Page 63: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 63

Kernel Density Estimation

Data Set Influence Function Density Function Influence Function: Influence of a data point in

its neighborhood Density Function: Sum of the influences of all

data points

Page 64: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 64

Kernel Density Estimation

Influence Function The influence of a data point y at a point

x in the data space is modeled by a function Kernel Density Estimation:

Density Function The density at a point x in the data space

is defined as the sum of influences of all data points:

Page 65: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 65

KDE and visual clustering

Page 66: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 66

KDE and visual clustering

Page 67: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 67

Scalability problem of clustering methods

Effectiveness degenerates with dimensionality d with noise level

Efficiency degenerates (at least) linearly with no of data points N exponentially with dimensionality d

Page 68: Knowledge discovery & data mining  Clustering & customer segmentation

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 68

References - Clustering

L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 1990. J. MacQueen. Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Synp. On Math., Stat., and

Prob., 1:281-297, 1967. S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis,

19:191-201, 1995. Charu C. Agrawal, Ceilia Procopiuc, Joel L. Wolf, Philip S. Yu, and Jong Soo Prk, Fast Algorithms for Projected Clustering, the ACM

SIGMOD Conference on Management of Data, Philadelphia, PA, June 1999. Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, Automatic Subspace Clustering on High Dimensional

Data for Data Mining Applications, the ACM SIGMOD Conference on Management of Data, Seattle, Washington, June 1998. P. Cheeseman and J. Stutz. Bayesian Classification (Autoclass): theory and results. In U. M. Fayyad et al (eds.) Advances in Knowledge

Discovery and Data Mining, p. 153-180, AAAI/MIT Press, 1996. Martin Ester, Hans-Peter Kriegel, Jorg Sander, Michael Wimmer, and Xiaowei Xu, Incremental Clustering for Mining in a Data

Warehousing Environment, the VLDB Conference, New York City, New York, August 1998. Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu, A density-based algorithm for discovering clusters in large spatial

database with noise. In Proc. KDD’96 A. Hinnenburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. KDD’98. P. Michaud. Clustering techniques. Future Generation Computer Systems, 13, 1997. D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987. Venkatesh Ganti, Raghu Ramakrishnan, and Johannes Gehrke, Clustering Large Datasets in Arbitrary Metric Spaces, the 15th

International Conference on Data Engineering, Sydney, Australia, April 1999. Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, CURE: An Efficient Clustering Algorithm for Large Databases, the ACM SIGMOD

Conference on Management of Data, Seattle, Washington, June 1998 Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, the 15th

International Conference on Data Engineering, Sydney, Australia, April 1999. Anil K. Jain and Richard C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, New Jersey, 1988. Raymond T. Ng and Jiawei Han, Efficient and effective clustering methods for spatial data mining, the VLDB Conference, Santiago,

Chile, September 1994. Tian Zhang, Raghu Ramakrishnan, and Miron, Birch: An efficient data clustering method for very large databases, the ACM SIGMOD

Conference on Management of Data, Montreal, Canada, June 1996.