data clustering 1 – an introduction data clustering – an introductionslide 1
TRANSCRIPT
Data Clustering 1 – An introduction
Data Clustering – An Introduction Slide 1
The Data Explosion
“If you feel like you are drowning in information, it’s because you are.”
Advance of IT and the Internet Massive increase in ability to:
Record: Electronic records and forms Store: Data Warehouses (as we have seen) Analyse: Data Mining and Visualisation (more later)
Risk of Information Overload
Slide 2Data Clustering – An Introduction
The Aims of Data Mining
Classification Categorising Risk-Return of Stocks
Association Identify products that tend to sell together
Detection Identify profiles of customers
Prediction Forecasting Market Performance
Slide 3Data Clustering – An Introduction
Database Technology Timeline
1960s: Data collection, database creation, IMS and network
DBMS
1970s: Relational data model, relational DBMS implementation
1980s: RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s—2000s: Data mining and data warehousing, multimedia
databases, and Web databases
Slide 4Data Clustering – An Introduction
From Data to Knowledge
Common to break down the process of learning from data into the following:
Data, Information and Knowledge
Slide 5Data Clustering – An Introduction
From Data to Knowledge
Data: Raw numbers
Information: Data with context or meaning
Knowledge: Data Structures / Patterns (Knowledge must be useful)
Slide 6Data Clustering – An Introduction
Data Mining / Intelligent Data Analysis
“Data mining is applying Machine Learning techniques to historical data to improve future decisions” Tom Mitchell 1997
Slide 7Data Clustering – An Introduction
Knowledge Discovery
Knowledge Discovery in Databases (KDD)
The Process (from Advances in KDD and Data mining):
Data Knowledge
Target Data
Pre-processedData
TransformedData
Patterns
Slide 8Data Clustering – An Introduction
Slide 9
Data Mining - Tools
Typical tools Statistical Analysis
Summarisation Outlier Detection Correlation Regression Clustering
Association Rules Time Series Models Decision Trees (classification)
Data Clustering – An Introduction
Slide 10
Data Mining - Applications
Some successful examples of its use:
Pharmaceutical companies – Drug Discovery
Credit card companies – Fraud Detection
Transportation companies - Routing Large consumer package goods
companies (to improve the sales process to retailers)
Hospital Organisation – Decision AnalysisData Clustering – An Introduction
Slide 11
Examples of Data Mining Tools
We will now look at some core techniques commonly used for analysing and mining business warehouses
Correlation Visualisation Clustering Regression
Data Clustering – An Introduction
ClusteringAn example in biology…
Things that are brown and run away
Things that are green and don’t run away
animals
plants
Data Clustering – An Introduction Slide 12
Clustering
An example in biology…– Kingdom– Phylum–Class–Order– Family–Genus– Species
Hierarchical clustering (more later)
Data Clustering – An Introduction Slide 13
Clustering
The process Extract features (colour, movement,
sensory organs etc): more later
Cluster into categories
Consolidation
Data Clustering – An Introduction Slide 14
ClusteringClustering: to partition a data set into subsets
(clusters), so that the data in each subset share some common trait - often similarity or proximity for some defined distance measure.
• The process of organizing objects into groups whose members are similar in some way.
• A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
• Unsupervised: No need for the ‘teacher’ signals, i.e. the desired output.
Cluster 1
Cluster 2
x1
x2
Supervised and Unsupervised LearningUnsupervised learning: learning without the
desired output (‘teacher’ signals).Supervised learning: learning with the desired
output.• Clustering is one of the widely-used unsupervised learning
methods.• Other unsupervised learning:
• Dimensionality reduction (factor analysis, principal component analysis, independent component analysis …)
• Time serious modelling• Source separation
• Supervised learning:• Classification• Regression
Data Clustering – An Introduction Slide 16
Patterns, Clusters and Features (1)
Patterns: physical objectsClusters: categories of objectsFeatures: attributes of objects
animals
plants
Colour: brown, green, …
Patterns, Clusters and Features (2)
100 150 200 250 300500
1000
1500
2000
2500
3000
3500
Top speed [ml/h]
Weig
ht
[kg]
Sports cars
Medium market cars
Lorries
Pattern
Feature-1 values
Features’ space
cluster
Feature-2 values
Creating vehicles’ clusters
Data Clustering – An Introduction Slide 19
Social networks
• Marketing • Terror networks• Allocation of resources in a company /
university
Data Clustering – An Introduction Slide 20
Gene networks
• Understanding gene interactions• Identifying important genes linked to disease
How to do clustering?
Cluster 1
Cluster 2
x1
x2
What we know: patterns represented by their feature vectors,
e.g.
370
851
.
.x
dx
x
x
2
1
x
General case:is in the d -dimensional domain of the feature vectors
What we need to find out: the clusters
Pattern Similarity
bd
i
bBi
Ai xxd
1
1
• A key concept in clustering: similarity.• Clusters are formed by similar patterns.• In computer science, we need to define
some metric to measure similarity. One of the commonly adopted similarity metrics is distance.
A general definition of distance (between pattern A and B):• b=2: Euclidean distance• b=1: Manhattan distance
The shorter the distance, the more similar the two patterns.
Data Clustering – An Introduction Slide 23
Pattern Similarity & Distance Metrics
Many methods are designed to work
on Distance Metrics, e.g. K-Means They assume that the Triangle
Inequality holds:“the sum of the lengths of any two sides must be greater than
the length of the remaining side”
Data Clustering – An Introduction Slide 24
Pattern Similarity & Distance Metrics
Distance Metrics Euclidean Correlation Minkowski Manhattan Mahalanobis
Relationship Metrics How Long is a Piece of String? Often Application Dependant
25
K-Means Clustering
Algorithm 1: K-Means Clustering
1. Place K points into the feature space. These points represent initial cluster centroids.
2. Assign each pattern to the closest cluster centroid.3. When all objects have been assigned, recalculate
the positions of the K centroids.4. Repeat Steps 2 and 3 until the assignments do not
change.
Data Clustering – An Introduction Slide 26
Slide 27
K-Means Clustering
Interactive Demo: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
Data Clustering – An Introduction
Discussions (1)
1. How to determine k, the number of clusters?
Data Clustering – An Introduction Slide 28
Discussions (2)
2. Any alternative ways of choosing the initial cluster centroids?
Data Clustering – An Introduction Slide 29
Discussions (3)
3. Does the algorithm converge to the same results with different selections of initial cluster centroids? If not, what should we do in practice?
Data Clustering – An Introduction Slide 30
Reading
Chapter 9, Section 9.3: David Hand “Principles of Data Mining”, MIT Press
Chapter 8: Pang-Ning Tan “Introduction to Data Mining”
Anil Jain: “Data Clustering: 50 Years Beyond K-Means”, Pattern Recognition Letters
Data Clustering – An Introduction Slide 31
Lab
In the lab:
Examine a piece of JAVA code for K-Means clustering
Explore the use of K-Means on some Toy datasets
Visualise the clusterings using an EXCEL macro
Data Clustering – An Introduction Slide 32