data clustering 1 – an introduction data clustering – an introductionslide 1

Data Clustering 1 – An introduction

Data Clustering – An Introduction Slide 1

The Data Explosion

“If you feel like you are drowning in information, it’s because you are.”

Advance of IT and the Internet Massive increase in ability to:

Record: Electronic records and forms Store: Data Warehouses (as we have seen) Analyse: Data Mining and Visualisation (more later)

Risk of Information Overload

Slide 2Data Clustering – An Introduction

The Aims of Data Mining

Classification Categorising Risk-Return of Stocks

Association Identify products that tend to sell together

Detection Identify profiles of customers

Prediction Forecasting Market Performance


Database Technology Timeline

1960s: Data collection, database creation, IMS and network

DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational,

OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s—2000s: Data mining and data warehousing, multimedia

databases, and Web databases


From Data to Knowledge

Common to break down the process of learning from data into the following:

Data, Information and Knowledge


From Data to Knowledge

Data: Raw numbers

Information: Data with context or meaning

Knowledge: Data Structures / Patterns (Knowledge must be useful)


Data Mining / Intelligent Data Analysis

“Data mining is applying Machine Learning techniques to historical data to improve future decisions” Tom Mitchell 1997


Knowledge Discovery

Knowledge Discovery in Databases (KDD)

The Process (from Advances in KDD and Data mining):

Data Knowledge

Target Data

Pre-processedData

TransformedData

Patterns


Data Mining - Tools

Typical tools Statistical Analysis

Summarisation Outlier Detection Correlation Regression Clustering

Association Rules Time Series Models Decision Trees (classification)

Data Clustering – An Introduction

Data Mining - Applications

Some successful examples of its use:

Pharmaceutical companies – Drug Discovery

Credit card companies – Fraud Detection

Transportation companies - Routing Large consumer package goods

companies (to improve the sales process to retailers)

Hospital Organisation – Decision AnalysisData Clustering – An Introduction

Examples of Data Mining Tools

We will now look at some core techniques commonly used for analysing and mining business warehouses

Correlation Visualisation Clustering Regression


ClusteringAn example in biology…

Things that are brown and run away

Things that are green and don’t run away

animals

plants


Clustering

An example in biology…– Kingdom– Phylum–Class–Order– Family–Genus– Species

Hierarchical clustering (more later)


Clustering

The process Extract features (colour, movement,

sensory organs etc): more later

Cluster into categories

Consolidation


ClusteringClustering: to partition a data set into subsets

(clusters), so that the data in each subset share some common trait - often similarity or proximity for some defined distance measure.

• The process of organizing objects into groups whose members are similar in some way.

• A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

• Unsupervised: No need for the ‘teacher’ signals, i.e. the desired output.

Cluster 1

Cluster 2

x1

x2

Supervised and Unsupervised LearningUnsupervised learning: learning without the

desired output (‘teacher’ signals).Supervised learning: learning with the desired

output.• Clustering is one of the widely-used unsupervised learning

methods.• Other unsupervised learning:

• Dimensionality reduction (factor analysis, principal component analysis, independent component analysis …)

• Time serious modelling• Source separation

• Supervised learning:• Classification• Regression


Patterns, Clusters and Features (1)

Patterns: physical objectsClusters: categories of objectsFeatures: attributes of objects

animals

plants

Colour: brown, green, …

Patterns, Clusters and Features (2)

100 150 200 250 300500

1000

1500

2000

2500

3000

3500

Top speed [ml/h]

Weig

ht

[kg]

Sports cars

Medium market cars

Lorries

Pattern

Feature-1 values

Features’ space

cluster

Feature-2 values

Creating vehicles’ clusters


Social networks

• Marketing • Terror networks• Allocation of resources in a company /

university

http://www.dashe.com/blog/wp-content/uploads/2011/05/your-ties.jpg


Gene networks

• Understanding gene interactions• Identifying important genes linked to disease

http://www.google.co.uk/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=JemoKUEg5zMq0M&tbnid=rxI-jlolQgOIuM:&ved=0CAUQjRw&url=http://www.scielo.br/scielo.php?pid=S1415-47572012000300021&script=sci_arttext&ei=d39NUoa2E6rT0QWxsoDoCg&bvm=bv.53537100,d.d2k&psig=AFQjCNGHGzTu20lS4C-ramvdGuFrTQ1b1g&ust=1380896999840169

http://www.google.co.uk/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=bv-BOrsUW94ODM&tbnid=4esi2sQE-E_-eM:&ved=0CAUQjRw&url=http://www.bioss.ac.uk/~dirk/essays/GeneExpression/bayes_net.html&ei=rn9NUtLiOenV0QXdl4EQ&bvm=bv.53537100,d.d2k&psig=AFQjCNFwnBsZrix2qWDEtzHYqXNGq5OkJQ&ust=1380897065718601

How to do clustering?

Cluster 1

Cluster 2

x1

x2

What we know: patterns represented by their feature vectors,

e.g.

370

851

.

.x

dx

x

x

2

1

x

General case:is in the d -dimensional domain of the feature vectors

What we need to find out: the clusters

Pattern Similarity

bd

i

bBi

Ai xxd

1

1

• A key concept in clustering: similarity.• Clusters are formed by similar patterns.• In computer science, we need to define

some metric to measure similarity. One of the commonly adopted similarity metrics is distance.

A general definition of distance (between pattern A and B):• b=2: Euclidean distance• b=1: Manhattan distance

The shorter the distance, the more similar the two patterns.


Pattern Similarity & Distance Metrics

Many methods are designed to work

on Distance Metrics, e.g. K-Means They assume that the Triangle

Inequality holds:“the sum of the lengths of any two sides must be greater than

the length of the remaining side”


Pattern Similarity & Distance Metrics

Distance Metrics Euclidean Correlation Minkowski Manhattan Mahalanobis

Relationship Metrics How Long is a Piece of String? Often Application Dependant

25

K-Means Clustering

Algorithm 1: K-Means Clustering

1. Place K points into the feature space. These points represent initial cluster centroids.

2. Assign each pattern to the closest cluster centroid.3. When all objects have been assigned, recalculate

the positions of the K centroids.4. Repeat Steps 2 and 3 until the assignments do not

change.


K-Means Clustering

Interactive Demo: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html


http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html



Discussions (1)

1. How to determine k, the number of clusters?


Discussions (2)

2. Any alternative ways of choosing the initial cluster centroids?


Discussions (3)

3. Does the algorithm converge to the same results with different selections of initial cluster centroids? If not, what should we do in practice?


Reading

Chapter 9, Section 9.3: David Hand “Principles of Data Mining”, MIT Press

Chapter 8: Pang-Ning Tan “Introduction to Data Mining”

Anil Jain: “Data Clustering: 50 Years Beyond K-Means”, Pattern Recognition Letters


Lab

In the lab:

Examine a piece of JAVA code for K-Means clustering

Explore the use of K-Means on some Toy datasets

Visualise the clusterings using an EXCEL macro


data clustering 1 – an introduction data clustering – an introductionslide 1

Documents

data clustering

introductionfrom data

data collection

historical data

data warehousing

data warehouses

introduction data

data explosionif