data clustering 1 – an introduction data clustering – an introductionslide 1

32
Data Clustering 1 – An introduction Data Clustering – An Introduction Slide 1

Upload: adele-wells

Post on 29-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Data Clustering 1 – An introduction

Data Clustering – An Introduction Slide 1

Page 2: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

The Data Explosion

“If you feel like you are drowning in information, it’s because you are.”

Advance of IT and the Internet Massive increase in ability to:

Record: Electronic records and forms Store: Data Warehouses (as we have seen) Analyse: Data Mining and Visualisation (more later)

Risk of Information Overload

Slide 2Data Clustering – An Introduction

Page 3: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

The Aims of Data Mining

Classification Categorising Risk-Return of Stocks

Association Identify products that tend to sell together

Detection Identify profiles of customers

Prediction Forecasting Market Performance

Slide 3Data Clustering – An Introduction

Page 4: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Database Technology Timeline

1960s: Data collection, database creation, IMS and network

DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational,

OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s—2000s: Data mining and data warehousing, multimedia

databases, and Web databases

Slide 4Data Clustering – An Introduction

Page 5: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

From Data to Knowledge

Common to break down the process of learning from data into the following:

Data, Information and Knowledge

Slide 5Data Clustering – An Introduction

Page 6: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

From Data to Knowledge

Data: Raw numbers

Information: Data with context or meaning

Knowledge: Data Structures / Patterns (Knowledge must be useful)

Slide 6Data Clustering – An Introduction

Page 7: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Data Mining / Intelligent Data Analysis

“Data mining is applying Machine Learning techniques to historical data to improve future decisions” Tom Mitchell 1997

Slide 7Data Clustering – An Introduction

Page 8: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Knowledge Discovery

Knowledge Discovery in Databases (KDD)

The Process (from Advances in KDD and Data mining):

Data Knowledge

Target Data

Pre-processedData

TransformedData

Patterns

Slide 8Data Clustering – An Introduction

Page 9: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Slide 9

Data Mining - Tools

Typical tools Statistical Analysis

Summarisation Outlier Detection Correlation Regression Clustering

Association Rules Time Series Models Decision Trees (classification)

Data Clustering – An Introduction

Page 10: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Slide 10

Data Mining - Applications

Some successful examples of its use:

Pharmaceutical companies – Drug Discovery

Credit card companies – Fraud Detection

Transportation companies - Routing Large consumer package goods

companies (to improve the sales process to retailers)

Hospital Organisation – Decision AnalysisData Clustering – An Introduction

Page 11: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Slide 11

Examples of Data Mining Tools

We will now look at some core techniques commonly used for analysing and mining business warehouses

Correlation Visualisation Clustering Regression

Data Clustering – An Introduction

Page 12: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

ClusteringAn example in biology…

Things that are brown and run away

Things that are green and don’t run away

animals

plants

Data Clustering – An Introduction Slide 12

Page 13: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Clustering

An example in biology…– Kingdom– Phylum–Class–Order– Family–Genus– Species

Hierarchical clustering (more later)

Data Clustering – An Introduction Slide 13

Page 14: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Clustering

The process Extract features (colour, movement,

sensory organs etc): more later

Cluster into categories

Consolidation

Data Clustering – An Introduction Slide 14

Page 15: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

ClusteringClustering: to partition a data set into subsets

(clusters), so that the data in each subset share some common trait - often similarity or proximity for some defined distance measure.

• The process of organizing objects into groups whose members are similar in some way.

• A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

• Unsupervised: No need for the ‘teacher’ signals, i.e. the desired output.

Cluster 1

Cluster 2

x1

x2

Page 16: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Supervised and Unsupervised LearningUnsupervised learning: learning without the

desired output (‘teacher’ signals).Supervised learning: learning with the desired

output.• Clustering is one of the widely-used unsupervised learning

methods.• Other unsupervised learning:

• Dimensionality reduction (factor analysis, principal component analysis, independent component analysis …)

• Time serious modelling• Source separation

• Supervised learning:• Classification• Regression

Data Clustering – An Introduction Slide 16

Page 17: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Patterns, Clusters and Features (1)

Patterns: physical objectsClusters: categories of objectsFeatures: attributes of objects

animals

plants

Colour: brown, green, …

Page 18: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Patterns, Clusters and Features (2)

100 150 200 250 300500

1000

1500

2000

2500

3000

3500

Top speed [ml/h]

Weig

ht

[kg]

Sports cars

Medium market cars

Lorries

Pattern

Feature-1 values

Features’ space

cluster

Feature-2 values

Creating vehicles’ clusters

Page 19: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Data Clustering – An Introduction Slide 19

Social networks

• Marketing • Terror networks• Allocation of resources in a company /

university

Page 21: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

How to do clustering?

Cluster 1

Cluster 2

x1

x2

What we know: patterns represented by their feature vectors,

e.g.

370

851

.

.x

dx

x

x

2

1

x

General case:is in the d -dimensional domain of the feature vectors

What we need to find out: the clusters

Page 22: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Pattern Similarity

bd

i

bBi

Ai xxd

1

1

• A key concept in clustering: similarity.• Clusters are formed by similar patterns.• In computer science, we need to define

some metric to measure similarity. One of the commonly adopted similarity metrics is distance.

A general definition of distance (between pattern A and B):• b=2: Euclidean distance• b=1: Manhattan distance

The shorter the distance, the more similar the two patterns.

Page 23: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Data Clustering – An Introduction Slide 23

Pattern Similarity & Distance Metrics

Many methods are designed to work

on Distance Metrics, e.g. K-Means They assume that the Triangle

Inequality holds:“the sum of the lengths of any two sides must be greater than

the length of the remaining side”

Page 24: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Data Clustering – An Introduction Slide 24

Pattern Similarity & Distance Metrics

Distance Metrics Euclidean Correlation Minkowski Manhattan Mahalanobis

Relationship Metrics How Long is a Piece of String? Often Application Dependant

Page 25: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

25

K-Means Clustering

Page 26: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Algorithm 1: K-Means Clustering

1. Place K points into the feature space. These points represent initial cluster centroids.

2. Assign each pattern to the closest cluster centroid.3. When all objects have been assigned, recalculate

the positions of the K centroids.4. Repeat Steps 2 and 3 until the assignments do not

change.

Data Clustering – An Introduction Slide 26

Page 27: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Slide 27

K-Means Clustering

Interactive Demo: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

Data Clustering – An Introduction

Page 28: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Discussions (1)

1. How to determine k, the number of clusters?

Data Clustering – An Introduction Slide 28

Page 29: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Discussions (2)

2. Any alternative ways of choosing the initial cluster centroids?

Data Clustering – An Introduction Slide 29

Page 30: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Discussions (3)

3. Does the algorithm converge to the same results with different selections of initial cluster centroids? If not, what should we do in practice?

Data Clustering – An Introduction Slide 30

Page 31: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Reading

Chapter 9, Section 9.3: David Hand “Principles of Data Mining”, MIT Press

Chapter 8: Pang-Ning Tan “Introduction to Data Mining”

Anil Jain: “Data Clustering: 50 Years Beyond K-Means”, Pattern Recognition Letters

Data Clustering – An Introduction Slide 31

Page 32: Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Lab

In the lab:

Examine a piece of JAVA code for K-Means clustering

Explore the use of K-Means on some Toy datasets

Visualise the clusterings using an EXCEL macro

Data Clustering – An Introduction Slide 32