data mining strategies. scales of measurement stevens, s.s. (1946). on the theory of scales of...

28
Data Mining Strategies

Upload: rosanna-sutton

Post on 23-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Data Mining Strategies

Page 2: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Scales of MeasurementStevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680

Four ScalesCategorical (nominal)Ordinal (only order matters)Interval (difference between two vars is meaningful)

Ratio (when variable is 0.0 there is none of that data; Kelvin is but C and F are not)

Page 3: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

What to Know about the Scales

The measurement principle involved for each scale

Examples of the measurement scales

Permissible arithmetic operations for each scale

Page 4: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Categorical Scale Data

The values of the scale have no numeric meaning

ExamplesGenderEthnicityMarital StatusHair Color

OperationsCounting (only)

Page 5: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Ordinal Scale Data

The categories can be ordered

But the intervals between adjacent scale values are indeterminate

Examples Movie ratings (0, 1 or 2 thumbs up)

U.S.D.A. beef (good, choice, prime)

The rank order of anything

Operations Counting Greater than or less than operations

Page 6: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Interval Scale Data

Intervals between adjacent scale values are equal

Examples Degrees Fahrenheit Most personality measures

IQ intelligence score

Operations Counting Greater than or less than operations

Addition and subtraction of scale values.

Page 7: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Ratio Scale Data

There is a rationale zero point for the scale

An absolute zero

Examples Degrees Kelvin Annual income in dollars

Length, distance, size cm, kB, inches, km

Operations All plus Multiplication and division of scale values.

Page 8: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Variables

IndependentInputx

DependentOutputf(x)

f(x) = 3+ 2x2

Page 9: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Data Mining Strategies

Unsupervised(No dependent variables used)

ClusteringMarket Basket Analysis

Information Visualization

Supervised(At least one dependent variable used for training)

ClassificationEstimationPrediction

Page 10: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Clustering

Cluster analysis divides data into groups (clusters) that are meaningful, useful or both

Clusters capture the natural structure of the data

Clustering allows us to think about the data at a new level of abstraction

Cluster analysis is often the first step in a data mining project

Page 11: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Cluster of Stars

Page 12: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Water Clusters

Page 13: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Cellular Clusters

Page 14: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Cluster Analysis

Uses information found in the data that describes objects and their relationships

Goal: That objects within a group be similar to one another and different from objects in other groups

The greater the similarity within groups and the greater the difference between groups, the better the clustering

Page 15: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

How Many Clusters?

Page 16: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Three Clusters Identified

Page 17: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Six Clusters Identified

Page 18: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Types of Clustering

Partitional clusteringHeirarchical clusteringExclusive clusteringOverlaping clusteringFuzzy clusteringComplete clusteringPartial clustering

Page 19: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Partitional Clustering

A division of a set of data into non-overlaping clusters

Each data point is in exactly one cluster

Example of Partitional Clustering

Page 20: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Heirarchical clustering

Permit subclusters (nested clusters within clusters)

Example of Hierarchical Clustering

Page 21: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Exclusive clustering

Each object is assigned to a single cluster

Page 22: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Overlaping Clustering

Non-exclusiveA data point can belong to two or more clusters simultaneously

Page 23: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Fuzzy Clustering

Every data point belongs to every cluster with a membership weight.

Membership ranges from 0 (absolutely does not belong) to 1 (absolutely belongs)

The sum of the membership weights for each point is 1

C1 40%C2 60%

C1

C2

C1 01%C2 99%

C1 75%C2 25%

Page 24: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Complete Clustering

Assigns every data point to a cluster

No data point is left out of a cluster

Page 25: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Partial Clustering

Does not assign every data point to a cluster

Some data points can not belong to any clusterNoiseOutliersUninteresting background

Classify newspaper stories

Many fall into Global warmingTerrorism

Some stories are uniqueCable Tie just graduated from the CofC in CS

Page 26: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

K-Means

1. Select K points as initial centroids

2. Repeat1. Form K cluster by assigning each

point to its closest centroid.2. Recompute the centroid of each

cluster.

3. Until centroids so not change

Chris Starr:

A centroid is the center of a cluster

Chris Starr:

A centroid is the center of a cluster

Page 27: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

The centroids are repositioned until stable in the K-means algorithm.

Page 28: Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical

Observe Your Environment

Start looking for clusters around you

Think about how the clusters are formedAre they hierarchical?Are they fuzzy clusters?Are they complete clusters?