data mining strategies. scales of measurement stevens, s.s. (1946). on the theory of scales of...

Data Mining Strategies

Scales of MeasurementStevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680

Four ScalesCategorical (nominal)Ordinal (only order matters)Interval (difference between two vars is meaningful)

Ratio (when variable is 0.0 there is none of that data; Kelvin is but C and F are not)

What to Know about the Scales

The measurement principle involved for each scale

Examples of the measurement scales

Permissible arithmetic operations for each scale

Categorical Scale Data

The values of the scale have no numeric meaning

ExamplesGenderEthnicityMarital StatusHair Color

OperationsCounting (only)

Ordinal Scale Data

The categories can be ordered

But the intervals between adjacent scale values are indeterminate

Examples Movie ratings (0, 1 or 2 thumbs up)

U.S.D.A. beef (good, choice, prime)

The rank order of anything

Operations Counting Greater than or less than operations

Interval Scale Data

Intervals between adjacent scale values are equal

Examples Degrees Fahrenheit Most personality measures

IQ intelligence score

Operations Counting Greater than or less than operations

Addition and subtraction of scale values.

Ratio Scale Data

There is a rationale zero point for the scale

An absolute zero

Examples Degrees Kelvin Annual income in dollars

Length, distance, size cm, kB, inches, km

Operations All plus Multiplication and division of scale values.

Variables

IndependentInputx

DependentOutputf(x)

f(x) = 3+ 2x2

Data Mining Strategies

Unsupervised(No dependent variables used)

ClusteringMarket Basket Analysis

Information Visualization

Supervised(At least one dependent variable used for training)

ClassificationEstimationPrediction

Clustering

Cluster analysis divides data into groups (clusters) that are meaningful, useful or both

Clusters capture the natural structure of the data

Clustering allows us to think about the data at a new level of abstraction

Cluster analysis is often the first step in a data mining project

Cluster of Stars

Water Clusters

Cellular Clusters

Cluster Analysis

Uses information found in the data that describes objects and their relationships

Goal: That objects within a group be similar to one another and different from objects in other groups

The greater the similarity within groups and the greater the difference between groups, the better the clustering

How Many Clusters?

Three Clusters Identified

Six Clusters Identified

Types of Clustering

Partitional clusteringHeirarchical clusteringExclusive clusteringOverlaping clusteringFuzzy clusteringComplete clusteringPartial clustering

Partitional Clustering

A division of a set of data into non-overlaping clusters

Each data point is in exactly one cluster

Example of Partitional Clustering

Heirarchical clustering

Permit subclusters (nested clusters within clusters)

Example of Hierarchical Clustering

Exclusive clustering

Each object is assigned to a single cluster

Overlaping Clustering

Non-exclusiveA data point can belong to two or more clusters simultaneously

Fuzzy Clustering

Every data point belongs to every cluster with a membership weight.

Membership ranges from 0 (absolutely does not belong) to 1 (absolutely belongs)

The sum of the membership weights for each point is 1

C1 40%C2 60%

C1

C2

C1 01%C2 99%

C1 75%C2 25%

Complete Clustering

Assigns every data point to a cluster

No data point is left out of a cluster

Partial Clustering

Does not assign every data point to a cluster

Some data points can not belong to any clusterNoiseOutliersUninteresting background

Classify newspaper stories

Many fall into Global warmingTerrorism

Some stories are uniqueCable Tie just graduated from the CofC in CS

K-Means

1. Select K points as initial centroids

2. Repeat1. Form K cluster by assigning each

point to its closest centroid.2. Recompute the centroid of each

cluster.

3. Until centroids so not change

Chris Starr:

A centroid is the center of a cluster

Chris Starr:

A centroid is the center of a cluster

The centroids are repositioned until stable in the K-means algorithm.

Observe Your Environment

Start looking for clusters around you

Think about how the clusters are formedAre they hierarchical?Are they fuzzy clusters?Are they complete clusters?

data mining strategies. scales of measurement stevens, s.s. (1946). on the theory of scales of...

Documents