data mining strategies. scales of measurement stevens, s.s. (1946). on the theory of scales of...
TRANSCRIPT
Data Mining Strategies
Scales of MeasurementStevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680
Four ScalesCategorical (nominal)Ordinal (only order matters)Interval (difference between two vars is meaningful)
Ratio (when variable is 0.0 there is none of that data; Kelvin is but C and F are not)
What to Know about the Scales
The measurement principle involved for each scale
Examples of the measurement scales
Permissible arithmetic operations for each scale
Categorical Scale Data
The values of the scale have no numeric meaning
ExamplesGenderEthnicityMarital StatusHair Color
OperationsCounting (only)
Ordinal Scale Data
The categories can be ordered
But the intervals between adjacent scale values are indeterminate
Examples Movie ratings (0, 1 or 2 thumbs up)
U.S.D.A. beef (good, choice, prime)
The rank order of anything
Operations Counting Greater than or less than operations
Interval Scale Data
Intervals between adjacent scale values are equal
Examples Degrees Fahrenheit Most personality measures
IQ intelligence score
Operations Counting Greater than or less than operations
Addition and subtraction of scale values.
Ratio Scale Data
There is a rationale zero point for the scale
An absolute zero
Examples Degrees Kelvin Annual income in dollars
Length, distance, size cm, kB, inches, km
Operations All plus Multiplication and division of scale values.
Variables
IndependentInputx
DependentOutputf(x)
f(x) = 3+ 2x2
Data Mining Strategies
Unsupervised(No dependent variables used)
ClusteringMarket Basket Analysis
Information Visualization
Supervised(At least one dependent variable used for training)
ClassificationEstimationPrediction
Clustering
Cluster analysis divides data into groups (clusters) that are meaningful, useful or both
Clusters capture the natural structure of the data
Clustering allows us to think about the data at a new level of abstraction
Cluster analysis is often the first step in a data mining project
Cluster of Stars
Water Clusters
Cellular Clusters
Cluster Analysis
Uses information found in the data that describes objects and their relationships
Goal: That objects within a group be similar to one another and different from objects in other groups
The greater the similarity within groups and the greater the difference between groups, the better the clustering
How Many Clusters?
Three Clusters Identified
Six Clusters Identified
Types of Clustering
Partitional clusteringHeirarchical clusteringExclusive clusteringOverlaping clusteringFuzzy clusteringComplete clusteringPartial clustering
Partitional Clustering
A division of a set of data into non-overlaping clusters
Each data point is in exactly one cluster
Example of Partitional Clustering
Heirarchical clustering
Permit subclusters (nested clusters within clusters)
Example of Hierarchical Clustering
Exclusive clustering
Each object is assigned to a single cluster
Overlaping Clustering
Non-exclusiveA data point can belong to two or more clusters simultaneously
Fuzzy Clustering
Every data point belongs to every cluster with a membership weight.
Membership ranges from 0 (absolutely does not belong) to 1 (absolutely belongs)
The sum of the membership weights for each point is 1
C1 40%C2 60%
C1
C2
C1 01%C2 99%
C1 75%C2 25%
Complete Clustering
Assigns every data point to a cluster
No data point is left out of a cluster
Partial Clustering
Does not assign every data point to a cluster
Some data points can not belong to any clusterNoiseOutliersUninteresting background
Classify newspaper stories
Many fall into Global warmingTerrorism
Some stories are uniqueCable Tie just graduated from the CofC in CS
K-Means
1. Select K points as initial centroids
2. Repeat1. Form K cluster by assigning each
point to its closest centroid.2. Recompute the centroid of each
cluster.
3. Until centroids so not change
Chris Starr:
A centroid is the center of a cluster
Chris Starr:
A centroid is the center of a cluster
The centroids are repositioned until stable in the K-means algorithm.
Observe Your Environment
Start looking for clusters around you
Think about how the clusters are formedAre they hierarchical?Are they fuzzy clusters?Are they complete clusters?