unsupervised learning

33
Unsupervised learning Factor & Cluster Analysis D3M

Upload: veesingh

Post on 16-Aug-2015

25 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Unsupervised learning

Unsupervised learningFactor & Cluster Analysis

D3M

Page 3: Unsupervised learning

Factor & Cluster Analysis

Learning Objectives Unsupervised Learning Methods Principle component, Factor Analysis, & Clustering Objective is Dimension Reduction

Reduce the number of collinear variables (PCA/Factor) Group your rows (e.g. customers, markets, counties): Cluster Analysis

Learning Resources MIT Open Courses Lecture 11 & 14 Data Mining Class at U of Chicago (Lecture notes 7 & 8) Class notes

Page 4: Unsupervised learning

Basic Idea

Data Exploration

A-theoretical but not mindless

Essentially looking for ‘similarities’o Between variables (columns)

o Principle Component/Factor Analysis

o Between Subjects (rows)o Clustering Algorithm

Page 5: Unsupervised learning

Examples

Time series of Stock Prices

Items sold in supermarket

Attributes of Fortune 500 companies

Attributes of Brands (Perceptual or Real)

Customer Base of Amazon

Cluster webpages

Biological Attributes of Different Species

Attributes of State/County/Zip Codes

Google searches of keywords

Demographics/Shares of our Brand across stores

Page 6: Unsupervised learning

6

Example: Marketing Research

• PRIZM (“Potential Ratings Index for Zip Markets”) by Claritas Inc.– “Birds of a feather flock together”– 62 neighborhood (zip code) based groups that are

similar on demographic and behavioral characteristics – Used for store location decisions, direct marketing,

media selection, etc.

• http://www.claritas.com/MyBestSegments/Default.jsp

Page 7: Unsupervised learning

7

Key Methods

• Two key research tools

Cluster Analysis Tool for actually constructing segments

Factor AnalysisTool for “data reduction”

Page 8: Unsupervised learning

Difference between cluster and factor analysis

V1 V2 V3 V4 V5 V20…..

Cluster Analysis

(Group Subjects)

Factor Analysis

(Group Variables)

Data

Page 9: Unsupervised learning

9

Factor Analysis

Page 10: Unsupervised learning

Difference between cluster and factor analysis

V1 V2 V3 V4 V5 V20…..

Cluster Analysis

(Group Subjects)

Factor Analysis

(Group Variables)

Data

Page 11: Unsupervised learning

11

Factor Analysis

Factor Analysis can be used for data reduction (i.e., reduce the number of variables needed for analysis).

Factor analysis is able to summarize the information contained in a larger number of variables into a smaller number of ‘factors’ without significant loss of information.

Main use of Factor Analysis

Page 12: Unsupervised learning

• Harm/care • Authority/respect • Fairness/reciprocity • Ingroup/loyalty• Purity/sanctity

Example: Basis of Moral Foundations

5 Underlying Factors behind these Questions

Page 13: Unsupervised learning

• Data reduction is important when you need to measure “fuzzy” concepts such as ‘love’, ‘trust’ or ‘satisfaction’

• Ask a series of question that tap into the different components of the concept.

• Too many variables! Factor analysis can help to reduce this dimensionality problem

Factor Analysis

???

?

Page 14: Unsupervised learning

14

Intuition• Factor analysis assumes that the correlation between a large

number of variables is due to them all being dependent on the same small number of “factors”. Analyze the patterns of correlations to tap into the underlying construct.

• Example: Car ratings

Perception of seats

Perception of noise

Perception of smoothness of ride

Perception of AC-system

(Attributes)

Perception of “quality”

(Factor)

Example: Car Ratings

Page 15: Unsupervised learning

MKTG450 15

OpenImaginativeInsightful

ConscientiousnessOrganizedThorough

ExtraversionEnergeticAssertive

AgreeablenessSympatheticKindAffectionate

Neuroticism

TenseMoodyAnxious

Psychology: The “Big Five”

Trait Characteristics Example

Page 16: Unsupervised learning

16

Cluster Analysis

Page 17: Unsupervised learning

Difference between cluster and factor analysis

V1 V2 V3 V4 V5 V20…..

Cluster Analysis

(Group Subjects)

Factor Analysis

(Group Variables)

Data

Page 18: Unsupervised learning

18

Cluster Analysis

• Cluster analysis is a technique used to identify groups of ‘similar’ customers in a market (i.e., market segmentation).

Cluster analysis encompasses a number of different algorithms and methods for grouping objects of similar kind into categories.

Page 19: Unsupervised learning

19

ApplicationExample: Market Segmentation

o Process of dividing a total market into groups of consumers who have similar needs and who respond similarly to marketing mix variables.

?

?

?

Page 20: Unsupervised learning

20

• General question: how to organize observed data into meaningful structures

• Examples: o In food stores items of similar nature, such as

different types of meat or vegetables are displayed in the same or nearby locations.

o Biologists have to organize the different species of animals-- man belongs to the primates, the mammals, the amniotes, the vertebrates, and the animals.

o In medicine, clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful taxonomies.

o In the field of psychiatry, the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy.

o Collaborative filtering & Recommendation systems

Page 22: Unsupervised learning

Example 1Segmenting Stores in Soup Case Study

D3M

Page 23: Unsupervised learning

Demographics Are Highly Correlated

Page 24: Unsupervised learning

Cluster Of Variables (Clustofvar Package in R)

Page 25: Unsupervised learning

Interpret the Factors

These are called factor “loadings”. Measures the correlation between each demographicand the underlying “factor”. Our Job to Interpret and put a label to these.

Page 26: Unsupervised learning

Information Captured

Factor1 Factor2 Factor3SS loadings 3.143 2.961 1.671Proportion Var 0.314 0.296 0.167Cumulative Var 0.314 0.610 0.777

Using 3 “factors” instead of 10 demographics, we capture approx. 78% of the information in data.

Page 27: Unsupervised learning
Page 28: Unsupervised learning
Page 29: Unsupervised learning

Example 2Segmenting US Counties

D3M

Page 30: Unsupervised learning

Files UsedUS_Counties.csv, Segment_US_County.R

• Suppose we are analyzing data based on US CountiesDemographic variablesHealth outcomesCrime RatesVoting BehaviorReligion Market Shares of brandsGoogle Searches

Page 31: Unsupervised learning

Hard to Even See let alone UnderstandBasically Bunch of Variables are Highly Correlated

Page 32: Unsupervised learning

Cluster Of Variables (Clustofvar Package in R)

Page 33: Unsupervised learning