unsupervised learning
TRANSCRIPT
Unsupervised learningFactor & Cluster Analysis
D3M
Learning ResourcesVideo Series from Stanford
Factor & Cluster Analysis
Learning Objectives Unsupervised Learning Methods Principle component, Factor Analysis, & Clustering Objective is Dimension Reduction
Reduce the number of collinear variables (PCA/Factor) Group your rows (e.g. customers, markets, counties): Cluster Analysis
Learning Resources MIT Open Courses Lecture 11 & 14 Data Mining Class at U of Chicago (Lecture notes 7 & 8) Class notes
Basic Idea
Data Exploration
A-theoretical but not mindless
Essentially looking for ‘similarities’o Between variables (columns)
o Principle Component/Factor Analysis
o Between Subjects (rows)o Clustering Algorithm
Examples
Time series of Stock Prices
Items sold in supermarket
Attributes of Fortune 500 companies
Attributes of Brands (Perceptual or Real)
Customer Base of Amazon
Cluster webpages
Biological Attributes of Different Species
Attributes of State/County/Zip Codes
Google searches of keywords
Demographics/Shares of our Brand across stores
6
Example: Marketing Research
• PRIZM (“Potential Ratings Index for Zip Markets”) by Claritas Inc.– “Birds of a feather flock together”– 62 neighborhood (zip code) based groups that are
similar on demographic and behavioral characteristics – Used for store location decisions, direct marketing,
media selection, etc.
• http://www.claritas.com/MyBestSegments/Default.jsp
7
Key Methods
• Two key research tools
Cluster Analysis Tool for actually constructing segments
Factor AnalysisTool for “data reduction”
Difference between cluster and factor analysis
V1 V2 V3 V4 V5 V20…..
Cluster Analysis
(Group Subjects)
Factor Analysis
(Group Variables)
Data
9
Factor Analysis
Difference between cluster and factor analysis
V1 V2 V3 V4 V5 V20…..
Cluster Analysis
(Group Subjects)
Factor Analysis
(Group Variables)
Data
11
Factor Analysis
Factor Analysis can be used for data reduction (i.e., reduce the number of variables needed for analysis).
Factor analysis is able to summarize the information contained in a larger number of variables into a smaller number of ‘factors’ without significant loss of information.
Main use of Factor Analysis
• Harm/care • Authority/respect • Fairness/reciprocity • Ingroup/loyalty• Purity/sanctity
Example: Basis of Moral Foundations
5 Underlying Factors behind these Questions
• Data reduction is important when you need to measure “fuzzy” concepts such as ‘love’, ‘trust’ or ‘satisfaction’
• Ask a series of question that tap into the different components of the concept.
• Too many variables! Factor analysis can help to reduce this dimensionality problem
Factor Analysis
???
?
14
Intuition• Factor analysis assumes that the correlation between a large
number of variables is due to them all being dependent on the same small number of “factors”. Analyze the patterns of correlations to tap into the underlying construct.
• Example: Car ratings
Perception of seats
Perception of noise
Perception of smoothness of ride
Perception of AC-system
(Attributes)
Perception of “quality”
(Factor)
Example: Car Ratings
MKTG450 15
OpenImaginativeInsightful
ConscientiousnessOrganizedThorough
ExtraversionEnergeticAssertive
AgreeablenessSympatheticKindAffectionate
Neuroticism
TenseMoodyAnxious
Psychology: The “Big Five”
Trait Characteristics Example
16
Cluster Analysis
Difference between cluster and factor analysis
V1 V2 V3 V4 V5 V20…..
Cluster Analysis
(Group Subjects)
Factor Analysis
(Group Variables)
Data
18
Cluster Analysis
• Cluster analysis is a technique used to identify groups of ‘similar’ customers in a market (i.e., market segmentation).
Cluster analysis encompasses a number of different algorithms and methods for grouping objects of similar kind into categories.
19
ApplicationExample: Market Segmentation
o Process of dividing a total market into groups of consumers who have similar needs and who respond similarly to marketing mix variables.
?
?
?
20
• General question: how to organize observed data into meaningful structures
• Examples: o In food stores items of similar nature, such as
different types of meat or vegetables are displayed in the same or nearby locations.
o Biologists have to organize the different species of animals-- man belongs to the primates, the mammals, the amniotes, the vertebrates, and the animals.
o In medicine, clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful taxonomies.
o In the field of psychiatry, the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy.
o Collaborative filtering & Recommendation systems
Example 1Segmenting Stores in Soup Case Study
D3M
Demographics Are Highly Correlated
Cluster Of Variables (Clustofvar Package in R)
Interpret the Factors
These are called factor “loadings”. Measures the correlation between each demographicand the underlying “factor”. Our Job to Interpret and put a label to these.
Information Captured
Factor1 Factor2 Factor3SS loadings 3.143 2.961 1.671Proportion Var 0.314 0.296 0.167Cumulative Var 0.314 0.610 0.777
Using 3 “factors” instead of 10 demographics, we capture approx. 78% of the information in data.
Example 2Segmenting US Counties
D3M
Files UsedUS_Counties.csv, Segment_US_County.R
• Suppose we are analyzing data based on US CountiesDemographic variablesHealth outcomesCrime RatesVoting BehaviorReligion Market Shares of brandsGoogle Searches
Hard to Even See let alone UnderstandBasically Bunch of Variables are Highly Correlated
Cluster Of Variables (Clustofvar Package in R)