exploring variable clustering and importance in jmp

15
Copyright © 2012, SAS Institute Inc. All rights reserved. EXPLORING VARIABLE CLUSTERING AND IMPORTANCE IN JMP CHRIS GOTWALT AND RYAN PARKER

Upload: jmp-division-of-sas

Post on 27-Jan-2015

107 views

Category:

Technology


1 download

DESCRIPTION

This presentation was given live at JMP Discovery Summit 2013 in San Antonio, Texas, USA. To sign up to attend this year's conference, visit http://jmp.com/summit

TRANSCRIPT

Page 1: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

EXPLORING VARIABLE CLUSTERING

AND IMPORTANCE IN JMP

CHRIS GOTWALT AND RYAN PARKER

Page 2: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGINTRODUCTION

• Variable clustering is a method that performs dimension reduction on the

number of input variables to be used in a predictive model.

• Reduces inputs by finding groups of similar variables so that a single variable

can represent each group.

• Helps reduce effects of collinearity on the input variables.

• Developed by SAS/STAT Development Director Warren Sarle.

Page 3: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGAN ITERATIVE ALGORITHM

• Iteratively splits and assigns variables to clusters.

• Sample iterations for variables in Wine Quality data set:

Iteration 1 Alcohol, Citric Acid, pH, Sugar, Sulfur Dioxide

Alcohol, Citric Acid, Sulfur Dioxide

Alcohol, SugarpH, Sulfur

Dioxide

pH, Sugar

Citric Acid

Iteration 2

Iteration 3

Page 4: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGALGORITHM DETAILS

• At each iteration the cluster with the largest second eigenvalue is split.

• Variables within this cluster are assigned to two new clusters based on each

variable’s correlation with the first two orthoblique rotated principal

components.

• After the split, variables from other clusters are reassigned to one of the new

clusters if they have a higher correlation with the new cluster.

• Ends when the second eigenvalue of all clusters is less than one.

Page 5: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGREDUCING EACH CLUSTER TO A SINGLE VARIABLE

pH

Sugar

pH

Citric Acid

• Each cluster can be reduced to a single

variable for modeling.

• There are two ways to do this:

1. We can use the most representative

variable from each cluster.

2. Alternatively, the cluster component from

each cluster can be used.

Page 6: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGMOST REPRESENTATIVE VARIABLES

• These are variables that best represent each cluster.

• They have the highest correlation with the variables in its cluster.

• Most representative variables provide a clear interpretation when used.

Page 7: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGCLUSTER COMPONENTS

• New variables created using the first principal component of each cluster.

• Provide a way to combine variables in each cluster into a single variable.

• Similar to traditional principal components analysis (PCA) except that each

cluster component only uses variables from that cluster.

• Interpretation not as clear when compared to most representative variables.

Page 8: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGDEMO: IMPORTANT TERMS

• RSquare with Own Cluster

• The RSquare a variable has with variables in its cluster.

• RSquare with Next Closest

• The RSquare a variable has with variables in the next most similar cluster.

• 1-RSquare Ratio

• Relative similarity between a variable’s own cluster and the next closest cluster.

• Values should always be less than 1.

• Values greater than 1 indicate variable should be moved to the next closest cluster.

Page 9: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEINTRODUCTION

• Provides a general way to assess the importance of variables for predictive

models in JMP.

• Insight is in terms of practical significance of input variables.

• Based on functional decomposition ideas of I. M. Sobol.

Page 10: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEFUNCTIONAL DECOMPOSITION

• I. M. Sobol showed that we can decompose a function 𝑓(𝑋1, … , 𝑋𝑝) into the

sum of lower dimensional inputs:

• 𝑓 𝑋1, … , 𝑋𝑝 = 𝑓0 + 𝑓1 𝑋1 +⋯+ 𝑓𝑝 𝑋𝑝 + 𝑓12 𝑋1, 𝑋2 +⋯

• Decomposition has a function for each 𝑋𝑖, each pair (𝑋𝑖 , 𝑋𝑗), etc.

• The variability of these lower dimensional functions assess the importance of

the input variables.

Page 11: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEIMPORTANCE EFFECTS

• Assessment of variable importance is in terms of effect indices.

• These indices are numbers between 0 and 1 indicating relative importance.

• Main effect indices measure variability of predictions due to a single input.

• They do not account for interaction effects.

• Total effect indices measure the total variability of predictions due the input.

• Combines all main and higher order interaction effects.

Page 12: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEDISTRIBUTION OF INPUT VARIABLES

• Variability in predictions is due to the distribution of input variables

• JMP 11 provides three input variable distribution options:

1. Independent Uniform

2. Independent Resampled

3. Dependent Resampled

• Monte Carlo estimation procedure used for independent cases.

• 𝐾-nearest neighbors estimation used for dependent case.

Page 13: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEUSE RESAMPLED INPUTS?

Uniform

Acceptable

Resampled

Needed

Page 14: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEMARGINAL INFERENCE

Main Effects0.16 0.03

Page 15: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEDEMO