svd filtered temporal usage clustering

13
SVD Filtered Temporal Usage Pattern Analysis & Clustering Liang Xie Liang Xie Liang Xie Liang Xie SCSUG Educational Forum 2009 SCSUG Educational Forum 2009 SCSUG Educational Forum 2009 SCSUG Educational Forum 2009 San Antonio, TX San Antonio, TX San Antonio, TX San Antonio, TX

Upload: liang-xie-phd

Post on 05-Jul-2015

73 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Svd filtered temporal usage clustering

SVD Filtered Temporal Usage Pattern Analysis & Clustering

Liang XieLiang XieLiang XieLiang Xie

SCSUG Educational Forum 2009SCSUG Educational Forum 2009SCSUG Educational Forum 2009SCSUG Educational Forum 2009San Antonio, TXSan Antonio, TXSan Antonio, TXSan Antonio, TX

Page 2: Svd filtered temporal usage clustering

Business Objective

� Provide a robust algorithm to cluster customers based on their temporal transactional data ;

� Issues :� Data

� High Dimensionality: 360 features, multi-million records� Capture amplitude at different resolution

� High volatility due to noise

� Possible Outliers� Algorithm

� Robustness

� Efficiency� Easy to implement in SAS!

� We Choose a SVD based algorithm

� Successful application on Gene-Expression Analysis by Alter et al (PNAS, 2000)

Page 3: Svd filtered temporal usage clustering

SVD as a Filter

� SVD Definition:� Singular Value Decomposition is a mathematical tool to decompose

rectangular matrix

� Left Eigenvector matrix U can be regarded as an input rotation matrix; \Sigma is the scaling matrix, and right Eigenvector matrix V is output matrix

� SVD is similar to Fourier analysis� Filter:

� Each row of X is a linear combination of right Eigenvectors

� Each column of X is a linear combination of left Eigenvectors

'VUX Σ=

Page 4: Svd filtered temporal usage clustering

Relationship Between PCA and SVD

� SAS/STAT doesn’t explicitly support SVD

� We can tweak SAS/STAT to do SVD by link one computation method of SVD to PCA� SVD and PCA are essentially the same: SVD on the covariance matrix of

original data X is equivalent to PCA of X� PCA on non-centered covariance matrix of X is equivalent to SVD of X,

with proper scaling

')'( VSVXXSVD =

Page 5: Svd filtered temporal usage clustering

SVD in SAS/STAT

� We call PROC PRINCOMP to conduct SVD in SAS/STAT

� The uncorrected covariance matrix in PROC PRINCOM is X’X/n, not X’X, therefore the singular value matrix should be scaled by

� PROC PRINCOMPPROC PRINCOMPPROC PRINCOMPPROC PRINCOMP NOINT COV SING=

� ‘COV’ computes the principal components from the covariance matrix

� ‘NOINT’ omits the intercept from the model � ‘SING=’ specifies the singularity criterion to ensure accuracy

n

Page 6: Svd filtered temporal usage clustering

Performance

� Accuracy� Test the code on Hilbert matrix

� Specify ‘SING=1e-16’, our result is comparable to those obtained from R and MATLAB

� Efficiency� Test the code on an arbitrary rectangular matrix with 1.7million rows and

400 columns

� On a Core2Duo 1.86Ghz PC, it takes SAS 7min56sec to finish all data processing and computations, user CPU time is 5min52sec

� Note that 32-bit Windows version RRRRRRRR is not able to handle data this big:> X<-matrix(runif(1.7E6*400), ncol=400)

Error in runif(1700000 * 400) : cannot allocate vector of length 680000000

� Multi-thread/Parallel SVD algorithm from SAS is highly desired!!

Page 7: Svd filtered temporal usage clustering

Temporal Usage Pattern Analysis

� Time series usage data from customers for one year at 60min interval

� Hourly usage data is normalized to:� Year total� Monthly Total

� We want to identify segments with distinct usage pattern over one year, so that marketing department is able to design customized messages to them

Page 8: Svd filtered temporal usage clustering

Traditional Approach

� Direct K-means clustering using PROC FASTCLUS on all features

� Problems:� Not Robust: Subjective to outliers� Ambiguity in choosing optimal number of clusters a prior� High dimensionality will affect the distance measure between each pair:

� In high dimensional spaces, distances between points become relatively uniform

� Combining Robustness and High Dimensionality, we could get segments that are occupied by only a few observations which is usually not desired

� K-means clustering algorithm doesn’t take the time series nature into consideration. All features are considered independent

Page 9: Svd filtered temporal usage clustering

Our Approach

� Apply SVD to the original data, obtain Eigenvectors and singular values

� Remove components associated with the first singular value (Low Pass Filtering)

� Apply SVD again to the SVD Filtered matrix

� Calculate Pearson correlation of each observation to the right Eigenvectors obtained in previous step

� Apply k-means clustering algorithm to this correlation elements matrix

Page 10: Svd filtered temporal usage clustering

Some Notes

� For a data matrix containing 360 days’ profile, we only need to use a few of the correlation elements. We use correlation up to 85% variation is accounted for in the data

� To determine optimal number of clusters, we applied Bayesian Information Criteria. This measurement is very robust and simple to calculate:� BIC=Distortion + (Num of Var)*log(Num of Obs)*K� Distortion=sum of total variance of each cluster=sum of Distance from

PROC FASTCLUS output� With hourly data, we separate the analysis in two steps:

� Daily Level� Hourly Level for a ‘typical day’ in a month� Apply the SVD Filtered Clustering algorithm in each step

Page 11: Svd filtered temporal usage clustering

Simulated Data

� We simulate data using Heterogeneous Mixed Model of Verbeke� High Usage among Month B-D

and Month H

� Some outliers were deliberately generated by adding abnormal ad-hoc error terms

Page 12: Svd filtered temporal usage clustering

Clustering Result on Filtered Data

Page 13: Svd filtered temporal usage clustering

THANK YOUTHANK YOUTHANK YOUTHANK YOU

� You can reach me at:� [email protected]� www.linkedin.com/liangxie� My Blog:

� http://sas-programming.blogspot.com