svd filtered temporal usage clustering
TRANSCRIPT
SVD Filtered Temporal Usage Pattern Analysis & Clustering
Liang XieLiang XieLiang XieLiang Xie
SCSUG Educational Forum 2009SCSUG Educational Forum 2009SCSUG Educational Forum 2009SCSUG Educational Forum 2009San Antonio, TXSan Antonio, TXSan Antonio, TXSan Antonio, TX
Business Objective
� Provide a robust algorithm to cluster customers based on their temporal transactional data ;
� Issues :� Data
� High Dimensionality: 360 features, multi-million records� Capture amplitude at different resolution
� High volatility due to noise
� Possible Outliers� Algorithm
� Robustness
� Efficiency� Easy to implement in SAS!
� We Choose a SVD based algorithm
� Successful application on Gene-Expression Analysis by Alter et al (PNAS, 2000)
SVD as a Filter
� SVD Definition:� Singular Value Decomposition is a mathematical tool to decompose
rectangular matrix
� Left Eigenvector matrix U can be regarded as an input rotation matrix; \Sigma is the scaling matrix, and right Eigenvector matrix V is output matrix
� SVD is similar to Fourier analysis� Filter:
� Each row of X is a linear combination of right Eigenvectors
� Each column of X is a linear combination of left Eigenvectors
'VUX Σ=
Relationship Between PCA and SVD
� SAS/STAT doesn’t explicitly support SVD
� We can tweak SAS/STAT to do SVD by link one computation method of SVD to PCA� SVD and PCA are essentially the same: SVD on the covariance matrix of
original data X is equivalent to PCA of X� PCA on non-centered covariance matrix of X is equivalent to SVD of X,
with proper scaling
')'( VSVXXSVD =
SVD in SAS/STAT
� We call PROC PRINCOMP to conduct SVD in SAS/STAT
� The uncorrected covariance matrix in PROC PRINCOM is X’X/n, not X’X, therefore the singular value matrix should be scaled by
� PROC PRINCOMPPROC PRINCOMPPROC PRINCOMPPROC PRINCOMP NOINT COV SING=
� ‘COV’ computes the principal components from the covariance matrix
� ‘NOINT’ omits the intercept from the model � ‘SING=’ specifies the singularity criterion to ensure accuracy
n
Performance
� Accuracy� Test the code on Hilbert matrix
� Specify ‘SING=1e-16’, our result is comparable to those obtained from R and MATLAB
� Efficiency� Test the code on an arbitrary rectangular matrix with 1.7million rows and
400 columns
� On a Core2Duo 1.86Ghz PC, it takes SAS 7min56sec to finish all data processing and computations, user CPU time is 5min52sec
� Note that 32-bit Windows version RRRRRRRR is not able to handle data this big:> X<-matrix(runif(1.7E6*400), ncol=400)
Error in runif(1700000 * 400) : cannot allocate vector of length 680000000
� Multi-thread/Parallel SVD algorithm from SAS is highly desired!!
Temporal Usage Pattern Analysis
� Time series usage data from customers for one year at 60min interval
� Hourly usage data is normalized to:� Year total� Monthly Total
� We want to identify segments with distinct usage pattern over one year, so that marketing department is able to design customized messages to them
Traditional Approach
� Direct K-means clustering using PROC FASTCLUS on all features
� Problems:� Not Robust: Subjective to outliers� Ambiguity in choosing optimal number of clusters a prior� High dimensionality will affect the distance measure between each pair:
� In high dimensional spaces, distances between points become relatively uniform
� Combining Robustness and High Dimensionality, we could get segments that are occupied by only a few observations which is usually not desired
� K-means clustering algorithm doesn’t take the time series nature into consideration. All features are considered independent
Our Approach
� Apply SVD to the original data, obtain Eigenvectors and singular values
� Remove components associated with the first singular value (Low Pass Filtering)
� Apply SVD again to the SVD Filtered matrix
� Calculate Pearson correlation of each observation to the right Eigenvectors obtained in previous step
� Apply k-means clustering algorithm to this correlation elements matrix
Some Notes
� For a data matrix containing 360 days’ profile, we only need to use a few of the correlation elements. We use correlation up to 85% variation is accounted for in the data
� To determine optimal number of clusters, we applied Bayesian Information Criteria. This measurement is very robust and simple to calculate:� BIC=Distortion + (Num of Var)*log(Num of Obs)*K� Distortion=sum of total variance of each cluster=sum of Distance from
PROC FASTCLUS output� With hourly data, we separate the analysis in two steps:
� Daily Level� Hourly Level for a ‘typical day’ in a month� Apply the SVD Filtered Clustering algorithm in each step
Simulated Data
� We simulate data using Heterogeneous Mixed Model of Verbeke� High Usage among Month B-D
and Month H
� Some outliers were deliberately generated by adding abnormal ad-hoc error terms
Clustering Result on Filtered Data
THANK YOUTHANK YOUTHANK YOUTHANK YOU
� You can reach me at:� [email protected]� www.linkedin.com/liangxie� My Blog:
� http://sas-programming.blogspot.com