spca_vignettes r package_test version

36
Sparse Principal Component Analysis with the R package spca Giovanni Maria Merola RMIT International University Vietnam [email protected] Abstract Sparse Principal Components Analysis (SPCA) aims to find principal components with few non-zero loadings. In the R-package spca we implemented our Least Squares Sparse Principal Components Analysis method (LS SPCA Merola 2014). The resulting sparse solutions are uncorrelated and minimise the Leats Squares criterion subject to sparsity requirements. Both these features are enhancements over existing SPCA methods. Locally optimal solutions can be found by a branch-and-bound (BB) algorithm. Recognising that sparsity is not the only requirement for simplicity of interpretation, we implemented a backward elimination algorithm (BE) that computes sparse solutions with large loadings. This algorithm can be run without specifying the number of non-zero loadings in advance. It can also be required that a minimum amount of variance is explained by the components and that the components are only combinations of a subset of the variables. Hence, the BB algorithm can be used as an exploratory Data Analysis tool. The package also contains utilities for printing the solutions and their summary statistics, plotting and comparing the results. We give a thorough example of an application of SPCA to a small dataset. Keywords : SPCA, Uncorrelated Components, branch-and-bound, Backward Elimination. 1. Introduction Sparse Principal Component Analysis (SPCA) is a variant of Principal Component Analysis (PCA) in which the components computed are restricted to be combinations of only few of the variables. In other words, the majority of the weights that define the combinations, called loadings, are zero. The number of non-zero loadings of a component is called cardinality or L 0 norm. The advantage of sparse components is that they are more readily interpretable. For example, a component from a set of IQ test scores equal to ( 1 2 Inductive reasoning + 1 2 Deductive reasoning) can be easily interpreted as the Logic skills of a person. Traditionally, loadings are simplified by thresholding, that is by discarding the ones smaller than a threshold (hereafter, for simplicity, we speak of the size of the loadings meaning the absolute value of the non-zero ones). Since this practice is potentially misleading (Cadima and Jolliffe 1995), a number of algorithmic SPCA methods have been proposed. SPCA methods determine the sparse components by maximising their variance subject to car- dinality constraints (Moghaddam, Weiss, and Avidan 2006). This objective function is taken from a property of the PCs which is irrelevant for summarising the information contained in the data (ten Berge 1993). In some of the methods the objective function is derived from

Upload: giunghi

Post on 26-Dec-2015

15 views

Category:

Documents


0 download

DESCRIPTION

Vignettes for old version of the R package spca

TRANSCRIPT

Page 1: Spca_Vignettes R Package_Test Version

Sparse Principal Component Analysis with the R

package spca

Giovanni Maria MerolaRMIT International University Vietnam

[email protected]

Abstract

Sparse Principal Components Analysis (SPCA) aims to find principal components withfew non-zero loadings. In the R-package spca we implemented our Least Squares SparsePrincipal Components Analysis method (LS SPCA Merola 2014). The resulting sparsesolutions are uncorrelated and minimise the Leats Squares criterion subject to sparsityrequirements. Both these features are enhancements over existing SPCA methods. Locallyoptimal solutions can be found by a branch-and-bound (BB) algorithm. Recognising thatsparsity is not the only requirement for simplicity of interpretation, we implemented abackward elimination algorithm (BE) that computes sparse solutions with large loadings.This algorithm can be run without specifying the number of non-zero loadings in advance.It can also be required that a minimum amount of variance is explained by the componentsand that the components are only combinations of a subset of the variables. Hence, theBB algorithm can be used as an exploratory Data Analysis tool. The package also containsutilities for printing the solutions and their summary statistics, plotting and comparingthe results. We give a thorough example of an application of SPCA to a small dataset.

Keywords: SPCA, Uncorrelated Components, branch-and-bound, Backward Elimination.

1. Introduction

Sparse Principal Component Analysis (SPCA) is a variant of Principal Component Analysis(PCA) in which the components computed are restricted to be combinations of only few ofthe variables. In other words, the majority of the weights that define the combinations, calledloadings, are zero. The number of non-zero loadings of a component is called cardinality orL0 norm. The advantage of sparse components is that they are more readily interpretable.For example, a component from a set of IQ test scores equal to (12 Inductive reasoning +12 Deductive reasoning) can be easily interpreted as the Logic skills of a person.

Traditionally, loadings are simplified by thresholding, that is by discarding the ones smallerthan a threshold (hereafter, for simplicity, we speak of the size of the loadings meaning theabsolute value of the non-zero ones). Since this practice is potentially misleading (Cadimaand Jolliffe 1995), a number of algorithmic SPCA methods have been proposed.

SPCA methods determine the sparse components by maximising their variance subject to car-dinality constraints (Moghaddam, Weiss, and Avidan 2006). This objective function is takenfrom a property of the PCs which is irrelevant for summarising the information contained inthe data (ten Berge 1993). In some of the methods the objective function is derived from

Page 2: Spca_Vignettes R Package_Test Version

2 Sparse Principal Component Analysis with the R package spca

a Lasso Regression type LS optimisation (e.g. Zou) but in others the derivations given arenot convincing. Furthermore, most of the SPCA methods produce correlated components.Correlated components are more difficult to interpret and the sum of the variance that theyexplain is not equal to the total variance explained.

In a recent paper (Merola 2014) we suggest improving over existing SPCA methods by derivingsparse components that are uncorrelated and minimise the LS criterion, like the PCs. Werefer to this approach as LS SPCA. The solutions are obtained from constrained ReducedRank Regression (Izenman 1975).

The problem common to all SPCA methods is finding the set of indices that optimise theobjective function. This Mixed Integer Problem is non-convex NP-hard, hence computation-ally intractable (Moghaddam et al. 2006). We do not explore this computational aspect andsuggest a simple greedy branch-and-bound search LS SPCA(BB), adapted from Farcomeni(2009), useful for finding solutions for medium size datasets.

Originally SPCA was designed to replace thresholding, that is compute components with onlyfew large loadings. However, this is not guaranteed by the cardinality constraints. For thisreason we developed a backward elimination algorithm (BE) that iteratively eliminates smallloadings from a solution until only ones larger than a threshold are left. Considering that thesize of the loadings is not the only feature for interpretability, the elimination process can bestopped also if a minimal cardinality or a maximal variance loss are reached.

The BE algorithm represents a departure from the LS optimality path, but this should becompensated in terms of users’ needs satisfied. In fact, BE breaks away from the inflexibilityof the LS criterion by finding sub-optimal solutions with defined characteristics, foremost thesize of the loadings. In our studies we found that the BE solutions compare well with theBB ones in terms of variance explained. For our analyses we often use LS SPCA(BE) as anExploratory Data Analysis tool with which we compare solutions with different features.

In this paper we present the R (R Core Team 2013) package spca, which contains functions forcomputing LS SPCA solutions. In the package we implemented the BB and BE algorithms.BE is computationally much simpler and faster than BB. We solved problems with up to 700variables in reasonable time, as we show in Merola (2014). This should be enough to satisfya large community of researchers.

The spca package also contains utilities for plotting, printing, summarising and comparingdifferent solutions. This paper is an illustration of its usage through examples. Computationaldetails will we discussed sparingly.

The original paper proposing LS SPCA is still unpublished because reviewers seem to ignorethat the original objective of PCA is the maximisation of the variance explained. Also journalseditors seem to share the same lack of knowledge and to approve biased reviews, likely fromauthors of different SPCA methods. For example, Dr. Qiu, the chief editor of Technometrics,accepted a report which did not discuss at all the content but rejected the paper only becauseI would have ignored an article (by A. D’aspremont et al.) which, instead, was cited in thereferences and the results of which were compared with mine in the paper. He also acceptedanother report from a referee (who could barely write in English) who called the LS criterion”a new measure used ad-hoc”. Dr Qiu rejected the paper on those grounds and added thereason that the algorithm is not scalable. Since computational efficiency is not the journal’saims and scope, this also was unfair. The (kind) email of complaint I sent to Dr. Qiu wentunanswered. This is just to testify how frustrating publishing sometimes can be, thanks to

Page 3: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 3

unfair resistance from other fellow academics.

The paper is organised as follows: in the next section we briefly outline the LS SPCA solutions;in Section 3 we outline the BB and BE algorithms; in Section 4 we illustrate the differentoptions available in spca on a public dataset on baseball statistics, which is also included inthe package; in Section 5 we give some concluding remarks.

2. Optimal solutions

Given a matrix of n observations on p mean-centred variables X (n× p), PCA intends to findthe rank-d (d ≤ p) approximation of the data, X[d] that minimises the LS criterion. It is easy

to show that the approximation must be in the form X[d] = XAP′, where A (p × d) is thematrix of loadings and P (p×d) a matrix of coefficients. The matrix T = XA (n×d) definesthe PCs.

The PCs can be constrained to be uncorrelated without loss of optimality because of the extrasum of squares principle. Hence, the solutions are the solutions to:

A = arg minRank(X[d])=d

||X− X[d]||2 = arg maxA∈<p×d

||X[d]||2 = arg maxaj∈<p

d∑j=1

a′jSSaj

a′jSaj(1)

subject to a′jSak = 0, j 6= k,

where S = n−1X′X (p× p) denotes the sample covariance matrix and the summation in thelast term derives from the uncorrelatedness of the components. The problem is completelyidentified by the loadings because P′ = (T′T)−1 T′X and the rank-d approximation is equalto X[d] = XA (A′X′XA)−1 A′X′X.

It follows that the total variance explained, ||X[d]||2, can be broken down into the sum of thevariances explained by each component Vexp(tj) = a′jSSaj/a

′jSaj . Hence, the PCA problem

can be equivalently solved by maximising the individual Vexp(tj), that is as:

aj = arg maxaj∈<p

a′jSSaj

a′jSaj, j = 1, . . . , d (2)

subject to a′jSak = 0, j 6= k.

Note that this problem is invariant to the scale of the loadings.

2.1. PCA solutions

It is well known that the optimal PCA loadings are proportional to the eigenvectors of S,{vj , j = 1, . . . , d}, corresponding to the d largest eigenvalues taken in non increasing order,{λ1 ≥ λ2 ≥ . . . ≥ λd}. Therefore the variance explained simplifies to:

Vexp∗(tj) = max Vexp(tj) =v′jSSvj

v′jSvj=

v′jSvj

v′jvj= λj .

It follows that the variance explained by each PC is equal to the corresponding eigenvalueand that the loadings can be taken to be orthonormal so that the variance explained is also

Page 4: Spca_Vignettes R Package_Test Version

4 Sparse Principal Component Analysis with the R package spca

equal to the variance of the PCs. Consequently, PCA has been popularised as the solution tothis simpler problem

aj = arg maxaj∈<p

a′jSaj , j = 1, . . . , d (3)

subject to a′jak = δjk,

where δjk is the Kronecker delta. ten Berge (1993, page 87) warns about this formulation ofthe PCA problem by stating: ”Nevertheless, it is undesirable to maximize the variance of thecomponents rather than the variance explained by the components, because only the latter isrelevant for the purpose of finding components that summarize the information contained inthe variables.”

PCA can be computed in R with different functions. We implemented the function pca thatreturns an spca object using a simple eigen-decomposition:

pca(S, nd, only.values = FALSE, screeplot = FALSE, kaiser.print = FALSE)

This function returns the first nd eigenvectors of the correlation matrix S . Setting the flagonly.values = TRUE returns only the eigenvalues (useful if only the variance explained is

needed and much faster). Setting screeplot = TRUE produces a screeplot of the eigenval-

ues and if kaiser.print = TRUE the number of eigenvalues larger than one is printed andreturned. This is the kaiser rule for deciding how many PCs to include in the model.

2.2. Least Squares Sparse PCA solutions

The LS SPCA problem is obtained by constraining the cardinality of the loadings in Problem(1), which gives:

A = arg minA∈<p×d

||X−XAP′||2 = arg maxA∈<p×d

d∑j=1

a′jSSaj

a′jSaj(4)

subject to L0(aj) ≤ cj and a′jSak = 0, j 6= k,

where cj < p are the maximal cardinalities allowed.

Under cardinality constraints Vexp(tj) no longer simplifies to (a′jSaj)/a′jaj and the solutions

must be obtained by maximising the individual Vexp in Equation (2) directly. Details of thederivation of these solutions can be found in Merola (2014).

The sparse components are combinations of the variables with index in indj , {j = 1, . . . , d},which we denote with the matrices Wj (n × cj). So, we can write the sparse componentsas tj = Wj aj , where the aj are the vectors of dimension cj containing only the non-zeroloadings. Then Problem (4) becomes:

aj = arg minaj∈<cj

||X−Wj ajp′j ||, j = 1, . . . , d (5)

subject to T′(j)Wj aj = 0, j > 1,

where T(j) is the matrix containing the first (j−1) components (T(1) = 0). Let Jj (p×cj) bethe matrices formed by the columns of the p dimensional identity matrix with indices in indj ,

Page 5: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 5

then we can write Wj = XJj and the full sparse loadings as aj = Jj aj . With this notationthe SPCA Problem (5) can be written as

arg minaj∈<cj

||X−Wj ajp′j || = arg max

aj∈<cj

a′jJ′jSSJj aj

a′jJ′jSJj aj

= arg maxaj∈<cj aj=Jj aj

a′jSSaj

a′jSaj, j = 1, . . . , d (6)

subject to Rjaj = 0, for j > 1,

where Rj = A′(j)SJj defines the uncorrelatedness constraints, with A(j) being the first (j−1)loadings. Hence the variance explained by the sparse PCs is the same Vexp defined in Equation(2).

Problem (6) can be seen as a series of constrained rank-one Reduced-Rank Regression prob-lems (Izenman 1975) where the regressors are the columns of the Wj matrices. It is wellknown that the first solution is the eigenvector a1 satisfying:

(W′1W1)

−1W′1XX′W1a1 = φmaxa1, (7)

where φmax is the largest eigenvalue. This solution is unique as long as the variables in W1 arenot multicollinear. Hereafter we exclude the possibility that a matrix Wj is not full columnrank because that set of variables should be discarded and a full rank one sought. The sparseloadings can be computed also if only the covariance matrix S is known. In fact, Equation(7) can be written as:

D−11 J′1SSJ1a1 = φmaxa1, (8)

where Dj = W′jWj = J′jSJj denotes the covariance matrices of the variables with indices in

indj , which are invertible for the full rank assumptions.

The following solutions can be found by applying the uncorrelatedness constraints to the RRRProblems (6) as in constrained multiple regression (e.g., see Rao and Toutenburg 1999, orMagnus and Neudecker 1999, Th. 13.5, for a more rigorous proof). The solutions are theeigenvectors satisfying:

CjD−1j J′jSSJj aj = φmaxaj , (9)

where Cj = Icj −D−1j R′j(RjD−1j R′j)

+Rj , with C1 = Ic1 , and the subscript ”+” denotes ageneralized inverse. The solutions exist because Rj spans the space of Wj . In this derivationwe assume that R′jRj is singular, otherwise Rj aj = 0 can never be satisfied. This meansthat uncorrelatedness can only be achieved if the cardinalities satisfy cj ≥ j. The LS SPCAsolutions can be computed from the leftmost eigenvector, bj , of the symmetric matrices

(CjD−1j )

12 J′jSSJj(CjD

−1j )

12 as aj = (CjD

−1j )

12 bj .

The above derivation shows that the sparse components that explain the most variance arenot eigenvectors of submatrices of the covariance matrix and that their variance is no longerequal to the variance that they explain.

As mentioned above, the uncorrelatedness constraints require that the cardinality of a com-ponent is not less than its order. In some situations it may be desirable to compute corre-lated components with lower cardinality. In this case the objective is the minimisation ofthe extra variance explained by the j-th component. That is, the variance explained by itscomplement orthogonal to the previous components, defined by t⊥j = Qjtj = Xjbj , where

Qj = (I − T(j)(T′(j)T(j))

−1T′(j)), Xj = QjX are the rediduals of X orthogonal to the first

(j − 1) components and bj are the loadings relative to the residual matrix. On substituting

Page 6: Spca_Vignettes R Package_Test Version

6 Sparse Principal Component Analysis with the R package spca

t⊥j into Problem (6), it is simple to show that the extra variance explained by this componentis:

b′jSjSjbj

b′jSjbj, (10)

where Sj = X′jXj is the covariance matrix of the residuals. However, in this formulationof the problem it is not possible to impose cardinality constraints on the loadings of theoriginal variables, aj . One possible way around this problem is to require that the correlatedcomponents explain as much as possible of the variance of the residuals Xj . That is theloadings can be obtained by solving the problem:

aj = arg minaj ∈<cj

||Xj −Wj ajp′j ||, j = 1, . . . , d

While correctly used in iterative PCA algorithms, such as Power Method and NIPALS (Wold1966), for example, without uncorrelatedness constraints this approach does not maximizethe extra variance explained in Equation (10), but the approximation

a′jSjSjaj

a′jSaj(11)

The loadings of the correlated components are given by the eigenvectors satisfying:

D−1j J′jSjSjJj aj = φmaxaj . (12)

When needed, we will refer to these solutions as Least Squares Correlated Sparse PrincipalComponents Analysis (LS CSPCA).

The LS SPCA solutions can be computed with the function :

spca(S, ind, unc = TRUE)

This function takes the as arguments a covariance or correlation matrix S , list of indicesind of the sparse loadings and the flag unc to indicate whether each component shouldbe uncorrelated with the preceding ones or not. The number of components to compute isdetermined from the length of ind . The function returns an spca object with the loadings,the variance explained and other items. See the documentation for details.

3. Finding a solution

Finding the optimal set of indices for the SPCA problem is a nonconvex mixed integer linearprogram, which is NP-hard, hence computationally intractable (Moghaddam et al. 2006).We consider two greedy approaches for its solution: a branch-and-bound algorithm (BB)analogous to that proposed in Farcomeni (2009) for the maximisation of the variance and theBackward Elimination (BE), which departs from the optimisation of the variance explainedto achieve large loadings.

In the following we illustrate the basic functioning of these algorithms illustrating the differentoptions available. In our implementation we allow some arguments to take fewer values thannecessary by assigning the last value passed to the missing ones. So, for example, passing

Page 7: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 7

thresh = c(0.2, 0.1) will assign the value 0.2 to the first component and 0.1 to all others.Further details can be found in the package documentation.

3.1. Branch-and-bound

Branch-and-bound is a well known method for speeding up the exhaustive searches of allpossible solutions to a problem by excluding subsets of variables that will certainly give worseresults than the current one. In our case, sets of variables that explain less variance thanthe current optimum are eliminated altogether from the search. The convergence to theoptimum is certain because eliminating a variable from a set of regressors cannot yield alarger regression sum of squares. Details on the implementation can be found in Farcomeni(2009), who uses a different objective function. The solutions are found sequentially foreach component, therefore the solution may not be a global optimum for a given number ofcomponents.

In our implementation of the BB algorithm we added the possibility of constraining thevariables that can enter the solutions and of excluding variables that have been used forprevious solutions. The BB algorithm can be called with the function spcabb:

spcabb(S, card, startind, unc = TRUE, excludeload = FALSE,

nvexp = FALSE)

where S is a correlation matrix and card a vector with the cardinalities of the solutions.The number of components to compute is determined by the lenght of card . A list of starting

indices for each component can be passed as a list argument [ startind = list() ]. The variables

used for previous components can be excluded from the following ones [ excludeload = TRUE ].The objective function can be set to be the actual variance explained, Equation (10) [ nvexp = TRUE ],or the approximated one, Equation (11). The usage of the actual vexp is not advised.

3.2. Backward Elimination algorithm

BE iteratively eliminates the smallest loading from a solution and recomputes the compo-nent without that variable until a stop rule is met. We call this procedure trimming. Weimplemented BE in the package spca adding options that accommodate different aspects ofinterpretability. BE is called with the function spcabe:

spcabe(S, nd = FALSE, thresh = FALSE, ndbyvexp = FALSE, perc = TRUE,

unc = TRUE, startind = NULL, mincard = NULL, excludeload = FALSE,

threshvar = FALSE, threshvaronPC = FALSE, trim = 1, reducetrim = TRUE,

diag = FALSE, eps = 1e-04)

The function takes a covariance or correlation matrix [ S ] as first argument. The number ofcomponents to compute [ nd ] can be either specified directly or decided by a stop rule defined

by the percentage of cumulative variance explained (PCVE) reached [ ndbyvexp = real ∈ [0,1] ].If no stopping rule is specified all components are computed.

There are three stop rules for trimming applicable to each component:

cardinality The minimal cardinality of the loadings [ mincard ];

Page 8: Spca_Vignettes R Package_Test Version

8 Sparse Principal Component Analysis with the R package spca

loss of variance explained The maximum acceptable loss of variance explained. This canbe computed either with respect to the loss of cumulated variance explained either fromthat explained by the same number of PCs [ threshvaronPC ] or from that explained bythe component before trimming [ threshvar ] (both arguments must be real in [0, 1]);

threshold This is the minimal absolute value required for a loading [ thresh ]. The thresholdcan be either specified with respect to the loadings scaled to unit L2 norm or to thepercentage contributions (scaled to unit L1 norm, [ perc = TRUE ]).

The stop rules for trimming are given in order of precedence and can have a different valuefor each component. The stop rules are all optional, if none is given, the minimal cardinalityis set equal to the rank of the component if they have to be uncorrelated otherwise it is setequal 1.

In problems with a large number of variables the computation can be sped up by trimmingmore than one loading at the time [ trim = integer > 1 ]. When the number of loadings leftis less than the number of loadings to trim trimming stops. However, more accurate solutionscan be obtained by finishing off the elimination by trimming the remaining ones one by one[ reducetrim = TRUE ].

The algorithm by default computes components uncorrelated with the previous ones. How-ever, one or more can be computed without this requirement, as given in Equation (12)[ unc = FALSE ].

The components can be constrained to be combinations of only a subset of the variables withtwo options:

starting indices A list containing the indices from which trimming must start for eachcomponent [ startind = list() ];

exclude indices used With this flag the next components are trimmed starting from onlythe indices of the variables that were not included in previous loadings [ excludeload TRUE].

The standard output is an spca object containing the loadings, variance explained and fewother items. A richer output containing diagnostic information can be obtained by setting[diag = TRUE ]. The value under which a loading is considered to be zero can be changedfrom the default 10−4 with the argument [ eps = real > 0 ].

The BE algorithm is outlined in Algorithm 1. Not all options are shown.

Page 9: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 9

Algorithm 1 LS SPCA(BE)

initializeStopping rules for the number of components

nd {the number of components to compute}ndbyvexp {optional, minimum variance cumulated explained}

Select which variables can enter the solutionsstartindj {optional, the starting indices for trimming}

Stopping rules for elimination. Can be different for each componentstartindj {starting set of indices}threshj {minimum absolute value of the sparse loadings}mincardj ≥ 1 {minimum cardinality of the sparse loadings}threshvarj {optional maximum relative loss of variance explained}

end initializefor j = 1 to nd do

Compute aj as the j-th LS SPCA solution for startindjVexpfullj = Vexp(aj)while mini∈startindj |aij | < threshj and length(startindj) > mincardj do

indoldj = startindj , aoldj = aj

k: |akj | ≤ |aij |, i ∈ startindjstartindj = startindj\kCompute aj as the j-th LS SPCA solution for startindjif 1−Vexp(aj)/Vexpfullj > threshvarj then

startindj = indoldj , aj = aoldj

breakend if

end whileif

∑ji=1 Vexp(ai) ≥ threshvarj thennd = jbreak

end ifend for

There is no obvious rule for choosing the thresholds, thresh. However, if the loadings arecomputed to have unitary L2 norm, specifying a threshold larger than 1/

√c will ensure a

cardinality lower than c. For percentage contributions standardized to unit L1 norm specifyinga threshold larger than 1/c will ensure a cardinality lower than c. For this reason, thechoice of the minimum cardinality and of the threshold must be considered together and latercomponents require a lower threshold than the first ones. Note that trimming is designedfor components computed from correlation matrices. If a covariance matrix is used, differentthresholds for every variable should be used.

4. Methods for spca objects

The spca package contains different S3 methods for printing, plotting and comparing sparsesolutions. Print and plot use generic functions. So, they take as first argument an spca objectand can be called without the suffix ”spca”.

Page 10: Spca_Vignettes R Package_Test Version

10 Sparse Principal Component Analysis with the R package spca

print

print.spca prints a formatted matrix of loadings and is called as:

print(smpc, cols, digits = 3, rows, noprint = 0.001, rtn = FALSE,

perc = TRUE, namcomp = NULL)

cols is the number of loadings to print, the number of digits shown is set by the argumentdigits (this is set to 1 if perc = TRUE ), noprint sets the threshold below which a loading

is considered zero and is not printed, rtn controls whether the matrix is returned, percwhether the loadings should be printed as percentage contributions and namcomp is anoptional vector on names to show in the header.

showload

Especially when the number of variables is large, loadings can be more readily read using theshowload function which prints the nonzero loadings one by one.

showload(smpc, cols, digits = 3, onlynonzero = TRUE, noprint = 0.001,

perc = TRUE, rtn= FALSE)

The function takes an spca object as first argument, the number of solutions to print [ cols ],the number of digits to print [ digits ], a flag indicating whether only the nonzero loadings

should be printed [ onlynonzero ], the value below which a loading is considered zero [ noprint ]and the flags [ perc ], indicating whether the percentage contributions should be printed and

[ rtn ], indicating whether the output should be returned.

summary

The method summary.spca prints the summary statistics for an spca solution and it is calledas

summary(smpc, cols, rtn = FALSE, prn = TRUE, thrsehcard = 0.001, perc = TRUE)

The method takes an spca object as first argument, the number of solutions to print [ cols ],a flag indicating whether to return the matrix of formatted summaries or not [ rtn ], a flagindicating whether to print the summaries or not [ prn ], the value below which a loading

is considered zero and is not counted in the cardinality [ threshcard ] and a flag indicatingwhether to return the percentage contributions [ perc ].

plot

The method plot.spca plots the percentage cumulative variance explained by the sparsecomponents together with that explained by the corresponding number of PCs. It can alsoplot the loadings. It is called as:

Page 11: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 11

plot(smpc, cols, plotvexp = TRUE, plotload = FALSE, perc = TRUE, bnw = FALSE,

nam, varnam = FALSE, onlynonzero = TRUE, plotloadvsPC = FALSE, pcs =

NULL, addlabels = TRUE, mfrowload = 1, mfcolload = 1, loadhoriz = FALSE)

This method takes an spca object as first argument. cols denotes the number of dimen-sions to plot, plotvexp and plotload control whether PCVE and the loadings are plot-

ted, perc whether the loadings should be plotted as percentage contributions. bnw con-trols whether the CPVE plot is in colour or black and white. nam assigns a name tothe results and varnam assigns names to the variables (useful when the names are long).onlynonzero specifies whether only the non zero loadings must be included in the plot or

not. mfrowload and mfcolload set the layout of the plot of the loadings by arranging themin a grid. loadhoriz specifies whether the loadings should be plotted as vertical or horizontalbars. Setting plotloadvsPC = TRUE the sparse loadings are plotted against the correspond-ing PCA loadings. A line showing the equality with the PCs loadings is also shown. Theargument addlabels determines if the nonzero loadings are to be labelled (if not FALSE)and whether short labels addlabels = TRUE instead of the original ones should be usedaddlabels = ”orig” . This plot inherits the arguments set for the others.

spca.comp

The function spca.comp can be used to compare two or more spca solutions. It producescomparative plots of the PCVE and of the loadings. It also prints the summaries side by sideand, optionally, the loadings. The function is not implemented as a method to make it easierto use and it is called as

spca.comp(smpc, nd, nam = FALSE, perc = TRUE, plotvar = TRUE, plotload = FALSE,

prnload = TRUE, shortnamcomp = TRUE, rtn = FALSE, prn = TRUE, bnw = FALSE)

It takes as first argument a list of spca objects, the number of dimensions to compare [ nd ],a vector on names for the different solutions [ nam ], a flag indicating whether the loadingsshould be printed as percentage contributions [ perc ]. The flag plotvar indicates whether

PCVE should be plotted with the corresponding PCA one, and the flag plotvar whether

the loadings should be plotted together; prnload indicates whether the loadings of the dif-

ferent solutions should be printed side by side. the flag shortnamecomp indicates whetherthe components should be printed using short names (Cx.y) (useful when there are severalcomponents to print); the flag rtn indicates whether a text matrix with the formatted sum-maries should be returned, prn indicates whether the summaries should be printed and bnwwhether the plots should be in black and white.

The spca.comp function produces different plots and this makes the customisation of eachof them quite cumbersome. Since the primary objective of this package is the computationof the sparse solutions and the plots are simple to produce, we did not include the possibilityof adding extra graphical parameters in this function.

Page 12: Spca_Vignettes R Package_Test Version

12 Sparse Principal Component Analysis with the R package spca

5. Sparse Principal Components Analysis Example

In this section we obtain different sparse solutions by changing the settings of the BB andBE algorithms. For this we use dataset baseball (available at Statlib, http://lib.stat.cmu.edu/datasets/baseball.data) which contains observations on 16 performance statistics of263 US Major League baseball hitters, taken on their career and 1986 season. We chose thisdataset because it is small but gives varied results. The demonstration does not want to bea proper analysis of the data (in any case, we are not baseball experts) but a simple examplein which we illustrate some of the spca package features.

The results shown are obtained with spca version 1.6.6 , which includes this dataset. I firstlydeveloped this package for my own needs, then I thought that I might as well share it withothers. Therefore, it is not intended to be applied to large problems, where the computationalneeds are more important than optimality ones. In Merola (2014) we applied the BE algorithmto problems with more than 700 variables.

First we load the spca package and extract the baseball data. The names of the 16 variablesin the set are also shown.

library("spca")

data(baseball, package = "spca")

colnames(baseball)

[1] "times at bat 1986" "hits 1986"

[3] "home runs 1986" "runs 1986"

[5] "runs batted 1986" "walks 1986"

[7] "years career" "times at bat career"

[9] "hits career" "home runs career"

[11] "runs career" "runs batted career"

[13] "walks career" "put outs 1986"

[15] "assists 1986" "errors 1986"

PCA

The full PCA solutions are computed with the function pca which produces PCA output ofclass spca. We require the screeplot and the Kaiser rule (Kaiser 1960), useful for determiningthe number of components to keep in the model. We also show the first five percentagecontributions calling the S3 spca method print. This method is automatically applied to spcaobjects with the following options:

b.pca = pca(baseball, nd = 5, screeplot = TRUE, kaiser.print = TRUE)

Page 13: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 13

●●

● ● ● ● ● ● ● ● ● ●02

46

component

eige

nval

ue

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Figure 1: Screeplot of the baseball data

[1] "number of eigenvalues larger than 1 is 3"

# PCA with screeplot and kaiser Rule

print(b.pca, cols = 1:5)

Percentage Contributions

Comp1 Comp2 Comp3 Comp4 Comp5

times at bat 1986 5.5 10.4 2.5 3.1 3.5

hits 1986 5.4 10.2 1.8 3.8 4.5

home runs 1986 5.6 6.3 -12.3 9.1 -16.7

runs 1986 5.4 10.1 -2.3 6.9 7.2

runs batted 1986 6.5 8.5 -6.1 6.2 -10.3

walks 1986 5.8 6.3 -2.4 -3.3 19.5

years career 7.9 -6.9 3.5 -0.7 -0.7

times at bat career 9.3 -5 4.6 -1.2 0.2

hits career 9.3 -4.7 4.5 -1.6 0.5

home runs career 8.9 -3.3 -4.3 2.8 -7.6

runs career 9.5 -4.4 3.2 -0.2 2.1

runs batted career 9.5 -4.3 0.6 -0.5 -4.1

walks career 8.9 -5 2 -2.5 6.1

put outs 1986 2.2 4.3 -5.5 -51.6 -3.1

Page 14: Spca_Vignettes R Package_Test Version

14 Sparse Principal Component Analysis with the R package spca

assists 1986 4.7 23.7 0.6 -0.9

errors 1986 -0.2 5.6 20.7 -5.9 -13

----- ----- ----- ----- -----

PCVE 45.3 71 81.8 87.2 91.6

The screeplot and Kaiser Rule indicate that three components are sufficient. These explainabout 82% of total variance.

Next we plot the first four [cols = 4, plotload = TRUE] PCA percentage contributions.The cumulative variance explained is not plotted [plotvexp = FALSE] (because it would plottwice the PCA variance). We require that also zero contributions are plotted, for ease of com-parison onlynonzero = FALSE. The plots will be in a 2×2 grid [mfrowload = 2, mfcolsload

= 2] with the variables labelled as ”V1, V2,...” [varnam = FALSE] (because the names are quitelong). These and other parameters can be set, as explained in the help documentation.

plot(b.pca, cols = 4, plotvexp = FALSE, plotload = TRUE, addlabel = TRUE,

onlynonzero = FALSE, mfrow = 2, mfcol = 2)

Comp 1

V1 V2 V3 V4V5 V6

V7V8 V9V10V11V12V13

V14

V15V16

−2%

0%

2%

4%

6%

8%

10%

Comp 2

V1 V2

V3

V4V5

V6

V7V8 V9

V10V11V12V13

V14V15V16

−10%

−5%

0%

5%

10%

15%

Comp 3

V1 V2

V3

V4

V5

V6

V7 V8 V9

V10

V11V12

V13

V14

V15V16

−20%

−10%

0%

10%

20%

30%

Comp 4

V1 V2

V3V4 V5

V6V7 V8 V9

V10

V11V12V13

V14

V15

V16

−60%

−50%

−40%

−30%

−20%

−10%

0%

10%

Figure 2: Percentage contributions of the first four PCs

The PCs’ loadings show that the first components is made of the sum of career and seasonoffensive play, with higher weight for the career statistics. The second component is thedifference of the previous season results from the career ones. Season offensive play has larger

Page 15: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 15

weight in this component, hence it should characterise young players who had a good 1986season. Defensive play has large weight in the third and fourth components.

5.1. Choice of cardinality

b.cc.be = choosecard(baseball, nd = 3, method = "BE", )

5 10 15

cardinality

Minc

ontr

1615141312

1110987654

32

1

0%

20%

40%

60%

80%

100%

component 1

5 10 15

cardinality

PVE

1615141312111098765

43

2

1

42%

43%

44%

45%

component 1

Mincontr

PVE

1615141312111098765

4 3

2

1

0% 20% 40% 60% 80% 100%

42%

43%

44%

45%

component 1

5 10 15

0.00

0.10

0.20

0.30

cardinality

Entro

py

161514131211109

87

65

43

2

1

component 1

Figure 3: Cardinality Plots. Min contr vs card top-left, etc

[1] "Comp 1 , reached min cardinality of 5 , smallest loading is 0.237"

[1] "Comp 2 , reached min cardinality of 2 , smallest loading is 0.493"

Page 16: Spca_Vignettes R Package_Test Version

16 Sparse Principal Component Analysis with the R package spca

2 4 6 8 10 12 14 16

cardinality

Minc

ontr

161514131211

1098

76

54

3

2

10%

20%

30%

component 2

2 4 6 8 10 12 14 16

cardinality

PVE

161514131211109

87

65

4

3

2

24%

24%

25%

26%

component 2

Mincontr

PVE

161514131211109

8 7

65

4

3

2

10% 20% 30%

24%

25%

25%

26%

component 2

2 4 6 8 10 12 14 16

0.20

0.25

0.30

0.35

cardinality

Entro

py

1615

1413

1211

109

8

7

6

54

3

2

component 2

Figure 4: Cardinality Plots. Min contr vs card top-left, etc

[1] "Comp 1 , reached min cardinality of 5 , smallest loading is 0.237"

[1] "Comp 2 , reached min cardinality of 5 , smallest loading is 0.322"

[1] "Comp 3 , reached min cardinality of 3 , smallest loading is 0.024"

Page 17: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 17

4 6 8 10 12 14 16

cardinality

Minc

ontr

1615

14131211

1098

7

65

4

3

0%

5%

10%

15%

component 3

4 6 8 10 12 14 16

cardinality

PVE

16151413121110987654

3

4%

6%

8%

10%

component 3

Mincontr

PVE

161514131211 1098 7 65 4

3

0% 5% 10% 15%

4%

6%

8%

10%

component 3

4 6 8 10 12 14 16

0.15

0.20

0.25

0.30

cardinality

Entro

py

16

15141312

11109

87

6

5

4

3

component 3

Figure 5: Cardinality Plots. Min contr vs card top-left, etc

# choosecard(S, nd, method = c('BE', 'BB', 'PCA', perc =

# TRUE, unc = TRUE, trim = 1), perc = TRUE, unc = TRUE, trim

# = 1, reducetrim = TRUE, prntrace = FALSE, rtntrace = TRUE,

# plotminload = TRUE, plotcvexp = TRUE, plotlovsvexp = TRUE,

# plotentropy = TRUE, plotfarcomeni = FALSE, mfrowplot = 2,

# mfcolplot = 2, plotall = TRUE, ce = 1)

At the prompt enter the preferred cardinality for component 1 (add a decimal to stop) : weentered the values 5, 5 and 4.1.

5.2. SPCA(BB)

As a first tentative to get some insight of the solutions, we run LS SPCA(BB) requiringfour components of cardinality 5. The output gives the summary statistics, then prints thepercentage contributions and then plots them together with PRCVE.

b.bb1 = spcabb(baseball, card = rep(5, 4))

done comp 1

done comp 2

done comp 3

done comp 4

Page 18: Spca_Vignettes R Package_Test Version

18 Sparse Principal Component Analysis with the R package spca

# print the summaries

summary(b.bb1, cols = 4)

Comp1 Comp2 Comp3 Comp4

PVE 45.1% 25.3% 10.7% 5.6%

PCVE 45.1% 70.4% 81.1% 86.7%

PRCVE 99.5% 99.1% 99.1% 99.3%

Card 5 5 5 5

Ccard 5 10 15 20

PVE/Card 9% 5.1% 2.1% 1.1%

PCVE/Ccard 9% 7% 5.4% 4.3%

MinCont 8.7% 9.4% 7.8% 4.7%

# print the contributions

b.bb1

Percentage Contributions

Comp1 Comp2 Comp3 Comp4

times at bat 1986 25.9

hits 1986 12

home runs 1986 23.8 15.9

runs 1986 12.2 17.3

runs batted 1986 13.7 13.2

walks 1986 8.7

years career

times at bat career 49.1 -34.2 -15.9

hits career

home runs career 16.4

runs career

runs batted career

walks career -4.7

put outs 1986 7.8 -59.6

assists 1986 -28

errors 1986 9.4 -24.5 -7.7

----- ----- ----- -----

PCVE 45.1 70.4 81.1 86.7

# plot PCVE and contributions

plot(b.bb1, plotload = TRUE, nam = "BB 1", varnam = TRUE, mfrowload = 2,

mfcolload = 2)

Page 19: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 19

Comp 1

runs 1986runs batted 1986

walks 1986

times at bat career

home runs career

0%

10%

20%

30%

40%

50%

Comp 2

times at bat 1986

runs 1986runs batted 1986

times at bat career

errors 1986

−40%

−30%

−20%

−10%

0%

10%

20%

30%

Comp 3

home runs 1986

times at bat career

put outs 1986

assists 1986errors 1986−30%

−20%

−10%

0%

10%

20%

30%

Comp 4

hits 1986home runs 1986

walks career

put outs 1986

errors 1986

−60%

−40%

−20%

0%

20%

Figure 6: Percentage contributions of the first four LS SPCA(BB) solutions

Page 20: Spca_Vignettes R Package_Test Version

20 Sparse Principal Component Analysis with the R package spca

number of components

PCVE

1 2 3 4

40%

50%

60%

70%

80%

90%

● PCABB 1

Figure 7: Cumulative variance explained by LS SPCA(BB) compared with PCA

A different view of the BB contributions can be obtained by plotting them versus those ofthe PCs, using the plotloadvsPC = TRUE in the plot method, as shown below in Figure

plot(smpc = b.bb1, cols = 1:4, plotvexp = FALSE, addlabels = TRUE,

varnam = TRUE, plotloadvsPC = TRUE, pcs = b.pca, mfrow = 2,

mfcol = 2, onlynonzero = FALSE)

Page 21: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 21

PC

Spars

e con

tributi

ons

●●● ● ●●●●●●●

●●

runs 1986runs batted 1986

walks 1986

times at bat career

home runs career

0% 2% 4% 6% 8%

0%

10%

20%

30%

40%

50%

comp 1

PC

Spars

e con

tributi

ons

●●●● ● ●●●● ●●

times at bat 1986

runs 1986

runs batted 1986

times at bat career

errors 1986

−5% 0% 5% 10%

−30%

−20%

−10%

0%

10%

20%

comp 2

PC

Spars

e con

tributi

ons

●●●● ● ●●● ●● ●

home runs 1986

times at bat career

put outs 1986

assists 1986

errors 1986

−10% 0% 10% 20%

−20%

−10%

0%

10%

20%

30%

comp 3

PC

Spars

e con

tributi

ons ● ●●● ●●● ●●●●

hits 1986home runs 1986

walks career

put outs 1986

errors 1986

−50% −40% −30% −20% −10% 0% 10%

−60%

−40%

−20%

0%

comp 4

Figure 8: Sparse loadings plotted against the corresponding PCA ones

5.3. SPCA(BE)

BE solutions with a PRCVE of at least 99% [ threshvaronPC = 0.01] are shown below.

b.be1 = spcabe(baseball, nd = 4, threshvaronPC = 0.01)

[1] "Comp 4 , reached min cardinality of 4 , smallest loading is 0.181"

summary(b.be1)

Comp1 Comp2 Comp3 Comp4

PVE 45% 25.4% 10.6% 5.7%

PCVE 45% 70.4% 81% 86.6%

PRCVE 99.3% 99.1% 99% 99.3%

Card 5 7 4 4

Ccard 5 12 16 20

PVE/Card 9% 3.6% 2.7% 1.4%

PCVE/Ccard 9% 5.9% 5.1% 4.3%

Converged 2 2 2 1

MinCont 11.1% 8.4% 17.4% 11.3%

b.be1

Page 22: Spca_Vignettes R Package_Test Version

22 Sparse Principal Component Analysis with the R package spca

Percentage Contributions

Comp1 Comp2 Comp3 Comp4

times at bat 1986 21.3 21.2 16.3

hits 1986

home runs 1986 11.1 29.1 14.8

runs 1986 19.4

runs batted 1986 16.2

walks 1986 -11.3

years career -9.3

times at bat career

hits career -17.4

home runs career

runs career 21.3 -15.6

runs batted career 29.3 -10

walks career 17

put outs 1986 -57.5

assists 1986 -29.3

errors 1986 8.4 -24.2

----- ----- ----- -----

PCVE 45 70.4 81 86.6

In some cases it can be more convenient to see just the nonzero loadings and to plot them.This is shown below.

# print the nonzero contributions separately

showload(b.be1)

[1] "Component 1"

times at bat 1986 home runs 1986 runs career

21.3% 11.1% 21.3%

runs batted career walks career

29.3% 17%

[1]

[1] "Component 2"

times at bat 1986 runs 1986 runs batted 1986

21.2% 19.4% 16.2%

years career runs career runs batted career

-9.3% -15.6% -10%

errors 1986

8.4%

[1]

[1] "Component 3"

home runs 1986 hits career assists 1986 errors 1986

29.1% -17.4% -29.3% -24.2%

Page 23: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 23

[1]

[1] "Component 4"

times at bat 1986 home runs 1986 walks 1986

16.3% 14.8% -11.3%

put outs 1986

-57.5%

[1]

# plot the contributions

plot(b.be1, plotload = TRUE, mfrowload = 2, mfcol = 2, varnam = TRUE,

addlabel = TRUE, nam = "BEB1")

Comp 1

times at bat 1986

home runs 1986

runs career

runs batted career

walks career

0%

5%

10%

15%

20%

25%

30%

Comp 2

times at bat 1986runs 1986

runs batted 1986

years career

runs career

runs batted career

errors 1986

−20%

−10%

0%

10%

20%

30%

Comp 3

home runs 1986

hits career

assists 1986

errors 1986−30%

−20%

−10%

0%

10%

20%

30%

Comp 4

times at bat 1986home runs 1986

walks 1986

put outs 1986−60%

−40%

−20%

0%

20%

Figure 9: Percentage contributions of the first four BE solutions

Page 24: Spca_Vignettes R Package_Test Version

24 Sparse Principal Component Analysis with the R package spca

number of components

PC

VE

1 2 3 4

40%

50%

60%

70%

80%

90%

● PCABEB1

Figure 10: Cumulative variance explained by BE compared with PCA

The BE output can be compared with the BB one using spca.comp. This function returnsthe loadings side-by-side, and the comparative plots of the variance explained and the contri-butions (setting for each dimension plot.load = TRUE. In the plots of loading the variablesnames are replaced by their order.

spca.comp(list(b.bb1, b.be1), nd = 4, nam = c("BB 5", "BE 99"),

plotvar = TRUE, plotload = TRUE)

[1] "Loadings"

C1.1 C1.2 C2.1 C2.2 C3.1

times at bat 1986 0.213 0.259 0.212

hits 1986

home runs 1986 0.111 0.238

runs 1986 0.122 0.173 0.194

runs batted 1986 0.137 0.132 0.162

walks 1986 0.087

years career -0.093

times at bat career 0.491 -0.342 -0.159

hits career

home runs career 0.164

runs career 0.213 -0.156

runs batted career 0.293 -0.1

walks career 0.17

put outs 1986 0.078

assists 1986 -0.28

errors 1986 0.094 0.084 -0.245

----- ----- ----- ----- -----

PCVE 45.1 45 70.4 70.4 81.1

Page 25: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 25

C3.2 C4.1 C4.2

times at bat 1986 0.163

hits 1986 0.12

home runs 1986 0.291 0.159 0.148

runs 1986

runs batted 1986

walks 1986 -0.113

years career

times at bat career

hits career -0.174

home runs career

runs career

runs batted career

walks career -0.047

put outs 1986 -0.596 -0.575

assists 1986 -0.293

errors 1986 -0.242 -0.077

----- ----- -----

PCVE 81 86.7 86.6

[1]

[1] Summary statistics

C1.1 C1.2 C2.1 C2.2 C3.1 C3.2 C4.1 C4.2

PVE 45.1 45.0 25.3 25.4 10.7 10.6 5.6 5.7

PCVE 45.1 45.0 70.4 70.4 81.1 81.0 86.7 86.6

PRCVE 99.5 99.3 99.1 99.1 99.1 99.0 99.3 99.3

Card 5 5 5 7 5 4 5 4

Ccard 5 5 10 12 15 16 20 20

PVE/Card 9.0 9.0 5.1 3.6 2.1 2.7 1.1 1.4

PCVE/Ccard 9.0 9.0 7.0 5.9 5.4 5.1 4.3 4.3

Min %Cont 8.7% 11.1% 9.4% 8.4% 7.8% 17.4% 4.7% 11.3%

Page 26: Spca_Vignettes R Package_Test Version

26 Sparse Principal Component Analysis with the R package spca

Component 1

variables

load

ings

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0%

10%

20%

30%

40%

50%

60%

BB 5BE 99

Component 2

variables

load

ings

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

−40%

−30%

−20%

−10%

0%

10%

20%

30%

BB 5BE 99

Component 3

variables

load

ings

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

−40%

−30%

−20%

−10%

0%

10%

20%

30%

40%

BB 5BE 99

Component 4

variables

load

ings

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

−80%

−60%

−40%

−20%

0%

20%●

BB 5BE 99

components

Cum

. Var

. Exp

l.

1 2 3 4

40%

50%

60%

70%

80%

90%

100%

PCABB 5BE 99

Figure 11: stoca

Page 27: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 27

The output shows that the PCVE are almost the same as the PCA one but that the contribu-tions are different for the two solutions. For the second component the smallest contributionproduced by BE is smaller than the corresponding BB one.

5.4. More BE solutions using different options

Since we ran BE without specifying a threshold for the loadings, some of the contributions areless than 10%. The next example shows the output of BE run requiring that the contributionsare not less than 20% and that enough components should be computed to explain at least80% of the total variance [ndbyvexp = 0.80].

b.be2 = spcabe(baseball, thresh = 0.2, ndbyvexp = 0.8)

[1] "Comp 3 , reached min cardinality of 3 , smallest loading is 0.062"

[1] "Comp 4 , reached min cardinality of 4 , smallest loading is 0.128"

[1] "reached PCVE 82.2% with 4 components"

summary(b.be2)

Comp1 Comp2 Comp3 Comp4

PVE 44.5% 24.6% 3.6% 9.6%

PCVE 44.5% 69.1% 72.6% 82.2%

PRCVE 98.2% 97.3% 88.8% 94.3%

Card 3 3 3 4

Ccard 3 6 9 13

PVE/Card 14.8% 8.2% 1.2% 2.4%

PCVE/Ccard 14.8% 11.5% 8.1% 6.3%

Converged 0 0 1 1

MinCont 24.8% 27.5% 4.2% 8.1%

print(b.be2, namcomp = NULL)

Percentage Contributions

Comp1 Comp2 Comp3 Comp4

times at bat 1986 28 37.3

hits 1986

home runs 1986 4.2 19.1

runs 1986 27.5

runs batted 1986 8.1

walks 1986

years career

times at bat career -14.6

hits career

home runs career

runs career -35.3

runs batted career 47.2

Page 28: Spca_Vignettes R Package_Test Version

28 Sparse Principal Component Analysis with the R package spca

walks career 24.8

put outs 1986

assists 1986 -53

errors 1986 42.8 -58.1

----- ----- ----- -----

PCVE 44.5 69.1 72.6 82.2

The algorithm stopped after four components but the last two did not converge because theminimum cardinality (required for uncorrelatedness) was reached. Note the warnings printedand the value ”1” corresponding to ”converged” in the summary table. in fact, the minimumcontributions are smaller than 10%.

In some cases it may be necessary to require a cardinality smaller than the order of thecomponent. This can be achieved by removing the uncorrelatedness restrictions on the lasttwo components. The output for this set up is shown next.

b.be3 = spcabe(baseball, thresh = 0.2, ndbyvexp = 0.8, unc = c(TRUE,

TRUE, FALSE, FALSE))

[1] "less than 16 unc flags given, last 12 set to FALSE"

[1] "reached PCVE 85.5% with 4 components"

summary(b.be3)

Comp1 Comp2 Comp3 Comp4

PVE 44.5% 24.6% 10.8% 5.6%

PCVE 44.5% 69.1% 79.9% 85.5%

PRCVE 98.2% 97.3% 97.7% 98.1%

Card 3 3 3 2

Ccard 3 6 9 11

PVE/Card 14.8% 8.2% 3.6% 2.8%

PCVE/Ccard 14.8% 11.5% 8.9% 7.8%

Converged 0 0 0 0

MinCont 24.8% 27.5% 28.2% 23.6%

print(b.be3, namcomp = paste("Index", 1:4))

Percentage Contributions

Index 1 Index 2 Index 3 Index 4

times at bat 1986 28 37.3 23.6

hits 1986

home runs 1986 28.2

runs 1986 27.5

runs batted 1986

walks 1986

Page 29: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 29

years career

times at bat career

hits career

home runs career

runs career -35.3

runs batted career 47.2

walks career 24.8

put outs 1986 -76.4

assists 1986 -41.2

errors 1986 -30.6

----- ----- ----- -----

PCVE 44.5 69.1 79.9 85.5

names(b.be3)

[1] "loadings" "contributions" "vexp"

[4] "vexpPC" "cardinality" "ind"

[7] "unc" "converged" "corComp"

[10] "Uncloadings"

round(b.be3$corComp, 2)

[,1] [,2] [,3] [,4]

[1,] 1.00 0.00 0.15 -0.02

[2,] 0.00 1.00 -0.12 -0.01

[3,] 0.15 -0.12 1.00 -0.13

[4,] -0.02 -0.01 -0.13 1.00

The last command shows the correlation between the components and it can be seen that thethird one is correlated with all others while the fourth one only with the third one.

The components can be constrained to be combinations of only a subset of the variables. Forexample, it could be reasonable to separate career and season statistics, as also suggested bythe previous solutions.

indoffs = 1:6

indoffc = 7:13

inddef = 14:16

inds = c(indoffs, inddef)

We could ask which are the single variables for each of these groups that are most usefulfor explaining the variability. We can obtain an answer by requiring correlated ”components”(these are really single variables) of cardinality one for each group. The output is shown belowwith the components named with respect to the indices selected argsenamcompc(””,...).

Page 30: Spca_Vignettes R Package_Test Version

30 Sparse Principal Component Analysis with the R package spca

b.bbs2c = spcabb(baseball, card = rep(1, 3), unc = FALSE,

startind = list(indoffc, indoffs, inddef))

done comp 1

done comp 2

done comp 3

summary(b.bbs2c)

Comp1 Comp2 Comp3

PVE 41.5% 25.6% 9.1%

PCVE 41.5% 67% 76.1%

PRCVE 91.5% 94.4% 93%

Card 1 1 1

Ccard 1 2 3

PVE/Card 41.5% 25.6% 9.1%

PCVE/Ccard 41.5% 33.5% 25.4%

MinCont 100% 100% 100%

print(b.bbs2c, namcomp = c("off carr.", "off 1986", "defens. 1986"))

Percentage Contributions

off carr. off 1986 defens. 1986

times at bat 1986 100

hits 1986

home runs 1986

runs 1986

runs batted 1986

walks 1986

years career

times at bat career

hits career

home runs career

runs career

runs batted career 100

walks career

put outs 1986

assists 1986 100

errors 1986

----- ----- -----

PCVE 41.5 67 76.1

names(b.bbs2c)

Page 31: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 31

[1] "loadings" "vexp" "vexpPC" "ind"

[5] "niter" "unc" "nvexp" "corComp"

[9] "Uncloadings"

round(b.bbs2c$corComp,2)

[,1] [,2] [,3]

[1,] 1.00 0.22 -0.10

[2,] 0.22 1.00 0.34

[3,] -0.10 0.34 1.00

When the number of variables is large, more than one loading can be trimmed at each iteration[trim = x, where x is the number of loadings to trim]. Since there could be left less loadingsthan the number to trim, setting reducetrim = TRUE makes these last loadings trimmed oneat the time. Below we compare the solutions

b.be3t = spcabe(baseball, thresh = 0.15, perc = TRUE, ndbyvexp = 0.8,

trim = 3, reducetrim = TRUE)

[1] "reduced trim to TRUE remaining 2 iterations. comp 2"

[1] "reduced trim to TRUE remaining 1 iterations. comp 3"

[1] "Comp 3 , reached min cardinality of 3 , smallest loading is 0.123"

[1] "Comp 4 , reached min cardinality of 4 , smallest loading is 0.203"

[1] "reached PCVE 81.7% with 4 components"

b.be3nt = spcabe(baseball, thresh = 0.15, perc = TRUE, ndbyvexp = 0.8,

trim = 1)

[1] "Comp 3 , reached min cardinality of 3 , smallest loading is 0.062"

[1] "Comp 4 , reached min cardinality of 4 , smallest loading is 0.128"

[1] "reached PCVE 82.2% with 4 components"

summary(b.be2, perc = TRUE)

Comp1 Comp2 Comp3 Comp4

PVE 44.5% 24.6% 3.6% 9.6%

PCVE 44.5% 69.1% 72.6% 82.2%

PRCVE 98.2% 97.3% 88.8% 94.3%

Card 3 3 3 4

Ccard 3 6 9 13

PVE/Card 14.8% 8.2% 1.2% 2.4%

PCVE/Ccard 14.8% 11.5% 8.1% 6.3%

Converged 0 0 1 1

MinCont 24.8% 27.5% 4.2% 8.1%

summary(b.be3t, perc = TRUE)

Page 32: Spca_Vignettes R Package_Test Version

32 Sparse Principal Component Analysis with the R package spca

Comp1 Comp2 Comp3 Comp4

PVE 41.2% 26.9% 7.4% 6.2%

PCVE 41.2% 68.1% 75.5% 81.7%

PRCVE 91% 95.9% 92.3% 93.7%

Card 1 3 3 4

Ccard 1 4 7 11

PVE/Card 41.2% 9% 2.5% 1.5%

PCVE/Ccard 41.2% 17% 10.8% 7.4%

Converged 0 0 1 1

MinCont 100% 20.8% 8.1% 10.8%

spca.comp(list(b.be3nt, b.be3t), nam = c("trim 1", "trim 3"))

components

Cum

. Var

. Exp

l.

1 2 3 4

40%

50%

60%

70%

80%

90%

100%

PCAtrim 1trim 3

Page 33: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 33

[1] "Loadings"

C1.1 C1.2 C2.1 C2.2 C3.1

times at bat 1986 0.28 0.373 0.497

hits 1986

home runs 1986 0.042

runs 1986 0.275

runs batted 1986 0.294

walks 1986

years career

times at bat career

hits career

home runs career

runs career 1 -0.353 -0.208

runs batted career 0.472

walks career 0.248

put outs 1986

assists 1986 -0.53

errors 1986 0.428

----- ----- ----- ----- -----

PCVE 44.5 41.2 69.1 68.1 72.6

C3.2 C4.1 C4.2

times at bat 1986

hits 1986

home runs 1986 0.191

runs 1986 0.108

runs batted 1986 0.081

walks 1986

years career

times at bat career -0.146

hits career 0.452 0.29

home runs career

runs career

runs batted career -0.467 -0.354

walks career

put outs 1986

assists 1986 0.081

errors 1986 -0.581 -0.248

----- ----- -----

PCVE 75.5 82.2 81.7

[1]

[1] Summary statistics

C1.1 C1.2 C2.1 C2.2 C3.1 C3.2 C4.1 C4.2

PVE 44.5 41.2 24.6 26.9 3.6 7.4 9.6 6.2

PCVE 44.5 41.2 69.1 68.1 72.6 75.5 82.2 81.7

PRCVE 98.2 91.0 97.3 95.9 88.8 92.3 94.3 93.7

Card 3 1 3 3 3 3 4 4

Page 34: Spca_Vignettes R Package_Test Version

34 Sparse Principal Component Analysis with the R package spca

Ccard 3 1 6 4 9 7 13 11

PVE/Card 14.8 41.2 8.2 9.0 1.2 2.5 2.4 1.5

PCVE/Ccard 14.8 41.2 11.5 17.0 8.1 10.8 6.3 7.4

Converged 0 0 0 0 1 1 1 1

Min %Cont 24.8% 100% 27.5% 20.8% 4.2% 8.1% 8.1% 10.8%

In case diagnostic information about the solution and the elimination is required, setting theargiment diag = TRUE produces several pieces of information, as exemplified below. Detailson the diagnostic output can be found in the help.

b.be1 = spcabe(baseball, nd = 4, threshvaronPC = 0.01, diag = TRUE)

[1] "Comp 4 , reached min cardinality of 4 , smallest loading is 0.181"

# the output now contains:

names(b.be1)

[1] "loadings" "contributions" "vexp"

[4] "vexpPC" "cardinality" "ind"

[7] "unc" "converged" "vexpo"

[10] "totvcloss" "vlossbe" "niter"

[13] "eliminated" "thresh" "threshvar"

[16] "threshvaronPC" "ndbyvexp" "stopbyvar"

[19] "mincard" "tracing"

# for example, the sequence of variables eliminated are

# stored in eliminated

b.be1$elim

$`Comp 1`

assists 1986 errors 1986 put outs 1986

15 16 14

runs 1986 hits career runs batted 1986

4 9 5

home runs career times at bat career walks 1986

10 8 6

years career hits 1986 home runs 1986

7 2 3

$`Comp 2`

home runs career hits career times at bat career

10 9 8

put outs 1986 assists 1986 walks career

14 15 13

home runs 1986 walks 1986 hits 1986

3 6 2

Page 35: Spca_Vignettes R Package_Test Version

Giovanni Maria Merola 35

errors 1986

16

$`Comp 3`

runs batted career hits 1986 runs career

12 2 11

runs 1986 times at bat 1986 walks career

4 1 13

walks 1986 years career home runs career

6 7 10

times at bat career put outs 1986 runs batted 1986

8 14 5

hits career

9

$`Comp 4`

walks career times at bat career runs career

13 8 11

hits 1986 years career errors 1986

2 7 16

assists 1986 runs batted career hits career

15 12 9

home runs career runs batted 1986 runs 1986

10 5 4

6. Concluding Remarks

The package spca was firstly developed for my own needs. The literature on SPCA does notpresent yet many applications and plots typical of PCA, such as, loadings plots, for example,are not very useful for sparse solutions. Therefore, I propose different ways for plotting andprinting the solutions that I found useful. I hope that they will be useful for other researchersas well. The package for now only takes correlation matrices as inputs, so utilities for theanalysis of the components’ scores are not present yet.

The plotting utilities are designed to provide a first graphical description of the output andthey are not the main feature of the package. Since it is extremely difficult to set the graphicalparameters to produce perfect plots under all possible circumstances (also because these canbe device dependent), I chose settings that should produce good plots in most situations. Theplots are simple to reproduce and better ones can be easily obtained from the spca objects.

The functions spcabb and spcabe are written entirely in R and they cannot deal with largeproblems. I am working on writing them in C.

References

Page 36: Spca_Vignettes R Package_Test Version

36 Sparse Principal Component Analysis with the R package spca

Cadima J, Jolliffe IT (1995). “Loadings and correlations in the interpretation of principalcomponents.” Journal of Applied Statistics, 22(2), 203–214.

Farcomeni A (2009). “An exact approach to sparse principal component analysis.” Computa-tional Statistics, 24(4), 583–604.

Izenman AJ (1975). “Reduced-rank regression for the multivariate linear model.” Journal ofMultivariate Analysis, 5(2), 248–264.

Kaiser H (1960). “The application of electronic computers to factor analysis.” Educationaland Psychological Measurement, 20, 141–151.

Magnus J, Neudecker H (1999). Matrix Differential Calculus With Applications in Statisticsand Econometrics. John Wiley.

Merola G (2014). “Least Squares Sparse Principal Component Analysis: a Backward Elimi-nation approach to attain large loadings.” In Submitted for publication. Preprint availableat http://arxiv.org/abs/1406.1381.

Moghaddam B, Weiss Y, Avidan S (2006). “Spectral Bounds for Sparse PCA: Exact andGreedy Algorithms.” In Advances in Neural Information Processing Systems, pp. 915–922.MIT Press.

R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foun-dation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org.

Rao C, Toutenburg H (1999). Linear Models: LS and Alternatives. Springer Series in StatisticsSeries, second edition. Springer Verlag.

ten Berge J (1993). Least Squares Optimization in Multivariate Analysis. DSWO Press,Leiden University.

Wold H (1966). “Estimation of principal components and related models by iterative leastsquares.” In P Krishnaiah (ed.), In Multivariate Analysis, volume 59, pp. 391–420. AcademicPress, NY.

Affiliation:

Giovanni Maria MerolaDepartment of Economics, Finance and MarketingRMIT International University Vietnam702 Nguyen Van Linh BvrdPMH, D 7, HCMC, Viet NamE-mail: [email protected]