sparse principal component analysis for multiblocks...

60
Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion Sparse Principal Component Analysis for multiblocks data and its extension to Sparse Multiple Correspondence Analysis Anne Bernard 1,5 , Herv´ e Abdi 2 , Arthur Tenenhaus 3 , Christiane Guinot 4 , Gilbert Saporta 5 1 CE.R.I.E.S, Neuilly sur Seine, France 2 Department of Brain and Behavioral Sciences, The University of Texas at Dallas, Richardson, TX, USA 3 SUPELEC, Gif-sur-Yvette, France 4 Universit´ e Francois Rabelais, d´ epartement d’informatique, Tours, France 5 Laboratoire Cedric, CNAM, Paris June 13, 2013 Bernard , Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 1 / 33

Upload: buibao

Post on 12-Sep-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse Principal Component Analysis formultiblocks data and its extension to Sparse

Multiple Correspondence Analysis

Anne Bernard1,5, Herve Abdi2, Arthur Tenenhaus3,Christiane Guinot4, Gilbert Saporta5

1 CE.R.I.E.S, Neuilly sur Seine, France2 Department of Brain and Behavioral Sciences, The University of Texas at Dallas, Richardson, TX, USA

3 SUPELEC, Gif-sur-Yvette, France4 Universite Francois Rabelais, departement d’informatique, Tours, France

5 Laboratoire Cedric, CNAM, Paris

June 13, 2013

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 1 / 33

Page 2: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Background

Variable selection and dimensionality reduction are necessary indifferent application domains, and more specifically in genetics.

Quantitative data analysis: PCACategorical data analysis: MCA

In case of high dimensional data (n�p):components (PCs) difficult to interpret.

A way for obtaining simple structures

1. Sparse methods like sparse PCA for quantitative data:PCs are combinations of few original variables→ extension for multiblocks data structure (GSPCA)

2. Extension for categorical variables:Development of a new sparse method (sparse MCA)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 2 / 33

Page 3: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Background

Variable selection and dimensionality reduction are necessary indifferent application domains, and more specifically in genetics.

Quantitative data analysis: PCACategorical data analysis: MCA

In case of high dimensional data (n�p):components (PCs) difficult to interpret.

A way for obtaining simple structures

1. Sparse methods like sparse PCA for quantitative data:PCs are combinations of few original variables→ extension for multiblocks data structure (GSPCA)

2. Extension for categorical variables:Development of a new sparse method (sparse MCA)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 2 / 33

Page 4: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Background

Variable selection and dimensionality reduction are necessary indifferent application domains, and more specifically in genetics.

Quantitative data analysis: PCACategorical data analysis: MCA

In case of high dimensional data (n�p):components (PCs) difficult to interpret.

A way for obtaining simple structures

1. Sparse methods like sparse PCA for quantitative data:PCs are combinations of few original variables

→ extension for multiblocks data structure (GSPCA)

2. Extension for categorical variables:Development of a new sparse method (sparse MCA)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 2 / 33

Page 5: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Background

Variable selection and dimensionality reduction are necessary indifferent application domains, and more specifically in genetics.

Quantitative data analysis: PCACategorical data analysis: MCA

In case of high dimensional data (n�p):components (PCs) difficult to interpret.

A way for obtaining simple structures

1. Sparse methods like sparse PCA for quantitative data:PCs are combinations of few original variables→ extension for multiblocks data structure (GSPCA)

2. Extension for categorical variables:Development of a new sparse method (sparse MCA)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 2 / 33

Page 6: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Background

Variable selection and dimensionality reduction are necessary indifferent application domains, and more specifically in genetics.

Quantitative data analysis: PCACategorical data analysis: MCA

In case of high dimensional data (n�p):components (PCs) difficult to interpret.

A way for obtaining simple structures

1. Sparse methods like sparse PCA for quantitative data:PCs are combinations of few original variables→ extension for multiblocks data structure (GSPCA)

2. Extension for categorical variables:Development of a new sparse method (sparse MCA)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 2 / 33

Page 7: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Data and methods

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 3 / 33

Page 8: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Data and methods

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 3 / 33

Page 9: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Uniblock data structure

Principal Component Analysis(PCA): Each PC is a linearcombination of all the originalvariables, so loadings are nonzero→ PCA results difficult tointerpret

Multiple Correspondence Analysis(MCA): Counterpart of PCA forcategorical data

Sparse PCA (SPCA): ModifiedPCs with sparse loadings usingregularized SVD

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 4 / 33

Page 10: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Uniblock data structure

Principal Component Analysis(PCA): Each PC is a linearcombination of all the originalvariables, so loadings are nonzero→ PCA results difficult tointerpret

Multiple Correspondence Analysis(MCA): Counterpart of PCA forcategorical data

Sparse PCA (SPCA): ModifiedPCs with sparse loadings usingregularized SVD

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 4 / 33

Page 11: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Uniblock data structure

Principal Component Analysis(PCA): Each PC is a linearcombination of all the originalvariables, so loadings are nonzero→ PCA results difficult tointerpret

Multiple Correspondence Analysis(MCA): Counterpart of PCA forcategorical data

Sparse PCA (SPCA): ModifiedPCs with sparse loadings usingregularized SVD

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 4 / 33

Page 12: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Principal Component Analysis (PCA)

X: matrix of quantitative variablesPCA can be computed via the SVD of X

Singular Value Decomposition of X (SVD):

X = UDVT with UTU = VTV = I (1)

where Z=UD the PCs and V the corresponding nonzero loadings→ PCA results difficult to interpret

SVD as low rank approximation of matrices:

minX(1)‖X− X(1)‖2

2 = minu,v‖X− uvT‖2

2 (2)

X(1) = uvT is the best rank-one matrix approximation of X with:u = δ1u1 the first component and v = v1 the first loading vector.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 5 / 33

Page 13: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Principal Component Analysis (PCA)

X: matrix of quantitative variablesPCA can be computed via the SVD of X

Singular Value Decomposition of X (SVD):

X = UDVT with UTU = VTV = I (1)

where Z=UD the PCs and V the corresponding nonzero loadings→ PCA results difficult to interpret

SVD as low rank approximation of matrices:

minX(1)‖X− X(1)‖2

2 = minu,v‖X− uvT‖2

2 (2)

X(1) = uvT is the best rank-one matrix approximation of X with:u = δ1u1 the first component and v = v1 the first loading vector.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 5 / 33

Page 14: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Principal Component Analysis (PCA)

X: matrix of quantitative variablesPCA can be computed via the SVD of X

Singular Value Decomposition of X (SVD):

X = UDVT with UTU = VTV = I (1)

where Z=UD the PCs and V the corresponding nonzero loadings→ PCA results difficult to interpret

SVD as low rank approximation of matrices:

minX(1)‖X− X(1)‖2

2 = minu,v‖X− uvT‖2

2 (2)

X(1) = uvT is the best rank-one matrix approximation of X with:u = δ1u1 the first component and v = v1 the first loading vector.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 5 / 33

Page 15: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse Principal Component Analysis (SPCA)

Challenge: facilitate interpretation of PCA results

How? PCs with a lot of zero loadings (sparse loadings)

Several approaches:

ScoTLass by Jolliffe, I.T. et al. (2003),

SPCA by Zou, H. et al. (2006)

SPCA-rSVD by Shen, H. and Huang, J.Z. (2008)

Principle: use connection of PCA with SVD→ low rank matrix approximation problem with

regularization penalty on the loadings

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 6 / 33

Page 16: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Regularization in regression

Lasso penalty: (Tibshirani, 1996)

a variable selection techniqueproduces sparse modelsImposes the L1 norm on the linear regression coefficients

Regression with L1 penalization

βlasso = minβ‖y− Xβ‖2 + λ

J∑j=1

|βj | with Pλ(β) = λ

J∑j=1

|βj |

If λ=0: usual regression (non-zero coefficients)

If λ increases: some coefficients sets to zero⇒ selection of variablesbut the number of variables selected is bounded by the number ofunits

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 7 / 33

Page 17: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Regularization in regression

Other kinds of penalties:

Hard thresholding penalty: (Antoniadis, 1997)

Pλ(|β|) = λ2 − (|β| − λ)2I(|β| < λ)

SCAD penalty (Smoothly Clipped Absolute Deviationpenalty): (Fan and Li, 2001)

Pλ(β) = λ

(I(|β| ≤ λ) +

(aλ− β)+

(a− 1)λI(|β| > λ)

)with a > 2, β > 0

Group Lasso penalty: (Yuan and Lin, 2007)

Pλ(β) = λ

K∑k=1

√J[k]‖βk‖2 (3)

with J[k] the number of variables in the group k .

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 8 / 33

Page 18: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse PCA via regularized SVD (SPCA-rSVD)

Low rank matrix approximation problem

+regularization penalty function applied on v

minu,v‖X− uvT‖2

2

+Pλ(v)

(4)

with

‖X− uvT‖22

+Pλ(v)

= tr(

(X− uv)T (X− uv))

+Pλ(v)

(5)

= ‖X‖22 + tr

(vT v

)− 2 tr

(XT uvT

)

+Pλ(v)

= ‖X‖22 +

J∑j=1

(v2j − 2(XT u)j vj

)

+J∑

j=1

Pλ(vj)

= ‖X‖22 +

J∑j=1

(v2j − 2(XT u)j vj+Pλ(vj)

)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 9 / 33

Page 19: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse PCA via regularized SVD (SPCA-rSVD)

Low rank matrix approximation problem

+regularization penalty function applied on v

minu,v‖X− uvT‖2

2

+Pλ(v)

(4)

with

‖X− uvT‖22

+Pλ(v)

= tr(

(X− uv)T (X− uv))

+Pλ(v)

(5)

= ‖X‖22 + tr

(vT v

)− 2 tr

(XT uvT

)

+Pλ(v)

= ‖X‖22 +

J∑j=1

(v2j − 2(XT u)j vj

)

+J∑

j=1

Pλ(vj)

= ‖X‖22 +

J∑j=1

(v2j − 2(XT u)j vj+Pλ(vj)

)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 9 / 33

Page 20: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse PCA via regularized SVD (SPCA-rSVD)

Low rank matrix approximation problem+regularization penalty function applied on v

minu,v‖X− uvT‖2

2+Pλ(v) (4)

with

‖X− uvT‖22

+Pλ(v)

= tr(

(X− uv)T (X− uv))

+Pλ(v)

(5)

= ‖X‖22 + tr

(vT v

)− 2 tr

(XT uvT

)

+Pλ(v)

= ‖X‖22 +

J∑j=1

(v2j − 2(XT u)j vj

)

+J∑

j=1

Pλ(vj)

= ‖X‖22 +

J∑j=1

(v2j − 2(XT u)j vj+Pλ(vj)

)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 9 / 33

Page 21: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse PCA via regularized SVD (SPCA-rSVD)

Low rank matrix approximation problem+regularization penalty function applied on v

minu,v‖X− uvT‖2

2+Pλ(v) (4)

with

‖X− uvT‖22+Pλ(v) = tr

((X− uv)T (X− uv)

)+Pλ(v) (5)

= ‖X‖22 + tr

(vT v

)− 2 tr

(XT uvT

)+Pλ(v)

= ‖X‖22 +

J∑j=1

(v2j − 2(XT u)j vj

)+

J∑j=1

Pλ(vj)

= ‖X‖22 +

J∑j=1

(v2j − 2(XT u)j vj+Pλ(vj)

)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 9 / 33

Page 22: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse PCA via regularized SVD (SPCA-rSVD)

Iterative procedure:

1. v fixed: minimizer of (4) is u = Xv/‖Xv‖

2. u fixed: minimizer of (4) is v = hλ(XT u) with

hλ(XT u) = minvj

(v2j − 2(XT u)j vj+Pλ(vj)

)(6)

For the Lasso penalty: hλ(XT u) = sign(XT u)(|XT u| − λ)+

For the hard thresholding penalty: hλ(XT u) = I(|XT u| > λ)XT u

min obtained by applying hλ to the vector XT u componentwise→ v1 is obtained.

Subsequently sparse loadings vi (i > 1) obtained via rank-oneapproximation of residual matrices.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 10 / 33

Page 23: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse PCA via regularized SVD (SPCA-rSVD)

Iterative procedure:

1. v fixed: minimizer of (4) is u = Xv/‖Xv‖

2. u fixed: minimizer of (4) is v = hλ(XT u) with

hλ(XT u) = minvj

(v2j − 2(XT u)j vj+Pλ(vj)

)(6)

For the Lasso penalty: hλ(XT u) = sign(XT u)(|XT u| − λ)+

For the hard thresholding penalty: hλ(XT u) = I(|XT u| > λ)XT u

min obtained by applying hλ to the vector XT u componentwise→ v1 is obtained.

Subsequently sparse loadings vi (i > 1) obtained via rank-oneapproximation of residual matrices.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 10 / 33

Page 24: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Multiblock data structure

Group Sparse PCA (GSPCA):

Challenge: To create sparsenessin multiblock data structure

Principle: Modified PCs withsparse loadings using the grouplasso penalty

Result: All variables of a block isselected or removed

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 11 / 33

Page 25: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Group Sparse PCA

Group Sparse PCA: Extension of SPCA-rSVD for blocks ofquantitative variables

Challenge: select groups of continuous variables→ zero coefficients to entire blocks of variables

How? A compromise between SPCA-rSVD and group Lasso

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 12 / 33

Page 26: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Group Sparse PCA

X: matrix of quantitative variables divided into K sub-matrices

SVD of X:

X =[X[1]|...|X[k]|...|X[K ]

]= UDVT

= UD[VT

[1]|...|VT[k]|...|V

T[K ]

]

where each line of VT is a vector divided into K blocks.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 13 / 33

Page 27: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Group Sparse PCA

Remember SPCA-rSVD:

minu,v‖X− uvT‖2

2 + Pλ(v) (7)

u fixed: v = hλ(XT u) = minvj

(v2j − 2(XT u)j vj + Pλ(vj)

)

In GSPCA:

v = hλ(XT u) = minv[k]

(tr(v[k]v

T[k])− 2 tr(v[k]u

TX[k]) + Pλ(v[k]))

Here Pλ is the group lasso regularizer:

Pλ(v) = λ

K∑k=1

√J[k]‖v[k]‖2 (8)

The solution is

hλ(XT[k]u) =

(1− λ

2

√J[k]

‖XT[k]u‖2

)+

XT[k]u (9)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 14 / 33

Page 28: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Group Sparse PCA

Remember SPCA-rSVD:

minu,v‖X− uvT‖2

2 + Pλ(v) (7)

u fixed: v = hλ(XT u) = minvj

(v2j − 2(XT u)j vj + Pλ(vj)

)In GSPCA:

v = hλ(XT u) = minv[k]

(tr(v[k]v

T[k])− 2 tr(v[k]u

TX[k]) + Pλ(v[k]))

Here Pλ is the group lasso regularizer:

Pλ(v) = λ

K∑k=1

√J[k]‖v[k]‖2 (8)

The solution is

hλ(XT[k]u) =

(1− λ

2

√J[k]

‖XT[k]u‖2

)+

XT[k]u (9)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 14 / 33

Page 29: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Group Sparse PCA

Group Sparse PCA Algorithm

1 Apply the SVD to X and obtain the best rank-oneapproximation of X as δuvT . Set vold = v = [v[1], . . . , v[K ]]and uold = δu.

2 Update:

a) vnew = [hλ(XT[1]uold), . . . , hλ(XT

[K ]uold)]

b) unew = Xvnew/‖Xvnew‖3 Repeat Step 2 replacing uold and vold by unew and vnew until

convergence.

Sparse loadings vi (i > 1) are obtained via rank-one approximationof residual matrices.

→ Choice of λ using cross-validation or an ad-hoc approach.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 15 / 33

Page 30: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Group Sparse PCA

Group Sparse PCA Algorithm

1 Apply the SVD to X and obtain the best rank-oneapproximation of X as δuvT . Set vold = v = [v[1], . . . , v[K ]]and uold = δu.

2 Update:

a) vnew = [hλ(XT[1]uold), . . . , hλ(XT

[K ]uold)]

b) unew = Xvnew/‖Xvnew‖3 Repeat Step 2 replacing uold and vold by unew and vnew until

convergence.

Sparse loadings vi (i > 1) are obtained via rank-one approximationof residual matrices.

→ Choice of λ using cross-validation or an ad-hoc approach.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 15 / 33

Page 31: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Group Sparse PCA

Group Sparse PCA Algorithm

1 Apply the SVD to X and obtain the best rank-oneapproximation of X as δuvT . Set vold = v = [v[1], . . . , v[K ]]and uold = δu.

2 Update:

a) vnew = [hλ(XT[1]uold), . . . , hλ(XT

[K ]uold)]

b) unew = Xvnew/‖Xvnew‖

3 Repeat Step 2 replacing uold and vold by unew and vnew untilconvergence.

Sparse loadings vi (i > 1) are obtained via rank-one approximationof residual matrices.

→ Choice of λ using cross-validation or an ad-hoc approach.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 15 / 33

Page 32: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Group Sparse PCA

Group Sparse PCA Algorithm

1 Apply the SVD to X and obtain the best rank-oneapproximation of X as δuvT . Set vold = v = [v[1], . . . , v[K ]]and uold = δu.

2 Update:

a) vnew = [hλ(XT[1]uold), . . . , hλ(XT

[K ]uold)]

b) unew = Xvnew/‖Xvnew‖3 Repeat Step 2 replacing uold and vold by unew and vnew until

convergence.

Sparse loadings vi (i > 1) are obtained via rank-one approximationof residual matrices.

→ Choice of λ using cross-validation or an ad-hoc approach.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 15 / 33

Page 33: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Group Sparse PCA

Group Sparse PCA Algorithm

1 Apply the SVD to X and obtain the best rank-oneapproximation of X as δuvT . Set vold = v = [v[1], . . . , v[K ]]and uold = δu.

2 Update:

a) vnew = [hλ(XT[1]uold), . . . , hλ(XT

[K ]uold)]

b) unew = Xvnew/‖Xvnew‖3 Repeat Step 2 replacing uold and vold by unew and vnew until

convergence.

Sparse loadings vi (i > 1) are obtained via rank-one approximationof residual matrices.

→ Choice of λ using cross-validation or an ad-hoc approach.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 15 / 33

Page 34: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Group Sparse PCA

Group Sparse PCA Algorithm

1 Apply the SVD to X and obtain the best rank-oneapproximation of X as δuvT . Set vold = v = [v[1], . . . , v[K ]]and uold = δu.

2 Update:

a) vnew = [hλ(XT[1]uold), . . . , hλ(XT

[K ]uold)]

b) unew = Xvnew/‖Xvnew‖3 Repeat Step 2 replacing uold and vold by unew and vnew until

convergence.

Sparse loadings vi (i > 1) are obtained via rank-one approximationof residual matrices.

→ Choice of λ using cross-validation or an ad-hoc approach.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 15 / 33

Page 35: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Multiblock data structure

To select 1 column in the originaltable (categorical variable X)

=

To select a block of indicator

variables in the complete

disjunctive table

Sparse MCA (SMCA):

Extension of the GSPCA forblocks of indicator variables

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 16 / 33

Page 36: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

MCA

r = F1c = FT1Dr = diag(r)Dc = diag(c)

p[j] =Nb of modalities

of variable j

Generalized SVD of F

F = P∆QT with PTMP = QTWQ = I

with F = [F[1]| . . . |F[j]| . . . |F[J]] and Q = [Q[1]| . . . |Q[j]| . . . |Q[J]]

In the case of PCA: M = W = I

In the case of MCA: M = Dr W = Dc

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 17 / 33

Page 37: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

From MCA to SMCA

GSVD as low rank approximation matrices

minF(1)‖F− F(1)‖2

W = minp,q‖F− pqT‖2

W (10)

where ‖F‖2W = tr(M

12 FWFTM

12 ) is the norm W-generalized

The norm is under the constraints of M and W

‖F− pqT‖2W = tr

(M

12 (F− pqT )W(F− pqT )TM

12

)(11)

Sparse MCA problem

minp,q‖F− pqT‖2

W+Pλ(q) pTMp = qtWq = 1 (12)

minp,q

(tr(

M12 (F− pqT )W(F− pqT )TM

12

)+Pλ(q)

)(13)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 18 / 33

Page 38: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

From MCA to SMCA

GSVD as low rank approximation matrices

minF(1)‖F− F(1)‖2

W = minp,q‖F− pqT‖2

W (10)

where ‖F‖2W = tr(M

12 FWFTM

12 ) is the norm W-generalized

The norm is under the constraints of M and W

‖F− pqT‖2W = tr

(M

12 (F− pqT )W(F− pqT )TM

12

)(11)

Sparse MCA problem

minp,q‖F− pqT‖2

W+Pλ(q) pTMp = qtWq = 1 (12)

minp,q

(tr(

M12 (F− pqT )W(F− pqT )TM

12

)+Pλ(q)

)(13)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 18 / 33

Page 39: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

From MCA to SMCA

GSVD as low rank approximation matrices

minF(1)‖F− F(1)‖2

W = minp,q‖F− pqT‖2

W (10)

where ‖F‖2W = tr(M

12 FWFTM

12 ) is the norm W-generalized

The norm is under the constraints of M and W

‖F− pqT‖2W = tr

(M

12 (F− pqT )W(F− pqT )TM

12

)(11)

Sparse MCA problem

minp,q‖F− pqT‖2

W+Pλ(q) pTMp = qtWq = 1 (12)

minp,q

(tr(

M12 (F− pqT )W(F− pqT )TM

12

)+Pλ(q)

)(13)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 18 / 33

Page 40: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

From MCA to SMCA

GSVD as low rank approximation matrices

minF(1)‖F− F(1)‖2

W = minp,q‖F− pqT‖2

W (10)

where ‖F‖2W = tr(M

12 FWFTM

12 ) is the norm W-generalized

The norm is under the constraints of M and W

‖F− pqT‖2W = tr

(M

12 (F− pqT )W(F− pqT )TM

12

)(11)

Sparse MCA problem

minp,q‖F− pqT‖2

W+Pλ(q) pTMp = qtWq = 1 (12)

minp,q

(tr(

M12 (F− pqT )W(F− pqT )TM

12

)+Pλ(q)

)(13)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 18 / 33

Page 41: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

From MCA to SMCA

GSVD as low rank approximation matrices

minF(1)‖F− F(1)‖2

W = minp,q‖F− pqT‖2

W (10)

where ‖F‖2W = tr(M

12 FWFTM

12 ) is the norm W-generalized

The norm is under the constraints of M and W

‖F− pqT‖2W = tr

(M

12 (F− pqT )W(F− pqT )TM

12

)(11)

Sparse MCA problem

minp,q‖F− pqT‖2

W+Pλ(q) pTMp = qtWq = 1 (12)

minp,q

(tr(

M12 (F− pqT )W(F− pqT )TM

12

)+Pλ(q)

)(13)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 18 / 33

Page 42: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

From MCA to SMCA

Remember GSPCA solution

minv[k]

(tr(v[k]v

T[k])− 2 tr(v[k]u

TX[k]) + Pλ(v[k]))

v = hλ(XT[k]u) =

(1− λ

2

√J[k]

‖XT[k]u‖2

)+

XT[k]u

Sparse MCA solution

minq[k]

(tr(q[k]W[k]q

T[k])− 2 tr(MF[k]W[k]q[k]p

T ) + Pλ(q[k]))

q = hλ(FT[k]Mp) =

(1− λ

2

√J[k]

‖FT[k]Mp‖W

)+

FT[k]Mp

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 19 / 33

Page 43: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

From MCA to SMCA

Remember GSPCA solution

minv[k]

(tr(v[k]v

T[k])− 2 tr(v[k]u

TX[k]) + Pλ(v[k]))

v = hλ(XT[k]u) =

(1− λ

2

√J[k]

‖XT[k]u‖2

)+

XT[k]u

Sparse MCA solution

minq[k]

(tr(q[k]W[k]q

T[k])− 2 tr(MF[k]W[k]q[k]p

T ) + Pλ(q[k]))

q = hλ(FT[k]Mp) =

(1− λ

2

√J[k]

‖FT[k]Mp‖W

)+

FT[k]Mp

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 19 / 33

Page 44: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA

Sparse MCA Algorithm

1 Apply the GSVD to F and obtain the best rank-oneapproximation of F as δpqT .Set pold = δp and qold = q.

2 Update:

a) qnew = [hλ(FT[1]Mpold), . . . , hλ(FT

[J]Mpold)]

b) pnew = Fqnew/‖Fqnew‖3 Repeat Step 2 replacing pold and qold by pnew and qnew until

convergence.

Sparse loadings qi (i > 1) are obtained via rank-one approximationof residual matrices.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 20 / 33

Page 45: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Properties of the SMCA

Properties MCA Sparse MCA

Barycentricproperties

TRUE TRUE

Non correlatedcomponents

TRUE FALSE

Orthogonalloadings

TRUE FALSE

Inertiaδjtot × 100

Adjustedvariance

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 21 / 33

Page 46: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA: exemple on breeds of dogs

Data

I = 27 breeds of dogsJ = 6 variablesQ = 16 (total number of modalities)X: 27 × 6 matrix of categoricalvariablesD: 27 × 16 complete disjunctivetable → D=(D[1], ...,D[6])

1 block=

1 descriptive variable=

1 sub-matrix D[j]

From Tenenhaus, M. (2007)

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 22 / 33

Page 47: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA: exemple on breeds of dogsResult of the MCA

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 23 / 33

Page 48: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA: exemple on breeds of dogsResult of the Sparse MCA

From λ = 0.25:CPEV ↘ rapidly

We choose λ = 0.25 to find acompromise between thenumber of variables selectedand the % of variance lost.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 24 / 33

Page 49: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA: exemple on breeds of dogsResult of the Sparse MCA

From λ = 0.25:CPEV ↘ rapidly

We choose λ = 0.25 to find acompromise between thenumber of variables selectedand the % of variance lost.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 24 / 33

Page 50: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA: exemple on breeds of dogsResult of the Sparse MCA

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 25 / 33

Page 51: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA: exemple on breeds of dogsComparison of the loadings

MCA Sparse MCAVariable Comp1 Comp2 Comp3 Comp4 Comp1 Comp2 Comp3 Comp4

large -0.361 0.071 -0.005 0.060 -0.389 0.000 0.000 0.000medium 0.280 0.287 0.300 -0.055 0.226 0.000 0.000 0.000small 0.291 -0.400 -0.293 -0.041 0.390 0.000 0.000 0.000lightweight 0.316 -0.389 -0.193 -0.081 0.368 -0.256 0.000 0.000heavy -0.047 0.390 -0.133 0.088 -0.075 0.451 0.000 0.000very heavy -0.294 -0.215 0.458 -0.055 -0.305 -0.479 0.000 0.000slow 0.059 -0.383 0.296 0.133 0.000 -0.561 0.000 0.000fast 0.224 0.256 0.057 -0.299 0.000 0.282 0.000 0.000veryfast -0.303 0.156 -0.391 0.168 0.000 0.328 0.000 0.000unintelligent 0.173 0.157 0.356 0.236 0.000 0.000 0.693 -0.693avg intelligent -0.145 -0.309 -0.168 0.125 0.000 0.000 -0.327 0.327very intelligent -0.086 0.125 -0.330 -0.491 0.000 0.000 -0.642 0.642

unloving -0.366 -0.084 0.030 0.087 -0.462 0.000 0.000 0.000veryaffectionate 0.353 0.081 -0.029 -0.084 0.445 0.000 0.000 0.000agressive -0.170 -0.096 0.162 -0.515 0.000 0.000 0.000 0.000non agressive 0.164 0.093 -0.156 0.497 0.000 0.000 0.000 0.000

Nb non-zeroloadings 16 16 16 16 8 6 3 3Adjustedvariance (%) 28.19 22.80 13.45 9.55 23.03 17.40 10.20 9.50

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 26 / 33

Page 52: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA: application on SNPs data

Single nucleotide polymorphism (SNP):A single base pair mutation at a specific locus→ The most frequent polymorphism

Data

I = 502 womenJ = 537 SNPsQ = 1554 (2 or 3 modalities by SNP)X: 502 × 537 matrix of categoricalvariables (SNPs)D: 502 × 1554 complete disjunctivematrix → D=(D[1], ...,D[537])

1 block=

1 SNP=1 sub-matrix D[j]

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 27 / 33

Page 53: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA: application on SNPs data

λ = 0.0045: CPEV=0.36% and 300 columns selected on Comp 1.λ = 0.005: CPEV= 0.32% and 174 columns selected on Comp 1.→ we choose λ = 0.005.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 28 / 33

Page 54: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA: application on SNPs data

λ = 0.0045: CPEV=0.36% and 300 columns selected on Comp 1.

λ = 0.005: CPEV= 0.32% and 174 columns selected on Comp 1.→ we choose λ = 0.005.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 28 / 33

Page 55: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA: application on SNPs data

λ = 0.0045: CPEV=0.36% and 300 columns selected on Comp 1.λ = 0.005: CPEV= 0.32% and 174 columns selected on Comp 1.→ we choose λ = 0.005.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 28 / 33

Page 56: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA: application on SNPs dataComparison of the loadings

SNPs MCA SMCAComp1 Comp2 Comp1 Comp2

SNP1.AA -0.078 0.040 -0.092 0.102SNP1.AG -0.014 -0.027 -0.022 -0.053SNP1.GG 0.150 -0.002 0.132 -0.003SNP2.AA -0.082 0.041 -0.118 0.000SNP2.AG -0.021 -0.025 -0.020 0.000SNP2.GG -0.081 0.040 -0.001 0.000SNP3.CC -0.004 0.050 0.000 0.000SNP3.CG 0.016 0.021 0.000 0.000SNP3.GG -0.037 -0.325 0.000 0.000SNP4.AA 0.149 -0.003 0.050 0.000SNP4.AG -0.016 -0.025 -0.002 0.000SNP4.GG -0.081 0.040 -0.100 0.000

. . . . . . . . . . . . . . .

Nb non-zero loadings 1554 1554 172 108

Variance (%) 1.14 0.63 0.32 0.16

Cumulative variance (%) 1.14 1.77 0.32 0.48

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 29 / 33

Page 57: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA: application on SNPs data

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 30 / 33

Page 58: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Sparse MCA: application on SNPs data

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 31 / 33

Page 59: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

Conclusions

In a unsupervised multiblock data context: GSPCA forcontinuous variables, SMCA for categorical variables.

Produce sparse loading structures (with limited loss ofexplained variance)→ easier interpretation and comprehension of the results.

Very powerful in a context of variable selection in highdimension issues→ reduce noise as well as computation time.

Research in progress: Extension of Sparse MCA to selectgroups and predictors within a group→ sparsity at both group and individual feature levels→ compromise between Sparse MCA and sparse group lassodeveloped by Simon et al. (2012).

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 32 / 33

Page 60: Sparse Principal Component Analysis for multiblocks …maths.cnam.fr/IMG/pdf/presentation_seminaire_12_juin_A-_Bernard... · Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse

Introduction From SPCA to GSPCA Sparse MCA Application on genetic data Conclusion

References

[1] Jolliffe, I. T. et al. (2003), A modified principal component technique basedon the LASSO, Journal of Computational and Graphical Statistics, 12,531–547,

[2] Shen, H. et Huang, J. (2008), Sparse principal component analysis viaregularized low rank matrix approximation, Journal of Multivariate Analysis, 99,1015–1034,

[3] Simon, N., Friedman, J., Hastie, T. et Tibshirani, R. (2012), ASparse-Group Lasso, Journal of Computational and Graphical Statistics,DOI:10.1080/10618600.2012.681250,

[4] Tenenhaus, M. (2007), Statistique; methodes pour decrire, expliquer etprevoir, Dunod, 252–254,

[5] Yuan, M. et Lin, Y. (2006), Model selection and estimation in regressionwith grouped variables, Journal of the Royal Statistical Society: Series B, 68,49–67,

[6] Zou, H., Hastie, T. et Tibshirani, R. (2006), Sparse Principal Component

Analysis, Journal of Computational and Graphical Statistics, 15, 265–286.

Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 33 / 33