epca: high dimensional exponential family principal ... · 1.new method epca for pca of exponential...
TRANSCRIPT
ePCA: High Dimensional Exponential FamilyPrincipal Component Analysis
Lydia T. Liu ’17
ORFE/PACM, Princeton University
April 20, 2017
Joint work with Edgar Dobriban and Amit Singer
1 / 35
Overview
IntroductionReview of PCAThe ePCA problem
The ePCA methodOverviewDiagonal DebiasingHomogenizationShrinkageHeterogenization and ScalingDenoising
Illustration with X-ray molecular imagingExperiments on simulated XFEL data
Software
2 / 35
Principal Component Analysis (PCA)
I PCA: useful tool for dimensionality reduction and dataanalysis
I Data X : n × p matrix (n samples from p-dimensionalpopulation)
I PC ’s: linear combinations of features explaining the mostvariance
I eigenvectors of sample covariance matrix Σn = 1nX>X
I corresponding eigenvalue λi is variance of PCi
3 / 35
PCA in classical vs. high-dimensional regimes
Classical setting: p fixed, n→∞I Σn → Σ by the law of large numbers
High dimensional setting: p/n = γ ∈ (0,∞), p →∞, n→∞I Classical asymptotics don’t apply. Original PCA is
inconsistent!
I Limiting spectrum: Marchenko-Pastur (MP) distribution(Marchenko and Pastur 1967)
I Acute need to shrink empirical eigenvalues (Donoho et al.2013)
4 / 35
Motivating example for ePCA: XFEL
(a) 3D structure of alysozyme
(b) XFEL imaging process (Gaffney and Chapman2007)
Figure: X-ray free electron laser (XFEL) molecular imaging
5 / 35
Demo of ePCA on XFEL images
(a) Clean intensitymaps
(b) Noisy photoncounts
(c) Denoised (PCA) (d) Denoised (ePCA)
Figure: XFEL diffraction images (n = 70, 000, p = 65, 536)
6 / 35
PCA for the exponential family
In many applications Xij ’s have an exponential family distribution
I SNPs: Binomial
I RNA-seq: Negative Binomial
I XFEL/photon-limited imaging: low-intensity Poisson
Single-parameter exponential family distributions
Has density of the form fθ(y) = exp θy − A(θ).
No commonly agreed upon version of PCA for non-Gaussian data(Jolliffe 2002)
7 / 35
A new method: ePCA
Deterministic 4-step algorithm using moments and shrinkage
Advantages compared to previous proposalsI Likelihood/generalized linear latent variable models (Collins
et al. 2001; Knott and Bartholomew 1999; Udell et al. 2016)I lack of global convergence guarantees
I Gaussianizing transforms: wavelet, Anscombe (Anscombe1948; Starck et al. 2010)
I unsuitable for low-intensity
8 / 35
Problem formulation
Sampling model for Poisson
I Each p-dim latent vector is drawn i.i.d. from distribution D
(X (1), · · · ,X (p))> = X ∼ D(µ,Σ)
— e.g., the noiseless image. µ and Σ are the mean andcovariance of D.
I Observations Yi ∼ Y ∈ Rp — e.g., the noisy image
I Model for Y : draw latent X ∈ Rp, then
Y = (Y (1), · · · ,Y (p))> where Y (j) ∼ Poisson(X (j))
Goal: Recover information about the original distribution D, i.e.Σ. Recover X .
9 / 35
Problem formulation
Sampling model for Exponential family
I One-parameter exponential family with density
fθ(y) = exp θy − A(θ)
Natural parameter θ, E[y ] = A′(θ),Var[y ] = A′′(θ)
I Observations Yi ∼ Y ∈ Rp
I Model for Y : draw latent θ ∈ Rp, then
Y (j) | θ(j) ∼ fθ(j)(y), Y = (Y (1), · · · ,Y (p))>
Goal: Recover Σx := Cov(A′(θ)) and X := E(Y |θ) = A′(θ)
10 / 35
Mean parameter is low-rank
Mean X = E(Y |θ) has unknown low-dim structure
I as opposed to natural parameter θ
I can leverage Random Matrix Theory ⇒ simple method
I reasonable for image data (Basri and Jacobs 2003)
11 / 35
Our sampling model is realistic
Two applications
1. XFEL images 2. Single cell RNA-seq
Sample la-tent vectorX
Random 2D intensitymap due to random 3Dorientation of molecule
Random expressionrate due to biologicalvariation of cells
Sample Y ∼Poisson(X )
Noisy 2D image Y dueto low photon count
Noisy read counts Ydue to technical varia-tion
12 / 35
IntroductionReview of PCAThe ePCA problem
The ePCA methodOverviewDiagonal DebiasingHomogenizationShrinkageHeterogenization and ScalingDenoising
Illustration with X-ray molecular imagingExperiments on simulated XFEL data
Software
13 / 35
Summary of ePCA
ePCA can be seen as a sequence of improved covariance estimators
Table: Covariance estimators
Name Formula Motivation
Sample covariance S = 1n
∑ni=1(Yi − Y )(Yi − Y )> -
Diagonal debiasing Sd = S − diag[V (Y )] Hierarchy
Homogenization Sh = D−1/2n SdD
−1/2n Heteroskedasticity
Shrinkage Sh,η = η(Sh) High dimensionality
Heterogenization She = D1/2n Sh,ηD
1/2n Heteroskedasticity
Scaling Ss =∑
αi vi v>i (She =
∑vi v>i ) Heteroskedasticity
14 / 35
Diagonally debiasing the sample covariance
Poisson case:
I Sample Mean Y n = 1n
∑ni=1 Yi
I Sample Covariance S = 1n
∑ni=1
(Yi − Y n
) (Yi − Y n
)TI But it’s inconsistent! E[S ] = Σ + diag[µ]
I Dn := diag(Y n
)I Diagonally debiased covariance Sd = S − Dn
15 / 35
Diagonally debiasing the sample covariance
Exponential family:
I Cov[Y ] = Cov[A′(θ)] + Ediag[A′′(θ)] by law of total variance
I Mean-variance map:V (y) = A′′[(A′)−1(y)]
I Dn := diag[V (Y n)] estimates diag[A′′(θ)]
I Sd = S − Dn
Theorem (Convergence of the debiased covariance (Liu et al.2016))
E[‖Sd −Σx‖F ] .
√p
n[√p ·m4 + ‖µ‖]
16 / 35
Homogenization
I Motivation: Remove the effects of heteroskedasticity ⇒ closerto standard spiked model (Johnstone 2001) + MP law
I Sh = D−1/2n SdD
−1/2n = D
−1/2n SD
−1/2n − Ip
I Different from Standardization (dividing each measurement byits empirical standard deviation)!
(a) Spectrum before homogenization (b) Spectrum after homogenizationwith red MP density
17 / 35
Marchenko-Pastur Law for ePCA
Theorem (Marchenko-Pastur Law (Liu et al. 2016))
I The eigenvalue distribution of S converges a.s. to general MPFγ,H .
I The eigenvalue distribution of Sh + Ip converges a.s. to thestandard MP with aspect ratio γ.
Importance of Homogenization and the MP law
I Use optimal eigenvalue shrinkers for covariance estimation(Donoho et al. 2013; Lee et al. 2010)
I Improves signal strength (Liu et al. 2016)
I Matches Hardy-Weinberg equilibrium normalization(Patterson et al. 2006)
18 / 35
Eigenvalue shrinkageI Reduce noise by shrinkageI η(·) scalar shrinker, applied elementwise to eigenvalues:
M = VΛV>; η(M) = V η(Λ)V>; Sh,η = η(Sh)
I Optimal shrinkage functions (Donoho et al. 2013)
Figure: Need to Shrink! Spiked model spectrum after homogenization
19 / 35
Heterogenization
I Homogenized covariance matrix Sh doesn’t estimate theoriginal covariance matrix Σ! ⇒ need to heterogenize bymultiplying back the estimated standard errors
I Improves eigenvector estimates
I She = D1/2n · Sh,η · D
1/2n
20 / 35
Scaling
Heterogenization induces a bias in the eigenvalues (Liu et al. 2016)
I Need for final Scaling step to correct the bias
I Ss =∑αi vi v
>i (She =
∑vi v>i )
I Scaling function based on assuming a Gaussian spiked model(Baik et al. 2005; Johnstone 2001)
I we conjecture the literature about phase transition also appliesto our model
21 / 35
Denoising by EBLP
I Use standard strategy Empirical Best Linear Predictor (EBLP)from random effects model (Searle et al. 2009). First weestimate E[A′′(θ)] and Σx using ePCA. Then, estimate
Xi = Σx
[diag[EA′′(θ)] + Σx
]−1Yi + diag[EA′′(θ)]
[diag[EA′′(θ)] + Σx
]−1Y .
I For the Poisson distribution,
Xi = Ss(diag[Y ] + Ss
)−1Yi + diag[Y ]
(diag[Y ] + Ss
)−1Y .
22 / 35
IntroductionReview of PCAThe ePCA problem
The ePCA methodOverviewDiagonal DebiasingHomogenizationShrinkageHeterogenization and ScalingDenoising
Illustration with X-ray molecular imagingExperiments on simulated XFEL data
Software
23 / 35
Covariance estimation
(a) Spectral norm (b) Frobenius norm
Figure: Error of covariance matrix estimation: norm of the differencebetween each successive covariance estimate (Sample, +Debiased,+Heterogenized, +Scaled) and the true covariance Σx .
24 / 35
Eigenvalues
Figure: Error of eigenvalue estimation for the top 5 eigenvalues,measured as percentage error relative to the true eigenvalue
25 / 35
Eigenvectors
Figure: XFEL Eigenimages for γ = 1/4, ordered by eigenvalue
26 / 35
Denoising comparison
Figure: Sampled reconstructions using the XFEL dataset
Figure: Comparing various methods’ sampled reconstructions of theXFEL dataset ePCA took 13.9 seconds, while (Collins et al. 2001)’sexponential family PCA took 10900 seconds, or 3 hours
27 / 35
Denoising results
Figure: MSE against log10 sample size. Mean over 50 Monte Carlo trials.Purple: PCA. Grey: ePCA (projection only). Green: ePCA +EBLP
28 / 35
Software
I ePCA is publicly available in an open-source Matlabimplementation from github.com/lydiatliu/epca/
I Link to ePCA main function
I e.g. [cov,~,~,~,~,eigval,eigvec,~,~,~]
= exp_fam_pca(data,’poisson’)
29 / 35
Summary
1. New method ePCA for PCA of exponential family data, basedon a new covariance estimator.
2. Homogenization, shrinkage, heterogenization, and scaling ofthe debiased covariance matrix improve performance forhigh-dimensional data. Each step has theoretical justifications.
3. Applied ePCA to simulated XFEL data ⇒ reduce the MSE forcovariance, eigenvalue, eigenvector estimation.
4. Used ePCA to develop new denoising method, a form ofempirical Best Linear Predictor (EBLP) from random effectsmodels. Demonstrate on simulated XFEL data.
30 / 35
References I
F. J. Anscombe. “THE TRANSFORMATION OFPOISSON, BINOMIAL AND NEGATIVE-BINOMIALDATA”. In: Biometrika 35.3-4 (1948), p. 246.
Jinho Baik, Gerard Ben Arous, and Sandrine Peche.“Phase transition of the largest eigenvalue for nonnullcomplex sample covariance matrices”. In: Annals ofProbability 33.5 (2005), pp. 1643–1697.
Ronen Basri and David W Jacobs. “LambertianReflectance and Linear Subspaces”. In: IEEETransactions on Pattern Analysis and MachineIntelligence 25.2 (2003), pp. 218–233.
Michael Collins, S Dasgupta, and Re Schapire. “Ageneralization of principal component analysis to theexponential family”. In: Advances in NeuralInformation Processing Systems (NIPS) (2001).
31 / 35
References II
Dl Donoho, M Gavish, and Im Johnstone. “Optimalshrinkage of eigenvalues in the Spiked CovarianceModel”. In: arXiv preprint arXiv:1311.0851 0906812(2013), pp. 1–35. arXiv: arXiv:1311.0851v1.
K. J. Gaffney and H. N. Chapman. “Imaging AtomicStructure and Dynamics with Ultrafast X-rayScattering”. In: Science 316.5830 (2007),pp. 1444–1448. issn: 0036-8075. doi:10.1126/science.1135923. eprint:http://science.sciencemag.org/content/316/
5830/1444.full.pdf. url: http://science.sciencemag.org/content/316/5830/1444.
Iain M Johnstone. “On the distribution of the largesteigenvalue in principal components analysis”. In:Annals of Statistics 29.2 (2001), pp. 295–327.
32 / 35
References III
Ian Jolliffe. Principal Component Analysis. WileyOnline Library, 2002.
Martin Knott and David J Bartholomew. Latentvariable models and factor analysis. Edward Arnold,1999.
Seunggeun Lee, Fei Zou, and Fred A Wright.“Convergence and prediction of principal componentscores in high-dimensional settings”. In: Annals ofStatistics 38.6 (2010), pp. 3605–3629.
Vladimir A Marchenko and Leonid A Pastur.“Distribution of eigenvalues for some sets of randommatrices”. In: Mat. Sb. 114.4 (1967), pp. 507–536.
N Patterson, AL Price, and D Reich. “Populationstructure and eigenanalysis”. In: PLoS Genet 2.12(2006), e190.
33 / 35
References IV
Shayle R Searle, George Casella, andCharles E McCulloch. Variance components. JohnWiley & Sons, 2009.
Andrey A Shabalin and Andrew B Nobel.“Reconstruction of a low-rank matrix in the presenceof Gaussian noise”. In: Journal of MultivariateAnalysis 118 (2013), pp. 67–76.
Jean-Luc Starck, Fionn Murtagh, and Jalal M Fadili.Sparse image and signal processing: wavelets,curvelets, morphological diversity. Cambridgeuniversity press, 2010.
Madeleine Udell et al. “Generalized Low RankModels”. In: Foundations and Trends in MachineLearning 9.1 (2016), pp. 1–118.
34 / 35
References V
L. T. Liu, E. Dobriban, and A. Singer. “ePCA: HighDimensional Exponential Family PCA”. In: ArXive-prints (Nov. 2016). eprint: 1611.05550.
35 / 35