scalable probabilistic pca - ww2.amstat.org › meetings › sdss › 2018 › online... ·...
TRANSCRIPT
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Scalable and flexible probabilistic PCA forlarge-scale genetic variation data
Sriram Sankararaman
Department of Computer ScienceDepartment of Human Genetics
UCLA
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
The genomic revolution
Challenge: scalable methods to analyze and visualize this data
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Genetic data
Individual 1
Individual 2
Individual 3
Position along Genome
.
.
.
.A..C...
.
.
.
.A..T...
.
.
.
.A..C...
.
.
.
.G..T...
.
.
.
.A..C...
.
.
.
.G..C...
Individual 1
Individual 2
Individual 3
SNPs AC
AT
AC
GT
AC
GC
Genotype 1
Genotype 2
Genotype 3
SNPs 01
11
12
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
PCA on genetic data
Genotypes
SNPs
1020202
1110102
1211111
0102011
0112010
PCA0.7 0.3 -0.1 -0.4 -0.5
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
PCA on genetic data
Visualize genetic structure
Novembre et al. Nature 2008
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
PCA
Given N genotype vectors over M SNPsxn ∈ RM , n ∈ {1, . . . , N} and K ≤M .
xn ≈ w1zn,1 + . . .+wKzn,K
= [w1 . . .wK ]
zn,1...zn,K
= Wzn
• PCA Constraint: columns of W are orthonormal.
• The PCA solution W = UK where UK contains the topK eigenvectors of the sample covariance matrix.
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Challenges with PCA
Computational
• Compute all eigenvalue, eigenvectors.• Singular-Value Decomposition (SVD):
O(MN min(M,N)) ≈ O(MN2)
• Infeasible for genetic datasets (large number of SNPs M orindividuals N).
• Recent Randomized approximation algorithms
Statistical
• Missing genotypes.
• Correlation among SNPs.
Halko et al. 2009, Galinskey et al. 2016
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Probabilistic PCA
Model
zniid∼ N (0, IK)
p(xn|zn,W , σ2) = N (Wzn, σ2IM )
Log likelihood
LL(W , σ2) ≡ logP (X|W , σ2) = log
N∏n=1
p(xn, zn|W , σ2)
The maximum likelihood estimator is equivalent to PCA.
Roweis 1999, Tipping and Bishop 1999, Engelhardt and Stephens 2010
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Probabilistic PCA
EM algorithm
• E-step:
Z = (WTW )−1
WTX
• M-step:
W = XZT(ZZT)−1
• Assume: σ2 → 0
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Probabilistic PCA
EM algorithm: computational complexity
• E-step:
Z = (WTW )︸ ︷︷ ︸K×K
−1WT︸︷︷︸K×M
X︸︷︷︸M×N
O(NMK)
• M-step:
W = X︸︷︷︸M×N
ZT︸︷︷︸N×K
(ZZT)︸ ︷︷ ︸K×K
−1
O(NMK)
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Probabilistic PCA
EM algorithm: computational complexity
• Run for I iterations with each iteration costing O(NMK).
• For small K, leads to a linear-time algorithm.
• Ignores the special structure of the genotype matrix X.
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Probabilistic PCA
EM algorithm: computational complexity
• Run for I iterations with each iteration costing O(NMK).
• For small K, leads to a linear-time algorithm.
• Ignores the special structure of the genotype matrix X.
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Probabilistic PCA
EM algorithm: computational complexity
Each E and M step, perform the following operations K times:
c = Xb
• X is a fixed M ×N matrix of genotypes.
• b is a real-valued vector that could potentially changeeach iteration.
• Naive multiplication takes O(NM).
• For a genotype matrix, can we do some pre-processing sothat Xb can be computed more efficiently ?
• Yes! For a matrix with binary entries: O( MNlog2(N)).
Liberty and Zucker 2009
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Probabilistic PCA
EM algorithm: computational complexity
Each E and M step, perform the following operations K times:
c = Xb
• X is a fixed M ×N matrix of genotypes.
• b is a real-valued vector that could potentially changeeach iteration.
• Naive multiplication takes O(NM).
• For a genotype matrix, can we do some pre-processing sothat Xb can be computed more efficiently ?
• Yes! For a matrix with binary entries: O( MNlog2(N)).
Liberty and Zucker 2009
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Probabilistic PCA
EM algorithm: computational complexity
Each E and M step, perform the following operations K times:
c = Xb
• X is a fixed M ×N matrix of genotypes.
• b is a real-valued vector that could potentially changeeach iteration.
• Naive multiplication takes O(NM).
• For a genotype matrix, can we do some pre-processing sothat Xb can be computed more efficiently ?
• Yes! For a matrix with binary entries: O( MNlog2(N)).
Liberty and Zucker 2009
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
EM with the Mailman algorithm
• Entries in genotype matrix take one of three values:{0, 1, 2}.
• Using the Mailman algorithm, per-iteration timecomplexity of EM for genotype matrix: O( MNK
log3(N)).
Sub-linear time algorithm for computing PCA
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Simulations
Accuracy
50, 000 SNPs, 10, 000 individuals
Fst MEV
K = 5 K = 10
0.001 0.987 1.0000.002 0.999 1.0000.003 0.999 1.0000.004 0.999 1.0000.005 1.000 1.0000.006 1.000 1.0000.007 1.000 1.0000.008 1.000 1.0000.009 1.000 1.0000.010 1.000 1.000
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Simulations
Efficiency
M = 100, 000 SNPs, K = 5, FST = 0.01
0 100 200 300 400 500
020
000
4000
060
000
8000
010
0000
1400
00
N (X 1000) individuals
Com
pute
tim
e (s
econ
ds)
●●●●●● ● ● ● ● ● ● ●
● EMFastPCAPLINK_SVDFlash−PCA
0 100 200 300 400 500
010
0020
0030
0040
00
N (X 1000) individuals
Com
pute
tim
e (s
econ
ds)
●●●●●● ● ● ●
●
●
●
●
● EMFastPCAFlash−PCA
0 100 200 300 400 500
010
020
030
040
050
060
070
0
N (X 1000) individuals
Ave
rage
iter
atio
n tim
e (s
econ
ds)
●●●●●● ● ● ●
●
●
●
●
● EMFastPCA
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Application to 1000 Genomes data
●●●
●
●
●●●●●
●●●●●●●●●●●
●
●●
●●●●●
●●●●●
●
●●
●●●●
●●●
●
●●
●●●●
●●●●
●●
●●
●●●
●
●
●●●●
●●●
●
●
●●●●
●
●●●●●●
●●●●
●●
●●●●●
●●●●●
●●●
●
●
●●
●●●●●●●●●●
●
●●●
●
●●●●●●
●
●●●
●
●
●●●
●
●●
●●●●●●●
●
●
●●●
●
●
●●●
●
●●●
●●●●●
●
●
●
●●●●●
●●●●
●
●●
●●
●●
●●●●●
●
●●
●●●
●
●
●
●●●●●
●●
●
●●●
●●●●●
●
●
●●●
●●●●
●
●
●
●●
●
●●●●●
●
●
●●●
●●●
●
●
●●
●●●●●●
●
●●●●
●
●
●
●●
●●●
●●●●●
●
●
●
●●
●●
●
●●●
●●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●● ●
●●
●●
●
●●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●●
●
●
●● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●
● ●●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●
●
●●
●●
●
●
●
●●●●●●
●
●●
●
●●
●
●
●●●●●
●●●●
●●
●●
●●
●●●●●●
●●
●●
●●●●
●
●
●●●
●
●
●●●●●
●●●
●●●●●
●
●●●●●●●●
●●●●●
●
●
●
●●
●●●●●●●●●●●
●●●
●
●●
●
●
●●●
● ●
●●●
●
●●
●
●
●●
●
●●●
●●●●
●●
●
●●●●
●●●●
●
●●●
●
●●●●
●●
●●●●●●
●
●
●●●●
●
●
●
●●●
●●●
●
●●
●●●●●●
●●●
●●●●
●
●●●●●●●
●●●
●
●●
●
●●●●
●●
●●●
●●
●●●●●●●
●●●
●●●●●●
●●●●
●
●
●●●●●
●
●●
●●●
●●●
●
●
●
●●●
●●●●●●
●●●●●●●
●●
●
●●●●●
●
● ●●●●●
●
●●
●
●●●
●
●●
●
●●
●
●
●●
●
●●●●●●●●●●
●●
●●
●●●●
●●●●●
●●●●●●●
●●●●●●●●
●●●●●●●●●●● ●
●●●●
●●●
●●● ●●
●●●●
●● ●● ●●●●●●●●●●
●●
●●
●●
●●●
●●
●
● ●
●
●
●
●●
●●●●●
● ●●●●●
● ●●●●
●●●●●●●
●
●● ●●●●●
●● ●●●●●●
●● ●●●●●
●●●●●●●
●
●
●●●●●●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
● ●
●● ●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
● ●
● ●
●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
● ●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●●
●●●
●
●●
●●
●
●●
●
●●●●
●
●
●
●●●●
●●
●●
●●●●●
● ●
●
●
●●●●
●
●●● ●●●●●●●●
●●
●
●
●
●●●●
●
●●●●●●●●●
●●●
●●●●
●
●●●●
●
●
●●●
●●●
●
−0.02 0.00 0.02 0.04 0.06
−0.
020.
000.
020.
04
EM
PC1
PC
2
●
●
●
●
EURASNAMRAFR
●●●
●
●
●●●●●
●●●●●●●●●●●
●
●●
●●●●●
●●●●●
●
●●
●●●●
●●●
●
●●
●●●●
●●●●
●●
●●
●●●
●
●
●●●●
●●●
●
●
●●●●
●
●●●●●●
●●●●
●●
●●●●●
●●●●●
●●●
●
●
●●
●●●●●●●●●●
●
●●●
●
●●●●●●
●
●●●
●
●
●●●
●
●●
●●●●●●●
●
●
●●●
●
●
●●●
●
●●●
●●●●●
●
●
●
●●●●●
●●●●
●
●●
●●
●●
●●●●●
●
●●
●●●
●
●
●
●●●●●
●●
●
●●●
●●●●●
●
●
●●●
●●●●
●
●
●
●●
●
●●●●●
●
●
●●●
●●●
●
●
●●
●●●●●●
●
●●●●
●
●
●
●●
●●●
●●●●●
●
●
●
●●
●●
●
●●●
●●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●● ●
●●
●●
●
●●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●●
●
●
●● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●
● ●●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●
●
●●
●●
●
●
●
●●●●●●
●
●●
●
●●
●
●
●●●●●
●●●●
●●
●●
●●
●●●●●●
●●
●●
●●●●
●
●
●●●
●
●
●●●●●
●●●
●●●●●
●
●●●●●●●●
●●●●●
●
●
●
●●
●●●●●●●●●●●
●●●
●
●●
●
●
●●●
● ●
●●●
●
●●
●
●
●●
●
●●●
●●●●
●●
●
●●●●
●●●●
●
●●●
●
●●●●
●●
●●●●●●
●
●
●●●●
●
●
●
●●●
●●●
●
●●
●●●●●●
●●●
●●●●
●
●●●●●●●
●●●
●
●●
●
●●●●
●●
●●●
●●
●●●●●●●
●●●
●●●●●●
●●●●
●
●
●●●●●
●
●●
●●●
●●●
●
●
●
●●●
●●●●●●
●●●●●●●
●●
●
●●●●●
●
● ●●●●●
●
●●
●
●●●
●
●●
●
●●
●
●
●●
●
●●●●●●●●●●
●●
●●
●●●●
●●●●●
●●●●●●●
●●●●●●●●
●●●●●●●●●●● ●
●●●●
●●●
●●● ●●
●●●●
●● ●● ●●●●●●●●●●
●●
●●
●●
●●●
●●
●
● ●
●
●
●
●●
●●●●●
● ●●●●●
● ●●●●
●●●●●●●
●
●
● ●●●●●
●● ●●●●●●
●● ●●●●●
●●●●●●●
●
●
●●●●●●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
● ●
●● ●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
● ●
● ●
●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
● ●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●●
●●●
●
●●
●●
●
●●
●
●●●●
●
●
●
●●●●
●●
●●
●●●●●
● ●
●
●
●●●●
●
●●● ●●●●●●●●
●●
●
●
●
●●●●
●
●●●●●●●●●
●●●
●●●●
●
●●●●
●
●
●●●
●●●
●
−0.02 0.00 0.02 0.04 0.06−
0.02
0.00
0.02
0.04
SVD
PC1
PC
2
●
●
●
●
EURASNAMRAFR
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Other advantages of Probabilistic PCA
Can naturally handle missing data
• E-step involves inferring hidden variables Z as well ashidden (missing observations).
• Can handle missing data efficiently.
Can use model selection to infer K.
• Choose K to maximize the marginal likelihood P (X|K).
• Use cross-validation and pick K that maximizes likelihoodon held out data.
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Open questions
• Modeling correlations .
• Beyond Gaussian outputs
Baran et al. 2013, Wen and Stephens 2012Collins et al. 2002
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Summary
• PCA can be interpreted as a latent variable model withcontinuous latent variable.
• Probabilistic interpretation useful to generalize PCA.• Leads to efficient inference.
• Fast matrix-vector multiplication for genotype data moregenerally applicable and can lead to sub-linear timealgorithms.
ScalableProbabilistic
PCA
PCA
ProbabilisticPCA
Results
Acknowledgments
• Aman Agrawal
• Minh Le
• Eran Halperin
Email [email protected]