accelerating sparse canonical correlation analysis for large brain imaging genetics data

26
ACCELERATING SPARSE CANONICAL CORRELATION ANALYSIS FOR LARGE BRAIN IMAGING GENETICS DATA Jingwen Yan , Hui Zhang, Lei Du, Eric Wernert, Andew J. Saykin, Li Shen

Upload: alena

Post on 22-Feb-2016

52 views

Category:

Documents


0 download

DESCRIPTION

Accelerating Sparse Canonical Correlation Analysis for Large Brain Imaging Genetics Data. Jingwen Yan , Hui Zhang, Lei Du, Eric Wernert , Andew J. Saykin , Li Shen. Outline. Imaging Genetics Sparse Canonical Correlation Analysis (SCCA) Computational Challenges and Methods - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

ACCELERATING SPARSE CANONICAL CORRELATION ANALYSIS FOR LARGE BRAIN IMAGING GENETICS DATAJingwen Yan , Hui Zhang, Lei Du, Eric Wernert, Andew J. Saykin, Li Shen

Page 2: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

OUTLINE• Imaging Genetics• Sparse Canonical Correlation Analysis (SCCA)• Computational Challenges and Methods• Data Simulation• Experimental Results

Page 3: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

IMAGING GENETICS

Cells Systems

Behavior:Disorders, Complex

interactions, phenomena,

diseases.Genes

UCI, S. Potkin et al.

Page 4: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

Underlying Biological Pathway and Mechanism

IMAGING GENETICS

Page 5: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

Risacher et al 2010 Sloan et al 2010Potkin et al 2009; Saykin et al 2010

Risacher et al 2013 AV45 ROIs & APOE

Swaminathan et al 2012 PiB ROIs & amyloid pathway

Potkin et al 2009 Mol Psych schizophrenia study

Ho et al 2010 FTO; Reiman et al PNAS 2009

Chiang et al 2012 SNP/Gene networks & WM integrity

Shen et al 2010 ROIs; Stein et al 2010 voxels

SingleROI

Circuit

Whole Brain

Candidate Gene/SNP Biological Pathway Genome-wide

IMAGING GENETICS

Page 6: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

OUTLINE• Imaging Genetics• Sparse Canonical Correlation Analysis (SCCA)• Computational Challenges and Methods• Data Simulation• Experimental Results

Page 7: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

X1

X2

X3

Xn

Y1

Y2

Y3

Yn

X1

X2

X3

Xn

Y1

Y2

Y3W’X

Yn

X1

X2

X3

Xn

Y1

Y2

Y3Xu

Yn

Yv

𝑹

Massive Univariate Analysis

Multivariate Multiple

RegressionCanonical

Correlation Analysis

SCCA

Page 8: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

• Sparse canonical correlation analysis (SCCA)• R package: Penalized Multivariate Analysis (PMA) (Witten, et al,

2009)

• X, Y : imaging and genetics data respectively• : sparse penalties, mostly norm• For simplicity, assuming and • Bi-convex and non differentiable problem• Iterative solution

= 1, = 1

SCCA

Page 9: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

• Sparse canonical correlation analysis (SCCA)• Problem

• Iterative solution

• , ) is the soft thresholding operator and is chosen so that

= 1, = 1,

SCCA

1. = 1,

2. = 1,

Page 10: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

OUTLINE• Imaging Genetics• Sparse Canonical Correlation Analysis (SCCA)• Computational Challenges and Methods• Data Simulation• Experimental Results

Page 11: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

COMPUTATIONAL CHALLENGES• Example SCCA run at a small scale

• Participants: 1000• Genotype: 3,200 SNPs• Phenotype: 10,000 voxels• Permutation: 10,000 permutation tests • Running time: more than 12,000 hours

• Scale up

• Genotype (array): 6M SNPs• Genotype (NGS): 40M variants• Phenotype: 200K voxels, imaging, cognitive and biomarker• Permutation: 10M permutation to reach p=10-7

• Parameter tuning via cross-validation

• 10-fold cross-validation coupled with an 11-by-11 grid search• SCCA runs: 10×11×11 = 1,210

Page 12: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

ACCELERATION WITH MKL• Intel Math Kernel Library (MKL)

• accelerate application performance and reduce development time• highly vectorized and threaded linear algebra, fast fourier

transforms (FFT), vector math and statistics functions

• MKL has been optimized to utilize

• multiple processing cores• wider vector units• more varied architectures available in a high end system

• MKL can provide parallelism transparently and speed up programs with supported math routines without changing code.

• Compiling R with MKL

Page 13: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

ACCELERATION WITH OFFLOAD MODEL• Xeon Phi SE10P Coprocessor

• 60 cores with 8GB GDDR5

• Intel x86 instruction set

• Usage of familiar programming models, software, and tools

• Pros• The host system can offload

computing workload partially to the Xeon Phi

• Independently run a compatible program

Page 14: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

• Texas Advanced Computing Center Stampede cluster

• MKL + offload

• Each computing node

• Two Intel Xeon E5-2680 processors each with eight cores @2.7GHz.

• 32GB DDR3 memory • The Xeon Phi SE10P Coprocessor has

61 cores with 8GB GDDR5• The NVIDIA K20 GPUs on each node

have 5GB of on-board GDDR5

• Software

• CentOS 6.3. • Stock R 3.01 package compiled with

the Intel compilers (v.13) and built with MKL v.11.

COMPUTATIONAL PLATFORM

Page 15: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

OUTLINE• Imaging Genetics• Sparse Canonical Correlation Analysis (SCCA)• Computational Challenges and Methods• Data Simulation• Experimental Results

Page 16: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

• FREGENE genome simulator

• Simulate sequence-like data over large genomic regions in large diploid populations

• Simulated data

• N=1,000 diploid individuals over 20,000 generations • 10 Mb genome with the average mutation rate as 2.5e-8

/site/generation• 3,274 SNPs with minor allele frequency (MAF) greater

than 0.05 included • Four SNP data sets (i.e., g500, g1000, g2000, and g3274)

by taking the first 500, 1,000, 2,000, and 3,274 SNPs from the entire data, respectively.

SYNTHETIC DATA (GENETICS)

Page 17: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

SYNTHETIC DATA (GENETICS)

Page 18: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

SYNTHETIC DATA (IMAGING)• Assumption

• Each image with multiple regions of interest (ROIs)• Voxel within each ROI highly correlated

• Simulation

• Random positive definite non-overlapping group structured covariance matrix

• Apply Cholesky decomposition to obtain the background imaging data

• Individual: N=1000, Size: 100x100• We created three sets of phenotypic imaging data (i.e.,

p1000, p5000, and p10000), consisting of 1,000, 5,000 and 10,000 voxels respectively

Page 19: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

SYNTHETIC DATA (IMAGING)

Page 20: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

OUTLINE• Imaging Genetics• Sparse Canonical Correlation Analysis (SCCA)• Computational Challenges and Methods• Data Simulation• Experimental Results

Page 21: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

• R snowfall package (sfLapply) with MKL and offload model

RESULTS

Baseline

Parallel (MKL+ offload)

Page 22: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

RESULTS

• Accelerated SCCA implementations yielded the same results

• These correlation coefficients are close to the ground truth value of 1

Correlation coefficient between the first pair of canonical components

Page 23: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

RESULTS

Page 24: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

CONCLUSION• Initial steps to accelerate the SCCA implementation for brain imaging genetics

applications.

• Parallelism achieved in system implementation level to accelerate linear algebra computation using math kernel library (MKL) and partial offloading computing workload.

• The 2-fold speedup, although encouraging, is still insufficient to handle extremely large-scale neuroimaging genetics data

• millions of image voxels and millions of SNPs.

• Future work

• Big data analytic strategies at the parallel computing model level• Parallelization of multiplicative algorithms using MapReduce and CUDA.

• Application to accelerate enhanced SCCA models as well as other bi-multivariate statistical models for analyzing brain imaging genetics data.

Page 25: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

ACKNOWLEDGEMENT

This research was supported by

• NIH R01 LM011360• NIH U01 AG024904• NIH RC2 AG036535• NIH R01 AG19771• NIH P30 AG10133• NSF IIS-1117335

Page 26: Accelerating Sparse Canonical Correlation Analysis  for Large  Brain Imaging Genetics Data

Thank you