haplotyping and genotype imputation using graphics processing units

41
Fast methods and software for imputation of whole-genome sequencing data Gary K. Chen Sept 18, 2012

Upload: usc

Post on 22-May-2015

730 views

Category:

Health & Medicine


1 download

DESCRIPTION

This is a talk I gave for the 2012 GES series, which focuses on applications and methods for genetic epidemiology

TRANSCRIPT

Page 1: Haplotyping and genotype imputation using Graphics Processing Units

Fast methods and software forimputation of whole-genome

sequencing data

Gary K. Chen

Sept 18, 2012

Page 2: Haplotyping and genotype imputation using Graphics Processing Units

An outline

Background and Motivation

Implementation/Software

Simulations based on KGP data

Tutorial

Ongoing work

Page 3: Haplotyping and genotype imputation using Graphics Processing Units

Introduction

◮ Imputation◮ Probabilistic inference of unobserved genotypes

(e.g. not available on chip, poor QC, low coverage)◮ Exploits the fact that we are all distantly related,

implying reduced haplotype diversity

Page 4: Haplotyping and genotype imputation using Graphics Processing Units

Introduction

◮ Imputation◮ Probabilistic inference of unobserved genotypes

(e.g. not available on chip, poor QC, low coverage)◮ Exploits the fact that we are all distantly related,

implying reduced haplotype diversity

◮ Can potentially improve power◮ by refining an association signal◮ by pooling resources for large scale collaborations

Page 5: Haplotyping and genotype imputation using Graphics Processing Units

Introduction

◮ Imputation◮ Probabilistic inference of unobserved genotypes

(e.g. not available on chip, poor QC, low coverage)◮ Exploits the fact that we are all distantly related,

implying reduced haplotype diversity

◮ Can potentially improve power◮ by refining an association signal◮ by pooling resources for large scale collaborations

◮ Existing software◮ FastPHASE, MACH, IMPUTE2, BEAGLE, PLINK,

MENDEL, and others◮ All use some form of the Expectation Maximization

to “learn” parameters◮ e.g. Hidden Markov Models

Page 6: Haplotyping and genotype imputation using Graphics Processing Units

The mechanics behind imputation

◮ Imputation is an exercise in haplotype counting◮ We don’t observe the true haplotypes◮ We make a large number of “guesses”, and weight

each possible pairing (much larger number!)◮ Weights are likelihoods: i.e. the probability of the

observed multimarker genotypes, given the pair ofhaplotype guesses

◮ We add up the weights across all possible pairs,which will give us a posterior probability of eachgenotype at a site

Page 7: Haplotyping and genotype imputation using Graphics Processing Units

Imputation in GWAS◮ Imputation is the bottleneck in modern

high-dimensional genetic studies◮ Memory: Because humans are diploid, we must

integrate over phasing uncertainty:(haps*(haps+1)/2)

◮ Speed: Current methods (either MCMC or EM)must iterate a number of times until convergence

Page 8: Haplotyping and genotype imputation using Graphics Processing Units

Imputation in GWAS◮ Imputation is the bottleneck in modern

high-dimensional genetic studies◮ Memory: Because humans are diploid, we must

integrate over phasing uncertainty:(haps*(haps+1)/2)

◮ Speed: Current methods (either MCMC or EM)must iterate a number of times until convergence

◮ Real world challenges◮ Requires massive computational resources (e.g.

>3500 jobs for HapMap 2 reference panel)◮ Scaling to genome wide sequence data (1-2 orders

of magnitude more coverage)?◮ Imputing very rare alleles from deep reference

panels (thousands) will only exacerbate thecomputational burden quadratically

Page 9: Haplotyping and genotype imputation using Graphics Processing Units

Why is imputation so computationallydemanding?

◮ Simplified concrete example: 1000 individuals,20 SNP window

Page 10: Haplotyping and genotype imputation using Graphics Processing Units

Why is imputation so computationallydemanding?

◮ Simplified concrete example: 1000 individuals,20 SNP window

◮ Suppose we make 220 = 1, 048, 576 = h guesses,h∗(h+1)

2= 549, 756, 338, 176 pairs

◮ We calculate weights for the first individual,repeating for everyone: 549, 756, 338, 176, 000

◮ We loop again to get genotype posteriors, so totalcomputation is:

◮ 2 ∗ 549, 756, 338, 176, 000 = 1.1quadrillioncalculations.

◮ In reality of course, we expect LD to greatly reduceh

Page 11: Haplotyping and genotype imputation using Graphics Processing Units

Current best practices

◮ Chunking◮ Divide regions into subregions of several megabases

each◮ Run each subregion independently on a cluster

node as an embarrassingly parallel problem

Page 12: Haplotyping and genotype imputation using Graphics Processing Units

Current best practices

◮ Chunking◮ Divide regions into subregions of several megabases

each◮ Run each subregion independently on a cluster

node as an embarrassingly parallel problem

◮ Pre phase◮ Estimating haplotype frequencies by far the most

expensive procedure◮ Phase study data across a much smaller subset of

SNPs (e.g. 660k)◮ Run fast haploid based imputation, using sequence

based reference panel

◮ Nevertheless, imputation can take weeks if notmonths on large clusters!

Page 13: Haplotyping and genotype imputation using Graphics Processing Units

Breaking down computational barriers◮ Traditional CPU clusters are not keeping up

◮ We are interested in rarer variation, so need deeperreference panels

◮ Computational demand scales linearly with respectto SNPs

◮ However, computation increases as the number ofreference haplotypes squared

◮ Algorithms do not take advantage of

◮ Innovations in processor technology◮ Just as sequencing has revolutionized data

production, new microprocessor technologies andprogramming interfaces are revolutionizing softwaredevelopment

◮ Many-core processors: e.g. Graphics ProcessingUnits

Page 14: Haplotyping and genotype imputation using Graphics Processing Units

China’s GPU farm at BGI

Page 15: Haplotyping and genotype imputation using Graphics Processing Units

Examples of GPU devices

Page 16: Haplotyping and genotype imputation using Graphics Processing Units

GPUs in scientific computing

◮ Far more superior to today’s CPUs in efficiency(wrt to energy and hardware cost)

◮ A single device can contain over 500 computingcores in near proximity

◮ Each enterprise grade device runs approx. $3000.

◮ Ideal for large scale optimization◮ Common routines like matrix multiplication

commonly yield over 300x speedup◮ Other common methods that are good candidates

include HMMs, matrix inversion, PCA◮ Many algorithms can be rewritten that expose fine

grained calculations that can be doneindependently on a separate core

Page 17: Haplotyping and genotype imputation using Graphics Processing Units

Parallel HMM◮ Over 25x speed up on a single machine◮ ASHG talk

◮ Kai Wang, Gary K Chen: GPU acceleratedgenotype imputation for low-coveragehigh-throughput whole-genome sequencing data.In: International Congress of Human Genetics:2011; Montreal, Canada; 2011.

Table: Heterozygote Accuracy

MAF MaCH Our program<0.01 0.684 0.821<0.03 0.798 0.865<0.05 0.865 0.882

Page 18: Haplotyping and genotype imputation using Graphics Processing Units

An outline

Background and Motivation

Implementation/Software

Simulations based on KGP data

Tutorial

Ongoing work

Page 19: Haplotyping and genotype imputation using Graphics Processing Units

The algorithm behind MaCH

001001/001001001001/010101001001/100100001001/101010010101/010101010101/100100

100100/100100100100/101010101010/101010

010101/101010

Sample states conditional onbackward prob and x over rates

100100101010

001001010101

Observedgenotypes

Best guesshaplotypes

State space overphasing uncertainty

Compute backward probabilities

2011?0

011102

Page 20: Haplotyping and genotype imputation using Graphics Processing Units

Our algorithm

111010101101010101000001000

Estimate hap freqson compact haplotype set

111010101 .3101010101 .3

111010101101010101000001000

Impute on middlethird set of genotypes

111010101101010101000001000

010101001010101010001000100

Assign hap freqs on fullhaplotype set

000001000 .4

0000 .41111 .61???0?1?2

Advance by one third

◮ Key point: we can support *all* referencehaplotypes, not just a random subset

Page 21: Haplotyping and genotype imputation using Graphics Processing Units

GPU implementation◮ A massively parallel problem:

◮ Typical to deploy millions of computation“work-items” in a single function call

◮ Consideration of all possible pairs of haplotypeguesses

◮ Each hap pair maps to a “work-item”◮ Each subject maps to a “work-group”

Workgroup 1

...

256 workitems

Workgroup 2

...

256 workitems

Workgroup N

...

256 workitems

...

◮ Computations are masked by memory latency◮ One workgroup may fetch data, while another is

finishing computations◮ Powerful mechanism for especially large problems

Page 22: Haplotyping and genotype imputation using Graphics Processing Units

Speedups are linear on a CPU cluster, butsuper linear on a GPU

Page 23: Haplotyping and genotype imputation using Graphics Processing Units

Manuscript and software

Page 24: Haplotyping and genotype imputation using Graphics Processing Units

An outline

Background and Motivation

Implementation/Software

Simulations based on KGP data

Tutorial

Ongoing work

Page 25: Haplotyping and genotype imputation using Graphics Processing Units

Evaluation on KGP derived simulated data

◮ 1KGP Cosmopolitan panel◮ includes AFR,AMR,ASN, and EUR◮ Allocated 50% of each ethnic group into two

datasets◮ Dataset 1 is the phased reference haplotypes◮ Dataset 2 is a hypothetical study◮ Study data consists of genotype likelihoods

reflecting mean subject level coverage of 4x.

Page 26: Haplotyping and genotype imputation using Graphics Processing Units

Evaluation on KGP derived simulated data

◮ 1KGP Cosmopolitan panel◮ includes AFR,AMR,ASN, and EUR◮ Allocated 50% of each ethnic group into two

datasets◮ Dataset 1 is the phased reference haplotypes◮ Dataset 2 is a hypothetical study◮ Study data consists of genotype likelihoods

reflecting mean subject level coverage of 4x.

◮ Evaluation◮ Applied recommended settings (e.g. IMPUTE2:

phasing states=80, MCMC rounds=30)◮ Accuracy: Applied post imputation filtering criteria

to recover approx same number of SNPs◮ Benchmarked RAM usage and run times

Page 27: Haplotyping and genotype imputation using Graphics Processing Units

Table: Accuracy (Dosage corr. and het accuracies:)

MAF IMPUTE2 GPU-IMPUTErange Dose Het Dose Het0.01 0.725 0.952 0.792 0.9220.02 0.796 0.928 0.829 0.9290.03 0.826 0.943 0.857 0.9390.04 0.872 0.961 0.887 0.9550.05 0.886 0.963 0.905 0.9550.06 0.906 0.979 0.904 0.9660.07 0.932 0.976 0.929 0.9670.08 0.933 0.977 0.929 0.9690.09 0.950 0.978 0.941 0.9690.10 0.953 0.983 0.944 0.9740.20 0.956 0.985 0.951 0.9760.30 0.961 0.981 0.960 0.9710.40 0.963 0.979 0.966 0.9680.50 0.963 0.975 0.968 0.965

Page 28: Haplotyping and genotype imputation using Graphics Processing Units

Computational requirements

Table: Memory and run time

Program Runtime RAMIMPUTE2 38:52:23 3.7GBGPU-IMPUTE 00:16:38 576MB

Fold speedup: 140.2x

Page 29: Haplotyping and genotype imputation using Graphics Processing Units

An outline

Background and Motivation

Implementation/Software

Simulations based on KGP data

Tutorial

Ongoing work

Page 30: Haplotyping and genotype imputation using Graphics Processing Units

Recommended hardware/platform

◮ 1 or more GPU devices that are CUDA or ATIstream compliant

◮ Linux OS: Pipeline scripts

◮ MySQL database server: to store and sortstudy data

Page 31: Haplotyping and genotype imputation using Graphics Processing Units

Configuration

Page 32: Haplotyping and genotype imputation using Graphics Processing Units

Distribution of effort

GPU 1 GPU 2

Chunk 1 Chunk 23001−38502001−30001001−20001−1000

Chunk 3 Chunk 4

Region 1

Region 2

Region R

...

Page 33: Haplotyping and genotype imputation using Graphics Processing Units

Step 1: Fetch study data

◮ Retrieves from database, sorts, and chunksdata into regions

Page 34: Haplotyping and genotype imputation using Graphics Processing Units

Step 2: Pre-process KGP data

◮ Removes extremely rare sites, chunks data intoregions

Page 35: Haplotyping and genotype imputation using Graphics Processing Units

Step 3: Pre-phase the study data

◮ Outputs posterior probabilities of the 4 orderedgenotypes.

Page 36: Haplotyping and genotype imputation using Graphics Processing Units

Step 4: Post-process the phased data

◮ Generates input files for haploid imputation.

Page 37: Haplotyping and genotype imputation using Graphics Processing Units

Step 5: Impute haploid data into KGPreference

◮ Imputation step. Very fast, but I/O intensive.Recommended on HPCC.

Page 38: Haplotyping and genotype imputation using Graphics Processing Units

Analysis of Multi-ethnic Cohort data◮ Host

◮ epigraph.epigenome.usc.edu◮ Two Tesla C2050 GPUs, each with 448 cores.

◮ GWAS studies imputed:◮ AABC (1M): 5761◮ AAPC-A (1M): 6806◮ AAPC-B (1M): 2835◮ JABC (660k): 2211◮ LABC (660k): 1070◮ LAPC/JAPC (660k): 4175◮ T2D-Lat (2.5M): 4673◮ Hecht-smoking (1M): 2319

◮ Total samples=29,850

◮ Total SNPs=13,123,026

Page 39: Haplotyping and genotype imputation using Graphics Processing Units

An outline

Background and Motivation

Implementation/Software

Simulations based on KGP data

Tutorial

Ongoing work

Page 40: Haplotyping and genotype imputation using Graphics Processing Units

On the horizon: matrix completion◮ Matrix completion

◮ Is the basis of winning entry of $1M Netflixchallenge.

◮ Customers rate about 1% of the movies, can weimpute the 99% and predict what movie they willlike?

◮ Model-free imputation◮ Makes no assumptions about inter-site and

inter-person correlations. (All other programsassume independence in the latter)

◮ Extremely fast◮ 421 times faster than MaCH in pedigree data◮ Parallelization of the SVD step may lead to

another order of magnitude improvement.◮ Eric Chi

◮ If interested please come to his noon talk Oct. 25at SSB first floor classrooms

Page 41: Haplotyping and genotype imputation using Graphics Processing Units

Collaborators

◮ USC◮ Kai Wang◮ Alex Stram◮ Chris Haiman◮ Brian Henderson◮ AABC consortium◮ AAPC consortium

◮ UCLA◮ Kenneth Lange◮ Eric Sobel◮ Eric Chi