haplotyping and genotype imputation using graphics processing units

Fast methods and software forimputation of whole-genome

sequencing data

Gary K. Chen

Sept 18, 2012

An outline

Background and Motivation

Implementation/Software

Simulations based on KGP data

Tutorial

Ongoing work

Introduction

◮ Imputation◮ Probabilistic inference of unobserved genotypes

(e.g. not available on chip, poor QC, low coverage)◮ Exploits the fact that we are all distantly related,

implying reduced haplotype diversity

Introduction




◮ Can potentially improve power◮ by refining an association signal◮ by pooling resources for large scale collaborations

Introduction




◮ Can potentially improve power◮ by refining an association signal◮ by pooling resources for large scale collaborations

◮ Existing software◮ FastPHASE, MACH, IMPUTE2, BEAGLE, PLINK,

MENDEL, and others◮ All use some form of the Expectation Maximization

to “learn” parameters◮ e.g. Hidden Markov Models

The mechanics behind imputation

◮ Imputation is an exercise in haplotype counting◮ We don’t observe the true haplotypes◮ We make a large number of “guesses”, and weight

each possible pairing (much larger number!)◮ Weights are likelihoods: i.e. the probability of the

observed multimarker genotypes, given the pair ofhaplotype guesses

◮ We add up the weights across all possible pairs,which will give us a posterior probability of eachgenotype at a site

Imputation in GWAS◮ Imputation is the bottleneck in modern

high-dimensional genetic studies◮ Memory: Because humans are diploid, we must

integrate over phasing uncertainty:(haps*(haps+1)/2)

◮ Speed: Current methods (either MCMC or EM)must iterate a number of times until convergence

Imputation in GWAS◮ Imputation is the bottleneck in modern

high-dimensional genetic studies◮ Memory: Because humans are diploid, we must

integrate over phasing uncertainty:(haps*(haps+1)/2)

◮ Speed: Current methods (either MCMC or EM)must iterate a number of times until convergence

◮ Real world challenges◮ Requires massive computational resources (e.g.

>3500 jobs for HapMap 2 reference panel)◮ Scaling to genome wide sequence data (1-2 orders

of magnitude more coverage)?◮ Imputing very rare alleles from deep reference

panels (thousands) will only exacerbate thecomputational burden quadratically

Why is imputation so computationallydemanding?

◮ Simplified concrete example: 1000 individuals,20 SNP window

Why is imputation so computationallydemanding?

◮ Simplified concrete example: 1000 individuals,20 SNP window

◮ Suppose we make 220 = 1, 048, 576 = h guesses,h∗(h+1)

2= 549, 756, 338, 176 pairs

◮ We calculate weights for the first individual,repeating for everyone: 549, 756, 338, 176, 000

◮ We loop again to get genotype posteriors, so totalcomputation is:

◮ 2 ∗ 549, 756, 338, 176, 000 = 1.1quadrillioncalculations.

◮ In reality of course, we expect LD to greatly reduceh

Current best practices

◮ Chunking◮ Divide regions into subregions of several megabases

each◮ Run each subregion independently on a cluster

node as an embarrassingly parallel problem

Current best practices

◮ Chunking◮ Divide regions into subregions of several megabases

each◮ Run each subregion independently on a cluster

node as an embarrassingly parallel problem

◮ Pre phase◮ Estimating haplotype frequencies by far the most

expensive procedure◮ Phase study data across a much smaller subset of

SNPs (e.g. 660k)◮ Run fast haploid based imputation, using sequence

based reference panel

◮ Nevertheless, imputation can take weeks if notmonths on large clusters!

Breaking down computational barriers◮ Traditional CPU clusters are not keeping up

◮ We are interested in rarer variation, so need deeperreference panels

◮ Computational demand scales linearly with respectto SNPs

◮ However, computation increases as the number ofreference haplotypes squared

◮ Algorithms do not take advantage of

◮ Innovations in processor technology◮ Just as sequencing has revolutionized data

production, new microprocessor technologies andprogramming interfaces are revolutionizing softwaredevelopment

◮ Many-core processors: e.g. Graphics ProcessingUnits

China’s GPU farm at BGI

Examples of GPU devices

GPUs in scientific computing

◮ Far more superior to today’s CPUs in efficiency(wrt to energy and hardware cost)

◮ A single device can contain over 500 computingcores in near proximity

◮ Each enterprise grade device runs approx. $3000.

◮ Ideal for large scale optimization◮ Common routines like matrix multiplication

commonly yield over 300x speedup◮ Other common methods that are good candidates

include HMMs, matrix inversion, PCA◮ Many algorithms can be rewritten that expose fine

grained calculations that can be doneindependently on a separate core

Parallel HMM◮ Over 25x speed up on a single machine◮ ASHG talk

◮ Kai Wang, Gary K Chen: GPU acceleratedgenotype imputation for low-coveragehigh-throughput whole-genome sequencing data.In: International Congress of Human Genetics:2011; Montreal, Canada; 2011.

Table: Heterozygote Accuracy

MAF MaCH Our program<0.01 0.684 0.821<0.03 0.798 0.865<0.05 0.865 0.882

An outline




Tutorial

Ongoing work

The algorithm behind MaCH

001001/001001001001/010101001001/100100001001/101010010101/010101010101/100100

100100/100100100100/101010101010/101010

010101/101010

Sample states conditional onbackward prob and x over rates

100100101010

001001010101

Observedgenotypes

Best guesshaplotypes

State space overphasing uncertainty

Compute backward probabilities

2011?0

011102

Our algorithm

111010101101010101000001000

Estimate hap freqson compact haplotype set

111010101 .3101010101 .3

111010101101010101000001000

Impute on middlethird set of genotypes

111010101101010101000001000

010101001010101010001000100

Assign hap freqs on fullhaplotype set

000001000 .4

0000 .41111 .61???0?1?2

Advance by one third

◮ Key point: we can support *all* referencehaplotypes, not just a random subset

GPU implementation◮ A massively parallel problem:

◮ Typical to deploy millions of computation“work-items” in a single function call

◮ Consideration of all possible pairs of haplotypeguesses

◮ Each hap pair maps to a “work-item”◮ Each subject maps to a “work-group”

Workgroup 1

...

256 workitems

Workgroup 2

...

256 workitems

Workgroup N

...

256 workitems

...

◮ Computations are masked by memory latency◮ One workgroup may fetch data, while another is

finishing computations◮ Powerful mechanism for especially large problems

Speedups are linear on a CPU cluster, butsuper linear on a GPU

Manuscript and software

An outline




Tutorial

Ongoing work

Evaluation on KGP derived simulated data

◮ 1KGP Cosmopolitan panel◮ includes AFR,AMR,ASN, and EUR◮ Allocated 50% of each ethnic group into two

datasets◮ Dataset 1 is the phased reference haplotypes◮ Dataset 2 is a hypothetical study◮ Study data consists of genotype likelihoods

reflecting mean subject level coverage of 4x.

Evaluation on KGP derived simulated data

◮ 1KGP Cosmopolitan panel◮ includes AFR,AMR,ASN, and EUR◮ Allocated 50% of each ethnic group into two

datasets◮ Dataset 1 is the phased reference haplotypes◮ Dataset 2 is a hypothetical study◮ Study data consists of genotype likelihoods

reflecting mean subject level coverage of 4x.

◮ Evaluation◮ Applied recommended settings (e.g. IMPUTE2:

phasing states=80, MCMC rounds=30)◮ Accuracy: Applied post imputation filtering criteria

to recover approx same number of SNPs◮ Benchmarked RAM usage and run times

Table: Accuracy (Dosage corr. and het accuracies:)

MAF IMPUTE2 GPU-IMPUTErange Dose Het Dose Het0.01 0.725 0.952 0.792 0.9220.02 0.796 0.928 0.829 0.9290.03 0.826 0.943 0.857 0.9390.04 0.872 0.961 0.887 0.9550.05 0.886 0.963 0.905 0.9550.06 0.906 0.979 0.904 0.9660.07 0.932 0.976 0.929 0.9670.08 0.933 0.977 0.929 0.9690.09 0.950 0.978 0.941 0.9690.10 0.953 0.983 0.944 0.9740.20 0.956 0.985 0.951 0.9760.30 0.961 0.981 0.960 0.9710.40 0.963 0.979 0.966 0.9680.50 0.963 0.975 0.968 0.965

Computational requirements

Table: Memory and run time

Program Runtime RAMIMPUTE2 38:52:23 3.7GBGPU-IMPUTE 00:16:38 576MB

Fold speedup: 140.2x

An outline




Tutorial

Ongoing work

Recommended hardware/platform

◮ 1 or more GPU devices that are CUDA or ATIstream compliant

◮ Linux OS: Pipeline scripts

◮ MySQL database server: to store and sortstudy data

Configuration

Distribution of effort

GPU 1 GPU 2

Chunk 1 Chunk 23001−38502001−30001001−20001−1000

Chunk 3 Chunk 4

Region 1

Region 2

Region R

...

Step 1: Fetch study data

◮ Retrieves from database, sorts, and chunksdata into regions

Step 2: Pre-process KGP data

◮ Removes extremely rare sites, chunks data intoregions

Step 3: Pre-phase the study data

◮ Outputs posterior probabilities of the 4 orderedgenotypes.

Step 4: Post-process the phased data

◮ Generates input files for haploid imputation.

Step 5: Impute haploid data into KGPreference

◮ Imputation step. Very fast, but I/O intensive.Recommended on HPCC.

Analysis of Multi-ethnic Cohort data◮ Host

◮ epigraph.epigenome.usc.edu◮ Two Tesla C2050 GPUs, each with 448 cores.

◮ GWAS studies imputed:◮ AABC (1M): 5761◮ AAPC-A (1M): 6806◮ AAPC-B (1M): 2835◮ JABC (660k): 2211◮ LABC (660k): 1070◮ LAPC/JAPC (660k): 4175◮ T2D-Lat (2.5M): 4673◮ Hecht-smoking (1M): 2319

◮ Total samples=29,850

◮ Total SNPs=13,123,026

An outline




Tutorial

Ongoing work

On the horizon: matrix completion◮ Matrix completion

◮ Is the basis of winning entry of $1M Netflixchallenge.

◮ Customers rate about 1% of the movies, can weimpute the 99% and predict what movie they willlike?

◮ Model-free imputation◮ Makes no assumptions about inter-site and

inter-person correlations. (All other programsassume independence in the latter)

◮ Extremely fast◮ 421 times faster than MaCH in pedigree data◮ Parallelization of the SVD step may lead to

another order of magnitude improvement.◮ Eric Chi

◮ If interested please come to his noon talk Oct. 25at SSB first floor classrooms

Collaborators

◮ USC◮ Kai Wang◮ Alex Stram◮ Chris Haiman◮ Brian Henderson◮ AABC consortium◮ AAPC consortium

◮ UCLA◮ Kenneth Lange◮ Eric Sobel◮ Eric Chi

haplotyping and genotype imputation using graphics processing units

Health & Medicine

imputation imputation

gwas imputation

haploid based imputation

low coverage

large number of guesses

current methods

reduced haplotype diversity

large clusters