how i learned to stop worrying about big data and love the data that actually counts - counsyl tech...

Post on 24-May-2015

3.509 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

July 18, 2013 Counsyl Tech Talk on "How I Learned to Stop Worrying about Big Data and Love the Data That Actually Counts." Speaker: Imran Haque, Director of Research, Counsyl. Video: https://vimeo.com/71282924 Check out our openings: jobs.counsyl.com

TRANSCRIPT

Counsyl

www.counsyl.com

How I Learned to Stop Worryingabout Big Data

...and love the data that actually counts

Imran S. HaqueCounsyl

18 Jul 2013

Friday, July 26, 13

About the Speaker

•Imran S. Haque (ihaque@counsyl.com)

•Director of Research at Counsyl

•BS EECS, UC Berkeley; PhD CS, Stanford

Friday, July 26, 13

About CounsylWe have developed a single genomic test that replaces 100+ expensive assays

It has reduced the cost of carrier testing by literally one hundred fold

Bloom Syndrome $167Canavan Disease $473

Cystic Fibrosis $506Familial Dysautonomia $334

Fanconi Anemia $167Gaucher Disease $467

Glycogen Storage Disease Type Ia $283Maple Syrup Urine Disease Type 1B $557

Mucolipidosis IV $279Niemann-Pick Disease Type A $337

Spinal Muscular Atrophy $700Tay-Sachs Disease $473

Total $4743

Friday, July 26, 13

Engineering at Counsyl

WetlabBiology

Ordering

Reporting

Billing

Fulfillment

Automation Assay Calling

Friday, July 26, 13

Engineering at Counsyl

How big is the data in genomics?

WetlabBiology

Ordering

Reporting

Billing

Fulfillment

Automation Assay Calling

Assay Calling

Friday, July 26, 13

Big Data Will Save the World

Friday, July 26, 13

Big Data Will Save the World

But what is it, anyway?

Friday, July 26, 13

Background

Friday, July 26, 13

Background

Wikipedia “Big Data”:A collection of data sets so large and complex that it becomes difficult to

process using on-hand database management tools or traditional data

processing applications

Friday, July 26, 13

What Defines Big Data

• Computation: data so large that algorithms must be o(N1+ε): “almost linear.”

• Handling: data so large that with tractable algorithms communication becomes more significant than computation.

Friday, July 26, 13

Why Do People Care?

Big Data is fundamental to fields in which each individual piece of data is relatively information-light, so it is necessary to

aggregate a lot of it.

Friday, July 26, 13

Imran Haque
This particularly characterizes advertising, which funds the consumer Internet. People are interested in Big Data as a means to an end (improving conversion rates), not as an end in itself.
Imran Haque

Genomics:Big Data

Friday, July 26, 13

Genomics:Big Data

But not as we know it.

Friday, July 26, 13

Short-Read Sequencing in Short

I don’t know what they want from meIt’s like the more money we come across

The more problems we see

Friday, July 26, 13

Short-Read Sequencing in Short

I don’t know what they want from meIt’s like the more money we come across

The more problems we see

It’s like the morew what they wan

acro5s The more problre problems we see

...

Friday, July 26, 13

Short-Read Sequencing in Short

I don’t know what they want from meIt’s like the more money we come across

The more problems we see

It’s like the morew what they wan

acro5s The more problre problems we see

...

Current sequencers can produce ~100Gb of short (100bp) reads/day

Friday, July 26, 13

Short-Read Alignment

It’s%like%the%more%money%we%come%across

Ferragina and Manzini, JACM 2005Langmead et al, Genome Biol 2009Li and Durbin et al, Bioinformatics 2009

Friday, July 26, 13

Short-Read Alignment

It’s%like%the%more%money%we%come%across!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acr

Ferragina and Manzini, JACM 2005Langmead et al, Genome Biol 2009Li and Durbin et al, Bioinformatics 2009

Friday, July 26, 13

Short-Read Alignment

It’s%like%the%more%money%we%come%across!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acrIt’s!like!the!more

Ferragina and Manzini, JACM 2005Langmead et al, Genome Biol 2009Li and Durbin et al, Bioinformatics 2009

Friday, July 26, 13

Short-Read Alignment

It’s%like%the%more%money%we%come%across!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acrIt’s!like!the!more!!!!!!!!!!!!!!!re!data!!we!c

Ferragina and Manzini, JACM 2005Langmead et al, Genome Biol 2009Li and Durbin et al, Bioinformatics 2009

Friday, July 26, 13

Short-Read Alignment

It’s%like%the%more%money%we%come%across!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acrIt’s!like!the!more!!!!!!!!!!!!!!!re!data!!we!c!!!!like!the!more!d

Ferragina and Manzini, JACM 2005Langmead et al, Genome Biol 2009Li and Durbin et al, Bioinformatics 2009

Friday, July 26, 13

Short-Read Alignment

It’s%like%the%more%money%we%come%across!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acrIt’s!like!the!more!!!!!!!!!!!!!!!re!data!!we!c!!!!like!the!more!d!!!!!!!!!!!!!!!!!!!!ata!!we!come!across

Ferragina and Manzini, JACM 2005Langmead et al, Genome Biol 2009Li and Durbin et al, Bioinformatics 2009

Friday, July 26, 13

Short-Read Alignment

It’s%like%the%more%money%we%come%across!!!!!!!!!!!!!!!!!!!!!!!!!!e!come!acrIt’s!like!the!more!!!!!!!!!!!!!!!re!data!!we!c!!!!like!the!more!d!!!!!!!!!!!!!!!!!!!!ata!!we!come!across

It’s!like!the!more!data!we!come!across

Ferragina and Manzini, JACM 2005Langmead et al, Genome Biol 2009Li and Durbin et al, Bioinformatics 2009

Friday, July 26, 13

Alignment Algorithms

Ning, Cox, Mullikin. Genome Res 2001Li, Ruan, Durbin Genome Res 2008Ferragina and Manzini, JACM 2005Langmead et al, Genome Biol 2009Li and Durbin et al, Bioinformatics 2009

Friday, July 26, 13

Alignment Algorithms

• Smith-Waterman: O(MN), large constant factor

• Hash-based Alignment: much smaller constants than SW• MAQ, SSAHA

• Burrows-Wheeler Alignment: sublinear in size of genome• Bowtie, BWA

Ning, Cox, Mullikin. Genome Res 2001Li, Ruan, Durbin Genome Res 2008Ferragina and Manzini, JACM 2005Langmead et al, Genome Biol 2009Li and Durbin et al, Bioinformatics 2009

Friday, July 26, 13

Real-World AlignmentsATCCTTTGGGTGTATGGGTCGTAGCGAACTGAGAAGGGCCGAGG............!!...........................................!!!!....C....................................!!!,,,,c,,,,,,,,,,,,,,,,,,,,,,...............!!!...........................................!!..C........................................!!!C...........................................C!!.........................................C.!!........................................C..!!..................,,,,,,,,,,,,,,,,,,,,,,,,,!!!!!!,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,c,,,,!!!!!!.......................................!!...........................................!!...................................C.......!!..................................C.........!!...........................................!!..........,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,!!..............................C.............!!.......

Friday, July 26, 13

Real-World AlignmentsATCCTTTGGGTGTATGGGTCGTAGCGAACTGAGAAGGGCCGAGG............!!...........................................!!!!....C....................................!!!,,,,c,,,,,,,,,,,,,,,,,,,,,,...............!!!...........................................!!..C........................................!!!C...........................................C!!.........................................C.!!........................................C..!!..................,,,,,,,,,,,,,,,,,,,,,,,,,!!!!!!,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,c,,,,!!!!!!.......................................!!...........................................!!...................................C.......!!..................................C.........!!...........................................!!..........,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,!!..............................C.............!!.......

PAH:Y414C(heterozygote C/T)

phenylketonuria

Friday, July 26, 13

Real-World AlignmentsATCCTTTGGGTGTATGGGTCGTAGCGAACTGAGAAGGGCCGAGG............!!...........................................!!!!....C....................................!!!,,,,c,,,,,,,,,,,,,,,,,,,,,,...............!!!...........................................!!..C........................................!!!C...........................................C!!.........................................C.!!........................................C..!!..................,,,,,,,,,,,,,,,,,,,,,,,,,!!!!!!,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,c,,,,!!!!!!.......................................!!...........................................!!...................................C.......!!..................................C.........!!...........................................!!..........,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,!!..............................C.............!!.......

PAH:Y414C(heterozygote C/T)

phenylketonuria

Need to align 1.5M reads per sample, across

thousands of samples!

Friday, July 26, 13

Genomics: Big Data?Genomics appears to have all the characteristics of Big Data.

• Large quantity: ~100GB/day/sequencer

• Advanced algorithms: BWT alignment in linear/sublinear time

But characteristics of the data itself matter too!

Friday, July 26, 13

Clinical Genomics: Not That Big

Friday, July 26, 13

Clinical Genomics: Not That BigMost of the human genome is currently non-actionable.

Whole Genome Sequencing (~3000 Mb)

Friday, July 26, 13

Clinical Genomics: Not That BigMost of the human genome is currently non-actionable.

Whole Genome Sequencing (~3000 Mb)

Whole Exome Sequencing (~30 Mb)

Friday, July 26, 13

Clinical Genomics: Not That BigMost of the human genome is currently non-actionable.

Whole Genome Sequencing (~3000 Mb)

Whole Exome Sequencing (~30 Mb)

Clinical Carrier Screening (~1 Mb)

Friday, July 26, 13

Clinical Genomics: Not That BigMost of the human genome is currently non-actionable.

Whole Genome Sequencing (~3000 Mb)

Whole Exome Sequencing (~30 Mb)

Clinical Carrier Screening (~1 Mb)

Exome Sequencing (30 Mb)

Friday, July 26, 13

Clinical Genomics: Not That BigMost of the human genome is currently non-actionable.

Whole Genome Sequencing (~3000 Mb)

Whole Exome Sequencing (~30 Mb)

Clinical Carrier Screening (~1 Mb)

Exome Sequencing (30 Mb)Clinical Carrier Screening (~1 Mb)

Friday, July 26, 13

But 100Gb Is Still 100Gb, Right?

Friday, July 26, 13

But 100Gb Is Still 100Gb, Right?

Clinical genomics analysis is per-sample.

• Processing is embarrassingly parallel after demultiplexing.• Handling a single sample is trivial on even a laptop.

Use ZFS and LSF/SGE, not Cassandra and Hadoop.

Friday, July 26, 13

Why is Genomics Still Interesting?

Friday, July 26, 13

Why is Genomics Still Interesting?

It’s OK to be Lil’.

Friday, July 26, 13

Research Genomics

Friday, July 26, 13

Research Genomics

Counsyl runs this many samples every year ; clinical = scale.

Target # Samples # SNPs

Education Level 126,559 2.2M

Breast/Ovarian Cancer 11,705 31,812

Diabetes 10,128 2.2M

Telomere Length 37,684 2.4M

Rietveld et al, Science 2013Couch et al, PLoS Genet 2013Zeggini et al, Nat Genet 2008Codd et al, Nat Genet 2013

Friday, July 26, 13

Clinical Genomics: Big Where It Matters

Whole Genome (3000 Mb)

Clinical Genome (1 Mb)

Friday, July 26, 13

Clinical Genomics: Big Where It Matters

• Focusing on a small region means you can examine thousands of people: study important regions in great depth.

• Embarrassingly parallel is a good thing: people pay the bills!

Friday, July 26, 13

Let’s Science Up This Data

N=83,538 samples, 493 variants

Estimated carrier frequency per population as a binomial.

Bonferroni-corrected binomial equality test comparing each population against the pooled data finds variants that are significantly enriched/

depleted in particular populations.

Haque et al, in preparationFriday, July 26, 13

Smith-Lemli-Opitz Syndrome (DHCR7)

• We see a carrier rate double the predicted literature values(e.g., 1/57 vs 1/124 in Northwestern Europeans)

• We find previously undescribed population associations for DHCR7:IVS8-1G>C

Population Frequency Overall Frequency P-value N

⬆AJ 1 in 46 1 in 96 1.18E-11 4330⬇EA 0 1 in 96 1.56E-07 2739

Haque et al, in preparationFriday, July 26, 13

Genetic Disease in South Asians

Cystic Fibrosis (CFTR)

• 1/57 observed vs 1/118 in literature.

GJB2-related DFNB1 nonsyndromic hearing loss and deafness

• Literature claims 1/133 with 35delG, but we find 1/2191.• 36/2191 carriers, 35 for W24X.

Progressive cone dystrophy/achromatopsia (CNGB3)

• R403Q present in 1/18: 30% of carriers in 4% of tested pop.

Haque et al, in preparationFriday, July 26, 13

Size Doesn’t Matter, It’s How You Use It

• Genomics has a real ground truth.

• Genomics has a real impact.

Clinical genomics is interesting independently of “Big”ness.

Friday, July 26, 13

Future of GenomicsCratering prices drive technological shifts.

Technologies at the research frontier will become commercialized.

• Whole-genome association studies

• RNA-seq and transcriptomics

• Epigenomics

• Pathogen sequencing and metagenomics

Friday, July 26, 13

Where Are We Now?

• Theory has been developed in academia and government.

• Scale-up is just beginning in industry: started with tool vendors, now reaching applications companies.

• New scales of data will feed back into basic R&D.

Friday, July 26, 13

Recap

Big Data =

•“near linear” algorithms• communication is harder than computation

Short-read sequencing produces large amounts

of data.

Useful clinical insights are mostly derived from embarrassingly-parallel

small data.

“Small data” genomics is highly impactful in its

own right.

Genomics may enter a “big data” phase in the

future with new methods.

Friday, July 26, 13

</talk>

jobs.counsyl.comihaque@counsyl.com

Friday, July 26, 13

top related