nyc* 2013 - "analyzing the human genome/dna with cassandra"

31
Analyzing the human genome/DNA with Cassandra BY SAMEER FAROOQUI [email protected] linkedin.com/in/blueplastic/ @blueplastic http://youtu.be/ziqx2hJY8Hg blueplastic.com/dna.pdf

Upload: planet-cassandra

Post on 11-May-2015

859 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Analyzing the human genome/DNA with CassandraBY SAMEER [email protected]

linkedin.com/in/blueplastic/

@blueplastic

http://youtu.be/ziqx2hJY8Hg

blueplastic.com/dna.pdf

Page 2: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

G C

TA

T A

TA

Page 3: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

T A A C C C N A A CA T T G G G N T T G

Adenine

Thymine Cytosine

Guanine

3.3 billion base pairs

. . .

1 Human:3 GB (uncompressed)900 MB (compressed gz)

US population:900 PB (uncompressed)

Earth population:2.8 exabytes (uncompressed)

1 Exabyte = 1 million TB

98.5 % of genome identical(3 in 10,000 bases differ)

Page 4: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

242 MB

154 MB59 MB

178 MB

61 MB

Y chromosome:58 million base pairs(2% of total DNA)

X chromosome:155 million base pairs(5% of total DNA)

X YX X

3.3 billion bp

4 billion bpMom Dad

Humans: 46 Chromosomes

Page 5: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Why Chromosomes ??

Adder's Tongue Fern1,200 Chromosomes

Garden Snail54 Ch2 billion bp

Elephant56 Ch5.8 billion bp

Gorilla48 Ch3.4 billion bp

Fruit Fly8 Ch

165 million bp

Onion16 Ch

~18 billion bpHighly repetitive

Page 6: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Human Genome Project vs 1000 Genomes Project

- Launched Jan 2008

- Oct 2012: 1092 human genomes completefrom 14 populations

- Goal: 2,500 sequences from 26 specific populations like: Han Chinese, Japanese, British, Columbian, Maratha/India, Punjabi/Pakistan, Finnish, African Americans

- Work done by 111 global institutions

- Cost about $40 million ($16,000 per person)

- ~15 year project: 1989 – 2003

- Sequenced 99% of the genome (400 gaps)

- >70% of the genome came from an anonymous male donor from Buffalo, New York (code name RP11)

- Cost about $3 billion dollars

Download @ http://www.1000genomes.org/

Page 7: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042601/

- In 2010: 179 human genomes

- Discussed DNA from 2 families of:Mother / Father / Child

- One of biology’s most cited papers in 2011

Page 8: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

- Feb 2009 assembly of one human genome (hg19)

- One gzip FASTA file per chromosome

rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .

gunzip <file>.fa.gz

1)

2)

Download at : http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes

Page 9: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

http://useast.ensembl.org/Homo_sapiens/Location/Genome

Exploring DNA from your browser…

Click here

Chromosome 2Gene: MCM6SNP: rs4988235Position: 136,608,646 bp from pter

A:T - Can digest milkG:C - Lactose intolerance

Page 10: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

T A A C C C T A A C C C T A A C C C T A A C C C T A A C C CA T T G G G A T T G G G A T T G G G A T T G G G A T T G G G

Chromosome #1 : 250 million base pairs (across both C-pairs)

P (short arm) Q (long arm)centromere

1p36.32

1q12

1q43

1q42.2

1p31.1

4,316 known genes

43360 0

pter qter

(8% of total DNA)

Page 11: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Chrom-1

humanID cell_type parent595-36-0000 normal mother595-36-0000 normal father595-36-0000 cancer mother595-36-0000 cancer father595-36-1111 normal mother595-36-1111 normal father

humanID:cell_type:parent

Compound Keys

Chrom-2 Chrom-3 Chrom-Y

24 Column Families(22 pairs + X + Y)

Partition key : remaining keys

Page 12: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Chrom-1

humanID cell parent595-36-0000 normal mother595-36-0000 normal father595-36-1111 normal mother595-36-1111 normal father

1 2Chrom-2 Chrom-X Chrom-Y

X Y

X X

595-36-0000 [normal, mother, 1]: TAG

[normal, mother, 2]: GCC

[normal, father, 1]: TAG

[normal, father, 2]: GCC

595-36-1111 [normal, mother, 1]: TAG

[normal, mother, 2]: GCC

[normal, father, 1]: TAG

[normal, father, 2]: GCC

Chrom-1 Column Family on disk

Page 13: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Chrom-1

humanID cell_type parent595-36-0000 normal mother595-36-0000 normal father595-36-0000 cancer mother595-36-0000 cancer father595-36-1111 normal mother595-36-1111 normal father

Chrom-Y

Partition based on humanID w/ Murmur3Partitioner

Send to range A

Send to range D

Now it’s possible to do row range scans down the same humanID…… and get all the DNA for human #1000

A

D

B

C

Page 14: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

1p36.321q12

1q43

1q42.2

1p31.1

Chromosome #1: 4,316 genes

PSEN2

1q31-q42227,058,272 – 227,083,803

Presenilin 211 m: Alzheimers

(25,000 bp)Neuroblastoma Cancer(deletion of: 1p36.1 – 1p36.3)

ABCA4 AGL AMPD1 BSND CDC73 CHRNB2 COL8A2 CPT2 DBT EDARADD

ACADM ALDH4A1 ASPM CACNA1S CFH CLCNKA COL9A2 CRB1 DIRAS3 EGLN1

ACTA1 ALPL ATP1A2 CASQ2 CFHR5 CLCNKB COL11A1 DARS2 DPYD EIF2B3

DIRAS3(4,800 bp)

1q3168,511,644 - 68,516,459

DIRAS3 proteinBreast Cancer

Page 15: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

- Alzheimer disease

- Neuroblastoma

- breast cancer

- color vision deficiency

- early-onset glaucoma

- Emery-Dreifuss muscular dystrophy

- Parkinson disease

Conditions related to genes on Chromosome 1

Page 16: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Conditions related to genes on Chromosome 1

actin-accumulation myopathyadenosine monophosphate deaminase deficiencyage-related macular degenerationAlagille syndromeAlzheimer diseaseamyotrophic lateral sclerosisanencephalyankylosing spondylitisarrhythmogenic right ventricular cardiomyopathyatypical hemolytic-uremic syndromeauriculo-condylar syndromeautosomal dominant nocturnal frontal lobe epilepsyautosomal recessive primary microcephalyBartter syndrome3-beta-hydroxysteroid dehydrogenase deficiencybreast cancercap myopathycarnitine palmitoyltransferase II deficiencycatecholaminergic polymorphic ventricular tachycardiaCharcot-Marie-Tooth diseaseChediak-Higashi syndromechronic granulomatous diseasecolor vision deficiencycongenital fiber-type disproportioncongenital hypothyroidismcongenital insensitivity to pain with anhidrosisCowden syndromeCrohn diseasedense deposit diseaseDiamond-Blackfan anemiadihydropyrimidine dehydrogenase deficiencyearly-onset glaucomaEhlers-Danlos syndromeEmery-Dreifuss muscular dystrophyessential thrombocythemiafactor V Leiden thrombophiliafamilial adenomatous polyposis

familial cold autoinflammatory syndromefamilial erythrocytosisfamilial hemiplegic migrainefamilial hypertrophic cardiomyopathyfamilial hypobetalipoproteinemiafamilial isolated hyperparathyroidismfamilial restrictive cardiomyopathyFuchs endothelial dystrophyfucosidosisfumarase deficiencygalactosemiagastrointestinal stromal tumorGaucher diseaseGitelman syndromeGLUT1 deficiency syndromeglycogen storage disease type IIIGreenberg dysplasiahemochromatosishereditary antithrombin deficiencyhereditary leiomyomatosis and renal cell cancerhereditary paraganglioma-pheochromocytomahereditary sensory and autonomic neuropathy type VhomocystinuriaHutchinson-Gilford progeria syndrome3-hydroxy-3-methylglutaryl-CoA lyase deficiencyhypercholesterolemiahypermanganesemia with dystonia, polycythemia, and cirrhosishyperparathyroidism-jaw tumor syndromehyperprolinemiahypohidrotic ectodermal dysplasiahypokalemic periodic paralysishypophosphatasiaidiopathic inflammatory myopathyintranuclear rod myopathyjunctional epidermolysis bullosajuvenile idiopathic arthritisKufs diseaseLeber congenital amaurosisleukoencephalopathy with brainstem and spinal cord involvement and lactate elevation

leukoencephalopathy with vanishing white matterlimb-girdle muscular dystrophymalignant hyperthermiamaple syrup urine diseasemedium-chain acyl-CoA dehydrogenase deficiencyMuckle-Wells syndromemultiminicore diseasemultiple epiphyseal dysplasianemaline myopathyneonatal onset multisystem inflammatory diseaseneuroblastomanonsyndromic deafnessnonsyndromic paragangliomaNoonan syndromeosteogenesis imperfectaParkinson diseasepopliteal pterygium syndromeporphyriaprimary myelofibrosispsoriatic arthritispyruvate kinase deficiencyREN-related kidney diseaseretinitis pigmentosarhizomelic chondrodysplasia punctatasevere congenital neutropeniaShprintzen-Goldberg syndromespina bifidaStargardt macular degenerationStickler syndromesystemic sclerodermathiamine-responsive megaloblastic anemia syndromethrombocytopenia-absent radius syndrometrimethylaminuriaUsher syndromevan der Woude syndromevitiligoVohwinkel syndromeWNT4 Müllerian aplasia and ovarian dysfunction

Page 17: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

What read queries do we want to perform?

1) Give me the PSEN2 gene for 2,000 people w/ Alzheimer's25,000 sequential bp

Write once, read many times type of database

2) Give me all of the humans who have the lactose intolerance SNP on CR-2

Page 18: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Translation: DNA -> Proteins

A T T G G G A T T G G G A T T T G A

T A A C C C T A A C C C T A A A C T

Isoleucine Glycine Isoleucine Glycine Isoleucine STOP

DNA

AminoAcids

Protein

G GI II

codon

(20 different types)

Page 19: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Translation: ATT -> Lsoleucine

Page 20: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Chrom-1

humanID cell_type parent595-36-0000 normal mother595-36-0000 normal father595-36-0000 cancer mother595-36-0000 cancer father595-36-1111 normal mother595-36-1111 normal father595-36-1111 normal 3rd

1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7

TAG GCC CAG CAG TCA CTG NNN GAT

TAG GCC CAG CAG TAA CTG NNN GAT

TAG GCC CAG CAG TCA CTG NNN GAT

TAG GCC CAG CAG CTG NNN GAT

TAG GCC CAG CAG TCC TCA CTG NNN GAT

TAG GCC CAG CAG TCA CTG NNN GAT

centromereP(short arm) Q (long arm)36 0 0 43

125 million bp

3= 41 m cols

Point Mutation TAG GCC CAG CAG TCA CTG

TAG GCC CAG CAG TAA CTG

Deletion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG ___ CTG

Insertion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG TCC TCA CTG

Chrom-21

(SNP)

1q43.7492932

Page 21: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Chrom-1

humanID cell_type parent595-36-0000 normal mother595-36-0000 normal father595-36-0000 cancer mother595-36-0000 cancer father595-36-1111 normal mother595-36-1111 normal father595-36-1111 normal 3rd

Chrom-21

Excellent candidate for compression!4x reduction in total data size + 35% faster reads

Create Secondary IndexTo detect SNPs

cqlsh:dna_table> SELECT humanID FROM Chrom-1WHERE 1q0 = ‘TAA’;

1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7

TAG GCC CAG CAG TCA CTG NNN GAT

TAG GCC CAG CAG TAA CTG NNN GAT

TAG GCC CAG CAG TCA CTG NNN GAT

TAG GCC CAG CAG CTG NNN GAT

TAG GCC CAG CAG TCC TCA CTG NNN GAT

TAG GCC CAG CAG TCA CTG NNN GAT

To get all of the people with the SNP:

Page 22: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

humanID cell_type parent595-36-0000 normal mother595-36-0000 normal father595-36-0000 cancer mother595-36-0000 cancer father595-36-1111 normal mother595-36-1111 normal father595-36-1111 normal 3rd

Chrom-21

cqlsh:dna_table> SELECT 1q0, 1q1 FROM Chrom-1WHERE humanID in(595-36-000, 595-36-111);

Query: Give me the X gene for 2 people

Chrom-1

1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7

TAG GCC CAG CAG TCA CTG NNN GAT

TAG GCC CAG CAG TAA CTG NNN GAT

TAG GCC CAG CAG TCA CTG NNN GAT

TAG GCC CAG CAG CTG NNN GAT

TAG GCC CAG CAG TCC TCA CTG NNN GAT

TAG GCC CAG CAG TCA CTG NNN GAT

X

Page 23: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Chrom-1

P-key

SS:CT:M

SS:CT:F

SS:CT:M

SS:CT:F

SS:CT:M

SS:CT:F

Chrom-2 Chrom-3 Chrom-Y

41 million columns

9 million columns

3 billion cols

Storing the total USA population Genome in Cassandra(314 million people)

1.5 GB 125 MB of data125 MB of data

630 million rows (2 for each person)

X=900 PB

Oct 2012: 1092 genomes sequenced3.2 TB data total

1000 Genomes Project

46,000nodes

(20 TB each)No Replication

Page 24: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

$0

$20,000,000

$40,000,000

$60,000,000

$80,000,000

$100,000,000

$120,000,000

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Cost per Human Genome sequence

Series 1

Linear scale$20 millionincrements

Huh ?

Page 25: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

$100

$1,000

$10,000

$100,000

$1,000,000

$10,000,000

$100,000,000

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Cost per Human Genome sequence

Genome Sequencing Moore's Law

Logarithmic scale10x

increments

Super LogarithmicScale!Jan 2008

Switched to next-gen sequencing

Page 26: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Coding vs Non-coding DNA

Coding Non-coding

98% non-coding DNA

Page 27: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

T A A C C C T A A C C C T A A C C C T A A C C C T A A C C CA T T G G G A T T G G G A T T G G G A T T G G G A T T G G G

8% of human DNA(98,000 fragments)

HIV-1 virus genome: https://www.ncbi.nlm.nih.gov/nuccore/9629357?report=fasta

Page 28: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Free python tools for biological computationhttp://biopython.org

Pycassa Cassandra python client libraryhttps://github.com/pycassa/pycassa

DataStax Community Edition of Cassandra + OpsCenterhttp://www.datastax.com/download/community

Get Ubuntu 12.10 http://www.ubuntu.com/download(note CentOS/Red Hat has install issues with Biopython)

Page 29: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Sameer [email protected]

- Freelance Big Data consultant and trainer- Taught 50+ courses on Hadoop, HBase, Cassandra and OpenStack

Ex: Hortonworks, Accenture R&D, Symantec

linkedin.com/in/blueplastic/

@blueplastic

http://youtu.be/ziqx2hJY8Hg

- Co-author on v2 of Cassandra book- Coming late 2013

blueplastic.com/dna.pdf

Polychaos dubium620 billion bp (200x humans)

Page 30: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

James Watson: How we discovered DNA

http://www.ted.com/talks/james_watson_on_how_he_discovered_dna.html

Juan Enriquez: The life-code that will reshape the future

http://www.ted.com/talks/juan_enriquez_on_genomics_and_our_future.html

Page 31: NYC* 2013 - "Analyzing the Human Genome/DNA with Cassandra"

Resources to get started for beginners…