nyc* 2013 - "analyzing the human genome/dna with cassandra"

Analyzing the human genome/DNA with CassandraBY SAMEER [email protected]

linkedin.com/in/blueplastic/

@blueplastic

http://youtu.be/ziqx2hJY8Hg

blueplastic.com/dna.pdf

G C

TA

T A

TA

T A A C C C N A A CA T T G G G N T T G

Adenine

Thymine Cytosine

Guanine

3.3 billion base pairs

. . .

1 Human:3 GB (uncompressed)900 MB (compressed gz)

US population:900 PB (uncompressed)

Earth population:2.8 exabytes (uncompressed)

1 Exabyte = 1 million TB

98.5 % of genome identical(3 in 10,000 bases differ)

242 MB

154 MB59 MB

178 MB

61 MB

Y chromosome:58 million base pairs(2% of total DNA)

X chromosome:155 million base pairs(5% of total DNA)

X YX X

3.3 billion bp

4 billion bpMom Dad

Humans: 46 Chromosomes

Why Chromosomes ??

Adder's Tongue Fern1,200 Chromosomes

Garden Snail54 Ch2 billion bp

Elephant56 Ch5.8 billion bp

Gorilla48 Ch3.4 billion bp

Fruit Fly8 Ch

165 million bp

Onion16 Ch

~18 billion bpHighly repetitive

Human Genome Project vs 1000 Genomes Project

- Launched Jan 2008

- Oct 2012: 1092 human genomes completefrom 14 populations

- Goal: 2,500 sequences from 26 specific populations like: Han Chinese, Japanese, British, Columbian, Maratha/India, Punjabi/Pakistan, Finnish, African Americans

- Work done by 111 global institutions

- Cost about $40 million ($16,000 per person)

- ~15 year project: 1989 – 2003

- Sequenced 99% of the genome (400 gaps)

- >70% of the genome came from an anonymous male donor from Buffalo, New York (code name RP11)

- Cost about $3 billion dollars

Download @ http://www.1000genomes.org/

http://www.1000genomes.org/

Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042601/

- In 2010: 179 human genomes

- Discussed DNA from 2 families of:Mother / Father / Child

- One of biology’s most cited papers in 2011

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042601/

- Feb 2009 assembly of one human genome (hg19)

- One gzip FASTA file per chromosome

rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .

gunzip <file>.fa.gz

1)

2)

Download at : http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes

http://useast.ensembl.org/Homo_sapiens/Location/Genome

Exploring DNA from your browser…

Click here

Chromosome 2Gene: MCM6SNP: rs4988235Position: 136,608,646 bp from pter

A:T - Can digest milkG:C - Lactose intolerance

http://useast.ensembl.org/Homo_sapiens/Location/Genome

T A A C C C T A A C C C T A A C C C T A A C C C T A A C C CA T T G G G A T T G G G A T T G G G A T T G G G A T T G G G

Chromosome #1 : 250 million base pairs (across both C-pairs)

P (short arm) Q (long arm)centromere

1p36.32

1q12

1q43

1q42.2

1p31.1

4,316 known genes

43360 0

pter qter

(8% of total DNA)

Chrom-1

humanID cell_type parent595-36-0000 normal mother595-36-0000 normal father595-36-0000 cancer mother595-36-0000 cancer father595-36-1111 normal mother595-36-1111 normal father

humanID:cell_type:parent

Compound Keys

Chrom-2 Chrom-3 Chrom-Y

24 Column Families(22 pairs + X + Y)

Partition key : remaining keys

Chrom-1

humanID cell parent595-36-0000 normal mother595-36-0000 normal father595-36-1111 normal mother595-36-1111 normal father

1 2Chrom-2 Chrom-X Chrom-Y

X Y

X X

595-36-0000 [normal, mother, 1]: TAG

[normal, mother, 2]: GCC

[normal, father, 1]: TAG

[normal, father, 2]: GCC

595-36-1111 [normal, mother, 1]: TAG

[normal, mother, 2]: GCC

[normal, father, 1]: TAG

[normal, father, 2]: GCC

Chrom-1 Column Family on disk

Chrom-1

humanID cell_type parent595-36-0000 normal mother595-36-0000 normal father595-36-0000 cancer mother595-36-0000 cancer father595-36-1111 normal mother595-36-1111 normal father

Chrom-Y

Partition based on humanID w/ Murmur3Partitioner

Send to range A

Send to range D

Now it’s possible to do row range scans down the same humanID…… and get all the DNA for human #1000

A

D

B

C

1p36.321q12

1q43

1q42.2

1p31.1

Chromosome #1: 4,316 genes

PSEN2

1q31-q42227,058,272 – 227,083,803

Presenilin 211 m: Alzheimers

(25,000 bp)Neuroblastoma Cancer(deletion of: 1p36.1 – 1p36.3)

ABCA4 AGL AMPD1 BSND CDC73 CHRNB2 COL8A2 CPT2 DBT EDARADD

ACADM ALDH4A1 ASPM CACNA1S CFH CLCNKA COL9A2 CRB1 DIRAS3 EGLN1

ACTA1 ALPL ATP1A2 CASQ2 CFHR5 CLCNKB COL11A1 DARS2 DPYD EIF2B3

DIRAS3(4,800 bp)

1q3168,511,644 - 68,516,459

DIRAS3 proteinBreast Cancer

- Alzheimer disease

- Neuroblastoma

- breast cancer

- color vision deficiency

- early-onset glaucoma

- Emery-Dreifuss muscular dystrophy

- Parkinson disease

Conditions related to genes on Chromosome 1

Conditions related to genes on Chromosome 1

actin-accumulation myopathyadenosine monophosphate deaminase deficiencyage-related macular degenerationAlagille syndromeAlzheimer diseaseamyotrophic lateral sclerosisanencephalyankylosing spondylitisarrhythmogenic right ventricular cardiomyopathyatypical hemolytic-uremic syndromeauriculo-condylar syndromeautosomal dominant nocturnal frontal lobe epilepsyautosomal recessive primary microcephalyBartter syndrome3-beta-hydroxysteroid dehydrogenase deficiencybreast cancercap myopathycarnitine palmitoyltransferase II deficiencycatecholaminergic polymorphic ventricular tachycardiaCharcot-Marie-Tooth diseaseChediak-Higashi syndromechronic granulomatous diseasecolor vision deficiencycongenital fiber-type disproportioncongenital hypothyroidismcongenital insensitivity to pain with anhidrosisCowden syndromeCrohn diseasedense deposit diseaseDiamond-Blackfan anemiadihydropyrimidine dehydrogenase deficiencyearly-onset glaucomaEhlers-Danlos syndromeEmery-Dreifuss muscular dystrophyessential thrombocythemiafactor V Leiden thrombophiliafamilial adenomatous polyposis

familial cold autoinflammatory syndromefamilial erythrocytosisfamilial hemiplegic migrainefamilial hypertrophic cardiomyopathyfamilial hypobetalipoproteinemiafamilial isolated hyperparathyroidismfamilial restrictive cardiomyopathyFuchs endothelial dystrophyfucosidosisfumarase deficiencygalactosemiagastrointestinal stromal tumorGaucher diseaseGitelman syndromeGLUT1 deficiency syndromeglycogen storage disease type IIIGreenberg dysplasiahemochromatosishereditary antithrombin deficiencyhereditary leiomyomatosis and renal cell cancerhereditary paraganglioma-pheochromocytomahereditary sensory and autonomic neuropathy type VhomocystinuriaHutchinson-Gilford progeria syndrome3-hydroxy-3-methylglutaryl-CoA lyase deficiencyhypercholesterolemiahypermanganesemia with dystonia, polycythemia, and cirrhosishyperparathyroidism-jaw tumor syndromehyperprolinemiahypohidrotic ectodermal dysplasiahypokalemic periodic paralysishypophosphatasiaidiopathic inflammatory myopathyintranuclear rod myopathyjunctional epidermolysis bullosajuvenile idiopathic arthritisKufs diseaseLeber congenital amaurosisleukoencephalopathy with brainstem and spinal cord involvement and lactate elevation

leukoencephalopathy with vanishing white matterlimb-girdle muscular dystrophymalignant hyperthermiamaple syrup urine diseasemedium-chain acyl-CoA dehydrogenase deficiencyMuckle-Wells syndromemultiminicore diseasemultiple epiphyseal dysplasianemaline myopathyneonatal onset multisystem inflammatory diseaseneuroblastomanonsyndromic deafnessnonsyndromic paragangliomaNoonan syndromeosteogenesis imperfectaParkinson diseasepopliteal pterygium syndromeporphyriaprimary myelofibrosispsoriatic arthritispyruvate kinase deficiencyREN-related kidney diseaseretinitis pigmentosarhizomelic chondrodysplasia punctatasevere congenital neutropeniaShprintzen-Goldberg syndromespina bifidaStargardt macular degenerationStickler syndromesystemic sclerodermathiamine-responsive megaloblastic anemia syndromethrombocytopenia-absent radius syndrometrimethylaminuriaUsher syndromevan der Woude syndromevitiligoVohwinkel syndromeWNT4 Müllerian aplasia and ovarian dysfunction

What read queries do we want to perform?

1) Give me the PSEN2 gene for 2,000 people w/ Alzheimer's25,000 sequential bp

Write once, read many times type of database

2) Give me all of the humans who have the lactose intolerance SNP on CR-2

Translation: DNA -> Proteins

A T T G G G A T T G G G A T T T G A

T A A C C C T A A C C C T A A A C T

Isoleucine Glycine Isoleucine Glycine Isoleucine STOP

DNA

AminoAcids

Protein

G GI II

codon

(20 different types)

Translation: ATT -> Lsoleucine

Chrom-1

humanID cell_type parent595-36-0000 normal mother595-36-0000 normal father595-36-0000 cancer mother595-36-0000 cancer father595-36-1111 normal mother595-36-1111 normal father595-36-1111 normal 3rd

1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7

TAG GCC CAG CAG TCA CTG NNN GAT

TAG GCC CAG CAG TAA CTG NNN GAT


TAG GCC CAG CAG CTG NNN GAT

TAG GCC CAG CAG TCC TCA CTG NNN GAT


centromereP(short arm) Q (long arm)36 0 0 43

125 million bp

3= 41 m cols

Point Mutation TAG GCC CAG CAG TCA CTG

TAG GCC CAG CAG TAA CTG

Deletion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG ___ CTG

Insertion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG TCC TCA CTG

Chrom-21

(SNP)

1q43.7492932

Chrom-1


Chrom-21

Excellent candidate for compression!4x reduction in total data size + 35% faster reads

Create Secondary IndexTo detect SNPs

cqlsh:dna_table> SELECT humanID FROM Chrom-1WHERE 1q0 = ‘TAA’;

1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7







To get all of the people with the SNP:


Chrom-21

cqlsh:dna_table> SELECT 1q0, 1q1 FROM Chrom-1WHERE humanID in(595-36-000, 595-36-111);

Query: Give me the X gene for 2 people

Chrom-1

1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7







X

Chrom-1

P-key

SS:CT:M

SS:CT:F

SS:CT:M

SS:CT:F

SS:CT:M

SS:CT:F

Chrom-2 Chrom-3 Chrom-Y

41 million columns

9 million columns

3 billion cols

Storing the total USA population Genome in Cassandra(314 million people)

1.5 GB 125 MB of data125 MB of data

630 million rows (2 for each person)

X=900 PB

Oct 2012: 1092 genomes sequenced3.2 TB data total

1000 Genomes Project

46,000nodes

(20 TB each)No Replication

$0

$20,000,000

$40,000,000

$60,000,000

$80,000,000

$100,000,000

$120,000,000

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Cost per Human Genome sequence

Series 1

Linear scale$20 millionincrements

Huh ?

$100

$1,000

$10,000

$100,000

$1,000,000

$10,000,000

$100,000,000

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Cost per Human Genome sequence

Genome Sequencing Moore's Law

Logarithmic scale10x

increments

Super LogarithmicScale!Jan 2008

Switched to next-gen sequencing

Coding vs Non-coding DNA

Coding Non-coding

98% non-coding DNA

T A A C C C T A A C C C T A A C C C T A A C C C T A A C C CA T T G G G A T T G G G A T T G G G A T T G G G A T T G G G

8% of human DNA(98,000 fragments)

HIV-1 virus genome: https://www.ncbi.nlm.nih.gov/nuccore/9629357?report=fasta

https://www.ncbi.nlm.nih.gov/nuccore/9629357?report=fasta

Free python tools for biological computationhttp://biopython.org

Pycassa Cassandra python client libraryhttps://github.com/pycassa/pycassa

DataStax Community Edition of Cassandra + OpsCenterhttp://www.datastax.com/download/community

Get Ubuntu 12.10 http://www.ubuntu.com/download(note CentOS/Red Hat has install issues with Biopython)

http://biopython.org/

https://github.com/pycassa/pycassa

http://www.datastax.com/download/community

http://www.ubuntu.com/download

Sameer [email protected]

- Freelance Big Data consultant and trainer- Taught 50+ courses on Hadoop, HBase, Cassandra and OpenStack

Ex: Hortonworks, Accenture R&D, Symantec

linkedin.com/in/blueplastic/

@blueplastic

http://youtu.be/ziqx2hJY8Hg

- Co-author on v2 of Cassandra book- Coming late 2013

blueplastic.com/dna.pdf

Polychaos dubium620 billion bp (200x humans)

mailto:[email protected]

James Watson: How we discovered DNA

http://www.ted.com/talks/james_watson_on_how_he_discovered_dna.html

Juan Enriquez: The life-code that will reshape the future

http://www.ted.com/talks/juan_enriquez_on_genomics_and_our_future.html

http://www.ted.com/talks/james_watson_on_how_he_discovered_dna.html

http://www.ted.com/talks/juan_enriquez_on_genomics_and_our_future.html

Resources to get started for beginners…

nyc* 2013 - "analyzing the human genome/dna with cassandra"

Technology

c c c t

c c ca t t g g g

t t g g gchromosome

c c c n

rs4988235 g

base pairsthymine cytosine

normal mother595

tat ag ca t