nyc* 2013 - "analyzing the human genome/dna with cassandra"
TRANSCRIPT
Analyzing the human genome/DNA with CassandraBY SAMEER [email protected]
linkedin.com/in/blueplastic/
@blueplastic
http://youtu.be/ziqx2hJY8Hg
blueplastic.com/dna.pdf
G C
TA
T A
TA
T A A C C C N A A CA T T G G G N T T G
Adenine
Thymine Cytosine
Guanine
3.3 billion base pairs
. . .
1 Human:3 GB (uncompressed)900 MB (compressed gz)
US population:900 PB (uncompressed)
Earth population:2.8 exabytes (uncompressed)
1 Exabyte = 1 million TB
98.5 % of genome identical(3 in 10,000 bases differ)
242 MB
154 MB59 MB
178 MB
61 MB
Y chromosome:58 million base pairs(2% of total DNA)
X chromosome:155 million base pairs(5% of total DNA)
X YX X
3.3 billion bp
4 billion bpMom Dad
Humans: 46 Chromosomes
Why Chromosomes ??
Adder's Tongue Fern1,200 Chromosomes
Garden Snail54 Ch2 billion bp
Elephant56 Ch5.8 billion bp
Gorilla48 Ch3.4 billion bp
Fruit Fly8 Ch
165 million bp
Onion16 Ch
~18 billion bpHighly repetitive
Human Genome Project vs 1000 Genomes Project
- Launched Jan 2008
- Oct 2012: 1092 human genomes completefrom 14 populations
- Goal: 2,500 sequences from 26 specific populations like: Han Chinese, Japanese, British, Columbian, Maratha/India, Punjabi/Pakistan, Finnish, African Americans
- Work done by 111 global institutions
- Cost about $40 million ($16,000 per person)
- ~15 year project: 1989 – 2003
- Sequenced 99% of the genome (400 gaps)
- >70% of the genome came from an anonymous male donor from Buffalo, New York (code name RP11)
- Cost about $3 billion dollars
Download @ http://www.1000genomes.org/
Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042601/
- In 2010: 179 human genomes
- Discussed DNA from 2 families of:Mother / Father / Child
- One of biology’s most cited papers in 2011
- Feb 2009 assembly of one human genome (hg19)
- One gzip FASTA file per chromosome
rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .
gunzip <file>.fa.gz
1)
2)
Download at : http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes
http://useast.ensembl.org/Homo_sapiens/Location/Genome
Exploring DNA from your browser…
Click here
Chromosome 2Gene: MCM6SNP: rs4988235Position: 136,608,646 bp from pter
A:T - Can digest milkG:C - Lactose intolerance
T A A C C C T A A C C C T A A C C C T A A C C C T A A C C CA T T G G G A T T G G G A T T G G G A T T G G G A T T G G G
Chromosome #1 : 250 million base pairs (across both C-pairs)
P (short arm) Q (long arm)centromere
1p36.32
1q12
1q43
1q42.2
1p31.1
4,316 known genes
43360 0
pter qter
(8% of total DNA)
Chrom-1
humanID cell_type parent595-36-0000 normal mother595-36-0000 normal father595-36-0000 cancer mother595-36-0000 cancer father595-36-1111 normal mother595-36-1111 normal father
humanID:cell_type:parent
Compound Keys
Chrom-2 Chrom-3 Chrom-Y
24 Column Families(22 pairs + X + Y)
Partition key : remaining keys
Chrom-1
humanID cell parent595-36-0000 normal mother595-36-0000 normal father595-36-1111 normal mother595-36-1111 normal father
1 2Chrom-2 Chrom-X Chrom-Y
X Y
X X
595-36-0000 [normal, mother, 1]: TAG
[normal, mother, 2]: GCC
[normal, father, 1]: TAG
[normal, father, 2]: GCC
595-36-1111 [normal, mother, 1]: TAG
[normal, mother, 2]: GCC
[normal, father, 1]: TAG
[normal, father, 2]: GCC
Chrom-1 Column Family on disk
Chrom-1
humanID cell_type parent595-36-0000 normal mother595-36-0000 normal father595-36-0000 cancer mother595-36-0000 cancer father595-36-1111 normal mother595-36-1111 normal father
Chrom-Y
Partition based on humanID w/ Murmur3Partitioner
Send to range A
Send to range D
Now it’s possible to do row range scans down the same humanID…… and get all the DNA for human #1000
A
D
B
C
1p36.321q12
1q43
1q42.2
1p31.1
Chromosome #1: 4,316 genes
PSEN2
1q31-q42227,058,272 – 227,083,803
Presenilin 211 m: Alzheimers
(25,000 bp)Neuroblastoma Cancer(deletion of: 1p36.1 – 1p36.3)
ABCA4 AGL AMPD1 BSND CDC73 CHRNB2 COL8A2 CPT2 DBT EDARADD
ACADM ALDH4A1 ASPM CACNA1S CFH CLCNKA COL9A2 CRB1 DIRAS3 EGLN1
ACTA1 ALPL ATP1A2 CASQ2 CFHR5 CLCNKB COL11A1 DARS2 DPYD EIF2B3
DIRAS3(4,800 bp)
1q3168,511,644 - 68,516,459
DIRAS3 proteinBreast Cancer
- Alzheimer disease
- Neuroblastoma
- breast cancer
- color vision deficiency
- early-onset glaucoma
- Emery-Dreifuss muscular dystrophy
- Parkinson disease
Conditions related to genes on Chromosome 1
Conditions related to genes on Chromosome 1
actin-accumulation myopathyadenosine monophosphate deaminase deficiencyage-related macular degenerationAlagille syndromeAlzheimer diseaseamyotrophic lateral sclerosisanencephalyankylosing spondylitisarrhythmogenic right ventricular cardiomyopathyatypical hemolytic-uremic syndromeauriculo-condylar syndromeautosomal dominant nocturnal frontal lobe epilepsyautosomal recessive primary microcephalyBartter syndrome3-beta-hydroxysteroid dehydrogenase deficiencybreast cancercap myopathycarnitine palmitoyltransferase II deficiencycatecholaminergic polymorphic ventricular tachycardiaCharcot-Marie-Tooth diseaseChediak-Higashi syndromechronic granulomatous diseasecolor vision deficiencycongenital fiber-type disproportioncongenital hypothyroidismcongenital insensitivity to pain with anhidrosisCowden syndromeCrohn diseasedense deposit diseaseDiamond-Blackfan anemiadihydropyrimidine dehydrogenase deficiencyearly-onset glaucomaEhlers-Danlos syndromeEmery-Dreifuss muscular dystrophyessential thrombocythemiafactor V Leiden thrombophiliafamilial adenomatous polyposis
familial cold autoinflammatory syndromefamilial erythrocytosisfamilial hemiplegic migrainefamilial hypertrophic cardiomyopathyfamilial hypobetalipoproteinemiafamilial isolated hyperparathyroidismfamilial restrictive cardiomyopathyFuchs endothelial dystrophyfucosidosisfumarase deficiencygalactosemiagastrointestinal stromal tumorGaucher diseaseGitelman syndromeGLUT1 deficiency syndromeglycogen storage disease type IIIGreenberg dysplasiahemochromatosishereditary antithrombin deficiencyhereditary leiomyomatosis and renal cell cancerhereditary paraganglioma-pheochromocytomahereditary sensory and autonomic neuropathy type VhomocystinuriaHutchinson-Gilford progeria syndrome3-hydroxy-3-methylglutaryl-CoA lyase deficiencyhypercholesterolemiahypermanganesemia with dystonia, polycythemia, and cirrhosishyperparathyroidism-jaw tumor syndromehyperprolinemiahypohidrotic ectodermal dysplasiahypokalemic periodic paralysishypophosphatasiaidiopathic inflammatory myopathyintranuclear rod myopathyjunctional epidermolysis bullosajuvenile idiopathic arthritisKufs diseaseLeber congenital amaurosisleukoencephalopathy with brainstem and spinal cord involvement and lactate elevation
leukoencephalopathy with vanishing white matterlimb-girdle muscular dystrophymalignant hyperthermiamaple syrup urine diseasemedium-chain acyl-CoA dehydrogenase deficiencyMuckle-Wells syndromemultiminicore diseasemultiple epiphyseal dysplasianemaline myopathyneonatal onset multisystem inflammatory diseaseneuroblastomanonsyndromic deafnessnonsyndromic paragangliomaNoonan syndromeosteogenesis imperfectaParkinson diseasepopliteal pterygium syndromeporphyriaprimary myelofibrosispsoriatic arthritispyruvate kinase deficiencyREN-related kidney diseaseretinitis pigmentosarhizomelic chondrodysplasia punctatasevere congenital neutropeniaShprintzen-Goldberg syndromespina bifidaStargardt macular degenerationStickler syndromesystemic sclerodermathiamine-responsive megaloblastic anemia syndromethrombocytopenia-absent radius syndrometrimethylaminuriaUsher syndromevan der Woude syndromevitiligoVohwinkel syndromeWNT4 Müllerian aplasia and ovarian dysfunction
What read queries do we want to perform?
1) Give me the PSEN2 gene for 2,000 people w/ Alzheimer's25,000 sequential bp
Write once, read many times type of database
2) Give me all of the humans who have the lactose intolerance SNP on CR-2
Translation: DNA -> Proteins
A T T G G G A T T G G G A T T T G A
T A A C C C T A A C C C T A A A C T
Isoleucine Glycine Isoleucine Glycine Isoleucine STOP
DNA
AminoAcids
Protein
G GI II
codon
(20 different types)
Translation: ATT -> Lsoleucine
Chrom-1
humanID cell_type parent595-36-0000 normal mother595-36-0000 normal father595-36-0000 cancer mother595-36-0000 cancer father595-36-1111 normal mother595-36-1111 normal father595-36-1111 normal 3rd
1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7
TAG GCC CAG CAG TCA CTG NNN GAT
TAG GCC CAG CAG TAA CTG NNN GAT
TAG GCC CAG CAG TCA CTG NNN GAT
TAG GCC CAG CAG CTG NNN GAT
TAG GCC CAG CAG TCC TCA CTG NNN GAT
TAG GCC CAG CAG TCA CTG NNN GAT
centromereP(short arm) Q (long arm)36 0 0 43
125 million bp
3= 41 m cols
Point Mutation TAG GCC CAG CAG TCA CTG
TAG GCC CAG CAG TAA CTG
Deletion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG ___ CTG
Insertion Mutation TAG GCC CAG CAG TCA CTG TAG GCC CAG CAG TCC TCA CTG
Chrom-21
(SNP)
1q43.7492932
Chrom-1
humanID cell_type parent595-36-0000 normal mother595-36-0000 normal father595-36-0000 cancer mother595-36-0000 cancer father595-36-1111 normal mother595-36-1111 normal father595-36-1111 normal 3rd
Chrom-21
Excellent candidate for compression!4x reduction in total data size + 35% faster reads
Create Secondary IndexTo detect SNPs
cqlsh:dna_table> SELECT humanID FROM Chrom-1WHERE 1q0 = ‘TAA’;
1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7
TAG GCC CAG CAG TCA CTG NNN GAT
TAG GCC CAG CAG TAA CTG NNN GAT
TAG GCC CAG CAG TCA CTG NNN GAT
TAG GCC CAG CAG CTG NNN GAT
TAG GCC CAG CAG TCC TCA CTG NNN GAT
TAG GCC CAG CAG TCA CTG NNN GAT
To get all of the people with the SNP:
humanID cell_type parent595-36-0000 normal mother595-36-0000 normal father595-36-0000 cancer mother595-36-0000 cancer father595-36-1111 normal mother595-36-1111 normal father595-36-1111 normal 3rd
Chrom-21
cqlsh:dna_table> SELECT 1q0, 1q1 FROM Chrom-1WHERE humanID in(595-36-000, 595-36-111);
Query: Give me the X gene for 2 people
Chrom-1
1p36 1p35 ... 1p1 1p0 1q0 1q1 ... 1q43.7
TAG GCC CAG CAG TCA CTG NNN GAT
TAG GCC CAG CAG TAA CTG NNN GAT
TAG GCC CAG CAG TCA CTG NNN GAT
TAG GCC CAG CAG CTG NNN GAT
TAG GCC CAG CAG TCC TCA CTG NNN GAT
TAG GCC CAG CAG TCA CTG NNN GAT
X
Chrom-1
P-key
SS:CT:M
SS:CT:F
SS:CT:M
SS:CT:F
SS:CT:M
SS:CT:F
Chrom-2 Chrom-3 Chrom-Y
41 million columns
9 million columns
3 billion cols
Storing the total USA population Genome in Cassandra(314 million people)
1.5 GB 125 MB of data125 MB of data
630 million rows (2 for each person)
X=900 PB
Oct 2012: 1092 genomes sequenced3.2 TB data total
1000 Genomes Project
46,000nodes
(20 TB each)No Replication
$0
$20,000,000
$40,000,000
$60,000,000
$80,000,000
$100,000,000
$120,000,000
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Cost per Human Genome sequence
Series 1
Linear scale$20 millionincrements
Huh ?
$100
$1,000
$10,000
$100,000
$1,000,000
$10,000,000
$100,000,000
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Cost per Human Genome sequence
Genome Sequencing Moore's Law
Logarithmic scale10x
increments
Super LogarithmicScale!Jan 2008
Switched to next-gen sequencing
Coding vs Non-coding DNA
Coding Non-coding
98% non-coding DNA
T A A C C C T A A C C C T A A C C C T A A C C C T A A C C CA T T G G G A T T G G G A T T G G G A T T G G G A T T G G G
8% of human DNA(98,000 fragments)
HIV-1 virus genome: https://www.ncbi.nlm.nih.gov/nuccore/9629357?report=fasta
Free python tools for biological computationhttp://biopython.org
Pycassa Cassandra python client libraryhttps://github.com/pycassa/pycassa
DataStax Community Edition of Cassandra + OpsCenterhttp://www.datastax.com/download/community
Get Ubuntu 12.10 http://www.ubuntu.com/download(note CentOS/Red Hat has install issues with Biopython)
Sameer [email protected]
- Freelance Big Data consultant and trainer- Taught 50+ courses on Hadoop, HBase, Cassandra and OpenStack
Ex: Hortonworks, Accenture R&D, Symantec
linkedin.com/in/blueplastic/
@blueplastic
http://youtu.be/ziqx2hJY8Hg
- Co-author on v2 of Cassandra book- Coming late 2013
blueplastic.com/dna.pdf
Polychaos dubium620 billion bp (200x humans)
James Watson: How we discovered DNA
http://www.ted.com/talks/james_watson_on_how_he_discovered_dna.html
Juan Enriquez: The life-code that will reshape the future
http://www.ted.com/talks/juan_enriquez_on_genomics_and_our_future.html
Resources to get started for beginners…