comparative genomics for pathogen/vector annotation manolis kellis csail mit computer science and...
TRANSCRIPT
Comparative genomics for
pathogen/vector annotation
Manolis Kellis
CSAIL MIT Computer Science and Artificial Intelligence Lab
Broad Institute of MIT and Harvard for Genomics in Medicine
32 mammals
9 yeasts
12 flies
The age of comparative genomics
human mouse ratchimp dog
8 Candidapathogensvectors
Resolving power in mammals, flies, fungi
• Neutral: 2.57 subs/site
(opp: 0.62 32sps: 4.87)
• Coding: 1.16 subs/site• Detect: 6-mer at FP 10-6
10 mammals 17 yeasts12 flies
8 Candida
9 Yeasts
Po
st-
du
pli
ca
tio
nD
iplo
idH
ap
loid
Pre
-du
p
P
P
P
PP
P
• Neutral: 4.13 subs/site
• Coding: 1.65 subs/site
• Detect: 6-mer at 10-11
• Neutral: 15.5 subs/site
(Yeast: 6.5 Candida: 6.5)
• Coding: 7.91 subs/site• Detect: 3-mer at 10-21
0.3 sub/site0.1 sub/site 0.8 sub/site
Extensive conservation of synteny
• Global mapping of orthologous segments• Nucleotide-level alignments span complete genomes• Study properties / patterns of nucleotide conservation
Mammals Flies Candida
Comparative Genomics 101: Conservation Function
• Conserved elements are typically functional (and vice versa)– For example: exons are deeply conserved to mouse, chicken, fish
• Some conserved elements are still uncharacterized– How do we make sense of them? – How do we distinguish each type of functional element
• Answer: evolutionary signatures (Comp. Genomics 201)– Tell me how you evolve, I’ll tell you who you are– Patterns of change selective pressures specific function
Overview
Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures
Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification
Part 3. Genome evolution Phylogenomics Genome Duplication Emergence of new functions
Distinguishing genes from non-coding regionsDmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC
Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC
Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC
Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC
Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT
Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA
Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC
Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC
Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC
Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT
Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC
Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT
***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * **
• Protein-coding genes have specific evolutionary constraints– Gaps are multiples of three (preserve amino acid translation)– Mutations are largely 3-periodic (silent codon substitutions)– Specific triplets exchanged more frequently (conservative substs.)– Conservation boundaries are sharp (pinpoint individual splicing signals)
• Encode as ‘evolutionary signatures’– Computational test for each of them– Combine and score systematically
Splice
Putting it all together: probabilistic framework
• Hidden Markov Models (HMMs)– Generative model, learn emission, transition probabilities– Easy to train, hard to integrate long-range signals
• Conditional Random Fields (CRFs)– Discriminative dual of HMMs, learn weights on features– Easy to integrate diverse signals, gradient ascent for training
Systematically annotate all protein-coding genes
• Large-scale re-annotation of the fly genome– New genes and exons, dubious genes and exons– Adjust gene boundaries: start codon, frame, splice site, seq errors– Reveal unusual gene structures: stop read-through, di-cistronics, editing
• Towards a revised genome annotation Curation: FlyBase integrates prediction with cDNA, protein, literature Experimentation: BDGP large-scale functional validation novel exons
D. simulans
D. erecta
D. persimilis
D. melanog.
579 fullyrejected
1,454 exons(~800 genes)
2,499 notaligned
+668 exonsin 443 genes
Revisiting fly genome annotation
10,845 fullyconfirmed
(…)
Example 2: Novel multi-exon gene
1,454 novel exons
outside known genes– 60% cluster in 300
new multi-exon genes– 40% are isolated high-
confidence exons
Example 3: Dubious single-exon gene
• Classification approach: Yes / No answer– Closely related species: both genes and intergenic aligned– Show very different patterns of mutation
• Comparative analysis provides negative evidence– Alignment is unambiguous, orthologous, spans entire gene– Sequence shows mutations and indels in every species
• Weak or missing experimental evidence– 100 of these independently rejected by FlyBase– These are missing from systematic clone collections– Only 34 (6%) have assigned names (vs. 36% of all fly genes)
CG6664/FBtr0100439
annotated start codon conserved start codon
Example 4: Start codon adjustment• Codon substitution patterns suggest new start in 200 genes
– Score each substitution using Codon Substitution Matrix (CSM)
poor CSM score, atypical substitutionhigh CSM score, protein-like substitution
ATG ATG
Unusual genes 1: Stop codon read-through
• Method #1 (single exons)– 112 events, 95 extending known genes Manual curation: 82– Enriched in neuronal function
• Method #2 (after splicing)– 256 events, looser cutoff, large overlap, needs manual curation– Enriched in transcription factors
Protein-coding
conservation
Continued protein-coding
conservation
No more
conservation
Stop codon
read through
2nd stop codon
BDGP experimental validation: initial results
• 189 novel exons tested (in & out of genes)– inverse PCR reaction + sequencing– Recover new genes + alternative splice forms
• Results: 178 validated (94%)– Novel exons inside known genes: 41/43 (95%)– Novel exons outside known genes: 137/146 (94%)
• Some cDNA overlap: 8/8 (100%)• no cDNA, some EST: 23/26 (88%)• no cDNA, no EST: 106/112 (95%)
novel gene novel gene known gene
Overview
Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the fly genome Unusual gene structures
Part 2. Gene regulation Regulatory motif discovery microRNA regulation
Part 3. Genome evolution Phylogenomics
The regulatory code
• Multiple levels of regulation– Temporal and spatial regulation, disease, development– Chromatin, pre- / post-transcriptional, splicing, translational
• Combinatorial coding of individual motifs– The core: a relatively small number of regulatory motifs– Regions: diverse motif combinations specify diverse functions
• Regulatory motifs– Summarize information across thousands of sites
• Distinguish: regulatory motifs vs. motif instances
– Challenging to discover• Small (6-8 nucleotides), subtle (frequent degenerate positions),
dispersed (act at a distance), diverse (sequence composition)
Enhancer regions
5’-UTR
Promoter motifs
3’-UTR
Splicing signals Motifs at RNA level
Regulatory motif discovery
Study known motifs
Derive conservation rules
Discover novel motifs
Known motifs are preferentially conserved
dmel AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC-----------AAGTCACdsim AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC-----------AAGTCACdyak AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC-----------AAGTCAGdere AATGGTTTGC----------------CAGCGGTCGCCAAACTCTCTAATTAGCGACCAAGTCC-----------AAGTCAGdana AATGATTTCCATTTCTCCCCACCCCCCACTAGTTCCTAGGCACTCTAATTAGCAAGTTAGTCTCTAGAGACTCTAAGTCGGdpse AAT--------TTTC-----------------------AGCCGTCTAATTAGTGGTGTTCTC------GGTTCTCAAT--- *** ** * * ********** ** *
engrailed
• In multi-species alignments: known motifs conservation islands– Conserved biology: Conserved regulatory code, same words are functional– Preferential conservation: Stand out from surrounding nucleotides– Good signal for identifying individual instances of known motifs
• Not sufficient for motif discovery: – Conservation not limited to exact binding site additional bases would be found– Weakly constrained positions can diverge Real motifs will be missed– How do we discover motifs de novo? Use basic property of regulatory motifs
Evaluate genome-wide conservation over thousands of instances
Known motifs are frequently conserved
• Across the fly genome, the engrailed motif: – appears 8599 times– is conserved 1534 times
D. mel.
D. yakub.
D. erecta
D. pseud.
engrailed (TAATTA) engrailed engrailed
Conservation rate: 17.8%
• Statistical significance– 5 flies: conservation rate of random control motifs: 2.8% – Engrailed enrichment: 6.8-fold (Binomial P-value: 35 stdev)
Motif Conservation Score (MCS)
Systematically evaluate candidate patterns
All potential motifs
Evaluate MCS
Collapse motif variants
GT C A GTR RY gapS W
196 motifs in 3’-UTR regions
168 motifs in promoter regions
• Enumerate
– Length between 6 and 15 nt, allow central gap
– 11 letter alphabet (A C G T, 2-fold codes, N)• Score
– Compute binomial score (conserved vs. total)– Select MCS > 6.0 specificity 97%
• Collapsing– Sequence similarity– Overlapping occurrences
Consensus MCS Matches to known Expression enrichment Promoters Enhancers
1 CTAATTAAA 65.6 engrailed (en) 25.4 2
2 TTKCAATTAA 57.3 reversed-polarity (repo) 5.8 4.2
3 WATTRATTK 54.9 araucan (ara) 11.7 2.6
4 AAATTTATGCK 54.4 paired (prd) 4.5 16.5
5 GCAATAAA 51 ventral veins lacking (vvl) 13.2 0.3
6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx) 16 3.3
7 TGATTAAT 45.7 apterous (ap) 7.1 1.7
8 YMATTAAAA 43.1 abdominal A (abd-A) 7 2.2
9 AAACNNGTT 41.2 20.1 4.3
10 RATTKAATT 40 3.9 0.7
11 GCACGTGT 39.5 fushi tarazu (ftz) 17.9
12 AACASCTG 38.8 broad-Z3 (br-Z3) 10.7
13 AATTRMATTA 38.2 19.5 1.2
14 TATGCWAAT 37.8 5.8 2
15 TAATTATG 37.5 Antennapedia (Antp) 14.1 5.4
16 CATNAATCA 36.9 1.8 1.7
17 TTACATAA 36.9 5.4
18 RTAAATCAA 36.3 3.2 2.8
19 AATKNMATTT 36 3.6 0
20 ATGTCAAHT 35.6 2.4 4.6
21 ATAAAYAAA 35.5 57.2 -0.5
22 YYAATCAAA 33.9 5.3 0.6
23 WTTTTATG 33.8 Abdominal B (Abd-B) 6.3 6
24 TTTYMATTA 33.6 extradenticle (exd) 6.7 1.7
25 TGTMAATA 33.2 8.9 1.6
26 TAAYGAG 33.1 4.7 2.7
27 AAAKTGA 32.9 7.6 0.3
28 AAANNAAA 32.9 449.7 0.8
29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n) 11 0.8
30 TTATTTAYR 32.9 Deformed (Dfd) 30.7
Results in the fly genome: Promoter motifs
Motif length
(a) 60 likely involved in mRNA regulation– AATAAA: Poly-A signal– 6 AT-rich elements: mRNA stability / degradation– 24 TGTA-rich elements: mRNA localization (PUF)– 29 other, potential target of RNA-binding proteins
Functional roles of 106 motifs in 3’-UTRs
(b) 46 likely micro-RNA targetscleaved
Protein-coding gene3’-UTR
miRNA
microRNA gene
Match 114 known microRNA genes
Enable discovery of 144 novel microRNA genes
Estimate extent of miRNA control 20% of human genes are miRNA targets
22-mer miRNA
8-mer motif
Specifically match distal 8 bp of 22-mer miRNA
6 of 12 tested using RT-PCR and confirmed
Global views of post-transcriptional regulation
Results in the fly: 50 novel microRNA genes
Regulatory motif discovery in the human
ATATGCAA
discovered8-mers 114 known + 144 new miRNA genes
Target ~20% of human 3’-UTRs
microRNA regulation
• 174 promoter motifs
70 match known TF motifs
115 expression enrichment
60 show positional bias
• 106 motifs in 3’-UTR
Strand specific
8-mers are miRNA-associated
mRNA localization and stability
TSS 3’-UTRATG Stop
Systematic discovery of regulatory motifs in the human• Frequently occurring, strongly conserved short regulatory signals
Overview
Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures
Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification
Part 3. Genome evolution Phylogenomics
Evolutionary history of all genes in 17 fungi
• Each branch– Mean and stdev– Num genes– Gains, losses
• Features– Few events !– Gain vs. loss– Acceleration– Churning
• Applications– Recover WGD– Pathogenicity– Mating evolution– Codon capture– Evol. parallels
yeast
candida
… tile 34 of 288 …
C. albicans (SC5314)
C. albicans (WO-1)
C. dubliniensis
C. parapsilosis
D. hansenii
C. tropicalis
C. guilliermondii
C. lusitaniae
lineage specific genes
inserted segment species specific genes
Syn
teny
spa
ns 1
00 m
illio
n ye
ars!
Gene duplication and loss in context of syntenic alignments
Overview
Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures
Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification
Part 3. Genome evolution Phylogenomics Genome Duplication Emergence of new functions
Resolving power in mammals, flies, fungi
• Neutral: 2.57 subs/site
(opp: 0.62 32sps: 4.87)
• Coding: 1.16 subs/site• Detect: 6-mer at FP 10-6
10 mammals 17 yeasts12 flies
8 Candida
9 Yeasts
Po
st-
du
pli
ca
tio
nD
iplo
idH
ap
loid
Pre
-du
p
P
P
P
PP
P
• Neutral: 4.13 subs/site
• Coding: 1.65 subs/site
• Detect: 6-mer at 10-11
• Neutral: 15.5 subs/site
(Yeast: 6.5 Candida: 6.5)
• Coding: 7.91 subs/site• Detect: 3-mer at 10-21
0.3 sub/site0.1 sub/site 0.8 sub/site
Rules of thumb for comparative genome sequencing
• Total branch length: >4 subs/site– Genome annotation: new genes, exons, unusual– Regulatory motif discovery, miRNAs, enhancers
• Max pair-wise branch length: <1 subs/site– Conservation of function, nucleotide alignment quality
• Conserved gene order: synteny– Global alignment quality
• Sequencing depth– One or two genomes: >8X– Remaining genomes: >3X, if syntenic relative exists