comparative genomics for pathogen/vector annotation manolis kellis csail mit computer science and...

30
Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics in Medicine

Upload: randolph-greene

Post on 28-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Comparative genomics for

pathogen/vector annotation

Manolis Kellis

CSAIL MIT Computer Science and Artificial Intelligence Lab

Broad Institute of MIT and Harvard for Genomics in Medicine

Page 2: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

32 mammals

9 yeasts

12 flies

The age of comparative genomics

human mouse ratchimp dog

8 Candidapathogensvectors

Page 3: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Resolving power in mammals, flies, fungi

• Neutral: 2.57 subs/site

(opp: 0.62 32sps: 4.87)

• Coding: 1.16 subs/site• Detect: 6-mer at FP 10-6

10 mammals 17 yeasts12 flies

8 Candida

9 Yeasts

Po

st-

du

pli

ca

tio

nD

iplo

idH

ap

loid

Pre

-du

p

P

P

P

PP

P

• Neutral: 4.13 subs/site

• Coding: 1.65 subs/site

• Detect: 6-mer at 10-11

• Neutral: 15.5 subs/site

(Yeast: 6.5 Candida: 6.5)

• Coding: 7.91 subs/site• Detect: 3-mer at 10-21

0.3 sub/site0.1 sub/site 0.8 sub/site

Page 4: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Extensive conservation of synteny

• Global mapping of orthologous segments• Nucleotide-level alignments span complete genomes• Study properties / patterns of nucleotide conservation

Mammals Flies Candida

Page 5: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Comparative Genomics 101: Conservation Function

• Conserved elements are typically functional (and vice versa)– For example: exons are deeply conserved to mouse, chicken, fish

• Some conserved elements are still uncharacterized– How do we make sense of them? – How do we distinguish each type of functional element

• Answer: evolutionary signatures (Comp. Genomics 201)– Tell me how you evolve, I’ll tell you who you are– Patterns of change selective pressures specific function

Page 6: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Overview

Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures

Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification

Part 3. Genome evolution Phylogenomics Genome Duplication Emergence of new functions

Page 7: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Distinguishing genes from non-coding regionsDmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT

Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA

Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC

Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC

Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC

Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT

Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC

Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT

***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * **

• Protein-coding genes have specific evolutionary constraints– Gaps are multiples of three (preserve amino acid translation)– Mutations are largely 3-periodic (silent codon substitutions)– Specific triplets exchanged more frequently (conservative substs.)– Conservation boundaries are sharp (pinpoint individual splicing signals)

• Encode as ‘evolutionary signatures’– Computational test for each of them– Combine and score systematically

Splice

Page 8: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Putting it all together: probabilistic framework

• Hidden Markov Models (HMMs)– Generative model, learn emission, transition probabilities– Easy to train, hard to integrate long-range signals

• Conditional Random Fields (CRFs)– Discriminative dual of HMMs, learn weights on features– Easy to integrate diverse signals, gradient ascent for training

Systematically annotate all protein-coding genes

Page 9: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

• Large-scale re-annotation of the fly genome– New genes and exons, dubious genes and exons– Adjust gene boundaries: start codon, frame, splice site, seq errors– Reveal unusual gene structures: stop read-through, di-cistronics, editing

• Towards a revised genome annotation Curation: FlyBase integrates prediction with cDNA, protein, literature Experimentation: BDGP large-scale functional validation novel exons

D. simulans

D. erecta

D. persimilis

D. melanog.

579 fullyrejected

1,454 exons(~800 genes)

2,499 notaligned

+668 exonsin 443 genes

Revisiting fly genome annotation

10,845 fullyconfirmed

(…)

Page 10: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Example 2: Novel multi-exon gene

1,454 novel exons

outside known genes– 60% cluster in 300

new multi-exon genes– 40% are isolated high-

confidence exons

Page 11: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Example 3: Dubious single-exon gene

• Classification approach: Yes / No answer– Closely related species: both genes and intergenic aligned– Show very different patterns of mutation

• Comparative analysis provides negative evidence– Alignment is unambiguous, orthologous, spans entire gene– Sequence shows mutations and indels in every species

• Weak or missing experimental evidence– 100 of these independently rejected by FlyBase– These are missing from systematic clone collections– Only 34 (6%) have assigned names (vs. 36% of all fly genes)

Page 12: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

CG6664/FBtr0100439

annotated start codon conserved start codon

Example 4: Start codon adjustment• Codon substitution patterns suggest new start in 200 genes

– Score each substitution using Codon Substitution Matrix (CSM)

poor CSM score, atypical substitutionhigh CSM score, protein-like substitution

ATG ATG

Page 13: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Unusual genes 1: Stop codon read-through

• Method #1 (single exons)– 112 events, 95 extending known genes Manual curation: 82– Enriched in neuronal function

• Method #2 (after splicing)– 256 events, looser cutoff, large overlap, needs manual curation– Enriched in transcription factors

Protein-coding

conservation

Continued protein-coding

conservation

No more

conservation

Stop codon

read through

2nd stop codon

Page 14: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

BDGP experimental validation: initial results

• 189 novel exons tested (in & out of genes)– inverse PCR reaction + sequencing– Recover new genes + alternative splice forms

• Results: 178 validated (94%)– Novel exons inside known genes: 41/43 (95%)– Novel exons outside known genes: 137/146 (94%)

• Some cDNA overlap: 8/8 (100%)• no cDNA, some EST: 23/26 (88%)• no cDNA, no EST: 106/112 (95%)

novel gene novel gene known gene

Page 15: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Overview

Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the fly genome Unusual gene structures

Part 2. Gene regulation Regulatory motif discovery microRNA regulation

Part 3. Genome evolution Phylogenomics

Page 16: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

The regulatory code

• Multiple levels of regulation– Temporal and spatial regulation, disease, development– Chromatin, pre- / post-transcriptional, splicing, translational

• Combinatorial coding of individual motifs– The core: a relatively small number of regulatory motifs– Regions: diverse motif combinations specify diverse functions

• Regulatory motifs– Summarize information across thousands of sites

• Distinguish: regulatory motifs vs. motif instances

– Challenging to discover• Small (6-8 nucleotides), subtle (frequent degenerate positions),

dispersed (act at a distance), diverse (sequence composition)

Enhancer regions

5’-UTR

Promoter motifs

3’-UTR

Splicing signals Motifs at RNA level

Page 17: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Regulatory motif discovery

Study known motifs

Derive conservation rules

Discover novel motifs

Page 18: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Known motifs are preferentially conserved

dmel AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC-----------AAGTCACdsim AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC-----------AAGTCACdyak AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC-----------AAGTCAGdere AATGGTTTGC----------------CAGCGGTCGCCAAACTCTCTAATTAGCGACCAAGTCC-----------AAGTCAGdana AATGATTTCCATTTCTCCCCACCCCCCACTAGTTCCTAGGCACTCTAATTAGCAAGTTAGTCTCTAGAGACTCTAAGTCGGdpse AAT--------TTTC-----------------------AGCCGTCTAATTAGTGGTGTTCTC------GGTTCTCAAT--- *** ** * * ********** ** *

engrailed

• In multi-species alignments: known motifs conservation islands– Conserved biology: Conserved regulatory code, same words are functional– Preferential conservation: Stand out from surrounding nucleotides– Good signal for identifying individual instances of known motifs

• Not sufficient for motif discovery: – Conservation not limited to exact binding site additional bases would be found– Weakly constrained positions can diverge Real motifs will be missed– How do we discover motifs de novo? Use basic property of regulatory motifs

Evaluate genome-wide conservation over thousands of instances

Page 19: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Known motifs are frequently conserved

• Across the fly genome, the engrailed motif: – appears 8599 times– is conserved 1534 times

D. mel.

D. yakub.

D. erecta

D. pseud.

engrailed (TAATTA) engrailed engrailed

Conservation rate: 17.8%

• Statistical significance– 5 flies: conservation rate of random control motifs: 2.8% – Engrailed enrichment: 6.8-fold (Binomial P-value: 35 stdev)

Motif Conservation Score (MCS)

Page 20: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Systematically evaluate candidate patterns

All potential motifs

Evaluate MCS

Collapse motif variants

GT C A GTR RY gapS W

196 motifs in 3’-UTR regions

168 motifs in promoter regions

• Enumerate

– Length between 6 and 15 nt, allow central gap

– 11 letter alphabet (A C G T, 2-fold codes, N)• Score

– Compute binomial score (conserved vs. total)– Select MCS > 6.0 specificity 97%

• Collapsing– Sequence similarity– Overlapping occurrences

Page 21: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Consensus MCS Matches to known Expression enrichment Promoters Enhancers

1 CTAATTAAA 65.6 engrailed (en) 25.4 2

2 TTKCAATTAA 57.3 reversed-polarity (repo) 5.8 4.2

3 WATTRATTK 54.9 araucan (ara) 11.7 2.6

4 AAATTTATGCK 54.4 paired (prd) 4.5 16.5

5 GCAATAAA 51 ventral veins lacking (vvl) 13.2 0.3

6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx) 16 3.3

7 TGATTAAT 45.7 apterous (ap) 7.1 1.7

8 YMATTAAAA 43.1 abdominal A (abd-A) 7 2.2

9 AAACNNGTT 41.2 20.1 4.3

10 RATTKAATT 40 3.9 0.7

11 GCACGTGT 39.5 fushi tarazu (ftz) 17.9

12 AACASCTG 38.8 broad-Z3 (br-Z3) 10.7

13 AATTRMATTA 38.2 19.5 1.2

14 TATGCWAAT 37.8 5.8 2

15 TAATTATG 37.5 Antennapedia (Antp) 14.1 5.4

16 CATNAATCA 36.9 1.8 1.7

17 TTACATAA 36.9 5.4

18 RTAAATCAA 36.3 3.2 2.8

19 AATKNMATTT 36 3.6 0

20 ATGTCAAHT 35.6 2.4 4.6

21 ATAAAYAAA 35.5 57.2 -0.5

22 YYAATCAAA 33.9 5.3 0.6

23 WTTTTATG 33.8 Abdominal B (Abd-B) 6.3 6

24 TTTYMATTA 33.6 extradenticle (exd) 6.7 1.7

25 TGTMAATA 33.2 8.9 1.6

26 TAAYGAG 33.1 4.7 2.7

27 AAAKTGA 32.9 7.6 0.3

28 AAANNAAA 32.9 449.7 0.8

29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n) 11 0.8

30 TTATTTAYR 32.9 Deformed (Dfd) 30.7

Results in the fly genome: Promoter motifs

Page 22: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Motif length

(a) 60 likely involved in mRNA regulation– AATAAA: Poly-A signal– 6 AT-rich elements: mRNA stability / degradation– 24 TGTA-rich elements: mRNA localization (PUF)– 29 other, potential target of RNA-binding proteins

Functional roles of 106 motifs in 3’-UTRs

(b) 46 likely micro-RNA targetscleaved

Protein-coding gene3’-UTR

miRNA

microRNA gene

Match 114 known microRNA genes

Enable discovery of 144 novel microRNA genes

Estimate extent of miRNA control 20% of human genes are miRNA targets

22-mer miRNA

8-mer motif

Specifically match distal 8 bp of 22-mer miRNA

6 of 12 tested using RT-PCR and confirmed

Global views of post-transcriptional regulation

Page 23: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Results in the fly: 50 novel microRNA genes

Page 24: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Regulatory motif discovery in the human

ATATGCAA

discovered8-mers 114 known + 144 new miRNA genes

Target ~20% of human 3’-UTRs

microRNA regulation

• 174 promoter motifs

70 match known TF motifs

115 expression enrichment

60 show positional bias

• 106 motifs in 3’-UTR

Strand specific

8-mers are miRNA-associated

mRNA localization and stability

TSS 3’-UTRATG Stop

Systematic discovery of regulatory motifs in the human• Frequently occurring, strongly conserved short regulatory signals

Page 25: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Overview

Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures

Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification

Part 3. Genome evolution Phylogenomics

Page 26: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Evolutionary history of all genes in 17 fungi

• Each branch– Mean and stdev– Num genes– Gains, losses

• Features– Few events !– Gain vs. loss– Acceleration– Churning

• Applications– Recover WGD– Pathogenicity– Mating evolution– Codon capture– Evol. parallels

yeast

candida

Page 27: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

… tile 34 of 288 …

C. albicans (SC5314)

C. albicans (WO-1)

C. dubliniensis

C. parapsilosis

D. hansenii

C. tropicalis

C. guilliermondii

C. lusitaniae

lineage specific genes

inserted segment species specific genes

Syn

teny

spa

ns 1

00 m

illio

n ye

ars!

Gene duplication and loss in context of syntenic alignments

Page 28: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Overview

Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures

Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification

Part 3. Genome evolution Phylogenomics Genome Duplication Emergence of new functions

Page 29: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Resolving power in mammals, flies, fungi

• Neutral: 2.57 subs/site

(opp: 0.62 32sps: 4.87)

• Coding: 1.16 subs/site• Detect: 6-mer at FP 10-6

10 mammals 17 yeasts12 flies

8 Candida

9 Yeasts

Po

st-

du

pli

ca

tio

nD

iplo

idH

ap

loid

Pre

-du

p

P

P

P

PP

P

• Neutral: 4.13 subs/site

• Coding: 1.65 subs/site

• Detect: 6-mer at 10-11

• Neutral: 15.5 subs/site

(Yeast: 6.5 Candida: 6.5)

• Coding: 7.91 subs/site• Detect: 3-mer at 10-21

0.3 sub/site0.1 sub/site 0.8 sub/site

Page 30: Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and

Rules of thumb for comparative genome sequencing

• Total branch length: >4 subs/site– Genome annotation: new genes, exons, unusual– Regulatory motif discovery, miRNAs, enhancers

• Max pair-wise branch length: <1 subs/site– Conservation of function, nucleotide alignment quality

• Conserved gene order: synteny– Global alignment quality

• Sequencing depth– One or two genomes: >8X– Remaining genomes: >3X, if syntenic relative exists