comparative genomics for pathogen/vector annotation manolis kellis csail mit computer science and...

Post on 28-Dec-2015

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Comparative genomics for

pathogen/vector annotation

Manolis Kellis

CSAIL MIT Computer Science and Artificial Intelligence Lab

Broad Institute of MIT and Harvard for Genomics in Medicine

32 mammals

9 yeasts

12 flies

The age of comparative genomics

human mouse ratchimp dog

8 Candidapathogensvectors

Resolving power in mammals, flies, fungi

• Neutral: 2.57 subs/site

(opp: 0.62 32sps: 4.87)

• Coding: 1.16 subs/site• Detect: 6-mer at FP 10-6

10 mammals 17 yeasts12 flies

8 Candida

9 Yeasts

Po

st-

du

pli

ca

tio

nD

iplo

idH

ap

loid

Pre

-du

p

P

P

P

PP

P

• Neutral: 4.13 subs/site

• Coding: 1.65 subs/site

• Detect: 6-mer at 10-11

• Neutral: 15.5 subs/site

(Yeast: 6.5 Candida: 6.5)

• Coding: 7.91 subs/site• Detect: 3-mer at 10-21

0.3 sub/site0.1 sub/site 0.8 sub/site

Extensive conservation of synteny

• Global mapping of orthologous segments• Nucleotide-level alignments span complete genomes• Study properties / patterns of nucleotide conservation

Mammals Flies Candida

Comparative Genomics 101: Conservation Function

• Conserved elements are typically functional (and vice versa)– For example: exons are deeply conserved to mouse, chicken, fish

• Some conserved elements are still uncharacterized– How do we make sense of them? – How do we distinguish each type of functional element

• Answer: evolutionary signatures (Comp. Genomics 201)– Tell me how you evolve, I’ll tell you who you are– Patterns of change selective pressures specific function

Overview

Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures

Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification

Part 3. Genome evolution Phylogenomics Genome Duplication Emergence of new functions

Distinguishing genes from non-coding regionsDmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT

Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA

Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC

Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC

Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC

Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT

Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC

Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT

***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * **

• Protein-coding genes have specific evolutionary constraints– Gaps are multiples of three (preserve amino acid translation)– Mutations are largely 3-periodic (silent codon substitutions)– Specific triplets exchanged more frequently (conservative substs.)– Conservation boundaries are sharp (pinpoint individual splicing signals)

• Encode as ‘evolutionary signatures’– Computational test for each of them– Combine and score systematically

Splice

Putting it all together: probabilistic framework

• Hidden Markov Models (HMMs)– Generative model, learn emission, transition probabilities– Easy to train, hard to integrate long-range signals

• Conditional Random Fields (CRFs)– Discriminative dual of HMMs, learn weights on features– Easy to integrate diverse signals, gradient ascent for training

Systematically annotate all protein-coding genes

• Large-scale re-annotation of the fly genome– New genes and exons, dubious genes and exons– Adjust gene boundaries: start codon, frame, splice site, seq errors– Reveal unusual gene structures: stop read-through, di-cistronics, editing

• Towards a revised genome annotation Curation: FlyBase integrates prediction with cDNA, protein, literature Experimentation: BDGP large-scale functional validation novel exons

D. simulans

D. erecta

D. persimilis

D. melanog.

579 fullyrejected

1,454 exons(~800 genes)

2,499 notaligned

+668 exonsin 443 genes

Revisiting fly genome annotation

10,845 fullyconfirmed

(…)

Example 2: Novel multi-exon gene

1,454 novel exons

outside known genes– 60% cluster in 300

new multi-exon genes– 40% are isolated high-

confidence exons

Example 3: Dubious single-exon gene

• Classification approach: Yes / No answer– Closely related species: both genes and intergenic aligned– Show very different patterns of mutation

• Comparative analysis provides negative evidence– Alignment is unambiguous, orthologous, spans entire gene– Sequence shows mutations and indels in every species

• Weak or missing experimental evidence– 100 of these independently rejected by FlyBase– These are missing from systematic clone collections– Only 34 (6%) have assigned names (vs. 36% of all fly genes)

CG6664/FBtr0100439

annotated start codon conserved start codon

Example 4: Start codon adjustment• Codon substitution patterns suggest new start in 200 genes

– Score each substitution using Codon Substitution Matrix (CSM)

poor CSM score, atypical substitutionhigh CSM score, protein-like substitution

ATG ATG

Unusual genes 1: Stop codon read-through

• Method #1 (single exons)– 112 events, 95 extending known genes Manual curation: 82– Enriched in neuronal function

• Method #2 (after splicing)– 256 events, looser cutoff, large overlap, needs manual curation– Enriched in transcription factors

Protein-coding

conservation

Continued protein-coding

conservation

No more

conservation

Stop codon

read through

2nd stop codon

BDGP experimental validation: initial results

• 189 novel exons tested (in & out of genes)– inverse PCR reaction + sequencing– Recover new genes + alternative splice forms

• Results: 178 validated (94%)– Novel exons inside known genes: 41/43 (95%)– Novel exons outside known genes: 137/146 (94%)

• Some cDNA overlap: 8/8 (100%)• no cDNA, some EST: 23/26 (88%)• no cDNA, no EST: 106/112 (95%)

novel gene novel gene known gene

Overview

Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the fly genome Unusual gene structures

Part 2. Gene regulation Regulatory motif discovery microRNA regulation

Part 3. Genome evolution Phylogenomics

The regulatory code

• Multiple levels of regulation– Temporal and spatial regulation, disease, development– Chromatin, pre- / post-transcriptional, splicing, translational

• Combinatorial coding of individual motifs– The core: a relatively small number of regulatory motifs– Regions: diverse motif combinations specify diverse functions

• Regulatory motifs– Summarize information across thousands of sites

• Distinguish: regulatory motifs vs. motif instances

– Challenging to discover• Small (6-8 nucleotides), subtle (frequent degenerate positions),

dispersed (act at a distance), diverse (sequence composition)

Enhancer regions

5’-UTR

Promoter motifs

3’-UTR

Splicing signals Motifs at RNA level

Regulatory motif discovery

Study known motifs

Derive conservation rules

Discover novel motifs

Known motifs are preferentially conserved

dmel AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC-----------AAGTCACdsim AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC-----------AAGTCACdyak AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTCC-----------AAGTCAGdere AATGGTTTGC----------------CAGCGGTCGCCAAACTCTCTAATTAGCGACCAAGTCC-----------AAGTCAGdana AATGATTTCCATTTCTCCCCACCCCCCACTAGTTCCTAGGCACTCTAATTAGCAAGTTAGTCTCTAGAGACTCTAAGTCGGdpse AAT--------TTTC-----------------------AGCCGTCTAATTAGTGGTGTTCTC------GGTTCTCAAT--- *** ** * * ********** ** *

engrailed

• In multi-species alignments: known motifs conservation islands– Conserved biology: Conserved regulatory code, same words are functional– Preferential conservation: Stand out from surrounding nucleotides– Good signal for identifying individual instances of known motifs

• Not sufficient for motif discovery: – Conservation not limited to exact binding site additional bases would be found– Weakly constrained positions can diverge Real motifs will be missed– How do we discover motifs de novo? Use basic property of regulatory motifs

Evaluate genome-wide conservation over thousands of instances

Known motifs are frequently conserved

• Across the fly genome, the engrailed motif: – appears 8599 times– is conserved 1534 times

D. mel.

D. yakub.

D. erecta

D. pseud.

engrailed (TAATTA) engrailed engrailed

Conservation rate: 17.8%

• Statistical significance– 5 flies: conservation rate of random control motifs: 2.8% – Engrailed enrichment: 6.8-fold (Binomial P-value: 35 stdev)

Motif Conservation Score (MCS)

Systematically evaluate candidate patterns

All potential motifs

Evaluate MCS

Collapse motif variants

GT C A GTR RY gapS W

196 motifs in 3’-UTR regions

168 motifs in promoter regions

• Enumerate

– Length between 6 and 15 nt, allow central gap

– 11 letter alphabet (A C G T, 2-fold codes, N)• Score

– Compute binomial score (conserved vs. total)– Select MCS > 6.0 specificity 97%

• Collapsing– Sequence similarity– Overlapping occurrences

Consensus MCS Matches to known Expression enrichment Promoters Enhancers

1 CTAATTAAA 65.6 engrailed (en) 25.4 2

2 TTKCAATTAA 57.3 reversed-polarity (repo) 5.8 4.2

3 WATTRATTK 54.9 araucan (ara) 11.7 2.6

4 AAATTTATGCK 54.4 paired (prd) 4.5 16.5

5 GCAATAAA 51 ventral veins lacking (vvl) 13.2 0.3

6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx) 16 3.3

7 TGATTAAT 45.7 apterous (ap) 7.1 1.7

8 YMATTAAAA 43.1 abdominal A (abd-A) 7 2.2

9 AAACNNGTT 41.2 20.1 4.3

10 RATTKAATT 40 3.9 0.7

11 GCACGTGT 39.5 fushi tarazu (ftz) 17.9

12 AACASCTG 38.8 broad-Z3 (br-Z3) 10.7

13 AATTRMATTA 38.2 19.5 1.2

14 TATGCWAAT 37.8 5.8 2

15 TAATTATG 37.5 Antennapedia (Antp) 14.1 5.4

16 CATNAATCA 36.9 1.8 1.7

17 TTACATAA 36.9 5.4

18 RTAAATCAA 36.3 3.2 2.8

19 AATKNMATTT 36 3.6 0

20 ATGTCAAHT 35.6 2.4 4.6

21 ATAAAYAAA 35.5 57.2 -0.5

22 YYAATCAAA 33.9 5.3 0.6

23 WTTTTATG 33.8 Abdominal B (Abd-B) 6.3 6

24 TTTYMATTA 33.6 extradenticle (exd) 6.7 1.7

25 TGTMAATA 33.2 8.9 1.6

26 TAAYGAG 33.1 4.7 2.7

27 AAAKTGA 32.9 7.6 0.3

28 AAANNAAA 32.9 449.7 0.8

29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n) 11 0.8

30 TTATTTAYR 32.9 Deformed (Dfd) 30.7

Results in the fly genome: Promoter motifs

Motif length

(a) 60 likely involved in mRNA regulation– AATAAA: Poly-A signal– 6 AT-rich elements: mRNA stability / degradation– 24 TGTA-rich elements: mRNA localization (PUF)– 29 other, potential target of RNA-binding proteins

Functional roles of 106 motifs in 3’-UTRs

(b) 46 likely micro-RNA targetscleaved

Protein-coding gene3’-UTR

miRNA

microRNA gene

Match 114 known microRNA genes

Enable discovery of 144 novel microRNA genes

Estimate extent of miRNA control 20% of human genes are miRNA targets

22-mer miRNA

8-mer motif

Specifically match distal 8 bp of 22-mer miRNA

6 of 12 tested using RT-PCR and confirmed

Global views of post-transcriptional regulation

Results in the fly: 50 novel microRNA genes

Regulatory motif discovery in the human

ATATGCAA

discovered8-mers 114 known + 144 new miRNA genes

Target ~20% of human 3’-UTRs

microRNA regulation

• 174 promoter motifs

70 match known TF motifs

115 expression enrichment

60 show positional bias

• 106 motifs in 3’-UTR

Strand specific

8-mers are miRNA-associated

mRNA localization and stability

TSS 3’-UTRATG Stop

Systematic discovery of regulatory motifs in the human• Frequently occurring, strongly conserved short regulatory signals

Overview

Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures

Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification

Part 3. Genome evolution Phylogenomics

Evolutionary history of all genes in 17 fungi

• Each branch– Mean and stdev– Num genes– Gains, losses

• Features– Few events !– Gain vs. loss– Acceleration– Churning

• Applications– Recover WGD– Pathogenicity– Mating evolution– Codon capture– Evol. parallels

yeast

candida

… tile 34 of 288 …

C. albicans (SC5314)

C. albicans (WO-1)

C. dubliniensis

C. parapsilosis

D. hansenii

C. tropicalis

C. guilliermondii

C. lusitaniae

lineage specific genes

inserted segment species specific genes

Syn

teny

spa

ns 1

00 m

illio

n ye

ars!

Gene duplication and loss in context of syntenic alignments

Overview

Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures

Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification

Part 3. Genome evolution Phylogenomics Genome Duplication Emergence of new functions

Resolving power in mammals, flies, fungi

• Neutral: 2.57 subs/site

(opp: 0.62 32sps: 4.87)

• Coding: 1.16 subs/site• Detect: 6-mer at FP 10-6

10 mammals 17 yeasts12 flies

8 Candida

9 Yeasts

Po

st-

du

pli

ca

tio

nD

iplo

idH

ap

loid

Pre

-du

p

P

P

P

PP

P

• Neutral: 4.13 subs/site

• Coding: 1.65 subs/site

• Detect: 6-mer at 10-11

• Neutral: 15.5 subs/site

(Yeast: 6.5 Candida: 6.5)

• Coding: 7.91 subs/site• Detect: 3-mer at 10-21

0.3 sub/site0.1 sub/site 0.8 sub/site

Rules of thumb for comparative genome sequencing

• Total branch length: >4 subs/site– Genome annotation: new genes, exons, unusual– Regulatory motif discovery, miRNAs, enhancers

• Max pair-wise branch length: <1 subs/site– Conservation of function, nucleotide alignment quality

• Conserved gene order: synteny– Global alignment quality

• Sequencing depth– One or two genomes: >8X– Remaining genomes: >3X, if syntenic relative exists

top related