interpreting the human genome manolis kellis csail mit computer science and artificial intelligence...

36
Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics in Medicine

Upload: morgan-holland

Post on 13-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Interpreting the human genome

Manolis Kellis

CSAIL MIT Computer Science and Artificial Intelligence Lab

Broad Institute of MIT and Harvard for Genomics in Medicine

Page 2: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

32 mammals

17 yeasts 12 flies

The age of comparative genomics

opossum armadillo rabbit cow hyrax elephant

human mouse ratchimp dog

bat dolphin lemur bushbaby pika hedgehog tenrec

pangolinTree shrewllama

etc...

Page 3: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Resolving power in mammals, flies, fungi

• Neutral: 2.57 subs/site

(opp: 0.62 32sps: 4.87)

• Coding: 1.16 subs/site• Detect: 6-mer at FP 10-6

10 mammals 17 yeasts12 flies

8 Candida

9 Yeasts

Po

st-

du

pli

ca

tio

nD

iplo

idH

ap

loid

Pre

-du

p

P

P

P

PP

P

• Neutral: 4.13 subs/site

• Coding: 1.65 subs/site

• Detect: 6-mer at 10-11

• Neutral: 15.5 subs/site

(Yeast: 6.5 Candida: 6.5)

• Coding: 7.91 subs/site• Detect: 3-mer at 10-21

Page 4: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Comparative Genomics 101: Conservation Function

• Conserved elements are typically functional (and vice versa)– For example: exons are deeply conserved to mouse, chicken, fish

• Some conserved elements are still uncharacterized– How do we make sense of them? – How do we distinguish each type of functional element

• Answer: evolutionary signatures (Comp. Genomics 201)– Tell me how you evolve, I’ll tell you who you are– Patterns of change selective pressures specific function

Page 5: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Gene identification

Study known genes

Derive conservation rules

Discover new genes

• Evolutionary signatures– “Tell me how you evolve, i’ll tell you who you are” – Each type of functional elements evolves in its own specific ways

Page 6: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Distinguishing genes from non-coding regionsDmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT

Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA

Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC

Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC

Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC

Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT

Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC

Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT

***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * **

• Protein-coding genes have specific evolutionary constraints– Gaps are multiples of three (preserve amino acid translation)– Mutations are largely 3-periodic (silent codon substitutions)– Specific triplets exchanged more frequently (conservative substs.)– Conservation boundaries are sharp (pinpoint individual splicing signals)

• Encode as ‘evolutionary signatures’– Computational test for each of them– Combine and score systematically

Splice

Page 7: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Signature 1: Reading frame conservation

30% 1.3%

0.14%

58%14%

10.2%

Genes Intergenic

Mutations Gaps Frameshifts

Separation

2-fold10-fold75-fold

100%

100%

100%

100%

100%

100%

100%

100%

100%

60%

55%

90%

40%

60%

100%

20%

30%

40%

100% 60%

RFC RFC

Page 8: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Signature 2: Distinct patterns of codon substitution

Codon observed in species 2

Cod

on o

bser

ved

in s

peci

es 1

Genes

• Codon substitution patterns specific to genes– Genetic code dictates substitution patterns– Amino acid properties dictate substitution patterns

Codon observed in species 2

Cod

on o

bser

ved

in s

peci

es 1

Intergenic

Page 9: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Codon Substitution Matrix (CSM)

hum

an

mousealiphaticaromatic

negativepolar positivepolar

Page 10: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Signatures 3, 4, 5, 6, 7, etc…

• Mutation patterns of splicing signals– Real splice acceptor/donor evolve in specific ways

• Evolution of other motifs associated with splicing– Exonic/Intronic Splicing Enhancers/Silencers (ESE,ESI)– Density of motif clouds surrounding real exons

• Sharp conservation boundaries– Relative conservation exon vs. surrounding regions

• Length of longest ‘open’ reading frame– Frequency of stop codons in each frame / each species

ISEs ISEs

ESEs

real exon

acceptorsite

donorsite

Page 11: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Putting it all together: probabilistic framework

• Hidden Markov Models (HMMs)– Generative model, learn emission, transition probabilities– Easy to train, hard to integrate long-range signals

• Conditional Random Fields (CRFs)– Discriminative dual of HMMs, learn weights on features– Easy to integrate diverse signals, gradient ascent for training

Page 12: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

From HMMs … to CRFs

yiyi-1 yi+1

X

hiddensequence

featurefunctions

F(i-1) F(i) F(i+1)

observed

Page 13: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

From HMMs … to CRFs

yyxy

y

y

L

iii

aeXiyy

yiXiyyyi

yiXiyyyi

XiyyYX

i ,',

'

'

11

),,,'(F

)',1(α ),,,'F(),(α

))',1V(),,,'(F(max),V(

),,,F(),P(

Transition and Emission probabilities

Generative model Discriminative model

For example, features can simply be ei and aij

hit BLASTnearest todistance),,,'(f

...in CpG %),,,'(f

...in Heads%),,,'(f

17

50509

50503

Xiyy

xxXiyy

xxXiyy

ii

ii

Or pretty much anything:

Page 14: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Running on real genomes

• Obtain optimal weights (from training set)– Experimentally-defined, genetics, curation, cDNA

• Apply CRF systematically to new genome– Revisit existing genomes– Annotate new genomes

Page 15: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

• Power of evolutionary signatures– New genes and exons, dubious genes and exons

– Adjust gene boundaries: ATG, frame, splice site, seq errors

• Signatures more powerful than primary signals– Recognize unusual gene structures read-through, uORFs, editing

• Towards a revised genome annotation Curation: FlyBase integrates prediction with cDNA, protein, literature

Experimentation: BDGP large-scale functional validation novel exons

D. simulans

D. erecta

D. persimilis

D. melanog.

579 fullyrejected

1,454 exons(~800 genes)

2,499 notaligned

+668 exonsin 443 genes

Revisiting fly genome annotation

10,845 fullyconfirmed

(…)

Page 16: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Systematic application leads to

• Exon-level changes– Ex 1: New genes– Ex 2: New exons– Ex 3: Dubious genes

• More subtle changes– Ex 4: Start/end adjustments– Ex 5: Wrong reading frame– Ex 6: Splice site adjustments– Ex 7: Sequencing errors fixed

• Unusual gene structures– W1: Stop-codon read-through– W2: uORFs & dicistronic– W3: Internal frame-shifts

Codon observed in species 2

Co

do

n o

bse

rved

in

sp

ecie

s 1

Genes vs. Intergenic

Reading Frame Conservation

Codon Substitution Matrix

Page 17: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

conserved

substitution

insertion

frameshift

gap

Example 1: Known genes stand out

Sharp conservation

boundaries.

Known exons

stand out.

High sensitivity

and specificity.

Page 18: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Example 2: Novel multi-exon gene

1,454 novel exons

outside known genes– Many cluster in new

multi-exon genes– Others are isolated

high-confidence exons

Page 19: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Example 2b: Novel exons inside known genes

(sorry, this example is from human, mouse, dog, rat)

• 668 cases in fly– New candidate alternatively spliced gene forms– New protein domains

Page 20: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Novel genes and exons

• 1,454 novel exons outside existing genes– 60% cluster in 300 multi-exon genes– 40% isolated exons

• 668 novel exons inside existing genes– Alternative splicing: Many with cDNA support– Nested genes: Few known examples

• Human curation– Collaboration with FlyBase– Hundreds of changes in release 5.1, more in 5.2

• Systematic experimentation– Sue Celniker and Berkeley Genome Project– Thousands of new genes in the pipeline

Page 21: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Example 3: Dubious single-exon gene

• Only evidence was an open reading frame– Comparative

information much stronger

Page 22: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

579 Dubious Genes

• Classification approach: Yes / No answer– Closely related species: both genes and intergenic aligned– Show very different patterns of mutation

• Comparative analysis provides negative evidence– Alignment is unambiguous, orthologous, spans entire gene– Sequence shows mutations and indels in every species

• Weak or missing experimental evidence– 100 of these independently rejected by FlyBase– These are missing from systematic clone collections– Only 34 (6%) have assigned names (vs. 36% of all fly genes)

Page 23: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Systematic application leads to

• Exon-level changes– Ex 1: New genes– Ex 2: New exons– Ex 3: Dubious genes

• More subtle changes– Ex 4: Start/end adjustments– Ex 5: Wrong reading frame– Ex 6: Splice site adjustments– Ex 7: Sequencing errors fixed

• Unusual gene structures– W1: Stop-codon read-through– W2: uORFs & dicistronic– W3: Internal frame-shifts

Codon observed in species 2

Co

do

n o

bse

rved

in

sp

ecie

s 1

Genes vs. Intergenic

Reading Frame Conservation

Codon Substitution Matrix

Page 24: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

CG6664/FBtr0100439

annotated start codon conserved start codon

Example 4: Start codon adjustment• Codon substitution patterns suggest new start in 200 genes

– Score each substitution using Codon Substitution Matrix (CSM)

poor CSM score, atypical substitutionhigh CSM score, protein-like substitution

ATG ATG

Page 25: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Annotated ORF (345nt) Real ORF (315nt)

Example 5: Gene annotated on wrong reading frame

• cDNA evidence supports overlapping reading frames, both open– Annotation traditionally selects longer one– Conservation enables distinguishing the two

mRNA supports both ORFs

Conservation only supports shorter ORF

Shorter ORF is the correct one

CG7738-RA is incorrect

Page 26: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Example 6: Incorrect splice causes wrong frame

• Second exon annotated in the wrong frame– Due to splice site boundary error– Correction is supported by cDNA evidence

Fix exon boundary

First exon: correct frame 2nd exon: incorrect frame

Page 27: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Example 7: Detect seq. errors / strain mutations

• Insertion/deletion causes frameshift– Conservation signature shifts from ‘frame1’ to ‘frame2’– All other species disagree with D. melanogaster indel– Sequencing error or species-specific mutation

chr3R:6,953,865-6,953,927 (Ugt86Dd) dm CAGTACATATTTGTGGAGAGTTACTTGAAAG-CTTGGCAGCTAAGGGTCATCAGGTGACCGTTAdroSec CAGTACATATTTTTGGAGAGCTACTTGAAAGCCTTGGCAGCTAAGGGTCACCAGGTGACCGTTAdroSim CAGTACATATTTATGGAGAGCTACTTGAAAGCCTTGGCAGCTAAGGGTCACCAGGTGACCGTTAdroYak CAGTACATTTTTGTGGAGACCTACTTGAAAGCCCTGGCAGCCAAGGGTCACCAGGTGACCGTTAdroEre CAGTACATTTTTGTGGAGACCTACTTGAAAGCCCTGGCAGCTAGGGGTCACCAGGTGACTGTTAdroAna CAGTACATCTTTGTGGAGACCTATCTGAAGGCTTTGGCCGACAAAGGTCACCAGGTGACTGTTAdroWil CAATACATATTCATTGAGGCGTATCTAAAGGCATTGGCTGCCAAAGGACATCAGTTAACTGTGAdroMoj CAGTACATATTCGCCGAGGCGTATTTGAAGGCGCTAGCAGCCCGGGGCCATGAGGTCACCGTGAdroVir CAGTATATATTTGCCGAGTCGTATTTGAAGGCCTTGGCAGCGCGGGGTCATGAGGTGACAGTGA 01201201201201201201201201201201 2012012012012012012012012012012 ** ** ** ** *** ** * ** * * ** * ** ** ** * ** ** *

Conservation in correct frame Conservation in 2nd frame

Frame-shift (sequencing error / recent mutation)

Page 28: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Example 8: Dubious gene is a miRNA transcript

• Evolutionary signatures reveal specific function

Page 29: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Systematic application leads to

• Exon-level changes– Ex 1: New genes– Ex 2: New exons– Ex 3: Dubious genes

• More subtle changes– Ex 4: Start/end adjustments– Ex 5: Wrong reading frame– Ex 6: Splice site adjustments– Ex 7: Sequencing errors fixed

• Unusual gene structures– W1: Stop-codon read-through– W2: uORFs & dicistronic– W3: Internal frame-shifts

Codon observed in species 2

Co

do

n o

bse

rved

in

sp

ecie

s 1

Genes vs. Intergenic

Reading Frame Conservation

Codon Substitution Matrix

Page 30: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Unusual genes 1: Stop codon read-through

• Method #1 (single exons)– 112 events, 95 extending known genes Manual curation: 82– Enriched in neuronal function

• Method #2 (after splicing)– 256 events, looser cutoff, large overlap, needs manual curation– Enriched in transcription factors

Protein-coding

conservation

Continued protein-coding

conservation

No more

conservation

Stop codon

read through

2nd stop codon

Page 31: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Unusual genes 2: Polycistronic messages / uORFs

• Method– High-scoring ORFs with cDNA evidence– Disjoint from the annotated ORF

• Results– 217 cases

Protein-coding conservation in the 5’UTR

Page 32: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Unusual genes 3: Frame-shift in the middle of exons

• Method– Exons changing high-scoring frame– Far from splice junctions

• Results– 68 cases in 44 genes

dm GACTATTTCAACAATCAGCAGCGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---TCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTCCTCACGCAGACCGdroSim GACTATTTCAACAACCAGCAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---TCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTCCTCACGCAGACCGdroSec GACTATTTCAACAACCAACAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---TCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTCCTCACGCAGACCGdroYak GACTACTTCAACAATCAGCAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---GGCGAGATTTGTACCGCCTCCACCGCCTCCGCGTCGCTTGCTGCTCACGCAGACCGdroEre GACTATTTCAACAATCAGCAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---GCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTTCTCACGCAGACCGdroAna GACTACTACAACAATCAGCAGCGGGAGCGGCACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGGCCAGCGGCGAAGTTCGTCCCTCCTCCGCCGCCTCCGCGACGTTTGCTTCTCACGCAGACAGdroPse GACTACTACAACAACCAGCAGCGGGAGCGACACTACGAGCTCCGGAGGCAGAGCCAGCGGCAGGCC---AGCGAGGTTTATACCACCGCCGCCGCCTCCGCGTCGCTTGCTGCTCACGCAGACCAdroPer GACTACTACAACAACCAGCAGCGGGAGCGACACTACGAGCTCCGGAGGCAGAGCCAGCGGAAGGCC---AGCGAGGTTTATACCACCGCCGCCGCCTCCGCGTCGCTTGCTGCTCACGCAGACCAdroWil GACTACTACAACAATCAGCAGAGGGAGCGACACTACGAGCAACGTCGCCAAAGCCAGCGGCAGGCC---AGCCAAATTTATACCACCGCCACCGCCTCCACGTCGACTGCTGCTAACGCAGACAAdroMoj GACTACTACAACAACCAGCAGCGGGAGCGGCACTACCAGCTGCGCCACCAGAGCCAACGTCAAGCC---ACCGAGATTTATACCACCACCGCCGCCGCCTCGTCGTCTGCTGCTCACGCAGACAAdroVir GACTACTACAACAACCAACAGCGGGAGCGGCACTACCAGCAGCGCCGCCAGAGCCAACGTCAAGCC---ACCGAGATTCATTCCACCGCCGCCGCCGCCTCGTCGTCTGCTGCTCACGCAGACAAdroGri GACTACTACAACAATCAGCAGCGGGAGCGGCACTATCAACAGCGTCGCCAGAGTCATCGTCAAGCC---ACCGAGATTTATACCACCACCACCGCCACCTCGTCGTCTATTGCTCACGCAGACAA 012012012012012012012012012012012012012012012012012012012012012012 01201201201201201201201201201201201201201201201201201201 ***** * ****** ** ** * ***** ***** * * ** ** ** ** ** * ** * * ** * ** ** ** ***** ** ** ** * * ** ********

chrX:2,226,518-2,226,639 (CG14047)

012 120

Frame 1 is high-scoring Frame 2 is high-scoring

Page 33: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

• Fully rejected genes: weak/no evidence• New exons: existing & novel experimental evidence• Need: large-scale functional annotation for novel genes

Dog

Mouse

Rat

Human

1,065 fullyrejected

454 novel(2591 exons)

1,919 notaligned

7,717refined

Initial results for the whole human genome

9,862 fullyconfirmed

Page 34: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

12 s

peci

es

12 s

peci

es

2 sp

ecie

s

Discriminative framework shows continued increase in power

• Reading frame conservation (RFC) scoreDmel,Dyak,Dpse

0

500

1000

1500

2000

2500

3000

3500

4000

4500

-2 -1 0 1 2

Dmel,Dyak,Dpse,Dwil,Dgri

0

500

1000

1500

2000

2500

3000

3500

-4 -3 -2 -1 0 1 2 3 4

12 flies

0

200

400

600

800

1000

1200

1400

1600

Dmel,Dpse

0

1000

2000

3000

4000

5000

6000

7000

8000

-1 0 1

• Codon substitution matrix (CSM) score

2 species 3 species 5 species 12 species

2 species

12 species

90%

10%

30%

70%80%

95%

5%

20%

Page 35: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Overview

Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures

Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification

Part 3. Genome evolution Phylogenomics The two forces of gene evolution Accurate gene trees in complete genomes

Page 36: Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics

Who’s actually doing the work

Matt RasmussenPhylogenomics

Erez LiebermanMotif evolution

Aviva PresserNetwork evolution

Mike LinGene identification

Alex StarkFly motifs and miRNAs

Pouya KheradpourHuman enhancers

Josh GrochowNetwork motif discovery

Ameya DeorasSpectral genomics