beyond genomics: detecting codes and signals in the cellular transcriptome

55
Brendan Frey Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome Brendan J. Frey University of Toronto

Upload: coy

Post on 08-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome. Brendan J. Frey University of Toronto. Purpose of my talk. To identify aspects of bioinformatics in which attendees of ISIT may be able to make significant contributions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Beyond Genomics:

Detecting Codes and Signals in the Cellular Transcriptome

Brendan J. Frey

University of Toronto

Page 2: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Purpose of my talk

To identify aspects of bioinformatics in which attendees of ISIT may be able to make significant contributions

Page 3: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Beyond Genomics:

Detecting Codes and Signals in the Cellular Transcriptome

Brendan J. Frey

University of Toronto

Page 4: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

The Genome

Page 5: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Starting point:Discrete biological sequences

• Symbols are Bases: G, C, A, T

• Examples of biological sequences– Genes – Peptides– DNA – RNA– Chromosomes – Viruses– Proteins – HIV

RED indicates a definitionthat you should remember

Page 6: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

DNA Sequence(GCATTCATGC…)

Sexual cell reproduction

Nucleus

Chromosomes: Inherited DNA sequence

Cell replication

Page 7: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

The genome

• Genome: Chromosomal DNA sequence from an organism or species

• Examples

Genome Length (bases)Human 3,000 million

(750MB)

Mouse 2,600 million

Fly 100 million

Yeast 13 million

Page 8: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Genes

• A gene is a subsequence of the genome that encodes a functioning bio-molecule

• The library of known genes– Comprises only 1% of genome sequence – Increases in diversity every year– Is probably far from complete

Page 9: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

The Transcriptome

Page 10: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Genome: The digital backbone of molecular biology

Transcripts: Perform functionsencoded in the genome

Page 11: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Traditional genes

DNA Protein

Translation

Output: Protein

Input:Transcript

Transcription

Transcript(RNA)

Input:DNA Output:

Transcript

Page 12: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Traditional genes

DNA Protein

TranslationTranscription

Transcript(RNA)

Genome

Transcriptome

Proteome

Page 13: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Transcription

Upstream region

DNA… …

Exon Intron

Regulatory proteinsTranscription proteins

Transcript (RNA)

Gene

Page 14: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Transcription

Upstream region

DNA… …

Exon Intron

Page 15: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

CGTGGATAGTGAT

Regulatoryprotein

Upstream region

DNA… …

Exon

Transcription

• Code: Set of regulatory codewords

• Signals: Concentrations of regulatory proteins and the output transcript

• Codewords in the upstream region bind to corresponding regulatory proteins

Page 16: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Transcript (RNA) IntronExon

… …

Splicing of transcripts

Regulatoryproteins

Page 17: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Transcript (RNA) IntronExon

… …

Splicing of transcripts

Regulatoryproteins Splicing proteins

Page 18: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Transcript (RNA) IntronExon

… …

Splicing of transcripts

• The intron is spliced out

• However, splicing may occur quite differently…

Page 19: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Splicing of transcripts

Transcript (RNA) IntronExon

… …

Regulatoryproteins Splicing proteins

Page 20: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Splicing of transcripts

Regulatoryproteins Splicing proteins

Page 21: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Splicing of transcripts

Splicing proteins

Regulatoryproteins

The middle exon is ‘skipped’,

leading to a different transcript

Page 22: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

TTAGAT

Regulatoryproteins

Splicing of transcripts

• Code: Set of regulatory codewords

• Signals: Concentrations of regulatory proteins and different spliced transcripts

• Codewords in the introns and exons bind to corresponding regulatory proteins

TGGGGT

Page 23: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

The modern transcriptome

Genome

Cell nucleus

Transcript(mRNA)

SPLICING

TRANSCRIPTION Liver

Transcript(RNA)

Protein

TRANSLATION

Non-functional transcripts

SPLICINGBrain

mRNA

Protein A

SPLICINGLiver

mRNA

Protein B

Non-traditional transcript

TRANSCRIPTION

TRANSCRIPTIONBrain and Liver

Page 24: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

The modern transcriptome

Genomic DNA

Cell nucleus

Protein

TRANSLATION

Protein Protein

SPLICING

TRANSCRIPTION in Liver TRANSCRIPTIONin Brain and Liver

SPLICINGBrain

SPLICINGLiver

ncRNA

TRANSCR.

Non-traditionaltranscript

Transcript(RNA)

Spliced transcript (mRNA) mRNA mRNA

Brain

Liver

Non-functional transcripts

… it turns out to be surprising in many ways

Alternative transcripts

# genes, ½ trans, 60% AS, 18k AS, 20% dis, 10k ncRNA

Page 25: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

The Resources

Page 26: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Your collaborators can do lab work…

• Sequencing: Snag an actual transcript and figure out its sequence

• Microarrays: Find out if your predicted transcript fragment is expressed in a tissue sample

• Mass spectrometry: Find out if a protein is present in a sample

Page 27: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Databases

• Genomes

• Genome annotations

• Libraries of observed transcript fragments

• Microarray datasets containing measured concentrations of transcripts

• …

Page 28: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

TCGGTCACAT

TCGGTCACAT

1. Fabricate microarray with probes

AGCTA

GTGTA

TCAAG

CGGTG

TTGAA

probes

Measuring transcript concentrations using microarrays

AGCCAGTGTA

2. Extract transcripts from cell

3. Add florescent tag

4. Hybridize tagged sequence to microarray

Cell

5. Excite florescent tag with laser and measure intensity

Page 29: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Inkjet printer technologyHughes et al, Nature Biotech 2001

Print nucleic acidsequences usinginkjet printer

Page 30: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Then and now…• First microarrays (late 1990s)

– ‘Cancer chips’, ‘gene chips’, …– 5,000-10,000 probes per slide– Noisy

• Current microarrays– ‘Sub-gene resolution’– 200,000 probes per slide– Low noise– Multi-chip designs are cost effective

Page 31: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

The Case Study:Discovering protein-making transcripts

using factor graphs

BJ Frey, …, TR HughesNature Genetics, September 2005

Page 32: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Controversy about the gene library

Despite Frey et al’s impressive computational reconstruction of gene structure, we argue that this does not prove the complexity of the transcriptome

– FANTOM/RIKEN Consortium Science, March 2006

How it all started…

Page 33: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Research on the transcriptome

Analysis ofgenome

Detection oftranscriptsOur project

2001-2005 1960’s-20002001-2006

2003-2005

Page 34: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Estimates of number of undiscoveredgenes

2000 200

1

2002 200

3

2004

2005

Genome: ~10,000(IHGSC, Nature)

Genome: ~3000(IHGSC, Nature)

Kapranov et al, Rinn et al,Shoemaker et al: ~300,000

Bertone et al: ~11,000(Science)

Page 35: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey Coordinates (in bases) in Chromosome 4

Number of probesper 8000 bases

Number of known exonsper 8000 bases

Our microarrays• Our genome analysis highlighted 1 million

possible exons (~180,000 already known)

• We designed one 60-base probe for each possible exon

Page 36: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Pool Composition (mRNA per array hybridization)1 Heart (2 g), Skeletal muscle (2 g)2 Liver (2 g)3 Whole brain (1.5 g), Cerebellum (0.48g), Olfactory bulb (0.15g)4 Colon (0.96g), Intestine (1.04g)5 Testis (3 g), Epididymis (0.4g)6 Femur (0.9g), Knee (0.4g), Calvaria (0.06 g),

Teeth+mandible (1.3 g), Teeth (0.4 g)7 15d Embryo (1.3 g), 12.5d Embryo (12.5 g), 9.5d Embryo (0.3 g),

14.5d Embryo head (0.25 g), ES cells (0.24 g)8 Digit (1.3 g), Tongue (0.6 g), Trachea (0.15 g)9 Pancreas (1g), Mammary gland (0.9 g), Adrenal gland (0.25g),

Prostate gland (0.25g)10 Salivary gland (1.26g), Lymph node (0.74g)11 12.5d Placenta (1.15 g), 9.5d Placenta (0.5g),

15d Placenta (0.35g)12 Lung (1g), Kidney (1 g), Adipose (1 g), Bladder (0.05 g)

Twelve pools of mouse mRNA

Our samples (37 tissues)

Page 37: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Signal: The data(small part of the data from Chromosome 4)

Example of a transcript

Code:A ‘vector repetition code with deletions’

Each column is an expression profile

Page 38: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

The transcript modelEach transcript is modeled using

A prototype expression profile# probes before prototype (eg, 1)# probes after prototype (eg, 4)Flag indicating whether each probe corresponds to an exon

e e e e e

Page 39: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

r1

t1...

r2

t2 Transcription start/stop indicator

Relative index of prototype

Exon versus non-exon indicator

Expression profile (genomic order)

r3

t3

r4

t4

r5

t5

r6

t6

s4

e4

x4

s3

e3

x3

s2

e2

x2

s1

e1

x1

s5

e5

x5

s6

e6

x6

sn

en

xn

rn

tn

Probe sensitivity & noise

...

The prototype for xi is xi+ri, ri {-W,…,W}. We use W=100

The factor graph

ONLY 1 FREE PARAMETER:, probability of starting a transcript

Page 40: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

r1

t1...

r2

t2 Transcription start/stop indicator

Relative index of prototype

Exon versus non-exon indicator

Expression profile (genomic order)

r3

t3

r4

t4

r5

t5

r6

t6

s4

e4

x4

s3

e3

x3

s2

e2

x2

s1

e1

x1

s5

e5

x5

s6

e6

x6

sn

en

xn

rn

tn

Probe sensitivity & noise

...

After expression data (x) is observed, the factor graph becomes a tree

Page 41: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

...

...

After expression data (x) is observed, the factor graph becomes a tree

r1...

t1

r2

t2 Transcription start/stop indicator

Relative index of prototype

Exon versus non-exon indicatorr3

t3

r4

t4

r5

t5

r6

t6

s4

e4

s3

e3

s2

e2

s1

e1

s5

e5

s6

e6

sn

en

rn

tn

Probe sensitivity & noise

...

Computation: The max-product algorithmperforms exact inference and learning.

Page 42: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Summary of results *

• 10 X more sensitive than other transcript-based methods

• Detected 155,839 exons

• Predicted ~30,000 new exons

• Reconciled discrepancies in thousands of known transcripts

* Exon false positive rate: 2.7%

Page 43: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Frey et al: ~0(Nature Genetics)

SURPRISE!

Revisiting Estimates of number of undiscovered genes

2000 200

1

2002 200

3

2004

2005

Genome: ~10,000(IHGSC, Nature)

Genome: ~3000(IHGSC, Nature)

Kapranov et al, Rinn et al,Shoemaker et al: ~300,000

Bertone et al: ~11,000(Science)

Page 44: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

2000 200

1

2002 200

3

2004

2005

Bertone et al: ~11,000(Science)

Frey et al: ~0(Nature Genetics)

SURPRISE!

Contentious results

FANTOM3: 5,154(FANTOM Consortium, Science)

Page 45: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

… [We discovered] new mouse protein-coding transcripts, including 5,154 encoding previously-unidentified proteins …

– FANTOM/RIKEN Consortium

Science, Sep 2005We wondered: Are these really new genes?

Page 46: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

… we found that 2917 of the FANTOM proteins are in fact splice isoforms of known transcripts …

– Frey et al Science, March 2006

… the number of new protein-coding genes found by us has been revised from 5154 to 2222…

– FANTOM/RIKEN Consortium Science, March 2006

Page 47: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Last word…

… the number of completely new protein-coding genes discovered by the FANTOM consortium is at most in the hundreds…

– Frey et al Science, March 2006

Page 48: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

The Closing Remarks

Page 49: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

• Producing genome-wide libraries of functioning transcripts, including– Alternatively-spliced transcripts– Transcripts that don’t make proteins

• Understanding functions of transcripts

• Developing models of how transcription and alternative splicing are regulated

• Developing models of gene interactions– ‘Genetic networks’

Open problems

Page 50: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Should you work in computational biology?

Pluses

• A major scientific frontier

• Potential for high impact on society

Minuses

• Mostly a collection of facts

• Mechanisms are complex and beyond our control

• Lacking a mathematical framework

Page 51: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Remember, communication theory also oncelacked a mathematical framework…

“Ok, Zorg, lets try using a prefix code”

Page 52: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Should you work in computational biology?

Minuses

• Mostly a collection of facts

• Mechanisms are complex and beyond our control

Pluses

• A major scientific frontier

• Potential for high impact on society

• Lacking a mathematical framework

Page 53: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

How do you enter this field?

• Hire a tutor (ie, student or postdoc)

• Hire a programmer

• Get involved in several ‘winner’ projects

• Be prepared to drop ‘loser’ projects

• Build mutually-beneficial collaborations

• How long will it take?

Page 54: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

For more information…

• As of Friday July 14, 2006:

http://www.psi.toronto.edu/isit2006.html– These slides– Pointers to helpful papers, databases, etc

Page 55: Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan Frey

Acknowledgements

• Frey Group– Quaid D Morris (postdoc)– Leo Lee (postdoc)– Yoseph Barash (postdoc)– Ofer Shai (PhD)– Inmar Givoni (PhD)– Jim Huang (PhD)– Marc Robinson (programmer)

Genomics Collaborators

• Hughes’ Lab• Blencowe’s Lab• Emili’s Lab• Boone’s Lab

Medical Collaborators:E Sat, J Rossant, BG Bruneau, JE Aubin