do not reproduce without permission 1 gerstein.info/talks (c) 2004 1 (c) mark gerstein, 2002, yale,...

56
1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 Annotation of Intergenic Regions of the Human Genome Mark B Gerstein Yale (Comp. Bio. & Bioinformatics) Cistrome 2007, Boston, MA 2007.04.30, 15:20-15:55 Slides downloadable from Lectures.GersteinLab.org (Please read permissions statement.) (Genome Annotation Talk without much pgenes, including Tilescope, HMMs, DART, binding sites, and pgene-transcription, All completed comfortably within time.)

Upload: ashley-bond

Post on 25-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

1

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Do not reproduce without permission 1 G

ers

tein

.in

fo/t

alk

s

(c)

20

04

Annotation of Intergenic Regionsof the Human Genome

Mark B GersteinYale (Comp. Bio. & Bioinformatics)

Cistrome 2007, Boston, MA

2007.04.30, 15:20-15:55Slides downloadable from Lectures.GersteinLab.org

(Please read permissions statement.)

(Genome Annotation Talk without much pgenes, including Tilescope, HMMs, DART, binding sites, and pgene-transcription, All completed comfortably within time.)

Page 2: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

2 2

zdz

© m

mvii

[IHGSC, Nature 409, 2001][Venter et al. Science 29, 2001]

Most of the human genome is not coding sequence

Page 3: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

3

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Do not reproduce without permission 3 G

ers

tein

.in

fo/t

alk

s

(c)

20

04

• Mike Snyder &Sherman Weissman

• Tiling of whole chromosomes into small fragments

• Large-scale hybridization to find transcribed regions in unbiased fashion and TF binding sites (via ChIP-chip)

• Careful Computational Annotation

+ENCODE

Page 4: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

4

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Do not reproduce without permission 4 G

ers

tein

.in

fo/t

alk

s

(c)

20

04

Overall Aim of Yale Genomics is Comprehensive Intergenic Annotation

• Regulatory regions, repeats, non-coding RNAs, origins of replication, pseudogenes, segmental duplications, unknown elements….

• Specifc Results within ENCODE – 1% of human genome (~30Mb in 44 regions)

Pseudogenes (Zheng et al., GR) Classification of Novel Transcribed Regions (Rozowsky et al., GR)

• Characterization of Novel Structured RNAs (Washeitl et al., GR) Grouping and Classification of Binding Sites

(from ChIP-chip)• Med. Scale (~100kb) deserts and islands (Zhang et al., GR)• Novel Promotors (Trinklein et al., GR)

• CNVs and SDs (from hires-aCGH, Korbel et al.)

Page 5: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

5

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Do not reproduce without permission 5 G

ers

tein

.in

fo/t

alk

s

(c)

20

04

Outline: Tiling Array Analysis + Annotation Pipelines

• Tools for Scoring Arrays• Tools for Segmentation and Validation of Arrays• Results on Clusters of Novel Transcribed Regions• Results on Clusters of Binding Sites• Results on Active, Transcribed Pseudogenes

Page 6: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

6

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Do not reproduce without permission 6 G

ers

tein

.in

fo/t

alk

s

(c)

20

04

Tiling Arrays Probing Intergenic

Activity: Tools

Page 7: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

7 7

zdz

© m

mvii

Tilescope 101

▪ It is available at tilescope.gersteinlab.org

▪ It was designed for high-density tiling microarray data analysis.

▪ It is useful▫ Most existing data processing software was designed for traditional

microarrays.

▫ It is flexible—several microarray data processing methods are available.

▫ It is easy to use• It has a graphic user interface.• The data analysis process is streamlined.• It is online software. No need to install.

▫ It is free!

Zhang et al. (2007) GenomeBiology

Page 8: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

8 8

zdz

© m

mvii

Tilescope: system implementation

▪ Written in Java

▪ Composed of 3 parts: applet, servlet, and pipeline program

Internet

Applet

ServletPipeline

Server Users

Zhang et al. (2007) GenomeBiology

Page 9: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

9 9

zdz

© m

mvii

Tilescope: user interface

Zhang et al. (2007) GenomeBiology

Page 10: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

1010

zdz

© m

mvii

Tilescope: data processing

▪ Array data can be normalized by mean, median, quantile, and loess.

▪ Tile scoring generates the signal map and the P-value map.

▪ Feature identification produces ‘hits’.

Zhang et al. (2007) GenomeBiology

Page 11: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

13

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 13

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

HMM Segmentation + Optimal Selection of Regions to

Validate

Page 12: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

14

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 14

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: modeling

(unknown)

Goal: identify S = 1 probes based on DMethod: build a model M’ based on D, compute SPossible performance metric: error rate in predicting the state of a probe, experimental validation, …

(may need pre-processing)

Page 13: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

15

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 15

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: Hidden Markov model

Source: Du et al. (2006) Bioinformatics, 22, 3016-3024.

A model for transcriptional tiling array data

TAR: transcriptionally active region

Source: http://en.wikipedia.org/wiki/Hidden_Markov_model

State transitions in a hidden Markov model (example)x — hidden states

y — observable outputsa — transition probabilitiesb — output probabilities

Page 14: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

16

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 16

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Source: Du et al. (2006) Bioinformatics, 22, 3016-3024.

Page 15: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

17

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 17

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: active sampling & supervised learning

• Active sampling Selecting a small set of sub-regions for validation first

• Supervised learning Use the validation data to train the statistical model

Source: Gerstein et al. (2007) Gen. Res.

Page 16: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

18

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 18

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: Sample sub-region selection

Can we find a good selection scheme?

Source: Gerstein et al. (2007) Gen. Res.

Page 17: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

19

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 19

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: Sample sub-region selection

• Sampling solely based on data D Some candidates

• Random selection• Entropy based• KL-divergence based

• Testing the performance of these schemes Simulation

• Why?– So that we know S exactly

Page 18: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

20

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 20

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: Simulation results

Source: Du et al. (2006) Bioinformatics, 22, 3016-3024.

Page 19: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

21

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 21

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: Simulation results (cont.)

Source: Du et al. (2006) Bioinformatics, 22, 3016-3024.

Page 20: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

23

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 23

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: back to the real world

• Transcriptional tiling array Mainly using gene annotation to train the model

• Somewhat analogous to using MaxEntropy to select the sample regions

• The training set is expected to be noisy, but still leads to satisfying performance

• ChIP-chip tiling array Try to guess the signal distribution according to annotation information

• Ideal scenario Optimally select a medium-sized set of sample sub-regions Do experimental validations to build the model

Page 21: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

2424

zdz

© m

mvii

Tiling array analyses: results on transcriptional data(ENCODE regions (~30Mb) , training set (~7.5mb), ¼ training set (~1.9Mb, ~0.1M probes))

Source: Du et al. (2006) Bioinformatics, 22, 3016-3024.

Page 22: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

25

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 25

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling Arrays Probing Intergenic

Activity: Classifying Un-annotated Transcription

Page 23: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

26

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 26

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

DART: Classification of Unannotated Transcription

• Large amount of novel transcribed regions (TARs / transfrags) detected using tiling microarrays.

• Developed DART: Database of Active Regions and Tools Developed a classification procedure for these novel TARs/transfrags Database for storing & visualizing various sets of TARs/transfrags Associated tools for analyzing these sets

Rozowsky et al. Genome Research (2007, in press)

Page 24: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

27

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 27

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Set of All TARs

Exonic TARs Novel TARs

Intronic Intergenic ESTs

Proximal Distal Proximal Distal

Pseudo TARs

Rozowsky et al. Gen. Res. (2007, in press)

Page 25: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

28

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 28

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Sch

em

atic

of C

lass

ifica

tion

Pro

ced

ure

Set of Novel TARs

S1A Filter Novel TARsfor Unusual SequenceComposition

S1B Filter Novel TARsfor Cross-Hybridization

S3A Cluster into Novel Transcribed Loci using Expression Profiles (EP)

Peculiar TARs

Cross-Hyb TARs

Novel EP Loci

Singlet or Ambiguous TARs

P D P D E

S2 Assign Novel TARs to Known Genes usingExpression Profiles

Gene Assoc. TARs

P D P D E

P D P D E

P D P D E

S3B Cluster into Novel TranscribedLoci using Phylogenic Profiles (PP)

Novel PP Loci

P D P D E

DART

Rozowsky et al. Genome Research (in press)

Page 26: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

29

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 29

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Rozowsky et al. Gen. Res. (2007, in press)

TAR clustering

Page 27: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

30

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 30

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

DART: Database & Tools- Interfaces with UCSC- Tools use Ensembl API

Rozowsky et al. Genome Research (in press)

Page 28: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

31

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 31

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Table 1 Locations of all TARs

Exonic Pseudogenes Unannotated RegionsSize of ENCODE Regions (bp) 1,776,157 144,745 28,077,158Percentage of all ENCODE 5.9% 0.5% 93.6%

Number of TARs 3,666 195 6,988Percentage of all TARs 33.8% 1.8% 64.4%

Locations of Novel TARsESTs not in Exons Intronic Proximal Intronic Distal Intergenic Proximal Intergenic Distal

Size of Unannotated Regions (bp) 2,477,910 8,522,559 5,536,879 2,434,101 9,250,454Percentage of Unannotated Regions 8.8% 30.2% 19.6% 8.6% 32.8%

Number of Novel TARs 1,194 3,006 864 772 1,300Percentage of all Novel TARs 16.7% 42.1% 12.1% 10.8% 18.2%

Table 2: Sets of Classified Novel TARs Number Percentage

Total 6,988 100.0%

With peculiar sequence composition 503 7.2%Assigned to known genes 955 13.7%Caused by cross-hybridization - -In novel transcribed loci using expression profiles 681 9.7%In novel transcribed loci using phylogenetic profiles 782 11.2%

Rozowsky et al. Genome Research (in press)

Page 29: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

32

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 32

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

DART Classification has been experimentally validated with some small scale experiment RT-PCR & Sequencing

Results:

18/46 (39%) confirmed by RT-PCR

4/5 Sequenced Products Map uniquely to correct genomic region

0.5Kb

1Kb

1.5Kb

2Kb

+ - + - + - + - + - + - + - + - + - + - + - + -+ - + - L

A11 A12 A13 A14 A15 A16 A17 A18 A19 A20 A21 A22 A23 B1ID:

TAR 1

PCR Sequence 1 ttcttcggaaaagcacatgaactctttggagtctcctgttccacttggtaaatttcctat 60 |||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||Chr21 34,270,569 ttcttcggaaaagcacatgaactcttcggagtctcctgttccacttggtaaatttcctat 34,270,628

PCR Sequence 61 agctccgcactgaaagtccctgctgccctccttcctctgagcttgtggggcccacagatc 120 ||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||Chr21 34,270,629 agccacgcactgaaagtccctgctgccctccttcctctgagcttgtggggcccacagatc 34,270,688

PCR Sequence 121 ccctgctccacttcctgcttcatttcagctgat 153 |||||||||||||||||||||||||||||||||Chr21 34,270,689 ccctgctccacttcctgcttcatttcagctgat 34,270,721

TAR 2

PCR Sequence 154 ggatgacactccctcgttctaataccatctgaatgcctgagcaattacatcttacaacct 213 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Chr21 34,270,898 ggatgacactccctcgttctaataccatctgaatgcctgagcaattacatcttacaacct 34,270,957

PCR Sequence 214 catgaaaaacacagcagcttgtcacgatgaatg 246 |||||||||||||||||||||||||||||||||Chr21 34,270,958 catgaaaaacacagcagcttgtcacgatgaatg 34,270,990

Forward Primer

Reverse Primer

Novel TARs

PCR SequenceFP RP

Rozowsky et al. Genome Research (in press)

Page 30: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

33

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 33

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling Arrays Probing Intergenic

Activity: Categories Groups of Binding

Sites

Page 31: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

3434

zdz

© m

mvii

Transcriptional regulation

▪ Protein-coding genes are transcribed by RNA polymerase II (Pol2) ...

▪ ... Under elaborate regulation by▫ The binding of a complex set of transcription factors to their regulatory

elements

▫ Histone modifications such as acetylation and methylation

▫ Chromatin remodeling

▪ Transcription factor binding sites include▫ Core promoters

▫ Promoter proximal elements

▫ Other elements such as enhancers, silencers, insulators, and response elements

▪ Transcriptional regulatory elements can be globally mapped by high-throughput experiments such as ChIP-chip or ChIP-PET.

Zhang et al. (2007) Gen. Res.

Page 32: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

3535

zdz

© m

mvii

ENCODE TR study

▪ First concerted effort to systematically identify TREs in the human genome on a large scale▫ 105 lists of transcriptional regulatory elements in the encode regions

▫ 29 transcription factors, 9 cell lines, 2 time points

▫ 7 laboratories and 3 different microarray platforms

▪ TFs and their TREs can be studied on various genomic levels.

Zhang et al. (2007) Gen. Res.

Page 33: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

3636

zdz

© m

mvii

Our TRE analysis approach

▪ On an intermediate genomic level, involving 10 ~ 100 kb of DNA with several genes on average.

▪ Try to present the problem and subsequently analyze the data in a consistent and coherent statistical framework.

Zhang et al. (2007) Gen. Res.

Page 34: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

3737

zdz

© m

mvii

Landscape of ENCODE TREs

▪ Positive correlation of the TRE density with both non-exonic conservation and gene density in a genomic region

Zhang et al. (2007) Gen. Res.

Page 35: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

3838

zdz

© m

mvii

Non-random distribution of TREs

▪ TREs are not evenly distributed throughout the encode regions (P < 2.2×10−16 ).

▪ The actual TRE distribution is power-law.

▪ The null distribution is ‘Poissonesque.’

▪ Many genomic subregions with extreme numbers of TREs.

Zhang et al. (2007) Gen. Res.

Page 36: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

3939

zdz

© m

mvii

Local TRE enrichment and depletion

▪ Hundreds of TRE ‘islands’ and ‘deserts’ are identified in ENCODE regions.

▪ The longest island is composed of 68 various TREs and covers a 35-kb region near the HOXA cluster on chromosome 7.

▪ The entirety of ehd1 on chromosome 11 is covered by TRE islands.

▪ Some of islands are located in the intergenic regions in the genome.

dart.gersteinlab.org/encode/tr/

Zhang et al. (2007) Gen. Res.

Page 37: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

40

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 40

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling Arrays Probing Intergenic

Activity: Connecting Intergenic Activity to

Pseudogenes

Page 38: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

41

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 41

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

5 Methods of Assignment

• 4 automatic pipelines retroFinder, pseudoFinder, PseudoPipe, GIS Comparing protein or transcript v genomic DNA,

filtering, application of rules

• HAVANA manual• What is a pseudogene?

Different criteria

• Conservative approach here Can't overlap gene annotation Need to have a protein alignment 201 pseudogenes (in comparison to ~400 genes)

Zheng et al. (2007) Gen. Res.

Page 39: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

42

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 42

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Overlap of Pseudogenes by 5 Different

Methods

Union of 252

Zheng et al. (2007) Gen. Res.

Page 40: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

43

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 43

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Ex. Pseudogene

Intersecting Transcript-

ional Evidence

SpecialG

tracks in browser

diTAG

CAGE

TARS

ChIP-chip

Zheng et al. (2007) Gen. Res.

Page 41: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

44

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 44

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Intersection of Pseudogenes with Transcriptional Evidence

Zheng et al. (2007) Gen. Res.

Page 42: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

45

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 45

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Intersection of Pseudogenes with Transcriptional Evidence

Zheng et al. (2007) Gen. Res.

Page 43: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

46

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 46

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Intersection of Pseudogenes with Transcriptional Evidence

Excluding TARs (due to cross-hyb issues)

Targeted RACE expts to 160 pseudogenes, gives 14

Total Evidence from Sequencing is 38 of 201 (with 5 having cryptic promotors)

14

Zheng et al. (2007) Gen. Res.

Page 44: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

47

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 47

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Targeted Transcription Expts.• RACE expts

Interrogated 160 pseudogenes (49 non-processed & 111 processed) In 51 cases (26 non-processed and 25 processed pseudogenes), could

design distinguishing primers (>4 mismatched bp v. parent) The resulting data supported transcription from 14 (8 processed and 6 non-

processed) of the 160 pseudogenes (9 with pseudogene specific primers) These numbers might represent a conservative estimate since a RACEfrag

was assigned to its parent gene by default if it could be mapped to both a parent locus and a pseudogene locus.

• RACE expts + sequencing (CAGE, PET, EST and mRNA) unambiguous evidence for pseudogene transcription All together, these data indicate 38 of 201 pseudogenes being the source

of novel RNA transcripts 5 of these had cryptic promotors (from TR analysis)

Zheng et al. (2007) Gen. Res.

Page 45: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

48

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 48

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Extension to Whole Genome

• 233 Transcribed from ~8000 Processed Pseudogenes • Evidence for Transcription

8% Refseq mRNAs 32% Unigene consensus sequences 72% dbEST expressed sequence tags 32% Oligonucleotide microarray data (extra support)

• Highly decayed Fraction with Ka/Ks ≥ 0.5 is 54%

Harrison et al. (2005) NAR

Page 46: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

4949

Genes & PseudogenesGenes & Pseudogenes

Zheng & Gerstein, TIG (2007)

Page 47: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

5050

Genes or Pseudogenes?Genes or Pseudogenes?

Zheng & Gerstein, TIG (2007)

Page 48: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

51

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 51

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Conclusions

• Tilescope Processing Pipeline HMM Segmentation & optimal selection of regions to validate

• DART classification of TARs 1300 clusters of transcriptionally active regions in ENCODE

• Deserts and Forests of Binding Activity on ~50kb scale

• Pseudogene Activity Consensus annotation from automatic pipelines and manual curation

gives 201 (~2/3 processed) >20% appear to be transcribed (38/201)

Page 49: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

52

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 52

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

MS

MG

PM

SW

Acknowledgements

Page 50: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

53

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 53

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

MS

MG

PM

SW

Acknowledgements

Page 51: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

54

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 54

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4 MS

MG

PM

SW

Acknowledgementspseudogene.org, tiling.gersteinlab.org

Page 52: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

55

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 55

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Acknowledgements

MS

MG

PM

SW

pseudogene.org, tiling.gersteinlab.org

Page 53: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

56

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 56

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Acknowledgements

P Bertone

D Zheng

Z Zhang

MS

MG

PM

SW

P Harrison

pseudogene.org, tiling.gersteinlab.org

Page 54: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

57

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 57

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Acknowledgements

P Bertone

J Rozowsky

G Euskirchen

T Royce

S Balasubramanian

J Korbel

A Karpikov

D Zheng

Z Zhang

D Yan

R Sasidharan

O Emanuelsson

J Du

J Rinn

MS

MG

PM

SW

V Stolc

R Martone

P Harrison

pseudogene.org, tiling.gersteinlab.org

N Luscombe

C Bruce

J Chang

N Carriero

N Echols

J Karro

Page 55: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

58

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 58

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

ENCODE Acknowledgements

Adam Frankish, Robert Baertsch, Philipp Kapranov, Alexandre Reymond,

Siew Woh Choo, Yontao Lu, France Denoeud, Stylianos Antonarakis, Yijun Ruan, Chia-Lin Wei, Thomas

Gingeras, Roderic Guigo, Jennifer Harrow

Sanger, UCSC, GIS, AFFX, Geneva, IMIM

Page 56: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions

59

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 59

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Permissions Statement

This Presentation is copyright Mark Gerstein, Yale University, 2007.

Feel free to use images in it with

PROPER acknowledgement

(via citation to relevant papers or link to gersteinlab.org).