evolutionary and genomic approaches to find gene regulatory sequences penn state university, center...

67
Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller, Francesca Chiaromonte, Anton Nekrutenko, Kateryna Makova, Stephan Schuster, Ross Hardison University of California at Santa Cruz: David Haussler, Jim Kent Children’s Hospital of Philadelphia: Mitch Weiss NimbleGen: Roland Green University of Nebraska, Lincoln February 14. 200

Upload: dinah-carmel-peters

Post on 23-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Evolutionary and genomic approaches to find gene regulatory sequences

Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller, Francesca Chiaromonte, Anton Nekrutenko, Kateryna Makova, Stephan Schuster, Ross Hardison

University of California at Santa Cruz: David Haussler, Jim Kent

Children’s Hospital of Philadelphia: Mitch WeissNimbleGen: Roland Green

University of Nebraska, Lincoln February 14. 2007

Page 2: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Major goals of comparative genomics

• Identify all DNA sequences in a genome that are functional– Selection to preserve function– Adaptive selection

• Determine the biological role of each functional sequence

• Elucidate the evolutionary history of each type of sequence

• Provide bioinformatic tools so that anyone can easily incorporate insights from comparative genomics into their research

Page 3: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Known types of gene regulatory regions

G.A. Maston, S.K. Evans, M.R. Green (2006) Ann. Rev. Genomics & Human Genetics 7:29-59.

Page 4: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Regulatory regions tend to be clusters of transcription factor

binding sites

Sequence-specific

SV40 promoters and enhancer

Page 5: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Properties of known regulatory regions

• Binding sites for transcription factors, many with sequence specificity

• Clusters of binding sites• Conventional promoters encompass major start sites for transcription

• Conserved over evolutionary time???

Page 6: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Structures involved in transcription are probably more

complex

Peter R. Cook, Oxford University, http://users.path.ox.ac.uk/~pcook/images/Images.html

Middle image: Green: active transcription (Br-UTP label) Red: all nucleic acids HeLa cellSides: EM spreads of transcripts

Page 7: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Domain opening is associated with movement to non-heterochromatic regions

Schubeler, Francastel, Cimbora, Reik, Martin, Groudine (2000) Genes & Dev. 14: 940-950

Page 8: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Other possible activities for sequences involved in gene

regulation• Opening or closing a chromosomal domain• Move a gene to or away from a transcription factory

• Control how long a gene is in a transcription factory– Long association

• High level expression• Really long gene

– Short association• Lower level expression• Rapid regulation

• Are these conserved over evolutionary time?

Page 9: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

3 modes of evolution

Sequence matches at longer phylogenetic distances could reflect purifying selectionSequence differences at closer phylogenetic distances could reflect adaptive evolution.

Page 10: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Conservation vs. Constraint

• Conserved sequences are those that align between two species thought to be descended from a common ancestor

• Constrained sequences show evidence in their alignments of negative (purifying) selection– E.g. change at a rate significantly slower than “neutral” DNA

Page 11: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Ideal cases for interpretation

Neutral DNASimilarity

Human vs mouse

Position along chromosome

DNA segments with a function common to divergent species.

DNA segments in which change is beneficial to at least one of the two species.

Negative selection(purifying)

P (not neutral)Neutral DNA

Similarity

Positive selection(adaptive)

Neutral DNA

Human vs rhesus

Page 12: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Messages about evolutionary approaches to predicting regulatory

regions• Regulatory regions are conserved, but not all to the same phylogenetic distance.

• Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential).

• Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity.

• In vivo occupancy by GATA-1 suggests other activities in addition to enhancers.

• Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.

Page 13: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Finding all gene regulatory regions is a challenge for comparative

genomics

• Known regulatory regions for the HBB complex• 23 total• 19 conserved (align) between human and mouse• Many others show no significant difference in a measure of constraint (phastCons) from the bulk or neutral DNA

Page 14: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Two extremes of

constraint in TRRs

Page 15: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

ENCODE projects

• ENCODE (ENCyclopedia Of DNA Elements): consortium aiming to find function for all human DNA sequences– Phase I focused on 1% of human DNA– 30 Mb, 44 regions

• About 10 regions had known genes of interest (CFTR, HOX)

• Others were chosen to get a sampling of regions varying in gene density and alignability with mouse

• Major areas– Genes and transcripts– Transcriptional regulation– Chromatin structure– Multiple sequence alignment– Variation in human populations

Page 16: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Biochemical assays for protein-binding sites in DNA

Purified protein& Naked DNA

Chromatin Immunoprecipitation:DNA sites occupied by a protein inside cells.

Page 17: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

ChIP-on-chip to examine many sites

Page 18: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Putative transcriptional regulatory regions = pTRRs

• Antibodies vs 10 sequence-specific factors: – Sp1, Sp3, E2F1, E2F4, cMyc, STAT1, cJun, CEBPe, PU1, RA Receptor A

– High resolution ChIP-chip platforms: Affymetrix and NimbleGen

– Data from several different labs in ENCODE consortium

• High likelihood hits for ChIP-chip– 5% false discovery rate

• Supported by chromatin modification data– Modified histones in chromatin: H4Ac, H3Ac, H3K4me, H3K4me2, H3K4me3, etc.

– DNase hypersensitive sites (DHSs) or nucleosome depleted sites

• Result: set of 1369 pTRRs

Page 19: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

A small fraction of cis-regulatory modules are conserved from human to

chicken

310

450

91

173

Millions ofyears

• About 4% of pTRRs, 4% of DNase HSs, 4-7% of promoters active in multiple cell lines

• Tend to regulate genes whose products control transcription and development

David King

Page 20: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Most pTRRs are conserved in eutherian mammals

310

450

91

173

Millions ofyears

Within aligned noncoding DNA of eutherians, need to distinguish constrained DNA (purifying selection) from neutral DNA.

Percentage of class that align no further than:

Primates: 3%

Eutherians: 71%

Marsupials: 21%

Tetrapods: 4%

Vertebrates: 1%

pTRRs DNase HSs Promoters

11%

70%

14%

4%

1%

1-13%

63%

16-28%

4-7%

2-4%

Page 21: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Measures of conservation and constraint capture only a subset of

pTRRs

Fraction overlappingan MCS

phastCons (background rate corrected)

Composite alignability (background rate corrected)

Stringent constraint Allows a range of constraint

Aligns, but no inference about purifying selection

Page 22: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Different measures perform better on specific functional regions

Sensitivity

1-Specificity

Page 23: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Examples of clade-specific pTRRs

Page 24: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Messages about evolutionary approaches to predicting regulatory

regions• Regulatory regions are conserved, but not all to the same phylogenetic distance.

• Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential).

• Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity.

• In vivo occupancy by GATA-1 suggests other activities in addition to enhancers.

• Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.

Page 25: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Regulatory potential (RP) to distinguish functional classes

Page 26: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Good performance of ESPERR for gene regulatory regions (RP)

-

James TaylorFrancesca Chiaromonte

Page 27: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Messages about evolutionary approaches to predicting regulatory

regions• Regulatory regions are conserved, but not all to the same phylogenetic distance.

• Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential).

• Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity.

• In vivo occupancy by GATA-1 suggests other activities in addition to enhancers.

• Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.

Page 28: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Conservation of predicted binding sites for

transcription factorsBinding site for GATA-1

Page 29: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Genes Co-expressed in Late Erythroid Maturation

G1E-ER cells: proerythroblast line lacking the transcription factor GATA-1. Can rescue by expressing an estrogen-responsive form of GATA-1Rylski et al., Mol Cell Biol. 2003

Page 30: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Predicted cis-Regulatory Modules (preCRMs) Around Erythroid Genes

B:Yong Cheng, Ross, Yuepin Zhou, David KingF:Ying Zhang, Joel Martin, Christine Dorman, Hao Wang

Page 31: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

preCRMs with conserved consensus GATA-1 BS tend to be active on transfected

plasmids

Page 32: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

preCRMs with conserved consensus GATA-1 BS tend to be active after integration into a chromosome

Page 33: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Examples of validated preCRMs

Page 34: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Correlation of Enhancer Activity with RP Score

Page 35: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Validation status for 99 tested fragments

Page 36: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

preCRMs with High RP and Conserved Consensus GATA-1 Tend To Be

Validated

Page 37: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Compare the outputs

C C N C M C C C W

Consensus for EKLF binding site:

All validated preCRMs

All nonvalidated preCRMs

Same parameters

CCNCMCCCWCCNCMCCCW

CACC box helps distinguish validated from nonvalidated preCRMs

Ying Zhang

Page 38: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Messages about evolutionary approaches to predicting regulatory

regions• Regulatory regions are conserved, but not all to the same phylogenetic distance.

• Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential).

• Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity.

• In vivo occupancy by GATA-1 suggests other activities in addition to enhancers.

• Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.

Page 39: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

preCRMs with conserved consensus GATA-1 binding sites are usually occupied by

that protein: ChIP assay

Page 40: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Design of ChIP-chip for occupancy by GATA-1

1. Non-overlapping tiling array with 50bp probe and 100bp resolution (NimbleGen)

2. Cover range Mouse chr7:57225996-123812258 (~70Mbp)3. Antibody against the ER portion of

GATA-1-ER protein in rescued G1E-ER4 cells

50 50

100

Yong Cheng, with Mitch Weiss & Lou Dore (CHoP), Roland Green (NimbleGen)

Page 41: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Signals in known occupied sites in Hbb LCR

1) Cluster of high signals2) “hill” shape of the signals

HS1 HS2 HS3

Page 42: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Peak Finding Programs

• TAMALPAISMark Bieda from Peggy Farmham’s lab Focus more on the cluster of the signals4 thresholds based on number of consecutive probes with signals in the 98th or 95th percentiles

• MPEAKBing Ren’s labFocus more one the “hill” shape of the signal4 thresholds, for a series of probes with at least one that is 3, 2.5, 2 or 1 standard deviations above the mean

Page 43: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

ChIP-chip hits for GATA-1 occupancy

Mpeak TAMALPAIS

275 hits in both 276 hits in both216 6059

321 total ChIP-chip hits

Technical replicates of ChIP-chip with antibody against GATA1-ER

Page 44: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

ChIP-chip hits validate at a high rate

Validation determined by quantitative PCR.19 of the 321 hits were tested.13 (~70%) were validated.

9 regions were “hits” in only one of the two technical replicates.None were validated.

Validation rate is similar at different thresholds

ChIP DNA

Page 45: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Association of WGATAR and conservation with ChIP-chip

Hits

1. 249 out of the 321 (78%) have WGATAR motifs, binding site for GATA-1

2. Of the GATA-1 binding motifs in those 249 hits, 112 (45%) are conserved between mouse and at least one non-rodent species.

Page 46: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Expected and unexpected ChIP-chip hits

Page 47: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Distribution of ChIP-chip hits on 70Mb of mouse chr7

Yong Cheng, Yuepin Zhou and Christine Dorman

Page 48: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Almost half the GATA-1 ChIP-chip hits increase expression of a

transgene, K562 cells

0

1

2

3

4

GHP181GHP10GHP7GHP182GHP309

GHP1GHP186GHP205

GHP4GHP314GHP172GHP167GHP74GHP193GHP27GHP9

GHP170GHP18GHP16GHP243GHP15GHP28GHP17GHP31GHP11GHP198GHP169GHP14GHP173GHP29GHP199GHP12GHP3GHP24GHP164GHP13GHP30GHP19GHP26GHP161GHP191GHP197GHP183GHP184GHP6GHP23GHP206GHP194GHP202

GHP0GHP200

GHP8GHP185GHP118GHP20GHP204GHN534GHN006GHN133GHN037GHN322

YC3

GHN213

Fold change over parent

GATA-1 occupied sites by ChIP-chip No GATA-1

15 6 6

24 validated out of 56 fragments with ChIP-chip hits tested 43%

Page 49: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Conserved and nonconserved ChIP-chip hits can be active

as enhancers

Conserved, active

Conserved, not active Not conserved, active

Page 50: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Messages about evolutionary approaches to predicting regulatory

regions• Regulatory regions are conserved, but not all to the same phylogenetic distance.

• Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential).

• Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity.

• In vivo occupancy by GATA-1 suggests other activities in addition to enhancers.

• Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.

Page 51: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Polymorphism as a transient phase of evolution

Slide from Dr. Hiroshi Akashi

Page 52: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Test of neutrality using polymorphism and divergence data

Page 53: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Test for recent selection in human noncoding DNA

• McDonald-Kreitman test• Use ancestral repeats as neutral model (MKAR test)• Count polymorphisms in human using dbSNP126• Count divergence of human from

– Chimpanzee (great Ape, diverged from human lineage 6 Myr ago)

– Rhesus macaque (Old World Monkey, diverged from human lineage 23 Myr ago)

• Tiled windows, most analysis on 10kb windows• Compute p-value for neutrality by chi-square test• Ratio of polymorphism to divergence ratios gives

indication of direction of inferred selection

Heather Lawson, Anthropology, PSU

Page 54: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

pTRR apparently under positive selection

Page 55: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

A promoter distal to the beta-like globin genes has a signal for recent

purifying selection

Page 56: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Selection on a primate-specific promoter

Page 57: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

The distal promoter is close to the locus control region for beta-globin

genes

Page 58: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Messages about evolutionary approaches to predicting regulatory

regions• Regulatory regions are conserved, but not all to the same phylogenetic distance.

• Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential).

• Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity.

• In vivo occupancy by GATA-1 suggests other activities in addition to enhancers.

• Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.

Page 59: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Many thanks …

B:Yong Cheng, Ross, Yuepin Zhou, David KingF:Ying Zhang, Joel Martin, Christine Dorman, Hao Wang

PSU Database crew: Belinda Giardine, Cathy Riemer, Yi Zhang, Anton Nekrutenko

Alignments, chains, nets, browsers, ideas, …Webb Miller, Jim Kent, David Haussler

RP scores and other bioinformatic input:Francesca Chiaromonte, James Taylor, Shan Yang, Diana Kolbe, Laura Elnitski

Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU

Page 60: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Computing Regulatory Potential (RP)

Alignment seq1 G T A C C T A C T A C G C A seq2 G T G T C G - - A G C C C A seq3 A T G T C A - - A A T G T ACollapsed alphabet 1 2 1 3 4 5 7 7 6 8 3 6 3 9

• A 3-way alignment has 124 types of columns. Collapse these to a smaller alphabet with characters s (for example, 1-9).

•Train two order t Markov models for the probability that t alignment columns are followed by a particular column in training sets:

–positive (alignments in known regulatory regions)–negative (alignments in ancestral repeats, a model for neutral DNA)–E.g. Frequency that 3 4 is followed by 5:

0.001 in regulatory regions0.0001 in ancestral repeats•RP of any 3-way alignment is the sum of the log likelihood ratios of

finding the strings of alignment characters in known regulatory regions vs. ancestral repeats.

RP = logpREG (sa | sa−1...sa−t )

pAR (sa | sa−1...sa−t )

⎝ ⎜

⎠ ⎟

a in segment

Page 61: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Stage 1: Reduced representations

G

T

gap

ESPERR: Evolutionary Sequence and Pattern Extraction using Reduced Representations

Page 62: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Stage 2: Improve encoding

Page 63: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Train models for classification

Note that many different columns are reduced to single “encoding” (a number in the figure). E.g. Four different columns are each called “3”.

6 6 2 may occur frequently in positive training set and rarely in the negative training set, and thus contribute to discrimination.If the positive training set is known regulatory regions, this would contribute to a positive RP.

Page 64: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Categories of Tested DNA Segments

Page 65: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Example that suggests turnover

GATA-1 BSs

Page 66: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

All validated preCRMs All nonvalidated preCRMs

Background:

Mouse chr 19 (42.8% C+G) - NCBI Build 30

CLOVER (Zlab)

EKLF PWM(Dr. Perkins)

ELPH (UMaryland)

Hexamer Counting

Motif P(mm_chr19.m)EKLF 0.0008

Motif P(mm_chr19.m)

none none

Output for validated preCRMs

Output for nonvalidated preCRMs

validated non-validated6-mer TTATYT GGCAGR7-mer CCWCAGM RGRCAGR8-mer CASCCWGC CAGGGAWR9-mer CCWGGCWGM CWGRGAWRA

counts validated nonvalidatedNCACCC 60 32CACCCW 56 27expected validated nonvalidatedNCACCC 16.31 5.81CACCCW 11.74 4.36

Additional methods find CACC box as distinctive for validation

Page 67: Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller,

Using Galaxy to find predicted CRMs