computer-assisted identification of cell cycle-related genes: …rulai.cshl.org › reprints ›...

22
Computer-assisted Identification of Cell Cycle-related Genes: New Targets for E2F Transcription Factors Alexander E. Kel 1,5 *, Olga V. Kel-Margoulis 1,5 , Peggy J. Farnham 2 Stephanie M. Bartley 2 , Edgar Wingender 3 and Michael Q. Zhang 4 1 Institute of Cytology and Genetics, Novosibirsk, Russia 2 McArdle Laboratory for Cancer Research, University of Wisconsin Medical School Madison, WN 53706, USA 3 Gesellschaft fu ¨r Biotechnologische Forschung mbH, Mascheroder Weg 1 D-38124 Braunschweig Germany 4 Watson School of Biological Sciences, Cold Spring Harbor Laboratory, 1 Bungtown Road P.O.Box 100, Cold Spring Harbor, NY 11724, USA 5 BIOBASE GmbH, Mascheroder Weg 1b D-38124 Braunschweig Germany The processes that take place during development and differentiation are directed through coordinated regulation of expression of a large number of genes. One such gene regulatory network provides cell cycle control in eukaryotic organisms. In this work, we have studied the structural features of the 5 0 regulatory regions of cell cycle-related genes. We devel- oped a new method for identifying composite substructures (modules) in regulatory regions of genes consisting of a binding site for a key transcription factor and additional contextual motifs: potential targets for other transcription factors that may synergistically regulate gene transcription. Applying this method to cell cycle-related promoters, we created a program for context-specific identification of binding sites for transcription factors of the E2F family which are key regulators of the cell cycle. We found that E2F composite modules are found at a high frequency and in close proximity to the start of transcription in cell cycle-related promoters in comparison with other promoters. Using this information, we then searched for E2F sites in genomic sequences with the goal of identifying new genes which play important roles in control- ling cell proliferation, differentiation and apoptosis. Using a chromatin immunoprecipitation assay, we then experimentally verified the binding of E2F in vivo to the promoters predicted by the computer-assisted methods. Our identification of new E2F target genes provides new insight into gene regulatory networks and provides a framework for con- tinued analysis of the role of contextual promoter features in transcrip- tional regulation. The tools described are available at http:// compel.bionet.nsc.ru/FunSite/SiteScan.html. # 2001 Academic Press Keywords: cell cycle; E2F transcription factors; computer-associated prediction; composite elements; weight matrix *Corresponding author Introduction For the last one and a half decades bioinfor- matics has been used to study genomic regulatory sequences, ranging from short specific functional sites to extended control regions. The ultimate goal of this effort is to unravel the regulatory code, which is expected to provide a clue to how such a complex process as regulation of gene expression is encoded in the structure of promoter regions. Now that the complete nucleotide sequence of the human genome is about to be determined, bioin- formatics approaches have become even more important. For example, genes whose products function in the same physiological or molecular- genetic process are often expressed coordinately. It is believed that this coordination is possible, at least in part, due to similarity in the structure of transcriptional regulatory regions of these genes, and primarily due to the presence of the binding sites for the same transcription factors. It is hoped that computer-assisted bioinformatics and exper- imental analysis of the human genome will pro- vide new mechanisms for the identification of elements which mediate a common transcriptional profile for a group of genes. The genes responsible for progression through the cell cycle have always been in the spotlight, especially those essential for normal cell cycle pro- E-mail address of the corresponding author: [email protected] Present address: Dr A. Kel, BIOBASE GmbH, Mascheroder Weg 1b, D-38124 Braunschweig, Germany. doi:10.1006/jmbi.2001.4650 available online at http://www.idealibrary.com on J. Mol. Biol. (2001) 309, 99–120 0022-2836/01/010099–22 $35.00/0 # 2001 Academic Press

Upload: others

Post on 25-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

doi:10.1006/jmbi.2001.4650 available online at http://www.idealibrary.com on J. Mol. Biol. (2001) 309, 99±120

Computer-assisted Identification of Cell Cycle-relatedGenes: New Targets for E2F Transcription Factors

Alexander E. Kel1,5*, Olga V. Kel-Margoulis1,5, Peggy J. Farnham2

Stephanie M. Bartley2, Edgar Wingender3 and Michael Q. Zhang4

1Institute of Cytology andGenetics, Novosibirsk, Russia2McArdle Laboratory forCancer Research, University ofWisconsin Medical SchoolMadison, WN 53706, USA3Gesellschaft fuÈ rBiotechnologische ForschungmbH, Mascheroder Weg 1D-38124 BraunschweigGermany4Watson School of BiologicalSciences, Cold Spring HarborLaboratory, 1 Bungtown RoadP.O.Box 100, Cold SpringHarbor, NY 11724, USA5BIOBASE GmbH,Mascheroder Weg 1bD-38124 BraunschweigGermany

E-mail address of the [email protected]

Present address: Dr A. Kel, BIOBMascheroder Weg 1b, D-38124 Brau

0022-2836/01/010099±22 $35.00/0

The processes that take place during development and differentiation aredirected through coordinated regulation of expression of a large numberof genes. One such gene regulatory network provides cell cycle controlin eukaryotic organisms. In this work, we have studied the structuralfeatures of the 50 regulatory regions of cell cycle-related genes. We devel-oped a new method for identifying composite substructures (modules)in regulatory regions of genes consisting of a binding site for a keytranscription factor and additional contextual motifs: potential targets forother transcription factors that may synergistically regulate genetranscription. Applying this method to cell cycle-related promoters, wecreated a program for context-speci®c identi®cation of binding sites fortranscription factors of the E2F family which are key regulators of thecell cycle. We found that E2F composite modules are found at a highfrequency and in close proximity to the start of transcription in cellcycle-related promoters in comparison with other promoters. Using thisinformation, we then searched for E2F sites in genomic sequences withthe goal of identifying new genes which play important roles in control-ling cell proliferation, differentiation and apoptosis. Using a chromatinimmunoprecipitation assay, we then experimentally veri®ed the bindingof E2F in vivo to the promoters predicted by the computer-assistedmethods. Our identi®cation of new E2F target genes provides newinsight into gene regulatory networks and provides a framework for con-tinued analysis of the role of contextual promoter features in transcrip-tional regulation. The tools described are available at http://compel.bionet.nsc.ru/FunSite/SiteScan.html.

# 2001 Academic Press

Keywords: cell cycle; E2F transcription factors; computer-associatedprediction; composite elements; weight matrix

*Corresponding author

Introduction

For the last one and a half decades bioinfor-matics has been used to study genomic regulatorysequences, ranging from short speci®c functionalsites to extended control regions. The ultimate goalof this effort is to unravel the regulatory code,which is expected to provide a clue to how such acomplex process as regulation of gene expressionis encoded in the structure of promoter regions.Now that the complete nucleotide sequence of thehuman genome is about to be determined, bioin-

ing author:

ASE GmbH,nschweig, Germany.

formatics approaches have become even moreimportant. For example, genes whose productsfunction in the same physiological or molecular-genetic process are often expressed coordinately. Itis believed that this coordination is possible, atleast in part, due to similarity in the structure oftranscriptional regulatory regions of these genes,and primarily due to the presence of the bindingsites for the same transcription factors. It is hopedthat computer-assisted bioinformatics and exper-imental analysis of the human genome will pro-vide new mechanisms for the identi®cation ofelements which mediate a common transcriptionalpro®le for a group of genes.

The genes responsible for progression throughthe cell cycle have always been in the spotlight,especially those essential for normal cell cycle pro-

# 2001 Academic Press

Page 2: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

100 Identi®cation of E2F Target Genes

gression in any cell type, the so-called cell cyclemachinery. Both cell proliferation and the success-ful progression of the differentiation programdepend on the correct, coordinated function of thegenes in this group. Any malfunction in the oper-ation of cell cycle-related genes can lead to uncon-trolled proliferation and, eventually, to tumors.The E2F family of transcription factors has beenimplicated in the cell cycle-regulated expression ofmany cellular genes.1 ± 6 E2F target genes include(1) genes that encode regulatory proteins compris-ing the so-called cell cycle machinery (cyclins,cyclin-dependent kinases, the E2F genes, the tumorsuppressor RB1 genes); (2) some transcription fac-tor genes of broad-spectrum (Myc and Mybfamilies); (3) enzyme genes and genes of other pro-tein components of the replication machinery;(4) some DNA repair genes; and (5) genes encod-ing structural proteins of chromatin (histones).

The E2F family comprises two subfamilies, E2F-1-E2F-6 and DP-1-DP-3, one member of each sub-group partnering to form an E2F/DP heterodimer(reviewed in3 and4). The E2F genes and their pro-ducts can be divided into three subgroups. (1) E2F-1-E2F-3, which display maximal expression in lateG1- to early S-phase,7,8 are highly related; (2) E2F-4and E2F-5 are less responsive to changes in pro-liferation and both proteins lack an N-terminaldomain contained within E2Fs 1-3, and (3) E2F-6 isa recently cloned E2F family member that lacksboth the N-terminal region of E2Fs 1-3 and theC-terminal region common to E2Fs 1-5.9 Thedomain structure of the E2F transcription factors iswell studied. The DNA-binding and dimerizationdomain is located closer to the N terminus of themolecule, and it was thought to be of the bHLH-ZIP (basic-Helix-Loop-Helix-Zipper) type.10 However,a recent analysis of the crystal structure of theE2F4-DP2-DNA complexes suggested a Winged-Helix DNA binding motif.11 Important in the over-all function of E2F activity is the interaction of E2Ffamily members with other regulatory proteins. Anacidic activation domain is located at the C termi-nus of E2Fs 1-5.12 Situated within the activationdomain are regions which mediate interaction ofE2F family members with components of the gen-eral transcriptional machinery such as TBP, TFIIH,and CBP.12 ± 14 A pRB-binding domain is alsolocated within the transactivation domain.12,13

Because the E2F-1 activation domain cannot inter-act with positive regulatory factors such as TBPand negative regulatory factors such as pRB simul-taneously, the retinoblastoma protein can inhibitthe interaction between E2F-1 and the componentsof the basal complex, thereby repressing transcrip-tional activation by E2F-1.13 Rb family memberscan also bind to histone deacetylases and thus ithas been proposed that E2F/pocket complexes canmediate active repression of E2F target promotersby recruitment of histone deacetylases and sub-sequent alteration of chromatin structure whichleads to transcriptional repression.

As noted above, the functions carried out by E2Fsites depend in part on the protein spectrum of thecomplex assembled at these sites; this spectrum, inturn, is dependent on cell cycle stage (reviewedin15). For example, E2F/pocket protein complexesare abundant in G0 and early G1 phase. It hasbeen proposed that such complexes mediate tran-scriptional repression of E2F target genes(reviewed in4). Transcriptional activation can betriggered by releasing E2F when pRB is phos-phorylated by cyclin-cdk complexes which are acti-vated as the cell cycle progresses (see16 ± 19).Disruption of E2F/pocket complexes frees the E2Ftransactivation domain to interact with com-ponents of the transcriptional machinery. Finally, itis proposed that S phase-speci®c cyclin/cdks phos-phorylate E2F/DP heterodimers, reducing theirDNA binding ability20,21 and causing a reductionin transcription of E2F target genes. Thus, due tothe action of both positive and negative regulatoryproteins, the E2F/DP family is thought to causevariable expression at different stages of the cellcycle. Recent chromatin immunoprecipitationexperiments have provided direct evidence in sup-port of this model of E2F action for certain targetgenes. For example, B-myb22 and cycA23 promotersare bound by E2Fs and pocket proteins in G0 andG1 phase but the E2F site is not occupied in Sphase. In contrast, other promoters, such as p10723

retain E2F, but not pocket protein, binding in Sphase and yet others, such as cdc2 and cycE22 arebound by both E2Fs and pocket complexesthroughout the cell cycle. Clearly, promoter contextmust determine exactly how, or if, a given promo-ter is regulated by E2F.24 An understanding of therules governing E2F-mediated regulation will beaided by the identi®cation and characterization ofadditional target promoters.

In the past few years, novel ideas concernedwith identi®cation of target genes of transcriptionfactors have been suggested by reports highlight-ing the combinatory nature of transcriptional regu-lation and the bene®ts of computer-aided analysisof the regulatory regions of functionally relatedgene groups (e.g. see a recent review25). Theseefforts aimed at the identi®cation of muscle-speci®cpromoters;26 ± 29 the search of liver-speci®c genepromoters for potential HNF-1 sites;30 the searchfor potential composite elements within promotersof immune response genes;31 and the characteriz-ation of structural features of the CTF/NF-1 sites.32

For example, potential sites were searched byusing the weight matrix of the binding sites forHNF-1, a homeodomain-containing transcriptionfactor with preferential expression in the liver.30

About 100 potential binding sites for this factorhave been found in the liver-speci®c gene promo-ters. As was shown, 95 % of oligonucleotides corre-sponding to these sites compete with a known sitein the albumin gene promoter for binding to theproteins of a liver nuclear extract.30 Although thisis a strong indication for the usefulness of compu-terized site search methods, no warranty, however,

Page 3: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Identi®cation of E2F Target Genes 101

is given that the revealed sites are indeed func-tional in the context of regulatory regions. Theauthors have studied the genomic distribution ofpotential HNF-1 sites and obtained several results:(1) these potential sites occur 2.5 times more fre-quent in liver-speci®c gene promoters than in ran-dom sequences; (2) most potential sites localize tothe region between position ÿ300 and the tran-scription start site; and (3) in liver-speci®c genepromoters, the potential binding sites, both thosefor HNF-1 and those for other factors, such asHNF-4, AP-1, Sp1, NF-Y, Oct, TBP are closelyspaced.30

The analysis of muscle-speci®c gene regulatoryregions was performed by using weight matricesand a function of logistic regression.28 Individualmatrices were de®ned for transcription factor bind-ing sites in many muscle-speci®c genes. These arethe sites for muscle-speci®c transcription factors,such as MEF-2 and Myf, as well as for several ubi-quitous factors such as Sp1, SRF, AP-1, and TEF.28

The authors derived the matrices from data con-tained in the database Muscle-Speci®c Regulation ofTranscription. The individual matrices were nextcombined into one recognition function. Thus amodular model of the muscle-speci®c type of pro-moters was developed, in which weight matriceswere regarded as independent modules. The con-clusion was drawn that ``The speci®city of themodular approach results in meaningful pre-diction''.28

A model of the promoter of a distinct class ofmuscle-speci®c genes, namely, actin genes, wasdeveloped by Werner and co-authors.29 Promotersof 11 vertebrate actin genes were used as a trainingset. It has been demonstrated that the USF sites,the CCAAT box, SRF sites, Sp1 sites, the TATAbox, and the Inr element are present in those pro-moters, in a speci®c order.29 The yield of thismodel was 33 % false negatives versus one falsepositive per 1290 non-actin promoters in the EPDdatabase.29

We have previously developed a method forrevealing new genes potentially regulated by com-posite NFAT/AP-1 elements.31 This method com-prises two weight matrices for the correspondingtranscription factors (NFATp/c and AP-1). It con-siders the distance between two sites, their locationrelative to the transcription start site, and the pre-sence of clusters of potential NFAT compositeelements. The method permits us to discriminateT-cell-speci®c promoter sequences against otherfunctional regions (coding and intron sequences) ofthe same genes, against promoters of muscle-speci®c genes or against random sequences. Usingthis approach, a set of new potential target genesfor NFAT/AP-1 complexes was identi®ed by scan-ning the EMBL data library.31

Because of the high functional complexity of thecell cycle machinery and because the E2F transcrip-tion factors play a key role in regulation of the cellcycle-associated genes, the problem of identi®-cation of target genes for these factors is a major

challenge. In this work, we have studied the struc-tural features of the 50 regulatory regions of cellcycle-related genes and made well-grounded pre-dictions about a broad range of genes targeted bythe E2F family members. Unlike many previousstudies which have employed computer analysis tosuggest potential binding sites, we have exper-imentally veri®ed the binding of E2F in vivo to thesites predicted by the computer-assisted methods.Our identi®cation of new E2F target genes pro-vides new insight into gene regulatory networksand provides a framework for continued analysisof the role of promoter context in transcriptionalregulation.

Results

Characterization of the E2F binding sites andthe creation of the weight matrix

Based on the information collected in theCYCLE-TRRD database (http://wwwmgs.bio-net.nsc.ru/mgs/papers/kel_ov/celcyc/), we con-structed a set of experimentally proven E2Fbinding sites that contains 45 sites within 33 differ-ent genes. In Table 1, we describe the generalcharacteristics of these sites and of the promotersof the corresponding genes. For example, we notethat only a few (seven out of 33) E2F-regulatedgenes contain a TATA box. Also, as indicated inTable 1, in many cases E2F sites are located withinthe ®rst 150 bp upstream of the transcription startsite with some E2F sites, e.g. in the promoters ofthe p107, cdc6, orc1 and dhfr genes, actually over-lapping the start site. However, in some cases theE2F sites are situated in the non-coding region ofthe ®rst exon or in the ®rst intron. As it is also evi-dent from Table 1, E2F-dependent genes oftenhave two binding sites for the E2F factors. Suchgenes include e2f-1 of man, mouse, and quail;proto-oncogenes c-myc and N-myc; p107; cycE;EBNA1- nuclear antigen of the Epstein-Barr virus,and adenoviral EIIaE. In all these cases the E2Fsites are situated in close proximity to each other,immediately adjacent or separated by 10-15 bponly. However, in the promoters of some genesE2F sites are separated by a greater distance. Forinstance, the E2F sites are separated by 30 bp inthe cdc6 promoter, by 50 bp in the adenoviral geneE1A promoter, and by 85 bp in the human cdc2promoter.

When inspecting E2F binding sites in the collec-tion some structural features are immediatelyobvious. The conserved core sequence of the site is12 bp in length, six base-pairs at the central pos-itions consisting of the G and C nucleotides, andthe ¯anking sequences are enriched with the T andAÁ nucleotides. The conserved CG nucleotide is inthe center of all the sites (see Table 1). Most ofthese sites are symmetrical relative to CG and areeither complementary palindromes, as in genes:N-myc, e2f-1, dhfr, PCNA; or mirror symmetricrepeats as in gene p107. However, non-symmetrical

Page 4: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Table 1. E2F binding sites in transcriptional regulatory regions of genes presented in the CYCLE-TRRD database

No. CYCLE-TRRDa Gene name TATA-box

Position of E2F sitesrelatively to transcriptionstart Sequence of E2F sites q

Transcription factors of Myc and Myb families1 S1835 (ct) c-myc, Hs � ÿ69 to ÿ58 gCTTGGCGGGAAAaaga (�)0.872 S1836 (ct) ÿ42 to ÿ31 gGATCGCGCTGAGtata (�)0.803 S398 (t) c-myc, Mm � ÿ81 to ÿ69 gCTTGGCGGGAAAaaga (�)0.874 S1926 (t) N-myc, Mm � ÿ142 to ÿ131 tTTTGGCGCGAAAggct (�)1.005 S1927 (t) ÿ127 to ÿ116 cTTTGGCGCCTCCcctg (�)0.966b S2353 (t) B-myb, Mm ÿ ÿ212 to ÿ201 cCTTGGCGGGAGAtagg (�)0.87

Transcription factors of E2F family7 S1791 (ct) e2f-1, Hs ÿ ÿ31 to ÿ20 cTTTCGCGGCAAAaagg (�)0.928 S1792 (ct) ÿ14 to ÿ3 cTTTGGCGCGTAAaagg (�)0.999 S1786 (ct) e2f-1, Mm ÿ ÿ40 to ÿ29 cTTTCGCGGCAAAaagg (�)0.9210 S1787 (t) ÿ23 to ÿ12 aTTTGGCGCGTAAaagt (�)0.9911 S3143 (t) e2f-1, Cc ÿ ÿ16 to ÿ5 cTTTCGCGGCAAAaagg (�)0.9212 S3144 (ct) �2 to �13 aTTTGGCGCGCAAaggc (�)0.99

Pocket proteins13 S1590 (t) RB1, Hs ÿ �91 to �102 tTTTCCCGCGGTTggac (�)0.8614 S1343 (t) p107, Hs ÿ ÿ16 to ÿ5 tTTTCGCGCGCTTtggc (�)0.9815 S1345 (t) ÿ6 to �6 cTTTGGCGCAGGTggtt (�)0.95

Cyclins, cyclin-dependent kinases, phosphatases16 S1858 (t) cycD1, Hs ÿ ÿ53 to ÿ42 gTTTGGCGCCCGCgccc (�)0.9717 S1909 (ct) cycE, Hs ÿ ÿ16 to ÿ5 gTTCCGCGCGCAGggat (�)0.8718 S1910 (t) �7 to �18 aTGTCCCGCTCTGagcc (�)0.7519 S3256 (t) cycE, Mm ÿ ÿ24 to ÿ7 gGCGGGCGCGAGGgcgg (ÿ)0.8120 S1853 (t) cycA, Hs ÿ at ÿ37 tAGTCGCGGGATActtg (�)0.7721 S1870 (t) cdc2, Hs ÿ ÿ131 to ÿ114 cTTTCGCGCTCTAgcca (�)0.9722 S1874 (t) ÿ27 to ÿ13 cTTTAGCGCGGTGagtt (�)0.9523 S1879 (t) cdc2, Rn ÿ ÿ130 to ÿ119 cTTTCGCGCTCTGcact (�)0.9724 S3147 (t) cdc6, Hs ÿ ÿ43 to ÿ32 cTTTGGCGGGAGGtggg (�)0.9325 S3148 (t) ÿ9 to �6 aTTTGGCGCGAGCgcgg (�)0.9926 S3672 (ct) Cdc25A, Hs ÿ ÿ63 to ÿ47 gTTTGGCGCCAACtagg (�)0.98

Enzymes and factors for DNA replication and repair27 S3257 (t) Orc1, Hs ÿ ÿ10 to �2 gATTGGCGCGAAGtttt (�)0.9428 S1923 (t) DNA pol a, Hs ÿ ÿ139 to ÿ128 aCAGGGCGCCAAAcgcg (ÿ)0.9829 S2260 (t) dhfr, Hs ÿ ÿ10 to �2 aTTTCGCGCCAAActtg (ÿ)1.0030 S2365 (ct) dhfr, Mm ÿ ÿ10 to �2 aTTTCGCGCCAAActtg (ÿ)1.0031b S88 (t) dhfr, Cs ÿ ÿ62 to ÿ51 aTTTCGCGCCAAActtg (ÿ)1.0032 S3421 (ct) tk, Hs ÿ ÿ101 to ÿ85 aTCTCCCGCCAGGtcag (ÿ)0.7733 S1914 (t) tk, Mm ÿ ÿ80 to ÿ69 aGTTCGCGGGCAAatgc (�)0.8634 S1338 (t) cad, Ma ÿ �74 to �85 cTTCGGCGCGCGGtgtt (�)0.8735 S2334 (t) PCNA p120, Hs ÿ In the 1st intron tTTTCGCGCCAAAgtca (ÿ)1.0036 S3495 (t) UDG, Hs ÿ ÿ112 to ÿ101 tTTTGCCGCGAAAagac (ÿ)0.92

Page 5: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Str

uct

ura

lp

rote

ins

of

chro

ma

tin

37S

2288

(t)

hist

one

H2A

.1,

Hs

�ÿ5

0toÿ3

9tTTTCGCGCCCAGcagc

(�)0

.98

38S

2378

(t)

hist

one

H2A

.X,

Mm

�ÿ2

56toÿ2

45aTTTCGCGCGCTCtaca

(�)0

.99

Co

mp

on

ents

of

sign

al

tra

nsd

uct

ion

pa

thw

ay

s39

S23

81(t

)ht

f9a/

Ran

BP

-1,

Mm

ÿÿ1

15toÿ1

04tTTTGGCGGGAAGcgcg

(�)0

.93

Vir

al

gen

esre

gula

ted

by

cell

ula

rE

2Fa

ctiv

ity

40S

3673

(ct)

Ad

typ

e2

EII

aE1

ÿÿ6

8toÿ5

2tT

TT

CG

CG

CT

TA

Aat

tt(�

)0.9

641

S36

74(c

t)ÿ4

8toÿ3

2aA

AG

GG

CG

CG

AA

Act

ag(ÿ

)0.9

742

S36

76(c

t)A

dty

pe

5E

1A�

ÿ288

toÿ2

72tT

TT

CG

CG

CG

GT

Ttt

ag(�

)0.9

743

S36

77(c

t)ÿ2

25toÿ2

09tT

TT

CG

CG

GG

AA

Aac

tg(�

)0.9

344

S21

85(c

t)E

pst

ein

-Bar

r�

�191

to�2

04aA

GG

CG

CG

GG

AT

Ag

cgt

(ÿ)0

.77

45S

2187

(ct)

Vir

us,

EB

NA

1�2

18to�2

34aG

AT

GG

CG

GG

TA

Ata

ca(�

)0.7

6

TA

TA

-bo

xp

rese

nce

inth

ep

rom

ote

rso

fE

2F-d

epen

den

tg

enes

issh

ow

n.

Po

siti

on

sco

rres

po

nd

toth

ese

qu

ence

sho

wn

by

cap

ital

lett

ers.

aS

ite

acc.

nu

mb

erin

the

CY

CL

E-T

RR

Dd

atab

ase

(htt

p://

srs5

.bio

net

.nsc

.ru

/srs

5/).

bP

osi

tio

ns

are

giv

enre

lati

vel

yto

the

star

to

ftr

ansl

atio

n;

(ct)

,si

tes

tak

enfo

rth

ep

osi

tiv

eco

ntr

ol

set;

(t),

site

sw

hic

hco

mp

ose

the

trai

nin

gse

t;C

c,C

otu

rnix

cotu

rnix

(qu

ail)

;C

s,C

rice

tulu

ssp

.(C

hi-

nes

eh

amst

er);

Hs,

Hom

osa

pien

s;M

a,M

esoc

rice

tus

aura

tus

(go

lden

ham

ster

);M

m,

Mu

sm

usc

ulu

s;R

n,

Rat

tus

nor

vegi

cus.

cT

he

min

imal

sco

rev

alu

eÿ0

.75,

corr

esp

on

dto

the

bin

din

gsi

tein

the

pro

mo

ters

of

the

cycl

inE

gen

e(s

ites

no

.18

).T

he

maj

ori

tyo

fsi

tes

are

char

acte

rize

db

yq5

0.86

,an

dap

pro

xim

atel

yh

alf

of

the

site

sar

ere

cog

niz

edat

q5

0.96

.

Page 6: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Figure 1. Weight matrix and consensus for E2F binding sites. Values in the parentheses correspond to the weightmatrix constructed using the 29 E2F sites in the training set (indicated by (t) in Table 1).

104 Identi®cation of E2F Target Genes

sites will occur too, as, for instance, in the cycD1,cdc6, and cad genes. In some sites, ¯ankingsequences have very few AT nucleotides, such asfor the proximal site of the cycE gene.

The generally accepted eight-letter consensussequence of the E2F site, TTTSGCGC, was con-structed on the basis of 12 sites in eight vertebrategenes.33 The matrix so generated has accessionnumber M00050 in the database TRANSFACRelease 3.5.34 Furthermore, TRANSFAC also con-tains a E2F matrix (M00180) generated on the basisof data of eight sites, the consensus being writtenas GCGCGAAA, which corresponds toTTTCGCGC in the complementary strand. With 45functional sites in 33 vertebrate genes, it is nowpossible to re®ne the weight matrix and extend theconsensus. We constructed a weight matrix byaligning the 45 sites relative to the central CGdinucleotide (Figure 1); all site sequences weretaken on the coding strand of DNA. The matrix ismost conserved where the GCGC tetranucleotideis, i.e. in the central positions of the site. A ratherconserved trinucleotide, TTT, localizes to the 50region whereas the 30 region is more degenerate(see Figure 1), although the prevalence of thenucleotide A is obvious. Based on the new weightmatrix, the consensus sequence of the E2F site isTTTSGGCSMDR (Figure 1), according to the rulessuggested.35

Accuracy of recognition of E2F binding sitesby weight matrix

For recognition of the potential E2F binding siteswe have applied a position weight matrix methodas is described in the Data and Methods section.For evaluation of the method we constructed aweight matrix on the basis of sequences from thetraining set only (see the values in parentheses inFigure 1). The training set consists of 29 sites arbi-trarily selected from the full data set (Table 1).

A positive control set was used for assessing theaccuracy of the method. The sites in the positive

Table 2. E2F sites with partial experimental proving that wer

No. Gene Positions

1 PCNA p120, Mm �9 to �212 Cdc7, Mm ÿ41 to ÿ253 e2f-2, Hs �34 to �50

control set are presented in Tables 1 and 2. Notincluded in the training set, these are blindlyselected ten sites in vertebrate nuclear genes(Table 1, indicated by (ct)); six experimentallydetermined sites in the viral gene promoters(Table 1, sites No 40-45), and three sites in ver-tebrate genes that were not yet precisely mapped(Table 2). For instance, it has been demonstratedthat the E2F factor is involved in the transcriptionalregulation of the cdc7 gene, however, the regionresponsible for this regulation is quite long: ÿ158/� 73.36 We put into the positive control set the bestmatch of the E2F matrix found in this region(Table 2, no. 2). Likewise, number 3 corresponds tothe best match of the E2F matrix in the promoterregion of the e2f-2 gene. Here, too, a long regionthat provides transcriptional regulation by the E2Ffactor was determined experimentally ÿ151/� 169.37

For recognition of potential binding sites withinextended DNA sequences it is important to esti-mate how many false positive sites are captured bythe method (level of overprediction errors). To esti-mate the level of overprediction errors we haveused the set Ex2 as a negative sequence set consist-ing of sequences for second vertebrate exons (seeData and Methods). The minimal threshold thatcan be considered corresponds to the minimalscore q obtained, that is 0.75 (see Table 1). At thisthreshold value the method has found 1757 sites inthe Ex2 set. Let us assign this number to 100 %,and plot the number of sites found at otherthreshold values as the percentage of the numberof sites found at the minimal threshold (Figure 2).When the threshold is increased, the overpredic-tion rate decreases. At q50.93 overprediction rateis equal to 0, which means that the method doesnot ®nd E2F sites with q50.93 in the Ex2 set.

When testing the positive control set at a cut-offvalue of 0.81, the method fails to correctly identifyfour sites of this set, which makes up 21 %(Table 3). In the c-myc promoter (Table 1, site no.

e included in the positive control set

E2F sites q

GTTGCGCGCGCAGaggg (�)0.88CTCTTGCGCAGGCgccg (�)0.82TTTTAGCGCCAAAcgtc (ÿ)0.98

Page 7: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Table 3. Rates of false negatives and false positives esti-mated on the basis of positive control set and the set ofsecond exons (negative control set)

q False negatives

Frequency offalse positives

(sites per1000 bp)(Ex2 set)

0.77 1 sites out of 19, 5.3% 1.3090.81 4 sites out of 19, 21.1% 0.5530.86 6 sites out of 19, 31.6% 0.1530.93 11 sites out of 19, 57.9% 0.009

Identi®cation of E2F Target Genes 105

2), one of the two closely spaced E2F sites wasincluded in the control set, but was not correctlyidenti®ed. The other site was included in the train-ing set, and was correctly identi®ed with q�0.87(Table 1, site no. 1). It is likely that the two E2Fsites in the c-myc gene promoter can function coop-eratively, and the binding of the E2F factor to alow-af®nity site is stabilized due to the presence ofa nearby high-af®nity site. At this cut-off value, themethod also fails to recognize two closely spacedE2F sites in the EBNA1 gene of the Epstein-Barrvirus and in the thymidine kinase gene promoter.At q50.86, six sites in the positive control set arenot found, which makes up 31.6 %. Here, the rateof false positives is rather low, 0.153 per 1000 bp.At q50.93, nearly 60 % of the real sites (11 sites)have been missed. As one can see, the minimum ofthe sum of the levels of overprediction and under-prediction errors is at the 0.86, and this value canbe considered as conditional optimum (Figure 2).

The threshold value can be selected arbitrarilydepending on the purpose of sequence analysis(see Figure 2). To reveal the majority of real sites, itis necessary to choose a low threshold value. Forinstance, according to the estimations about 15 %of real sites will be lost only at the low thresholdlevel equal to 0.79, but more than 55 % of foundsites will be false positives. On the other hand, ifthe task is to reveal the most reliable sites only, itis necessary to select a higher threshold. At

threshold value 0.93 or above the percentage offalse positives is expected to be quite low (seeabove), but as much as 60 % of the real sites willnot be recognized by the method. It becomes clearthat additional criteria are needed to decrease falsepositive rate without losing too many real sites.

Analysis of additional sequence motifs in thecontext of E2F sites

To investigate whether there are preferredsequence motifs in the context of E2F binding siteswe have created two sets of sequences: set Y andset N (see Data and Methods). The set Y consists ofextended E2F sites corresponding to those in thetraining set (indicated by (t) in Table 1). The lengthof each sequence was L � 72 bp (the E2F site com-prising 12 bp in the center and ¯anking sequences30 bp on either side). The set N consists of 100sequences of the same length L � 72 bp taken fromthe set Ex2 so that in the center of each sequence afalse positive match of the E2F matrix with thescore value q50.80 is detected. We performed anexhaustive search for contextual motifs that dis-tinguish the Y and N sets, as described in Data andMethods. More than 108 characteristics ((l,w)-com-binations) have been checked. Among them, about103 characteristics have the utility valueU(l,w) > 0.7. Each of these characteristics (speci®cmotif l in window w) exhibits a statistically signi®-cant difference between sets Y and N. Elevenmutually uncorrelated characteristics have beenselected (Table 4).

On the basis of these 11 characteristics (m � 11)we constructed the linear classi®cation functiondiscriminating sets Y and N by means of linear dis-criminant analysis. For any sequence X of thelength L � 72 bp we compute the score value:

d � b�Xm

i�1

ai � f �li;wi;X� �1�

where ai and b are the coef®cients of the discrimi-nating function (see Table 4). We refer to function(1) as E2F score of context.

Figure 2. Estimation of the levelof overprediction and underpredic-tion errors (percentage of the maxi-mal levels) of the method ofrecognition of E2F binding sites atvarious threshold values.

Page 8: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Table 4. Eleven uncorrelated characteristics found in the context of E2F sites in promoters of C-genes

No. Motif (l) Window (w)a fÃY/fÃNb Utility ai

1 TTT [39,41] 0.0112/0.0032 � 3.536 0.75 0.96182 CGSK [17,37] 0.0851/0.0341 � 2.499 0.90 0.53533 HKCG [13,15] 0.0675/0.0095 � 7.071 0.79 0.59044 VDWW [17,45] 0.1233/0.0536 � 2.299 0.72 0.2235 DWTT [21,25] 0.0337/0.0000 0.80 0.50366 GSDM [3,69] 0.0980/0.0559 � 1.754 0.82 0.595

Negative characteristics7 VWS [7,70] 0.1258/0.1932 � 0.651 0.91 ÿ0.0958 HSWY [26,69] 0.0413/0.0813 � 0.508 0.79 ÿ0.22979 VTV [19,33] 0.0427/0.1354 � 0.315 0.71 ÿ0.26110 MGCG [27,33] 0.0/0.0041 � 0.0 0.80 ÿ0.39411 BAY [7,70] 0.0274/0.0614 � 0.447 0.78 ÿ0.566

b � ÿ 5.6767

a The motif can start at any position in the window w. E2F sites start in the position 31.b fÃY, the mean frequency in the set Y; fÃN, the mean frequency in the set N.

Figure 3. (a) Plot of false negative rate (FN in per-cents) versus false positive rate (FP sites per 1000 bp) inlogarithmic scale for E2F sites recognition by positionalweight matrix (PWM) and by the score of context; (b)dependence of false negative rate (FN in percents) onthe cut-off value.

106 Identi®cation of E2F Target Genes

Among selected 11 motifs, the ®rst six arecharacterized by an elevated frequency in the¯anking sequences of E2F sites in comparison withthe sequences from the set of exons 2 (positivecharacteristics), whereas the last ®ve motifs arecharacterized by a reduced frequency (negativecharacteristics). The corresponding windows wherethe selected motifs were found to be overrepre-sented/underrepresented are rather different intheir length and location. For instance, windowsfor motifs no. 3 and 5 are located in the 50 ¯ankingregions relative to the E2F sites; windows formotifs no. 4, 6, 7, 8 and 11 cover long regionsincluding E2F sites; windows for motifs no. 2, 9and 10 overlap with the 50 end of the E2F sites. It isinteresting to observe that motif no. 1 (TTT) isfound to be overrepresented in the small windowlocating at the 30 half of E2F sites. Such motifs arecontext-speci®c for some of known non-symmetri-cal E2F sites that are hardly to be recognized bythe weight matrix. It is clear that the found motifsbring additional information about the ®ne struc-ture and context features of E2F sites in promotersof cell cycle-speci®c genes.

The advantage of the score of context d (equation(2)) as an additional ®lter to identify potential E2Fis demonstrated in Figure 3. As a ®rst step weidentify the potential E2F sites by using the weightmatrix with de®nite cut-off values qcut-off. Then, weapply the d score to an extended region surround-ing each potential E2F site. The requirement d > 0(dcut-off � 0.0) is used as an additional argument forrecognition of E2F sites.

Running the method with gradually increasingqcut-off for the sequences from positive control setand negative control set, and plotting the falsepositive (FP) versus false negatives (FN) rates, itbecomes evident that taking the context motifs intoaccount provides a signi®cant decrease of falsepositive rate (Figure 3(a)). With the high qcut-off

values of the weight matrix (0.80 and more) thesequential application of the d score (named in theFigure as PWM � score of context) decreases thefalse positive rate more than tenfold (Figure 3(a)),

leaving the false negative rate practicallyunchanged (Figure 3(b)). When using a low weightmatrix cut-off (qcut-off is less than 0.80) the PWM �score of context method is characterized by com-paratively high rate of false negatives (about 21 %)(see Figure 3(b)). Therefore, the score of contextappeared reasonable to be applied for revealing

Page 9: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Identi®cation of E2F Target Genes 107

high scoring E2F sites. To search for all potentialE2F sites it is necessary to use the conventionalweight matrix method with a low cut-off value.The price of high false positive rate has to be paidtherewithal.

Promoters of cell cycle genes are significantlyenriched by potential E2F binding sites

We have compared the frequencies of potentialE2F binding sites within the promoters of cell cyclegenes with the frequencies in promoter regions offunctionally different genes. In addition, threetypes of random sequences and a set of vertebratesecond exons were used for comparative analysis.The sets analyzed were: promoters of cell cyclegenes (C-prom), promoters of muscle-speci®cgenes (M-prom), promoters of genes induced uponimmune response (T-prom), promoters collected inthe Eukaryotic Promoter Database (EPD-prom).(Only a limited number of cell cycle related genesare presented in EPD. These are dhfr, tk and somehistone genes. Many genes encoding componentsof the cell cycle machinery are not included in EPDrelease 56.) Detailed characteristics of these sets aregiven in Table 9 (see Data and Methods).

We scanned these sequences by the developedPWM for E2F sites. The score of context was notapplied on this step. Frequencies of found potentialE2F binding sites for different threshold values areshown in Figure 4. E2F frequency is signi®cantlyhigher in the set C-prom than in all other sets atany threshold value. Random sequences with the

Figure 4. Comparison of the frequency of potential E2Fmethod in the range of cut-off values qcut-off � [0.8-1.0]. Theper 1 bp. E2F frequency is signi®cantly higher in the set C-lowest E2F frequencies occur within promoters of muscleinduced upon immune response (5 to 20 times lower than ibution rand-0.25). E2F frequency in the set EPD-prom is simsequences with the same mononucleotide and dinucleotidedi-rand C) exhibit slightly elevated frequency of potential E2

same mononucleotide and dinucleotidedistribution as in the C-prom set (Figure 4, curverand-C, and di-rand C) exhibit slightly elevatedfrequency of potential E2F sites, but still signi®-cantly lower than C-prom. Apparently, the fre-quencies of mono- and dinucleotides are correlatedwith the E2F binding site occurrence but cannotcompletely de®ne it.

The high concentration of E2F sites in theC-prom set could not be explained simply byknown sites which are present in the sequencesanalyzed. We found in the C-prom set 302 potentialE2F sites with score values higher than 0.75. Only36 of them are known and experimentally veri®ed(presented in the Table 1). Others are new andfound in all sequences of the set with high concen-tration.

Thus, the frequencies of potential E2F bindingsites vary signi®cantly between different DNAsequences. The results obtained suggest that pro-moters of cell cycle-related genes are signi®cantlyenriched by both high scoring and low scoringpotential E2F binding sites. Therefore, the fre-quency of the potential E2F binding sites can beeffectively used for functional classi®cation of thepromoters.

Location of E2F sites relative to thetranscription start site

As can be seen from Table 1, the majority of theexperimentally proven E2F sites are locatedimmediately near the transcription start site. We

sites in several sets of sequences scanned by the PWMordinate axis gives the frequency of potential E2F sitesprom than in all other sets at any threshold value. The-speci®c genes, second exons and promoters of genesn the random sequences with the even nucleotide distri-ilar to that in the random sequences rand-0.25. Randomdistribution as in the C-prom set (curve rand-C, and

F sites, but still signi®cantly lower than C-prom.

Page 10: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Figure 5. Frequency of potential E2F sites in promo-ters of C-genes in comparison with EPD promoters.Search was performed by PWM method with cut-offvalues qcut-off � 0.8. No score of context was applied.

108 Identi®cation of E2F Target Genes

wanted to investigate if this holds true for potentialsites in the genes with variable expression through-out the cell cycle. The C- and EPD-gene promoterswere divided into 50 bp segments. The frequencyof potential E2F sites was counted for each of suchregions, as shown in the histogram (Figure 5). Ascan be seen, in the C-gene promoters, potential E2Fsites localize close to the transcription start site.Their concentration has a distinct peak betweenpositions ÿ50 and �1 relative to the transcriptionstart site (Figure 5).

In the EPD-gene promoters, potential E2F sitesare evenly distributed over the entire length, withno peaks observed (Figure 5). Over a long regionspanning from ÿ400 to �100, the E2F sites in theC-gene promoters occur more frequently than inthe EPD promoters. The respective concentrationsof potential E2F sites upstream of position ÿ400are indistinguishable in these two groups of pro-moters, thereby suggesting that functional E2Fsites are most unlikely to occur in this region(Figure 5).

Thus, the study of the distribution of the poten-tial E2F sites suggests that they localize close to thetranscription start site, between positions ÿ400 and�150. The concentration is distinctly peaking overthe ®rst 50 nucleotides upstream of the transcrip-tion start site.

Eligibility criteria for searching of newpotential E2F target genes

Based on the results of testing the E2F site recog-nition method on the control sets and comparisonsof frequency of potential E2F sites in C-gene pro-moters and in other sequences, we came up withthe following criteria for identi®cation of newgenes targeted by E2F family members. For everygene with known location of transcription startsite:

(1) Search for potential E2F sites is performed inthree modes: (i) qcut-off � 0.86 (relaxed mode);(ii) qcut-off � 0.80 and dcut-off � 0.0 (moderate mode);(iii) qcut-off � 0.80 and dcut-off � 1.9 (stringent mode).

(2) At least one potential E2F site should befound between positions ÿ400 to �150 relative tothe transcription start site.

Table 5. Testing of the eligibility criteria for identifying of po

The n

Set Number of seq.qcut-off �

(relaxe

C-genes 45 36 � 80 %Ex2-cuta 1733 148 � 8.5 %M-genes � T-genes 121 7EPD 939 220

a Sequences from the Ex2 set (see Table 9) were concatenated andb TP, Estimation of the true positive rate on the whole set of kno

rate on the set of exon2 sequences (the number of 550 bp pieces of e

Using these criteria, for a selected searchingmode, we now can identify genes as potential E2Ftargets in different phases of the cell cycle. Ifwithin the same promoter additional sites arefound with q between 0.80 and 0.86, these are alsoassumed to be potential E2F sites.

We have tested the three modes of eligibility cri-teria on several sets of sequences (see Table 5).

One can see that the stringent mode of searchingcriteria is characterized by rather low level of falsepositive rate (1.4 %) and by reasonably high truepositive rate (about 65 % ). This mode is quiteeffective in ®ltering false positives as it is shown inthe Table 5 on the example of muscle-speci®c andimmune cell-speci®c promoters. We used the strin-gent mode for scanning the EMBL database (seebelow) to prevent the results from being diluted bytoo many false positives. The relaxed mode ofsearching criteria provides better true positive rate(80 %), that makes it useful for careful analysis ofsmall sets of genes. The high level of false positives(8.5 %) prevents us from using this mode for data-base screening.

Scanning EMBL for E2F target genes

To reveal new E2F target genes we scannedEMBL release 6.0, divisions: hum, rod, vrt, andmam. We selected automatically from EMBL allentries that contain clear information about promo-ter position. For that we applied keyword search

tential E2F target genes

umber of sequences identified as potential E2F targets

0.86d)

qcut-off � 0.80 anddcut-off � 0.0 (moderate)

qcut-off � 0.80 anddcut-off � 1.9 (stringent)

(TP)b 32 � 71 % (TP) 29 � 64.4 % (TP)(FP) 48 � 2.8 % (FP) 24 � 1.4 % (FP)

6 3182 121

then chopped into pieces of 550 bp length.wn cell cycle-related genes; FP, estimation of the false positive

xon 2 sequences that were identi®es as potential E2F targets).

Page 11: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Identi®cation of E2F Target Genes 109

in the FT ®elds. We checked for keywords:(``prim_transcript'') or (``precursor_RNA'') or(``mRNA'' and ``join'' but not ``:``).

4459 promoter sequences were selected fromEMBL (Table 6). It is known that retrieving promo-ters from EMBL by the use of such a keywordsearch is far from completeness. Therefore, inaddition to the automatic retrieving of promoterswe have collected manually all available promotersfor cell cycle-related genes by the help of detailedanalysis of EMBL feature tables as well as by theuse of information from the EPD database andfrom various literature sources. We have retrievedadditionally 152 promoters that were missed byautomatic EMBL feature table search. All 4611 pro-moters have been analyzed with the use of strin-gent criteria: search for E2F sites in the region[ ÿ400, �150], with cut-off value qcut-off � 0.80 andwith applying the score of context d > 1.9.

As a result, 342 promoters were identi®ed aspotential E2F targets. Among the correspondinggenes, 29 are known E2F-dependent genes, and313 new potential E2F target genes. In Table 6 weshow the distribution of the found target genesamong different gene functional groups. TheTable with potential E2F target genes that arefound in the current release of EMBL is availableby request.

One can see that besides the known E2F regu-lated gene functional groups we propose new E2Ftarget genes in such important groups of genes as:nucleolins, prions, ribosomal proteins and rRNAgenes, several genes for receptors and transportersas well as genes for several heat shock proteins.

Table 6. Search for E2F target genes in EMBL and in additio

Genes/gene products(1) Kn

gen

1. Transcription factors (AP-1, Myb, Myc and E2F family)2. Components of the signal transduction pathways3. Cyclins and cyclin dependent kinases4. Inhibitors of cyclin-dependent kinases5. Cell cycle-dependent phosphatases6. Tumor associated and tumor suppressor genes7. Histone genes and HMG proteins8. Genes involved in DNA replication and repair9. RNA polymerases, and elongation factors10. Enzymes of nucleotide metabolism11. Other enzymes12. Other cell-cycle related proteins13. Nucleolins, prions14. Ribosomal proteins and rRNAs15. Receptors16. Transporters17. Cytokines, growth factors, hormones18. Heat shock proteins19. OthersTotal foundNumber of promoters that were checked

Criteria: qcut-off � 0.80 � the score of context d cut-off � 1.9; [ ÿ 400, �

Phylogenetic footprinting as an additionalargument for the identification of E2Fbinding sites

Phylogenetic footprinting is a method based onthe assumption that the parts of gene regulatoryregions that are liable to no or little variations inthe course of evolution have important functions.38

The essence of the method is the alignment of theregulatory regions of the orthologous genes and®nding of the most highly conserved regions. Tothis end, we took advantage of phylogenetic foot-printing as an additional criterion for the assertionof the potential regulatory elements, in our case,the E2F transcription factor binding sites. Thealignment was done with ClustalX.

We found that in many cases, potential E2Fbinding sites are conserved across species. Forinstance, the site at ÿ300 in the promoters ofhuman and mouse tgf-b1 genes(CTTCGCGCCCTGG, q � 0.91). Promoters of thehuman and mouse mcm4 genes contain three con-served sites, at ÿ24 (TGTCGCGCAGGT, q � 0.85);at ÿ329 (AGTGGCGCCTCC, q� 0.81); and at ÿ443(TTTCCCGCGAAA, q � 0.87). In the promoter ofc-fos proto-oncogene, the predicted E2F site aroundÿ420 (TTTCCCGCCTCC, q � 0.84), is absolutelyconserved between rodents: mouse, rat, andhamster. The corresponding site in the humangene contains one mismatch that makes q < 0.80. Inaddition, we have revealed three potential bindingsites in the human c-fos proximal promoter,q � 0.91, 0.84, 0.87, that are not conserved inrodents.

We have demonstrated the presence of potentialE2F sites in the promoters of nucleolin genes offour mammalian species. Potential E2F sites, most

nal gene set

own E2F targetes (C-genes)

(2) Totally found inEMBL

(2)-(1) New potentialE2F target genes

9 50 413 25 226 10 4

3 31 1

3 9 61 20 191 4 31 6 55 22 17

41 4112 125 514 1411 114 411 113 391 91

29 342 31342 4459 � 152 � 4611

150].

Page 12: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

110 Identi®cation of E2F Target Genes

similar to the consensus (q � 0.96 and 0.97), arehomologous in humans and rodents (Figure 6). Inall the rodent species considered (mouse, rat,hamster), these sites are absolutely conserved,while in man, three substitutions occur. Addition-ally, we revealed potential E2F sites with littlesimilarity to the consensus (q � 0.76-0.81) and nohomology in different species (Figure 6). It istempting to propose a role for E2F in the regu-lation of nucleolin genes. We decided to con®rmthis by determining if E2F factors bind to thehuman, hamster, and mouse nucleolin promoters.

Confirming E2F targets bychromatin immunoprecipitation

One method for con®rming that E2F familymembers regulate speci®c promoters has been totest the ability of E2Fs to bind in vitro to thepromoter-speci®c oligonucleotides. However, suchin vitro studies may lead to false conclusions dueto the arti®cial nature of the analysis. For example,promoter context cannot contribute to bindingwhen an isolated site is studied and the inclusionof a single site, in the absence of competition with

Figure 6. Phylogenetic footprinting as an additional criteserved potential E2F sites in the promoters of the mammalia

the rest of the genome, may not provide an accu-rate view of the in vivo situation. Therefore, exper-imental veri®cation of the binding of E2F to theidenti®ed sites should be performed under normalphysiological conditions. The study of E2F bindingin living cells has recently been demonstratedusing an in vivo formaldehyde cross-linking tech-nique.22,23,39 Brie¯y, this protocol involves treatingcells with formaldehyde to cross-link the transcrip-tion complexes to promoter DNA and immunopre-cipitation of the protein-DNA complexes with anantibody against an individual E2F followed byanalysis of the immunoprecipitated DNA by PCRusing promoter-speci®c primers. This method hasbeen successfully applied to study in vivo bindingof different E2F family members to genes whichwere previously shown to be regulated by the E2Ffamily using in vitro techniques. To date, in vivoE2F binding has been con®rmed at the e2f-1, cdc6,cdc25A, p107, B-myb, cdc2, dhfr, thymidine kinase,cycE and cycA promoters.22,23

We have now used the method of chromatinimmunoprecipitation to determine if E2F-4 bindsto the promoter regions of the human, mouse, andhamster nucleolin genes. The results shown in

rion for recognition of potential E2F binding sites. Con-n nucleolin gene.

Page 13: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Figure 7. Experimental veri®cation of the nucleolinpromoter as an E2F target. Crosslinked chromatin fromhuman (HeLa cells), mouse (adult mouse liver), andhamster (chinese hamster ovary cells) samples was incu-bated with an antibody to E2F4 or in the absence ofantibody (No Ab). Immunoprecipitates were analyzedby PCR using primers speci®c for the nucleolin promo-ter from each species. The nucleolin promoter is clearlypresent in the sample prepared using an E2F4 antibodybut is not present in the control immunoprecipitate.

Identi®cation of E2F Target Genes 111

Figure 7 clearly provide a strong positive con®r-mation for the functional signi®cance of the pre-dicted E2F sites within the orthologous nucleolingene promoters.

We have also used the in vivo formaldehydecross-linking technique to con®rm the identity ofother potential E2F target genes that have beensuggested computationally. We have used HeLacells for these experiments since all six E2Fs arepresent in these cells. For experimental analysis,we have chosen several computationally predictedtarget genes whose possible transcriptional regu-lation by E2F family members is considered to beinteresting taking into account known functions ofthose genes. Those selected include c-fos and junB;members of the AP-1 family of transcription fac-tors; the gene encoding TGF-b which acts as anantiproliferative agent to a majority of cell types;ARF locus encoding protein that binds to andstabilizes p53 and thus functions in tumor suppres-sion; mcm4 and mcm5 involved in the initiation ofDNA replication; von Hippel-Lindau (VHL) tumorsuppressor gene; and e2f-1 (see Table 7). Each ofthe promoters tested in Figure 8 have been ana-lyzed using the exact same immunoprecipitatedsamples, allowing us to directly compare the bind-ing of E2F family members to the different promo-ters. As a positive control, we have examinedin vivo E2F binding to the dhfr promoter. Regu-lation of the dhfr promoter by the E2F family hasbeen demonstrated using several different methodsand binding of E2Fs to the dhfr promoter has beenshown both in vitro and in vivo.14,22,40 ± 43 As a nega-tive control, we used the 30-end of the dhfr genethat is known to have no E2F binding sites.22 A``no antibody'' control was also included in eachexperiment and the small signal in this lane was

subtracted from each of the signals before the per-centage relative to input was calculated. Asexpected, we found that several of the E2Fs boundto the dhfr promoter (Figure 8(a)). In contrast, verylittle E2F binding was detected at the 30 end of thedhfr gene (Figure 8(b)). Thus, the control demon-strates the speci®city of the experimental protocol.

Using antibodies against various members of theE2F family, we have con®rmed speci®c E2F bind-ing to all promoters under study in asynchro-nously growing HeLa cells (Figure 8(c)). Byreporting signal intensity relative to a ®xed portionof input, we can directly compare binding of theE2Fs at the different promoters. In general, the sig-nal intensities relative to input suggest that all ofthe computer-identi®ed promoters bind E2Fs atapproximately the same level, as does the dhfr pro-moter. However, binding of E2Fs to the mcm4 andmcm5 promoters appears to be stronger than to theother promoters. For all investigated promoters,including dhfr, practically no E2F-5 binding wasshown, which coincides with previous dataobtained using human T98G cells and mouse 3T3cells.23 For all promoters except e2f-1, we alsoobserved low signals using an antibody againstE2F-3. However, we have found that E2F-3 signalintensities are the most variable between exper-iments (data not shown). Signals using E2F-1, E2F-2, and E2F-4 antibodies are the most intense; inmost cases the signals are about 60-150 % relativeto the input. Previous studies have not examinedbinding of E2F-6 to cellular promoters. However,E2F-6 is a special member of the E2F family, as thisis a repressor lacking both the activation domainand pocket-interaction domain.9 Although its func-tion in vivo is not clear, evidence to date suggeststhat E2F6 may be a negative regulator of transcrip-tion. Interestingly, we ®nd that although most ofthe promoters show considerable E2F-6 binding(relative to the other E2Fs), the dhfr, mcm4, andmcm5 promoters have very low E2F-6 signals. Thebinding of several different E2Fs to an E2F targetgene could be due to (1) the existence of multipleE2Fs within the promoter region (see Table 7), (2)diffferential binding of speci®c E2Fs in differentstages of the cell cycle,22,23 or (3) the stochasticnature of transcription complex formation (see22

for further details). Although E2F binding to eightdifferent mammalian promoters containing compu-tationally predicted target sites was con®rmed byin vivo chromatin immunoprecipitations, de®nitionof the characteristics of E2F binding sites that cor-relate with the preferential binding of distinct E2Ffamily members requires additional theoretical andexperimental investigations.

It is interesting to note that for all E2F targetgenes which have been con®rmed by chromatinimmunoprecipitation we have revealed at least onesite that is conserved between mammalian species.Thus, phylogenetic footprinting may serve as anadditional criterion for the identi®cation of func-tional sites in newly sequenced regulatory regions.If the results of the computerized search for the

Page 14: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Table 7. Computationally predicted E2F target genes that have been con®rmed by experimental veri®cation

Gene EMBLSequence of the potential

sitesPosition rel. start of

transcriptiona Score, q Score of context, db Positions of PCR primers

c-fos, Homo sapiens HSFOS (ÿ) gcCTTGGCGCGTGTcc ÿ165, ÿ176 0.92 ÿ201!(ÿ) ggGGTGGCGCGCGGgc ÿ92, ÿ103 0.84 2.92 �96 (�) ccTCTGGCGCCACCgt ÿ90, ÿ79 0.88(ÿ) acGGTGGCGCCAGAgg ÿ78, ÿ89 0.830

JunB, Homo sapiens HS207341 (�) gcTATCGCGCCAGAga 79, 90 0.89 ÿ27!(ÿ) tcTCTGGCGCGATAgc 91, 80 0.91 �313 (ÿ) ggGCTGGCGCGGGCgg 169, 158 0.82 3.17

tgf-b1, Homo sapiens HSTGFB1PR (�) ctGTTTGCGGGGCGga ÿ513, ÿ502 0.80 2.03 ÿ122!(�) ccCTTCGCGCCCTGgg ÿ298, ÿ287 0.91 �210 (�) ctCTTGGCGCGACGct 28, 39 0.93(ÿ) agCGTCGCGCCAAGag 40, 29 0.83(�) ccTTTGCCGCCGGGga 85, 96 0.85

p14ARF, Homo sapiens AF082338 (ÿ)ctCTCCGCGCGCGGga ÿ1384, ÿ1395 0.81 4.11 ÿ404!(ÿ)gtCTTGGCGACCGTtg ÿ1009, ÿ1020 0.81 ÿ143 (ÿ) ggCCTGGCGCCGGAct ÿ739, ÿ750 0.81(�) tgATTGGCGGATAGag ÿ589, ÿ578 0.83(ÿ) acTTTCCCGCCCTGtg ÿ265, ÿ276 0.86

Mcm4 (Cdc21), Homo sapiensHSU63630 (ÿ) gtTTTCGCGGGAAAac ÿ491, ÿ502 0.93 3.53 ÿ667!(ÿ) ctTTCAGCGCCCGTgc ÿ409, ÿ420 0.82 ÿ330 (�) gcAGTGGCGCCTCCcg ÿ377, ÿ366 0.80(�) ggCGTGGCGCGGAGcc ÿ175, ÿ164 0.83 4.39(�) ctTGTCGCGCAGGTac ÿ93, ÿ82 0.86

mcm5 (P1-cdc46), Homosapiens HS286B10 (�) agTTTCGCGCCAAAtt ÿ187, ÿ176 0.99 4.91 ÿ211!

(ÿ) aaTTTGGCGCGAAAct ÿ175, ÿ186 1.00 �88 (�) ttTTTCCCGCGAAAct 8, 19 0.89 3.01(ÿ) agTTTCGCGGGAAAaa 20, 9 0.93 4.21

von Hippel-Lindau (VHL),Homo sapiens AF010238 (�) aaGCTCGCGCCACTgc ÿ270, ÿ259 0.81 ÿ137!

(ÿ) gcAGTGGCGCGAGCtt ÿ258, ÿ269 0.84 �123 (ÿ) gtCTTCGCGCGCGCtc ÿ28, 39 0.92 2.22

B-myb, Homo sapiens HSBMYBDNA (ÿ) gtCCTGGCGCGCGGgc ÿ72, ÿ83 0.83 ÿ296!(�) cgCTTGGCGGGAGAta ÿ53, ÿ42 0.87 1.18 �14

nucleolin, Homo sapiens HSNUCLEO (ÿ) ttTTTGGCGCCGGCtg ÿ297, ÿ308 0.97 ÿ407!(ÿ) ccGTGGGCGCGCGGgt ÿ256, ÿ267 0.81 2.91 ÿ41

nucleolin, Cricetulus griseus CSNUCLEO (ÿ) cgTTTGGCGCGGCTtg ÿ296, ÿ307 0.97 6.67 ÿ538!ÿ198

nucleolin, Mus musculus MMNUCLO (ÿ) agTTTGGCGCGGCTtg ÿ306, ÿ317 0.97 1.76 ÿ531!ÿ232

a In the cases when multiple starts of transcription are known (HSTGFB1PR, AF082338) an extended promoter region (up to ÿ1500 bp) has been searched for E2F sites.b Positive values of the score of context are shown only. At least one E2F site with the positive score of context is found in the promoter regions of these genes.

Page 15: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Figure 8. In vivo formaldehyde cross-linking experiments con®rming potential E2F target genes that have been suggested computationally. Shown is an analysis of E2F bind-ing to the different promoters in asynchronously growing HeLa cells. Cross-linked chromatin from these cells was incubated with antibodies to E2F1-6 or in the absence of anti-body. Immunoprecipitates from each sample were analyzed by PCR using primers speci®c for the different promoters. As a control, a sample representing a ®xed percentage ofthe input chromatin was included. For each set of primers, the signal in the ``no antibody'' reaction was subtracted from the signals in the E2F reactions. The E2F signals are thenreported as a percentage of input. This allows the different promoters to be compared, regardless of PCR ef®ciencies.

Page 16: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

114 Identi®cation of E2F Target Genes

E2F sites at a suf®ciently high value of qcut-off areconsistent with phylogenetic footprinting data, it ismore likely that the sites we have revealed areindeed functional. However, based solely on thismethod, we cannot deliberately ``sort out'' sites,because in different species real functional sitesmay not be con®ned to homologous regions.

Discussion

This work is a combined effort of theoretical andexperimental approaches to study transcriptionalregulation of genes by the E2F family of transcrip-tion factors. Because E2F factors regulate a numberof genes involved in proliferation control and Sphase entry, we reasoned that identi®cation ofother genes regulated by the E2F family wouldprovide insight into cell proliferation. From theearlier published data on transcriptional regulationof cell cycle-related genes contained in CYCLE-TRRD44 and TRANSFAC,34 we generated a large-sample set of E2F binding sites and derived thecorresponding weight matrix. Based on the matrix,we developed a computer method for searchingregulatory regions for E2F binding sites. Using thecriteria listed above, we have identi®ed a largenumber of new genes that could be regulated byE2F transcription factors (see Table 6). One cat-egory of identi®ed genes includes orthologs ofgenes that were previously known to be regulatedby E2F family members. For example, the humancycD1 and RB1 promoters were previouslyreported to be regulated by E2F and we have nowidenti®ed E2F sites in the rat cycD1 and the mouseRB1 promoters. However, most of the genes whichwe have identi®ed have not been suggested pre-viously to be E2F target genes in any mammalianspecies. The identi®ed E2F target genes can beclassi®ed into large groups, depending on the func-tion of the proteins they encode. For example, weidenti®ed cell cycle-related genes such as E2F tran-scription factors, cyclins, cyclin-dependent kinasesand phosphatases, inhibitors of cyclin-dependentkinases, RB1, p53, and ARF; factors and enzymesinvolved in DNA replication, DNA repair, andnucleotide metabolism; histone genes, transcriptionfactors of the Myc, Myb, and AP-1 families, nucleo-lins, RNA polymerases, and genes encoding com-ponents of signal transduction pathways (seeTable 6).

Identification and classification of genespotentially regulated by E2F

Although the expression of most E2F-dependentgenes (such as dhfr, e2f-1, RB1, and p107) peaks atthe G1/S boundary, E2F factors are also involvedin regulating genes that control other phases of thecell cycle. For example, there are several E2F-regu-lated genes, such as cycA, cdc2, and certain histonegenes, whose expression remains high throughoutS-phase and into G2 phase. To date, regulation byE2F factors has been demonstrated experimentally

only for two histone genes. Our results suggestthat E2F may be involved in the control of manymore types of histone genes than previouslydemonstrated. We have also identi®ed potentialE2F sites in G2-speci®c genes such as cycB1,cdc25C, and plk. It has been demonstrated else-where that G2-speci®c transcription of severalgenes is mediated by two closely spaced CDE(C/G GCGG) and CHR (TTGAA) cis-elements, towhich yet unidenti®ed proteins bind that suppresstranscription at the earlier stages of the cellcycle.45,46 Our results suggest that the E2F familymembers contribute to G2-speci®c expressionalong with the factors interacting with CDE andCHR. This assumption is in a good agreement withthe recently established view that E2F-mediatedrepression is one of the most frequent ways ofexerting transcriptional regulation on the gen-es.46,47

We predicted theoretically and veri®ed exper-imentally the binding of the factors of the E2Ffamily in vivo to the promoters of genes encodingthe proteins of the AP-1 family. These genes arehighly inducible at G0/G1 transition. AP-1 familymembers are known to activate transcription ofcyclins and cdks48 that in turn activate E2F factorsthrough retinoblastoma phosphorylation. Ourresults suggest positive feedback from the E2F fac-tors to the fos and jun genes. Our computer anal-ysis also revealed potential E2F sites in genesencoding various other components of signal trans-duction pathways, such as ERK-1, Ras andRanBP1. This suggests other possible positive feed-back loops from E2Fs to upstream components ofsignal transduction pathways. In normal cells,however, the expression of the jun, fos, erk and rasgenes does not depend on cell cycle stage as criti-cally as does the expression of other E2F targetgenes. We propose that E2F-dependent regulationof certain genes takes place in unique situations,for instance, during uncontrolled cell proliferation,when the levels and activity of the E2F factorsexceed a certain threshold. Another interestinggroup of genes potentially regulated by E2F hasmaximal expression in quiescent cells (G0 stage).These are inhibitors of cyclin-dependent kinasesand TGF-b. Therefore, although the identi®cationof an E2F target gene predicts cell cycle regulation,the exact pattern of gene expression must be deter-mined experimentally.

We suggest that the computer-assisted identi®-cation of E2F target genes can be used in twodifferent ways to simplify future studies of tran-scriptional regulation. First, if a gene is known tobe cell cycle-regulated, a sequence analysis usingthe criteria described here could be used to deter-mine if the cell cycle regulation is mediated byE2F. For example, we have proposed, based oncomputer prediction, phylogenetic conservation,and in vivo binding studies, that E2F is involved inthe control of nucleolin gene transcription. Othershave demonstrated that expression of nucleolin islow in serum-starved cells and high in S phase.49

Page 17: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Identi®cation of E2F Target Genes 115

Based on our studies, it is likely that this S phase-speci®c regulation is mediated through theconserved E2F binding sites in the nucleolin pro-moter. Thus, identi®cation of nucleolin as an E2Ftarget has greatly simpli®ed future studies of thetranscriptional regulation of this gene. A secondway in which our computer-assisted identi®cationcan be useful is in the ability to identify additionalgenes that are cell cycle regulated. As shown inFigure 4, E2F sites are found at a higher frequencyin the promoters of cell cycle-related genes than inother DNA sequences. At qcut-off = 0.86 (inciden-tally, this is the value at which the sum of falsenegatives and false positives is minimal), thepotential E2F sites occur in the promoters of thecell cycle-related genes 3.5 times more frequentlythan in random sequences, and ten times more fre-quently than in promoters of muscle-speci®c andimmune response genes (Figure 4). Thus, ourresults suggest that the frequency of potential E2Fsites in promoter regions can be a reliable criterionfor the identi®cation of cell cycle-regulated genes.

A hypothesis of promoter-defining sites

Functionally related genes involved in the samemolecular-genetic, biochemical, or physiologicalprocess are often regulated coordinately by mem-bers of families of closely related factors. We usethe term ``promoter-de®ning sites'' to refer to bind-ing sites which are responsible for the commonstructural features of promoters in functionallyrelated genes and which are also responsible for alarge component of the common pattern of geneexpression. Based on the results reported hereinand elsewhere, some features of promoter-de®ningsites in functionally related groups of genes can beidenti®ed:

(1) The promoter-de®ning sites are revealed inspeci®c promoters with a frequency signi®cantlyhigher than in random sequences. This holds trueboth for sites similar to the consensus (at high cut-off values), and for sites with little similarity to it(at low cut-off values). Other investigators demon-strated that the frequency of potential HNF-1 sitesis 2.5 times higher in liver-speci®c gene promotersthan in random sequences.30 The frequency of thepotential composite elements NFAT/AP-1 is tentimes higher in the promoters of immune responsegenes than in random sequences.31 In this work we

Table 8. Frequencies of E2F binding sites and composite elgenes and in random sequences (sites per 1000 bp)

C-prom T-prom

E2Fa 2.680 0.000NFAT/AP-1b 0.464 2.067

Sets of promoters, C-prom, T-prom, M-prom, and EPD-prom are desa Frequency of E2F sites at qcut-off � 0.88.b Frequency of NFAT/AP-1 at wNFAT � 1.2; wAP-1 � 0.9; qCE > 0.0 (

have demonstrated it in detail for E2F binding sites(Figure 4).

(2) The frequency of promoter-de®ning sites sig-ni®cantly differs between the promoters of thegenes belonging to different functional groups,which can be used as a criterion for functionalclassi®cation of the promoters. A comparison ofthe frequencies of E2F sites and of compositeelements NFAT/AP-1 in four groups of promotersis presented in Table 8. For the purposes of thiscomparison, the cut-off values were chosen suchthat the number of false positives in the randomsequences was the same for the E2F sites and thecomposite elements NFAT/AP-1. As can be seen,the E2F binding sites are 28 times more frequent inthe cell cycle gene promoters than in muscle-speci®c gene promoters; seven times more frequentthan in EPD promoters; while in the promoters ofthe immune response genes, recognition of the E2Fsites at this cut-off value fails (Table 8 see alsoFigure 4 in Results). The frequency of ®nding thecomposite elements NFAT/AP-1 looks different. Inthe promoters of immune response genes, they are4.5 times more frequent than in those of cell cycle-related genes, and 2.5 times more frequent than inmuscle-speci®c or EPD promoters (Table 8).

(3) Promoter-de®ning sites are located close tothe transcription start site. The composite elementsNFAT/AP-1 largely localize to the ®rst 300 bpupstream of the transcription start site.31 Likewise,in the promoters of liver-speci®c genes, most HNF-1 sites localize to the region between ÿ300 bp andthe transcription start site.30 The study of the distri-bution of the potential E2F sites relative to the tran-scription start site demonstrated that they arelargely located in the region between ÿ400 and�100 (Figure 5). A distinct peak of concentration ofthe E2F sites is observed between ÿ50 and �1(Figure 5). This localization is consistent with dataon the mechanisms of E2F function and on thestructure of basal promoters of the genes regulatedby E2F factors. Few E2F target promoters contain aTATA box (Table 1), suggesting that E2F may func-tion in place of direct binding of TBP to promoterDNA. It is known that E2F-1 is capable, throughits activation domain, of interacting with the basalcomplex components, namely TBP and the generalfactor TFIIH13,50. Recently, evidence of cooperativeinteractions between E2F and the basal transcrip-tion factors TFIIA and TFIID was also obtained.51

ements NFAT/AP-1 in the promoters of four groups of

M-prom EPD-prom Rand-0.25

0.096 0.352 0.4870.828 0.891 0.490

cribed in Data and Methods.

see Kel et al., 1999a).31

Page 18: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

116 Identi®cation of E2F Target Genes

The entirety of these facts and the proximity of thepotential E2F sites to the transcription start suggestthat E2F factors may be involved in transcriptionalregulation as early as at the stage of basal complexassembly, providing the recruitment of TBP toTATA-less promoters and additionally stabilizingTBP binding to TATA-containing promoters..

In the cases considered here, we successfullyclassi®ed E2F target promoters by looking at justone type of site. However, it is clear that promotercontext can be very important in determining regu-lation by a transcription factor. An analysis of the¯anking sequences of the E2F sites in the promo-ters of cell cycle-related genes allowed us to revealspeci®c motifs which occur in certain segments ofthese sequences at reliably higher frequency thanin the Ex2 set. With these motifs, we developed anadditional test for checking out the context sur-rounding the potential E2F sites. This test allowedus to considerably reduce the number of sitesincorrectly identi®ed as being E2F binding sites. Itmay well be that these motifs serve as bindingsites for various transcription co-factors. Appar-ently, this structure of the ¯anking sequencesre¯ects a possible cooperation in the relationshipsbetween the E2F factors and other factors duringregulatory complex assembly. Future studies ofsuch cooperation mechanisms of regulation of cellcycle-related genes are clearly required.

In summary, the computer method presentedhere allows us to systematically reveal promoterslikely to be recognized by the factors of the E2Ffamily. A comparison of the binding of E2F factorsin vivo with the sites predicted in the promoters of11 genes con®rms the recognition of these promo-ters by E2F factors and provides a convincingdemonstration that performing theoretical andexperimental studies in combination is very instru-mental in analyzing the regulatory regionsof genes. With the advent of the large-scalesequencing projects, it is becoming increasinglyessential that computational biologists team upwith bench-scientists in order to develop a detailedunderstanding of the transcriptional networksthat control mammalian cell growth anddevelopment.25,52,53

Table 9. Sets of DNA sequences used in the analysis

Set Number of seq.Total length

(bp)

C-prom 42 19,404

c-mycHs,Mm; N-mycM

cycD1Hs,Rn; cycEHs;dhfrCs,Hs,Mm; tkHs,M

udgHs,; p14ARFHs

EPD-prom 905 842,799 EPD rel. 56, vertebrand-0.25 2000 902,410 Random sequences

rand-C 2000 902,410Random sequencesprom p[A]�0.17846

di-rand-C 2000 902,410Random sequencesprom

Ex2 4051 953,528 Nonredundant sec

Data and Methods

Sets of sequences used for analysis of E2F sites

The sets of sequences that were used in the analysisare given in Table 9. The set designated C-prom includespromoters of genes that are known to be controlled byE2F transcription factors. Orthologs of these genes fromother vertebrate species (if the promoter sequence isavailable in EMBL rel.58) are included in this set as well.The set EPD-prom includes promoters of vertebratesfrom EPD rel. 56 (Eukaryotic Promoter Database).Additionally we use promoter sets of genes expressed inmuscle tissue (M-prom) and in immune cells (T-prom).31

Random sequences were generated to estimate howoften potential binding sites can appear just by chance.The set rand-0.25 contains computer-generated randomsequences with equal frequencies for the four nucleo-tides. We used it for estimation of the basic false positiverate of the recognition method. Random sequences withnucleotide and dinucleotide frequencies correspondingto that within C-prom were generated as well andused where indicated (see Results) (sets: rand-C anddi-rand-C).

Set Ex2 contains 4051 sequences of the second exonsof vertebrate genes (based on EMBL rel. 48). This set isavailable on the TRANSFAC server and was designedpreviously.54 Up to now, only very few functionally rel-evant binding sites have been revealed experimentally insecond exons, therefore the potential binding sites foundin these sequences may be considered a priori as falsepositives.

Weight matrix method for recognition of E2F sites

The weight matrix method is used to search for poten-tial E2F sites in any nucleotide sequence. A sliding win-dow moves through the sequence and a score value iscalculated for every position of the window. For calculat-ing the score q we used an approach implying aninformation vector55,56 with certain modi®cations:31

q �

XL

i�1

I�i�fi;biÿXL

i�1

I�i�f mini

!XL

i�1

I�i�f maxi

�2�

Here, b1,b2 . . . bL are the nucleotides in the window of thelength L equal to the matrix width; fi,bi

is the frequency

m,Rn,Sb; B-mybHs,Mm; e2f-1Hs,Mm,Cc; e2f-2Hs; RB1Hs,Mm; p107Hs;cycAHs,Mm; cdc2Hs,Mm,Cc; cdc6Hs; cdc7Mm; Orc1Hs; DNA pol (Hs;m; cadMa; PCNAHs,Mm,Rn; htf9aHs,Mm; H2A.1Hs,Mm; H2A.XMm;

rateswith even nucleotide frequencies p[A]�p[T]�p[C]�p[G]�0.25.with nucleotide frequencies corresponding to that within C-8, p[T]�0.192641, p[C]�0.317873, p[G]�0.311018.with dinucleotide frequencies corresponding to that within C-

ond exons of vertebrate genes

Page 19: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Identi®cation of E2F Target Genes 117

of the nucleotide bi in the ith position of the weightmatrix; f min

i stands for the frequency of a nucleotidewhich is rarest in position i (in many cases, f min

i � 0);f max

i , frequency of the most frequent nucleotide in pos-ition i of the matrix. The information vector gives a rela-tive value of position conservation. It is calculated asfollows:

I�i� �X

B2fA;T;G;Cgfi;B ln�4fi;B�; i � 1; 2; . . . ;L �3�

If the score exceeds a prede®ned cut-off value qcut-off, apotential binding site is allocated in this position.Additionally, we de®ned nucleotides for which fbi

� 0 as``forbidden'' nucleotides in the ith position. If in a cur-rent window at least one nucleotide falls in the forbid-den location of the matrix we set q � 0. This approachhelps to exclude sites that contain such forbidden nucleo-tides in any of the most conserved positions.

Checking the sequence motifs in the context ofE2F sites

Let us consider a training set Y consisting of pY

nucleotide sequences of equal length L taken from regu-latory regions of different genes. Each sequence S con-tains in the positions [r, r � l] an experimentally provenbinding site of length l for transcription factors of a cer-tain type. Let us consider a negative control set N con-sisting of pN sequences that do not contain binding sitesof the type in question. We use the Ex2 set to compilethe negative control set N, by selecting such subse-quences that match in positions [r, r � l] with the con-sidered weight matrix (score value q > qcut-off). Weconsider such matrix matches as false positives of theweight matrix method. By using the false positives as thenegative control set we can analyze additional contextualfeatures that are not picked up by the positional weightmatrix method. The most interesting are the contextualcharacteristics of the long sequences ¯anking the con-sidered binding sites in regulatory regions of geneswhere these sites are known to be functional.

As the contextual features being analyzed, we considerthe frequency of occurrence of short motifs l � (a1a2 . . . ak)of the length k, where a are letters from the ambiguitycode a 2 {A,T,G,C,W,S,R,Y,M,K,B,V,H,D,N}, in whichadditional letters correspond to several alternativenucleotides (W-(A/T); R-(A/G); M-(A/C); K-(T/G);Y-(T/C); S-(G/C); B-(T/G/C); V-(A/G/C); H-(A/T/C);D-(A/T/G); N-(A/T/G/C)) (see57). Each motif describesa set of similar oligonucleotides. For a motif l we com-pute its mean frequency in every window w � [t1,t2](0 < t2 < t1 < L ÿ k � 1) of a sequence S:

f �l;w; S� � N�l;w;S��t2 ÿ t1� �4�

where N(l,w,S) is the number of motifs (found to start atany position in window w of sequence S).

We developed a method that permits us to ®nd suchmotifs l that are characterized by a signi®cant differenceof their frequencies in speci®c regions of the sequencesfrom the sets Y and N. The motifs found are used thenfor creating a context analyzer that is able to perform anadditional check of the potential sites revealed by theweight matrix method. We have applied the SITEVIDEOsystem65 for ®nding such motifs. It performs an exhaus-tive search through the space of all possible motifs l andwindows w. The corresponding f(l,w,S) values are calcu-

lated for all sequences SY in the training set Y and for allsequences SN from the control set N. The differencebetween these two distributions of frequencies fY and fNis tested for signi®cance using a number of statisticalcriteria57 integrated in a generalized criteria, utility valueU58,59 (ÿ1 < U < 1). The utility value U(l,w) is calculatedfor every combination (l,w). The best (l,w) combinationsthat are characterized by highest U values (U(l,w) > 0.7)are selected for the next step of analysis.

At the next step we analyze correlation betweencharacteristics ((l,w) combinations) revealed at the pre-vious step. For every two characteristics c1 � (l1,w1) andc2 � (l2,w2) we compute the linear correlation coef®cientr(c1,c2) between the mean frequencies f(l1,w1,S) andf(l2,w2,S) for all sequences from the set Y. Using thisinformation we select the best set of m uncorrelatedcharacteristics (the full description of the algorithm thatis based on the search for maximal weighted cliques in acorrelation graph is given in60). The found uncorrelatedcharacteristics describe a set of motifs that are preferablylocalized in certain regions relative to the core of the site.Such set characterizes the speci®c context of the regulat-ory regions that contain the functional sites of the con-sidered type. We use the selected set of motives foradditional context analysis of the potential sites revealedby the weight matrix method.

The context analyzer is developed by means of lineardiscriminant analysis (STATISTICA 5.0). The uncorre-lated characteristics selected at the previous step:c1, c2, . . . , cm are used for construction linear classi®cationfunction discriminating sets Y and N.

Formaldehyde crosslinking andimmunoprecipitation of chromatin

Formaldehyde (Fisher Scienti®c) was added directly tocell culture media at a ®nal concentration of 1 % to non-adherent log phase HeLa cells. Fixation proceeded at22 �C for ten minutes and was stopped by the additionof glycine to a ®nal concentration of 0.125 M. HeLa cellswere collected by centrifugation and rinsed in cold PBS.Cell pellets were resuspended in swelling buffer (10 mMpotassium acetate, 15 mM magnesium acetate, 0.1 MTris (pH 7.6), 0.5 mM PMSF, and 100 ng/ml leupeptinand aprotinin), incubated on ice for 20 minutes, and thendounced. The nuclei were collected by microcentrifuga-tion at 5000 rpm and then resuspended in sonicationbuffer (1 % SDS, 10 mM EDTA, 50 mM Tris-HCl (pH 8.1),0.5 mM PMSF, and 100 ng/ml leupeptin and aprotinin)and incubated on ice for 20 minutes. Samples were soni-cated on ice with an Ultrasonics sonicator at full powerfor four 30 second pulses to an average length of 500-1000 bp and then microfuged at 14,000 rpm. The chro-matin solution was precleared with the addition of Sta-phylococcas aureus cells for 15 minutes at 4 �C. Prior touse, S. aureus cells were blocked with 1 mg/ml shearedherring sperm DNA and 1 mg/ml BSA for at least fourhours at 4 �C. Precleared chromatin from 2.5 � 107 cellswas incubated with 1 mg of af®nity-puri®ed rabbit poly-clonal antibody or no antibody and rotated at 4 �C forapproximately 12-16 hours. Antibodies used include:E2F1 no. 05-379 (UBI); E2F2 C-20 no. SC-633X (SantaCruz); E2F3 C-18 no. SC-878X (Santa Cruz); E2F4 C-20no. SC-866X (Santa Cruz); E2F5 E-19 no. SC999X (SantaCruz); E2F6 E-20 no. SC-8366 (Santa Cruz). Immunopre-cipitation, washing and elution of immune complexeswas carried out as previously described.39 Prior to the®rst wash, 20 % of the supernatant from the no primary

Page 20: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Table 10. PCR primers used to analyze the E2F target genes

Name Primer no. 1 Primer no. 2 Product size

hdhfr ttctgctgtaacgagcgggctcgga ctacaagttagagaaacagcgttactcgaa 463hdhfr30 ctgatgtccaggaggagaaagg agcccgacaatgtcaaggactg 349hE2F1 cgccgggtccggcgcgttaaag catggcggcaggcctcggcga 161hjunB tataaaggcgtgtggctcag gggctgttccattttagtgc 341hfos aggaactgcgaaatgctcac gaagcagagctgggtaggag 302hARF tgtcaggtgacggatgtagc caagatctcggaacggctct 262hMCM4 ccaccacctcccgtccttaa aatcacagcggcgctcgtac 350hMCM5 tccttcccagccagaagttt tcccactagcctcacctctg 300hTGFB gccccagagtctgagacgag cgaggtctggggaaaagtct 333hvhl gcctccgttacaacagccta tcttcagggccgtactcttc 289mnuc tcatagccaagaagccgttc agtgtgcgggtcctttgaat 300hnuc gggtctctgaggaactccaa ccagtaacgcctgtggaaag 367chnuc cagtggctctaggccctcaa tagctctacgcaaggcgg 341

118 Identi®cation of E2F Target Genes

antibody reaction for each time point was saved as totalinput chromatin and was processed with the elutedimmunoprecipitates beginning at the crosslink reversalstep. Crosslinks were reversed by addition of NaCl to a®nal concentration of 200 mM and RNA was removedby addition of 10 mg RNase A per sample followed byincubation at 65 �C for ®ve hours. Samples were thenprecipitated at ÿ20 �C overnight by the addition of twovolumes of EtOH and then pelleted by microcentrifuga-tion. Samples were resusupended in 100 ml TE (pH 7.5),25 ml of 5� proteinase K buffer (1.25 % SDS, 50 mM Tris(pH 7.5), 25 mM EDTA), and 1.5 ml of proteinase K(Boehringer Mannheim) and incubated at 45 �C for twohours. Samples were extracted with phenol: chloroform:isoamyl alcohol (25:24:1, by vol.) and then precipitatedwith 0.1 volume of 3 M NaOAc (pH 5.3), 5 mg tRNA,and two volumes of EtOH at ÿ20 �C overnight. Pelletswere collected by microcentrifugation, resuspended in30 ml water, and analyzed using PCR.

PCR reactions contained 2 ml of immunoprecipitate or2 ul of a 1:100 dilution of the total sample, 50 ng of eachprimer, 0.88 mM MgCl2, 2 mM each dATP, dCTP,dGTP, dTTP, 1� Thermophilic Buffer (Promega), and1.25 units Taq DNA polymerase (Promega) in a totalvolume of 20 ml. Following 32-35 cycles of ampli®cation,PCR products were run on a 1.5 % (w/v) agarose geland analyzed by EtBr staining. PCR primers used to ana-lyze target genes are shown in Table 10. An annealingtemperature of 60 �C was used for all primer pairs exceptfor the hE2F1 primers which required an annealing tem-perature of 68 �C and the nucleolin primer sets which allrequire an annealing temperature of 57 �C.

Acknowledgments

The authors are indebted to Aida G. Romashchenko(Institut Cytology and Genetics, Novosibirsk) for fruitfuldiscussion of the results and Vladimir Filonenko for criti-cal reading of the manuscript. Part of this work was sup-ported by Siberian Branch of Russian Academy ofSciences, by a grant of the European Commission (BIO4-95-0226), by grant of Volkswagen-Stiftung (I/75941).M.Q.Z. is supported partly by grant GM60513 andHG01696 from NIH. P.J.F. is supported partly by grantCA45240 from NIH.

References

1. Farnham, P. J., Slansky, J. E. & Kollmar, R. (1993).The role of E2F in the mammalian cell cycle. Bio-chim. Biophys. Acta, 1155, 125-131.

2. DeGregori, J., Kowalik, T. & Nevis, J. R. (1995). Cel-lular targets for activation by the E2F1 transcriptionfactor include DNA synthesis- and G1/S-Regulatorygenes. Mol. Cell. Biol, 15, 4215-4224.

3. Slansky, J. E. & Farnham, P. J. (1996). Transcrip-tional regulation of the dihydrofolate reductasegene. Bioessays, 18, 55-62.

4. Dyson, N. (1998). The regulation of E2F by pRB-family proteins. Genes Dev. 12, 2245-2262.

5. Helin, K. (1998). Regulation of cell proliferation bythe E2F transcription factors. Curr. Opin. Genet. Dev.8, 28-35.

6. Lavia, P. & Jansen-Durr, P. (1999). E2F target genesand cell-cycle checkpoint control. Bioessays, 21, 221-230.

7. Johnson, D. G., Ohtani, K. & Nevis, J. R. (1994).Autoregulatory control of E2F1 expression inresponse to positive and negative regulators of cellcycle progression. Genes Dev. 8, 1514-1525.

8. Neuman, E., Flemington, E. K., Sellers, W. R. &Kaelin, W. G. (1994). Transcription of the E2F-1 geneis rendered cell cycle dependent by E2F DNA-bind-ing sites within its promoter. Mol. Cell. Biol. 14,6607-6615.

9. Cartwright, P., Muller, H., Wagener, C., Holm, K. &Helin, K. (1998). E2F-6: a novel member of the E2Ffamily is an inhibitor of E2F-dependent transcrip-tion. Oncogene, 17, 611-623.

10. Ivey-Hoyle, M., Conroy, R., Huber, H. E., Goodhart,P. J., Oliff, A. & Heimbrook, D. C. (1993). Cloningand characterization of E2F-2, a novel protein withthe biochemical propeties of transcription factor E2F.Mol. Cell. Biol. 13, 7802-7812.

11. Zheng, N., Fraenkel, E., Pabo, C. O. & Pavletich,N. P. (1999). Structural basis of DNA recognition bythe heterodimeric cell cycle transcription factor E2F-DP. Genes Dev. 13, 666-674.

12. Hagemeier, C., Cook, A. & Kouzarides, T. (1993).The retinoblastoma protein binds E2F residuesrequired for activation in vivo and TBP bindingin vitro. Nucl. Acids Res. 21, 4998-5004.

13. Pearson, A. & Greenblatt, J. (1997). Modular organ-ization of the E2F1 activation domain and its inter-action with general transcription factors TBP andTFIIH. Oncogene, 15, 2643-2658.

Page 21: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

Identi®cation of E2F Target Genes 119

14. Fry, C. J., Pearson, A., Malinowski, E., Bartley, S. M.,Greenblatt, J. & Farnham, P. J. (1999a). Activation ofthe murine dihydrofolate reductase promoter byE2F1. A requirement for CBP recruitment. J. Biol.Chem. 274, 15883-15891.

15. Mayol, X. & Grana, X. (1997). pRB, p107 and p130as transcriptional regulators: role in cell growth anddifferentiation. Prog. Cell Cycle Res. 3, 157-169.

16. Sherr, C. J. (1993). Mammalian G1 cyclins. Cell, 73,1059-1065.

17. Lees, E. (1995). Cyclin dependent kinase regulation.Curr. Opin. Cell. Biol. 7, 773-780.

18. Morgan, D. O. (1997). Cyclin-dependent kinases:engines, clocks, and microprocessors. Annu. Rev.Cell. Dev. Biol. 13, 261-291.

19. Roberts, J. M. (1999). Evolving ideas about cyclins.Cell, 98, 129-132.

20. Dynlacht, B. D., Flores, O., Lees, J. A. & Harlow, E.(1994). Differential regulation of E2F transactivationby cyclin/cdk2 complexes. Genes Dev. 8, 1772-1786.

21. Kitagawa, M., Higashi, H., Suzuki-Takahashi, I.,Segawa, K., Hanks, S. K., Taya, Y., Nishimura, S. &Okuyama, A. (1995). Phosphorylation of E2F-1 bycyclin A-cdk2. Oncogene, 19, 229-236.

22. Wells, J., Boyd, K. E., Fry, C. J., Bartley, S. M. &Farnham, P. J. (2000). Target gene speci®city of E2Fand pocket protein family members in living cells.Mol. Cell. Biol. 20, 5797-5807.

23. Takahashi, Y., Rayman, J. B. & Dynlacht, B. D.(2000). Analysis of promoter binding by the E2F andpRB families in vivo: distinct E2F proteins mediateactivation and repression. Genes Dev. 14, 804-816.

24. Fry, C. J. & Farnham, P. J. (1999b). Context-depen-dent transcriptional regulation. J. Biol. Chem. 274,29583-29586.

25. Zhang, M. Q. (2000). Computational methods forpromoter recognition. In Curr. Top. Comput. Biol(Jiang, T., Smith, T., Xu, Y. & Zhang, M. Q., eds),chapter 5, MIT Press, Cambridge.

26. Fickett, J. W. (1996a). Coordinate positioning ofMEF2 and myogenin binding sites. Gene, 172, GC19-GC32.

27. Fickett, J. W. (1996b). Quantitative discrimination ofMEF2 sites. Mol. Cell. Biol. 16, 437-441.

28. Wasserman, W. W. & Fickett, J. W. (1998). Identi®-cation of regulatory regions which confer muscle-speci®c gene expression. J. Mol. Biol. 278, 167-181.

29. Frech, K., Quandt, K. & Werner, T. (1998). Muscleactin genes: a ®rst step towards computationalclassi®cation of tissue-speci®c promoters. In SilicoBiol. 01, 0005.

30. Tronche, F., Ringeisen, F., Blumenfeld, M., Yaniv, M.& Pontoglio, M. (1997). Analysis of the distributionof binding sites for a tissue-speci®c transcriptionfactor in the vertebrate genome. J. Mol. Biol. 266,231-245.

31. Kel, A., Kel-Margoulis, O., Babenko, V. &Wingender, E. (1999a). Recognition of NFATp/AP-1composite elements within genes induced upon theactivation of immune cells. J. Mol. Biol. 288, 353-376.

32. Roulet, E., Bucher, P., Schneider, R., Wingender, E.,Dusserre, Y., Werner, T. & Mermod, N. (2000).Experimental analysis and computer prediction ofCTF/NFI transcription factor DNA binding sites.J. Mol. Biol. 297, 833-848.

33. Nevins, J. R. (1992). E2F: a link between the Rbtumor supressor protein and viral oncoproteins.Science, 258, 424-429.

34. Wingender, E., Chen, X., Hehl, R., Karas, H.,Liebich, I., Matys, V., Meinhardt, T., Pruss, M.,Reuter, I. & Schacherer, F. (2000). TRANSFAC: anintegrated system for gene expression regulation.Nucl. Acids Res. 28, 316-319.

35. Cavener, D. R. (1987). Comparison of the consensussequence ¯anking translational start sites in Droso-phila and vertebrates. Nucl. Acids Res. 15, 1353-1361.

36. Kim, J. M., Sato, N., Yamada, M., Arai, K.-I. &Masai, H. (1998). Growth regulation of theexpression of mouse cDNA and gene encoding aserine/threonine kinase related to Saccharomycescerevisiae CDC7 essential for G1/S transition. J. Biol.Chem, 273, 23248-23257.

37. Sears, R., Ohtani, K. & Nevins, J. R. (1997). Identi®-cation of positively and negatively acting elementsregulating expression of the E2F2 gene in responseto cell growth signals. Mol. Cell. Biol. 17, 5227-5235.

38. Duret, L. & Bucher, P. (1997). Searching for regulat-ory elements in human noncoding sequences. Curr.Opin. Struct. Biol. 7, 399-406.

39. Boyd, K. E. & Farnham, P. J. (1997). Myc vs USF:discrimination at the cad gene is determined by corepromoter elements. Mol. Cell. Biol. 17, 2529-2537.

40. Blake, M. C. & Azizkhan, J. C. (1989). Transcrip-tional factor E2F is required for ef®cient of thehamster dihydrofolate reductase gene in vitro andin vivo. Mol. Cell. Biol. 6, 4994-5002.

41. Wade, M., Blake, M. C., Jambou, R. C., Helin, K.,Harlow, E. & Azizkhan, J. C. (1995). An invertedrepeat motif stabilizes binding of E2F and enhancestranscription of the dihydrofolate reductase gene.J. Biol. Chem. 270, 9783-9791.

42. Fry, C. J., Slansky, J. E. & Farnham, P. J. (1997).Position-dependent transcriptional regulation of themurine dihydrofolate reductase promoter by theE2F transactivation domain. Mol. Cell. Biol. 17, 1966-1976.

43. Jensen, D. E., Black, A. R., Swick, A. G. & Azizkhan,J. C. (1997). Distinct roles for Sp1 and E2F sites inthe growth/cell cycle regulation of the DHFR pro-moter. J. Cell. Biochem. 67, 24-31.

44. Kel, O. V. & Kel, A. E. (1997). Complex gene net-work in cell cycle regulation: central role of the E2Ffamily of transcription factors. Mol. Biol. (Moscow),31, 548-561.

45. Lucibello, F. C., Truss, M., Zwicker, J., Ehlert, F.,Beato, M. & Muller, R. (1995). Periodic cdc25C tran-scription is mediated by a novel cell cycle-regulatedrepression element (CDE). EMBO J. 14, 132-142.

46. Zwicker, J. & Muller, R. (1997). Cell-cycle regulationof gene expression by transcriptional repression.Trends Genet. 13, 3-6.

47. Iavarone, A. & Massague, J. (1999). E2F and histonedeacetylase mediate transforming growth factor betarepression of cdc25A during keratinocyte cell cyclearrest. Mol. Cell. Biol. 19, 916-922.

48. Albanese, C., Johnson, J., Watanabe, G., Eklund, N.,Vu, D., Arnold, A. & Pestell, R. G. (1995). Trans-forming p21ras mutants and c-Ets-2 activate thecyclin D1 promoter through distinguishable regions.J. Biol. Chem. 270, 23589-23597.

49. Sirri, V., Roussel, P., Gendron, M. C. & Hernandez-Verdun, D. (1997). Amount of the two majorAg-NOR proteins, nucleolin, and protein B23 iscell-cycle dependent. Cytometry, 28, 147-156.

50. van Ginkel, P. R., Hsiao, K. M., Schjerven, H. &Farnham, P. J. (1997). E2F-mediated growth regu-

Page 22: Computer-assisted Identification of Cell Cycle-related Genes: …rulai.cshl.org › reprints › jmb01.pdf · 2002-08-21 · Computer-assisted Identification of Cell Cycle-related

120 Identi®cation of E2F Target Genes

lation requires transcription factor cooperation.J. Biol. Chem. 272, 18367-18374.

51. Ross, J. F., Liu, X. & Dynlacht, B. D. (1999). Mechan-ism of transcriptional repression of E2F by theretinoblastoma tumor suppressor protein. Mol. Cell,3, 195-205.

52. Zhang, M. Q. (1999a). Promoter analysis of co-regulated genes in the yeast genome. Comput. Chem.23, 233-250.

53. Zhang, M. Q. (1999b). Large scale gene expressiondata analysis: a new challenge to computationalbiologists. Genome Res. 9, 681-688.

54. Pickert, L., Reuter, I., Klawonn, F. & Wingender, E.(1998). Transcription regulatory region analysisusing signal detection and fuzzy clustering. Bioinfor-matics, 14, 244-251.

55. Schneider, T. D., Stormo, G. D., Gold, L. &Ehrenfeucht, A. (1986). Title. J. Mol. Biol. 188, 415-431.

56. Quandt, K., Frech, K., Karas, H., Wingender, E. &Werner, T. (1995). MatInd and MatInspector - newfast and versatile tools for detection of consensusmatches in nucleotide sequence data. Nucl. AcidsRes. 23, 4878-4884.

57. Kel, A. E., Ponomarenko, M. P., Likhachev, E. A.,Orlov, Yu. L., Ischenko, I. V., Milanesi, L. &Kolchanov, N. A. (1993). SITEVIDEO: a computersystem for functional site analysis and recognition.Investigation of the human splice sites. Comput.Appl. Biosci. 9, 617-627.

58. Ponomarenko, M. P., Orlov, Y. L. & Kolchanov,N. A. (1991). Computer Analysis of Structure, Function

and Evolution of Genetic Macromolecules, pp. 197-220,ICG, Novosibirsk.

59. Fishburn, P. C. (1970). Utility Theory for DecisionMaking, John Wiley and Sons Inc., New York,London, Sidney, Toronto.

60. Kel, A., Kel-Margoulis, O. & Wingender, E. (1999b).E2F composite units in promoters of cell cyclegenes. In Proceedings of the German Conference onBioinformatics GCB'99, October 4-6, 1999, pp. 148-154,Hannover, Germany.

Edited by J. Karn

(Received 28 November 2000; received in revised form27 March 2001; accepted 27 March 2001)

http://www.academicpress.com/jmb

Supplementary material comprising a Table of thetarget genes is available at IDEAL