p ath w ay to f unct io nal studi es: pi p eline li nking...

7
Pathway to Functional Studies: Pipeline Linking Phylogenetic Footprinting and Transcription-Factor Binding Analysis Nagesh Chakka Jill E. Gready Computational Proteomics Group, John Curtin School of Medical Research, Australian National University, Canberra, Australia. Email: [email protected]; [email protected] Abstract Identification of transcription-factor binding sites is a critical first step in studying transcriptional regu- lation of genes. The comparative genomics method of phylogenetic footprinting is based on identifying sequence elements that are conserved across multiple genomes, and, thus, likely to be functional. We have developed a systematic high throughput screening pipeline to first search for conserved motifs using two different phylogenetic footprinting methods (motif- discovery and alignment-based) , and then rapid eval- uate the motifs as potential transcription-factor bind- ing sites. The results are displayed in an interactive graphical user interface, FactorScan, which integrates three separate complementary databases (conserved- sequence motifs, transcription-factor binding site mo- tifs, TRANSFAC). We applied this pipeline for transcription-factor binding site analysis to the or- thologous gene regions of prion-protein family genes from vertebrate lineages, taking account of the gene annotations. Keywords: transcription factors; transcription fac- tor binding sites; phylogenetic footprinting; TRANS- FAC; MATCH; comparative genomics; sequence mo- tifs. 1 Introduction 1.1 Motivation for the work Availability of draft sequence for newly sequenced genomes of model organisms offers huge opportunities for characterizing functional elements using compara- tive genomic approaches. One key class of such func- tional elements is sites for binding proteins termed “transcription factors” (TFs) which play a central role in DNA polymerase II mediated transcriptional regulation of gene expression. TFs bind to specific short DNA sequence motifs know as TF binding sites (TFBSs) or cis -regulatory elements (CRE). Predic- tion of TFs which may bind to a particular gene can rapidly provide initial insights into potential func- tions of the target genes. This is based on known modes of actions of the TFs in regulating other bet- ter characterized genes. Such initial predictions can greatly assist in designing focused confirmatory ex- periments. As TFBSs are under greater selective pressure than other non-protein-coding DNA, the re- liability of predicting them is greatly improved by comparative genomics to filter out noise from genetic Copyright c 2006, Australian Computer Society, Inc. This pa- per appeared at The 2006 Workshop on Intelligent Systems for Bioinformatics (WISB2006), Hobart, Australia. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 73. Mikael Bod´ en and Timothy L. Bailey, Ed. Reproduc- tion for academic, not-for profit purposes permitted provided this text is included. drift. Identifying such conserved sequence elements in non-coding regions of homologous genes from phylo- genetic comparison is called ‘phylogenetic footprint- ing’(PF) (Tagle, Koop, Goodman, Slightom, Hess & Jones 1988). While there are several online resources which can perform PF, none provides the flexibility for combining the conserved sequence-motif data with TFBS analysis and, at the same time, allowing the flexibility to customize the searches based on gene an- notation information. To address this deficiency, we developed a two-step procedure which combines PF with TFBS analysis. This automated pipeline enables us to carry out rapid screening and evaluation of the phylogenetically conserved motifs for potential TF- binding affinity. To perform the most comprehensive searches, TRANSFAC professional database (version 9.2) was included in the pipeline. We used this strat- egy to identify potential TFBSs in prion protein and its paralogous gene, doppel, encoded by the PRNP and PRND genes, respectively. We gleaned some ini- tial insights into the functions of these genes, which are not well understood, from the TFs predicted to be involved in regulating their expression. 1.2 Advantages of our approach As TFBSs are short DNA motifs of 5-15 bp, analyzing a single sequence would lead to a very high percent- age of false positive hits. PF offers a solution to this problem by identifying such sequence elements that are conserved among genes that are either orthologous or co-expressed. Several programs implement PF but only a few combine it with TFBS analysis, for exam- ple, rVISTA (Loots, Ovcharenko, Pachter, Dubchak & Rubin 2002) and ConSite (Sandelin, Wasserman & Lenhard 2004). However, both programs allow only pairwise comparison; there are no programs which perform this analysis on multiple sequences. An- other restriction with rVISTA and ConSite is that they use different databases of position weight ma- trices (PWMs), TRANSFAC public and JASPAR re- spectively, neither of which is as comprehensive as TRANSFAC professional. Finally, rVISTA and Con- Site do not provide a facility to customize display of the results to make the maximum use of the output, for example, display of clusters of TFs. Our approach overcomes these restrictions, by providing various options for customizing searches for both pairwise and multiple sequences, for incorporating flexibility in visualizing the output, and for using databases of PWMs of choice.

Upload: phungngoc

Post on 27-Feb-2018

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: P ath w ay to F unct io nal Studi es: Pi p eline Li nking ...crpit.com/confpapers/CRPITV73Chakka.pdf · ... Pi p eline Li nking Ph ylo geneti c F o ot pr in ting a nd T ra nscr ipti

Pathway to Functional Studies: Pipeline Linking PhylogeneticFootprinting and Transcription-Factor Binding Analysis

Nagesh Chakka Jill E. Gready

Computational Proteomics Group, John Curtin School of Medical Research, Australian National University,Canberra, Australia. Email: [email protected]; [email protected]

Abstract

Identification of transcription-factor binding sites isa critical first step in studying transcriptional regu-lation of genes. The comparative genomics methodof phylogenetic footprinting is based on identifyingsequence elements that are conserved across multiplegenomes, and, thus, likely to be functional. We havedeveloped a systematic high throughput screeningpipeline to first search for conserved motifs using twodi!erent phylogenetic footprinting methods (motif-discovery and alignment-based) , and then rapid eval-uate the motifs as potential transcription-factor bind-ing sites. The results are displayed in an interactivegraphical user interface, FactorScan, which integratesthree separate complementary databases (conserved-sequence motifs, transcription-factor binding site mo-tifs, TRANSFAC). We applied this pipeline fortranscription-factor binding site analysis to the or-thologous gene regions of prion-protein family genesfrom vertebrate lineages, taking account of the geneannotations.

Keywords: transcription factors; transcription fac-tor binding sites; phylogenetic footprinting; TRANS-FAC; MATCH; comparative genomics; sequence mo-tifs.

1 Introduction

1.1 Motivation for the work

Availability of draft sequence for newly sequencedgenomes of model organisms o!ers huge opportunitiesfor characterizing functional elements using compara-tive genomic approaches. One key class of such func-tional elements is sites for binding proteins termed“transcription factors” (TFs) which play a centralrole in DNA polymerase II mediated transcriptionalregulation of gene expression. TFs bind to specificshort DNA sequence motifs know as TF binding sites(TFBSs) or cis-regulatory elements (CRE). Predic-tion of TFs which may bind to a particular gene canrapidly provide initial insights into potential func-tions of the target genes. This is based on knownmodes of actions of the TFs in regulating other bet-ter characterized genes. Such initial predictions cangreatly assist in designing focused confirmatory ex-periments. As TFBSs are under greater selectivepressure than other non-protein-coding DNA, the re-liability of predicting them is greatly improved bycomparative genomics to filter out noise from geneticCopyright c!2006, Australian Computer Society, Inc. This pa-per appeared at The 2006 Workshop on Intelligent Systems forBioinformatics (WISB2006), Hobart, Australia. Conferencesin Research and Practice in Information Technology (CRPIT),Vol. 73. Mikael Boden and Timothy L. Bailey, Ed. Reproduc-tion for academic, not-for profit purposes permitted providedthis text is included.

drift. Identifying such conserved sequence elements innon-coding regions of homologous genes from phylo-genetic comparison is called ‘phylogenetic footprint-ing’(PF) (Tagle, Koop, Goodman, Slightom, Hess &Jones 1988). While there are several online resourceswhich can perform PF, none provides the flexibilityfor combining the conserved sequence-motif data withTFBS analysis and, at the same time, allowing theflexibility to customize the searches based on gene an-notation information. To address this deficiency, wedeveloped a two-step procedure which combines PFwith TFBS analysis. This automated pipeline enablesus to carry out rapid screening and evaluation of thephylogenetically conserved motifs for potential TF-binding a"nity. To perform the most comprehensivesearches, TRANSFAC professional database (version9.2) was included in the pipeline. We used this strat-egy to identify potential TFBSs in prion protein andits paralogous gene, doppel, encoded by the PRNPand PRND genes, respectively. We gleaned some ini-tial insights into the functions of these genes, whichare not well understood, from the TFs predicted tobe involved in regulating their expression.

1.2 Advantages of our approach

As TFBSs are short DNA motifs of 5-15 bp, analyzinga single sequence would lead to a very high percent-age of false positive hits. PF o!ers a solution to thisproblem by identifying such sequence elements thatare conserved among genes that are either orthologousor co-expressed. Several programs implement PF butonly a few combine it with TFBS analysis, for exam-ple, rVISTA (Loots, Ovcharenko, Pachter, Dubchak& Rubin 2002) and ConSite (Sandelin, Wasserman &Lenhard 2004). However, both programs allow onlypairwise comparison; there are no programs whichperform this analysis on multiple sequences. An-other restriction with rVISTA and ConSite is thatthey use di!erent databases of position weight ma-trices (PWMs), TRANSFAC public and JASPAR re-spectively, neither of which is as comprehensive asTRANSFAC professional. Finally, rVISTA and Con-Site do not provide a facility to customize display ofthe results to make the maximum use of the output,for example, display of clusters of TFs. Our approachovercomes these restrictions, by providing variousoptions for customizing searches for both pairwiseand multiple sequences, for incorporating flexibilityin visualizing the output, and for using databases ofPWMs of choice.

Page 2: P ath w ay to F unct io nal Studi es: Pi p eline Li nking ...crpit.com/confpapers/CRPITV73Chakka.pdf · ... Pi p eline Li nking Ph ylo geneti c F o ot pr in ting a nd T ra nscr ipti

2 Pipeline for Phylogenetic Footprinting(PF) Analysis

2.1 Rationale for selecting algorithms

For alignment-based identification of conserved ele-ments, we used AVID (Bray, Dubchak & Pachter2003) and LAGAN (Brudno, Do, Cooper, Kim, Davy-dov, Program, Green, Sidow & Batzoglou 2003), bothof which are sensitive and widely used for genome-wide alignment problems. rVISTA uses both align-ment programs and LAGAN is being incorporated inConSite alignment step. For identification of con-served elements from multiple sequences, we usedFootPrinter (Blanchette & Tompa 2003) which takesphylogeny into account and, hence, weighs the se-quence based on the evolutionary relationship andimplements most of the concepts of PF, in contrastto other motif-discovery methods such as MEME(Bailey & Elkan 1994). BioProspector (Liu, Brutlag& Liu 2001) identifies motifs that are overrepresentedin the input sequences and, hence, is a di!erent ap-proach to handling this problem.

2.2 Annotated gene sequence database

A database of annotated gene sequences was cre-ated by mapping the (PRNP and PRND) cDNAsequence obtained from either experiments or pub-lic databases onto the genome sequence obtainedfrom various genome sequencing projects. The EM-BOSS application (Rice, Longden & Bleasby 2000)“est2genome” was used to annotate the exon-intronboundaries, transcription start site, while “getorf”was used for detecting the coding regions, which werethen masked. Genomic sequence covering 2 kb up-stream to the transcription start site, the whole ofexon-intron region, and 2 kb downstream from thetranscription stop site was included in the PF anal-ysis. To improve the signal-to-noise ratio, we se-lected representative species for which genomic datafor PRNP and PRND was available. This comprisesseveral eutherian mammalian species, and all thoseavailable for lower vertebrates; marsupial mammalsMonodelphis domestica (South American opossum)and Tammar wallaby, chicken, and the frog Xenopustropicalis. Indicative sequence lengths are shown inthe scale bar of Figure 5 (b) for the complete genomicregions of mouse and human PRNP and PRND ; thereare significant di!erences in the lengths of the intronicand intergenic regions of these genes, both among eu-therian mammals and among the vertebrate lineagesdue to the high frequency of insertion of transpos-able elements (Premzl, Gready, Jermiin, Simonic &Graves 2004).

2.3 Conserved-sequence motif detection

Conserved sequence motifs were identified by severalPF methods which we categorize into two groups,alignment-based and motif-discovery-based. Separatepipelines for each, alignment-based (Fig. 1(a)) andmotif discovery-based (Fig. 1(b)), were developed.

2.3.1 Alignment-based method.

To perform end-to-end comparisons, the globalpairwise-alignment methods AVID (Bray et al. 2003)and LAGAN (Brudno et al. 2003) were used indepen-dently to generate pairwise alignments. The AVIDalignment method is fast, memory e"cient, and prac-tical for sequence alignments of large genomic re-gions up to megabase. AVID performs the pairwisealignment of two input sequences; the output com-prises the alignment and additional information. The

alignment files were used for downstream processing.LAGAN is a method for rapid global alignment oftwo homologous sequences. The algorithm is basedon three main steps (Brudno et al. 2003): (1) gen-eration of pairwise local alignments, (2) constructionof a rough global map, by linking a subset of localalignments, and (3) computation of the final globalalignment. LAGAN alignments were generated usingthe translate anchor option and binary output formatwas selected, which enables downstream processing.Both the AVID and LAGAN alignments for all possi-ble pairwise combinations (Fig. 2) of sequences in theannotated gene sequence database were performed us-ing the Perl script “doAlign.pl”.

Figure 1: Phylogenetic footprinting pipeline usingAVID/LAGAN alignment methods and annotationwith VISTA (a). Pipeline summarizing steps in phy-logenetic footprinting using FootPrinter (b). The endresult of both analyses is a database of the conservedsequence motifs.

Annotation with VISTA. Global pairwise align-ments generated by AVID and LAGAN were anno-tated using VISTA (Frazer, Pachter, Poliakov, Ru-bin & Dubchak 2004). VISTA can be configured bychanging several parameters (e.g. percentage iden-tity and length), which can be defined in the inputPlotfile. To facilitate trialing of several combinationsof percent identity (range: 75% to 100%) and length(range: 8 to 15 bp) values, a Perl script “runVista.pl”was developed to generate corresponding Plotfiles forpercent identity and length values passed as commandline arguments. VISTA generates three output files:VISTA plot, alignment, and region file (Fig. 1(a)).

Page 3: P ath w ay to F unct io nal Studi es: Pi p eline Li nking ...crpit.com/confpapers/CRPITV73Chakka.pdf · ... Pi p eline Li nking Ph ylo geneti c F o ot pr in ting a nd T ra nscr ipti

Figure 2: Summary of pairwise sequence comparisons(all grey cells) performed with AVID and LAGANbetween the species on the X and Y axes. H, human;M, mouse; R, rat; D, dog; C, cow; S, sheep; Md,Monodelphis; Tw, Tammar wallaby; Ch, Chicken; X,Xenopus.

VISTA plot contains graphical representation of theconserved regions. The region file contains details ofthose regions which satisfied the user-specified lengthand percentage cuto!s. This file was processed us-ing a Perl script “extractseq.pl”. Based on the startand end numbers of the conserved regions, the sub-sequences were extracted using the EMBOSS appli-cation “extractseq” integrated into “extractseq.pl”.This process was repeated for all the region files ob-tained for the various combinations of alignments. Fi-nally, the Perl script generates a single multi-FASTAfile of all the conserved subsequences, which are storedin a conserved sequence database. The sequence iden-tifier for each motif contains information about thepair involved in the alignment, its position in boththe reference sequences and the region to which itbelongs. This enables the exact position of the con-served sequence to be tracked for further analysis. Forthose motifs which are shorter than 15 bp, continuousstretches of five “N” were added to both the 5’ and3’ ends of the motif to facilitate the TFBS analysis.

2.3.2 Motif-discovery approach.

FootPrinter (Blanchette & Tompa 2003) implementsmotif-discovery method to identify conserved motifsin a collection of homologous sequences. The al-gorithm identifies each set of motifs of user-definedsize, one from each input sequence, that have a par-simony score specified by the user. This processuses phylogenetic tree information. The input forFootPrinter is the file containing sequences from theannotated gene sequence database and a tree file(Fig. 1b). The program generates several outputfiles with di!erent file formats. For programmaticprocessing, html output format was selected. Theinput sequences were divided into several datasets:intra-eutherian mammals and others comprising setswith eutherian mammals and sequences from one ormore of the other lineages (marsupial, avian, am-phibian). The output motif file (motif.html) con-tains the information about the motif and its po-sition. A comprehensive search was performed us-ing di!erent FootPrinter options (subregion- 1000 to3000bp; motif size- 6 to 10bp; parsimony score- 0to 2). Using a Perl script “motifextract.pl”, the“motif.html” output file was converted to a singlemulti-FASTA file. Each analysis was performed twiceusing upstream and downstream (FootPrinter: se-quence type) option. The multi-FASTA files fromboth analyses were combined using a Perl script,“compileTFBS.pl” to produce a non-redundant singlemulti-FASTA file. These multi-FASTA files relatingto di!erent subregion sizes were stored in a conserved-

sequence database. Each sequence-motif position wasregistered in the sequence identifier.

Figure 3: Pipeline showing the steps in the TFBSanalysis.

3 Pipeline for transcription-factor binding-site (TFBS) analysis

To enable a comprehensive analysis, the commercialversion TRANSFAC (Matys, Fricke et al. 2003) pro-fessional was used for TFBS analysis. MATCH (Kel,Gossling et al. 2003) is a tool which uses the weightmatrices in the TRANSFAC database to search forputative TFBSs; the advanced version, MATCH pro-fessional, distributed with TRANSFAC professionalwas used. Published TFBS information was used tooptimize the MATCH search parameters, i.e. to pre-dict maximum true positives and minimum false pos-itives against the TRANSFAC professional database.A systematic pipeline was developed to assess thespecificity of TF binding to the conserved-sequencemotifs identified by phylogenetic footprinting (Fig.3). The steps of the analysis were:

• Starting inputs were the motifs identified byAVID/LAGAN/FootPrinter methods.

• These motifs were scored against the TRANS-FAC database using MATCH which uses the in-formation defined in the profile (selection of ma-trices with defined cuto!s).

• The output file generated by MATCH was pro-cessed to eliminate entries for motif sequenceswhich did not correlate with any known bindinga"nity; only sequences showing putative bindingto the vertebrate TFs were retained.

• The Perl scripts, “motifExtract.pl” and “ex-tractSeq.pl” contain modules that process theMATCH output file.

• The final output (same format as MATCH out-put) generated by these Perl scripts was storedin the TFBS database.

• When conserved motifs were obtained by non-stringent criteria, e.g. for parsimony score value> 0 for FootPrinter or percent identity value <100% for alignment methods, it is possible thatTFs predicted to bind to the same set of con-served motifs in di!erent input sequences coulddi!er. Such predicted TFs were eliminated. Thiscriterion was implemented by two Perl scripts,“tfbsCons.pl” and “ultraTFBS.pl” which need tobe run consecutively.

• Altogether, the resultant predicted motifs wereclassified as either highly conserved or less highlyconserved. Both sets were stored in the TFBSdatabase.

Page 4: P ath w ay to F unct io nal Studi es: Pi p eline Li nking ...crpit.com/confpapers/CRPITV73Chakka.pdf · ... Pi p eline Li nking Ph ylo geneti c F o ot pr in ting a nd T ra nscr ipti

(a)

(c)

(b)

(d)

Figure 5: (a) Web form for the user to submit information required for viewing the results. (b) Results pagefor alignment method and (c) for FootPrinter method. Note the di!erent options in the display pattern shownin (b) and (c); TF titles are seen in (c), while conserved-sequence motifs are seen as vertical bars in (b). (d)Report page showing the report for the search made for the TF SpZ1.

Page 5: P ath w ay to F unct io nal Studi es: Pi p eline Li nking ...crpit.com/confpapers/CRPITV73Chakka.pdf · ... Pi p eline Li nking Ph ylo geneti c F o ot pr in ting a nd T ra nscr ipti

Figure 4: The procedural flow of information, start-ing from submitting the web form to the display ofresults, and the programs involved with each task.

4 Visual front-end for data analysis

TFBSs occur in combinations of order, distance andstrand orientation which are specific for a partic-ular gene. Analyzing this organization in relationto the gene structure is essential for understandingtranscriptional regulation. An intuitive visual front-end is necessary to allow the researcher to view theTFBS organization and interpret and evaluate theresults. To achieve this, we designed an interactiveuser-interface, FactorScan.

4.1 Interface development

FactorScan is a web-based application accessiblethrough a web browser. It links the TFBS informa-tion, conserved-sequence motif information predictedby AVID/LAGAN/FootPrinter and the TRANSFACdatabase (Fig. 4). This interface enables access tothe data (conserved-sequence motifs and TFBS) gen-erated by the various pipelines (Figs 1 and 3): it isnot dynamically generated during the visualisation.The web interface has three main components, theweb form, the results page and the report page.

Input: The user input for the web form is catego-rized into mandatory and optional parameters (Fig5(a)). The mandatory parameters include the genefor which the results are to be displayed and thevarious options used for phylogenetic footprinting togenerate the data (subregion size, sequence type, se-quence dataset). The optional parameters are for cus-tomizing and controlling the display of the results.Some important features are (i) Transcription FactorSearch, (ii) Core Similarity Score, (iii) Title, (iv) Tis-sue Source and (v) Line. The Transcription FactorSearch is useful to display a subset of TFs of partic-ular interest, either individually or in combinations.The latter is particularly useful for identifying andcomparing ‘modules’ TFBS (clusters of TFBS in adefined order)(Wasserman and Sandelin 2004). TheCore Similarity Score can be used to visualize TFswhich satisfy criteria set by the user. This value is inthe range of 0-1; by default this is set to 1 to displaythe statistically most significant hits. The “Title”option can be used to visualize the name of the TFmatrices for the displayed TFs. Tissue-specific TFscan be searched according to tissue, such as brainand testis. The cell-positive and cell-negative infor-mation in the TRANSFAC database is used for thispurpose. The conserved-sequence motif distributioncan be viewed by selecting the “Line” option.

Output: The submitted web form is processed bya CGI script “simpleImageReference.cgi” (Fig. 4)and the results are displayed in the same window.The results page displays a schematic of relative or-ganization of gene annotation, TF and conserved-sequence motif information. Genomic sequence is rep-resented, conventionally, as a horizontal line with ex-ons mapped on as rectangular boxes, and with coding-and non-coding regions of exons shaded in di!erentcolors (Fig. 5(b),( c)). The TFs predicted to bindare represented as triangles (Fig. 5(b), (c)), invertedand upright for the forward and reverse strands, re-spectively. Each TF is assigned a unique color; itsname is displayed if the “Title” option is selected.The conserved-sequence motifs, identified by any ofthe methods, are represented as vertical bars (Fig.5(b)); use of color is particularly helpful to discrim-inate these regions when they are very close. Trian-gles representing TFs and vertical bars representingconserved-sequence motifs are clickable areas. Click-ing on the triangle displays a summary of TF informa-tion, obtained from the TRANSFAC database. Click-ing on the vertical bar displays information aboutthe conserved-sequence motif, accessed from informa-tion in the conserved-sequence motif database. Thisis particularly useful as the conserved-sequence mo-tif can be examined for other purposes. For Foot-printer analyses the schematic is drawn to scale withina species, but between species the scale is not nor-malized (Fig. 5(c)). For pairwise-alignment analy-ses, the scale (also shown; see Fig. 5(c)) is normal-ized between the pairs, and the results can be dis-played either between specific pairs or for one againstall others. The latter is useful to compare the con-served TFBS distribution among various lineages. In-formation about species, abbreviations used and thesequence length in base pairs is provided in table format the bottom of the schematic. The results pagealso has a link to view the report of the TFs andtheir binding sites. Clicking this link pops up a win-dow (Fig. 5(d)) displaying a detailed summary ofthe TFs, the strand to which it binds, core match,the conserved-sequence motif identifier, the positionof the TF relative to the transcription start site andthe sequence which was used for TFBS analysis.

5 Analysis of results

Our use of this combinatorial PF approach (i.e. bothalignment-based and motif-discovery-based methods)predicted most of the known TFs for the PRNP andPRND genes. The SP1 TF has been shown experi-mentally to play a role in transcriptional regulationof PRNP (Saeki, Matsumoto, Matsumoto & Onodera1996)(Baybutt & Manson 1997)(Inoue, Tanaka, Ho-riuchi, Ishiguro & Shinagawa 1997)(Mahal, Asante,Antoniou & Collinge 2001). Mahal and coworkers alsofound AP1 and AP2 binding sites in the human pro-moter region. We predicted both SP1 and AP1/AP2TFBSs using the pairwise-alignment method in mostpairs of sequences compared, but these TFs were notidentified using FootPrinter analysis (motif absence inany sequence was not allowed). Premzl and coworkers(Premzl, Delbridge, Gready, Wilson, Johnson, Davis,Kuczek & Graves 2005) reported several regulatoryregions in PRNP using PF (Footprinter method) withthe then-available sequences (eutherian mammals andone marsupial only): most of the TFs (MEF2, Oct-1, MyT1 and NFAT) were predicted in the intra-eutherian mammal comparison.

Nagyova and coworkers (Nagyov, Pastorek &Kopcek 2004) experimentally validated the role ofUSF and NF-Y in PRND promoter activity. We pre-dicted the NF-Y region using both alignment-based

Page 6: P ath w ay to F unct io nal Studi es: Pi p eline Li nking ...crpit.com/confpapers/CRPITV73Chakka.pdf · ... Pi p eline Li nking Ph ylo geneti c F o ot pr in ting a nd T ra nscr ipti

and FootPrinter methods. We predicted the USF-binding site in comparisons of some sequence pairsusing alignment-based methods but not using Foot-Printer: this indicates either that the USF-bindingsite is degenerate or short or that it is not phyloge-netically conserved among the species compared.

Of particular interest for our genes, we predictedseveral new TFBSs (PRNP : E4BP4, DBP, FAC1,MYB; PRND : SpZ1, CDXA, LEF1) which are phylo-genetically conserved for both genes, and which corre-late well with physiological behaviour consistent withoperation of these TFs in regulating other genes (e.g.tissue specificity, specific physiological role). We aretesting these predictions experimentally, and so farhave indicative confirmation for FAC1 and SpZ1.

6 Application and Conclusions

We have developed a graphical web interface to facil-itate researcher evaluation of results from the phylo-genetic footprinting and TFBS analysis pipelines. Anapplication of the pipeline and web interface is illus-trated by an analysis on PRNP and PRND genes.This revealed several new conserved TFBSs, in addi-tion to detecting already published and experimen-tally validated TFs for regulating these genes. De-tection of the latter serves as a confidence test forour pipeline analysis. Several of the newly predictedTFBSs are consistent with the known functions ofthese genes, providing strong starting points for followup experimental studies. A combinatorial approachof predicting conserved motifs using FootPrinter andAVID/LAGAN methods followed by TF binding anal-ysis significantly improved the confidence in the pre-dicted TFBSs. Our pipeline was also tested on thenewly discovered prion-protein family gene, SPRN,coding for the protein Shadoo, providing us with valu-able initial functional predictions of a gene whosefunction is not known. Our development of a pipelinewhich incorporates both alignment-based and motif-discovery based methods with TFBS analysis is novel,and provides a powerful new tool for high through-put, robust analysis. The concurrent development ofthe graphical-display module to this pipeline, greatlyenhances its usefulness by facilitating intuitive andinteractive analysis of the results.

7 Software and Hardware

Standalone versions of AVID (version 2.1), LAGAN(version 1.21) and FootPrinter (version 2.1) wereused for the phylogenetic footprinting analysis. TheTRANSFAC database version 9.2 and MATCH ver-sion 6.1 were used for TFBS analysis. The webform was implemented using HTML running onan Apache web server on a Linux operating sys-tem at valera.anu.edu.au which hosts the web pageand can be accessed locally with the web addresshttp://valera.anu.edu.au:8080/factorScan html. Thegraphical package Perl GD and Common Gateway In-terface package Perl CGI were used for the web inter-face development. Additional pipelining and analy-sis modules were written in Perl. All analysis wasperformed on a PC but some of the more memory-demanding FootPrinter analyses were performed onthe Dell Linux cluster at the APAC (Australian Part-nership for Advanced Computing) National Facility.

References

Bailey, T. L. & Elkan, C. (1994), ‘Fitting a mixturemodel by expectation maximization to discover

motifs in biopolymers.’, Proc Int Conf Intell SystMol Biol 2, 28–36.

Baybutt, H. & Manson, J. (1997), ‘Characterisationof two promoters for prion protein (prp) geneexpression in neuronal cells.’, Gene 184(1), 125–131.

Blanchette, M. & Tompa, M. (2003), ‘Footprinter: Aprogram designed for phylogenetic footprinting.’,Nucleic Acids Res 31(13), 3840–3842.

Bray, N., Dubchak, I. & Pachter, L. (2003), ‘Avid:A global alignment program.’, Genome Res13(1), 97–102.

Brudno, M., Do, C. B., Cooper, G. M., Kim, M. F.,Davydov, E., Program, N. I. S. C. C. S., Green,E. D., Sidow, A. & Batzoglou, S. (2003), ‘Laganand multi-lagan: e"cient tools for large-scalemultiple alignment of genomic dna.’, GenomeRes 13(4), 721–731.

Frazer, K. A., Pachter, L., Poliakov, A., Rubin, E. M.& Dubchak, I. (2004), ‘Vista: computationaltools for comparative genomics.’, Nucleic AcidsRes 32(Web Server issue), W273–W279.

Inoue, S., Tanaka, M., Horiuchi, M., Ishiguro, N. &Shinagawa, M. (1997), ‘Characterization of thebovine prion protein gene: the expression re-quires interaction between the promoter and in-tron.’, J Vet Med Sci 59(3), 175–183.

Kel, A. E., Gssling, E., Reuter, I., Cheremushkin, E.,Kel-Margoulis, O. V. & Wingender, E. (2003),‘Match: A tool for searching transcription factorbinding sites in dna sequences.’, Nucleic AcidsRes 31(13), 3576–3579.

Liu, X., Brutlag, D. L. & Liu, J. S. (2001), ‘Bio-prospector: discovering conserved dna motifsin upstream regulatory regions of co-expressedgenes.’, Pac Symp Biocomput pp. 127–138.

Loots, G. G., Ovcharenko, I., Pachter, L., Dubchak,I. & Rubin, E. M. (2002), ‘rvista for com-parative sequence-based discovery of functionaltranscription factor binding sites.’, Genome Res12(5), 832–839.

Mahal, S. P., Asante, E. A., Antoniou, M. & Collinge,J. (2001), ‘Isolation and functional characterisa-tion of the promoter region of the human prionprotein gene.’, Gene 268(1-2), 105–114.

Matys, V., Fricke, E., Ge!ers, R., Gssling, E.,Haubrock, M., Hehl, R., Hornischer, K., Karas,D., Kel, A. E., Kel-Margoulis, O. V., Kloos,D.-U., Land, S., Lewicki-Potapov, B., Michael,H., Mnch, R., Reuter, I., Rotert, S., Saxel, H.,Scheer, M., Thiele, S. & Wingender, E. (2003),‘Transfac: transcriptional regulation, from pat-terns to profiles.’, Nucleic Acids Res 31(1), 374–378.

Nagyov, J., Pastorek, J. & Kopcek, J. (2004), ‘Iden-tification of the critical cis-acting elements inthe promoter of the mouse prnd gene cod-ing for doppel protein.’, Biochim Biophys Acta1679(3), 288–293.

Premzl, M., Delbridge, M., Gready, J. E., Wilson, P.,Johnson, M., Davis, J., Kuczek, E. & Graves,J. A. M. (2005), ‘The prion protein gene: iden-tifying regulatory signals using marsupial se-quence.’, Gene 349, 121–134.

Page 7: P ath w ay to F unct io nal Studi es: Pi p eline Li nking ...crpit.com/confpapers/CRPITV73Chakka.pdf · ... Pi p eline Li nking Ph ylo geneti c F o ot pr in ting a nd T ra nscr ipti

Premzl, M., Gready, J. E., Jermiin, L. S., Simonic,T. & Graves, J. A. M. (2004), ‘Evolution ofvertebrate genes related to prion and shadooproteins–clues from comparative genomic anal-ysis.’, Mol Biol Evol 21(12), 2210–2231.

Rice, P., Longden, I. & Bleasby, A. (2000), ‘Emboss:the european molecular biology open softwaresuite.’, Trends Genet 16(6), 276–277.

Saeki, K., Matsumoto, Y., Matsumoto, Y. & On-odera, T. (1996), ‘Identification of a promoterregion in the rat prion protein gene.’, BiochemBiophys Res Commun 219(1), 47–52.

Sandelin, A., Wasserman, W. W. & Lenhard, B.(2004), ‘Consite: web-based prediction of regu-latory elements using cross-species comparison.’,Nucleic Acids Res 32(Web Server issue), W249–W252.

Tagle, D. A., Koop, B. F., Goodman, M., Slightom,J. L., Hess, D. L. & Jones, R. T. (1988), ‘Em-bryonic epsilon and gamma globin genes of aprosimian primate (galago crassicaudatus). nu-cleotide and amino acid sequences, developmen-tal regulation and phylogenetic footprints.’, JMol Biol 203(2), 439–455.

Wasserman, W. W., Sandelin, A. (2004), ‘Appliedbioinformatics for the identification of regulatoryelements.’, Nat Rev Genet 5(4), 276–87.