bio scope

BioScopeAdvanced Search Grammar Tool for identification of Functional

Noncoding Elements

Principal Investigator - Hariharane RamasamySanjeev MishraTulasi Ravuri

SummaryThe completion of several genomic sequences has provided the motivation for development of a tool that can aid in locating and analyzing transcription factor binding sites (TFBS) responsible for regulating the gene transcritption. TFBS are short sequences 4-20 in length, and often located near the genes they regulate. These sequences occur in groups or modules also called enhancer or cisRegulatory modules (CRM). CRM contain one or more TFBS and interact with a specific combination of transcription factors to regulate gene expression. Such sequences are often abundant near the genes they regulate. The goal of developmental biologists is to understand how these CRM are organized in a genome, and regulate the gene. Laboratory methods, that are performed to locate CRM, are often laborious and time consuming. Hence computational methods have become an invaluable tool. The success of computational methods depends on how well they can be utilized in a lab environment. Several computational tools exist to locate motifs in a genomic sequence. These tools fall under two categories. The first category tools employ statistical and probabilistic methods using known motifs and the frequencies of codons in a genomics sequence. Although some motifs have been discovered using these tools, often they yield more false positives. Tools in the second category employ fundamental principles of the combinatorial logic underlying the occurrence of the enhancers / cisRegulatory modules (CRM). It is believed that genes with similar temporal and spatial expression patterns are controlled by similar CRM. The experimental biologists who are knowledgeable about CRM occurrences need an efficient tool to locate them by applying the combinatorial knowledge such as counts of the binding site occurrences within a specified width, logical combination of one of more binding sites, orientation and more. The tools should be efficient, scalable, and fast. The aim of this proposal is to build such tools.

1 IntroductionSeveral genomes including the human and the mouse genomes have been sequenced close to completion. In this post-genomic era, it is imperative that researchers are equipped with novel methodologies that will facilitate them to rapidly and accurately identify, annotate and functionally characterize genes. Thus,

mining of genomics and proteomics data using computational approaches seems to be the superior way to extract information from these resources in a short time frame. The transcriptional regulation of a gene depends on the concerted action of multiple transcription factors that bind to cis-regulatory modules located in the vicinity of the gene. Cis-regulatory modules are regulatory elements that occur close to each other and control the spatial and temporal expression of genes. The regulatory language that the genome uses to dictate transcriptional dynamics can be revealed by identifying these cis-regulatory elements. Often these elements are transferred evolutionarily across organisms with little mutations but without losing their functional value. Knowledge of these motifs may help drive discovery of similar genes in other closely related organisms. The availability of accurate models along with useful search methods with enhanced sensitivity and specificity will be the first step in being able to detect putative regulatory elements in a genome-wide manner.

2 Background

The identification of regulatory sequences and their location in a genome is an important step in understanding the gene expression. Genes that have similar expression are believed to have similar regulatory logic. Such genes are governed by unique combinatorial transcriptional codes known as cis-acting regulatory modules (CRMs) or enhancers. CRMs are oligonucleotide sequences that act together to activate or suppress the gene. In the past, several studies have been performed in understanding the behavior of enhancers and their role in developmental biology. The experiments, performed to study the expression of the gene in a developmental stage, are often time consuming, and laborious. Computational tools are often sought by biologists to scan the whole genome for better candidate selection of these regulatory regions.

Several computational methods exist to predict the regulatory motif sequences. The motifs are overly represented near the gene they transcribe. Using the earlier knowledge and position based probabilities, several tools were built to predict new regulatory motifs. CisAnalyst, developed by Berman et. al., has been successfully applied for fruitfly to find new clusters using a purely computational approach. Bioprospector uses Gibb sampler to predict regulatory sequences. The main problem with these tools are the presence of background noise and the inability to differentiate between a true regulatory motif versus a false positive. Besides, the variations in genomic sequence across

species further increases the noise. Although computational methods have served well for purposes of finding genes and even individual exons in genomic data, regulatory element predictions have proven difficult.

Markstein [1] developed a tool for biologists to search using the previous knowledge of enhancers. The tool allows the biologists to input desired regular expressions using {A,T,G,C}, gene name, width, and proximity constraints. However, the tool is genome-specific and does not contain some important constraints like distance to the next binding site, orientation and order of the motifs, low affinity sequences, variable length regular expression, and user-defined overlap constraints.

A brief survey for computational identification of regulatory DNA is described in Dmitri Papatsenko and Michael Levine. The paper elucidates the need for computational tools providing a comparison of available tools without going into the specific details of the algorithms. The article however emphasizes the need for a fast and efficient computational tools.

3 Project Proposal

The project aims to provide the following :1. restrictive search capabilities like distance to the next

motif, orientation of the motif, low affinity motif, order of motif occurrence [5],

2. limited integrated information like nearby genes/exons, gene expression data, annotation details around the target once it is located [5],

3. interactive chain search where a search for a target on an organism can be linked to intra species or cross species search.

4. Scalable, and efficient

More importantly, our proposed module will be highly flexible, allowing constant integration of newer genomes and at the same time being a powerful tool that will allow the researcher to search for complex gene clusters.

To that end we developed a software program that will more precisely locate the regulatory region with far more ease for the researcher than programs that are currently available. The control, more importantly, of the result of the program will be given to developmental biologist. The tool is very ideal for a lab environment.

3.1 Phase I Specific Aims1. To develop a web-based module that allows the researcher

to search for cisregulatory elements. The tool will input motif and search constraints as mentioned in figure 1 and will display results as shown in figure 2 and 3. The search feature of the program will provide◦ ability to enter 10 regular expressions using

A,T,G,C and letters given in the table below. ◦ an option to allow self overlap ◦ capacity to input a name for the motif◦ a box to specify width constraint◦ flexibility to input logical combination of motifs

typed in (1) such as (2A and 2B), (A or B or C)◦ ability to disallow overlap across motifs type in

first item.◦ To type name of the gene within a specified

distance once a cluster is found using the above rules◦ a name to save the results. The name will/can be

used in SuperCluster

Letter CodonB C,G,TD A,G,TH A,C,TK G,TM A,CN A,C,G,TR A,GS C,GV A,C,GW A,TY C,

4 Summary: Significance of proposed workThe tool will also provide integration and maintenance that include1. Update to new versions of genomics sequences when they are available from the public site.2. Rerun the program on old results and inform automatically via email on new results.3. Integrate with Gene Ontology information and other useful databases as advised by biologists.4. Provide a work_ow like tool which takes the query run on an organism and apply it another organism with a single key

5. Storage and maintenance of results.

5 Commercialization StrategyAfter Phase I launch, every person who visits the site will be requested to fill their profile before access to use their program along with the purpose of the visit. The visitor will also be requested to give feedback which will be collected and used as leads to prepare the BioRegulatory Appliance in Phase II.

6 KEY PERSONNEL

1) Hariharane Ramasamy is pursing his PhD Computer Science, at Illinois Institute of Technology, IL., and has more than 15 years of experience in developing applied computational tools for biomedical engineering. Few relevant tools include implemented motif search system for genomic sequences

that displays the results graphically on the screen along with the sequence annotation.

developed surveillance system to detect novel sequences. Developed a program that calculates the digest of

peptides for user input proteins and also performs differential combination of post-translational modification along with pI/Mw calculations.

Pattern induced Multiple alignment using properties of amino acids.

New Extended Genetic Algorithm for 3D lattice simulation of protein folding using conflicting criteria,

Simulation of human stand-sit movement using 3 link stick figure model.

Sanjeev MishraSanjeev Mishra is a seasoned professional having about 20 years of industry experience. Half of his industry life is spent doing startups in the field of business activity management, business intelligence and mobile application and management platforms. Rest half in research and development. He is awarded with one US patent. Sanjeev is passionate about biking, hiking, running, meditation and gardening. Sanjeev holds a masters degree in Physics from DBS College Dehradun, India.

Tulasi RavuriTulasi Ravuri is an experienced software engineering manager with 23 years of experience at several Silicon Valley companies such as Unisys, Novell, McAfee, DoCoMo Labs and others. Through his broad career he has helped bring several products to market. His most recent work is in Life Sciences Regulatory Compliance and

Administration software suite used by Universities like Stanford, Berkeley, Harvard; Pharma companies such as GSK, Hospitals such as Palo Alto Medical Foundation and Government. He advises several software companies and is an advocate of open source software. He has an MSCS from University of Louisiana & BS (Chemical Engg.) from Andhra University, India.

7 ConsultantsIn phase I, the following help will be used to guide the program to Phase II1. two student interns for refining the search and gathering data on the abilities of the program2. Consultant for designing user interface and graphics display

8 Prior SupportThe proposal has no prior or current support.

References cited[1] Marc S. Ha_on, Yonaton Grad, George M. Church, Alan M. Michelson, computation-Based Discovery of Related Transcriptional Regulatory Modules and Motifs Using an Experimentally Validated Combinatorial Model Howard Hughes Medical Institute and Department ofMedicine, Brigham and Women's Hospital, Link®oping University, Sweden.[2] Dimitri Papatsenko, Michael Levine, Computational Identification of regulatory DNAs underlying animal development Nature Methods, Vol. 2 No. 7:529-534, 2005.[3] Markstein, M., Markstein, P., Markstein, V. Levine, M.S., ìGenome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo, Proc.Natl Acad. Sci. USA, Vol. 99:763-768, 2002.[4] Benjamin P. Berman, Barret D. Pfeiffer, Todd R. Laverty, Steven L.Salzberg, Gerald M.Rubin, Michael B. Eisen and Susan E. Celniker, Computational identification of developmental enhancers : conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biology, Vol. 5:R81, 2004.[5] Alan M. Michelson,Deciphering genetic regulatory codes : A challenge for functional genomics. PNAS, Vol. 99 No. 2, 546-548, 2002.[6] Matthias Harbers, Piero Carninci, Tag-based approaches for transcriptiome research and genome annotation. Nature Methods, Vol. 2, No 7, 499-502, 2005.[7] Yueyi Liu, Liping Wei, Sera_m Batzaglou, Douglas L. Brutlag, Jun S. Liu and X.Shirley Liu A suite of web-based programs to

search for transcriptional regulatory motifs. Nucleic Acids Research, Vol. 32 Web Server Issue, 2004.[8] Mike P. Liang, Olga G. Troyanskaya, Alain Laederach, Douglas L Brutlag, and Russ B. Altman Computational Functional Genomics. IEEE Signal Processing Magazine, 2004.

Budget

Description Expense Amount for 6 monthsSalary for Principal Investigator

$36,000

Salary for Software engineer $30,000Salary for 2 student interns $24,000Salary for Biology consultant

$24,000

Hardware and Software cost (4)

$24,000

Internet & Cloud hosting services

$12,000

Miscellaneous expenses $6,000Office rent & expenses $15,000Travel $5,000Total Cost $176,000

Figure 1: Input web form to search the genomic sequence using user defined constraints

Figure 2: Results summary

Figure 3: Detailed results display for

Figure 4: Flow chart describing the flow of the algorithm

Figure 5: Diagram describing the Phase I flow

Appendix

The ultimate goal is to build a self-contained BioRegulatory appliance that supports automatic updates of the genomic sequences, rerun the old queries on the new sequences and inform users of new results, thereby saving enormous amount of time for the developmental biologist who depend on computers to locate the target.

Phase II Plan

Specific Aims - To enhance the available module, Biocis so that the module is user friendly and easy to navigate by a researcher. Phase II will also aim to create a work_ow module that will allow easy storage and retrieval of data from disparate sources and will integrate with useful information.The phase II feature will include

1. Advanced Regular Expression Search Tool for genomic sequences that uses the prebuilt index positions for 4 length bases (AAAA, AAAG, ,,,, GCGC, ...,TTTT) to locate the motifs.

2. Advance multithreaded server tool to perform fast parallel search of the motif sequences.

3. Advanced caching in memory/disk and database to avoid repeated search of previous sequences

4. Automated daemon process to get new releases and rerun the saved searches, inform via email to scientists on new results.

5. Link to GeneOntology database that provides gene function information

6. Cross species ortholog results from existing public annotated database.

7. simple statitical tools to look at the motif occurrences on the whole genome from the interesting results

8. creation of BioRegulatroy software package and plan for designing a spec for BioRegulatory Appliance.

9. to provide supercluster tool which will perform a similar search as in Aim I.

10. The input in A -J are the names of the search performed in Aim I. The tool will help supporting the theory where cluster of enhancers act to in regulating the gene. A sample input form is shown in 6

3.1.2 Phase III

The phase III Creating a sound computing infrastructure. The

infrastructure requires writing(?) a separate server to perform the search/caching capabilities. The search module will not be run via a web server like some of the existing tools. Every request to perform a search on the web server indicates the whole genome sequence will be read in memory. The length of genomic sequence varies from 1 Megabytes to 200 Megabytes in length. If the number of users on the system grows, the system will run out of memory, thus imposing a limit on the number of users. Using a web server to preload the data during startup is not advisable. Hence a separate server, to perform the search for any generic genome sequence is needed. The caching in phase I is achieved in two levels - memory, and disk.

will concentrate on adding more features to the query, creating a continuity in search.

For example, once one performs a search, the result will display genes along with the other species orthologs. The search can be immediately performed for the same enhancer for the species that has the closest orthologs. Phase III will also look at improving the performance of the BioRegulatory appliance.

Figure 6: SuperCluster - Web form for user input

bio scope

Technology

cisregulatory modules

regulatory regions

regulatory language

gene expression

thesecisregulatory elements

firstcategory tools

severalcomputational

similar expression