bio - the faculty of mathematics and computer scienceohadm/enrich/enrich.doc · web viewthe gcr1...

13
Computational tools ENRICH - From Expression, Through Annotation, to Function Goren Tali 1 and Manor Ohad 1 . 1 Department of Computer Science, The Hebrew University Of Jerusalem, Israel. Visit the ENRICH site at http://www.huji.ac.il/~manor460/enrich Department of Computer Science, Hebrew University of Jerusalem, Israel 1

Upload: others

Post on 15-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: bio - The Faculty of Mathematics and Computer Scienceohadm/enrich/enrich.doc · Web viewThe GCR1 gene encodes a positive transcriptional regulator of the enolase and glyceraldehyde-3-phosphate

Computational tools

ENRICH - From Expression, Through Annotation, to FunctionGoren Tali1 and Manor Ohad1.1Department of Computer Science, The Hebrew University Of Jerusalem, Israel.Visit the ENRICH site at http://www.huji.ac.il/~manor460/enrich

Department of Computer Science, Hebrew University of Jerusalem, Israel 1

Page 2: bio - The Faculty of Mathematics and Computer Scienceohadm/enrich/enrich.doc · Web viewThe GCR1 gene encodes a positive transcriptional regulator of the enolase and glyceraldehyde-3-phosphate

ABSTRACT

Which transcription factor controls stress response in yeast? Which genes regulate amino acid biosynthesis? Do mitochondrial genes have a stronger response to heat shock than peripheral genes?   The amount of available biological data is rapidly increasing. Gene an-notation, cellular localization, Chromatin structure, protein struc-ture, function and interactions, ChIP and gene expression data are overwhelming in their richness of information, but how does one extract meaningful biological conclusions from it?

ENRICH is a computational aid tool aimed to solve these is-sues. ENRICH allows users an easy interactive interface for up-loading their biological data to the program, being data of gene an-notation tables, experimental gene expression or other. These data sets, which we refer to as matrices, can then be manipulated in a variety of ways which we implemented in ENRICH, to allow the user more freedom with the uploaded data.

We also implemented a large variety of statistical tests, both paired and unpaired that can be preformed upon the data to find en-riched aspects of it, also implemented are refinement methods such as Bonferroni and FDR (false discovery rate. The result is a tool, which can be used to upload various data types, investigate it sta-tistically, and save the biological results.

1 INTRODUCTION Computational biology has the potential to revolutionize the under-standing of the cell by offering an unprecedented view of the molecular underpinnings of biology phenomena. Computational analysis is essential to transform the masses of generated data into a mechanistic understanding of processes and pathways in the cell.

The project aims at extracting a comprehensive overview from the existing huge amount of information arising from genome-wide analyses. Our goal is to connect information from different system-atic global studies and resources to infer biological function, and check whether the connections found are statistically significant. In addition we wished to use present biological information and ob-tainable data to reconstruct novel and previously known biological networks.

For this purpose we developed a computational tool in the form of a computer program, which we call ‘ENRICH’. ENRICH en-ables to combine heterogeneous data, involving gene annotations, gene expression data, chromatin immunoprecipitation (ChIP), cel-lular localization analysis, protein-protein interactions and more.

ENRICH handles matrix-based data, of two main types: binary and continuous. A binary matrix can be viewed as annotations

from any type. A continuous matrix may contain various types of measures over genes, such as expression, ChIP, etc.

ENRICH allows to perform a global statistical analysis of such crossed types of biological information, determine how significant the examined associations are, and find enrichment within different data sets.

1.1 Statistical Hypothesis TestingIt is easier to show a universal hypothesis is false than prove it is true. Therefore, we use statistical hypothesis testing to determine the probability that a given hypothesis is true. The process of sta-tistical hypothesis testing is usually composed of the following steps:

1. Formulate the null hypothesis – commonly, states that "there is no phenomenon", and that the observations could have arisen through chance. The alternative hy-pothesis is commonly in contrast to the null hypothesis and it is accepted if the observed data values are sufficiently improbable under the null hypothesis. An example of the null hypothesis may be: “There is no connection between the annotation ‘Ribosome Assembly’ and the transcription factor ‘RAP1’ ”. The alternative hypothesis says that the two are indeed related.

2. Identify a test sufficient statistics that can be used to as-sess the validity of the null hypothesis. We will elaborate on the sufficient statistics of each statistical test we im-plemented in the next section.

3. Compute the P-value, which is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis were true. The smaller the P-value, the stronger the evidence against the null hypothesis.

4. Compare the P-value to an acceptable significance value (sometimes called an alpha value). If the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid.

2 METHODS2.1 Implementation

Our main desire was to develop an easy-to-use, flexible and consis-tent application which will reduce otherwise complex tasks to only a few lines of code. We decided to implement ENRICH using the Perl programming language. We had several considerations in choosing so: Perl is a high level programming language with

Department of Computer Science, Hebrew University of Jerusalem, Israel 2

Page 3: bio - The Faculty of Mathematics and Computer Scienceohadm/enrich/enrich.doc · Web viewThe GCR1 gene encodes a positive transcriptional regulator of the enolase and glyceraldehyde-3-phosphate

ENRICH - From Expression, Through Annotation, to Function

strong support for arrays and hashes. It has a wide availability of modules and packages related to mathematics and statistics, which were helpful for us. In addition Access to Perl is available to any-one, and many Bioinformatic tools are implemented in Perl. A dis-advantage of Perl, of which we were aware from the beginning of the work process, is that it is relatively slow (compared to the C programming language) in performing numerical calculations.ENRICH was developed so that it can be used in three different ways:

1. As a command line interactive interpreter, where the user types one command at a time as seen in figure 1.

2. Running the program with an added script of ENRICH commands to be preformed.

3. Using the ENRICH as a module and incorporating EN-RICH functions into any Perl program.

2.2 ENRICH inputENRICH handles tab delimited text files in the structure of a ma-trix: the first row contains the headers of the columns and the first column contains the headers of the rows. Matrices with two header rows are also handled. Every cell in the matrix contains the data relevant to the specific line and row headers. The data in each cell of a matrix may contain numeric continuous values or binary val-ues (0 or 1).

We will elaborate on some representative types of data sets, which are widely used in the research field of computational molecular biology, and with which ENRICH is aimed at process-ing and analyzing.

1. Gene Ontology Annotation (GO) - The GO database pro-vides a useful tool to annotate and analyze the functions of a large number of genes.

2. Gene Expression - Results of microarray experiments, which indicate the RNA level of the genome in different conditions.

3. ChIP on chip – Data from Chromatin Immunoprecipita-tion, a technique used for identification of the DNA se-quences to which specific proteins bind in vivo.

4. Cellular localization – Proteins were marked using dif-ferent methods, and their sub-cellular localization was checked.

5. Protein-protein interactions – Different experiments in which interactions between different proteins in the cell were measured.

2.3 Basic OperationsAll input files to ENRICH are in a matrix-based format. We there-fore found it useful to implement the following three types of func-tions:

1. Basic necessary matrix manipulations, which include mathematical simple operations over all matrix’ values. For example: Transpose, LogScale, Normal, Average, Negate, AddCon (Add a constant numeric value to all matrix values), etc.

2. Convenient and helpful data queries. For example: select rows/columns by name/sum, get rows/columns number, get row headers, extract P-values, get number of rows/columns, etc.

3. Necessary functions for simple use and orientation of ENRICH. For example: Help, Whos (print out all the currently used variables), Load, Save, etc.

Most of ENRICH functions are explained in more detail in the Appendix section.

Figure 1. An example of using ENRICH in order to load to datasets, one being ChIP data and the other GO annotations data. Then, a HyperGeometric test is performed upon these matrices, and the result is refined using the False Discovery Rate correction and saved to a new file in the user's directory.

3

Page 4: bio - The Faculty of Mathematics and Computer Scienceohadm/enrich/enrich.doc · Web viewThe GCR1 gene encodes a positive transcriptional regulator of the enolase and glyceraldehyde-3-phosphate

Goren T. and Manor O.

2.4 Statistical TestingAll statistical tests are performed in ENRICH using the function ‘enrich()’.The input for performing a statistical test by ‘Enrich()’ is two ma-trices of the following format: Matrix A with n rows and m columns. Matrix B with n rows and k columns. A and B both contain the same row values (i.e. all the yeast

genes).The three above specifications are not necessary requirements. If A and B don’t contain exactly the same row values, ENRICH output would refer only to rows which appear in both A and in B.

The function ‘ENRICH’ performs the statistical test for every pair of columns (vectors) and . The output is a ma-trix C with m rows and k columns in which every cell con-tains the P value that is the result of the statistical test performed on vector x and vector y.

’Enrich()’ performs a statistical test according to the method specified by the user. ‘Enrich()’ checks the type of values matrices A and B contain (may be either decimal or binary) and calls the matching function out of the following three accordingly: Enrich-Bin(), EnrichFlo() or EnrichAnno(). Each function may call a vari-ety of statistical tests, which can operate on the relevant type of values in the matrices. We will explain briefly what is the main purpose of each statistical test and how is it performed.

2.4.1EnrichBin – Performs the test specified by the user on two matrices containing binary values. There are 2 possible test for binary values matrices: χ2 test or the hypergeometric test.a. χ2 TestA non-parametric test of statistical significance for binary values. The hypothesis tested with χ2 is whether or not two different sam-ples are different enough in some characteristic or aspect of their behavior. If there is a significant difference, we can generalize from our samples that the populations from which our samples are drawn are also different in their behavior or some tested character-istic. The null hypothesis of this test states there is no difference among different sample groups [3].

b. Hypergeometric TestDescribes the number of successes in a sequence of n draws from a finite population without replacement.Let there be n ways for a ‘good’ selection and m ways for a ‘bad’ selection out of a total n+m possibilities. Take N samples and let xi equal 1 if selection i is successful and 0 if it is not. Let X be the total number of successful selections:

The probability of i successful selections is then:

2.4.2EnrichFlo – Performs the test specified by the user on two matri-ces containing decimal values. There are a few possible tests for decimal values matrices, which will now be described shortly.

a. Paired Student's t TestA parametric test to compare two paired groups. Given two paired sets Xi and Yi of n measured values, the paired t-test determines whether they differ from each other in a significant way under the assumptions that the paired differences are independent and identically nor-mally distributed [2].

Correlations – Correlation is a measure of the relation between two or more variables [5]. The correlation coefficients values range from -1 to +1. The value of -1.00 represents a perfect negative correlation while a value of +1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation. We implemented in ENRICH two methods to calculate a correlation coefficient: Pearson and Spearman, both will now be elaboreted.

a. Pearson Correlation Rank Test The Pearson Correlation Coefficient measures the linear relationship between two variables, and is calculated as follows:

It determines the extent to which values of the two variables are proportional to each other. That is, the correlation is high if it can be ‘summarized’ by a straight line.

b. Spearman Correlation Rank TestThe Spearman Correlation Rank Test is a nonparametric correlation test. The Spearman Rank Correlation can be thought of as the regular Pearson correlation coefficient, except that Spearman correlation is computed from ranks.

2.4.3 EnrichAnno – Performs the test specified by the user on two matri-ces: matrix A containing binary values, and matrix B containing decimal values. As in ‘EnrichFlo’ and ‘EnrichBin’, ‘EnrichAnno’ performs the statistical test for every pair of columns (vectors)

and . Every vector x is divided into two indepen-dent groups, according to the matching binary values in the binary vector y. There are 3 possible tests for two independent data sets, on which we will soon elaborate.

a. Unpaired Student's t TestA parametric test to evaluate the differences in means between two groups. The P-value reported with a t-test represents the probability of error involved in accepting the research hypothesis about the existence of a difference. This is the probability of error associated with rejecting the hypothesis of no difference between

4

Page 5: bio - The Faculty of Mathematics and Computer Scienceohadm/enrich/enrich.doc · Web viewThe GCR1 gene encodes a positive transcriptional regulator of the enolase and glyceraldehyde-3-phosphate

ENRICH - From Expression, Through Annotation, to Function

the two categories of observations (corresponding to the groups) in the population when, in fact, the hypothesis is true [2].

b. Wilcoxon Rank-sum Test (Mann-Whitney-Wilcoxon)A nonparametric alternative to the paired t-test. This test assumes that there is no information in the magnitudes of the differences between paired observations. It calculate the differences and rank them from smallest to largest by absolute value. The test then adds all the ranks associated with positive differences, giving the suffiecient statistic [1].

c. Kolmogorov-Smirnoff’s TestThe Kolmogorov-Smirnov test (KS test) tries to determine if two datasets differ significantly. This test has the advantage of making no assumption about the distribution of data - it is non-parametric and distribution free. The KS test is only appropriate for testing data against a continuous distribution, such as the normal distribu-tion. It is based on the empirical distribution function (ECDF)[1].To compare two empirical cumulative distributions SN(x) contain-ing N events, and SM(x) containing M events, the statistic DMN is calculated:

2.5 Multiple Hypothesis CorrectionThe Statistical analysis ENRICH performs of a data set typically involves not just a single hypothesis, but rather many. For any par-ticular test, a pre-set probability α of a type-1 error (i.e., a false positive, rejecting the null hypothesis when in fact it is true) is as-signed. The problem of multiple comparisons is that we would like to control the false positive rate not just for any single test but also for the entire collection of tests that makes up our experiment. Multiple Hypothesis correction methods attempt to keep the over-all chance of getting any false positives at the same level (e.g. 0.05). This is done by the ENRICH function ‘Refine’. It refines a P-value matrix with a cut off specified by the user. The refinement is performed according to the specified method out of two possibil-ities: The Bonferroni correction or FDR.

2.5.1The Bonferroni correctionThe Bonferroni correction is a multiple-comparison correction used when several dependent or independent statistical tests are be-ing performed repeatedly. If a particular outcome of an experiment is unlikely to happen, the fact that the experiment is repeated mul-tiple times will increase the probability that the outcome appears at least once [4].While a given α value may be appropriate for each individual com-parison, it is not for the set of all comparisons. In order to avoid a lot of false positives, the α value needs to be lowered to account for the number of comparisons being performed. The Bonferroni correction tests each null hypothesis independently from outcome of others to level α/m.

2.5.2False Discovery Rate CorrectionThe FDR is the fraction of false positives among all tests declared significant. The motivation for using the FDR is that we may be running a very large number of tests, with those being declared sig-nificant being subjected to further studies. For example, searching for differently expressed genes a certain microarray experiment.

The set of all genes in this experience is obviously huge, and we want to find the significantly differently expressed genes. The idea is that the statistical procedure results in a significant enrichment of differently expressed genes, controlling the fraction of false pos-itives within the enriched setting by specifying a value for the FDR. Choosing an FDR of 5% means that (on average) 5% of the genes we picked as being significant are actually false positives (and 95% of those genes declared significant do indeed have dif-ferential expression). Hence, screening genes with an FDR of 5% results in a significant enrichment of genes that are truly differen-tially expressed [6].Suppose a total of N hypotheses are tested, S of which are judged significant (by the criteria being used for each test). If we had com-plete knowledge, we would know that n of the hypotheses have the null hypothesis true and m=N-n have the alternative hypothesis true, and we might find that F of the true nulls were called signifi-cant, while T of the alternative true were called significant, as can be seen in table 1.

Called signif-icant

Called not sig-nificant

Total

Null true F n-F n

Alternative true

T m-T m

Total K N-K N

Table 1. The FDR multiple hypothesis correction method

For this experiment, the false discovery rate is the fraction of tests called significant that are actually true nulls, FDR = F/K. This is actually:

3 RESULTSIn order to test the use of ENRICH, we decided to try and create a partial regulation network for the yeast S. Cerevisiae. We picked a few key biological processes in the cell's life i.e. response, metabo-lism and ubiquitination, and used ENRICH as described in figure 1 in order to gain insight about the transcription factors that are in-volved in these processes.

The process shown in figure 2 uses two sets of data represented in two binary matrices. The first matrix is a result of a ChIP analy-sis done in Rick Young’s lab, which shows for every known yeast transcription factor all the genes it binds. The other matrix is a Gene Ontology (GO) matrix, representing for each gene the GO annotations that it holds.

Using these two matrices, ENRICH creates a new matrix of p-values, where each cell represents the p-value obtained from per-

5

Page 6: bio - The Faculty of Mathematics and Computer Scienceohadm/enrich/enrich.doc · Web viewThe GCR1 gene encodes a positive transcriptional regulator of the enolase and glyceraldehyde-3-phosphate

Goren T. and Manor O.

forming a statistical significance test (such as a Hyper Geometric test) over two columns from the previous matrices.The p-values are then refined according to the multiple hypothesis correction. Resulting in a decimal matrix, which is then trans-formed into a binary one using a significance threshold. The result-ing matrix has a binary value for every pair of transcription factor and GO annotation, being one if there is a strong enrichment be-tween them or zero if not.

Then, ENRICH is used to select all the GO annotations related to a certain process, i.e. response. The GO annotations are selected such that there is minimum overlap between them (e.g. “response to stress” and “response to stress conditions”) and the selected GO annotations and the enriched transcription factors are then visual-ized by DOT/NEATO as a directed graph (or network).

The resulted networks can be seen in figures 3 and 4. A closer look will reveal a set of transcription factors that ENRICH found to be associated with each process, and each transcription factor is also associated with a sub-process within each process.

In order to evaluate the quality of ENRICH predictions of sub-processes and transcription factors, we can look at the predictions for the following processes.

3.1.1. Case study of results for Transport processMSN2 is a Transcriptional activator related to Msn4p; activated in stress conditions, binds DNA at stress response elements of re-sponsive genes, inducing gene expression [7,8], and we can see that ENRICH has predicted it to be associated with two sub-pro-cesses of response, namely response to stimulus and response to stress, which is reassuring.

MCM1 is Transcription factor involved in cell-type-specific transcription and pheromone response [9-11], and ENRICH has predicted it to be involved in two transport sub-processes that in-volve the response to pheromone.

STE12, TEC1 and DIG1 that were predicted by ENRICH to be involved in response to pheromone induction, are known to induce mating and growth in response to pheromone induction [12,13].

MBP1 is a known Transcription factor involved in regulation of cell cycle progression from G1 to S phase [14], which usually indicates a crucial point of regulation, and ENRICH has predicted it to be involved in response with DNA damage which seems very reasonable.

CAD1 is a transcription factor known to be involved in iron metabolism, considering that iron is an inorganic substance [15], ENRICH placed CAD1 as connected to response to inorganic com-pounds.

YAP7 is a putative transcription factor of unknown role in yeast, yet ENRICH strongly places it as involved with response to chemical and abiotic stimulus. Considering that all the other TF’s we discussed were well categorized by ENRICH, we believe this case to be the same, although further investigation is necessary.

3.1.2. Case study of results for Metabolism process

A representing set of transcription factors from the results of the metabolism process, give us some encouragement about the rest of them. GCR1 and GCR2, known transcription factors involved in glycolysis [16-18]. THI2 is known as involved in the biosynthesis of thiamine [19,20], MET4, MET31, MET32 in the biogenesis of sulfur amino acids[21-23], and GAT1 and GAT3 in nitrogen com-pound metabolism [24-26].

3.1.3. Case study of results for Ubiquitin related processes

In the case of the four proteins ENRICH found to be involved in the ubiquitin cycle, things are less decisive. Apart from RPN4, which is a transcription factor that stimulates expression of protea-some genes [27-28] and therefore highly connected to the ubiquitin cycle, the rest of the proteins seem less connected according to the literature. REB1 is a RNA polymerase enhancer protein [29], RCS1 is involved in iron utilization [30,31] and ADR1 in alcohol related genes and peroxisomal proteins [32]. Yet they were the genes ENRICH found to be highly enriched with the ubiquitin an-notation, therefore we suggest that each one of them does play a role in the ubiquitin cycle of some sort.

3.2 Response versus localizationDuring our work we asked ourselves the following Biological question: Do Mitochondrial genes react differently to heat shock conditions than Peripheral genes do?In order to answer that question we used ENRICH in the way de-scribed in figure 5, where we use localization information and ex-perimental annotations to obtain a result matrix where we can see for various hit shock conditions the reaction intensity for mito-chondrial genes versus peripheral genes. The results can be seen in figure 6, where a dark cell represent a high p-value, meaning that there was no significance for those genes, and red cell represents a high enrichment for that gene group during the relevant condition. The results clearly show that genes located in the mitochondria react much stronger to a heat shock than genes located in the cell periphery. An interesting bio-logical insight gained very easily by using ENRICH with just a few commands. When looking at literature about this subject, we haven't found anything decisive, but the fact that there are mito-chondrial HSP (heat shock proteins) and the fact that the mitochon-dria is involved in apoptotic pathways, perhaps strengthens the re-sults. Still further investigation of the issue remains to be done.

6

Page 7: bio - The Faculty of Mathematics and Computer Scienceohadm/enrich/enrich.doc · Web viewThe GCR1 gene encodes a positive transcriptional regulator of the enolase and glyceraldehyde-3-phosphate

ENRICH - From Expression, Through Annotation, to Function

Figure 2. on the left the two data sets used in the process: ChIP data and GO annotation data. A HyperGeometric test is done in order to check enrichment for every pair of columns. The result is a p-value matrix where each cell such as the yellow one is a result of the test over two columns marked in purple. Then, a multiple hypothesis correction is done and using a significance threshold a binary matrix is created where each cell informs whether there is an enrichment between the two columns.

Figure 3. The regulatory network created by ENRICH (using Neato) for the Metabolism process. The Transcription factors are filled with light green and the sub-processes are dark green.

Figure 4. The regulatory network created by ENRICH (using Neato) for the Response and Ubiqiuitin processes. The Transcription factors are empty and the sub-processes are dark green.

Figure 5. on the left the two data sets used in the process: experimental expression data and experimental localization data. An Unpaired T-test is performed in order to check enrichment for every pair of columns, resulting in a p-value matrix. That matrix is then transformed to binary using a threshold resulting in the top matrix where each cell represents a specific experiment vs. a cell compartment. The yellow columns are the columns of the mitochondria and the cell periphery which interest us. The bottom matrix is an experiment vs. conditions matrix, in which every line is a different experiment and every column is a different condition performed, for this query, we choose only the various heat shock conditions in this matrix. Then a HyperGeometric test is performed

7

Page 8: bio - The Faculty of Mathematics and Computer Scienceohadm/enrich/enrich.doc · Web viewThe GCR1 gene encodes a positive transcriptional regulator of the enolase and glyceraldehyde-3-phosphate

Goren T. and Manor O.

in order to find how enriched are these cell compartments in respect to heat shock conditions.

Figure 6. The resulted table of the process done in figure 4. The Experiment column indicates which kind of heat shock condition was used in the related experiments, the Cell Periphery and Mitochondrion columns show the enrichment for genes located in the cell periphery and mitochondria respectively. A red cell shows of high enrichment whereas a black cell of low enrichment. From these results it is clear that the mitochondrial genes have a much more powerful response to heat shock conditions than do peripheral genes.

DISCUSSIONIn this project we presented ENRICH, a program we created, aimed in aiding researchers performing various statistical signifi-cance tests. We spent a large portion of the work on writing and cleaning the code itself, and trying to make it as convenient as pos-sible. We then moved on to test our program on real data sets; we used various data types such as ChIP data from Rick Young’s lab, experimental expression data from David Botstein’s lab and others, localization data and more. Our first aim was to try and create a yeast regulation network, not of transcription factors and their tar-gets, but rather of processes and the transcription factors involved in the process and its sub-processes.

The results and the way they were obtained shows the power of ENRICH as an easy to use tool which we feel can be very useful, for gaining biological insights. It still lacks many features, among them visualization and incorporated clustering methods. However, we strongly feel that it can be developed into a strong and useful computational aid tool for researchers.

ACKNOWLEDGEMENTSWe would like to thank Prof. Nir Friedman and his Ph.D. student Tommy Kaplan for guiding us through the project.

REFERENCES

[1] Nonparametric statistics for the behavioral sciences, by Sidney Siegel (international student edition).

[2] Sensory Evaluation of Food: Statistical Methods and Procedures, by Michael O'Mahony.

[3] Conover, W. J. (1998). Practical Nonparametric Statistics (3rd Ed.).

[4] Abdi, H. "((2007). Bonferroni and Sidak corrections for multiple comparisons. In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage.".

[5] Allison, D.B., G.L. Gadbury, M. Heo, J.R. Fernandez, C.-K. Lee, T.A. Prolla, and R. Weindruch. 2002. A mixture model approach for the analysis of microarray gene expression data. Computational Statisrtcis and Data analysis 39: 1-20.

[6] Benjamini, Y., and Hochberg, T. 1995. Controlling the False Discovery Rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. B 85: 289-300.

[7] Martinez-Pastor MT, et al. (1996) The Saccha-romyces cerevisiae zinc finger proteins Msn2p and Msn4p are required for transcriptional induction through the stress response element (STRE). EMBO J 15(9): 2227-35.

[8] Gorner W, et al. (1998) Nuclear localization of the C2H2 zinc finger protein Msn2p is regulated by stress and protein kinase A activity. Genes Dev 12(4): 586-97.

[9] Passmore S, et al. (1989) A protein involved in minichromosome maintenance in yeast binds a tran-scriptional enhancer conserved in eukaryotes. Genes Dev 3(7): 921-35.

[10] Elble R and Tye BK (1991) Both activation and re-pression of a-mating-type-specific genes in yeast re-quire transcription factor Mcm1. Proc Natl Acad Sci U S A 88(23): 10966-70.

[11] Lydall D, et al. (1991) A new role for MCM1 in yeast: cell cycle regulation of SW15 transcription. Genes Dev 5(12B): 2405-19.

[12] Tedford K et al.Regulation of the mating pheromone and invasive growth responses in yeast by two MAP kinase sub-strates.Curr Biol. 1997 Apr 1; 7(4):228-38.

[13] Liu H et al.Elements of the yeast pheromone response pathway required for filamentous growth of diploids. Sci-ence. 1993 Dec 10;262(5140): 1741-4.

[14] Koch C, et al. (1993) A role for the transcription factors Mbp1 and Swi4 in progression from G1 to S phase. Science 261(5128): 1551-7.

8

Page 9: bio - The Faculty of Mathematics and Computer Scienceohadm/enrich/enrich.doc · Web viewThe GCR1 gene encodes a positive transcriptional regulator of the enolase and glyceraldehyde-3-phosphate

ENRICH - From Expression, Through Annotation, to Function

[15] Lesuisse E and Labbe P (1995) Effects of cad-mium and of YAP1 and CAD1/YAP2 genes on iron metabolism in the yeast Saccharomyces cerevisiae. Microbiology 141 (Pt 11): 2937-43.

[16] Uemura H and Jigami Y (1992) Role of GCR2 in transcriptional activation of yeast glycolytic genes. Mol Cell Biol 12(9): 3834-42.

[17] Holland MJ, et al. (1987) The GCR1 gene encodes a positive transcriptional regulator of the enolase and glyceraldehyde-3-phosphate dehydrogenase gene families in Saccharomyces cerevisiae. Mol Cell Biol 7(2): 813-20.

[18] GCR2, a new mutation affecting glycolytic gene ex-pression in Saccharomyces cerevisiae. Mol Cell Biol 10(12): 6389-96.

[19] Nishimura H, et al. (1992) Cloning and characteris-tics of a positive regulatory gene, THI2 (PHO6), of thiamin biosynthesis in Saccharomyces cerevisiae. FEBS Lett 297(1-2): 155-8.

[20] Nosaka K, et al. (1994) Isolation and characteriza-tion of the THI6 gene encoding a bifunctional thi-amin-phosphate pyrophosphorylase/hydroxyethylth-iazole kinase from Saccharomyces cerevisiae. J Biol Chem 269(48): 30510-6.

[21] Multiple transcriptional activation complexes tether the yeast activator Met4 to DNA. EMBO J 17(21): 6327-36.

[22] Met31p and Met32p, two related zinc finger pro-teins, are involved in transcriptional regulation of yeast sulfur amino acid metabolism. Mol Cell Biol 17(7): 3640-8.

[23] Blaiseau PL, et al. (1997) Met31p and Met32p, two related zinc finger proteins, are involved in tran-scriptional regulation of yeast sulfur amino acid me-tabolism. Mol Cell Biol 17(7): 3640-8.

[24] Kuruvilla FG, et al. (2001) Carbon- and nitrogen-quality signaling to translation are mediated by dis-tinct GATA-type transcription factors. Proc Natl Acad Sci U S A 98(13): 7283-8.

[25] Cooper TG (2002) Transmitting the signal of ex-cess nitrogen in Saccharomyces cerevisiae from the Tor proteins to the GATA factors: connecting the dots. FEMS Microbiol Rev 26(3): 223-38.

[26] Cox KH, et al. (1999) Genome-wide transcriptional analysis in S. cerevisiae by mini-array membrane hybridization. Yeast 15(8): 703-13.

[27] Ng DT, et al. (2000) The unfolded protein response regulates multiple aspects of secretory and mem-brane protein biogenesis and endoplasmic reticulum quality control. J Cell Biol 150(1):77-88

[28] Xie Y and Varshavsky A (2001) RPN4 is a ligand, substrate, and transcriptional regulator of the 26S proteasome: a negative feedback circuit. Proc Natl Acad Sci U S A 98(6):3056-61

[29] Morrow BE, et al. (1989) Proteins that bind to the yeast rDNA enhancer. J Biol Chem 264(15):9061-8

[30] Gil R, et al. (1991) RCS1, a gene involved in con-trolling cell size in Saccharomyces cerevisiae. Yeast 7(1):1-14

[31] Yamaguchi-Iwai Y, et al. (1996) Iron-regulated DNA binding by the AFT1 protein controls the iron regulon in yeast. EMBO J 15(13):3377-84

[32] Simon M, et al. (1991) The Saccharomyces cere-visiae ADR1 gene is a positive regulator of tran-scription of genes encoding peroxisomal proteins. Mol Cell Biol 11(2):699-704

APPENDIX A

We will now elaborate on some ENRICH functions in more detail. Enrich Orientation :

Help - Print the documentation of all functions or a desired certain function. Whos - Print all the currently used variables of the user. Load - Load a new file into the program and returns a data handle of this file. Save – Save into a file the data of a given handle.

Mathematical Operations: AddCon - Add a constant numeric value to all values in all cells in the matrix. Transpose - Turn rows into columns and vice versa. LogScale - Convert all the matrix values into logarithmic scale values in any base. Presentation of data on a logarithmic scale can be helpful when the data covers a large range of values; the logarithm reduces this to a more manageable range. Normal – Normalize all matrix values according to standard normal distribution. Standard means the expected value is 0 and the variance is 1. AvgR/AvgC - Calculate the average value of the

rows/columns respectively. Convert2bin - Converts the numeric values of a matrix to binary values using a given threshod (assigning 1 to values above the threshold and 0 to values below the threshold). Negate – Reverse the sign of all the values in the matrix.

Data Queries: SelectRowsByName/SelectColsByName - Select from the matrix only rows/columns respectively, whose headers either contain or match exactly a specific word. SelectRowsBySum - Select from the matrix only rows whose numeric values sum up over a certain threshold. GetGenes - Return all the matrix row headers. GetRowsNum/GetColsNum - Return the number of rows/columns respectively of the matrix. ExtractPvals - Extract and write to a file the column headers that received a P-value below a specified

9

Page 10: bio - The Faculty of Mathematics and Computer Scienceohadm/enrich/enrich.doc · Web viewThe GCR1 gene encodes a positive transcriptional regulator of the enolase and glyceraldehyde-3-phosphate

Goren T. and Manor O.

threshold

APPENDIX BAdditional information and software such as: the ENRICH perl script, the EnrichFunctions perl module, a running example and other results and figures, are all available at the ENRICH web site http://www.huji.ac.il/~manor460/enrich

10