rna-seq/microarray deg analysis...analysis (k-means clustering, hierarchical analysis, t-test,...

- 1 -

Version 4.0

RNA-Seq & Microarray

DEG Analysis Manual v4.1

- 2 -

<Contents>

1. Differentially Expressed Gene (DEG) Analysis (ExDEGA v.1.6.6)

2. Clustering Heatmap (MeV Software)

3. Pathway Analysis (KEGG Mapper)

4. Functional Annotation Analysis (DAVID)

5. Gene Set Enrichment Analysis (GSEA) Analysis (MSigDB)

6. Protein-Protein Interaction (PPI) Analysis (STRING)

- 3 -

1. Differentially Expressed Gene (DEG) Analysis (ExDEGA v.1.6.6)

EBIOGEN provides our customers with special data report of ExDEGA (Excel-based Differentially

Expressed Gene Analysis) for NGS, microarray, and antibody array experimental services. ExDEGA is

Excel-based data analysis tool that includes various convenient functions such as data mining and

graphic visualization. It is user-friendly and will be continuously updated for researchers who are

unfamiliar with data analysis and the use of Excel software.

ExDEGA setup file and data will be provided after completing NGS, microarray or antibody array. As

follows (Figure 1-1), it is needed to unzip ExDEGA.zip file and to execute ExDEGASetup.exe. Then,

the ExDEGA data report will be automatically opened. If other excel files already opened, please

close all opened files and open data report again.

Figure 1-1. ExDEGA Set Up

- 4 -

Gene Ontology (GO) analysis tool is on the left, mRNA expression data is on the middle, and the

Differentially Expressed Gene (DEG) analysis tool is on the right in ExDEGA Report.xls file (Figure 1-

2). General GOs are already set up on the Gene Category and GOs are editable by manually adding

or modifying a gene list in Gene Category Settings. Meaningful data can be quickly acquired when

Gene Category and DEG analysis functions work together. DEG analysis allows the user to select

significantly differentially expressed genes and to visualize gene expression data more effectively.

By using these functions, the researcher can analyze easily NGS, microarray or antibody array data

with ExDEGA.

Figure 1-2. mRNA Expression Data Format Made by EBIOGEN

- 5 -

1-1. Gene category

Functional grouping is efficient to analyze mRNA expression data from tens of thousands of genes.

Most biologists normally use gene ontology (GO) database and pathway database for biological

function analysis. Pre-established 15 GOs in Gene Category have been commonly studied in the

field of biology. If you want to analyze genes related to aging, it could be filtered by selecting

‘Aging’ in the Gene Category (Figure 1-3). Multi-selection is available. The functions, ‘AND’ and ‘OR’,

are helpful to filter out genes that are related to more than one GO or at least one at the same

time.

Figure 1-3. Gene Ontology (Aging) Selection

If you cannot find interesting GO in Gene Category, other GOs can be added through Quick GO

site. To modify the Gene Category setting, click the ‘View All Data’ button first, then click the ‘Gene

Category Settings’ button (Figure 1-4). An instruction that describes the way of adding GO will be

popped up when clicking ‘?’ button.

Figure 1-4. Gene Category Settings

- 6 -

If you have a gene list of another functional group, you can manually create a new gene category

as follows: 1) click on the ‘Gene Category Settings’ button, 2) select ‘New’, 3) enter a name for the

new gene category 4) enter the desired gene list (or copy-paste), and 5) click ‘OK’ to save it (Figure

1-5-a, b).

Figure 1-5-a. Adding Genes to Make a New Gene Category

Figure 1-5-b. Adding Genes to Make a New Gene Category

- 7 -

1-2. Significant Gene selection

In the DEG Analysis section on the right, the ‘Significant Gene Selection’ window is designed to

filter genes that were significantly different between control and test samples from the total results.

Figure 1-6 shows that fold change is 2.00, normalized data (log2) is 4.00, and t-test p-value is 0.05,

resulting in 59 genes are filtered from the total 24,496 genes. Fold change, p-value, and normalized

data (log2) are adjustable according to results. p-value will be calculated in only replicated data.

As to ‘AND’ and ‘OR’ functions in Significant Gene Selection, it has a similar concept as the

aforementioned in the part 1-1. The significant differences of genes involved in one sample or more

than one sample would be filtered through these functions.

Figure 1-6. Selection of Significantly Expressed Genes

- 8 -

Gene Category and Significant Gene Selection can be performed together. If you select Cell

differentiation in the Gene Category as Figure 1-7, only 5 genes from A and B samples are filtered.

It means that the 5 genes are significantly differentially expressed genes related to cell differentiation.

Figure 1-7. Significantly Differentially Expressed Genes in Cell Differentiation

To visualize its results after setting the Significant Gene Selection up, click the 'Filter Gene Category

Chart' button. A pie graph and a bar graph will pop up in a moment. The ratio and number of

genes expressed differently in each GO can be verified graphically. When you click a specific part

of the graph where you are interested in, its genes will be automatically displayed in the spreadsheet

of ExDEGA. For example, click a part of GOs in the pie graph or the bar graph, resulting in its up or

down regulated genes will be automatically filtered in the spreadsheet (Figure 1-8). The digits that

are written above the bar in the bar graph are the number of genes.

Figure 1-8. Gene Category Chart

- 9 -

1-3. Analysis Graph

A scatter plot, volcano plot, and Venn diagram can be easily drawn through Analysis Graph (Figure

1-9).

Figure 1-9. Analysis Graph Tool

1-3-1. Scatter plot

For scatter plot, choose two variables and set Fold Threshold Line value first. Then click the ‘Graph

View’. Scatter Plot is automatically created for the selected condition. Each part describes as 1) x-

and y-axis are relative expression levels 2) red dots are a higher expression level of y-values than x-

value 3) green dots are a higher expression level of x-values than y-value (Figure 1-10). When you

click on a spot in the plot, the gene symbol is displayed and it can be removed by clicking the right

mouse button. If you want to display multiple genes at the same time, copy and enter the

corresponding gene ID list into the ‘Gene Select (ID Input)’ window and click ‘Add’.

Figure 1-10. Analysis Graph Tool – Scatter Plot

- 10 -

1-3-2. Volcano plot

Volcano Plot’s function is almost the same as the Scatter Plot. Select the variables and set Fold

Threshold Line value and p-value. Then click the ‘Graph View’. Volcano Plot is automatically created

for the selected comparison condition on the left. Each part describes as 1) x-axis is a fold change

in the log2 scale, 2) y-axis is the p-value in -log10 scale, 3) red or green dots are genes that were

significantly changed in accordance with the condition already set up (Figure 1-11). When you click

on a spot in the plot, the gene symbol is displayed and it can be removed by clicking the right

mouse button. If you want to display multiple genes at the same time, copy and enter the

corresponding gene ID list into the ‘Gene Select (ID Input)’ window and click ‘Add’.

Figure 1-11. Analysis Graph Tool – Volcano Plot

- 11 -

1-3-4. Venn diagram

Venn diagrams for all possible logical relations between 2, 3 or 4 samples can be created. To draw

a Venn Diagram, select Sample Comparison first. Then set the Fold Change value and p-value and

click the ‘Diagram View’ (Figure 1-12). Up to 4 sample comparisons can be selected.

Figure 1-12. Analysis Graph Tool – Venn Diagram

The numbers shown in the Venn Diagram results (Figure 1-13) indicates that 1) tilted number is the

number of up-regulated genes 2) red number is the number of genes that showed the opposite

aspects among sample comparisons 3) underlined number is the number of down-regulated genes

based on the pre-set conditions.

Figure 1-13. An Example of Up, Down, and Contra-regulated in Venn Diagram

- 12 -

To confirm the corresponding genes which were appeared in the Venn Diagram, place the mouse

cursor onto a region of the Venn Diagram and click the right mouse button. For example, if you

want to see up-regulated genes in only B/A, right-click the area of the B/A in the Venn Diagram

and select ‘Up-regulated’. Three genes would be filtered in the Excel spreadsheet (Figure 1-14).

Figure 1-14. Filtering 2fold Up-regulated Gene List in Venn Diagram

All images provided by ExDEGA can be saved by right-clicking in the plots and Venn Diagram and

selecting a 'Save image' (Figure 1-15).

Figure 1-15. Saving Image

- 13 -

1-4. Clustering Heatmap Support

DEG Analysis of ExDEGA supports data mining through Significant Gene Selection or Venn Diagram

and easily creates a Clustering Heatmap for the sorted gene list.

A recommended Clustering Heatmap program is MeV. ExDEGA can automatically generate an input

file that can be imported in MeV and details on how to create clustering using MeV software are

described in 2. Clustering heatmap using MeV Software on page 15.

In order to create the input file from ExDEGA for Clustering Heatmap regarding the filtered gene

list, two types of data can be used (Figure 1-16). First, when using the Fold change value, check the

‘Fold change’ in the Type part and sample comparison in the Export Data Select. Click the ‘Data

Export’ and save it as a tab-delimited text file. Second, when using the expression value (Normalized

data), check the ‘Z-score’ and follow the same steps as above. The z-score, which generally indicates

how far away a value is from the mean, is only available when the variable is three or more samples.

The formula for calculating the standard score (z-score) is given below:

Z-score = {Normalized data (log10) – average of Normalized data (log10)}/ standard deviation of

Normalized data(log10)

Figure 1-16. Clustering Heatmap Support

- 14 -

1-5. Selected Gene Plot & Gene Search

A tool of ‘Selected Gene Plot’ is used to draws a graph of the expression patterns of selected genes.

Both genes based on the setting of Significant Gene Selection or selected by researchers can be

used. To create it, copy the ID list of the selected gene, paste them into the Selected Gene Plot

window, and click the ‘Expression Plot View’. Two types of selected gene plots displayed with the

normalized data (log2) and the fold change (log2) values will be popped up (Figure 1-17).

‘Gene Search’ is helpful to search for specific keywords. For example, if you enter ‘insulin’ in the

gene search box, all genes that contain the word ‘insulin’ will be automatically searched and filtered

in the Excel data sheet (Figure 1-18).

Figure 1-17. Gene Graph

Figure 1-18. Genes Related to Insulin

- 15 -

2. Clustering Heatmap (MeV Software)

MeV software, developed by the Dana-Farber Cancer Institute in the United States, is a free

analysis program of Microarray and mRNA-seq data. It serves clustering analysis and statistical

analysis (K-means clustering, Hierarchical analysis, t-test, Significance Analysis of mRNA-Seq data,

Gene Set Enrichment Analysis, and EASE). Visit the web site to download the latest updated

programs and manuals (MeV software download web site: https://sourceforge.net/projects/mev-

tm4/).

For using MeV, three steps are required first: 1) download MeV, 2) unzip the file, and 3) run the

installer, ‘MeV’ or ‘TMEV’ (Figure 2-1). After that, it is needed to confirm that three windows will

appear when the MeV program is opened as described in Figure 2-2. Data analysis will be

performed in the ‘Multiple Array Viewer’. To create this, click ‘File’ and ‘New’ on the

‘MultiExperiment Viewer’ bar. Creating several Multiple Array Viewers is available.

Figure 2-1. Installation File for MeV Program

https://sourceforge.net/projects/mev-tm4/

https://sourceforge.net/projects/mev-tm4/

- 16 -

Figure 2-2. MeV Program Windows

A clustering analysis can be performed by using MeV. The automatically saved input file from

‘Clustering Heatmap Support’ can be used for MeV as described on page 13. Another, genes that

researcher wants to use for clustering analysis can be also listed up. Open a new Excel file, then

copy and paste the list of genes’ name and the fold change value or the normalized value. It must

be saved as a ‘tab-delimited text file’ (Figure 2-3) and be limited to 20,000 genes. Depending on

the number of samples, about 15,000 genes may not be analyzed.

Figure 2-3. An Example of Data Format

- 17 -

After the input data is saved, click ‘File’ and ‘Load Data’ on the ‘Multiple Array Viewer’ of the MeV

program (Figure 2-4). Click ‘Browse’ and select the input file.

Figure 2-4. Data Uploading Method

Click ‘Analysis’, ‘Clustering’, and ‘HCL’ (Figure 2-5).

Figure 2-5. Hierarchical Clustering Selection

- 18 -

Various options for clustering analysis can be selected (Figure 2-6). ‘Gene Tree’ creates a cluster of

genes that have similar fold change or normalized values. ‘Sample Tree’ creates a cluster of samples

that show similar aspects of the gene expression. Among many options, ‘Euclidean Distance’ and

‘Average linkage clustering’ have been widely used for the clustering analysis in research. After the

setup is complete, click ‘OK’.

Figure 2-6. Hierarchical Clustering Options

- 19 -

A result of HCL clustering shows up on the left side and an HCL tree shows up on the right side

when clicking ‘HCL Tree’ (Figure 2-7). Figure 2-7 is an example of HCL clustering and indicates that

a top tree diagram is a result of sample clustering and a left tree diagram is a result of gene

clustering. Each tree diagram has its distance scale bar to measure the length of the tree. The

shorter the distance of the tree indicates that the pattern of expression between genes or samples

is more similar, whereas the longer the distance means that the pattern of expression is more

different.

Figure 2-7. A Result of Hierarchical Clustering

Clustering’s size and color are modifiable (Figure 2-8).

Figure 2-8. Clustering Size Option

- 20 -

A range of color scale bar (lower limit, midpoint value, and upper limit) can be set by click ‘Display’

and ‘Set Color Scale Limits’ (Figure 2-9). Generally, the lower and upper values are set with the same

value and the midpoint value is set to 0 as illustrated in Figure 2-9. The up-regulated gene

expressions will be showing up with the red color and the down-regulated gene expression will be

showed up with the blue color.

Figure 2-9. Color Scale Option

To save the image, click ‘File’ and ‘Save Image’. A file name must include the file extension such as

JPG files (Figure 2-10).

Figure 2-10. Saving Clustering Image

- 21 -

3. Pathway Analysis (KEGG Mapper)

Pathway analysis using KEGG Mapper helps to search specific pathways that are related to genes

that have come from the results of NGS, microarray, and antibody array. A procedure of how to use

the KEGG mapping tool is described in Figure 3-1.

Figure 3-1. Process of Pathway Analysis by Using KEGG Mapper

Pathway analysis is simple and easy if using ExDEGA. Figure 3-2 shows a way of importing selected

genes data based on 2-fold change and normalized data (log2) > 4 into KEGG Mapper. KEGG input

values are located between ‘Raw Data(RC)’ and ‘Annotation’. First of all, it is needed to specify genes

by using ‘Fold change’, ‘Normalized Data’, and ‘p-value’ (p-value is only available when replicates

were carried out) from ‘Significant Gene Selection’ in the right filter and then click sample

comparison to apply its setting. Afterward, copy KEGG input data both Entrez ID and FC Color (black

colored #Number) that will be used in KEGG Mapper.

Figure 3-2. Process of Making KEGG Mapper Input Data in ExDEGA

Copy the section of Entrez ID & a Fold change color column

Enter KEGG mapper Website –Search& Color pathway

http://www.genome.jp/kegg/tool/map_pathway2.html

Paste the copied items and Pathway proceeding

Check the Result of pathway and interesting pathway search

Enter the

Website

- 22 -

Steps for pathway analysis in KEGG Mapper are as follows: 1) visit the KEGG Mapper website

(http://www.genome.jp/kegg/tool/map_pathway2.html), 2) enter a species code (against). If you do

not know the organism code, click ‘org’ and search it as described in Figure 3-3, 3) select ‘KEGG

identifiers’ for primary ID, 4) copy and paste the Entrez ID and Color data that were copied from

‘KEEG input’ in ExDEGA into 'Enter objects one per line followed bgcolor, fgcolor' box, 5) check

‘Include Aliases’ and ‘Use uncolored diagrams’, and 6) click ‘Exec’.

Figure 3-3. Process of Setting Up KEGG Mapper

‘Pathway Search Result’ by KEGG Mapper will be represented as illustrated in Figure 3-4. Pathway

lists are related to genes which you input and the digits beside the name of the pathway is the

number of all genes. Genes can be checked by clicking the number. Click the pathway name in

which you are interested to make the pathway map. The red color indicates up-regulated genes

and the green color indicates down-regulated genes. For saving the image of the pathway map,

click the right mouse button and ‘Save As’. If you click and an item linked to the image can be

saved by saving as ‘HTML’.

http://www.genome.jp/kegg/tool/map_pathway2.html

- 23 -

Figure 3-4. Pathway Search Result in KEGG Mapper

- 24 -

4. Functional Annotation Analysis (DAVID)

DAVID provides a comprehensive set of functional annotation tools based on numerous databases

to understand the biological meaning of genes derived from the result of NGS, microarray, and

antibody array. Its process is described in Figure 4-1.

Figure 4-1. Process of Functional Annotation Analysis by Using DAVID

Since more than 3,000 genes cannot be analyzed in DAVID, less than 30,000 genes have to be

selected first. Significantly differentially expressed genes that extracted from mRNA-Seq data in

ExDEGA can be also used as above mentioned in Chapter 2 and 3. Visit the DAVID homepage

(http://david.abcc.ncifcrf.gov/) and click ‘Functional Annotation’ (Figure 4-2).

Figure 4-2. DAVID Homepage

Websit

access

•http://david.abcc.ncifcrf.gov/

•‘Functional Annotation ‘ Click!

Step 1 ~ 4

•Gene list (Gene symbol, Gene Bank No, others) copy & paste

• Select Identifier ---> ‘Gene List’ Check ---> ‘Submit List’ Click!

Data Base

Check

•Gene Ontology, Pathway, others DB의 ‘Chart’ Click!

•Indentify the gene of interest in the ‘Chart’ and the corresponding gene.

http://david.abcc.ncifcrf.gov/

- 25 -

Step 1: Enter Gene List, copy the list of ‘Gene Symbol’ from ExDEGA (or Gene Bank No. if you have)

and paste it into ‘A: Paste a list’ box. Step 2: Select Identifier, select ‘OFFICIAL_GENE_SYMBOL’ (or

‘GENEBANK_ACCESSION’ if Gene Bank No. is used). Step 3: List Type, check ‘Gene List’. Step 4:

Submit List, click ‘Submit List’. Finally, read a popup message and click ‘확인’ to confirm it.

Figure 4-3. Process of Functional Annotation Analysis in DAVID

If a specific species was not found in the ‘Current Background’ as shown in Figure 4-4, select the

correct species with the ‘Population Manager’ on the ‘Background’ page and click ‘Use’.

Genes’ number which is marked in ‘Species(number)’ on the ‘List’ page will only be applied to the

functional annotation analysis since only that number of genes is identified in the database even if

more genes were input.

- 26 -

Figure 4-4. Specifying Species Information

To review the results, select one of the lists, click ‘+’, click ‘Chart’, then select and click one of ‘Terms’

on the popup window. Figure 4-5 shows an example of the result of ‘Gene_Ontology’. Click ‘+’

beside the ‘Gene_Ontology’ and click ‘Chart’ on the ‘GOTERM_BP_FAT’. Relevant biological processes,

55 chart record, in this case, popped up. In the new window, if you select and click one of the terms,

QuickGO will be linked to display its information. Genes related to GO can be identified by clicking

the bar on the ‘Genes’.

- 27 -

Figure 4-5. Results of Gene Ontology Analysis

A procedure of checking the result of ‘Pathways’ is the same as well (Figure 4-6). Click ‘+’ beside

the ‘Pathways’ and click ‘Chart’ on the ‘KEGG_PATHWAY’. A list of relevant pathways, 2 pathways, in

this case, popped up. In the new window, select and click one of the pathways to see its image. A

red star in the image indicates the gene that you manually input in the previous step. Details about

genes can be confirmed by clicking it.

Figure 4-6. A Results of Pathway Analysis

- 28 -

DAVID tool is useful to analyze GO and pathway. However, the DAVID tool uses only input data so

that the small number of genes, either input number or relevant number, cannot produce results of

GO and pathways. DAVID tool defaults more than two genes and a lower than 0.1 EASE score to

make results. Its criteria are adjustable in ‘Option’. ‘Help and Tool Manual’ for DAVID tool is located

on the top of the window as described in Figure 4-7.

Figure 4-7. Help and Tool Manual for DAVID Tool

- 29 -

5. Gene Set Enrichment Analysis (GSEA) Analysis (MSigDB)

Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a

priori defined set of genes shows statistically significant, concordant differences between two

biological states. The analysis process is shown in Figure 5-1.

Figure 5-1. Process of GSEA Analysis

Visit the MSigDB, click ‘Investigate Gene Sets’, and enter a registered email address to log in.

If necessary, a registration has to preceed to view the MSigDB gene sets and/or download the

GSEA software (Figure 5-2 and Figure 5-3).

Figure 5-2. GSEA Main Page

Websit

access

•http://software.broadinstitute.org/gsea/msigdb/index.jsp

•Left menu ‘Investigate gene sets’ Click! ---> Enter email, ‘login’ Click!

Enter gen

e list

•gene identifier > Gene list (Gene symbol or Entrez GeneID) Copy and Paste

•Select the DB you want from Compute Overlaps --->After option selection ‘compute

overlaps’ Click!

Analysis

Results

•Check results of Enrichment Function & Pathway, Save as Excel

•Gene/geneset overlap matrix

- 30 -

Figure 5-3. GSEA Login Page

Enter the list of genes (Gene Symbol, EntrezGeneID, or public ID) in ‘Gene Identifiers’, click an

interested DB on the ‘Compute Overlaps’, and click ‘compute overlaps’ on the bottom (Figure 5-4).

For more gene set information derived from selected DB, click the blue letter of DB.

Figure 5-4. Investigating Gene Sets

After the analysis is complete, the results of GSEA analysis (Gene Set and Gene/geneset overlap

matrix) are available as shown in Figure 5-5 and Figure 5-6.

- 31 -

Figure 5-5. A Result of GSEA Analysis (Gene Set)

Figure 5-6. A Result of GSEA Analysis (Gene/Gene-set Overlap Matrix)

- 32 -

6. Protein-Protein Interaction (PPI) Analysis (STRING)

STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a database which statistically

analyzes known and predicted protein-protein interactions. The interactions include physical and

functional associations to build and analyze interactome networks. The analysis process is shown

in Figure 6-1.

Figure 6-1. Process of Analysis of Protein-Protein Interaction by STRING

Before using the web-based STRING, note that it allows only fewer than 100 genes to analyze.

ExDEGA is designed to easily use meaningful data for the analysis of protein-protein interaction. In

ExDEGA, sort and select genes that you want to use and analyze. Next, copy those gene symbols

or EnterzGeneIDs. Visit STRING homepage (http://string-db.org/). Click ‘Multiple proteins’ and paste

them into a box of ’List Of Names’. Select a scientific name of species from ‘Organism. Then, click

‘Search’ (Figure 6-2).

Websit

access

•http://string-db.org/

•‘Multiple proteins’ Click!

Input gen

e list

•Gene list (Gene symbol or Entrez GeneID) copy & paste (Less than 100)

•Enter Organism (Ex.) Homo sapiens, Mus musculus,...) ---> ‘Search’ Click!

Network

& Analysis

•‘Continue’ Click! ---> Network contsruction ---> Check Results

•‘Analysis’ Click! ---> Check Enriched Function & Interaction etc.

- 33 -

Figure 6-2. Multiple Proteins Search

As shown in Figure 6-3, The confirmation of whether the following proteins match the input genes

is required. If there are no problems, click ‘continue’ to proceed.

Figure 6-3. Gene Confirmation Steps

When the analysis is complete, you can see a figure like Figure 6-4. That is a result of the network

based on STRING DB. To check ‘Functional applications in your network’ like Figure 6-5, click

‘Analysis’. To view all items that represent less than 0.5 0f FDRs, Click ‘More’.

- 34 -

Figure 6-4. A Result of STRING Network

Figure 6-5. A Result of Functional Enrichments

- 35 -

If you click any of the interesting functions in the result of ‘Functional enrichments in your network’,

genes will be displayed with red color on the network figure (Figure 6-6). To get more details about

the gene that you are interested in, click one of the genes on the network figure (Figure 6-7).

Figure 6-6. Selection One of Functions

Figure 6-7. Gene Details

- 36 -

The ‘Legend’ tab provides a detailed description of the nodes, edges, and input genes (Figure 6-8).

To save network image and genetic information, click the ‘Tables Exports’ (Figure 6-9).

Figure 6-8. Legend of Network

Figure 6-9. Exporting of Network

rna-seq/microarray deg analysis...analysis (k-means clustering, hierarchical analysis, t-test,...

Documents