iberex biomedical entity-relationship explorer tutorial …iberex.korea.ac.kr/tutorial.pdf ·...
TRANSCRIPT
iBEReX
Biomedical Entity-Relationship Explorer
TUTORIAL
http://berex.korea.ac.kr
2
Table of Contents:
CHAPTER 1: INTRODUCTION TO iBEReX 3
1. iBEReX Overview 3
1.1. Query iBEReX 4
1.2. iBEReX Visualization Page 5
1.3. Top/Bottom Panel 6
1.4. Visualization Panel 7
1.4.1. Menu 7
1.4.2. Save File 7
1.4.3. Expand Nodes 7
1.4.4. Add Shortest Paths Option 8
1.4.5. Network Layout Option 8
1.4.6. Help 9
1.4.7. Visualization Tool 9
1.4.8. Delete/Expand Options 9
1.4.9. Search Option 10
1.5. Entity-Relationship Panel 11
1.5.1. Detailed Information on Entity (node) 11
1.5.2. Entity-Relationship Type 11
1.5.3. Entity-Relationship Tree 12
1.5.4. Graph Edit 13
1.5.5. Other Query Options 13
CHAPTER 2: Context-Specific Subnetwork Discovery (COSSY) Analysis Option 14
2. COSSY Overview 14
2.1. COSSY Analysis Option in iBEReX 14
2.2. File Formats 16
2.2.1. Gene cluster text format (*.gct) 16
2.2.2. Categorical class format (*.cls) 17
2.2.3. Chip format (*.chip) (optional) 17
2.3. COSSY Results 18
2.4. Linking Two Genes by Shortest Paths 20
REFERENCES 21
3
CHAPTER 1: INTRODUCTION TO iBEReX
1. iBEReX Overview.
iBEReX is a new web-based biomedical knowledge integration, search and exploration tool. iBEReX
integrates eight popular biomedical databases and delineates an integrated network by combining the
multi-layered biomedical entity-relationships extracted from these databases. By users entering keywords
to search the integrated network, iBEReX returns a subnetwork that matches the keywords. iBEReX also
allows users to upload their gene expression profiles, and automatically discovers the important
subnetworks to differentiate the phenotypes from their data. Users can interactively explore the resulting
subnetwork. iBEReX has the following advantages as compared to other biomedical database exploration
tools: 1) reliable information was extracted from widely used databases that are proven for their utility in
biomedical research; 2) integrated networks were constructed by gathering information from these various
databases; 3) interactive exploration and visualization of the resulting networks; 4) context-specific
subnetwork discovery from high-throughput transcriptomics data; and 5) a user-driven approach to
automatically generate plausible and/or unexpected relationships from the derived relevant subnetwork.
The underlying iBEReX database is based on the recently published BEReX (Jeon et al., 2014) and
integrates the subnetwork discovery algorithm COSSY (Saha et al., 2014) for high-throughput
transcriptomics analysis. iBEReX utilizes the open source Cytoscape web codes for interactive and
powerful visualization features. We believe that iBEReX will be a useful bioinformatics tool for
biologists in exploring the complex biomedical entity-relationship networks.
iBEReX Webpage: http://berex.korea.ac.kr
This website is free and open to all users; there is no login requirement.
Figure 1.1: Homepage of iBEReX.
4
1.1. Query iBEReX
Figure 1.2: Two types of query inputs.
We query and explore the biomedical entity-relationship in iBEReX (Figure 1.2) by using a simple
keyword search or a list of keywords that can be uploaded to the web server. Here, we use Example 1 to
illustrate the functionality of iBEReX. By clicking Example 1, the query box will be populated with three
keywords: BCR, ABL1, and imatinib. Then we click the “submit” button to visualize the entity-
relationships between these keywords.
NOTE: iBEReX accepts any keywords as the query inputs (e.g. gene symbols, diseases, pathways, gene
ontology terms, drug names, etc.).
5
1.2. iBEReX Visualization Page
Figure 1.3: The visualization page of iBEReX.
The results page of iBEReX contains four panels: the top panel, the visualization panel, the entity-
relationship panel, and the bottom panel.
Top Panel: In the top panel, if a user clicks on the iBEReX logo, it will return to the homepage.
Visualization Panel: The visualization panel has four components: the “Menu,” “Layout,” and “Help”
options at the top left corner, the visualization screen in the middle, and the “visualization tool” at the
bottom right corner. This is the main visualization screen for the user to interactively navigate and explore
the biomedical entity-relationship network.
Entity-Relationship Panel: The entity-relationship panel provides the detailed information on the entity,
the relationship, the entity-relationship, and the option to edit the network (nodes and edges).
Bottom Panel: The bottom panel presents the query list which displays the list of keywords queried by
the user, and the color legend representing the different biomedical entities. The user can select from this
query list to highlight the entity in the visualization panel. The user can also select all nodes of a certain
type by clicking on the legend of the desired type.
6
1.3. Top/Bottom Panel
Figure 1.4: Top panel of the visualization page.
By clicking the iBEReX logo, users can return to the homepage of the iBEReX server.
Figure 1.5: Bottom panel of the visualization page.
The bottom panel contains the query list and the color legend for the nodes in the graph. As illustrated in
Figure 1.5, the query list includes ABL1 and BCR (two gene symbols), and imatinib (a drug name),
which are query terms in Example 1. By clicking one of these entities, it will highlight the selected entity
in the visualization panel. For example, if “ABL1” is selected from here, the visualization panel will show
a gray circle around “ABL1” indicating that this gene has been selected. The entity-relationship panel will
also display the information regarding this selected gene (Figure 1.6). In order to apply an operation
collectively on all nodes of the same type, users can click on the legend of the desired type and then
perform the operation. For example, suppose that a user wants to change the color of all disease nodes in
the current graph from the default color (white) to red. The user can click on the “disease” legend, which
will highlight all disease nodes in the graph, and then change the node color in the “Graph Edit” tab in the
entity-relationship panel on the right.
Figure 1.6: ABL1 is selected and a gray circle highlights “ABL1” in the visualization panel. The
entity-relationship panel displays the detailed information of the selected gene.
7
1.4. Visualization Panel
1.4.1. Menu
Three options are available in the “Menu,” including “Save,” “Expand by,” and “Add shortest paths by.”
1.4.2. Save File
Figure 1.7: Save option.
Users can save the network (entity-relationship) in iBEReX as the following formats: png, svg, pdf,
xgmml, graphml, and sif.
1.4.3. Expand Nodes
Figure 1.8: Expand option.
By default, iBEReX will expand the graph by five highest-ranking entities in response to the execution of
the “expand” operation explained in Section 1.4.8. Users can change this by selecting the “Expand by”
option under “Menu.”
8
1.4.4. Add Shortest Paths By Option
Figure 1.9: Add shortest paths by number of nodes option.
By default, iBEReX will add a single top-scoring shortest path between two selected nodes upon a user’s
request (Section 2.5). Users can change this by selecting the “Add shortest paths by” option under
“Menu.”
1.4.5. Network Layout Option
Figure 1.10: Layout option.
Users can display the network using the automatic graph layout options provided under the “Layout”
menu. Three options are currently available: Radial (default), Circle, and Tree. Figure 1.11 illustrates the
three different layouts from querying Example 1.
Figure 1.11: Three layout options: A. Radial, B. Circle, and C. Tree.
9
1.4.6. Help
Figure 1.12: Help options.
The “Help” tab has two options: (i) “About” – describes the iBEReX version, and (ii) “Tutorial” – which
is this document.
1.4.7. Visualization Tool
Figure 1.13: Functions of the Visualization tool.
The visualization tool enables users to pan the entire network, zoom in and out, and expand the
visualization panel to fit the screen.
1.4.8. Delete/Expand Options
Figure 1.14: Delete/Expand options.
10
If users right click on a node (entity), a pop-up menu will appear; through this menu, users can
delete/expand the relationships with other entities (e.g. genes/proteins, diseases/symptoms, drugs,
pathways, gene ontology terms, miRNAs, and transcription factors). Again, by default, five entities will
be expanded at a time (see Section 1.4.3). For example, Figure 1.15 is obtained by clicking “ABL1” for
“Expand by Gene/Protein.” Five new genes/proteins were added into the network.
Figure 1.15: Expand by Gene/Protein.
1.4.9. Search Option
Figure 1.16: Searching “SRC” in the Example 1 query.
Users can search a particular entity in the visualization panel by using the “Search” box. If an entity is
already available in the current network, that entity will be highlighted. If an entity is not available in the
current network, the entity will be added into the current network. Figure 1.16 illustrates searching the
“SRC” in Example 1. Since SRC is not available in the Example 1 network, it has been added into the
network.
11
1.5. Entity-Relationship Panel
The entity-relationship panel contains four tabs that provide detailed information and link out options for
the selected entity and relationship in the visualization panel. It also provides the Edit options for the
nodes and edges in the network.
Figure 1.17: Options in the Entity-Relationship Panel.
1.5.1. Detailed Information on Entity (node)
For example, if “ABL1” is selected in the visualization panel, the “Detail Info” tab will display the
information regarding this entity (Figure 1.18) and provide link out options to other external databases.
Figure 1.18: Detailed information on ABL1.
1.5.2. Entity-Relationship Type
Users can customize the entity-relationship type to be displayed in the visualization panel by clicking the
“Entity-Relationship Type” tab. By default, all types of entity-relationship are displayed. Users can
uncheck types in the drop-down menu as illustrated in Figure 1.19.
12
Figure 1.19: Entity-Relationship Type for ABL1.
1.5.3. Entity-Relationship Tree
Users can explore all the entity-relationships of the selected entity (node) in the “Entity-Relationship
Tree” tab. Figure 1.20 illustrates the entity-relationship tree of “ABL1,” and all the genes and pathways
connected to “ABL1” in Figure 1.20B and Figure 1.20C, respectively.
Figure 1.20: Entity-Relationship Tree.
13
1.5.4. Graph Edit
The “Graph Edit” tab allows users to customize the nodes and edges in the visualization panel. Figure
1.21 illustrates the options in the “Graph Edit” tab.
Figure 1.21: Options in Graph Edit tab.
1.5.5. Other Query Options
Users can specify the colors of biomedical entities (e.g. genes, drugs, etc.) in their query by appending “|”
followed by the desired color at the end of each query line. iBEReX will automatically extract and color
the biomedical entity according to the color specified by the user.
See Example 2 (Figure 1.22):
ITGAX | red
ITGB2 | green
EZR | blue
Figure 1.22: Color Options in Example 2 query.
14
CHAPTER 2: Context-Specific Subnetwork Discovery (COSSY) Analysis Option
2. COSSY Overview
The algorithm COSSY discovers important subnetworks that can differentiate between two phenotypes
(context) (Saha et al., 2014). It automatically finds differentially expressed subnetworks of closely
interacting molecules from molecular interaction networks by using gene expression profiles. COSSY
enables users to analyze their gene expression profile datasets on the iBEReX web server
(http://berex.korea.ac.kr).
2.1. COSSY Analysis Option in iBEReX
Figure 2.1. Context-Specific Subnetwork Discovery (COSSY) and Analysis of Gene Expression
Profiles Webpage.
To perform the subnetwork discovery analysis, users need to provide gene expression profiles (in the
format of *.gct), a class label (in the format of *.cls), and an optional chip file (in the format of *.chip).
1. Click on the “Analysis” tab.
2. Upload gene expression profile dataset file (File format: *.gct).
3. If the gene expression profile dataset is already converted to gene symbols (e.g. RNAseq), then
click “no need to map probe ID -> gene ID.” If the gene expression profile dataset contained
probe IDs, then unclick this option and provide the chip file (File format: *.chip).
4. Upload class labels (File format: *.cls).
5. Click “Analyze & Explore.”
The right panel of the “Analysis” tab provides an example of a gene expression profile dataset
(leukemia2.gct), a chip file (leukemia2.chip), and a class file (leukemia2.cls). We will use this example to
illustrate the functionality of COSSY. On the click of the “Analyze & Explore” button, COSSY will be
15
executed. The execution typically takes several minutes (depending on the number of samples) to return
the results.
NOTE:
Once a COSSY query is submitted, iBEReX acknowledges the receipt of the user dataset and shows a
link to the result page. Since a COSSY job typically takes several minutes, users may not see the result
graph populated in the results page immediately after the submission. Users are advised to keep the
browser open and refresh the result page until the result graph appears, or to bookmark the URL of the
result page and visit the webpage later.
16
2.2. File Formats
The iBEReX server uses common file formats for data input in the COSSY analysis. The detailed
description of the file formats used here is available online at
http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats.
2.2.1. Gene cluster text format (*.gct)
This is a tab delimited file representing an expression dataset. The first line of this format is fixed
[#1.2].
The second line contains the number of genes followed by the number of samples in the dataset.
Format:
(#genes)(tab)(#samples)
The third line contains column titles of the expression dataset. The first two columns are “Name” and
“Description,” and the other columns represent the expression profiles.
Format:
Name(tab)Description(tab)(sample_1_name)(tab)(sample_2_name)(tab)...
(sample_N_name)
The expression data start from the fourth line. Each line represents the expression levels of a gene across
all the samples.
Format:
(gene name)(tab)(gene description)(tab)(col_1_data)(tab)(col_2_data)
(tab)...(col_N_data)
The “Name” column must contain unique identifiers. It contains microarray probe IDs for microarray
data, or RNA-seq IDs for RNA-seq profiles.
Note: For microarray probe annotations, users must provide a chip file separately which will be mapped
against the “Name” column of the gct file. For RNA-seq annotations, users must provide the gene
symbols in the “Description” column.
17
2.2.2. Categorical class format (*.cls)
This is a tab or space delimited file.
The first line contains three numbers.
Format:
(number of samples)(tab)(number of classes)(tab)1
The second line contains the names of the classes and is preceded by a hash symbol (#).
Format:
# (tab)(class_0_name)(tab)(class_1_name)
The last line actually contains the class labels of all the samples.
Format:
(sample_1_class)(tab)(sample_2_class)(tab)...(sample_N_class)
Note: COSSY works with binary categories; therefore, two classes exist. The higher label in the lexical
order is chosen as the positive class by default. For example, if “0” and “1” are the two class labels,
then “0” is the negative class (control), and “1” is the positive class (case). Similarly, the “tumor” class
and “normal” class represent the positive and negative classes, respectively.
2.2.3. Chip format (*.chip) (optional)
This file contains the annotation information about a microarray. It is a tab delimited by containing three
columns: 1) Probe Set ID, 2) Gene Symbol, and 3) Gene Title.
Format:
(probe_id)(tab)(gene_symbol)(tab)(gene_title)
Note: COSSY can handle the case where a probe is mapped to multiple genes, i.e. if there are multiple
gene symbols for a probe. Gene symbols have to be separated by a semicolon (;) or three consecutive
slashes (///).
18
2.3. COSSY Results
Figure 2.2: Subnetworks identified by COSSY from the Leukemia (Example) data set.
From the COSSY analysis, the list of genes identified from the example (leukemia) dataset includes
PDPK1, PIK3CA, RPS6KB2, PRKCI, PLCG2, PPP3CB, PPP3CC, GRB2, CREB1, BIRC5, MYC, and
WNT4. The red and green colors represent over-expressed and under-expressed genes in Case (AML in
this example) as compared to Control (ALL in this example). These genes were used as the query in
iBEReX, and listed on the bottom panel in the webpage. Users may click on any gene in this list, and the
corresponding gene will be highlighted in the visualization panel. The current implementation of COSSY
is based on KEGG pathways (Saha et al., 2014). Given a user expression profile dataset, COSSY
identifies the top 10 phenotype-correlated subnetworks from KEGG pathways, and selects the top five
most differentially expressed genes from each of the top 10 subnetworks. The five gene group is reported
as a Molecular Interaction Subnetwork (MIS). Note that the number of genes in an MIS may be less than
five when microarray profiles are used because multiple microarray probes may map to one gene ID. The
example in Figure 2.2 shows that MIS #1 consists of only four genes. The two different probes in MIS #1
are mapped to the same gene ID. Also note that two extra genes, PPP3CA and STAT3, were added in the
visualization panel showing the top three MISs (Figure 2.2). Since an MIS is formed by the top five
differentially expressed genes from a subnetwork, the genes in an MIS may not be fully connected to each
other. To glue the genes together in an MIS, iBEReX introduces the least number of additional required
genes into the MIS.
19
Figure 2.3. Analysis Results.
The MISs are reported in the sorted order of the correlation strength with the phenotypes in the user
profiles. By default, iBEReX displays the top three MISs. Users may increase or decrease the number of
MISs to be included in the visualization panel by sliding the MIS rule in the “Analysis Results” tab
(Figure 2.3). Figure 2.4 illustrates the result after sliding the MIS rule from 3 (default) to 4.
Figure 2.4: Changing the “Top Phenotype-Correlated Subnetworks” from top three to top four.
20
Figure 2.5: Changing the “Top Phenotype-Correlated Subnetworks” from top three to top two.
Figure 2.5 illustrates the option of reducing the top three MISs to the top two MISs, using the MIS rule in
the “Analysis Result” tab. Here, only the differentially expressed genes identified in the top two MISs and
a glue gene (PPP3CA) are visualized as a subnetwork. Users can click on the gene list in the bottom panel
to highlight the genes in the visualization panel. Users can also add other entity-relationships by right
clicking the highlighted genes.
2.4. Linking Two Genes by Shortest Paths
Figure 2.6. Linking two genes by shortest paths.
21
In iBEReX, users can connect isolated subnetworks by adding the nodes that formed the shortest paths
between two selected biomedical entities. Here, we illustrate how to connect the two remote subnetworks
by selecting RPS6KB2 and PPP3CB. Once the two genes are selected (by shift-click), right-click on one
of the selected genes. A pop-up menu will be displayed (Figure 2.6). Select the “Add shortest paths”
option from this pop-up menu to find biomedical entities that link between these two nodes.
Figure 2.7: Results of linking the two genes by shortest paths.
As illustrated in Figure 2.7, a GO term, “Immunity, innate” (purple node), is added to connect RPS6KB2
and PPP3CB. Users can expand or delete nodes from this network by repeating these procedures. By
default, shortest paths will be added one at a time. However, users can change it by clicking “Menu” and
“Add shortest paths by” in the drop down menu to select the desired number of paths to add (see Section
1.4.4).
References:
Jeon, M., Lee, S., Lee, K., Tan, A.C., and Kang, J. (2014). BEReX: Biomedical Entity-Relationship
eXplorer. Bioinformatics. 30:135-136. [PMID: 24149052]
Saha A, Tan AC, Kang J (2014) Automatic Context-Specific Subnetwork Discovery from Large
Interaction Networks. PLoS ONE. 9(1): e84227. doi:10.1371/journal.pone.0084227