blast2go teaching exercises {...
TRANSCRIPT
Blast2GO Teaching Exercises –SOLUTIONS
Ana Conesa and Stefan Gotz2012
BioBam Bioinformatics S.L.Valencia, Spain
Contents
1 Annotate 10 sequences with Blast2GO 2
2 Perform a complete annotation process with Blast2GO 6
3 Creating GO-DAGs and Pies 9
4 Enrichment Analysis with Blast2GO - FatiGO 12
5 Functional Analysis/Data Mining 14
1
1 Annotate 10 sequences with Blast2GO
(please note that results may vary slightly depending on used parameters and different databaseversions!)
1.1 Annotate 10 sequences with Blast2GO
• BLAST against NCBI nr database:
Check on the Application messages tab the progress of the BLASTing. How long does ittake to complete? Are all sequences successfully blasted?
It should take just some 2-3 minutes: All sequences are successfully blasted and are noworange.
• Launch mapping: That’s just a click.
• Browsing Blast results:
Place the cursor on one sequence and right-click on the mouse. The single sequence menuappears. By selecting the “Show Blast Results” option, the Blast results tab gets filled withBLAST information. Double click on the upper bar to enlarge the window and to enablethe scroll bars. You can visualize different BLAST hits and their percentage of similarity,number of HSPs, reading frame, etc.
• GO graph:
On the single sequence menu, click on “Draw Graph of Mapping-Results” with highlightedannotations. The graph appears with the annotation score of each GO term. Annotatingterms are the most specific terms of each branch that surpass the annotation score threshold(default = 55).
• Export top-blast data: Here, only information on the “best-blast-hit” is given.
• How many GO terms have you fetched for each sequence?
Sequence 1 2 3 4 5 6 7 8 9 10Number of GOs 13 6 3 10 2 2 11 7 2 2
• Annotate the sequences. How many GO terms you obtain for each sequence?
Here the table with annotation results:
Sequence 1 2 3 4 5 6 7 8 9 10Number of GOs 5 2 1 7 - - 8 2 - -
1.2 Let’s check some annotations more in detail
• Obtain the annotation DAG of Sequence 7 (single sequence menu). Interpret and save the“molecular function” graph (see Figure 1).
In this graph you can see the annotation score for all candidate GO terms (GO terms withdescription attached). The selected GO terms (octogonal boxes) are those that surpassthe annotation threshold and are most specific terms in the branch. There are other termsthat are above the threshold but do not appear in the annotation because there are morespecific terms that are also above the cutoff value.
• Re-annotate sequences 1 and 8 at an annotation threshold of 80? How does it change? Thesteps to re-annotate are :
◦ De-select all sequences at the sequences check box.
◦ Select sequences 1 and 8.
◦ Go to Annotation and select Reset Annotation.
◦ Run annotation step again having only these 2 sequences selected.
2
Single Sequence Graph of Seq7
nucleotide bindingAnnotScore:90
purine nucleotidebinding
AnnotScore:85
is
ribonucleotide bindingAnnotScore:85
is
GO:0000166
adenyl nucleotidebinding
AnnotScore:85
is
purine ribonucleotidebinding
AnnotScore:85
is
cation bindingAnnotScore:85
metal ion bindingAnnotScore:85
is
GO:0046872adenyl ribonucleotide
bindingAnnotScore:85
is
purine nucleosidebinding
AnnotScore:85
is
ATP bindingAnnotScore:85
is
GO:0005524
is
is
molecular_functionAnnotScore:100
catalytic activityAnnotScore:100
is
bindingAnnotScore:95
is
transferase activityAnnotScore:100
is is
nucleoside bindingAnnotScore:85
is
ion bindingAnnotScore:85
is
is is
transferase activity,transferring alkyl
or aryl (otherthan methyl)
groupsAnnotScore:100
methionineadenosyltransferase
activityAnnotScore:100
is
GO:0004478
is
GO:0016740
Figure 1: GO mapping and annotation of seq 7
◦ Sequence 1 has now only 2 GO terms (3 less) and sequence 8 has now only 1 GOterms
◦ Both sequence lost information. Some terms disappeared, others changed to moregeneral ones.
• There are a number of sequences with mapping but without annotation. What happened?Try to annotate them manually. Tip: go to the Blast results of these sequences to learnabout them, decide on the functions you would give to these sequences. Go to the GeneOntology resource www.geneontology.org and look for appropriate GO terms. Add thesemanually to the sequences and marked them as annotated manually These sequences donot have annotation because the obtained terms are root GOs. By browsing the blastresults some functions can be proposed:
◦ Sequence 5: GO:0016021, integral to membrane
◦ Sequence 10: GO:0016020, membrane
1.3 Let’s augment/modify the annotations
• Get InterPro annotation for these sequences. How long does it take?
3
It takes about 5 minutes for 10 sequences. 8 sequences obtain InterProScan results. Only5 of them are linked to GO terms.
• Merge InterPro results with the existing (blast-based) annotations (AnnotScore=55). Howmuch does your annotation improve?
After merging there are 2 GO terms added and 0 GO term removed for being too general.Now 27 terms are assigend to our sequences. 38 (redundant) GO terms obtained throughInterPro could be used to confirm existing ones. In this example 0 InterPro based GOterms have been more general than the ones already assigned.
Merge Interpro Annotation Results
before after confirmed too general0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
22.5
25.0
27.5
30.0
32.5
Nu
mb
er
of
an
no
tati
on
s
Figure 2: InterProScan results
• Run Annex on these sequences. How does you annotation improve?
After Annex several new GO terms have been obtained based on already existing molec-ular function terms, some terms are replaced by more specific ones and some others gotconfirmed.
Annex Results
previous actual new replaced confirmed0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
22.5
25.0
27.5
30.0
Nu
mb
er
of
an
no
tati
on
s
Figure 3: Annex results
• Get KEGG maps for these sequences. For how many sequences you obtain KEGG results?
There are 3 sequences for which an Enzyme Code has been obtained. These codes map to10 KEGG metabolic pathways. The enzyme position in the KEGG pathway is high-lighted.Each enzyme with a different color.
• Get the GOSlim of these sequences. How many GO terms do you have now?
Here the table with annotation results:
Sequence 1 2 3 4 5 6 7 8 9 10Before GOSlim 5 2 1 7 - - 8 4 - -After GOSlim 6 5 2 9 - - 8 4 2 2
GOSlim annotations are not necessary less in number than the normal GO at the singlesequence level, but diversity of GO terms is reduced (see Figure 4).
4
• Export annotation results in different formats (.annot, GeneSpring, Sequence Table andBestHit). Open these files with OpenOffice SpreadSheet. Which format do you like themost?
Every format has a function. The GeneSpring format is good to understand results, whilethe .annot is appropriate to perform calculations and to import results into other applica-tions. The table formats are also useful to browse annotations.
1.4 Extra exercise: Merge two .dat file
• Save the B2G project in two separate files. Close the project and join the 2 dat files againwith B2G.
The steps are:
◦ Save project as result.dat
◦ De-select all sequences at the sequence check box
◦ Select the last 5 sequences
◦ Go to Select menu and delete selected sequences
◦ Save as result1.dat
◦ Close project
◦ Load result.dat
◦ De-select the last 5 sequences
◦ Go to Select menu and delete selected sequences
◦ Save as result2.dat
◦ Close Project
◦ Load result1.dat
◦ Go to Tools and select Add .dat to existing project, and then select result2.dat
◦ The original annotation project is restored
GOSlim Combined Graph
metabolic processSeqs:5
primary metabolicprocessSeqs:4
macromolecule metabolicprocessSeqs:2
cellular metabolicprocessSeqs:4
small moleculemetabolic process
Seqs:1
nitrogen compoundmetabolic process
Seqs:4
secondary metabolicprocessSeqs:1
biosynthetic processSeqs:4
catabolic processSeqs:1
cellular aminoacid and derivativemetabolic process
Seqs:1
carbohydrate metabolicprocessSeqs:1
nucleobase, nucleoside,nucleotide and
nucleic acidmetabolic process
Seqs:4
cellular macromoleculebiosynthetic process
Seqs:1
transcriptionSeqs:1
gene expressionSeqs:1
nucleic acidmetabolic process
Seqs:2
DNA metabolicprocessSeqs:1
cellular macromoleculemetabolic process
Seqs:2
macromoleculebiosynthetic process
Seqs:1
cellular biosyntheticprocessSeqs:1
cellular nitrogencompound metabolic
processSeqs:4
cellular processSeqs:4
biological_processSeqs:6
localizationSeqs:2
biological regulationSeqs:1
regulation ofbiological process
Seqs:1
establishment oflocalization
Seqs:2
transportSeqs:2
Figure 4: GOSlim graph
5
2 Perform a complete annotation process with Blast2GO
(please note that results may vary slightly depending on used parameters and different databaseversions!)
2.1 Annotation of 1100 sequences with Blast2GO
• e-Values and similarities: The e-Value ranges from 1xE-3 to 1xE-130. Most sequences havean e-value between E-10 and E-70. The sequence similarity goes from approx. 40% toapprox 95%, then it drops. Also we can observe a peak at 100% which could be “self-hits”or sequence pattern of 100% similarity.
E-value distribution
2 5 5 0 7 5 100 125 150 175
E-value (1e-X)
0
5 0
100
150
200
250
300
350
400
HIT
s
Figure 5: evalue distribution
Sequence similarity distribution
0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 100
#positives/alignment-length
0
5 0
100
150
200
250
300
350
400
450
500
550
600
650
700
HIT
s
Figure 6: Similarity distribution
• Mapping: The majority of sequences do have annotations inferred from electronic annota-tions, even so approx. 350 sequences also do have annotations inferred from direct assays.The next two evidences codes are also “not experimental” ones: Inferred from computa-tional analysis and inferred from sequence similarity.
Having a look at the source of databases we can observe that the majority of annotationsare obtained from the UniProt Knowledge Base.
The mean GO-level is 5.4 and approx. 2800 annotations could be assigned.
2.2 Augment annotation via InterPro and Annex
• About 600 sequences have a InterPro scan result and about 30% of them could be linkedto a GO-term. However, (for this dataset) only 8 additional sequences could be annotatedthrough InterPro domains.
• Through Annex, the amount of annotation increased from about 3500 to over 4100 byadding complementary terms derived form the existing molecular functions.
6
GO-level distribution
P F C
0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5
GO Level (Total Annotations = 2839, Mean Level = 5.391, Std. Deviation = 1.778)
0
5 0
100
150
200
250
300
350
400
# A
nn
ota
tio
ns
Figure 7: Distribution of GO term levels for each GO category
2.3 Try different annotation strategies
• We observed a drastic decrease in the amount of assigned annotations by excluding severalevidence codes with more restrictive setting. Only about 10% of the sequences could besuccessfully annotated. The other way round we obtained over 50% of annotated sequences.Compared to the annotation with default parameters we only obtained annotations for 70more sequences (see Figure 8).
(a) default parameters (b) permissive parameters
(c) restrictive parameters
Figure 8: Annotation results
• By generating the GO-Level distribution chart we can see that the amount of annotatedsequences is much less for the restrictive mode. Also can be observed that the meanannotation level stayed more or less the same. At a closer look we see that the Cellularcomponent category seems the less affected one by the the restrictive mode (see Figure 9).
7
(a) default parameters (b) permissive parameters
(c) restrictive parameters
Figure 9: GO level distributions
8
3 Creating GO-DAGs and Pies
3.1 Creating GO-DAGs and Pies
• Create the complete graphs for all 3 GO branches. Can you extract any conclusion?
All graphs are really big. The only thing you can conclude is that the Biological Processbranch bears much more information than the other two.
• Use the seq and score filters to reduce the number of GO terms. Try one type of filter atthe time. How does the resulting graph look like? Which filtering value gives you a goodview on the data? Can you see easily “important” terms? How? Which ones?
By setting a “seq filter” the graph becomes smaller from the lowest nodes. The score filtermakes some nodes in-between to disappear, creating an “odd“ graph, with many links andfew nodes.
By setting the seq filter to 30 you can see some highlighted nodes such as “response tostress”, “regulation of transcription-DNA dependent”, “translation” and “transport”.
By setting the filter on the score value (also 30), we get an even more compact graph.Now all nodes are intensively colored and it is more difficult to find the relevant terms,but we also see functions such as “response to stress”, “regulation of transcription-DNAdependent”, “translation” which are among the dominant functions
• Perform a GOSlim on these data (use plant specific). Create the DAG. How does it compareto the previous graphs?
The graph is much more compact. You can find back important terms as response tostimulus, but many other nodes are not represented.
• Generate pie charts with normal and multilevel pies and bar-charts. Try out differentfiltering until you get a useful summary? Which functions are more abundant? Give asummary of the functions represented in this sub-array.
We can obtain a good summary with the bar chart (see Figure 10) and the multilevel pie(see Figure 11) , but the pies by level have always too many sectors to be useful.
From both analysis we can conclude that the main functions in this dataset are: responseto stimulus, translation, metabolic processes,transport, . . .
3.2 Extra exercise: Pie charts with Excel and custom-colored graphs
• Export the graph data as .txt and open it in excel. Try to reproduce some of the charts youobtained with Blast2GO. Here I simply have the counts on the different GOs. I also havethe level, so it is possible to create a bar chart and normal pie on one level. The multilevelpie is more difficult since you need the relationships between nodes and branches.
• Make a custom-colored graph with the top 100 GO terms ordered by the amount of anno-tated sequences.
From the table above we order the sequences by the score. We take the first 100 sequencesand number them from 1 to 0. The column with GO IDs gets duplicated and the filewith columns GO-ID, GO-ID, value is saved with the extension .annot. Import the fileinto Blast2GO and create combined graph without filters but using the option “colorbyDesc”. The result can be seen in Figure 12.
9
Figure 10: Bar-Chart for the biological processes
Figure 11: Multi-Level Pie Chart
10
Figure 12: Custom-colored graph
11
4 Enrichment Analysis with Blast2GO - FatiGO
(NOTE: Results may vary depending on used parameters and different database versions!)
For example: The enriched term “response to chemical stimulus” has 114 contigs in the test-setand 366 in the reference set. The term obtained an adjusted (FDR) p-Value of 6.9E−3 and aun-adjusted value of 1.6E−6. This value of 6.9E−3 is above 0.05 and statistically overrepresentedafter multiple testing correction.
Below we can see the unfiltered “enriched graph” of molecular functions (see Figure 13). Thisgraph got saved as pdf.
A more compact graph of this results was generated as a “thined graph” with a FDR filter of0.05 (see Figure 14).
Finally the list of the most specific (tip-terms) AND enriched terms per GO branch got generated(see Figure 15).
Enriched Graph
structural moleculeactivity GO:0005198
FDR: 7.0E-3 FWER: 0.0E0p-Value: 4.8E-6
structural constituentof ribosome GO:0003735FDR: 7.0E-3 FWER: 0.0E0
p-Value: 4.0E-6
is
molecular_functionGO:0003674
is
binding GO:0005488
is
cation bindingGO:0043169
metal ion bindingGO:0046872
is
transition metalion binding
GO:0046914
is
ion bindingGO:0043167
is
is
iron ion binding GO:0005506FDR: 3.3E-2 FWER: 0.0E0
p-Value: 6.8E-5
is
Figure 13: Enriched molecular functions (without filter)
12
Enriched Graph
structural moleculeactivity GO:0005198
FDR: 7.0E-3 FWER: 0.0E0p-Value: 4.8E-6
structural constituentof ribosome GO:0003735FDR: 7.0E-3 FWER: 0.0E0
p-Value: 4.0E-6
is
molecular_functionGO:0003674
is
iron ion binding GO:0005506FDR: 3.3E-2 FWER: 0.0E0
p-Value: 6.8E-5
5 terms
Figure 14: Enriched molecular functions (thinned out with a 0.05 FDR filter)
Figure 15: List of the most specific (tip-terms) AND enriched terms per GO branch.
13
5 Functional Analysis/Data Mining
The analysis pipeline would be as follows:
• Upload the .dat in B2G
• Go to tools and use the add.annot function to include the manual.annot file into the .datproject
• Go to Enrichment Analysis
• Select the StressSelection.txt file, performing a ONE tail statistical test.
• Once results are obtained, save them as default.gossip.txt
• Select all sequences
• Go to the annotation menu and reset annotation
• Change annotation parameters. Set all evidence codes lower than 1 to 0
• Re-annotate
• Repeat the Fisher Analysis and save as strict.gossip.txt
• Select all sequences
• Go to the annotation menu and reset annotation
• Change annotation parameters. Set all evidence codes to 1
• Re-annotate
• Repeat the Fisher Analysis and save as permissive.gossip.txt
• Outside B2G, in Excel, create a ”gossip-result” annotation file: a 2 columns file with inthe first column the name of the strategy default, strict or permissive and in the other thesignificant terms obtained in each analysis. Save this file as text delimited file with theextension .annot
• Upload this file in Blast2GO (File menu, Load annotations)
• Make a Combined graph selecting “Node Information = with Seqs”. Here you can seecommon and different nodes
• To create a graph with only nodes that were enriched with the 3 strategies, make the graphgiving a value of 2.5 to the Seq filter parameter. Export results as .txt
14