Download - Editing Functionality - Apollo Workshop
ApolloCollaborative genome annotation editing
A workshop for the International Aphid Genome Consortium research community.
Monica Munoz-Torres, PhD | @monimunozto
Berkeley Bioinformatics Open-Source Projects (BBOP)Environmental Genomics & Systems Biology DivisionLawrence Berkeley National Laboratory
Webinar - 22 March, 2017
http://GenomeArchitect.org
theannotationwindow
Basic data visualization
USER-CREATED ANNOTATIONS
EVIDENCE TRACKS
ANNOTATOR PANEL
GenomeArchitect.org
Annotations Organism Users Groups AdminTracks Reference Sequence
Removable Annotator Panel
1
Annotation details & exon boundaries2
Annotations
gene
mRNA
1
2
Navigating to an annotationAnnotations
gene
mRNA
➼
➼
➼
Displaying tracks with supporting dataTracks
➼
➼➼
➼
Navigating to ‘Reference Sequence’ (i.e. assembly fragments: scaffolds, chromosomes, etc.)
Ref Sequence
➼
➼
➼
➼
➼
➼
Additional functionality
➼
➼
➼
➼
Share a location
Switch organisms
Leave a session
Hide/show Annotator Panel➼
beginwithanewgenemodel
BECOMING ACQUAINTED WITH APOLLO
Annotatorpanel.
• Chooseappropriateevidencefromlistof“Tracks”onannotatorpanel.
• Select&dragelementsfromevidencetrackintothe‘User-createdAnnotations’area.
• Hoveringoverannotationinprogressbringsupaninformationpop-up.
Creating a new annotation
Adding a gene model
Adding a gene model
Adding a gene model
thesequencetrack
17 | BECOMING ACQUAINTED WITH APOLLO
‘Zoom to base level’ reveals the sequence track.
18 | BECOMING ACQUAINTED WITH APOLLO
Color exons by CDS from the ‘View’ menu.
Zoomin/outwithkeyboard:shift+arrowkeysup/down
BECOMING ACQUAINTED WITH APOLLO
Toggle reference DNA sequence and translation frames in forward strand.
Also, toggle models in either direction.
curatingsimplecases
“Simple case”: - the predicted gene model is correct or nearly correct, and - this model is supported by evidence that completely or mostly agrees with the prediction. - evidence that extends beyond the predicted model is assumed to be non-coding sequence.
The following are simple modifications.
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
Editing functionality
• Aconfirmationboxwillwarnyouifthereceivingtranscriptisnotonthesamestrandastheelementfromwherethe‘new’exonoriginated.
• Check‘Start’and‘Stop’ signalsaftereachedit.
ADDING EXONS
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
Editing functionalityExample: Adding an exon supported by experimental data
• RNAseq reads show evidence in support of a transcribed product that was not predicted.• Add exon by dragging up one of the RNAseq reads.
Iftranscriptalignmentdataareavailable&extendbeyondyouroriginalannotation,youmayextendoraddUTRs.
1. Rightclickattheexonedgeand‘Zoomtobaselevel’.
2. PlacethecursorovertheedgeoftheexonuntilitbecomesablackarrowthenclickanddragtheedgeoftheexontothenewcoordinatepositionthatincludestheUTR.
ADDING UTRs
ToaddanewsplicedUTRtoanexistingannotationalsofollowtheprocedureforaddinganexon,orto‘SetasX’end’.
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
In some cases all the data may disagree with the annotation, in other cases some data support the annotation and some of the data support one or more alternative transcripts.
Try to annotate as many alternative transcripts as are well supported by the data.
MATCHING EXON BOUNDARY TO EVIDENCE
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
To modify an exon boundary and match data in the evidence tracks: select both the offending exon and the element with the correct boundary, then right click on the annotation to select ‘Set 3’ end’ or ‘Set 5’ end’ as appropriate.
1. Twoexonsfromdifferenttrackssharingthesamestart/endcoordinatesdisplayaredbartoindicatematchingedges.
2. Selectingthewholeannotationoroneexonatatime,usethis edge-matching functionandscrollalongthelengthoftheannotation,verifyingexonboundariesagainstavailabledata.Usesquare[]bracketstoscrollfromexontoexon.Usercurly{}bracketstoscrollfromannotationtoannotation.
3. CheckifcDNA/RNAseqreadslackoneormoreoftheannotatedexonsorincludeadditionalexons.
CHECK FOR EXON INTEGRITY
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
Doubleclickselectstheentiremodel
EvidenceTracksArea
‘User-createdAnnotations’Track
Edge-matching
Apollo’seditinglogic(brain):§ selectslongestORFasCDS§ recalculatesORFaftereachedit,unlessset
ORFs - setting & recalculating
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
Redlinesaroundexons:‘edge-matching’allowsannotatorstoconfirmwhethertheevidenceisinagreement,withoutexaminingeachexonatthebaselevel.
Non-canonical splices are indicated with orange circles with a white exclamation point inside, placed over the edge of the offending exon.
Canonicalsplicesites:
3’-…exon]GA/TG[exon…-5’
5’-…exon]GT/AG[exon…-3’reversestrand,notreverse-complemented:
forwardstrand
SPLICE SITES
Zoom to review non-canonical splice site warnings. Although these may not always have to be corrected (e.g. GC donor), they should be flagged with a comment.
Exon/intron splice site error warning
Curatedmodel
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
Editing functionalityExample: Adjusting exon boundaries supported by experimental data
Apollo calculates the longest possible open reading frame (ORF) that includes canonical ‘Start’ and ‘Stop’ signals within the predicted exons.
If ‘Start’ appears to be incorrect, modify it by selecting an in-frame ‘Start’ codon further up or downstream, depending on evidence (e.g. proteins, RNAseq).
It may be present outside the predicted gene model, within a region supported by another evidence track.
In very rare cases, the actual ‘Start’ codon may be non-canonical (non-ATG).
‘Start’ AND ‘Stop’ SITES
BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES
curatingcomplexcases
Evidencemaysupportjoiningtwoormoredifferentgenemodels.Warning: proteinalignmentsmayhaveincorrectsplicesitesandlacknon-conservedregions!
1. In‘User-createdAnnotations’area shift-clicktoselectanintronfromeachgenemodelandrightclicktoselectthe‘Merge’ optionfromthemenu.
2. Dragsupportingevidencetracksoverthecandidatemodelstocorroborateoverlap,orreviewedgematchingandcoverageacrossmodels.
3. Checktheresultingtranslationbyqueryingaproteindatabase e.g.UniProt,NCBInr.Addcommentstorecordthatthisannotationistheresultofamerge.
MERGE TWO GENE PREDICTIONS ON THE SAME SCAFFOLD
BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
Redlinesaroundexons:‘edge-matching’allowsannotatorstoconfirmwhethertheevidenceisinagreement,withoutexaminingeachexonatthebaselevel.
Oneormoresplitsmayberecommendedwhen:- differentsegmentsofthepredictedproteinaligntotwoormoredifferentgenefamilies- predictedproteindoesn’taligntoknownproteinsoveritsentirelength- Transcriptdatamaysupportasplit;BUT- first,verifywhethertheyarealternativetranscripts.
SPLIT A GENE PREDICTION
BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
DNATrack
‘User-createdAnnotations’Track
ANNOTATE FRAMESHIFTS AND CORRECT SINGLE-BASE ERRORS
Alwaysremember:whenannotatinggenemodelsusingApollo,youarelookingata‘frozen’versionofthegenomeassemblyandyouwillnotbeabletomodifytheassemblyitself.
BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
CORRECTING SELENOCYSTEINE CONTAINING PROTEINS
BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
CORRECTING SELENOCYSTEINE CONTAINING PROTEINS
1. Apolloallowsannotatorstomakesinglebasemodificationsorframeshiftsthatarereflectedinthesequenceandstructureofanytranscriptsoverlappingthemodification.ThesemanipulationsdoNOTchangetheunderlyinggenomicsequence.Ifyoudeterminethatyouneedtomakeoneofthesechanges,zoomintothenucleotidelevelandrightclickoverasinglenucleotideonthegenomicsequencetoaccessamenuthatprovidesoptionsforcreatinginsertions,deletionsorsubstitutions.
2. The‘CreateGenomicInsertion’featurewillrequireyoutoenterthenecessarystringofnucleotideresiduesthatwillbeinsertedtotherightofthecursor’scurrentlocation.The‘CreateGenomicDeletion’ optionwillrequireyoutoenterthelengthofthedeletion,startingwiththenucleotidewherethecursorispositioned.The‘CreateGenomicSubstitution’featureasksforthestringofnucleotideresiduesthatwillreplacetheonesontheDNAtrack.
3. Onceyouhaveenteredthemodifications,Apollowillrecalculatethecorrectedtranscriptandproteinsequences,whichwillappearwhenyouusetheright-clickmenu‘GetSequence’option.Sincetheunderlyinggenomicsequenceisreflectedinallannotationsthatincludethemodifiedregionyoushouldalertthecuratorsofyourorganismsdatabaseusingthe‘Comments’sectiontoreporttheCDSedits.
4. Inspecialcasessuchasselenocysteinecontainingproteins(read-throughs),right-clickovertheoffending/premature‘Stop’signalandchoosethe‘Setreadthroughstopcodon’optionfromthemenu.
ANNOTATING FRAMESHIFTS, CORRECTING SINGLE-BASE ERRORS & SELENOCYSTEINES
BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES
addingmetadata
40 | BECOMING ACQUAINTED WITH APOLLO
Information Editor
Isoforms at BIPAA:
If the gene you are annotating does not have multiple isoforms, add metadata only on left side of the Information Editor (i.e. under gene).
If the gene you are annotating has multiple isoforms, you should populate the right panel (mRNA / transcript) for each isoform, adding a letter (A, B, C, …) at the end of the name to distinguish
BECOMING ACQUAINTED WITH APOLLO
Information Editor
BECOMING ACQUAINTED WITH APOLLO
Information Editor
history
44 | BECOMING ACQUAINTED WITH APOLLO
Keeping track of each edit
Annotations, annotation edits, and History:are stored in a centralized database.
BECOMING ACQUAINTED WITH APOLLO
checklist
Followthischecklistuntilyouaresatisfiedtheannotationisthebestrepresentationoftheunderlyingbiology.
Andrememberto…– commenttovalidateyourannotation,evenifyoumadenochangestoanexistingmodel.Thinkofcommentsasyour‘voteofconfidence’.
– addacommenttoinformthecommunityofunresolvedissuesyouthinkthismodelmayhave.
47 |
AlwaysRemember:Apollocurationisacommunityeffortsopleaseusecommentstocommunicatethereasonsforyour
annotation.Yourcommentswillbevisibletoeveryone.
COMPLETING THE ANNOTATION
BECOMING ACQUAINTED WITH APOLLO
• Check‘Start’ and‘Stop’sites.
• Checksplicesites:mostsplicesitesdisplaytheseresidues…]5’-GT/AG-3’[…
• CheckifyoucanannotateUTRs,forexampleusingRNA-Seq data:– alignitagainstrelevantgenes/genefamily– blastp againstNCBI’sRefSeq ornr
• Check&commentgaps inthegenome.
• Additionalfunctionalitymaybenecessary:–merge 2genepredictions- samescaffold– ‘merge’ 2genepredictions- differentscaffolds
– split ageneprediction– annotate frameshifts– annotateselenocysteines,correctingsingle-baseandotherassemblyerrors,etc.
48 |
• Add:– Importantprojectinformationintheformof
comments.– IDsforthisgenemodelinpublicorprivate
databasesviaDBXRefs,e.g.GenBank ID,genesymbol(s),commonname(s),synonyms.
– Commentsaboutthechangesyoumadetoeachgenemodel,ifany.
– Anyappropriatefunctionalassignments,e.g.viaBLAST+HMM(e.g.InterProScan),RNA-Seq orotherdataofyourown,literaturesearches,etc.
CHECKLISTfor accuracy and integrity
MANUAL ANNOTATION CHECKLIST
example
Apis mellifera genome data in Apollo
GenomeArchitect.org
1. Evidence in support of protein coding gene models.
1.1 Consensus Gene Sets:Official Gene Set v3.2Official Gene Set v1.0
1.2 Consensus Gene Sets comparison:OGSv3.2 genes that merge OGSv1.0 andRefSeq genesOGSv3.2 genes that split OGSv1.0 and RefSeq genes
1.3 Protein Coding Gene Predictions Supported by Biological Evidence:NCBI GnomonFgenesh++ with RNASeq training dataFgenesh++ without RNASeq training dataNCBI RefSeq Protein Coding Genes and Low Quality Protein Coding Genes
1.4 Ab initio protein coding gene predictions:Augustus Set 12, Augustus Set 9, Fgenesh, GeneID, N-SCAN, SGP2
1.5 Transcript Sequence Alignment:NCBI ESTs, Apis cerana RNA-Seq, Forager Bee Brain Illumina Contigs, Nurse Bee Brain Illumina Contigs, Forager RNA-Seq reads, Nurse RNA-Seq reads, Abdomen 454 Contigs, Brain and Ovary 454 Contigs, Embryo 454 Contigs, Larvae 454 Contigs, Mixed Antennae 454 Contigs, Ovary 454 Contigs, Testes 454 Contigs, Forager RNA-Seq HeatMap, Forager RNA-Seq XY Plot, Nurse RNA-Seq HeatMap, Nurse RNA-Seq XY Plot
Apis mellifera genome data in Apollo
GenomeArchitect.org
1. Evidence in support of protein coding gene models (Continued).
1.6 Protein homolog alignment:Acep_OGSv1.2Aech_OGSv3.8Cflo_OGSv3.3Dmel_r5.42Hsal_OGSv3.3Lhum_OGSv1.2Nvit_OGSv1.2Nvit_OGSv2.0Pbar_OGSv1.2Sinv_OGSv2.2.3Znev_OGSv2.1Metazoa_Swissprot
2. Evidence in support of non protein coding gene models
2.1 Non-protein coding gene predictions:NCBI RefSeq Noncoding RNANCBI RefSeq miRNA
2.2 Pseudogene predictions:NCBI RefSeq Pseudogene
followalong
Access Apollo
Servers 1 = http://tinyurl.com/apollo-bipaa12 = http://tinyurl.com/apollo-bipaa2
Ceramidase
Example 54
Ceramidase is an enzyme, which cleaves fatty acids from ceramide, producing sphingosine (SPH), which in turn is phosphorylated by a sphingosine kinase to form sphingosine-1-phosphate (S1P). Ceramide, SPH, and S1P are bioactive lipids that mediate cell proliferation, differentiation, apoptosis, adhesion, and migration.
It has come to our attention that the honey bee Apis mellifera ortholog of Ceramidase is fragmented into 2 or more genes in the current gene set (Official Gene Set v3.2).
Interrogate the genome using Blat
Example 55
Search all genomic sequences
Blat results
Example 56
Click on a high-scoring segment pair (hsp) to navigate and highlight the region.
57i5K Workspace@NAL
BIPAA resources - blast
58i5K Workspace@NAL
BIPAA resources - blast
59i5K Workspace@NAL
BIPAA resources - Apollo
You may find candidate genes from blast results using the ‘Search’ box with coordinates in main window.
Create a new annotation
Example 60
Drag and drop ‘GB40335-RA’
Transcriptomic data support longer gene
Example 61
RNA-Seq reads support a large intron and additional exons located about 20k bpdownstream (3’) of the last predicted exon for GB40335-RA.
Transcriptomic data support longer gene
Example 62
Drag and drop ‘GB40336-RA’
Merge transcripts
Example 63
Select one exon from each gene model, holding down the ‘Shift’ key. Then, select ‘Merge’ from right-click menu to bring gene models together.
Note non-canonical splice sites.
Exon not supported by RNA-Seq data
Example 64
At the end of GB40335-RA, select last exon and right-click to choose the ‘Delete’ option.
Fix remaining non-canonical splice site
Example 65
Now on the other offending exon (was first exon of GB40336-RA), use RNA-seq reads - or use ‘Set Downstream Splice Acceptor’, or drag the intron/exon boundary manually - to use a canonical splice site.
Retrieve resulting peptide, compare to public databases
Example 66
Results from NCBI blastp vs nr
Example 67
Add metadata in ‘Information Editor’
Example 68
Don’t forget!
Nice to have
Add metadata in ‘Information Editor’
Example 69
PubMed Identifiers
Gene Ontology terms
Comments
curationwithIAGC&BIPAA
71i5K Workspace@NAL
Accessing the genome home page
Each genome hosted on BIPAA has a dedicated home page, accessible from AphidBase, ParWaspDB or LepidoDB. Access may be restricted for some species; if so, login with your BIPAA account.
To create a BIPAA account visit http://bipaa.genouest.org/account
For Phylloxera, visit http://bipaa.genouest.org/sp/daktulosphaira_vitifoliae/
72i5K Workspace@NAL
1. A computationally predicted consensus gene set has been generated using multiple lines of evidence in MAKER.
2. BIPAA will integrate consensus computational predictions with manual annotations to produce an updated, official gene set (OGS):
Attention!• If it’s not on either track, it won’t make the OGS!• If it’s there and it shouldn’t, it will still make the OGS!
Curation process with IAGC & BIPAA
73i5K Workspace@NAL
3. In some cases algorithms and metrics used to generate consensus sets may actually reduce the accuracy of the gene’s representation. Use caution!
4. Isoforms: drag original and alternatively spliced form to ‘User-created Annotations’ area.
If the gene you are annotating does not have multiple isoforms, add metadata only on left side of the Information Editor (i.e. under gene).If the gene you are annotating has multiple isoforms, you should populate the right panel (mRNA / transcript) for each isoform, adding a letter (A, B, C, …) at the end of the name to distinguish each isoform.
5. If an annotation needs to be removed from the consensus set, drag it to the ‘User-created Annotations’ area, copy the gene ID in the Name field, and mark with ‘Delete’ radio button on the Information Editor.
6. Overlapping interests? Collaborate to reach agreement.
7. Follow guidelines for IAGC & BIPAA, at https://bipaa.genouest.org/is/how-to-annotate-a-genome/
Curation process with IAGC & BIPAA
74i5K Workspace@NAL
Annotation report
To avoid mistakes, a personal report is generated each night for each annotator, giving access to the list of annotated genes, and the possible corresponding errors and warnings (e.g. missing symbol, wrong name, etc.).
75i5K Workspace@NAL
Updating the OGS
• Regularly, a new OGS is released: merging the original OGS with the manual curation set.
• If a manually curated gene overlaps a gene predicted by Maker, we keep the manual annotation and replace the automated one.***
• Question from moni: Overlapping CDS? Or just overlapping coordinates?
• Gene IDs are conserved between each OGS release, a suffix being incremented when a gene is modified (structure, as well as associated information like Name or Symbol).
PUBLIC DEMO76 |
APOLLO ON THE WEBinstructions
Public Honey bee demo available at:
genomearchitect.org/demo/
Username:[email protected]
Password:demo
APOLLOdemonstration
PUBLIC DEMO 77
Demonstrationvideoavailableathttps://youtu.be/VgPtAP_fvxY
Apollo Development
Nathan DunnTechnical Lead Eric Yao
Christine Elsik’s Lab, University of Missouri
Suzi LewisPrincipal Investigator
BBOP
Moni Munoz-TorresProject Manager
Deepak Unni
JBrowse. Ian Holmes’ Lab University of California, Berkeley
For your attention, Thank you!
Berkeley Bioinformatics Open-Source Projects, Environmental Genomics & Systems Biology,
Lawrence Berkeley National Laboratory
Funding• Apollo is supported by NIH grants 5R01GM080203 from NIGMS,
and 5R01HG004483 from NHGRI.
• BBOP is also supported by the Director, Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231
Collaborators• Ian Holmes, Eric Yao, UC Berkeley (JBrowse)• Chris Elsik, Deepak Unni, U of Missouri (Apollo)• Monica Poelchau, USDA/NAL (Apollo)• i5k Community
berkeleybop.org
UNIVERSITY OF CALIFORNIA
Suzanna Lewis & Chris Mungall
Seth Carbon (Noctua / AmiGO)Nathan Dunn (Apollo)Monica Munoz-Torres (Apollo / GO)Jeremy Nguyen Xuan (Monarch Init.)