editing functionality - apollo workshop

Post on 11-Apr-2017

328 Views

Category:

Science

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ApolloCollaborative genome annotation editing

A workshop for the International Aphid Genome Consortium research community.

Monica Munoz-Torres, PhD | @monimunozto

Berkeley Bioinformatics Open-Source Projects (BBOP)Environmental Genomics & Systems Biology DivisionLawrence Berkeley National Laboratory

Webinar - 22 March, 2017

http://GenomeArchitect.org

theannotationwindow

Basic data visualization

USER-CREATED ANNOTATIONS

EVIDENCE TRACKS

ANNOTATOR PANEL

GenomeArchitect.org

Annotations Organism Users Groups AdminTracks Reference Sequence

Removable Annotator Panel

1

Annotation details & exon boundaries2

Annotations

gene

mRNA

1

2

Navigating to an annotationAnnotations

gene

mRNA

Displaying tracks with supporting dataTracks

➼➼

Navigating to ‘Reference Sequence’ (i.e. assembly fragments: scaffolds, chromosomes, etc.)

Ref Sequence

Additional functionality

Share a location

Switch organisms

Leave a session

Hide/show Annotator Panel➼

beginwithanewgenemodel

BECOMING ACQUAINTED WITH APOLLO

Annotatorpanel.

• Chooseappropriateevidencefromlistof“Tracks”onannotatorpanel.

• Select&dragelementsfromevidencetrackintothe‘User-createdAnnotations’area.

• Hoveringoverannotationinprogressbringsupaninformationpop-up.

Creating a new annotation

Adding a gene model

Adding a gene model

Adding a gene model

thesequencetrack

17 | BECOMING ACQUAINTED WITH APOLLO

‘Zoom to base level’ reveals the sequence track.

18 | BECOMING ACQUAINTED WITH APOLLO

Color exons by CDS from the ‘View’ menu.

Zoomin/outwithkeyboard:shift+arrowkeysup/down

BECOMING ACQUAINTED WITH APOLLO

Toggle reference DNA sequence and translation frames in forward strand.

Also, toggle models in either direction.

curatingsimplecases

“Simple case”: - the predicted gene model is correct or nearly correct, and - this model is supported by evidence that completely or mostly agrees with the prediction. - evidence that extends beyond the predicted model is assumed to be non-coding sequence.

The following are simple modifications.

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

Editing functionality

• Aconfirmationboxwillwarnyouifthereceivingtranscriptisnotonthesamestrandastheelementfromwherethe‘new’exonoriginated.

• Check‘Start’and‘Stop’ signalsaftereachedit.

ADDING EXONS

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

Editing functionalityExample: Adding an exon supported by experimental data

• RNAseq reads show evidence in support of a transcribed product that was not predicted.• Add exon by dragging up one of the RNAseq reads.

Iftranscriptalignmentdataareavailable&extendbeyondyouroriginalannotation,youmayextendoraddUTRs.

1. Rightclickattheexonedgeand‘Zoomtobaselevel’.

2. PlacethecursorovertheedgeoftheexonuntilitbecomesablackarrowthenclickanddragtheedgeoftheexontothenewcoordinatepositionthatincludestheUTR.

ADDING UTRs

ToaddanewsplicedUTRtoanexistingannotationalsofollowtheprocedureforaddinganexon,orto‘SetasX’end’.

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

In some cases all the data may disagree with the annotation, in other cases some data support the annotation and some of the data support one or more alternative transcripts.

Try to annotate as many alternative transcripts as are well supported by the data.

MATCHING EXON BOUNDARY TO EVIDENCE

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

To modify an exon boundary and match data in the evidence tracks: select both the offending exon and the element with the correct boundary, then right click on the annotation to select ‘Set 3’ end’ or ‘Set 5’ end’ as appropriate.

1. Twoexonsfromdifferenttrackssharingthesamestart/endcoordinatesdisplayaredbartoindicatematchingedges.

2. Selectingthewholeannotationoroneexonatatime,usethis edge-matching functionandscrollalongthelengthoftheannotation,verifyingexonboundariesagainstavailabledata.Usesquare[]bracketstoscrollfromexontoexon.Usercurly{}bracketstoscrollfromannotationtoannotation.

3. CheckifcDNA/RNAseqreadslackoneormoreoftheannotatedexonsorincludeadditionalexons.

CHECK FOR EXON INTEGRITY

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

Doubleclickselectstheentiremodel

EvidenceTracksArea

‘User-createdAnnotations’Track

Edge-matching

Apollo’seditinglogic(brain):§ selectslongestORFasCDS§ recalculatesORFaftereachedit,unlessset

ORFs - setting & recalculating

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

Redlinesaroundexons:‘edge-matching’allowsannotatorstoconfirmwhethertheevidenceisinagreement,withoutexaminingeachexonatthebaselevel.

Non-canonical splices are indicated with orange circles with a white exclamation point inside, placed over the edge of the offending exon.

Canonicalsplicesites:

3’-…exon]GA/TG[exon…-5’

5’-…exon]GT/AG[exon…-3’reversestrand,notreverse-complemented:

forwardstrand

SPLICE SITES

Zoom to review non-canonical splice site warnings. Although these may not always have to be corrected (e.g. GC donor), they should be flagged with a comment.

Exon/intron splice site error warning

Curatedmodel

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

Editing functionalityExample: Adjusting exon boundaries supported by experimental data

Apollo calculates the longest possible open reading frame (ORF) that includes canonical ‘Start’ and ‘Stop’ signals within the predicted exons.

If ‘Start’ appears to be incorrect, modify it by selecting an in-frame ‘Start’ codon further up or downstream, depending on evidence (e.g. proteins, RNAseq).

It may be present outside the predicted gene model, within a region supported by another evidence track.

In very rare cases, the actual ‘Start’ codon may be non-canonical (non-ATG).

‘Start’ AND ‘Stop’ SITES

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

curatingcomplexcases

Evidencemaysupportjoiningtwoormoredifferentgenemodels.Warning: proteinalignmentsmayhaveincorrectsplicesitesandlacknon-conservedregions!

1. In‘User-createdAnnotations’area shift-clicktoselectanintronfromeachgenemodelandrightclicktoselectthe‘Merge’ optionfromthemenu.

2. Dragsupportingevidencetracksoverthecandidatemodelstocorroborateoverlap,orreviewedgematchingandcoverageacrossmodels.

3. Checktheresultingtranslationbyqueryingaproteindatabase e.g.UniProt,NCBInr.Addcommentstorecordthatthisannotationistheresultofamerge.

MERGE TWO GENE PREDICTIONS ON THE SAME SCAFFOLD

BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

Redlinesaroundexons:‘edge-matching’allowsannotatorstoconfirmwhethertheevidenceisinagreement,withoutexaminingeachexonatthebaselevel.

Oneormoresplitsmayberecommendedwhen:- differentsegmentsofthepredictedproteinaligntotwoormoredifferentgenefamilies- predictedproteindoesn’taligntoknownproteinsoveritsentirelength- Transcriptdatamaysupportasplit;BUT- first,verifywhethertheyarealternativetranscripts.

SPLIT A GENE PREDICTION

BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

DNATrack

‘User-createdAnnotations’Track

ANNOTATE FRAMESHIFTS AND CORRECT SINGLE-BASE ERRORS

Alwaysremember:whenannotatinggenemodelsusingApollo,youarelookingata‘frozen’versionofthegenomeassemblyandyouwillnotbeabletomodifytheassemblyitself.

BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

CORRECTING SELENOCYSTEINE CONTAINING PROTEINS

BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

CORRECTING SELENOCYSTEINE CONTAINING PROTEINS

1. Apolloallowsannotatorstomakesinglebasemodificationsorframeshiftsthatarereflectedinthesequenceandstructureofanytranscriptsoverlappingthemodification.ThesemanipulationsdoNOTchangetheunderlyinggenomicsequence.Ifyoudeterminethatyouneedtomakeoneofthesechanges,zoomintothenucleotidelevelandrightclickoverasinglenucleotideonthegenomicsequencetoaccessamenuthatprovidesoptionsforcreatinginsertions,deletionsorsubstitutions.

2. The‘CreateGenomicInsertion’featurewillrequireyoutoenterthenecessarystringofnucleotideresiduesthatwillbeinsertedtotherightofthecursor’scurrentlocation.The‘CreateGenomicDeletion’ optionwillrequireyoutoenterthelengthofthedeletion,startingwiththenucleotidewherethecursorispositioned.The‘CreateGenomicSubstitution’featureasksforthestringofnucleotideresiduesthatwillreplacetheonesontheDNAtrack.

3. Onceyouhaveenteredthemodifications,Apollowillrecalculatethecorrectedtranscriptandproteinsequences,whichwillappearwhenyouusetheright-clickmenu‘GetSequence’option.Sincetheunderlyinggenomicsequenceisreflectedinallannotationsthatincludethemodifiedregionyoushouldalertthecuratorsofyourorganismsdatabaseusingthe‘Comments’sectiontoreporttheCDSedits.

4. Inspecialcasessuchasselenocysteinecontainingproteins(read-throughs),right-clickovertheoffending/premature‘Stop’signalandchoosethe‘Setreadthroughstopcodon’optionfromthemenu.

ANNOTATING FRAMESHIFTS, CORRECTING SINGLE-BASE ERRORS & SELENOCYSTEINES

BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

addingmetadata

40 | BECOMING ACQUAINTED WITH APOLLO

Information Editor

Isoforms at BIPAA:

If the gene you are annotating does not have multiple isoforms, add metadata only on left side of the Information Editor (i.e. under gene).

If the gene you are annotating has multiple isoforms, you should populate the right panel (mRNA / transcript) for each isoform, adding a letter (A, B, C, …) at the end of the name to distinguish

BECOMING ACQUAINTED WITH APOLLO

Information Editor

BECOMING ACQUAINTED WITH APOLLO

Information Editor

history

44 | BECOMING ACQUAINTED WITH APOLLO

Keeping track of each edit

Annotations, annotation edits, and History:are stored in a centralized database.

BECOMING ACQUAINTED WITH APOLLO

checklist

Followthischecklistuntilyouaresatisfiedtheannotationisthebestrepresentationoftheunderlyingbiology.

Andrememberto…– commenttovalidateyourannotation,evenifyoumadenochangestoanexistingmodel.Thinkofcommentsasyour‘voteofconfidence’.

– addacommenttoinformthecommunityofunresolvedissuesyouthinkthismodelmayhave.

47 |

AlwaysRemember:Apollocurationisacommunityeffortsopleaseusecommentstocommunicatethereasonsforyour

annotation.Yourcommentswillbevisibletoeveryone.

COMPLETING THE ANNOTATION

BECOMING ACQUAINTED WITH APOLLO

• Check‘Start’ and‘Stop’sites.

• Checksplicesites:mostsplicesitesdisplaytheseresidues…]5’-GT/AG-3’[…

• CheckifyoucanannotateUTRs,forexampleusingRNA-Seq data:– alignitagainstrelevantgenes/genefamily– blastp againstNCBI’sRefSeq ornr

• Check&commentgaps inthegenome.

• Additionalfunctionalitymaybenecessary:–merge 2genepredictions- samescaffold– ‘merge’ 2genepredictions- differentscaffolds

– split ageneprediction– annotate frameshifts– annotateselenocysteines,correctingsingle-baseandotherassemblyerrors,etc.

48 |

• Add:– Importantprojectinformationintheformof

comments.– IDsforthisgenemodelinpublicorprivate

databasesviaDBXRefs,e.g.GenBank ID,genesymbol(s),commonname(s),synonyms.

– Commentsaboutthechangesyoumadetoeachgenemodel,ifany.

– Anyappropriatefunctionalassignments,e.g.viaBLAST+HMM(e.g.InterProScan),RNA-Seq orotherdataofyourown,literaturesearches,etc.

CHECKLISTfor accuracy and integrity

MANUAL ANNOTATION CHECKLIST

example

Apis mellifera genome data in Apollo

GenomeArchitect.org

1. Evidence in support of protein coding gene models.

1.1 Consensus Gene Sets:Official Gene Set v3.2Official Gene Set v1.0

1.2 Consensus Gene Sets comparison:OGSv3.2 genes that merge OGSv1.0 andRefSeq genesOGSv3.2 genes that split OGSv1.0 and RefSeq genes

1.3 Protein Coding Gene Predictions Supported by Biological Evidence:NCBI GnomonFgenesh++ with RNASeq training dataFgenesh++ without RNASeq training dataNCBI RefSeq Protein Coding Genes and Low Quality Protein Coding Genes

1.4 Ab initio protein coding gene predictions:Augustus Set 12, Augustus Set 9, Fgenesh, GeneID, N-SCAN, SGP2

1.5 Transcript Sequence Alignment:NCBI ESTs, Apis cerana RNA-Seq, Forager Bee Brain Illumina Contigs, Nurse Bee Brain Illumina Contigs, Forager RNA-Seq reads, Nurse RNA-Seq reads, Abdomen 454 Contigs, Brain and Ovary 454 Contigs, Embryo 454 Contigs, Larvae 454 Contigs, Mixed Antennae 454 Contigs, Ovary 454 Contigs, Testes 454 Contigs, Forager RNA-Seq HeatMap, Forager RNA-Seq XY Plot, Nurse RNA-Seq HeatMap, Nurse RNA-Seq XY Plot

Apis mellifera genome data in Apollo

GenomeArchitect.org

1. Evidence in support of protein coding gene models (Continued).

1.6 Protein homolog alignment:Acep_OGSv1.2Aech_OGSv3.8Cflo_OGSv3.3Dmel_r5.42Hsal_OGSv3.3Lhum_OGSv1.2Nvit_OGSv1.2Nvit_OGSv2.0Pbar_OGSv1.2Sinv_OGSv2.2.3Znev_OGSv2.1Metazoa_Swissprot

2. Evidence in support of non protein coding gene models

2.1 Non-protein coding gene predictions:NCBI RefSeq Noncoding RNANCBI RefSeq miRNA

2.2 Pseudogene predictions:NCBI RefSeq Pseudogene

followalong

Access Apollo

Servers 1 = http://tinyurl.com/apollo-bipaa12 = http://tinyurl.com/apollo-bipaa2

Ceramidase

Example 54

Ceramidase is an enzyme, which cleaves fatty acids from ceramide, producing sphingosine (SPH), which in turn is phosphorylated by a sphingosine kinase to form sphingosine-1-phosphate (S1P). Ceramide, SPH, and S1P are bioactive lipids that mediate cell proliferation, differentiation, apoptosis, adhesion, and migration.

It has come to our attention that the honey bee Apis mellifera ortholog of Ceramidase is fragmented into 2 or more genes in the current gene set (Official Gene Set v3.2).

Interrogate the genome using Blat

Example 55

Search all genomic sequences

Blat results

Example 56

Click on a high-scoring segment pair (hsp) to navigate and highlight the region.

57i5K Workspace@NAL

BIPAA resources - blast

58i5K Workspace@NAL

BIPAA resources - blast

59i5K Workspace@NAL

BIPAA resources - Apollo

You may find candidate genes from blast results using the ‘Search’ box with coordinates in main window.

Create a new annotation

Example 60

Drag and drop ‘GB40335-RA’

Transcriptomic data support longer gene

Example 61

RNA-Seq reads support a large intron and additional exons located about 20k bpdownstream (3’) of the last predicted exon for GB40335-RA.

Transcriptomic data support longer gene

Example 62

Drag and drop ‘GB40336-RA’

Merge transcripts

Example 63

Select one exon from each gene model, holding down the ‘Shift’ key. Then, select ‘Merge’ from right-click menu to bring gene models together.

Note non-canonical splice sites.

Exon not supported by RNA-Seq data

Example 64

At the end of GB40335-RA, select last exon and right-click to choose the ‘Delete’ option.

Fix remaining non-canonical splice site

Example 65

Now on the other offending exon (was first exon of GB40336-RA), use RNA-seq reads - or use ‘Set Downstream Splice Acceptor’, or drag the intron/exon boundary manually - to use a canonical splice site.

Retrieve resulting peptide, compare to public databases

Example 66

Results from NCBI blastp vs nr

Example 67

Add metadata in ‘Information Editor’

Example 68

Don’t forget!

Nice to have

Add metadata in ‘Information Editor’

Example 69

PubMed Identifiers

Gene Ontology terms

Comments

curationwithIAGC&BIPAA

71i5K Workspace@NAL

Accessing the genome home page

Each genome hosted on BIPAA has a dedicated home page, accessible from AphidBase, ParWaspDB or LepidoDB. Access may be restricted for some species; if so, login with your BIPAA account.

To create a BIPAA account visit http://bipaa.genouest.org/account

For Phylloxera, visit http://bipaa.genouest.org/sp/daktulosphaira_vitifoliae/

72i5K Workspace@NAL

1. A computationally predicted consensus gene set has been generated using multiple lines of evidence in MAKER.

2. BIPAA will integrate consensus computational predictions with manual annotations to produce an updated, official gene set (OGS):

Attention!• If it’s not on either track, it won’t make the OGS!• If it’s there and it shouldn’t, it will still make the OGS!

Curation process with IAGC & BIPAA

73i5K Workspace@NAL

3. In some cases algorithms and metrics used to generate consensus sets may actually reduce the accuracy of the gene’s representation. Use caution!

4. Isoforms: drag original and alternatively spliced form to ‘User-created Annotations’ area.

If the gene you are annotating does not have multiple isoforms, add metadata only on left side of the Information Editor (i.e. under gene).If the gene you are annotating has multiple isoforms, you should populate the right panel (mRNA / transcript) for each isoform, adding a letter (A, B, C, …) at the end of the name to distinguish each isoform.

5. If an annotation needs to be removed from the consensus set, drag it to the ‘User-created Annotations’ area, copy the gene ID in the Name field, and mark with ‘Delete’ radio button on the Information Editor.

6. Overlapping interests? Collaborate to reach agreement.

7. Follow guidelines for IAGC & BIPAA, at https://bipaa.genouest.org/is/how-to-annotate-a-genome/

Curation process with IAGC & BIPAA

74i5K Workspace@NAL

Annotation report

To avoid mistakes, a personal report is generated each night for each annotator, giving access to the list of annotated genes, and the possible corresponding errors and warnings (e.g. missing symbol, wrong name, etc.).

75i5K Workspace@NAL

Updating the OGS

• Regularly, a new OGS is released: merging the original OGS with the manual curation set.

• If a manually curated gene overlaps a gene predicted by Maker, we keep the manual annotation and replace the automated one.***

• Question from moni: Overlapping CDS? Or just overlapping coordinates?

• Gene IDs are conserved between each OGS release, a suffix being incremented when a gene is modified (structure, as well as associated information like Name or Symbol).

PUBLIC DEMO76 |

APOLLO ON THE WEBinstructions

Public Honey bee demo available at:

genomearchitect.org/demo/

Username:demo@demo.com

Password:demo

APOLLOdemonstration

PUBLIC DEMO 77

Demonstrationvideoavailableathttps://youtu.be/VgPtAP_fvxY

Apollo Development

Nathan DunnTechnical Lead Eric Yao

Christine Elsik’s Lab, University of Missouri

Suzi LewisPrincipal Investigator

BBOP

Moni Munoz-TorresProject Manager

Deepak Unni

JBrowse. Ian Holmes’ Lab University of California, Berkeley

For your attention, Thank you!

Berkeley Bioinformatics Open-Source Projects, Environmental Genomics & Systems Biology,

Lawrence Berkeley National Laboratory

Funding• Apollo is supported by NIH grants 5R01GM080203 from NIGMS,

and 5R01HG004483 from NHGRI.

• BBOP is also supported by the Director, Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231

Collaborators• Ian Holmes, Eric Yao, UC Berkeley (JBrowse)• Chris Elsik, Deepak Unni, U of Missouri (Apollo)• Monica Poelchau, USDA/NAL (Apollo)• i5k Community

berkeleybop.org

UNIVERSITY OF CALIFORNIA

Suzanna Lewis & Chris Mungall

Seth Carbon (Noctua / AmiGO)Nathan Dunn (Apollo)Monica Munoz-Torres (Apollo / GO)Jeremy Nguyen Xuan (Monarch Init.)

top related