genome project management resources at the national … · 2017-08-16 · genome project management...

16
Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau USDA-ARS, National AgriculturalLibrary

Upload: others

Post on 04-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

Genomeprojectmanagement

resourcesattheNational

AgriculturalLibrary

ChrisChildersandMonicaPoelchau

USDA-ARS,NationalAgriculturalLibrary

Page 2: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

Soyouhaveagenomeproject.Wherewillyoustoreyourdata?

• MakeyourdataavailablethroughNCBIwhenapplicable(orotherINSDCorganizations).

• Tomakeyourdataevenmoreusefulforyourcommunity,consideralsomakingitavailableinataxon-specificrepository.

• Advantagesforyou:• Greatervisibilityforyourdataset• Value-added toolsforsearchingandbrowsing,analysis• Curationtoolstoimproveannotationquality• Helpwithdatamanagement• Increasingmandatefromjournalsandfundingbodiestomakeresearch

datafullyaccessiblepost-publication1, 21http://www.nature.com/authors/policies/data/data-availability-statements-data-citations.pdf2https://obamawhitehouse.archives.gov/the-press-office/2013/05/09/executive-order-making-open-and-machine-readable-new-default-government-

Page 3: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

Soyouhaveagenomeproject.Wherewillyoustoreyourdata?

• Advantagesforthescientificcommunity:– Helpsfacilitateknowledgediscoveryforhumans(andsometimesmachines);

– Easiertofinddataforcomparativeanalyses;

– Promotesreproducibleresearch;

– Generalrepositories(e.g.GenBank)maynotmeettheneedsforstoringalldatatypes,inparticularfornon-standarddatatypes(e.g.phenotypicdata).

Page 4: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

Genomedatamanagementresourcesforarthropods– howtochoose

• Whatspeciesisthedatafrom?– Manytaxon-specificgenomedatabasesarehereatthisworkshop

• Whatkinddatadoyouhave?– Rawdata,genomeassemblies,transcriptome assemblies,gene

annotations,canandshouldallbestoredatNCBI(orotherINSDCorganization)

– Someorallofthesedatatypescanalsobemadeaccessibleatgenomedatabases(justask)

– Genericrepositories(e.g.Dryad,AgDataCommons)canbeusedfordatatypesthatdon’tfitthemold

Page 5: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

Thei5kWorkspace@NAL

• Wesupportany‘orphaned’ arthropodgenomeproject.– Connectresearcherstothedata– Createstandardizedtoolsforaccessing

thedatainusefulways– Provideresourcestofacilitatemanual

curationprojects• Supporteddatatypes:

– Genomeassembly– Anythingthatyoucanmaptoorpredict

fromthegenomeassembly

• Mainrequirements:– Genomeassemblyneedstobein

GenBank/ENA/DDBJ– Datashouldbepublic(noprivate

repositories)– Manualannotationonlyoccursatone

genomedatabaseatatime

• Researchplan• Genomesequencing• Genomeassembly• Automated

annotationofgenomeassembly

• ManualCuration• Officialgeneset

(OGS)generation

• Biologicalinsights/Publication

• Dataaccessforthebroadercommunity

• Genomeprojectmaintenance

Genom

eProjectTrajectory

Page 6: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

Thei5kWorkspace@NAL

Ourbackground:• Originallysetuptosupportgenomessequencedaspartofthei5kinitiative

• I5k:Internationalefforttoprioritizeinsectgenomesforsequencing;provideguidelinesforgenomesequencingandcuration;andseekfunding

• I5kGoal:coordinatethesequencingandassemblyof5000insectorrelatedarthropodgenomes

• Briefintroductiontoi5katthebeginningofthei5ksessiononThursday

Page 7: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

Submission

‘Frozen’genome

assembly

Automated

annotations

Ancillarydatafiles (e.g.

RNA-Seq alignments)

Tools

Tutorials

CustomBLASTinterface

Apollomanualcurationtool

JBrowse genomebrowser

Services

Manualannotationqualitycontrol

Officialgenesetgeneration

https://i5k.nal.usda.gov/

Workspace@NAL

HMMer Clustal

Resources

Challenges

Non-standarddataformatting

Failuretosubmitallmetadata(ex:sampleorigin;

analysismethods)

OrganismInformation

Page

Bulkdatadownloads

Genepages

Page 8: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

i5kWorkspacecontent–57speciesandcounting

• Manyotherdatasetsmappedto,orpredictedfromeachgenomeassembly(genepredictions,transcriptomes,RNA-Seq,etc.)

Order Quantity Order QuantityAmphipoda 1 Hemiptera 7Araneae 3 Hymenoptera 14Blattodea 1 Lepidoptera 2Calanoida 1 Odonata 1Coleoptera 7 Orthoptera 1Diplura 1 Scorpiones 1Diptera 13 Thysanoptera 1

Ephemeroptera 1 Trichoptera 1Harpacticoida 1

Page 9: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

Communityannotationatthei5kWorkspace

• Whatiscommunityannotation?– Scientistscollectivelyexamineandimprovegenemodels(usuallycomputationallypredicted)

• Whyannotate?– Verifyqualityofautomatedgenepredictions– Improvegenemodelsforspecificanalyses– Linkgenemodelstoexistingliteratureandontologies

• Ourcommunity:Over400registeredannotatorshavecuratedover10,000genemodelsusingtheApollosoftware

Page 10: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

Communityannotationatthei5kWorkspace

Oursupportforcommunityannotationincludes:

• Accesstoalargecommunityofcurators

• Tutorials,guidelines,webinars

• Registrationmechanismfornewannotators

• One-on-onesupport

• Softwaretoevaluatechangesbetweencuratedandoriginalannotations(Chien-Yueh Lee,https://github.com/chienyuehlee/gff-cmp-cat)

Page 11: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

QCandOGSpipeline

• QCprogramcorrectscommonformattingerrorsfromthecuration process

• OGSgenerationprogrammergescuratedmodelswithonedesignatedgenesetusingcurator-suppliedinformation

• Stillindevelopment,already6OGS’sproduced(Mei-Ju Chen)

Apollo output Error checking Curator fixes

Merge with one

designated gene set

Official Gene Set

Page 12: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

Genomealreadyhostedelsewhere?

• Youcanalsouseourtoolstoquerythedatasetsthatwehost.

Page 13: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

OtherresourcesattheNAL:TheAgDataCommons

• HostsanydatasetfundedbytheUSDA

• Landingpage

• CitableDOI

• https://data.nal.usda.gov/

• Ninei5kdatasetsalreadyavailable

Page 14: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

Whatwe’lltalkabouttomorrow

1. Background:Whatisthei5kWorkspace?2. Submittingdata3. Findingdataatthei5kWorkspace

1. Generalsearch/Contenttypes2. Datadownloads3. BLAST4. Clustal(s)5. HMMER6. Jbrowse/Apollo

4. Improvingdataatthei5kWorkspaceviacommunityannotation1. SeeMonicaMunoz-Torres’workshopforfulluseofApollo

Page 15: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

Needmoreinformation?

i5kWorkspace@NAL:• https://i5k.nal.usda.gov/

• https://github.com/NAL-i5K/

• PosterduringtheFridaysession

Thei5kinitiative:• Newwebsite:http://i5k.github.io/AgDataCommons:

• https://data.nal.usda.gov/

Page 16: Genome project management resources at the National … · 2017-08-16 · Genome project management resources at the National Agricultural Library Chris Childers and Monica Poelchau

AcknowledgementsTheNALTeam

• GaryMoore

• SusanMcCarthy

• Yu-yu Lin

• Mei-Ju Chen

Workspacealumni

• Chien-Yueh Lee

• HanLin

• Jun-WeiLin

• Vijaya Tsavatapalli

i5kWorkspace@NAL advisorycommittee

• JayEvans

• KevinHackett

• SimonLiu

• UrsulaPieper

• i5kCoordinatingCommittee

• i5kPilotProject

• Apollo&JBrowse DevelopmentTeams

• GMOD/Tripal community

• TheAgBioData consortium

• Allofourusersandcontributors!