genome project management resources at the national … · 2017-08-16 · genome project management...
TRANSCRIPT
Genomeprojectmanagement
resourcesattheNational
AgriculturalLibrary
ChrisChildersandMonicaPoelchau
USDA-ARS,NationalAgriculturalLibrary
Soyouhaveagenomeproject.Wherewillyoustoreyourdata?
• MakeyourdataavailablethroughNCBIwhenapplicable(orotherINSDCorganizations).
• Tomakeyourdataevenmoreusefulforyourcommunity,consideralsomakingitavailableinataxon-specificrepository.
• Advantagesforyou:• Greatervisibilityforyourdataset• Value-added toolsforsearchingandbrowsing,analysis• Curationtoolstoimproveannotationquality• Helpwithdatamanagement• Increasingmandatefromjournalsandfundingbodiestomakeresearch
datafullyaccessiblepost-publication1, 21http://www.nature.com/authors/policies/data/data-availability-statements-data-citations.pdf2https://obamawhitehouse.archives.gov/the-press-office/2013/05/09/executive-order-making-open-and-machine-readable-new-default-government-
Soyouhaveagenomeproject.Wherewillyoustoreyourdata?
• Advantagesforthescientificcommunity:– Helpsfacilitateknowledgediscoveryforhumans(andsometimesmachines);
– Easiertofinddataforcomparativeanalyses;
– Promotesreproducibleresearch;
– Generalrepositories(e.g.GenBank)maynotmeettheneedsforstoringalldatatypes,inparticularfornon-standarddatatypes(e.g.phenotypicdata).
Genomedatamanagementresourcesforarthropods– howtochoose
• Whatspeciesisthedatafrom?– Manytaxon-specificgenomedatabasesarehereatthisworkshop
• Whatkinddatadoyouhave?– Rawdata,genomeassemblies,transcriptome assemblies,gene
annotations,canandshouldallbestoredatNCBI(orotherINSDCorganization)
– Someorallofthesedatatypescanalsobemadeaccessibleatgenomedatabases(justask)
– Genericrepositories(e.g.Dryad,AgDataCommons)canbeusedfordatatypesthatdon’tfitthemold
Thei5kWorkspace@NAL
• Wesupportany‘orphaned’ arthropodgenomeproject.– Connectresearcherstothedata– Createstandardizedtoolsforaccessing
thedatainusefulways– Provideresourcestofacilitatemanual
curationprojects• Supporteddatatypes:
– Genomeassembly– Anythingthatyoucanmaptoorpredict
fromthegenomeassembly
• Mainrequirements:– Genomeassemblyneedstobein
GenBank/ENA/DDBJ– Datashouldbepublic(noprivate
repositories)– Manualannotationonlyoccursatone
genomedatabaseatatime
• Researchplan• Genomesequencing• Genomeassembly• Automated
annotationofgenomeassembly
• ManualCuration• Officialgeneset
(OGS)generation
• Biologicalinsights/Publication
• Dataaccessforthebroadercommunity
• Genomeprojectmaintenance
Genom
eProjectTrajectory
Thei5kWorkspace@NAL
Ourbackground:• Originallysetuptosupportgenomessequencedaspartofthei5kinitiative
• I5k:Internationalefforttoprioritizeinsectgenomesforsequencing;provideguidelinesforgenomesequencingandcuration;andseekfunding
• I5kGoal:coordinatethesequencingandassemblyof5000insectorrelatedarthropodgenomes
• Briefintroductiontoi5katthebeginningofthei5ksessiononThursday
Submission
‘Frozen’genome
assembly
Automated
annotations
Ancillarydatafiles (e.g.
RNA-Seq alignments)
Tools
Tutorials
CustomBLASTinterface
Apollomanualcurationtool
JBrowse genomebrowser
Services
Manualannotationqualitycontrol
Officialgenesetgeneration
https://i5k.nal.usda.gov/
Workspace@NAL
HMMer Clustal
Resources
Challenges
Non-standarddataformatting
Failuretosubmitallmetadata(ex:sampleorigin;
analysismethods)
OrganismInformation
Page
Bulkdatadownloads
Genepages
i5kWorkspacecontent–57speciesandcounting
• Manyotherdatasetsmappedto,orpredictedfromeachgenomeassembly(genepredictions,transcriptomes,RNA-Seq,etc.)
Order Quantity Order QuantityAmphipoda 1 Hemiptera 7Araneae 3 Hymenoptera 14Blattodea 1 Lepidoptera 2Calanoida 1 Odonata 1Coleoptera 7 Orthoptera 1Diplura 1 Scorpiones 1Diptera 13 Thysanoptera 1
Ephemeroptera 1 Trichoptera 1Harpacticoida 1
Communityannotationatthei5kWorkspace
• Whatiscommunityannotation?– Scientistscollectivelyexamineandimprovegenemodels(usuallycomputationallypredicted)
• Whyannotate?– Verifyqualityofautomatedgenepredictions– Improvegenemodelsforspecificanalyses– Linkgenemodelstoexistingliteratureandontologies
• Ourcommunity:Over400registeredannotatorshavecuratedover10,000genemodelsusingtheApollosoftware
Communityannotationatthei5kWorkspace
Oursupportforcommunityannotationincludes:
• Accesstoalargecommunityofcurators
• Tutorials,guidelines,webinars
• Registrationmechanismfornewannotators
• One-on-onesupport
• Softwaretoevaluatechangesbetweencuratedandoriginalannotations(Chien-Yueh Lee,https://github.com/chienyuehlee/gff-cmp-cat)
QCandOGSpipeline
• QCprogramcorrectscommonformattingerrorsfromthecuration process
• OGSgenerationprogrammergescuratedmodelswithonedesignatedgenesetusingcurator-suppliedinformation
• Stillindevelopment,already6OGS’sproduced(Mei-Ju Chen)
Apollo output Error checking Curator fixes
Merge with one
designated gene set
Official Gene Set
Genomealreadyhostedelsewhere?
• Youcanalsouseourtoolstoquerythedatasetsthatwehost.
OtherresourcesattheNAL:TheAgDataCommons
• HostsanydatasetfundedbytheUSDA
• Landingpage
• CitableDOI
• https://data.nal.usda.gov/
• Ninei5kdatasetsalreadyavailable
Whatwe’lltalkabouttomorrow
1. Background:Whatisthei5kWorkspace?2. Submittingdata3. Findingdataatthei5kWorkspace
1. Generalsearch/Contenttypes2. Datadownloads3. BLAST4. Clustal(s)5. HMMER6. Jbrowse/Apollo
4. Improvingdataatthei5kWorkspaceviacommunityannotation1. SeeMonicaMunoz-Torres’workshopforfulluseofApollo
Needmoreinformation?
i5kWorkspace@NAL:• https://i5k.nal.usda.gov/
• https://github.com/NAL-i5K/
• PosterduringtheFridaysession
Thei5kinitiative:• Newwebsite:http://i5k.github.io/AgDataCommons:
• https://data.nal.usda.gov/
AcknowledgementsTheNALTeam
• GaryMoore
• SusanMcCarthy
• Yu-yu Lin
• Mei-Ju Chen
Workspacealumni
• Chien-Yueh Lee
• HanLin
• Jun-WeiLin
• Vijaya Tsavatapalli
i5kWorkspace@NAL advisorycommittee
• JayEvans
• KevinHackett
• SimonLiu
• UrsulaPieper
• i5kCoordinatingCommittee
• i5kPilotProject
• Apollo&JBrowse DevelopmentTeams
• GMOD/Tripal community
• TheAgBioData consortium
• Allofourusersandcontributors!