GGeennoommiicc IInnffrraassttrruuccttuurree ffoorr NNGGSSThe small, medium and large stuff that helps get the job done
Associate Professor Mik BlackDepartment of Biochemistry, University of Otago
BBrriieeff aassiiddee:: wwhhoo aamm II??Background in statistics: my (rather diverse and collaborative) researchinvolves the development and application of statistical methods forproblems in human disease genomics.
Heavily involved in the establishment of two government-fundednational infrastructure initiatives in New Zealand:
Formerly the bioinformatics team leader for NZGL (2012), and still anactive team member...
·
·
NZGL (New Zealand Genomics Ltd) - inter-university collaborationin genomics and bioinformatics.NeSI (NZ eScience Infrastructure) - cross institutional (universitiesand Crown Research Institutes) collaboration in high performancecomputing and eResearch.
-
-
·
2/31
OOvveerrvviieeww -- tthhiiss ttaallkk wwiillll ccoovveerr......Large-scale institutional infrastructure: out-sourcing the generationand analysis of genomic data.
Community infrastructure: shared genomics resources.
Personal infrastructure: tools and techniques for getting the most outof genomic (or any) data.
·
·
·
3/31
OOvveerrvviieeww -- tthhiiss ttaallkk wwiillll ccoovveerr......
BBuutt nnoo GGaammee ooff TThhrroonneess......
Large-scale institutional infrastructure: out-sourcing the generationand analysis of genomic data.
Community infrastructure: shared genomics resources.
Personal infrastructure: tools and techniques for getting the most outof genomic (or any) data.
·
·
·
4/31
OOvveerrvviieeww -- tthhiiss ttaallkk wwiillll ccoovveerr......
BBuutt nnoo GGaammee ooff TThhrroonneess......
First though: I'm a statistician -‐ let's generate some data.
Large-scale institutional infrastructure: out-sourcing the generationand analysis of genomic data.
Community infrastructure: shared genomics resources.
Personal infrastructure: tools and techniques for getting the most outof genomic (or any) data.
·
·
·
5/31
OOuuttssoouurrcciinngg yyoouurr ggeennoommiiccssMany highly competitive (and high quality) genomics providers exist.
Preferences depend on many things...
·
Offer various levels of flexibility in terms of the types of data theywill/can generate.Provide a variety of services, from basic data generation to"bundled" full-service options which take tissue samples, andproduce analytic results.Some providers also provide web- or VM-based analysis options.
-
-
-·
Your own competence/capacity for handling genomic dataPrevious experience with a providerWord of mouthPrice....
----
6/31
YYoouurr eexxppeerriimmeenntt:: wwhhaatt ttoo tthhiinnkk aabboouutt??What is the question you are trying to answer?Have you designed an appropriate sequencing experiment?
Once you have the data, what will you do with it?
··
will it provide the answer to your question?is there a better way to get to that answer?
--
·Storage and processing? On-site or off? Space and CPUrequirements?Do you have access to the skills/resources required to completethe analysis task?
-
-
7/31
YYoouurr eexxppeerriimmeenntt:: wwhhaatt ttoo tthhiinnkk aabboouutt??
Ask (and answer) these questions BEFORE you generate the data
What is the question you are trying to answer?Have you designed an appropriate sequencing experiment?
Once you have the data, what will you do with it?
··
will it provide the answer to your question?is there a better way to get to that answer?
--
·Storage and processing? On-site or off? Space and CPUrequirements?Do you have access to the skills/resources required to completethe analysis task?
-
-
8/31
OOuuttssoouurrcciinngg yyoouurr bbiiooiinnffoorrmmaattiiccssWhat are you outsourcing?
Make sure a full analysis plan is in place before committing to the work.
·Quality assessment and basic bioinformatics?Generic data analysis?Domain-specific analysis?Tailored analysis for your specific question?
----
·If possible, have the plan inspected by an independent "expert".-
9/31
OOuuttssoouurrcciinngg yyoouurr bbiiooiinnffoorrmmaattiiccss
DON'T underestimate the value that an expert team ofbioinformaticians can bring to your project, but DO make sure youknow what you will be getting from them (and the cost...)
What are you outsourcing?
Make sure a full analysis plan is in place before commuting to the work.
·Quality assessment and basic bioinformatics?Generic data analysis?Domain-specific analysis?Tailored analysis for your specific question?
----
·If possible, have the plan inspected by an independent "expert".-
10/31
OOuuttssoouurrcciinngg yyoouurr bbiiooiinnffoorrmmaattiiccssQuality assessment: this will usually be provided with the data. Don'tbe afraid to ask for more information (and even more QA).
Basic bioinformatics: e.g., quality trimming/filtering and alignment toa reference genome - make sure you are very clear about what youwant (if you know): trim/filter parameters, genome build, organism (!),aligner, parameters...
Generic data analysis: e.g., variant calling, differential expression etc
Tailored analysis for your specific question: this requires the mostspecification (and input from you) and should involve a provider with abackground in this area.
·
·
·
·
11/31
OOuuttssoouurrcciinngg yyoouurr bbiiooiinnffoorrmmaattiiccss
ALWAYS specify that you require the code used to perform theanalysis -‐ you need to know what was done every step of the way
Quality assessment: this will usually be provided with the data. Don'tbe afraid to ask for more information (and even more QA).
Basic bioinformatics: e.g., quality trimming/filtering and alignment toa reference genome - make sure you are very clear about what youwant (if you know): trim/filter parameters, genome build, organism (!),aligner, parameters...
Generic data analysis: e.g., variant calling, differential expression etc
Tailored analysis for your specific question: this requires the mostspecification (and input from you) and should involve a provider with abackground in this area.
·
·
·
·
12/31
CCoommmmuunniittyy rreessoouurrcceess:: wwhhaatt iiss aavvaaiillaabbllee??Databases/Browsers (the big players)
Software tools:
Generic data sources:
·NCBI (http://www.ncbi.nlm.nih.gov/)Ensembl (http://www.ensembl.org/)UCSC (https://genome.ucsc.edu/)
MANY domain-specific options (check Nat Gen annual DB issue)
----
·GenomeSpace (Galaxy, GenePattern, Cytoscape,...)R/BioconductorThat scary command line thing....
---
·GEO, ArrayExpress, inSilicoDB, SRA, dbGaP, EGA,....-
13/31
GGeenneePPaatttteerrnn:: wweebb--bbaasseedd aannaallyyssiiss ppllaattffoorrmm
http://www.broadinstitute.org/cancer/software/genepattern/14/31
GGaallaaxxyy:: wweebb--bbaasseedd aannaallyyssiiss ppllaattffoorrmm
https://usegalaxy.org/15/31
GGeennoommeeSSppaaccee:: jjooiinniinngg uupp tthhee ccooooll ssttuuffff......
http://www.genomespace.org/16/31
GGeennoommeeSSppaaccee:: jjooiinniinngg uupp tthhee ccooooll ssttuuffff......
http://www.genomespace.org/17/31
BBiiooccoonndduuccttoorr:: wwhheerree tthheeyy ggeett tthhee ccooooll ssttuuffff......
http://bioconductor.org/18/31
LLooccaall ccoommppuuttaattiioonnaall rreessoouurrcceessQuite local:
Local(ish):
·NCI (HPC for bioinformatics): http://www.ncisf.orgGenomics Virtual Lab: http://genome.edu.auNeCTAR (HPC/eResearch resources): http://nectar.org.au
---
·NZGL (customized environment): http://nzgenomics.org.nzNeSI ("raw" HPC): http://nesi.org.nz
--
19/31
TTrraaiinniinngg -- ""II''mm ssuurree II ccaann ddoo iitt bbeetttteerr......""Most researchers aren't looking to out-source the investigativecomponent of their research.
Training: there are MANY opportunities available for up-skilling:
·
Data analysis is a fundamental part of scientific investigation.Many investigators want to "own" the entire dataprocessing/analysis process, others don't.
--
·
Bioplatforms Australia: http://www.bioplatforms.com.auInstitute based (e.g., IMB Winter School...)NZGL: http://nzgenomics.co.nzOnline courses (what to choose...??):
----
FREE R/Bioconductor material: http://bioconductor.org-
20/31
BBiiooccoonndduuccttoorr ccoouurrssee mmaatteerriiaall
http://bioconductor.org/help/course-materials/2014/21/31
TTrraaiinniinngg -- kknnooww wwhhaatt yyoouu nneeeeddBioinformatics is a VERY broad field: what is it that you want to learn?
Specialization is a GOOD thing, if you can afford it
·
Early-stage analysis: QA/QC, alignment
More specialized: variant calling (SNPs, CNV, other SV),assembly, RNA-seq count generation, metagenomics...
Further downstream: analysis of "processed" data (clustering,prediction, pathways, network reconstruction, phylogeny...)
-
-
-
·it's great to be a jack-of-all-trades, but the "master-of-none"trade-off can be a problem.makes sense to invest your time where it will be most effective:learn the skills most relevant to what you are trying to accomplishwith your research.
-
-
22/31
TTrraaiinniinngg -- ""II''mm ssuurree II ccaann ddoo iitt bbeetttteerr......""
DO take advantage of training opportunities, but DON'Toverestimate what is being provided.
There is only so much we can teach in a few days...
Learn "enough to be dangerous", and then find a good"bioinformatics buddy" to keep you from going astray -‐ know yourlimitations.
23/31
RReepprroodduucciibbllee rreesseeaarrcchhWe are currently (I hope) in the midst of a "reproducibility revolution"
The R computing environment provides a good example of this, butthere are a number of others (e.g., iPython notebooks).
·
increased emphasis on sharing all aspects of our research.strong emphasis on the use (and development) of open sourcetools that build on existing frameworks.move (by many) towards the use of frameworks for ensuring thatwe are doing "reproducible research".
--
-
·
Rstudio (http://rstudio.com) includes R markdown (v2) by default.Facilitates the production of high-quality output (HTML, PDF, evenWord!) with embedded analysis and results.
--
24/31
RRssttuuddiioo iinntteerrffaaccee
http://rstudio.com25/31
RR mmaarrkkddoowwnn oouuttppuutt
http://rmarkdown.rstudio.com/26/31
TToooollss ffoorr ccoollllaabboorraattiioonnGenomeSpace (e.g., Galaxy and GenePattern) provide domain-specifictools for the collaborative sharing of data and analyses.A number of groups combine cloud-based tools in an ad hoc fashion togenerate a collaborative research environment:
·
·
storage provision (e.g., Dropbox, Google Drive)code sharing/editing (e.g., Github, Bitbucket)reproducible research (e.g., rmarkdown, iPython notebooks)shared/collaborative web-based analysis (e.g., RStudio Server,Shiny Server).
----
27/31
TToooollss ffoorr ccoollllaabboorraattiioonn
Although seemingly haphazard, this approach provides a lot offlexibility for incorporating new tools as they emerge.
GenomeSpace (e.g., Galaxy and GenePattern) provide domain-specifictools for the collaborative sharing of data and analyses.A number of groups combine cloud-based tools in an ad hoc fashion togenerate a collaborative research environment:
·
·
storage provision (e.g., Dropbox, Google Drive)code sharing/editing (e.g., Github, Bitbucket)reproducible research (e.g., rmarkdown, iPython notebooks)shared/collaborative web-based analysis (e.g., RStudio Server,Shiny Server).
----
28/31
SSuummmmaarryyLarge-scale infrastructure - know what you are getting:
Community infrastructure - know what is available:
Personal infrastructure - know what you are doing:
·
Clearly define plans and expectations in terms of the data andanalysis that you are paying for.Ensure you have the resources needed to complete the project.
-
-·
Generic and domain-specific resources exist which can facilitate,streamline and complement your research.
-
·
Know your tools, and develop (and follow!) an analysis plan.The "reproducible research" paradigm offers a valuable set ofresources to help ensure reproducibility.
--
29/31
IInn cclloossiinngg
These slides are not meant to contain a comprehensive summaryof the issues and opportunities surrounding genomics andbioinformatics infrastructure.
If you still have questions after this session, please feel free tocome and have a chat with me over the next couple of days (oremail):
AA ((nnoonn--eexxhhaauussttiivvee)) lliisstt ooff uusseeffuull llooccaall lliinnkkss::
* These guys are all giving presentations this week...
Australia:
New Zealand:
·
QFAB*: http://qfab.orgAGRF*: http://agrf.org.auAus. Bioinformatics Network*: http://australianbioinformatics.netBioplatforms Australia: http://bioplatforms.com.auEnsembl/EMBL resources (local)*: http://braembl.org.au
-----
·
NZGL: http://nzgenomics.co.nzNZ Bioinformatics Institute: http://www.bioinformatics.org.nzNeSI: http://nesi.org.nz
---
31/31