s s gg n n r r o o f f e e r r u u t t c c u u r r t t s s...

31
Genomic Infrastructure for NGS The small, medium and large stuff that helps get the job done Associate Professor Mik Black Department of Biochemistry, University of Otago

Upload: lamkiet

Post on 20-Mar-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

GGeennoommiicc IInnffrraassttrruuccttuurree ffoorr NNGGSSThe small, medium and large stuff that helps get the job done

Associate Professor Mik BlackDepartment of Biochemistry, University of Otago

Page 2: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

BBrriieeff aassiiddee:: wwhhoo aamm II??Background in statistics: my (rather diverse and collaborative) researchinvolves the development and application of statistical methods forproblems in human disease genomics.

Heavily involved in the establishment of two government-fundednational infrastructure initiatives in New Zealand:

Formerly the bioinformatics team leader for NZGL (2012), and still anactive team member...

·

·

NZGL (New Zealand Genomics Ltd) - inter-university collaborationin genomics and bioinformatics.NeSI (NZ eScience Infrastructure) - cross institutional (universitiesand Crown Research Institutes) collaboration in high performancecomputing and eResearch.

-

-

·

2/31

Page 3: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

OOvveerrvviieeww -- tthhiiss ttaallkk wwiillll ccoovveerr......Large-scale institutional infrastructure: out-sourcing the generationand analysis of genomic data.

Community infrastructure: shared genomics resources.

Personal infrastructure: tools and techniques for getting the most outof genomic (or any) data.

·

·

·

3/31

Page 4: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

OOvveerrvviieeww -- tthhiiss ttaallkk wwiillll ccoovveerr......

BBuutt nnoo GGaammee ooff TThhrroonneess......

Large-scale institutional infrastructure: out-sourcing the generationand analysis of genomic data.

Community infrastructure: shared genomics resources.

Personal infrastructure: tools and techniques for getting the most outof genomic (or any) data.

·

·

·

4/31

Page 5: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

OOvveerrvviieeww -- tthhiiss ttaallkk wwiillll ccoovveerr......

BBuutt nnoo GGaammee ooff TThhrroonneess......

First  though:  I'm  a  statistician  -­‐  let's  generate  some  data.

Large-scale institutional infrastructure: out-sourcing the generationand analysis of genomic data.

Community infrastructure: shared genomics resources.

Personal infrastructure: tools and techniques for getting the most outof genomic (or any) data.

·

·

·

5/31

Page 6: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

OOuuttssoouurrcciinngg yyoouurr ggeennoommiiccssMany highly competitive (and high quality) genomics providers exist.

Preferences depend on many things...

·

Offer various levels of flexibility in terms of the types of data theywill/can generate.Provide a variety of services, from basic data generation to"bundled" full-service options which take tissue samples, andproduce analytic results.Some providers also provide web- or VM-based analysis options.

-

-

Your own competence/capacity for handling genomic dataPrevious experience with a providerWord of mouthPrice....

----

6/31

Page 7: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

YYoouurr eexxppeerriimmeenntt:: wwhhaatt ttoo tthhiinnkk aabboouutt??What is the question you are trying to answer?Have you designed an appropriate sequencing experiment?

Once you have the data, what will you do with it?

··

will it provide the answer to your question?is there a better way to get to that answer?

--

·Storage and processing? On-site or off? Space and CPUrequirements?Do you have access to the skills/resources required to completethe analysis task?

-

-

7/31

Page 8: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

YYoouurr eexxppeerriimmeenntt:: wwhhaatt ttoo tthhiinnkk aabboouutt??

Ask  (and  answer)  these  questions  BEFORE  you  generate  the  data

What is the question you are trying to answer?Have you designed an appropriate sequencing experiment?

Once you have the data, what will you do with it?

··

will it provide the answer to your question?is there a better way to get to that answer?

--

·Storage and processing? On-site or off? Space and CPUrequirements?Do you have access to the skills/resources required to completethe analysis task?

-

-

8/31

Page 9: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

OOuuttssoouurrcciinngg yyoouurr bbiiooiinnffoorrmmaattiiccssWhat are you outsourcing?

Make sure a full analysis plan is in place before committing to the work.

·Quality assessment and basic bioinformatics?Generic data analysis?Domain-specific analysis?Tailored analysis for your specific question?

----

·If possible, have the plan inspected by an independent "expert".-

9/31

Page 10: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

OOuuttssoouurrcciinngg yyoouurr bbiiooiinnffoorrmmaattiiccss

DON'T  underestimate  the  value  that  an  expert  team  ofbioinformaticians  can  bring  to  your  project,  but  DO  make  sure  youknow  what  you  will  be  getting  from  them  (and  the  cost...)

What are you outsourcing?

Make sure a full analysis plan is in place before commuting to the work.

·Quality assessment and basic bioinformatics?Generic data analysis?Domain-specific analysis?Tailored analysis for your specific question?

----

·If possible, have the plan inspected by an independent "expert".-

10/31

Page 11: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

OOuuttssoouurrcciinngg yyoouurr bbiiooiinnffoorrmmaattiiccssQuality assessment: this will usually be provided with the data. Don'tbe afraid to ask for more information (and even more QA).

Basic bioinformatics: e.g., quality trimming/filtering and alignment toa reference genome - make sure you are very clear about what youwant (if you know): trim/filter parameters, genome build, organism (!),aligner, parameters...

Generic data analysis: e.g., variant calling, differential expression etc

Tailored analysis for your specific question: this requires the mostspecification (and input from you) and should involve a provider with abackground in this area.

·

·

·

·

11/31

Page 12: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

OOuuttssoouurrcciinngg yyoouurr bbiiooiinnffoorrmmaattiiccss

ALWAYS  specify  that  you  require  the  code  used  to  perform  theanalysis  -­‐  you  need  to  know  what  was  done  every  step  of  the  way

Quality assessment: this will usually be provided with the data. Don'tbe afraid to ask for more information (and even more QA).

Basic bioinformatics: e.g., quality trimming/filtering and alignment toa reference genome - make sure you are very clear about what youwant (if you know): trim/filter parameters, genome build, organism (!),aligner, parameters...

Generic data analysis: e.g., variant calling, differential expression etc

Tailored analysis for your specific question: this requires the mostspecification (and input from you) and should involve a provider with abackground in this area.

·

·

·

·

12/31

Page 13: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

CCoommmmuunniittyy rreessoouurrcceess:: wwhhaatt iiss aavvaaiillaabbllee??Databases/Browsers (the big players)

Software tools:

Generic data sources:

·NCBI (http://www.ncbi.nlm.nih.gov/)Ensembl (http://www.ensembl.org/)UCSC (https://genome.ucsc.edu/)

MANY domain-specific options (check Nat Gen annual DB issue)

----

·GenomeSpace (Galaxy, GenePattern, Cytoscape,...)R/BioconductorThat scary command line thing....

---

·GEO, ArrayExpress, inSilicoDB, SRA, dbGaP, EGA,....-

13/31

Page 14: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

GGeenneePPaatttteerrnn:: wweebb--bbaasseedd aannaallyyssiiss ppllaattffoorrmm

http://www.broadinstitute.org/cancer/software/genepattern/14/31

Page 15: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

GGaallaaxxyy:: wweebb--bbaasseedd aannaallyyssiiss ppllaattffoorrmm

https://usegalaxy.org/15/31

Page 16: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

GGeennoommeeSSppaaccee:: jjooiinniinngg uupp tthhee ccooooll ssttuuffff......

http://www.genomespace.org/16/31

Page 17: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

GGeennoommeeSSppaaccee:: jjooiinniinngg uupp tthhee ccooooll ssttuuffff......

http://www.genomespace.org/17/31

Page 18: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

BBiiooccoonndduuccttoorr:: wwhheerree tthheeyy ggeett tthhee ccooooll ssttuuffff......

http://bioconductor.org/18/31

Page 19: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

LLooccaall ccoommppuuttaattiioonnaall rreessoouurrcceessQuite local:

Local(ish):

·NCI (HPC for bioinformatics): http://www.ncisf.orgGenomics Virtual Lab: http://genome.edu.auNeCTAR (HPC/eResearch resources): http://nectar.org.au

---

·NZGL (customized environment): http://nzgenomics.org.nzNeSI ("raw" HPC): http://nesi.org.nz

--

19/31

Page 20: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

TTrraaiinniinngg -- ""II''mm ssuurree II ccaann ddoo iitt bbeetttteerr......""Most researchers aren't looking to out-source the investigativecomponent of their research.

Training: there are MANY opportunities available for up-skilling:

·

Data analysis is a fundamental part of scientific investigation.Many investigators want to "own" the entire dataprocessing/analysis process, others don't.

--

·

Bioplatforms Australia: http://www.bioplatforms.com.auInstitute based (e.g., IMB Winter School...)NZGL: http://nzgenomics.co.nzOnline courses (what to choose...??):

----

FREE R/Bioconductor material: http://bioconductor.org-

20/31

Page 21: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

BBiiooccoonndduuccttoorr ccoouurrssee mmaatteerriiaall

http://bioconductor.org/help/course-materials/2014/21/31

Page 22: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

TTrraaiinniinngg -- kknnooww wwhhaatt yyoouu nneeeeddBioinformatics is a VERY broad field: what is it that you want to learn?

Specialization is a GOOD thing, if you can afford it

·

Early-stage analysis: QA/QC, alignment

More specialized: variant calling (SNPs, CNV, other SV),assembly, RNA-seq count generation, metagenomics...

Further downstream: analysis of "processed" data (clustering,prediction, pathways, network reconstruction, phylogeny...)

-

-

-

·it's great to be a jack-of-all-trades, but the "master-of-none"trade-off can be a problem.makes sense to invest your time where it will be most effective:learn the skills most relevant to what you are trying to accomplishwith your research.

-

-

22/31

Page 23: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

TTrraaiinniinngg -- ""II''mm ssuurree II ccaann ddoo iitt bbeetttteerr......""

DO  take  advantage  of  training  opportunities,  but  DON'Toverestimate  what  is  being  provided.

There  is  only  so  much  we  can  teach  in  a  few  days...

Learn  "enough  to  be  dangerous",  and  then  find  a  good"bioinformatics  buddy"  to  keep  you  from  going  astray  -­‐  know  yourlimitations.

23/31

Page 24: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

RReepprroodduucciibbllee rreesseeaarrcchhWe are currently (I hope) in the midst of a "reproducibility revolution"

The R computing environment provides a good example of this, butthere are a number of others (e.g., iPython notebooks).

·

increased emphasis on sharing all aspects of our research.strong emphasis on the use (and development) of open sourcetools that build on existing frameworks.move (by many) towards the use of frameworks for ensuring thatwe are doing "reproducible research".

--

-

·

Rstudio (http://rstudio.com) includes R markdown (v2) by default.Facilitates the production of high-quality output (HTML, PDF, evenWord!) with embedded analysis and results.

--

24/31

Page 25: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

RRssttuuddiioo iinntteerrffaaccee

http://rstudio.com25/31

Page 26: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

RR mmaarrkkddoowwnn oouuttppuutt

http://rmarkdown.rstudio.com/26/31

Page 27: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

TToooollss ffoorr ccoollllaabboorraattiioonnGenomeSpace (e.g., Galaxy and GenePattern) provide domain-specifictools for the collaborative sharing of data and analyses.A number of groups combine cloud-based tools in an ad hoc fashion togenerate a collaborative research environment:

·

·

storage provision (e.g., Dropbox, Google Drive)code sharing/editing (e.g., Github, Bitbucket)reproducible research (e.g., rmarkdown, iPython notebooks)shared/collaborative web-based analysis (e.g., RStudio Server,Shiny Server).

----

27/31

Page 28: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

TToooollss ffoorr ccoollllaabboorraattiioonn

Although  seemingly  haphazard,  this  approach  provides  a  lot  offlexibility  for  incorporating  new  tools  as  they  emerge.

GenomeSpace (e.g., Galaxy and GenePattern) provide domain-specifictools for the collaborative sharing of data and analyses.A number of groups combine cloud-based tools in an ad hoc fashion togenerate a collaborative research environment:

·

·

storage provision (e.g., Dropbox, Google Drive)code sharing/editing (e.g., Github, Bitbucket)reproducible research (e.g., rmarkdown, iPython notebooks)shared/collaborative web-based analysis (e.g., RStudio Server,Shiny Server).

----

28/31

Page 29: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

SSuummmmaarryyLarge-scale infrastructure - know what you are getting:

Community infrastructure - know what is available:

Personal infrastructure - know what you are doing:

·

Clearly define plans and expectations in terms of the data andanalysis that you are paying for.Ensure you have the resources needed to complete the project.

-

Generic and domain-specific resources exist which can facilitate,streamline and complement your research.

-

·

Know your tools, and develop (and follow!) an analysis plan.The "reproducible research" paradigm offers a valuable set ofresources to help ensure reproducibility.

--

29/31

Page 30: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

IInn cclloossiinngg

These  slides  are  not  meant  to  contain  a  comprehensive  summaryof  the  issues  and  opportunities  surrounding  genomics  andbioinformatics  infrastructure.

If  you  still  have  questions  after  this  session,  please  feel  free  tocome  and  have  a  chat  with  me  over  the  next  couple  of  days  (oremail):

[email protected]/31

Page 31: S S GG N N r r o o f f e e r r u u t t c c u u r r t t s s ...bioinformatics.org.au/ws14/wp-content/uploads/ws14/sites/5/2014/08/... · national infrastructure initiatives ... ·

AA ((nnoonn--eexxhhaauussttiivvee)) lliisstt ooff uusseeffuull llooccaall lliinnkkss::

* These guys are all giving presentations this week...

Australia:

New Zealand:

·

QFAB*: http://qfab.orgAGRF*: http://agrf.org.auAus. Bioinformatics Network*: http://australianbioinformatics.netBioplatforms Australia: http://bioplatforms.com.auEnsembl/EMBL resources (local)*: http://braembl.org.au

-----

·

NZGL: http://nzgenomics.co.nzNZ Bioinformatics Institute: http://www.bioinformatics.org.nzNeSI: http://nesi.org.nz

---

31/31