engaging biologists with big data using interactive genome … · 2015-12-04 · engaging...

1
Current GEP Members The Genomics Education Partnership (GEP) began in 2006 with 16 members, and has grown steadily. GEP members represent a very diverse group of schools, both public and private, large and small, with varying educational missions and diverse student populations. Currently there are > 100 affiliated schools; > 60 faculty/year are engaged in GEP research, and > 1,000 undergraduates participate each year. Faculty generally join by attending a one-week workshop at WUSTL. Shared work (done in summer Alumni Workshops) is organized on the GEP website (curriculum development, publications, etc.). We find that institutional characteristics have little correlation with student success , indicating that diverse students in diverse settings benefit from curriculum-based research experiences of this type. 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 http://galaxyproject.org http://usegalaxy.org The Galaxy platform is an open-source, Web-based platform for analyzing large biomedical datasets. Galaxy’s key motivations are: 1. Accessibility for everyone: scientists can use Galaxy’s Web-based interface to run complex analyses on large datasets using computing clusters or cloud computing with no programming; programmers can use Galaxy through its API, which provides programmatic access to all Galaxy functionality. 2. Reproducibility for all analyses: all analysis details, including input datasets, tool versions, and parameter settings, are saved so that an analysis can be precisely repeated by anyone with access to the analysis. 3. Web-based collaborative science: analyses can be shared with collaborators through a Web link, published to the entire Web, and included in Galaxy Pages, which are online, interactive research supplements. Engaging Biologists with Big Data Using Interactive Genome Annotation Jeremy Goecks 1 , Wilson Leung 2 , and Sarah C.R. Elgin 2 1 George Washington University and 2 Washington University in St. Louis Project Goal: combine two successful and long-running projects—the Genomics Education Partnership and the Galaxy Project—to create an integrated, Web-based, and scalable environment (G-OnRamp) that will enable biologists to utilize large genomic datasets for interactive annotation of any genome, an activity that can serve as an introduction and training for “big data” biomedical analyses. The Genomics Education Partnership (GEP) http://gep.wustl.edu Primary goals: Incorporate genomics and bioinformatics into the undergraduate curriculum Engage undergraduates in genomics research Central organization: Hosts training workshops for GEP faculty / TAs Develops & maintains web framework for projects Hosts shared curriculum & assessment Student photos taken by GEP faculty Michael Rubin (University of Puerto Rico – Cayey) and Heather Eisler (University of the Cumberlands) Workflow Faculty members have collaboratively developed a variety of ways to use the GEP approach in their teaching: Short (10 hrs) modules in a genetics course Longer modules within molecular biology laboratory courses Stand-alone genomics lab courses Independent research studies Results produced by GEP students are reconciled and used in subsequent scientific publications [e.g., Leung et al. 2015, G3. 5(5):719-40]. Public “draft” genomes Divide into overlapping student projects (40-100 kb) Sequence and assembly improvement Optional wet bench experiment PCR/sequencing of gaps Evidence-based gene annotation Collect projects, compare and confirm annotations Reassemble into high quality annotated sequence Analyze and publish results Sequence Improvement Annotation Collect projects, compare and verify final consensus sequence Optional evidence-based TSS and motif annotation Training Benefits Students are challenged to analyze and evaluate available evidence (assembled on the GEP UCSC genome browser) to create optimal gene models, often in the face of contradictory evidence, & explore other genomic features (right). GEP students report substantial learning gains, which improve significantly with more time invested (bottom). GEP Challenges can be Addressed by Galaxy GEP provides an ideal use case for training scientists to work with big data, but there are several challenges that Galaxy can address: World-wide Galaxy Usage Galaxy is used by tens of thousands of scientists throughout the world and is increasing in popularity Galaxy Features for End-to-End Analysis of Large –omics Datasets Thousands of analysis tools from simple to advanced for genomics, proteomics, metabolomics, chemoinformatics, and more Web interface scales to large collections of datasets for batch analysis Integration with many public databases making it simple to combine private and public data Graphical workflow editor to create multi-step, reproducible analyses of individual datasets or collections Visual analytics for visualizing datasets generated from analyses and analyzing data within a visualization Share or publish any Galaxy dataset, history, workflow, or visualization using a Web link Only need a Web browser to access all Galaxy features Arbitrary # of Inputs (... paired). Run applications in parallel (one per input). Merged output for subsequent processing. GEP Challenge Galaxy Feature to Address Challenge Difficult to set up and integrate GEP computational tools Automated installation and configuration Cannot be easily extended to organisms beyond Drosophila Can be configured to work with any organism and with multiple organisms at once Limited flexibility to add custom analyses and data into the curriculum Supports completely customizable workflows and analyses Difficult to share and collaborate across physically- distributed sites Web-based collaboration framework for sharing all Galaxy objects Acknowledgements G-OnRamp supported by NIH Grant HG008843-01. GEP supported by HHMI grant #52005780, NSF grant #1431407 and WUSTL. Galaxy supported by NIH grant HG006620-04 and GWU. Contact Sarah CR Elgin [email protected] GEP + Galaxy = G-OnRamp G-OnRamp Goals: Create a custom Galaxy server to power interactive annotation of any genome Provide an interactive, Web-based platform that can scale to support world-wide big data biomedical training through interactive genome annotation Foster the growth of the GEP and other educational communities to increase the participation of undergraduates and the broader scientific community in genomics research G-OnRamp Features: Analysis workflows for creating multiple, complementary datasets for genome annotation: Gene prediction models and homology results ChIP-Seq peak calls for transcription factor binding sites and histone modifications Splice junction and transcript predictions from RNA-Seq Identify methylated regions through the analysis of bisulfite sequencing results Provide interactive, Web-based tools and visualizations for: Viewing annotation evidence Placing labels on genomic regions Facilitating distributed and collaborative annotations Reconciling of annotations produced by multiple individuals Workflows, tools, and visualizations will be agnostic to the organism: Facilitate the analyses and annotations of non-model organisms Ensure that G-OnRamp can reach as broad an audience as possible Easy for individuals to use and install Public servers powered by national cybercomputing infrastructure (iPlant and XSEDE) Self-contained installation package (virtual machine with all the dependencies already installed and configured) Validating G-OnRamp using GEP: GEP faculty will serve as beta testers to ensure that G-OnRamp meets real educational needs Provide continuous feedback to help guide the development of G-OnRamp Help test and revise curriculum and training materials during workshops Year joined Shaffer CD et al. 2014, CBE Life Sci Educ. 13(1):111-30 2 3 4 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Mean scores Learning gain items in the SURE survey Q1 (1-10 hrs.) Q4 (>36 hrs.) SURE (Summer Research) Understanding the research process Ability to analyze data Independence Free public Galaxy instance at http://usegalaxy.org Registered users can use the high-performance computing resources on the main public Galaxy instance to run -omics data analyses for free. Users run ~130,000 analyses each month on the server. Public servers are available for anyone to use: http://bit.ly/gxyservers Analysis Tools in Galaxy Nearly all command line tools can be integrated into Galaxy, and thousands of tools have already been integrated into Galaxy. Number of registered users on Galaxy Main 2007 2008 2009 2010 2011 2012 2013 2014 2015 2006 0 10000 20000 30000 40000 50000 60000 Number of users Year 2011 2012 2013 2014 2015 0 40 80 2000 1500 1000 500 0 Year Count 120 New repositories Total repositories

Upload: others

Post on 03-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Engaging Biologists with Big Data Using Interactive Genome … · 2015-12-04 · Engaging Biologists with Big Data Using Interactive Genome Annotation ... • Graphical workflow

Current GEP Members The Genomics Education Partnership (GEP) began in 2006 with 16 members, and has grown steadily. GEP members represent a very diverse group of schools, both public and private, large and small, with varying educational missions and diverse student populations.

Currently there are > 100 affiliated schools; > 60 faculty/year are engaged in GEP research, and > 1,000 undergraduates participate each year. Faculty generally join by attending a one-week workshop at WUSTL. Shared work (done in summer Alumni Workshops) is organized on the GEP website (curriculum development, publications, etc.).

We find that institutional characteristics have little correlation with student success, indicating that diverse students in diverse settings benefit from curriculum-based research experiences of this type.

2006200720082009201020112012201320142015

http://galaxyproject.org http://usegalaxy.org

The Galaxy platform is an open-source, Web-based platform for analyzing large biomedical datasets. Galaxy’s key motivations are:1.  Accessibility for everyone: scientists can use

Galaxy’s Web-based interface to run complex analyses on large datasets using computing clusters or cloud computing with no programming; programmers can use Galaxy through its API, which provides programmatic access to all Galaxy functionality.

2.  Reproducibility for all analyses: all analysis details, including input datasets, tool versions, and parameter settings, are saved so that an analysis can be precisely repeated by anyone with access to the analysis.

3.  Web-based collaborative science: analyses can be shared with collaborators through a Web link, published to the entire Web, and included in Galaxy Pages, which are online, interactive research supplements.

Engaging Biologists with Big Data Using Interactive Genome AnnotationJeremy Goecks1, Wilson Leung2, and Sarah C.R. Elgin2

1George Washington University and 2Washington University in St. Louis

Project Goal: combine two successful and long-running projects—the Genomics Education Partnership and the Galaxy Project—to create an integrated, Web-based, and scalable environment (G-OnRamp) that will enable biologists to utilize large genomic datasets for interactive annotation of any genome, an activity that can serve as an introduction and training for “big data” biomedical analyses.

The Genomics Education Partnership (GEP) http://gep.wustl.edu Primary goals:•  Incorporate genomics and bioinformatics into the undergraduate curriculum

•  Engage undergraduates in genomics research

Central organization:•  Hosts training workshops for GEP faculty / TAs•  Develops & maintains web framework for projects•  Hosts shared curriculum & assessment

Student photos taken by GEP faculty Michael Rubin (University of Puerto Rico – Cayey) and Heather Eisler (University of the Cumberlands)

Workflow Faculty members have collaboratively developed a variety of ways to use the GEP approach in their teaching:•  Short (�10 hrs) modules in a genetics course•  Longer modules within molecular biology

laboratory courses•  Stand-alone genomics lab courses•  Independent research studies

Results produced by GEP students are reconciled and used in subsequent scientific publications [e.g., Leung et al. 2015, G3. 5(5):719-40].

Public “draft” genomes

Divide into overlapping student projects (40-100 kb)

Sequence and assembly improvement

Optional wet bench experimentPCR/sequencing of gaps

Evidence-based gene annotation

Collect projects, compare and confirm annotations

Reassemble into high quality annotated sequence

Analyze and publish results

Sequence Improvement Annotation

Collect projects, compare and verify final consensus sequence

Optional evidence-based TSS and motif annotation

Training Benefits Students are challenged to analyze and evaluate available evidence (assembled on the GEP UCSC genome browser) to create optimal gene models, often in the face of contradictory evidence, & explore other genomic features (right). GEP students report substantial learning gains, which improve significantly with more time invested (bottom).

GEP Challenges can be Addressed by GalaxyGEP provides an ideal use case for training scientists to work with big data, but there are several challenges that Galaxy can address:

World-wide Galaxy Usage Galaxy is used by tens of thousands of scientists throughout the world and is increasing in popularity

Galaxy Features for End-to-End Analysis of Large –omics Datasets•  Thousands of analysis tools from simple to advanced for

genomics, proteomics, metabolomics, chemoinformatics, and more

•  Web interface scales to large collections of datasets for batch analysis

•  Integration with many public databases making it simple to combine private and public data

•  Graphical workflow editor to create multi-step, reproducible analyses of individual datasets or collections

•  Visual analytics for visualizing datasets generated from analyses and analyzing data within a visualization

•  Share or publish any Galaxy dataset, history, workflow, or visualization using a Web link

•  Only need a Web browser to access all Galaxy features

More Powerful Workflows

Arbitrary # of Inputs (... paired).

Run applications in parallel (one per input).

Merged output forsubsequent processing.

GEP Challenge Galaxy Feature to Address ChallengeDifficult to set up and integrate GEP computational tools Automated installation and configurationCannot be easily extended to organisms beyond Drosophila

Can be configured to work with any organism and with multiple organisms at once

Limited flexibility to add custom analyses and data into the curriculum

Supports completely customizable workflows and analyses

Difficult to share and collaborate across physically-distributed sites

Web-based collaboration framework for sharing all Galaxy objects

Acknowledgements G-OnRamp supported by NIH Grant HG008843-01. GEP supported by HHMI grant #52005780, NSF grant #1431407 and WUSTL. Galaxy supported by NIH grant HG006620-04 and GWU.

ContactSarah CR [email protected]

GEP + Galaxy = G-OnRampG-OnRamp Goals:•  Create a custom Galaxy server to power interactive annotation of any genome•  Provide an interactive, Web-based platform that can scale to support world-wide big data

biomedical training through interactive genome annotation•  Foster the growth of the GEP and other educational communities to increase the participation of

undergraduates and the broader scientific community in genomics research

G-OnRamp Features:•  Analysis workflows for creating multiple, complementary datasets for genome annotation: •  Gene prediction models and homology results•  ChIP-Seq peak calls for transcription factor binding sites and histone modifications•  Splice junction and transcript predictions from RNA-Seq•  Identify methylated regions through the analysis of bisulfite sequencing results

•  Provide interactive, Web-based tools and visualizations for: •  Viewing annotation evidence•  Placing labels on genomic regions•  Facilitating distributed and collaborative annotations•  Reconciling of annotations produced by multiple individuals

•  Workflows, tools, and visualizations will be agnostic to the organism:•  Facilitate the analyses and annotations of non-model organisms•  Ensure that G-OnRamp can reach as broad an audience as possible

•  Easy for individuals to use and install•  Public servers powered by national cybercomputing infrastructure (iPlant and XSEDE)•  Self-contained installation package (virtual machine with all the dependencies already installed and configured)

Validating G-OnRamp using GEP:•  GEP faculty will serve as beta testers to ensure that G-OnRamp meets real educational needs•  Provide continuous feedback to help guide the development of G-OnRamp•  Help test and revise curriculum and training materials during workshops

Year joined

Shaffer CD et al. 2014, CBE Life Sci Educ. 13(1):111-30

2 3

4 5

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

Mean

s

Q1

Q4

SURE

Mea

n sc

ores

Learning gain items in the SURE survey

Q1 (1-10 hrs.)

Q4 (>36 hrs.)

SURE (Summer Research)

Understanding the research process

Ability to analyze data Independence

Free public Galaxy instance at http://usegalaxy.orgRegistered users can use the high-performance computing resources on the main public Galaxy instance to run -omics data analyses for free. Users run ~130,000 analyses each month on the server.

Public servers are available for anyone to use: http://bit.ly/gxyservers

Analysis Tools in GalaxyNearly all command line tools can be integrated into Galaxy, and thousands of tools have already been integrated into Galaxy.

Number of registered users on Galaxy Main

2007 2008 2009 2010 2011 2012 2013 2014 201520060

10000

20000

30000

40000

50000

60000Nu

mbe

r of u

sers

Year

2011 2012 2013 2014 20150

40

80

2000

1500

1000

500

0Year

Coun

t

120New repositories Total repositories