introduction to iplant dan stanzione [email protected] the iplant collaborative september...

41
Introduction to iPlant Dan Stanzione [email protected] The iPlant Collaborative September 16th, 2013

Upload: juniper-collins

Post on 27-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Introduction to iPlant

Dan Stanzione

[email protected] iPlant Collaborative

September 16th, 2013

The iPlant CollaborativeCyberinfrastructure for the Plant Sciences

“BGI, based in China, is the world’s largest genomics research institute, with 167 DNA sequencers producing the equivalent of 2,000 human genomes a day.

BGI churns out so much data that it often cannot transmit its results to clients or collaborators over the Internet or other communications lines because that would take weeks. Instead, it sends computer disks containing the data, via FedEx.”

The Problem of Big Data in Biology

Human Genome:$2.7 Billion, 13 Years

Human Genome: $900, 6 Hours

2012:Oxford Nanopore

MiniION

2003: ABI 3730 Sequencer

The Problem of Big Data in Biology A decade’s progress

The Problem of Big Data in Biology

The Problem of Big Data in Biology

The Problem of Big Data in Biology

Data-intensive biology will mean getting biologists comfortable withnew technology…

1973Sharp, Sambrook, Sugden

Gel Electrophoresis Chamber, $250

1958 Matt Meselson &

Ultracentrifuge, $500,000

The Problem of Big Data in Biology hopefully comfortable enough to minimize the technology

and focus on the biology.

What is iPlant?

The iPlant Collaborative is a community-driven organization building cyberinfrastructure for the plant (and animal) sciences.

What is Cyberinfrastructure?Cyberinfrastructure is the coordinated aggregate of software, hardware and other technologies, as well as human expertise, required to support current and future discoveries in science and engineering.

--Fran Berman

Cyberinfrastructure consists of computing systems, data storage systems, instruments and data repositories, visualization environments, and people, linked together by software and networks to improve research productivity and enable breakthroughs not otherwise possible. --Craig Stewart

At iPlant, we make computation,data storage, cloud services, and software tools easily available to informaticians and researchers, leveraging existing CI investments.

The Stampede System at the Texas Advanced Computing Center

The iPlant Community

Over 9,000 researchers now access iPlant services or data, in diverse areas from

ecology to epigenomicsIn my collaboration with Nathan, the tasks they used to call me to do as the “bioinformatics expert”, the students now do on their own through the Discovery Environment-U. of Minnesota

“What my users used to call me for, they now do on their own through Atmosphere. Now I can scale up my user community” -U. of Wisconsin

“The resources available change your research landscape… the amounts and types of analyses that you do.” Haibao Tang, J. Craig Venter Institute

Scientific Achievements through iPlant’s Open Infrastructure

• In conjunction with the BIEN project, generating Range maps for thousands of species (can compute in a few hours)

• The 1KP project has stored tens of millions of sequence reads with iPlant; have a richer catalog of plant data with iPlant than NCBI; 2.6 m hours of BLAST to annotate

• In conjunction with USDA and researchers at Iowa State, pipelines in place which can re-sequence cattle/buffalo at 3 hours per animal.

• Speedups of thousands on certain GWAS & network comparison problems

• Integration of AgMIP agriculture models and data

Twig to Genome

13

Brief History of iPlant

Start of software development – Sep 2009 First prototypes delivered to public – Apr 2010 Discovery Environment release with user-driven

tool integration – Jul 2011 Public launch of Atmosphere, “Powered by iPlant”

– PAG 2012

Design Principles(this slide first shown in March, 2009)

• Ad hoc, reconfigurable analysis workflows

• Community-led authorship of new components

• Re-use of existing codebases

• Computational scalability

• Enable sharing and collaboration

• Presentation of well-designed APIs

• Attractive visualization of results

• Facile data integration

8

The iPlant CollaborativeCyberinfrastructure for the Plant Sciences

• The iPlant CI is designed as infrastructure.

• This means it is a platform upon which other projects can build. • Use of the iPlant infrastructure can take one of several forms:

Storage Computation Hosting Web Services Scalability

iPlant Today• Today, iPlant provides a robust, maturing (and still

growing) CI for thousands of users, leveraging other large scale NSF investments in hardware, software, and networking.

• While iPlant has many applications in many forms, the core CI Components are the: – Discovery Environment

– Atmosphere “Cloud” Platform

– Application Programmer Interfaces

– Data Store

– Foundation of Computation provided by NSF XSEDE

• iPlant can be used to run large scale, complex workflows using distributed, international resources.

• The rest of the speakers will show the novel ways iPlant is used to enable science!

iPlant Growth Over TimeGrand Challenge

team conferences, whitepapers and

requirements gathering

778

registered users (collaborative tools)

93

users of DE

2200+

registered users (iPlant CI)

8800+

registered users

>400 TB

storage with

growth rate of

1 TB/day

20122009 2010 20112008

192

registered users (collaboration

tools)

Now ~12,000

• For a challenge as broad as “plant science,” focus on specific applications/tools is a moving target, and never enough.

• Most important to build a *platform* that can support diverse and constantly evolving needs. “Cyberinfrastructure” is, in fact, infrastructure. The platform can lift all the apps, not select winners and losers.

“The useful lifetime of our analysis tool chains is now 6 months”

-Matthew Trunnel, Broad Institute

The iPlant CollaborativeCyberinfrastructure for the Plant Sciences

We have designed iPlant to be consistent with the pillars of CIF21

High Performance ComputingData and Data AnalysisVirtual OrganizationLearning and Workforce

The iPlant CollaborativeCyberinfrastructure Philosophy

EndUsers

ComputationalUsers

Teragrid

XSEDE

The iPlant CollaborativeCyberinfrastructure for the Plant Sciences

The iPlant CollaborativeWays to access iPlant

• Atmosphere: For virtual hosting of web apps, sites, databases. • iPlant Data Storage: All data large and small• The Discovery Environment: Integrated Web apps. • DNASubway: Annotation and more• Standalone Apps: TNRS, TreeViewer, PhytoBisque, etc• The API: For programmers embedding iPlant CI capabilities• Command line for experts (thru TeraGrid/XSEDE)

We’ll cover each of these in more detail throughout the workshop.

The iPlant CollaborativePractical Benefits

• Powerful computational resources (Data analysis and storage)

• Experimental verifiability, reproducibility, provenance

• Interconnected resources / multiple levels of access

• Facilitation of collaboration

• Scalability/extensibility

• About 500,000 Compute Cores available in a variety of platforms.

• Up to 1TB shared memory

TACC Ranger

PSC Blacklight TACC Corral EBI Web Services

TACC Lonestar

The iPlant CollaborativeScalable Computation for High Throughput Inquiry

• Chris Pires, U. of Missouri– Assembly of Brassica

Genomes on shared memory systems

• Haibo Tang, JCVI

“The resources available change your research landscape –the amounts and types of analyses that you do.”

The iPlant CollaborativeScalable Computation for High Throughput Inquiry

• A rich web client– Consistent interface to

bioinformatics tools– Portal for users who won’t

want to interact with lower level infrastructure

• An integrated, extensible system of applications and services – Additional intelligence

above low level APIs – Provenance, Collaboration, etc.

The iPlant CollaborativeiPlant Discovery Environment

DE Growth

• Extensibility has meant an exponential curve as platform matures.

• Major DE Releases in last year: – January 2011: Extensible by programmers

– July 2011: Extensible by Users

– January 2012: Extensible workflow, too.27

2008 2009 2010 20110

20

40

60

80

100

120

Discovery Environment Tools Available vs. Time

Tools Available

To

ols

Today ~400 tools and counting!

• API-compatible implementation of Amazon EC2/S3 interfaces

• Virtualize the execution environment for applications and services

• Up to 12 core / 48 GB instances• Access to Cloud Storage + EBS• Run servers, CloudBurst desktop use

cases. Big data and the desktop are co-local again!

>60 hosted applications in Atmosphere today, including users from USDA, Forest Service, database providers, etc.

(30 more for postdocs and grad students for training classes)

The iPlant CollaborativeProject Atmosphere™: Custom Cloud Computing

Atmosphere – Exemplar Users

• Nathan Miller, Wisconsin• “What my users used to call me for, they

now do on their own through Atmosphere. Now I can scale up my user community” – Nathan Miller

29

Fast data transfers via parallel, non-TCP file transfer

• Move large (>2 GB) files with ease

Multiple, consistent access modes

• iPlant API• iPlant web apps• Desktop mount (FUSE/DAV)• Java applet (iDrop)• Command line

Fine-grained ACL permissions• Sharing made simple

Access and a storage allocation is automatic with your iPlant account

The iPlant CollaborativeData Store

The iPlant API• The API is the Application Programmer

Interface .• This is the way bioinformatics tools and data

get integrated with iPlant. • The API is out there now. Avoiding the

cardinal sin of API support:– Release lots of versions, each incompatible with the last.

– Our approach: Incremental releases; each release will add new areas of functionality, not change old syntax.

• 2011 Usage: 96,000 true hits, 1,000 large jobs, 37 apps and 3 XSEDE sites supported.

31

API Exemplar Users

• Carol Lushbough, Bioextract

• BioExtract, a DBI funded portal, is currently being re- written to take advantage of the API.

• Carol is running jobs on Lonestar through the API.

32

The API has also been adopted outside plants, by Apache Airavata, CyberGIS, and some XSEDE sites.

• A number of other applications are “Powered by iPlant” but developed by our team on top of the infrastructure.

• In response to specific grand challenge team requests for things that needed their own web presence.

• TNRS, My-Plant, and more.

The iPlant Collaborative

• “Powered by iPlant” is the moniker for a variety of ways of using the iPlant infrastructure underneath another application that communicates with users; usually outside the iPlant project.

• Other major projects have adtoped the iPlant CI as their underlying infrastructure (some completely, some in limited ways – more on this later).

The iPlant Collaborative

iPlant APIsResources

iPlant Advanced Collaborative Support

• Based on the old TeraGrid AUS and XSEDE Extended Collaborative Support

• Provide a computing expert for an extended period of time to rebuild a popular tool for scalability, or other key functionality.

• Could be scaling, infovis, or just information architecture help.

Future of iPlant

• NSF invited iPlant to submit a renewal proposal in the summer of 2012.

• The proposal was submitted in September, and received a site visit (December 2012).

• Approved by the National Science Board at the May, 2013 meeting.

• A renewal may be formally announced, hypothetically, when a certain agency gets the press release out the door.

UATACC

CSHL

The iPlant CollaborativeA virtual organization

Staff:Greg AbramSonali AdityaRoger BarthelsonBrad BoyleTodd BryanGordon BurleighJohn CazesMike ConwayKaren CranstonRion DoodeyAndy EdmondsDmitry FedorovMichael GattoUtkarsh GaurCornel GhibanMichael GonzalesHariolf HäfeleMatthew Hanlon

Metadata Data Tools Workflows Viz

Executive Team:Steve GoffDan Stanzione

Faculty Advisors & Collaborators:Ali AkogluGreg AndrewsKobus BarnardSue BrownThomas BrutnellMichael DonoghueCasey DunnBrian EnquistDamian GesslerRuth GreneJohn HartmanMatthew HudsonDan KliebensteinJim Leebens-MackDavid LowenthalRobert Martienssen

Students:Peter BaileyJeremy BeaulieuDevi BhattacharyaStorme BriscoeYa-Di ChenJohn DonoghueSteven Gregory Yekatarina KhartianovaMonica Lent Amgad Madkour

B.S. Manjunath Nirav Merchant David NealeBrian O’MearaSudha RamDavid SaltMark SchildhauerDoug SoltisPam SoltisEdgar SpaldingAlexis StamatakisAnn StapletonLincoln SteinVal TannenTodd VisionDoreen WareSteve WelchMark Westneat

Andrew LenardsZhenyuan LuEric LyonsNaim MatasciSheldon McKayRobert McLayAngel MercerDave MicklosNathan MillerSteve Mock Martha NarroPraveen NuthulapatiShannon OliverShiran PasternakWilliam PeilTitus PurdinJ.A. Raygoza GarayDennis RobertsJerry Schneider

Anthony HeathBarbara HeathMatthew Helmke Natalie HenriquesUwe HilgertNicole HopkinsEun-Sook JeongLogan JohnsonChris JordanB.D. KimKathleen KennedyMohammed KhalfanSeung-jin KimLars KoersterkSangeeta KuchimanchiKristian KvilekvalAruna LakshmananSue LauterTina Lee

Bruce SchumakerSriramu SingaramEdwin SkidmoreBrandon SmithMary Margaret Sprinkle Sriram SrinivasanJosh SteinLisa StillwellKris UriePeter Van BurenHans Vasquez-GrossMatthew VaughnFusheng WeiJason WilliamsJohn WregglesworthWeijia XuJill Yarmchuk

Aniruddha MaratheKurt MichaelsDhanesh PrasadAndrew PredoehlJose SalcedoShalini SasidharanGregory StriemerJason VandeventerKuan Yang

Postdocs:Barbara BanburyJamie EstillBindu JosephChristos Noutsos Brad RuhfelStephen A. SmithChunlao TangLin WangLiya WangNorman Wickett

The iPlant Collaborative

iPlant Collaborative Data Mining with iPlant

Workshop Goals

• Demonstrate some of the ways iPlant CI can advance your science

• Familiarize you with iPlant tools and services• Jumpstart your project’s involvement with iPlant through a

direct consultation • Helping you add your “voice” to the iPlant user community

Thanks!

• Thank you for taking the time to participate in this workshop; we hope to make it productive, and for those staying for tomorrow and the next day, increasingly hands-on.

• Questions?