empowering bioinformatics workflows using the lustre wide area file system across a 100 gigabit...

37
Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems Indiana University [email protected]

Upload: victoria-harris

Post on 11-Jan-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network

Stephen SimmsManager, High Performance File Systems

Indiana University [email protected]

Page 2: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Today’s talk brought to you by NCGAS

• Funded by National Science Foundation• Large memory clusters for assembly• Bioinformatics consulting for biologists• Optimized software for better efficiency• Open for business at: http://ncgas.org

Page 3: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Data

• In the 21st Century everything is data– Patient data– Nutritional data– Musical data

• Raw material for– Scientific advancement– Technological development

Page 4: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Better Technology = More Data

Page 5: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

• ODI – One Degree Imager– WIYN (Wisconsin, Indiana, Yale, NOAO)telescope

in Arizona– ODI will provide 1 billion pixels/image

• Pan-STARRS– Providing 1.4 billion pixels/image– Currently has over 1 Petabyte of images stored

Better Telescopes

One Degree Imager

32k x 32k CCD

Page 6: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Better Televisions• Ultra High Definition Television (UHDTV)

– 16 times more pixels than HDTV– Last month LG began sales of 84” UHDTV– Tested at the 2012 Summer Olympics– Storage media lags behind

Page 7: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Genomics

• Next Gen sequencers are generating more data and getting cheaper

• Sequencing is: Becoming commoditized at large centers and Multiplying at individual labs

• Analytical capacity has not kept up Storage support Computational support (thousand points solution) Bioinformatics support

Page 8: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

NSF Funded in 2005535 Terabytes Lustre storage

(currently 1.1 PB)24 Servers with 10Gb NICsShort to mid-term storage

http://www.flickr.com/photos/shadowstorm/404158384/http://www.flickr.com/photos/dvd5/163647219/http://www.flickr.com/photos/vidiot/431357888/

Data Capacitor

Page 9: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

The Lustre Filesystem

• Open Source

• Supports many thousands of client systems

• Supports petabytes of storage

• Over 240 GB/s measured throughput at ORNL

• Scalable

– aggregates separate servers for performance

– user specified “stripes”

• Standard POSIX interface

Page 10: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Lustre

Scalable Object Storage

Client OSS

MDSmetadata server

object storage

server

Page 11: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Computation

Page 12: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Workflow - The Data Lifecycle

Page 13: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

http://www.flickr.com/photos/davesag/4307240/in/set-799526/

Page 14: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Data Lifecycle – Centralized Storage

Page 15: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

NCGAS Cyberinfrastructure at IU

• Mason large memory cluster (512 GB/node)

• Quarry cluster (16 GB/node)

• Data Capacitor (1.1 PB)

• Research File System (RFS)

• Research Database Cluster for structured data

• Bioinformaticians and software engineers

Page 16: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Galaxy: Make it easier for Biologists

• Galaxy interface provides a “user friendly” window to NCGAS resources

• Supports many bioinformatics tools

• Available for both research and instruction.

Common

Rare

Computational Skills

LOW

HIGH

Page 17: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

GALAXY.IU.EDU Model

Virtual box hosting Galaxy.IU.edu

The host for each tool is configured to meet IU needs

Quarry Mason

Data CapacitorRFS

UITS/NCGAS establishes tools, hardens them, and moves them into production.

A custom Galaxy tool can be made to import data from the RFS to the DC.

Individual labs can get duplicate boxes – provided they support it themselves.

Policies on the DC guarantee that untouched data is removed with time.

Page 18: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Increasing DC’s Utility

• If we’re getting high speed performance across campuses– What could we do across longer distances?

• Empower geographically distributed workflows• Facilitate data sharing among colleagues• Provide data everywhere all the time

Page 19: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

2006 - 10 Gb Lustre WAN

977 MB/s between ORNL and IUUsing a single Dell 2950 clientAcross 10Gb TeraGrid connection

Page 20: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

2007 Bandwidth Challenge Win:Five Applications Simultaneously

• Acquisition and Visualization– Live Instrument Data

• Chemistry– Rare Archival Material

• Humanities

• Acquisition, Analysis, and Visualization– Trace Data

• Computer Science– Simulation Data

• Life Science• High Energy Physics

Page 21: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Beyond a Demo

• To make Lustre across the Wide Area Network useful and more than a demo we needed to be able to span heterogeneous name spaces– In Unix each user has a UID– It could differ from system to system– To preserve ownership across systems we

created a method for doing so

Page 22: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

IU’s Data Capacitor WAN Filesystem• Funded by Indiana University in 2008

• Put into production in April of 2008

• 360TB of storage available as production service

• Centralized short-term storage for resources nationwide:

– Simplifies use of distributed resources

– Projects space exists for mid-term storage

Page 23: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Gas Giant Planet Research

Page 24: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

2010: Lustre WAN at 100Gb

Page 25: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

100 Gbit Testbed – Full Duplex Results

16*8 Gbit/s

16

*20

Gb

it/s

DD

R IB

5*4

0 G

bit

/s Q

DR

IB

16*8 Gbit/s

100GbE

Writing to Freiberg

10.8 GB/s

Writing to Dresden

11.1 GB/s

21.9 GB/s

Page 26: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

100 Gbit Testbed – Uni-Directional Efficiency

Unidirectional Lustre: 11.79 GByte/s (94.4%)

TCP/IP: 98.5 Gbit/s (98.5%)

Link: 100 Gbit/s (100.0%)

Page 27: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

2011: SCinet Research Sandbox

• Supercomputing 2011, Seattle– Joint effort of SCinet and Technical Program

• Software Defined Networking and 100 Gbps– From Seattle to Indianapolis (2,300 miles)

• Demonstrations using Lustre WAN– network– benchmark– applications

Page 28: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Network, Hardware and Software

• Internet2 and ESnet, 50.5 ms RTT

Page 29: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Network, Hardware and Software

Page 30: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Application Results• Applications

– Peak: 6.2 GB/s– Sustained: 5.6 GB/s

Page 31: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

NCGAS Workflow Demo at SC 11

• STEP 1: data pre-processing, to evaluate and improve the quality of the input sequence

• STEP 2: sequence alignment to a known reference genome

• STEP 3: SNP detection to scan the alignment result for new polymorphisms

Bloomington, IN Seattle, WA

Page 32: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Monon 100

• Provides 100Gb connectivity between IU and Chicago

• Internet2 deploying 100Gb networks nationally• New opportunities for sharing Big Data• New opportunities for moving Big Data

Page 33: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Commodity Internet (1Gbps but highly variable)

Internet2 (100Gbps)

0

100

Gbps

NLR to Sequencing Centers (10Gbps/link)

IU Data Capacitor WAN (20 Gbps throughput)

Ultra SCSI 160 Disk (1.2 Gbps, 160 MBps)

DDR3 SDRAM (51.2 Gbps, 6.4GBps, )

Page 34: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

10 Gbps

100 Gbps

NCGAS Mason(Free for

NSF users)

IU POD(12 cents

per core hour)

Amazon EC2(20 cents

per core hour)

Data CapacitorNO data storage Charges

Amazon Cloud Storage$80 – 120 per TB per month

Lustre WANFile System

Your Friendly RegionalSequencing Lab

Your Friendly Neighborhood Sequencer

Your Friendly NationalSequencing Center

NCGAS Logical Model

Page 35: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

National Center for Genome Analysis Support (NCGAS)

• Using high speed networks like I2 and the Monon 100, the DC-WAN facility will be ingesting data from Laboratories with next generation sequencers and serving reference data sets from sources like NCBI.

• Data will be processed using IU’s Cyberinfrastructure

Page 36: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Special Thanks To• NCGAS – Bill Barnett and Rich LeDuc• IU’s High Performance Systems Group• Application owners and IU’s HPA Team• IU’s Data Capacitor Team• Matt Davy, Tom Johnson, Ed Balas, Jeff Ambern, Martin Swany• Andrew Lee, Chris Robb, Matthew Zekauskas and Internet2• Evangelos Chaniotakis, Patrick Dorn and ESnet

• Brocade – 10Gb Cards, 100Gb Cards, and optics• Ciena – 100 Gb optics• DDN – 2 SFA 10K• IBM – iDataPlex nodes• Internet2 , ESnet – Network link and equipement• Whamcloud – Lustre support

Page 37: Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems

Thank you!

Stephen Simms

[email protected]

High Performance File Systems

[email protected]