data area report chris jordan, data working group lead, tacc kelly gaither, data and visualization...

19
Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Upload: tracy-farmer

Post on 31-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Data Area Report

Chris Jordan, Data Working Group Lead, TACC

Kelly Gaither, Data and Visualization Area Director, TACCApril 2009

Page 2: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

PY 4 Data Area Characteristics

•Relatively Stable software and user tools

•Relatively dynamic site/machine configuration–New Sites and Systems–Older systems being retired

•TeraGrid Emphasis on broadening participation–Campus Champions–Science Gateways–Underrepresented disciplines

2

Page 3: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

PY4 Areas of Emphasis

•Improve campus-level access mechanisms•Provide support for gateways and other “mobile” computing models

•Improve clarity of documentation•Enhance user ability to manage complex datasets across multiple resources

•Develop comprehensive plan for future developments in the Data area

•Production deployments of Lustre-WAN, path to Global file systems

3

Page 4: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Data Working Group Coordination

•Led by Chris Jordan•Meets bi-weekly to discuss current issues

•Has membership from each RP•Attendees are a blend of system administrators and software developers

4

Page 5: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Wide-Area and Global File Systems•Providing a TeraGrid global file system is a highly user requested service

•Global file system implies that a file system is mounted on most TeraGrid resources–No single file system can be mounted across all TG resources

•Deploying wide-area file systems, however, are possible with technologies such as GPFS-WAN–GPFS-WAN has licensing issues and isn’t available for all platforms

•Lustre-WAN is promising for both licensing and compatibility reasons

•Additional technologies such as pNFS will be necessary to make a file system global5

Page 6: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Lustre-WAN Progress•There is an initial production deployment of Indiana’s Data Capacitor Lustre-WAN on IU’s BigRed, PSC’s Pople –Declared to be production in PY4 (involves testing and implementation of security enhancements)

•In PY4, had successful testing and commitment to production on LONI’s QueenBee, TACC’s Ranger/Lonestar, NCSA’s Mercury/Abe, and SDSC’s IA64 (expected to go into production before PY5)–Additional sites (NICS, Purdue) will begin testing Q4 PY4

•Additionally, in PY4, ongoing work to improve performance and authentication infrastructure –Work in parallel with production deployment6

Page 7: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

CTSS Efforts in the Data Area

•In PY4, created data kits–data movement kit–data management kit–wide area file systems kit

•Currently reworking data kits to include:–new client-level kits to express functionality and accessibility more clearly

–new server-level kits to report more accurate information on server configurations

–broadened use cases–requirements for more complex functionality (managing, not just moving, data)

–improved information services to support science gateways and automated resource selection

7

Page 8: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Data/Collections Management

•In PY4, tested new infrastructure for data replication and management across TeraGrid resources (iRODS)

•In PY4, made assessment of archive replication and transition challenges

•In PY4, gathered requirements for data management clients in CTSS

8

Page 9: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Data Collection Highlights•Large data collections.

– MODIS Satellite Imagery of the Earth. Remote sensing data from Center for Space Research. Grows by ~ 2.4 GB/day, widely used by earth scientists, many derivative products produced. (6 TB)

– Purdue Terrestrial Observatory. Remote sensing data. (1.4 TB)

– Alaska Herbarium collection. High-resolution scans of > 223,000 plant specimens from Alaska and the Circumpolar North. (1.5 TB)

•Hosting of Data Collection services within VMs (provides efficient delivery of services related to modest scale data sets)– Flybase – key resource for Drosophila genomics. Front end hosted within VM (2.3 GB)

– MutDB – web services data resource – delivers info on known effects of mutations in genes (across taxa).

9

Page 10: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Data Architecture (1)

• Two primary categories of use for data movement tools in the TeraGrid– Users moving data to or from a location outside the

TeraGrid– Users moving data between TeraGrid resources– (Frequently, users will need to do both within the span of

a given workflow)

• Moving data to/from location outside the TeraGrid:– Tend to be smaller numbers of files and less overall data

to move– Primarily encounter problems with usability due to

availability or ease-of-use

10

Page 11: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Data Architecture (2)

• Moving data between TeraGrid resources– Datasets tend to be larger– Users are more concerned with performance, high-

reliability and ease of use

• General trend that we have seen – as need for data movement has increased, both the complexity of the deployments and the frustrations of users have increased.

11

Page 12: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Data Architecture (3)

• This is an area in which we think we can have a significant impact– Users want reliability, ease of use, and in some cases

high performance– How the technology is implemented should be

transparent to the user.– User initiated data movement, particularly on large

systems has proven to create problems with contention for disk resources

12

Page 13: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Data Architecture (4)

• Data Movement Requirements:– R1: Users need reliable, easy to use file transfer tools for

user moving data from outside the TeraGrid to resources inside the TeraGrid.

– R2: Users need reliable, high performance, easy to user file transfer tools for using moving data from one TeraGrid resource to another.

– R3: Tools for providing transparent data movement are needed on large systems with low storage to flops ratio.

(SSH/SCP with the High-performance networking patches (HPN-SCP), SCP-based transfers to gridftp nodes - RSSH)

13

Page 14: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Data Architecture (5)

• Users continue to request a single file system that is shared across all resources.

• Wide area file systems have proven to be a real possibility through the production operation of GPFS-WAN.

• There are still significant technical and licensing issues that prevent GPFS-WAN from becoming a global WAN-FS solution.

14

Page 15: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Data Architecture (6)

• Network architecture on the petascale systems is proving to be a challenge – only a few router nodes are connected to wide area networks directly and the rest of the compute nodes are routed through these. Wide area file systems often need direct connect access.

• It has become clear that no single solution will provide a production global wide are network file system.

- R4: The “look and feel” or the appearance of a global wide area file system with high availability and high reliability. (LUSTRE-WAN, pNFS)

15

Page 16: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Data Architecture (7)

• Until recently, visualization and in many cases, data analysis have been considered a post-processing task requiring some sort of data movement.

• With the introduction of petascale systems, we are seeing data set sizes grow to size that prohibits data movement or makes it necessary to minimize the movement.

• It is anticipated that scheduled data movement is one way in which to guarantee that the data is present at the time it is needed.

16

Page 17: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Data Architecture (8)

• Visualization and data analysis tools have not been designed to be data aware and have made assumptions that the data can be read into memory and that the applications and tools don’t need to be concerned with exotic file access mechanisms.

- R5: Ability to schedule data availability for post-processing tasks. (DMOVER)

- R6: Availability of data mining/data analysis tools that are more data aware. (Currently working with VisIt developers to modify open source software. Leveraging work done on parallel Mesa)

17

Page 18: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Data Architecture (9)• Many TeraGrid sites provide effectively unlimited

archival storage to compute-allocated users.• Almost none of these sites have a firm policy

requiring or allowing them to delete data after a triggering event.

• The volume of data flowing into and out of particular archives is already increasing drastically, in some cases exponentially, beyond the ability of the disk caches and tape drives currently allocated.-R7: The TeraGrid must provide better organized, more capable, and more logically unified access to archival storage for the user community. (Proposal to NSF for unified approach to archival storage)

18

Page 19: Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

Plans for PY5

•Implement Data Architecture recommendations–User portal integration–Data Collections infrastructure–Archival replication services–Continued investigation of new location-independent access mechanisms (Petashare, Reddnet)

•Complete production deployments of Lustre-WAN

•Develop plans for next-generation Lustre-WAN and pNFS technologies

•Work with CTSS team on continued improvements to Data kit implementations

19