the collaboratory: computing environments and infrastructure for structural biology research

19
The Collaboratory: computing environments and infrastructure for structural biology research Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory

Upload: jirair

Post on 23-Mar-2016

60 views

Category:

Documents


0 download

DESCRIPTION

The Collaboratory: computing environments and infrastructure for structural biology research. Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory. What is the Collaboratory?. Technically: an R&D program funded by NIH - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Collaboratory:  computing environments and infrastructure for structural biology research

The Collaboratory: computing environments and infrastructure for structural biology research

Timothy M. McPhillipsStanford Synchrotron Radiation Laboratory

Page 2: The Collaboratory:  computing environments and infrastructure for structural biology research

What is the Collaboratory?Technically: an R&D program funded by NIH

• NIH’s definition of a Collaboratory: “A laboratory without walls.”• Pilot program to investigate if collaboration and remote access

tools could improve the efficiency of NCRR resources. • Supplement to the NCRR grant that funds the SMB group.• Currently funds three full-time employees in the SMB group:

Thomas Eriksson, Ken Sharp, and Tim McPhillips.• Funding has been extended through the end of the NCRR parent

grant; the Collaboratory program will be renewed within the context of the parent grant in 2005.

In practice: a group-wide effort to create a coherent computational research environment for our users• Goal is to provide users with a coherent, overarching system for collecting data

and solving structures--not just a bunch of tools. • Software development, systems management, instrument design, hardware

development, beam line automation, maintenance of equipment, etc--all are critical to the Collaboratory.

• Everyone in the PX group contributes to the Collaboratory effort.

Page 3: The Collaboratory:  computing environments and infrastructure for structural biology research

The core Collaboratory development team

Page 4: The Collaboratory:  computing environments and infrastructure for structural biology research

“Something there is that doesn’t love a wall…”

What kind of walls has the Collaboratory removed?• Walls between beam lines: Users can move between beam lines and find

the same computer systems, user accounts and file systems wherever they go.

• Walls of geographical distance: Users can access the beam line, computing resources, and their data from anywhere in the world.

• Walls between collaborators: Local and remote coworkers can see samples, monitor the beam line, view data, and share data collection sessions.

• Walls between detectors and disk storage: High performance network and file server allows users to collect data from large area detectors at maximum speed.

• Walls between data and solved structures: High performance computers enable users to process their data and solve structures in real time.

• And coming down this year: Walls between traditional and web-based applications; walls between users and support staff; and walls between users and archived data.

Page 5: The Collaboratory:  computing environments and infrastructure for structural biology research

…but “good fences make good neighbors!”

What kind of fences has the Collaboratory put up?• Fences between user groups: Each user group’s data is secure

from snooping, theft, and tampering by other groups.• Fences between networks: Computer systems at the beam lines

are protected from network disturbances elsewhere at SSRL; instrument control computers are on an isolated network.

• Fences that keep users from damaging equipment remotely: Access control and rights restrictions in Blu-Ice make remote control of beam lines safe.

• Fences between computer systems and crackers: High level of security means users need not worry about data loss or system downtime due to marauders from the Internet.

Page 6: The Collaboratory:  computing environments and infrastructure for structural biology research
Page 7: The Collaboratory:  computing environments and infrastructure for structural biology research

Implications of the automated sample mounting system

• SSRL cassette design allow hundreds of pre-frozen crystals to be examined without entering the hutch.

• Automatic crystal centering system allows the crystal to be aligned automatically in the beam.

• In 2003, users of the robot on 11-1 entered the hutch only once to install cassettes in dispensing dewar.

• In 2004, users will not be allowed to use robot if they re-enter hutch after cassettes are loaded under staff supervision.

• Cassettes of crystals can be shipped to beam line via FEDEX.

• Cassettes can be placed in the hutch by staff, allowing users to work remotely.

• Local and remote users will have equal access to the hutch when using the robot (i.e., none).

• In theory, many users of the sample mounting robot need not come on site at all.

BUT -- Need appropriate computing, network, and software infrastructure to enable remote access to full experimental capabilities of beam lines.

Page 8: The Collaboratory:  computing environments and infrastructure for structural biology research

Collaboratory tools and sample mounting robots will allow SSRL users to work completely remotely in 2004

Blu-Ice for beam line control• Can run locally or remotely.• Multiple copies may run simultaneously.• Security features prevent unsafe actions.

Beam line video system• Monitor sample in beam,

experimental hardware, and crystals under microscope.

• Video streams may be viewed via Blu-Ice or through a web browser.

Archive System• Back up data to multi-terabyte

robot tape system at SDSC over network.

• Simple web interface for data archival and retrieval.

• No need to use backup tapes.Remote Unix desktop

• Fully functional Unix desktop environment.• Blu-Ice and all data processing software may be run remotely.• Free ICA client from Citrix.

Page 9: The Collaboratory:  computing environments and infrastructure for structural biology research

Why a high capacity, long term data archive is needed

Need a replacement for tapes• Tapes age and medium formats change rapidly.• Storage capacity and reliability of tapes limited.• Much manual book-keeping is needed to keep

track of data stored on tapes.

Need to support large-area CCD detectors• Three Q315 detectors and a MAR 325 will each be

generating 20-70 MB of image data every 5 seconds when the SPEAR3 upgrade is complete.

• RAID data storage at SSRL will be 24 TB in 2004--all that data must be backed up somehow!

• Need to archive data as rapidly as it is collected.

Need to support high-throughput structural biology• Automated beam lines will generated huge amounts of data. • Large numbers of samples and targets require that metadata

be stored and tracked systematically.• Data must be archived automatically and easy to retrieve.

Page 10: The Collaboratory:  computing environments and infrastructure for structural biology research

High Performance Storage System and Storage Resource Broker at SDSC

High Performance Storage System (HPSS)• Long term data storage system at SDSC.• Currently stores over 344 TB of data in 18 million files.• Currently provides 0.9 PB of storage.

Storage Resource Broker (SRB)• Client-server middleware for accessing heterogeneous

resources over the network.• May be used to store and retrieve data on the HPSS at SDSC.• Powerful metadata querying system allows data sets to be

accessed based on their attributes.• Data sets can be replicated over multiple resources.

The challenge• Capabilities of HPSS and SRB far exceed

the perceived needs of our beam line users.• Educating users to effectively use these

systems for managing their data is a challenge.

• Our users need a customized interface with simplified functionality.

Page 11: The Collaboratory:  computing environments and infrastructure for structural biology research

InQ SRB client for Microsoft Windows

SRB client applications• Users must be able to upload

data, download data, and view the data in the archive.

• Users perform these functions via SRB client applications.

InQ for Microsoft Windows• InQ is the easiest to use client

provided by SDSC.• Individual files or entire

folders may be uploaded or downloaded.

• Files in the archive may be browsed either by directory structure or by data attributes.

Limitations of InQ• Runs only on Microsoft Windows platforms.• Windows is not the major platform used at synchrotron light sources or in crystallography

research labs.• No batch job capability for long archive jobs.• Exposes confusing SRB features and terminology (resources, containers, collections, etc).

Page 12: The Collaboratory:  computing environments and infrastructure for structural biology research

MySRB web browser-based SRB client

MySRB•MySRB is a powerful

web-based SRB client.• Can be run from standard

web browsers.• Files in the archive may

be browsed either by directory structure or by data attributes.

Limitations of MySRB• No way to upload or

download more than one file at a time.

• The otherwise rich functionality and powerful features are confusing to users.

The bottom line:• Additional infrastructure must be designed and implemented in order to

make the SRB a viable storage system for crystallographic data.• A browser-based user interface is ideal.

Page 13: The Collaboratory:  computing environments and infrastructure for structural biology research

The Collaboratory interface for using the SRB archive

Simple archive job definition• Users may rapidly browse their data

sets at SSRL.• Directory contents are listed in the

browser window.• Directories may be navigated by

clicking on directory names.• Files to be uploaded may be filtered

according to a list of wildcards.• Subdirectories may be archived

recursively.• The only SRB related information

required is the name of the new data collection to create.

Convenient web browser interface• Users may define archive jobs over

the web from anywhere in the world using any common type of computer.

• Users need only log in to the Collaboratory portal with their Unix account name and password.

Page 14: The Collaboratory:  computing environments and infrastructure for structural biology research

Monitoring archive jobs and downloading data

Batch operation• Archive job runs in background once

definition is confirmed.• Browser does not hang during archival.• New jobs may be started while

previously defined jobs are in progress.• A job status page indicates definitions

and status of all running jobs.• E-mail is sent to the user when a job is

complete.

Similar interface for data download• Users browse their archived data sets in

exactly the same fashion.• Data may be downloaded from the

archive to a directory at SSRL (analogous to an upload job).

• Another option is to download selected files in one or more tar files directly to any computer on the Internet.

Page 15: The Collaboratory:  computing environments and infrastructure for structural biology research

Significant infrastructure is required to provide this “simple” interface--but the payoff is huge.

Authentication Gateway Server• Java servlet that provides a common

authentication protocol for all Collaboratory applications.

• Used to authenticate archive system users.

• All web-based Collaboratory software are being updated to use this single authentication server.

• Support for the authentication server has already been integrated into Blu-Ice/DCS.

• Allows users to navigate between web applications seamlessly without authenticating multiple times.

• Will allow access to be controlled based on the beam port schedule.

• Will allow users to start web-based applications from within Blu-Ice without requiring the user to authenticate again within the browser.

Impersonation Server• Unix daemon that can run any non-

interactive program on behalf of any Unix user.

• Enables web applications to run background jobs for a user with the actual rights of the Unix user account.

• Accepts commands via the HTTP protocol.• Verifies authentication information with the

Authentication Server.• Used by the Collaboratory archive system

to list directories in the web browser and run background archive jobs as the user.

• Will enable fluorescence scans and autochooch to be executed by the scripting engine in DCSS.

• Will allow further analyses to be initiated by the beam line control system automatically.

Page 16: The Collaboratory:  computing environments and infrastructure for structural biology research

Projects for the next year

Integration of web-based Collaboratory tools• A new web-based environment for monitoring beam lines and viewing

results will be developed over the next year.• The diffraction image viewer, beam line video web application, and

archive system will be integrated into this system.• Will enable real-time monitoring of beam line operations and

experimental results via the web. • Layout of user interface will likely mimic Blu-Ice’s tab look and feel to

leverage user familiarity and experience.• Currently investigating tools for rapidly developing powerful web-based

applications in a component-based framework (e.g., WebObjects).

Page 17: The Collaboratory:  computing environments and infrastructure for structural biology research

Projects for the next year

Web-based proposal management system• Provide all SSRL users with web-browser based tools for submitting

proposals and beam time requests; updating personal information; and viewing personalized beam time schedules.

• Facilitate communication with user administration and user support staff.• Integrate with production SSRL database system, eliminate older user

interfaces and reporting tools.• SSRL will run a separate instance of the Authentication Gateway Server

for this purpose.• Users will be able to use this system to specify which Unix accounts are

enabled to collect data at the beam line when a particular proposal is active. No more editing the MySQL table!

• First new interfaces will be rolled out by the end of 2003; major features will likely be released in late 2004.

Page 18: The Collaboratory:  computing environments and infrastructure for structural biology research

Collaboratory projects for the next 5 years…

Ice-Floe• Provide users with the databases, user interfaces, and project

management capabilities required to make maximum use of high-throughput structural biology resources.

• Present users with a high-level interface to automated beam lines and automated structure determination systems.

• Enable user to focus on the workflow of carrying out their research rather than the details of each operation.

Ice-Breaker• Develop an open protocol for communicating with beam line automation

systems.• Work with developers at other light sources to make protocol compatible

across a large fraction of structural biology beam lines worldwide.• Enable anyone to develop their own interface to automated beam lines,

support in-house LIMS, interface to other software packages, etc.• Allow users to choose the interface most useful to them, independent of

the light source.

Page 19: The Collaboratory:  computing environments and infrastructure for structural biology research

Where we’re going: data grids, compute grids and experimental resource grids