vol. 21, no. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ fall...

28
M AGAZINE OF C YBERINFRASTRUCTURE & C OMPUTATIONAL S CIENCE VOL. 21, NO. 1 FALL 2005 San Diego Supercomputer Center SDSC Data Experts Help Reunite Hurricane Katrina Evacuees Building Superlist KATRINA’S Building KATRINA’S Superlist

Upload: others

Post on 06-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

MAGAZINE OF CYBERINFRASTRUCTURE & COMPUTATIONAL SCIENCE

VOL. 21, NO. 1␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ FALL 2005

San Diego Supercomputer Center

SDSC Data Experts Help ReuniteHurricane Katrina Evacuees

Building

SuperlistKATRINA’SBuilding

KATRINA’SSuperlist

Page 2: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

FROM THE DIRECTOR

2Data and DisasterUsing SDSC Data

Cyberinfrastructure, Centerexperts make a difference in

responding to natural disasters,from Hurricane Katrina to the

Indian Ocean tsunami andpreparing for earthquakes.

4Building Katrina’s

SuperlistSDSC database researchers

lend expertise to reunitevictims of Hurricane

Katrina, helping build aWeb-accessible missingand found list merging

thousands of names frommultiple sources.

COVER GRAPHIC

THE SAN DIEGO SUPERCOMPUTER CENTERIn 2005, the San Diego Supercomputer Center, SDSC, celebratestwo decades of enabling international science and engineeringdiscoveries through advances in computational science and highperformance computing. Continuing this legacy into the era ofcyberinfrastructure, SDSC is a strategic resource to academia andindustry, providing leadership in Data Cyberinfrastructure,particularly with respect to data curation, management, andpreservation, data-oriented high-performance computing, andcyberinfrastructure-enabled science and engineering. Primarilyfunded by the NSF, SDSC is an organized research unit of theUniversity of California, San Diego and one of the founding sitesof NSF’s TeraGrid. For more information see www.sdsc.edu.

SDSC INFORMATIONFran Berman, Director

Vijay Samalam, Executive Director

San Diego Supercomputer CenterUniversity of California, San Diego9500 Gilman Drive MC 0505La Jolla, CA 92093-0505

Phone: 858-534-5000

Fax: 858-534-5152

[email protected] Lund, Director of [email protected]

ENVISIONENVISION magazine is published by SDSC and presentsleading-edge research in cyberinfrastructure andcomputational science.

ISSN 1521-5334EDITOR: Paul ToobyCONTRIBUTORS: Cassie Ferguson,

Lynne Friedmann, Greg Lund,Paul Tooby, Ashley Wood

DESIGN: Beyond The Imagination Graphics

Any opinions, conclusions, or recommendations in thispublication are those of the author(s) and do notnecessarily reflect the views of NSF, other fundingorganizations, SDSC, or UC San Diego. All brandnames and product names are trademarks or registeredtrademarks of their respective holders.

© 2005 The Regents of the University of California

Weather satellite viewof Hurricane Katrina,image courtesy NOAA/NESDIS.Cover graphic, Ben Tolo, SDSC.

Page 3: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

7Tapping Plants for Fuel

SDSC Strategic Applications Collaboration helps researchers scale upCHARMM molecular dynamics code, probing details of a key enzyme to

accelerate conversion of cellulose into the renewable fuel, ethanol.

Swami, the Next GenerationBiology Workbench

Updating an established favorite withnew technologies increases the power,

openness, and future extensibility ofthe Biology Workbench.

The Elusive Neutrino: New Windowon the Violent Universe

SDSC TeraGrid data and computing resources help physicistsvalidate powerful new neutrino “telescope” to probe the most

violent events in the Universe.

Faster Workflows:Scientific Workflow

Automation with KeplerSDSC helps open source tool

streamline scientists’ tasksand aid collaboration,

bringing the benefits ofcyberinfrastructure to a wide

range of disciplines.

The Big Picture:Building Mosaics of the Entire Sky

SDSC Strategic Applications Collabo-ration helps stitch together millions of

images in the 10 terabyte 2-Micron AllSky Survey, allowing astronomers to

explore large-scale patterns in theUniverse.

10 13

16 20THE BACK COVERTHE BACK COVER

Long-term Records of GlobalSurface Temperature

FEATURES

SDSC News

24

Page 4: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

2 ENVISION Fall 2005 www.sdsc.edu

The interdisciplinary science, engineering,and information technology staff at SDSCwork each day with researchers andeducators supporting and serving users, andenabling advances and new discoverythrough Cyberinfrastructure. SDSC staffrecently applied this experience to help theRed Cross amalgamate critical survivordata. Our dedicated and generous teamworked around the clock to create one listthat has become the central Red Crossresource for people trying to locate lostloved ones in the wake of HurricaneKatrina.

For SDSC, and our colleagues in thescience, engineering, and technologycommunities, the opportunity to give backto the broader society at a time of real needusing approaches, tools, and techniquespioneered for academic research, isimmensely gratifying.

Katrina was not the first experience withdisaster data for SDSC staff, who have alsoworked with tsunami data from the IndianOcean and earthquake data from theSouthern California Earthquake Center. Inthis issue of EnVision, we focus on ways in

which SDSC’s Data Cyberinfrastructureand the expertise developed from workingwith science and engineering communitiescan make a difference more broadly—through predicting, responding to, andevaluating the impact of large-scale naturaldisasters.

THREE PERSPECTIVESFor both survivors and family and

friends, sifting through a rapidly increasingnumber of websites with survivor data tofind a missing loved one (30+ sites whenSDSC was brought in) is a time-consumingand onerous task. Using database technolo-gies originally designed to coordinatemultiple science and engineering datacollections, SDSC staff worked with theRed Cross, the National Institute for UrbanSearch and Rescue, San Diego StateUniversity, Microsoft, and others to createan amalgamated “missing and found”persons list with many thousands of namescompiled from separate sources. Thetechnical challenges involved translationand coordination of multiple data formats,data cleaning and “de-duping” (removing

From the Director

SDSC staff worked with the

Red Cross, the National

Institute for Urban Search

and Rescue, San Diego State

University, Microsoft, and

others to create an amal-

gamated ‘missing and

found’ persons list with

many thousands of names

compiled from separate

sources.

and

O ver the last few years, the world has been bothsobered and galvanized by the immense impact oflarge-scale natural disasters. The immediate andmassive loss of life and property during the Indian

Ocean tsunami and Hurricane Katrina, as well as the longerterm impacts on the local, national, and global economy andecosystems from these events will have repercussions for many,many years to come.

Page 5: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

ENVISION Fall 2005 3www.sdsc.edu

duplicates), the addition of a “fuzzy” searchquality which allows close but inexactqueries to obtain useful information, etc.The resulting KatrinaSafe.com websiteprovides one-stop shopping through a“meta-list” of survivor data. More aboutSDSC’s Katrina efforts can be found in thecover story on page 4.

Katrina data was collected and used inreal-time during the first intense weeksfollowing the hurricane. Following theDecember 2004 Indian Ocean tsunami,more than 20 National Science Foundationfunded scientific reconnaissance teams,supported by the SDSC-based Network forEarthquake Engineering SimulationCyberinfrastructure Center (NEESit), wentto work in Asia, gathering voluminous datafrom the tsunami—the deadliest in

recorded history. For scientists andengineers, data from the tsunami provides aunique and irreplaceable data collection,invaluable for validating and improving theaccuracy of models of tsunami behavior.More about the development and manage-ment of the Indian Ocean Tsunami DataRepository can be found on page 6.

Of course, the best time to understandthe implications of a disaster is before ithappens. In the last year, a consortium ofgeoscientists and earthquake engineersworking with the Southern CaliforniaEarthquake Center (SCEC) partnered withtwo dozen SDSC staff to run the largestscale simulation of a large magnitudeearthquake in southern California to date.The “TeraShake” simulation modeled a 600by 300 kilometer area (which included the

Los Angeles basin) at 200 meter resolution(grid points every 200 meters). The originalsimulation ran for a week on SDSC’sDataStar and produced 47 TeraBytes of data(over 4.5 times the printed materials in theLibrary of Congress). The simulations areongoing at SDSC, producing hundreds ofterabytes of data that can help lead toimproved hazard estimates and saferstructures.

All of us at SDSC would like todedicate this issue to the scientists,engineers, and technologists in ourcommunity, both at SDSC and elsewhere atmany institutions and agencies, who haveworked so hard to lend their time andexpertise during recent disasters. Yourefforts have made an immense difference.

SDSC Director

Hurricane Katrina, which devastatedNew Orleans and the Gulf coast inthe costliest natural disaster in thenation’s history, seen from a NOAAweather satellite as the eye makeslandfall. SDSC database experts pro-vided vital assistance to the Ameri-can Red Cross by consolidating mul-tiple lists of Katrina evacuees into asingle, easy-to-use missing andfound “superlist.” NOAA/NESDIS.

by Dr. Francine Berman,

Page 6: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

4 ENVISION Fall 2005 www.sdsc.edu

H urricane Katrina crashed ashoreon August 28th to become one ofthe most damaging storms in UShistory. Where the winds blew

down homes, families scattered. Wherefloods inundated homes, people ran orswam for their lives. Thousands of peoplesimply fled. They ended up in shelters,athletic arenas, or in homes across the Gulfstates and across the country.

That flight started a great separation.Parents were separated from their children,and husbands separated from wives. Somewere only a few miles from one another, buthad no idea how to tell family and friendsthey were safe. Others ended up thousandsof miles away from each other without aclue where to look. The Red Cross alonehoused evacuees in 1,150 shelters across 27

states and the District of Columbia.Efforts to reunite lost families began

even before Katrina’s winds and raindiminished. Missing persons lists sprang upall over, seemingly every place that housedevacuees. The Red Cross, newspapers, andwebsites all began independent lists to tryto reunite loved ones. By the Red Cross’tally, more than 287,000 names wouldultimately be registered online.

DATA EXPERTISESDSC couldn’t do much to stem the

floodwaters, but Dr. Chaitan Baru, arenowned data scientist at SDSC, and histeam could do something about the delugeof missing persons’ data threatening tochoke available resources and dramaticallyslow down the reunification process.

It started with a phone call to Baru lessthan 24 hours after first reports of leveebreaches in New Orleans. Dr. Eric Frost, anassociate professor in the department ofGeological Sciences at San Diego StateUniversity and an SDSC collaborator, wasworking with the National Institute forUrban Search and Rescue and the RedCross to provide first responders before-and-after satellite images of devastatedsections of the gulf coast. He called to askfor extra space in SDSC’s large data storagefacility for these images. Richard Moore,SDSC’s director of production services,quickly agreed and marshaled staffers toprovide server space for Frost’s effort.

This was followed the next day by arequest that would eventually leverageknowledge gained from research in

Reuniting Hurricane EvacueesHundreds of thousands of people were displaced and separated byHurricane Katrina. SDSC data expertise helped reconnect them byproviding unified missing and found lists to the Red Cross.

Phot

os ©

Am

erica

n Red

Cros

s

Building

Katrina’sKatrina’s

by Greg Lund

Superlist Superlist

Page 7: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

ENVISION Fall 2005 5www.sdsc.edu

managing large data sets at SDSC. The RedCross needed help in dealing with thegrowing data from missing person’s lists. Inthe first few days after the hurricane dozensof separate lists were posted online, rangingfrom CNN, the International Committeeof the Red Cross, and MSNBC, to smallergulf coast newspapers. There was nouniformity in the information gathered,and one list might look completely differentfrom the next but still contain many of thesame names. So, the Red Cross asked forhelp in sorting 911 calls reporting missingpeople.

Baru enlisted the help of SDSCdatabase experts Jerry Rowley and VishuNandigam to begin the task of amalgamat-ing the names from more than 30 indepen-dent lists into one, coordi-nated and easy-to-use masterlist. A large task, much easierdiscussed than actually done.

That meant ditching anyLabor Day weekend plans forthe SDSC team, which hadgrown to include NeilCotofana, Peter Shin, HectorJasso, L.J. Ding, and RogerUnwin. Baru’s team hadsignificant experience indatabase management, butthis task was unique. It startedaround a white board with theteam trying to find allavailable missing persons lists,determine how to compiledata from lists that differedsubstantially in the type of informationcontained, and then combine that data intoone list. In addition, it was a race againsttime to get a list together as quickly aspossible to start reconnecting families.

“We immediately deployed a dataupload site where the Red Cross and otherdata providers could upload their filescontaining names of safe and missingpeople. We commandeered one of ourcomputer systems at SDSC and used it todevelop a database of all the names,” saidBaru. “At the same time, we requested someof our staff volunteers to begin investigatingapproaches for ‘fuzzy’ matching on names,since we knew that the incoming data wasgoing to be of highly varying quality.”

Jerry Rowley coordinated the dailyactivities of the group at SDSC. “Ouroriginal plan was to also provide a websiteto support searching the names in the singlelist that we were creating,” said Rowley.“Over the Labor Day weekend, some of ourstaff were working on downloading,cleaning, and loading the data, while othersdeveloped a new website along with a

search interface.”The initial fuzzy search was based only

on the first and last name of a person. Toimplement a more powerful search, SDSCcontacted commercial vendors of fuzzysearch software, and one of them, IdentitySystems, offered the software for free useand immediately flew one of their customersupport technicians from New York City toSan Diego. SDSC researcher Doug Greer isusing the software to efficiently match thenames of missing people with the names inthe amalgamated safe list at SDSC.

EFFECTIVE TEAMWORKOn Monday, September 5, as the

refugee chaos was reaching its zenith, thecollaboration between SDSC, the RedCross, and eventually Microsoft kicked intohigh gear. The Red Cross had the largest listof names, Microsoft was developing awebsite, www.katrinasafe.com, and SDSChad the data expertise. Together thepartners raced against time to access asmuch missing person data as possible, refineit, and place it on one readily-available and

Left: SDSC database researcher Jerry Rowleycoordinated day to day efforts of the SDSCteam. B. Tolo.

Above: SDSC data researcher Chaitan Baruled the Center’s efforts to provide emergencydata management help to the American RedCross in responding to Hurricane Katrina.B. Tolo.

prominent website.SDSC signed a nondisclosure agreement

with the Red Cross, in effect promising tokeep the data on individuals safe andsound. That paved the way for a deluge ofRed Cross names gathered from the ever-growing number of shelters in the South.

“It was tremendously encouraging to seethe spirit with which our staff responded tothis call for help. People dropped their plansfor the long weekend and simply wentahead with the tasks they were given withgusto. We pulled together a team acrossmultiple departments at SDSC, andeveryone worked together with greatenthusiasm,” said Baru. “We were alsoextremely impressed with the high degree ofprofessionalism of the Red Cross staff, itwas clear we were working with a group ofindividuals very committed to the cause ofhelping others.”

The researchers were aware that the datathey were dealing with was extremelysensitive. It contained information onindividuals, including their addresses,phone numbers and, occasionally, other

Page 8: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

6 ENVISION Fall 2005 www.sdsc.edu

private information. For example, in somecases individuals voluntarily provided theirSocial Security numbers. “As the centralaggregator of all this data, we are extremelycareful about how we treat this informa-tion,” said Baru. “We ‘sanitize’ and‘anonymize’ the data before passing it toothers who are not from the Red Cross.”

SDSC is filling a business-to-business(B2B) role in this effort, Baru explained.The amalgamated list of safe and missingnames that SDSC creates is then providedto Microsoft, who is filling the business-to-customer role, providing access to thisinformation to the public through theKatrinaSafe.com website.

“In our B2B role, SDSC is offering asearch service to businesses. After we receivelists of names of missing people fromprivate companies, government agencies,and nonprofits, we do a fuzzy match ofthese missing people names with our safelist, and then return that informationdirectly to the requesting institution,” saidBaru. “This information is kept privatebetween that institution and SDSC.”

GOING PUBLICThe partnership was a success. Two lists

were developed, a list of the missing and alist of the found. Data from the amalgam-ated list started flowing from SDSC onTuesday, September 6, and SDSC has sinceprovided more than 486,000 records to theKatrinaSafe.com site, both safe and missing

Whether thousands of milesapart or very nearby, sepa-rated families often had noway to find each other exceptInternet lists such asKatrinaSafe.com that SDSCstaff helped create.

URL: www.katrinasafe.com

© American Red Cross

SDSC SUPPORTS TSUNAMI DATA COLLECTIONFollowing the devastating Indian Ocean tsunami in December 2004,

more than 20 National Science Foundation-funded scientific reconnais-sance teams went to work in Asia, gathering voluminous data from thetsunami—the deadliest in recorded history. The data collected will bepreserved, curated, and made available for use by researchers. Thisunique and irreplaceable resource will help them better understand thebroad impact of tsunamis on communities, buildings, ecology, andpeople, as well as guiding preparations for future events.

The data preservation effort is being led by SDSC’s Network forEarthquake Engineering Simulation (NEES) Cyberinfrastructure Center(NEESit) team. This group operates and supports an extensive centralinformation technology infrastructure for earthquake engineers andresearchers.

“The teams are working with regional partners to gather andtranslate data in support of research projects aimed at better under-standing the impact of this unique event.” said SDSC’s Anke Kamrath,director of NEESit. “We’re helping to capture and preserve thesefindings to support research endeavors on this tsunami which maycontinue for decades or even centuries from now.” The tsunami data isscheduled to be made publicly available through the NEES TsunamiData Repository at tsunami.nees.org starting in 2006.

The NSF-sponsored tsunami researchers have looked at manydifferent types of tsunami impacts on the world. Teams focusing oncollecting social, environmental, geological, ecological, and otherkinds of data spent several months in the region. Information collectedranges from social data on human behavior and tsunami responses toremote sensing satellite data on ecological impacts and data ontsunami deposits and sedimentology.

persons. As data continued to pour in,additional SDSC staff volunteered to helpclean the data, including Geoff Avila, LindaFerri, John Helly, Cynthia Lee, Jon Meyer,Hannes Neidner, Jane Park, Bob Sinkovits,and Donna Turner.

The search for evacuees took newforms. Large gulf coast area companiesimpacted by Katrina asked SDSC to minethe data lists to help locate their employeeswho had been scattered. The NationalCenter for Missing and Exploited Childrenoffered names of missing kids to bematched against the amalgamated data.Government agencies, too, took part in themassive effort to locate evacuees.

“SDSC provides a comprehensive set ofdata storage and analysis tools andtechnologies for the science and engineer-ing research and education communities,”said Dr. Francine Berman, SDSC Director.“All of the staff at SDSC want to help, andwe are delighted that we can use our datatools and technologies to facilitate thedifficult and important job of helpingidentify and reconnect Katrina’s survivors.”

Throughout the Katrina effort therewas another thought in everyone’s mind:Let’s put together a data system to handlethe next disaster. “This experience hasclearly demonstrated the need for a rapidIT deployment capability during emergen-cies,” said Baru. “We’re glad that thecapacity and expertise available at SDSCwas able to serve the need in this case, andlearning from this experience, we’vedecided to launch a new initiative at SDSCcalled SURGE, for SDSC URGentResponse.”

– Greg Lund is Directorof Communications at SDSC.

Page 9: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

ENVISION Fall 2005 7www.sdsc.edu

Cellulase EnzymeTop and Side view of the enzyme cellulase, processing along thetop of a fiber of the sugar polymer cellulose. The enzyme breakscellulose into smaller sugar molecules called beta-glucose whichare then fermented to make the renewable fuel, ethanol. J. Brady,M. Himmel, L. Zhong, M. Crowley, C. Brooks III, and M. Nimlos.

Imagine mowing your lawn and then dumping thegrass clippings into the gas tank of your car. Insideyour tank, the grasses are digested and converted intoethanol—a high-performance, clean-burning,

renewable fuel. You avoid the astronomical cost of fillingup with old-fashioned petroleum, and the U.S. avoids thecostly environmental, climate, and security issues ofdepending on nonrenewable fossil fuel. While tapping yardclippings as a source of gas might still be something foundonly in movies, the use of plant material as a major energysource has attracted nationwide attention, with ethanolblends already being offered at the pump.

But the process of producing ethanol remains slow andexpensive, and researchers are trying to formulate moreefficient, economical methods—a challenge that hinges onspeeding up a key molecular reaction being investigated in aStrategic Applications Collaboration between researchers atSDSC, the Department of Energy’s National RenewableEnergy Laboratory (NREL), Cornell University, The ScrippsResearch Institute, and the Colorado School of Mines.

The interest in ethanol, commonly known as grainalcohol, is being driven by the combination of risingpetroleum prices and government subsidies for so-calledbiofuel, a mixture of 15 percent (by volume) gasoline and85 percent ethanol, known as E85, which sells for anaverage of 45 cents less per gallon than gasoline. Efforts tomitigate climate change are also spurring the growth ofsuch renewable fuels, which add far less net greenhouse gasto the atmosphere than burning fossil fuels because thestep of growing plant material removes carbon dioxide. InAugust 2005, President George W. Bush signed a compre-hensive energy bill that included a requirement to increasethe production of biofuels including ethanol and biodieselfrom 4 to 7.5 billion gallons within the next 10 years.

While most people are familiar with the process used toturn plant material—such as hops—into ethanol-containing beverages like beer, that process is slow,expensive, and the end product too impure for energy use.To produce ethanol for energy use on a massive scale,researchers are trying to perfect the conversion of “bio-mass”—plant matter such as trees, grasses, byproductsfrom agricultural crops, and other biological material—via

PARTICIPANTSMARK NIMLOS, MIKE HIMMEL, NRELJAMES MATTHEWS, JOHN BRADY, CORNELL

LINGHAO ZHONG, PENN STATE

XIANGHONG QIAN, COLORADO SCHOOL OF MINES

MICHAEL CROWLEY, CHARLES BROOKS, TSRIMIKE CLEARY, GIRI CHUKKAPALLI, SDSC /UCSD

by Cassie Ferguson

PROJECT LEADERSJOHN BRADY, CORNELL

MIKE HIMMEL, NREL

Page 10: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

8 ENVISION Fall 2005 www.sdsc.edu

industrial conversion in “biorefineries.”“Cellulose is the most abundant plant material on earth and a

largely-untapped source of renewable energy,” said project managerMike Cleary, who is coordinating SDSC’s role in the project. “Sothis collaboration is addressing not just a significant problem inenzymology but a problem of huge potential benefit to society.”

MODELING A MOLECULAR MACHINEThe central bottleneck in making the biomass to fuel conversion

process more efficient is the current slow rate of breakdown of thewoody parts of plants—cellulose—by the enzyme cellulase, whichalso happens to be expensive to produce. The enzyme complex ofcellulases, made up of proteins, acts as a catalyst to mediate andspeed this chemical reaction, turning cellulose into sugars.

Scientists want to understand this process at the molecular levelso that they can learn how to enhance the reaction. Using moleculardynamics simulations, which model the movement of the enzyme atthe atomic scale, the researchers want to determine if the kinetics ofthe enzyme agree with models based on biochemical and geneticstudies. By probing in minute detail how the enzyme makes contactwith cellulose at the molecular level, the researchers hope to speedup the process and make it more cost effective by discovering waysthe enzyme can be altered through genetic engineering.

The cellulase enzyme complex is actually a collection of proteinenzymes, each of which plays a specific role in breaking downcellulose into smaller molecules of sugar called beta-glucose. Thesmaller sugar molecules are then fermented with microbes, typicallyyeast, to make the fuel, ethanol. One of the parts of the enzymecomplex, cellobiohydrolase (CBH I), acts as a “molecular machine”that attaches to bundles of cellulose, pulls up a single strand of thesugar, and puts it onto a molecular conveyor belt where it ischopped into the smaller pieces. In order to make this process moreefficient through bioengineering, researchers will need a detailedmolecular-level understanding of how the cellulase enzyme func-tions. But the system has been difficult to study because it is toosmall to be directly observed under a microscope while too large fortraditional molecular mechanics modeling.

To explore the intricate molecular dynamics of cellulase,researchers at NREL have turned to CHARMM (Chemistry atHARvard Molecular Mechanics), a suite of modeling software formacromolecular simulations, including energy minimization,molecular dynamics, and Monte Carlo simulations. The widely-usedcommunity code, originally developed in 1983 in the laboratory ofMartin Karplus at Harvard University, models how atoms interact.

In the cellulase modeling, CHARMM is used to explore theensemble configurations and protein structure, the interactions ofthe protein with the cellulose substrate, and the interactions of waterwith both. Not only are the NREL simulations the first to simulta-neously model the cellulase enzyme, cellulose substrate, andsurrounding water, they are among the largest molecular systemsever modeled. In particular, the researchers are interested in howcellulase aligns and attaches itself to cellulose, how the separate partsof cellulase—called protein domains—work with one another, andthe effect of water on the overall system. And they are also investi-gating which of the over 500 amino acids that make up the cellulaseprotein are central to the overall workings of the “machine” as itchews up cellulose.

To the biochemists in the collaboration, the simulation is like astop-motion film of a baseball pitcher throwing a curveball. In real-life the process occurs too quickly to evaluate visually, but bybreaking down the throw into a step-by-step process, observers canfind out the precise role of velocity, trajectory, movement, and armangle. In simulations on SDSC’s DataStar supercomputer, theresearchers have modeled a portion of the enzyme, the type 1cellulose binding domain, on a surface of crystalline cellulose in abox of water. The modeling revealed how the amino acids of thedomain orient themselves when they interact with crystallinecellulose as well as how the interaction disrupts the layer of watermolecules that lie on top of the cellulose, providing a detailedglimpse of this intricate molecular dance.

PUSHING THE ENVELOPEThe NREL cellulose model includes over 800,000 atoms,

including the surrounding water, the cellulose, and the enzyme—anenormous structure to model computationally.According to the researchers, an accurate under-standing of what is happening will require thecapability to scale up their simulation to run for 50nanoseconds in the reaction—an extremely longamount of time in molecular terms and highlydemanding in computational terms (there are onebillion nanoseconds in one second). To reach 50nanoseconds, the researchers must calculate 25million time-steps at two femtoseconds per timestep (one femtosecond is one quadrillionth of asecond).

However, the sheer size of the model is beyondthe limit of the current capabilities of theCHARMM simulation code, which has beendifficult to scale as the number of computerprocessors grows larger, since the code was originallywritten to model thousands, not hundreds ofthousands, of atoms. The SAC partners have workedto enhance CHARMM to scale to larger numbers ofatoms and to run on some of the largest resourcesavailable to academic scientists in the U.S.,including DataStar (recently expanded to 15.6teraflops), TeraGrid (4.4 teraflops), and BlueGene(5.7 teraflops).

Molecular MachineIn this artist’s rendering, the large translucent enzyme cellulase, resembling a dinosaur, pulls up strands ofcellulose, digests them in its belly, and excretes them as smaller pieces of sugar. M. Himmel, NREL, DOE BiomassProgram.

Page 11: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

ENVISION Fall 2005 9www.sdsc.edu

Top: CHARMMChemistry at HARvard Molecular Mechanics,or CHARMM, a suite of modeling software formacromolecular simulations, is used to modelthe complete system of the enzyme cellulasedigesting the sugar cellulose in water (red-dish haze). J. Brady, M. Himmel, L. Zhong,M. Crowley, C. Brooks III, and M. Nimlos.

Bottom: Cellulase EnzymeSide view of the enzyme cellulase, process-ing along the top of a fiber of the sugarcellulose. J. Brady, M. Himmel, L. Zhong,M. Crowley, C. Brooks III, and M. Nimlos.

To determine how much time thelarge-scale CHARMM simulationsrequire, a calculation on DataStar foundthat a series of 500-step simulations ona 711,887 atom system for onepicosecond (one thousandth of ananosecond) required 12 minutes on 64processors and 9 minutes on 128processors. Because of scaling issues, afull nanosecond run will require 1,000times more time than thesebenchmarking runs, so that full-scalesimulations are expected to requirenearly one million CPU hours.

To extend the capabilities of theCHARMM simulation code to this unprecedented scale, SDSC’s GiriChukkapalli, a computational scientist, along with Scripps’ MichaelCrowley, a software developer in Charles Brooks’ lab, havereengineered parts of CHARMM to be more efficient running as aparallel, rather than serial, application. In particular, the researchers inthe SAC collaboration have targeted a number of subroutines in thecode, which are being altered to speed up its performance on 256 and512 processors.

A MODEL PARTNERSHIPOutreach on the part of SDSC resulted in this large cross-agency

collaboration based on a team approach, with interdisciplinaryparticipation by biochemists from NREL, enzymologists andcarbohydrate chemists from Cornell, software developers fromTSRI, and computational scientists at SDSC. To validate and gaugethe accuracy of the CHARMM simulations, the models are studiedby James Matthews and John Brady of Cornell, and Linghao Zhongat Penn State, who compare the simulated version of the overallaction of the cellulase complex with experimental results. Similarly,the chemists at NREL, including Mark Nimlos, Mike Himmel, andXianghong Qian at the Colorado School of Mines, interpret thebiochemical findings. In addition to assisting with the softwaredevelopment and scaling to be able to run larger simulations, SDSCis also the key site for computation since the center houses computeresources such as DataStar, with capabilities far beyond thoseavailable at the other collaborators.

“We were looking for opportunities for collaboration with otheragencies,” said Cleary. “SDSC has unique expertise to offer inimproving community codes like CHARMM and other moleculardynamics tools like AMBER.” It turned out that Cleary, along withother SDSC staff, knew some of the researchers who had beenworking on the cellulase problem at NREL and the other sites. Theirwork was an ideal fit for a SDSC SAC collaboration, with eachgroup lending its expertise to the project.

The collaboration, funded by U.S. Department of Energy’s Officeof Energy Efficiency and Renewable Energy, and the Office of theBiomass Program, fit the mission of SDSC’s SAC program—toenhance the effectiveness of computational science and engineeringresearch conducted by nationwide academic users. The goal of thesecollaborations is to develop a synergy between the academic research-ers and SDSC staff that accelerates the researchers’ efforts by usingSDSC resources most effectively and enabling new science onrelatively short timescales of three to 12 months. And beyond theproject results, the hope is to discover and develop general solutionsthat will benefit not only the selected researchers but also their entireacademic communities and other high-performance computing users.

In this case, beyond being able to model cellulase digestingcellulose to improve the production of ethanol, the improvements to

CHARMM are opening the door so that the software, running oncutting-edge hardware systems, can simulate many other large-scalebiological systems. In turn, that will allow scientists to pose entirelynew questions, opening novel avenues for research, said Cleary.

According to Chukkapalli, “We’re excited about the advent ofnew architectures that provide massive amounts of computingpower. The questions from biophysics, structural biology, andbiochemistry that have been only dreams in the minds of computa-tional chemists are now on the verge of being studied in realisticsimulations.” –Cassie Ferguson is a freelance science writer.

RELATED LINKSURL: National Renewable Energy Laboratory’s BiomassResearch Pagewww.nrel.gov/biomass

SDSC Strategic Applications Collaborations Programwww.sdsc.edu/user_services/sac

REFERENCESMark R. Nimlos, Stan Bower, Michael E. Himmel,Michael F. Crowley, Giridhar Chukkapalli, Michael J. Cleary, andJohn Brady. (May, 2005) Computational Modeling of the Interactionof the Binding Domain of T. reesei Cel7A with Cellulose. Posterpresentation at the 27th Symposium on Biotechnology forFuels and Chemicals, Denver, CO.Sheehan, J. & Himmel, M. (1999). Enzymes, energy, and theenvironment: A strategic perspective on the US Department ofEnergy’s Research and Development Activities for Bioethanol.Biotechnology Progress, 15, 817-827.

Page 12: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

10 ENVISION Fall 2005 www.sdsc.edu

Generation the Next

Biology Workbench

Generation the Next

Biology Workbench

A free, Web-based interface thatlinks major molecular biologydatabases with analysis pro-grams—accessed thousands of

times a week by scientists and studentsworldwide—is about to become even better.

In May, SDSC announced the award of$2.2 million from the National Institutes ofHealth (NIH) to build on the ground-breaking “Biology Workbench,” introducednearly a decade ago by researcher ShankarSubramaniam to provide broad access tomany biology software tools and dataresources through a Web-based, point-and-click computational environment. TheBiology Workbench also offers the advan-tage of speed, letting researchers completewithin hours work that once took days,weeks, or even months.

The Next Generation Biology Work-bench (NGBW), also known as Swami, is aunique and versatile biological analysisenvironment that allows users to search up-to-date protein and nucleic acid sequencedatabases. Searching is integrated withaccess to a wide variety of analysis andmodeling tools, all within a point-and-clickinterface that eliminates file formatcompatibility problems. This has made theBiology Workbench an indispensableresource for scientists who work inbioinformatics, an emerging discipline inwhich researchers collect, integrate, model,and analyze the explosion of biological dataproduced by such efforts as the HumanGenome Project.

In addition to adding to fundamentalscientific knowledge, this research can leadto improved understanding of disease andopen the door to development of newtreatments and drug discovery. The NGBWprototype, as well as links to the currentBiology Workbench can be found atwww.ngbw.org.

“I have been using Biology Workbenchon a regular basis for the last three to fouryears,” said Vanderbilt University assistantprofessor, Mark de Caestecker. “It hasproved to be an invaluable tool for theanalysis and design of gene and proteinconstructs used in a range of differentexperiments in my laboratory. The BiologyWorkbench has the most comprehensiveand easy-to-use applications I have comeacross.” In his research, de Caesteckerstudies stem cell differentiation in kidneydevelopment, cancer, and tissue injuryrepair, and also researches cellular signalingin relation to hypertension.

UPDATING A FAVORITEThe researchers are making the Next

Generation Biology Workbench even more

PARTICIPANTSMIKE CLEARY, KEVIN FOWLER, MARK MILLER, GREGORY QUINN,ASHTON TAYLOR, AND ROGER UNWIN, SDSC/UCSDSHANKAR SUBRAMANIAM, UCSD AND SDSC/UCSDCELESTE BROWN, U. IDAHO

by Lynne Friedmann

A Tool for EducationUsing the Biology Workbench in the Bioinformatics Teaching Lab at the University of Idaho. Michael Placke.

Page 13: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

ENVISION Fall 2005 11www.sdsc.edu

useful by expanding its present offering of65 tools. Like the original, the NGBW willcontinue as a free Web resource that offersaccess to data, data storage, software tools,and computational resources that helpresearchers mine the information in manypopular protein and nucleic acid sequencedatabases. NIH funding will support theconstruction of up-to-date features such asimproved user interfaces and an expandablearchitecture that will allow the NGBW tocontinue to evolve in the future in responseto new developments in technology,biology, and the needs of scientists.

“There have been huge leaps in thetechnologies used in buildingcyberinfrastructure since the original BiologyWorkbench was created,” said Mark A.Miller, SDSC project leader for the newgrant. Work will be done in phases with abeta release planned for April 2006.According to Miller, there are some upgradesthat can be accomplished in a matter ofmonths, while others will take a year or twoto accomplish. For example, the currentworkbench integrates information from 33public databases, which are downloaded intoa flat file. Using the less powerful technologyof the flat file format places significantlimitations on search functionality. There-fore, a major goal of the Next Generationproject is to adopt a relational databaseformat in which the information is brokendown into tables and categories, which thenallows more complex queries, or scientificquestions, to be answered.

“Software developers always want to

make things very elegant so they can laterexpand and make them more modular,”said Miller. “We do want that, but we don’twant to make people wait five years for thenext product. So our focus is giving userssomething today and then making it moreelegant underneath.”

Other improvements will includeenhanced visualization and data manage-ment capabilities, and to make sure thatthese services are available even to userswith only a lower speed dial-up modem.This will enable a wide range of users topose sophisticated questions, even if theydon’t have access to advanced computingresources.

While not losing sight of the researcherwith limited financial resources for whomthe Biology Workbench was originallydeveloped, enhancements to the NextGeneration Biology Workbench areexpected to also capture more high-endusers. The current workbench, running on aSun computer, allows access to more than32,000 active users, the majority of whomlook at single sequences, submitting morethan 120,000 requests for analysis, or jobs,monthly. This is not high demand, andconsequently not very expensivecomputationally.

“We could probably handle four timesthat amount without breathing hard,” saidMiller. “But the current system is limitedbecause of how the file system is structured.I believe we can design the Next GenerationBiology Workbench so it will be fullyexpandable in the future.”

A TOOL FOR EDUCATIONBecause navigating the interface is

comparable to learning the Windows orMacintosh operating systems, it didn’t takelong for instructors to embrace the originalBiology Workbench as a teaching tool.Responding to this growing user segment,SDSC researchers are partnering withcolleagues at the National Center forSupercomputing Applications (NCSA),where the workbench was initially devel-oped, in developing an educationalcomponent. And consulting with educatorsin the early stages of the Next GenerationBiology Workbench design, Miller explains,“will keep us from getting too far off thebeam making a nice architecture but thewrong functions.” The outcome will be adedicated component or view for studentsand teachers, called the Student BiologyWorkbench.

This is welcome news to instructors likeCeleste Brown, Bioinformatics Coordinator,Initiative for Bioinformatics and Evolution-ary Studies at the University of Idaho. “Tenyears ago a graduate student found theBiology Workbench (on the Web) andbrought it to my attention,” she said. “I’vebeen using it for teaching ever since.”

Brown is not alone. A Web searchreveals a wide range of lesson plans designedspecifically with the Biology Workbench inmind. Many are from smaller institutions ofhigher education, such as the University ofIdaho. The use of the Biology Workbenchin this setting should be welcome news tothe NIH, which supports an initiative toencourage bioinformatics training in statesthat have historically not received as high alevel of government grant dollars as, say, thestate of California. In this program, 23states and Puerto Rico qualify for additionalNIH support for faculty development andenhancement of research infrastructureunder the Institutional Development Award

Easy VisualizationThe NGBW will feature lightweight server-side tools for vis-ualizing and manipulating protein molecules, such as thescalable vector graphics viewer shown here. NGWB.

Updating a Favorite: Next Generation Biology Workbench Interface.

Page 14: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

12 ENVISION Fall 2005 www.sdsc.edu

Facilitating ScienceAn example of the kind of protein renderingthat users can easily produce with the Work-bench. To make the Next Generation BiologyWorkbench usable by the most researchersand students, the project is developing toolsthat run on machines at SDSC so that usersdon’t have to download and install softwareon their machines. NGBW.

UNDER THE HOODA core goal of the project is to

improve the Workbench using newtechnologies. The design goalsinclude using an architecture thatallows the NGBW to be freelyavailable for distribution, anddeveloped within the available budgetand a short time frame. The team hasaddressed these issues by leveragingthe Java Enterprise Edition softwarestack as implemented by the open-source JBoss 4.0 Application Server.The new workbench stores andretrieves data from relationaldatabases by mapping Java objectsto relational entities using JBoss’Hibernate persistence library. Theuser works with this data through auser-friendly, Web front-end that isimplemented using the Apache Strutsweb-application framework.

“Because many users do not haveauthorization or sometimes the abilityto install programs on their comput-ers, we’re developing very lightweightvisualization tools that run effectivelyfrom the server side at SDSC,” saidMiller. This ensures that a wider rangeof users can benefit from NGBW dataand tools.

Another challenge is how tosupport the wide variety of analyticaltools made available within theWorkbench, since such tools typicallyhave very specific input and outputformat requirements. To do this, Millerexplains, the developers must “wrap”each separate program, putting atranslator between it and the centralNGBW architecture, so that datasupplied by the user can be passed toany of the analytical programs in theWorkbench in the language it caninterpret. In turn, the translator returnsoutput to the user in a commonformat. And since the developers areusing a well-defined commonlanguage, this also makes it straight-forward for other developers to createtheir own tools to be added to orwork with the Workbench.

(IDeA) Program. While there isn’t arequirement that IDeA states use theBiology Workbench to train students inbioinformatics, many do.

At the University of Idaho, an evolu-tionary perspective is the emphasis of allbiology training. “But let’s face it. Nobodyreally likes development lab where you putan egg into a Petri dish and watch how itdevelops into a chicken,” notes Brown. “Atool like the Biology Workbench putsthings in context and reinforces scientificprinciples learned in earlier classes. Besides,students are used to a lot more technologythan they were ten years ago.”

Brown uses the Biology Workbench inan introductory enzyme lab course to access3-D structures from the database. Studentsnot only see what’s going on in thereactions they set up, they can thenconsider how the structures evolved. “Iwant students to understand that there aredatabases out there that have all thisnucleotide and protein sequence informa-tion,” said Brown. “I also want them torealize that it’s easy to get to and there are alot of tools out there to help them analyzewhat’s in those databases.”

Outside the classroom researchers havebecome aware of the Biology Workbenchthrough scientific publications. In manycases, at the end of a commercial softwarereview, the Biology Workbench is mentionedas a free solution that scientists might alsowish to consider. When the Next GenerationBiology Workbench is ready for release, thereis travel support in the NIH grant for a “rollout” of its new capabilities at a series ofmajor national scientific meetings.

“The original BiologyWorkbench created a newparadigm for integratingbiological information and tools,giving easy access to researchers aswell as students,” said ShankarSubramaniam, Professor ofBioengineering and Director ofthe Bioinformatics Program atUCSD.␣ “It’s rewarding to see thegrowing interest in this resource,

and bringing the workbench forward usingmodern technologies will make thisimportant tool more versatile and availableto an even broader range of users.”

When Miller recently posted an Internetrequest for testimonials about the BiologyWorkbench he soon heard from students,teachers, and researchers from the fourcorners of the earth. Feedback includes suchsuperlatives as “invaluable,” “easy to use,”“faster and more efficient than other tools,”and “critical for completion of my thesisresearch.”

Professor of Biology Darrel Stafford atthe University of North Carolina, ChapelHill wrote, “We used it extensively in oursearch for the gene for epoxide reductasewhich was published in the March 5th issueof Nature.”

As far as Gabriel M. Belfort, an MD/PhD candidate at Boston University Schoolof Medicine, is concerned, “not having theBiology Workbench would be the func-tional equivalent of replacing my computerwith an abacus.”

The Next Generation Biology Work-bench team at SDSC includes Mark Miller,PhD, PI; Mike Cleary, PhD, co-PI and useradvocate; Shankar Subramaniam, PhD, co-PI; Kevin Fowler, senior software architect;Roger Unwin, database engineer; GregoryQuinn, PhD, senior interface engineer; andAshton Taylor, artist. Celeste Brown, PhD,of the University of Idaho, is educationadvisor and bioinformatics coordinator.

–Lynne Friedmann is a freelance sciencewriter living in Solana Beach, California.

URL: www.ngbw.org

REFERENCESubramaniam, S. The Biology Workbench—a seamless database and analysis environmentfor the biologist. Proteins, 32(1):1-2, 1998.

Page 15: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

ENVISION Fall 2005 13www.sdsc.edu

S cientists have long sought ways tomap the Universe and explore itsmost violent phenomena, frommysterious gamma ray bursts and

supernovae to the black holes that inhabitactive nuclei in the centers of galaxies. Intheir quest, some researchers are nowfocusing on subatomic particles calledneutrinos, which show promise of beingvaluable messengers. In contrast to otherparticles or light, which are absorbed, bent,or scattered in their travels, the tiny, almostmassless neutrino is able to travel virtuallyunimpeded across the vast distances of spaceto reach the Earth. Aided by the TeraGridnetwork, cluster, and massive data resourcesat SDSC at UC San Diego, physicists aredeveloping a new kind of telescope,AMANDA-II, the Antarctic Muon andNeutrino Detector Array, to observe theseneutrinos and decipher their tales about thelocation and inner workings of the cataclys-mic events in which they originated.

Researcher Andrea Silvestri andProfessor Steven Barwick of the Physics andAstronomy Department at the University ofCalifornia, Irvine (UCI), along with manyother scientists, are beginning to use themultipurpose AMANDA-II high-energyneutrino telescope at the South Pole to seekanswers to a broad array of questions inphysics and astrophysics. Observingneutrinos can shed light on fundamentalproblems such as the origins of cosmic rays,the search for dark matter and other exoticparticles, as well as serving as a monitor forsupernovas in the Milky Way.

However, the same qualities that letneutrinos travel freely across the universealso make them extremely difficult to de-tect. To further complicate matters, the vastmajority of neutrinos that reach the Earthare produced nearby through cosmic raycollisions in the Earth’s atmosphere, poten-tially masking the rarer, distant-origin neu-trinos the scientists are seeking. How canparticle astrophysi-cists Silvestri and

Barwick filter out the mass of unwanted in-formation from their data and tease out thetiny signal of high-energy neutrinos fromfar away?

TAMING A FLOOD OF DATAScientists have steadily increased the size

and effectiveness of the AMANDA telescopesince it began collecting data in 1997. InAMANDA-II, the data acquisition electron-ics were upgraded with Transient WaveformRecorders that capture the completewaveform for each event detected. Theresearchers expect that several importantgoals will benefit by as much as a factor of 10from the additional information gathered,including improvements in reconstructingmuon cascades, the search for diffuse sourcesof ultra-high energy neutrinos, and thesearch for neutrino point sources.

However, with this progress in gatheringdata come new challenges, and the telescope

now produces a flood of information,growing from one terabyte to 15 terabytesper year even in compressed form—aboutthe same amount of information as in theentire printed collection of the Library ofCongress or the data on 3,500 DVDs.

To analyze this immense data collection,Silvestri and Barwick turned to the large-scale data and computing capabilities of theNSF TeraGrid facility at SDSC. “The morecapable the AMANDA telescope becomes,the more information we gather aboutneutrinos, which greatly helps our science,”said Silvestri. “But this also means that toanalyze all this data, we need the expertiseand high-end resources of SDSC and theTeraGrid.”

The first step was to transfer the 15terabytes of raw AMANDA neutrino dataover a high-speed network from UCI into aStorage Resource Broker (SRB) data archiveat SDSC. Having developed the advancedSDSC SRB data management tool andinstalled more than one petabyte of onlinedisk and more than six petabytes of archivalstorage capacity, SDSC is ideally suited tohouse and analyze massive data sets. And asthe TeraGrid was designed to do, the high-speed network allowed the researchers totransparently access their massive dataarchive housed at SDSC for use onTeraGrid computational resources at varioussites, speeding their research.

AMANDA II Neutrino TelescopeTo probe the most violent events in the Universe, scientists use the NSF-funded AMANDA neutrino telescope to seek evidenceof elusive high-energy neutrinos. Aided by SDSC and TeraGrid data and compute resources, they successfully analyzed 15terabytes of data. AMANDA, UW-Madison, photo Robert Morse.

PROJECT LEADERSSTEVEN BARWICK AND ANDREA SILVESTRI, UCI

SDSC’s TeraGrid Data and Computing Resources Help Validate AMANDA Neutrino Telescope

New Window on theViolent Universe

THE ELUSIVE NEUTRINO:THE ELUSIVE NEUTRINO:

by Paul Tooby

Page 16: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

14 ENVISION Fall 2005 www.sdsc.edu

FINDING A NEUTRINO IN A HAYSTACKThe 15 terabytes of the full AMANDA-

II waveform data collected for one yearduring 2003 contains some two billionexperimental events, and the challenge thescientists faced was to identify the fewneutrinos among the millions of timesgreater number of background muonevents. Scientists measure neutrinos bydetecting muons, which are subatomicparticles produced in the rare interaction ofa neutrino with other matter.

In their analysis, the researchersprocessed and filtered the experimental dataand reconstructed each individual event. Byrunning sophisticated algorithms on theTeraGrid through numerous iterationsusing likelihood-based statistical methods,the researchers analyzed the full 15 terabytesof experimental data. This process washighly data and compute-intensive, andonly by having access to the resources of themassive SDSC online disk and tape storageand some 70,000 CPU hours on theTeraGrid supercomputer were the research-ers able to carry out their data analysis.Typical jobs ran on 512 processors usingone to two gigabytes of memory.

Finally, the researchers succeeded indistinguishing the faint signal of 1,112atmospheric neutrinos from the billions ofextraneous events. When they comparedtheir results to standard analyses they foundgood agreement, confirming that theAMANDA instrument and SDSC-aideddata analysis can produce the same physics

results as previous data. Inparticular, the angulardistribution of theatmospheric neutrinosample extracted from thestandard data set agreedwell with the newAMANDA data, with all1,112 neutrinos originat-ing in the northernhemisphere and distrib-uted across the sky in afairly uniform way, asexpected.

In validating theAMANDA instrumentand analysis, the research-ers also investigated

whether the observed neutrinos weregenerated from collisions with the Earth’satmosphere, which produces a uniformspatial distribution of neutrino events acrossthe sky, or whether some of the neutrinoswere created by a source of extraterrestrialorigin, which would be expected to producea more concentrated event cluster in thedirection of the source. Their statisticalanalysis of the data showed that theobserved regions of the sky were compatiblewith atmospheric neutrino events, withoutsignificant event clusters that might indicatean extraterrestrial source.

Evidence of a NeutrinoOne of the 1,112 observed high-energy neutrinos detectedover one year as it traveled through the AMANDA instru-ment. In this side view of the detector looking upward, thelines of small black dots represent the strings of photo sen-sors, with colored circles marking sensors that detected lightin this event (larger circles indicate more intense light). Theearliest-arriving pulses (red light) are lowest in the arraywith later-arriving (green) light detected higher up, indi-cating an upward-travelling neutrino that passed throughthe Earth to reach the detector. A. Silvestri, UCI.

Using Ice to Measure NeutrinosThe Antarctic Muon and Neutrino Detector Array (AMANDA), buried a mile deep in the Antarctic ice, isa new kind of telescope. Spanning a volume three times the size of the Eiffel Tower, it uses the clear icewithout background light as a detector for sensitive measurements of rare neutrinos. Dan Brennan,UW-Madison.

NEUTRINOS:A NEW WINDOW ON THEUNIVERSE

Neutrinos are subatomicparticles of neutral electricalcharge and very small mass.Moving at nearly the speed oflight, they participate only in weakand gravitational interactions, andcan thus travel virtually unim-peded, giving a new view of theUniverse. However, this alsomakes them very difficult to detect.When a high-energy muon-neutrino has a rare collision in theice of the AMANDA detector, amuon particle is emitted thatinitially travels faster than the localspeed of light in the ice. The muonradiates a sort of “shock wave” oflight—analogous to the sonicboom of a supersonic aircraft.Traveling through the clear ice tothe photodetectors of theAMANDA neutrino telescope, thisCherenkov light can reveal thepresence of a high-energyneutrino. By correlating the light’sarrival time in the 3-D photodetec-tor array, scientists can pinpointthe direction of high neutrino-emitting objects, which can thenbe further studied.

Page 17: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

ENVISION Fall 2005 15www.sdsc.edu

The scientists explained that it has beenan enormous undertaking, requiring manyyears and the efforts of diverse groups andspecialities working together to develop andvalidate an entirely new kind of telescopesuch as AMANDA. “This is a major resultfor the AMANDA-II neutrino telescopeand broader research community,” saidSilvestri. “It’s the first validation that we canin fact perform valid neutrino analysis withthe new generation of instrument, its muchlarger data stream, and all the steps of ouranalysis, and we couldn’t have done itwithout SDSC and the TeraGrid data andcompute resources.”

Moreover, since their initial analysisused only part of the complete informationcontained in the AMANDA-II waveformdata, the researchers are now developingnew software tools to exploit the fullinformation available. The scientists expectthe additional information to improve theirability to resolve even smaller differences inenergy and angle. This will be crucial intheir continuing search for the hard-to-detect energetic extra-terrestrial neutrinosthat may hold the answers to manyfundamental questions about the Universe.

A NEW KIND OF TELESCOPEThe fulfillment of a 40-year dream,

AMANDA was designed to overcome theobstacles to detecting elusive neutrinos, andshows promise of giving scientists astartlingly broader view the Universethrough the window of these high-energyparticles. AMANDA is an ingenious newkind of “telescope” that senses neutrinosinstead of light from above as have alltelescopes since the time of Galileo (seesidebar). And unlike normal telescopes,

Neutrino Sky MapSky map from the AMANDA neutrino telescope shows that the locally produced atmospheric neutrino background detected to date is quite uniform, without evidence of strong sources. This ispart of the first validation of the new AMANDA-II neutrino telescope and the researchers’ analysis system, which relied on large-scale TeraGrid data and compute resources at SDSC. Scale atright indicates excess from mean background events. A. Silvestri, UCI.

REFERENCESA. Silvestri et al, Performance of AMANDA-II using data from Transient WaveformRecorders, Proceedings of 29th InternationalCosmic Ray Conference, Pune, India, August3-10, 2005.A. Silvestri et al, The AMANDA NeutrinoTelescope, Proceedings of InternationalSchool of Cosmic Ray Astrophysics. Erice,Italy, July 2-13, 2004.

RELATED LINKSAntarctic Muon and NeutrinoDetector Array (AMANDA)amanda.uci.edu

Andrea Silvestriwww.ps.uci.edu/~silvestri

Steven Barwickwww.ps.uci.edu/physics/barwick.html

which always face upward, AMANDA canalso look downward, using the size of theEarth to “filter out” the extraneousdownward-moving atmospheric muons,which are about a million times moreabundant, and in this way detect high-energy neutrinos in the intermediate rangefrom distant parts of the Universe. Onlysuch neutrinos are able to pass through thewhole Earth after entering in the NorthernHemisphere to reach the AMANDAtelescope at the South Pole.

Occasionally, one of these upward-movinghigh-energy neutrinos will interact with anoxygen atom in the ice near the AMANDAarray to produce a cascade of light-emittingmuon particles. This light can travel longdistances through the clear ice at the SouthPole, which is free of competing backgroundlight, until it is picked up by the sensitiveAMANDA photodetectors that gather thisindirect evidence of the passage of a neutrino.The telescope can also search for even moreenergetic neutrinos by looking for downward-moving neutrinos of ultra-high energies.

The AMANDA neutrino telescopecontinues to grow in power, and currentlyconsists of some 700 photon detectorsarranged like beads on vertical strings,lowered into 19 holes in the ice at theSouth Pole. The holes are distributed acrossa circular area, creating a cylindrical volumeof ice that serves as the detector some 120meters in diameter and 500 meters tall,with its top about 1,500 meters below theice cap’s surface. Each photon detectormodule consists of a photomultiplier tubehoused in a tough, pressure-resistant hollowsphere, with electrical and optical connec-tors attached. After a hole is bored in the icewith heated water, the string of detector

modules is lowered into the water-filledhole, which then freezes solid, locking thedetectors permanently in place.

In the future, the research will be scaledup even further in the NSF IceCubeproject, a much larger one-kilometer cubetelescope array that will produce 20 times asmuch data, one terabyte per day or some300 terabytes annually. Silvestri points outthat “this will drive the need for even largerdata, computational, and network resourcesat SDSC to better answer questions aboutthe most energetic events in the history ofthe Universe.”

– Paul Tooby is a senior science writer atSDSC and editor of EnVision Magazine.

Page 18: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

16 ENVISION Fall 2005 www.sdsc.edu

Scientific Workflow AutomationKepler

T he advent of the World Wide Web has brought a whole newworld of data, tools, and online services within the reach ofscientists. But this wealth of opportunities also brings a newset of challenges, and researchers can be overwhelmed by the

sheer volume and complexity of resources. To overcome thisproblem, researchers at SDSC at UC San Diego, the NationalCenter for Ecological Analysis and Synthesis (NCEAS) at UC SantaBarbara, and their partners have initiated an interdisciplinarycollaboration to develop Kepler, a tool for scientific workflowmanagement. By helping organize and automate scientific tasks,Kepler lets scientists take full advantage of today’s complex softwareand Web services.

“For scientists, a good workflow tool is invaluable,” said KimBaldridge, a computational chemist at SDSC and professor at the

University of Zurich. “It relieves researchers of the drudgery oftedious manual steps, but more importantly it can dramaticallyexpand our ability to think bigger and ask new questions that weresimply too complex or time-consuming before.” Her research grouphas developed modules known as “actors” to carry out workflows inKepler, leading to published results in the RESURGENCE project,which makes use of computational grids and Web services incomputational chemistry.

Kepler is named after the Ptolemy software from UC Berkeleyon which it is built, and the new tool is part of the emergingcyberinfrastructure, or integrated technologies for doing today’sscience. The open-source Kepler project grew out of the need for ananalytical workflow tool in the Science Environment for Ecological

Knowledge, or SEEK, project. The researchers from SDSC andUCSB who initiated Kepler worked with partners in the Depart-ment of Energy Scientific Discovery through Advanced ComputingProgram. Kepler has now grown to include more than half a dozenscientific projects, and the researchers plan to release a beta versionin the coming months.

In its simplest form, Kepler may be thought of as a sort of“scientific robot” that relieves researchers of repetitive tasks so thatthey can focus on their science. In addition to increasing theefficiency of scientists’ own workflows, Kepler will also giveresearchers increased capabilities to communicate and worktogether—searching for, integrating, and sharing data and workflowsin large-scale collaborative environments.

“With Kepler, scientists from many disciplines can automatecomplex workflows, without having to become expert program-mers,” said Bertram Ludäscher, one of the initiators of the Keplerproject and an SDSC Fellow and associate professor of ComputerScience at UC Davis. “Kepler’s flexibility and its visual program-ming interface make it easy for scientists to create both low-level‘plumbing workflows’ to move data around and start jobs on remotecomputers, as well as high-level data analysis pipelines that chaintogether standard or custom algorithms from different scientificdomains. And beyond automation, being able to document andreproduce workflows is a major objective of scientific workflowsystems like Kepler.”

An important factor in ensuring that Kepler will be broadlyuseful across multiple scientific disciplines is its organization as anopen source consortium. Participants in an open source projectcollaborate in building, maintaining, and peer-reviewing a commonsoftware tool. The source code is made publicly available withoutcharge, and those who use the software are encouraged to contributeto its development—finding and fixing errors and adding newfeatures that benefit the entire community. The Linux operatingsystem and tools from the Apache Software foundation are well-known examples of open source efforts.

PROJECT PARTICIPANTSILKAY ALTINTAS, KIM BALDRIDGE, ZHIJIE GUAN, EFRAT JAEGER-FRANK, NANDITA MANGAL, STEVE MOCK, SDSC/UCSDSHAWN BOWERS, BERTRAM LUDÄSCHER, UC DAVIS

CHAD BERKLEY, DANIEL HIGGINS, MATT JONES, JING TAO, UCSBCHRISTOPHER BROOKS, EDWARD A. LEE, STEPHEN NEUENDORFFER, YANG ZHAO, UCBZHENGANG CHENG, MLADEN VOUK, NCSUTOBIN FRICKE, U. OF ROCHESTER

TIMOTHY MCPHILLIPS, NDDPA. TOWN PETERSON, ROD SPEARS, U. KANSAS

KIM BALDRIDGE, WIBKE SUDHOLT, U. ZURICH

TERENCE CRITCHLOW, XIAOWEN XIN, LLNL

Kepler InterfaceThe visual programming interface in the Kepler scientific workflow tool makes it easy forscientists to create low-level “plumbing workflows” to move data around and start jobs onremote computers as well as high-level data analysis pipelines that chain together standardor custom algorithms from different scientific domains.

by Paul Tooby

SDSC Helps Open Source Tool Streamline Scientists’ Tasks and Aid Collaboration

with

Page 19: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

ENVISION Fall 2005 17www.sdsc.edu

“The fact that Kepler is open source encourages researchers to jointhe collaboration and build their own components, leveraging theinfrastructure, and providing the vitality of a community approach tomore rapidly extend Kepler’s capabilities,” said Ilkay Altintas, directorof SDSC’s Scientific Workflow Automation Technologies (SWAT) lab,which brings together scientific workflow efforts at SDSC under oneumbrella. Researchers interested in scientific workflow technologiesare invited to contact the lab to learn more.

TOWARD A GENERIC TOOLWhen scientists search for relevant data sources and then

undertake multi-step workflows, they must typically carry out andkeep track of these complex steps in manual, ad hoc ways as theyexport and import data from one step to another across diverseenvironments. As a first step toward automating these tasks,scientists and computer scientists may collaborate on building acustom workflow tool. But this is an expensive and time-consumingprocess in which the software must generally be developed andmaintained in an individual effort for each application.

To overcome these limitations, the Kepler initiative is developing ageneric tool and environment that builds on existing technologies andwill work in a wide range of applications to capture, automate, andmanage researchers’ actions as they carry out scientific workflows. Theinitial effort has brought together computer scientists with domainscientists in the disciplines of ecology, biology, chemistry, oceanogra-phy, geosciences, nuclear physics, and astronomy.

“With Kepler, we’re giving scientists an intuitive tool that theycan use to build their own workflows, which can include emergingGrid-based approaches to distributed computation,” said Kepler co-initiator Matt Jones, a co-principal investigator and project managerfor the SEEK project. “And in order to build a workflow environ-ment that is effective across multiple domains of science, we’re

working with a growing range of projects to ensure the widestpossible usefulness of the infrastructure.”

In addition to the Ptolemy project of UC Berkeley describedbelow, which serves as the framework for Kepler, the collaborationcurrently includes the following projects that span a range ofscientific fields:

• Ecology: SEEK• Biology, Nuclear Physics, and Astrophysics: Scientific Discovery

through Advanced Computing Program in the Department ofEnergy (DOE SciDAC)

• Biology: NSF Encyclopedia of Life project (EOL) at SDSC• Geosciences: the NSF GEON project, building a

cyberinfrastructure for the geosciences• Environmental science and sensor networks: the NSF Real-time

Observatories, Applications, and Data management Network(ROADNet)

• Computational chemistry: the NSF RESURGENCE project• Data mining tools: Cyberinfrastructure Laboratory for Ecologi-

cal Observing Systems (CLEOS) at SDSC• Distributed data integration: National Laboratory for Advanced

Data Research (NLADR), a collaboration between SDSC andthe National Center for Supercomputing Applications (NCSA).

Kepler is used in a wide variety of ways in these projects. In theEncyclopedia of Life project, the integrated Genome AnnotationPipeline software uses the Application Level Scheduling ParameterSweep Template (APST) in month-long grid computing jobs thatwould be far more difficult without a workflow tool. First, Keplerprepares the databases and submits the computing job. Then itcontinues in a monitoring mode that checks on the execution andupdates the corresponding database. In the event of a failure, themost recent update can be retrieved from the database, greatly

Revealing the EarthLight Detection And Ranging (LIDAR)data is an important new tool forstudying the Earth’s surface, espe-cially where heavy vegetation makestraditional aerial photography inef-fective. Kepler workflows in the GEONproject aid analysis of multi-terabyteLiDAR data. Image of the SanAndreas fault in California shows treecanopy (left) from the first return sig-nal and “virtual deforestation” (right)based on the last return signal, mak-ing the fault line clearly visible, run-ning N.W. to S.E. C. Crosby, ASU.

Page 20: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

18 ENVISION Fall 2005 www.sdsc.edu

users to compose complex workflows simply by stringing together indi-vidual actors, linking them according to the flow of data, and nestingthem to represent desired levels of abstraction. In addition to Ptolemy’sconsiderable built-in capabilities, which include more than 100 actors(or processing components) and directors (or workflow engines), theKepler collaborators are continually adding new ones that extend thesystem, and have already contributed more than 100 additional actors.

AUTOMATION CHALLENGESTo capture the actions that scientists carry out in conducting

their research and to automate these steps, the flow of data from oneanalytical step to another is described in Kepler in a formal,computer-readable workflow language. Among technical issues theresearchers are facing in developing Kepler are:

• identifying factors that improve interoperability such assupporting shared workflow component repositories betweenKepler, Taverna, Triana, SCIRun, DiscoveryNet, GeoVistaStudio, and others

• managing heterogeneous mixtures of models of computation,including controlling space, time, and context

• handling distributed computation and code migration inscientific workflows

• facilitating the efficient use of existing scientific codes inworkflow environments

• choosing or developing languages that are suitable for represent-ing scientific workflows

• ensuring usability of the Kepler interface for the diverse group ofscientists adopting Kepler.

As they resolve these technical challenges, Kepler developers mustwork closely with domain scientists in order to ensure that the

resulting software meets the scientists’ needs.

ENHANCING COLLABORATION Beyond automating the steps of a given project,workflows captured in Kepler are intended to promote

communication and collaboration for scientists in diverse

Modeling MoleculesKepler workflows are being tested in studies involving this ligand/protein system in

the RESURGENCE project. Image shows molecular electrostatic potential map of aligand in a binding site of a mutated enzyme PLP-dependent alanine racemase fromG. stearothermophilus. K. Baldridge and C. Amoreira.

Monitoring Lake EcologyThe SDSC Cyberinfrastructure Laboratory for Ecological Observing Systems (CLEOS) lab helpsresearchers gather data on lake ecology in the Collaborative Lake Metabolism Project. Keplerworkflows can speed gathering, managing, and analyzing this complex multi-sensor streamingdata, giving scientists new insights into lakes, including rapid events such as overturning. CLMP.

simplifying recovery. Kepler also makes it easy for scientists toexecute a new task simply by double-clicking on and changing theparameters of an existing task. All of these capabilities let scientistsaccomplish genomic research much more rapidly.

To advance genomic research, biologists in SciDAC Program atthe DOE study co-regulated genes. In their research, they try toidentify promoters and develop models of transcription factor bindingsites that play key roles in the expression of genes. The scientists useKepler to help execute a series of data analysis and querying steps inwhich they move the results of each successive step from one Webresource to another. By automating these steps, the researchers savehours or days of time, speeding their results and allowing them toundertake problems on larger scales than previously possible.

Ecologists in the SEEK project study problems that includeinvasive diseases such the West Nile virus. West Nile virus is spreadthrough mosquitos feeding on migrating birds in a complex dual-vector process. The researchers develop predictions for where andhow fast this kind of disease will spread. To do this, ecologists accessonline data sets about where the mosquitoes and birds are observedto live and migrate. Then they use Web-based ecological nichemodeling tools to correlate this information with climate data,computing predictions for where the birds and mosquitos are likelyto be found. Automating these steps with Kepler can make it feasibleto produce accurate predictions for the spread of an invasive diseasefar more quickly than previously possible.

Automating workflows can yield similar benefits in a wide rangeof other scientific fields, and a growing number of projects andindividuals are contributing to the Kepler open source project. Moreinformation on members and contributors can be found on theKepler website at http://kepler-project.org/.

LEVERAGING EXISTING TECHNOLOGYTo explore whether they could build on existing technologies,

the Kepler team surveyed available tools. Ptolemy, a project of theCenter for Hybrid and Embedded Software Systems led by ProfessorEdward Lee of UC Berkeley, focuses on modeling, simulation, anddesign of concurrent, real-time embedded computing systems. TheKepler team realized that although Ptolemy had been developed fora very different purpose, it had capabilities that would provide a ma-ture platform for the needs of Kepler in designing and executing sci-entific workflows. Ptolemy II, published as open source software, isthe current base version of theKepler infrastructure.

Ptolemy provides a set of Javalanguage packages that supportheterogeneous, concurrent model-ing, design, and execution.Among Ptolemy’s strengths aresupport for a number of pre-cisely-defined models of computa-tion such as streaming, and a con-current dataflow paradigm for pro-cess networks that is appropriate formodeling and executing many scien-tific workflows. Ptolemy’s pro-gramming approach is activity-based, or “actor-oriented” inPtolemy terminology, whichmakes it easier to design thereusable components thatscientists need. Ptolemy alsohas an intuitive graphical userinterface called Vergil that allows

Page 21: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

ENVISION Fall 2005 19www.sdsc.edu

domains—a crucial capability for today’slarge-scale interdisciplinary collaborations.“Through its systematic approach toscientific workflows, Kepler can fulfill theimportant function of publishing analyses,models, data transformation programs, andderived data sets,” said Kepler co-initiatorJones. “This gives scientists a way to trackthe provenance of derived data setsproduced through workflow transforma-tions, which is essential to being able toidentify appropriate data sets for integrationand further research.”

In addition to distributing new Kepleractors that automate specific tasks, scientistscan publish the results of workflows, storingthe formal workflow descriptions of thesteps carried out in a Web-accessiblerepository such as one of the metadatacatalogs that are part of the SEEK EcoGrid.Kepler developers are working on extensionsthat will allow scientists to easily publishtheir workflows and share them withcolleagues in flexible ways.

“We’re now starting to add semanticcapabilities to Kepler,” said Shawn Bowers, project scientist at the UCDavis Genome Center where he works with professor Ludäscher onworkflow and data integration technology for the SEEK project.“These include domain-specific ontologies acting as ‘semantic types’for datasets, which will let scientists use the concepts of their ownfields to search for and discover data and services, link to, andintegrate data sets in both local and distributed grid environments.”

Scientists are also interested in the potential of Kepler andrelated tools to power comprehensive “science environments,” whichthey envision will follow accelerating growth paths as scientists arerapidly and seamlessly able to find out about and build on theprevious work of their own and collaborating groups. “As the Keplerenvironment gains momentum and becomes more robust andreliable,” explains SDSC’s Altintas, “the body of resources thatscientists can build upon grows larger, and more groups andscientific domains are joining this open collaboration.”

West Nile VirusIn the United States, West Nile virus is transmitted by infected mosquitoes,primarily members of Culex species. © 2005 Jupiter Images Corporation.

– Paul Tooby is a senior science writer at SDSCand editor of EnVision Magazine.

RELATED LINKURL: Kepler Scientific Workflow Projectkepler-project.org

REFERENCESI. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludäscher,S. Mock, Kepler: An Extensible System for Design andExecution of Scientific Workflows, 16th InternationalConference on Scientific and Statistical DatabaseManagement (SSDBM’04), 21-23 June 2004, SantoriniIsland, Greece.

B. Ludäscher, I. Altintas, C. Berkley, D. Higgins,E. Jaeger-Frank, M. Jones, E. Lee, J. Tao, Y. Zhao,Scientific Workflow Management and the Kepler System,Concurrency and Computation: Practice & Experience,Special Issue on Scientific Workflows, to appear.

Predicting West Nile VirusPrediction of 2003 West Nile Virus cases (black dots)by researchers in SEEK project using ecological nichemodeling to project forward 2002 cases, based on2003 environmental conditions. Illinois 2003 reportedcases (yellow squares) are overlaid on prediction.Kepler workflows can speed such predictions, whichthe researchers found to be surprisingly accurate usingonly limited case information. A.T. Peterson, U.Kansas.

Page 22: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of
Page 23: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

ENVISION Fall 2005 21www.sdsc.edu

he old expression “can’t see theforest for the trees” is just as apt forastronomers who observe theheavens as for viewers of earthly

woodlands. As telescopes become ever morepowerful, they give astronomers deeperviews that reveal dramatic new details ofnarrow areas of the sky. But without aunified view of larger regions, scientists maymiss large-scale patterns that can help themmake sense of the individual objects theysee.

What if astronomers had tools that letthem see not only deeper into space but alsobroader areas of the sky—even the wholesky—in a unified image collection? Thatdream is now becoming a reality throughadvances in both telescopes and thesupporting data cyberinfrastructure ofsoftware, data resources, andsupercomputers being pioneered in aStrategic Applications Collaborationinvolving SDSC experts and astronomersfrom Caltech and NASA’s Jet PropulsionLaboratory (JPL).

Using the Montage software, theresearchers are stitching together intomosaics millions of individual images in the2-Micron All Sky Survey, known as

PROJECT LEADERSBRUCE BERRIMAN, JOHN GOOD, THOMAS PRINCE,

AND ROY WILLIAMS, CALTECH

REAGAN MOORE, SDSC/UCSD

PARTICIPANTSANASTASIA CLOWER LAITY, CALTECH

ATILLA BERGOU, JOSEPH JACOB, AND DANIEL KATZ, JPLLEESA BRIEGER AND GEORGE KREMENEK, SDSC/UCSD

2MASS, which contains an enormous 10terabytes of data (10 terabytes is about2,500 DVDs, around the size of the printedcollection of the Library of Congress). Aderived image product on this scale hasnever before been produced, and theresulting mosaics will let astronomers studystructures as complete entities, opening thedoor to exploring previously unseen large-scale relationships in the Universe.

“You’ll underestimate the true complex-ity of what’s going on if you look only atnarrow regions of the sky,” said astronomerJohn Good of the Montage project atCaltech and JPL. “Having mosaics of theentire sky will help researchers construct 3-D maps of larger regions, giving us a muchclearer understanding of how thesestructures evolved—we’re finding that it’s avery dynamic, explosive process.”

For example, astronomers are findingthat viewing the “big picture” can give themnew insights into the complexities of largestar formation regions that include amixture of point sources, barely extendedsources, and very complex extendedstructures. A large mosaic map of the regionand its environs helps scientists understandthe physical processes shaping it—the

dynamic interplay of massive stars and theirradiation fields with the surroundingmedium of gas and dust. These powerfultools can help unravel the mysteriesunderlying the birth of new stars.

To produce the individual images in the2MASS Survey, astronomers scanned theentire sky, using new instruments with an80,000-fold improvement in sensitivity overearlier surveys. By looking at the sky innear-infrared light, the survey provides a“window” that is distinct from visible light,and can penetrate dust clouds of the MilkyWay Galaxy to reveal an enormous numberof objects that can’t be seen in visible light.Funded by NASA and the National ScienceFoundation (NSF), the 2MASS survey isled by the University of Massachusetts, with

Above: All-Sky ProjectionAn all-sky projection of the long-wavelength data from the Infrared Astronomical Satellite. The unusual form comes from the Aitoff projection, which maps the whole sphere into an equal-area projection. IPAC, Caltech.

Left: Milky Way MosaicThis striking image of dust clouds in the plane of the Milky Way (running diagonally from upper left to lower right) is a three-color mosaic produced by SDSC and Montage from observationsin the 2MASS all-sky survey. With 10,000 individual pixels on a side, the mosaic gives astronomers a view vastly expanded in both detail and width, opening the door to new insights into thelarge-scale structure of the Milky Way. Montage, IPAC, Caltech, SDSC/UCSD.

Page 24: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

22 ENVISION Fall 2005 www.sdsc.edu

“Having mosaics of the entire sky willhelp researchers construct 3-D maps oflarger regions, giving us a much clearerunderstanding of how these structuresevolved—we’re finding that it’s a verydynamic, explosive process.”

processing, construction, and distributionof the data products done by the InfraredProcessing and Analysis Center (IPAC) atCaltech and JPL. Available through theIPAC Infrared Science Archive (IRSA), the2MASS survey includes millions of high-resolution digital images.

In this collaboration, SDSC researchersare working with colleagues in the Montageproject from IPAC and the Center forAdvanced Computing Research (CACR) aswell as the National Virtual Observatory(NVO) at Caltech and JPL.

THE WHOLE IS GREATER…The Montage software being used to

join individual 2MASS images can producemosaics to user-specified parameters ofprojection, coordinates, size, rotation, andspatial sampling. The algorithms Montageuses are computationally intensive, and the10 terabytes of data in the survey is huge.For this project the mosaic plates are 6degrees on a side, and to cover the entiresky, SDSC staff scientist Leesa Brieger isgenerating 1,734 plates for each of the threenear-infrared bands. Every 6 degree plateincorporates between 800 and 4,000 input

files, with an average of about 1,500.Brieger is working closely with the

scientists in this Strategic ApplicationsCollaboration to accelerate their research bydeveloping ways to use large-scale SDSCresources most effectively. In addition toaiding this research project, the collabora-tion also aims to develop general solutionsthat will benefit the wider academiccommunity. In processing the images,Brieger has developed an efficient workflowthat first warps the input images onto thesame projection, and then integrates theoverlapping parts of adjacent images.Finally, a key task that Montage carries outis to apply a background model thatrectifies to a common level thevarying backgrounds of the millionsof different images acquired atdifferent times, under conditionswith different background light, orsky emissions. The software doesthis by removing terrestrial andtelescope features in the images in away that can be tracked anddescribed quantitatively.

The researchers anticipate thatthe project will take between 70,000

and 100,000 processor hours to completeon the TeraGrid at SDSC. The process isquite parallel, Brieger explains, and can bescaled up to run on more than 1,000processors. Creating the mosaics iscomputationally demanding, requiringsubstantially more time than is needed tomove the 10 terabytes of input data toSDSC and to store the expected 7 terabytesof output mosaics. The researchers arestaging the input data on the local parallelfile system and temporarily storing theoutput data locally until it is archived tojoin other long-term scientific datacollections at SDSC as an important new

War and Peace NebulaThe region of star formation NGC 6357is located about 7 degrees west of the Ga-lactic center. This 2.5 degree square mo-saic was constructed from near-infrared2MASS and mid-infrared MSX imagesavailable through IRSA. Montage,Caltech.

Page 25: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

ENVISION Fall 2005 23www.sdsc.edu

Galactic PlaneFull-resolution mosaic of 2-Micron All Sky Survey (2MASS) galactic plane data constructed with Montage. IPAC, Caltech.

“The real contribution we make at SDSCis to help identify the problems that showup as scientists scale up their data and com-puting, and develop robust, fault-toleranttools that let them get science done at thevery large scale required at the frontiers oftoday’s science.”

resource for astronomers.Managing the mosaicking process with

such massive amounts of data is presentinginteresting new challenges. For example,with today’s state-of-the art hardware andfile systems, the data retrieval error rates arevery low, on the order of one error in everytrillion bytes. With such low error rates, anastronomer computing a single mosaic withMontage using a few thousand imagestypically won’t encounter any problem.

However, as SDSC tackles the largerchallenge of scaling up to computethousands of mosaics for the entire 10terabyte survey, even this tiny error rate canbecome significant. “When dealing with amulti-terabyte data collection with millionsof files you can’t ignore even small errorrates,” said Brieger. “The real contributionwe make at SDSC is to help identify theproblems that show up as scientists scale uptheir data and computing, and developrobust, fault-tolerant tools that let them getscience done at the very large scale requiredat the frontiers of today’s science.”

A key resource for the project is SDSC’smassive data storage resources—more thanone petabyte (one million gigabytes) ofonline disk backed by six petabytes ofarchival tape, along with the SDSC StorageResource Broker (SRB) software fororganizing, moving, and archiving massivedata collections. “SDSC’s end-to-end data

and computational capabilities are enablingthe astronomy community to reach themilestone of being able to process the entire10 terabyte 2MASS sky survey,” saidReagan Moore, Distinguished Scientist anddirector of the Data Intensive ComputingEnvironments (DICE) group at SDSC.“The 2MASS survey is archived using theSRB in SDSC’s HPSS archival storage, andwith the help of SDSC expertise andresources, the Montage software is able toproduce mosaics of the entire survey.”

AN ATLAS OF THE ENTIRE SKYAn important step, the researchers

explain, is to validate the mosaics againstthe original 2MASS survey images fromwhich they were derived. The completedscience-grade mosaics are then stored alongwith information on all of the modifica-tions of the raw images, tracked in aquantitative way so that astronomers canknow exactly how each plate was produced.And the mosaics are being made inaccordance with Hyperatlas standards andwill form part of the growing Hyperatlascoordinated by Roy Williams of Caltech,which will be available for public accessthrough the NVO.

By producing a collection of uniformmosaics for the entire 2MASS all-skysurvey, the collaboration represents animportant step forward for astronomers.

And when the large-scalemosaic project is extended sothat a number of surveys indifferent wavelengths are re-projected to the same pixelgrid, or map projection, thenimages from surveys indifferent wavelengths can bedirectly overlaid and jointlycompared and analyzed inways never before possible.

This is exciting, explainsWilliams, because it opensthe door for large-scale data

mining of the “federated pixels,” that is, thebackground-adjusted mosaicked imagesthat are linked through a common catalogand available in a common projection.Astronomers will be able to ask complex,“big picture” questions that both explorethe different wavelengths of various surveysand extend across large spatial structures inthe Universe—a new mode of inquiry.

In addition to the 2MASS survey,Caltech researchers are re-projecting the 3terabyte Digital Palomar Observatory SkySurvey collection, also stored in a SRBcollection at SDSC, and preparing to do thesame with the Palomar-Quest survey,already 13 terabytes in size and growing byone terabyte a month. Eventually the SloanDigital Sky Survey (SDSS) may form partof the collection as well.

Collaborators in the project includeThomas Prince, Montage PrincipalInvestigator, Bruce Berriman, Montageproject manager, Anastasia Clower Laity,John Good, and Roy Williams, NVO co-director, of Caltech; Joseph Jacob, DanielKatz, and Atilla Bergou of JPL; and LeesaBrieger, George Kremenek, and ReaganMoore of SDSC. Montage is funded by theNASA Earth Science Technology Office,Computational Technologies Project.

– Paul Tooby is a senior science writer atSDSC and editor of EnVision Magazine.

RELATED LINKSNational Virtual Observatory (NVO)www.us-vo.org

Montagemontage.ipac.caltech.edu

SDSC Strategic ApplicationsCollaborations (SAC)www.sdsc.edu/user_services/sac/index.html

Page 26: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

24 ENVISION Fall 2005 www.sdsc.edu

SDSC EXPANDS DATASTAR SUPERCOMPUTERTO 15.6 TERAFLOPS

To better support the extreme data-intensive needs of U.S. science and

engineering researchers and educators,SDSC has expanded the capacity andcapability of its DataStar supercomputer tomore than 2,048 processors. Through theaddition of 96 8-way IBM Power 4+ p655compute nodes, users will now have accessto one of the most powerful computers inthe nation available to the open academiccommunity.

The expanded DataStar will provideSDSC users 50 percent more capacity,helping meet the heavy demand for thecenter’s compute time. In addition, theaggregate memory as well as the size ofDataStar’s parallel file system will almostdouble. This will provide 7.3 terabytes ofaggregate memory and 115 terabytes ofparallel file system disk storage, givingresearchers the ability to compute andoutput more data in research areas such asastronomy, geosciences, fluid dynamics, andothers. SDSC is the TeraGrid site withparticular responsibility for data-intensivecomputing, and the expansion will greatlyimprove overall performance, providing amore powerful tool for users.

“Data-intensive computing has rapidlybecome a principal mode of scientificexploration. This expansion of SDSC’sDataStar system demonstrates SDSC’s andthe NSF’s recognition that the best toolsmust be made available to the science andengineering communities,” said JoséMuñoz, deputy director of the Office ofCyberinfrastructure at the National ScienceFoundation (NSF). “SDSC has been aleader in data-intensive computing, and thisupgrade maintains that leadership. Thecapabilities being made available throughthis expansion, coupled with SDSC’sexcellent scientific and support staff, areparamount for continued U.S. leadership inscience, engineering, and education.”

DIRECTOR BERMAN CO-CHAIRS NSFSBE-CISE WORKSHOP

The National Science Foundation fundedthe Social, Behavioral, and Economic

Sciences/Computer and InformationScience and Engineering (SBE/CISE)Workshop on “Cyberinfrastructure for theSocial and Behavioral Sciences” in recogni-

A protein structurevisualized by the Molecular

Biology Toolkit.J. Moreland,SDSC/Calit2

SynthesisCenter.

tion of NSF’s role in enabling, promoting,and supporting science and engineeringresearch and education. The workshop wasdesigned to help social, behavioral, andeconomic researchers identify their needsfor infrastructure, their potential forhelping CISE develop this infrastructure forengineering and all the sciences, and theircapacity for assessing the societal impacts ofcyberinfrastructure. SDSC Director FranBerman co-chaired the workshop alongwith Henry Brady of the University ofCalifornia, Berkeley. More than 80 leadingCISE and SBE scientists were broughttogether at Airlie House in Virginia onMarch 15-16, 2005 to discuss six areas:cyberinfrastructure tools for the social andbehavioral sciences; cyberinfrastructure-mediated interaction; organization ofcyberinfrastructure and cyberinfrastructure-enabled organizations; malevolence andcyberinfrastructure; the economics ofcyberinfrastructure; and finally, the impactof cyberinfrastructure on jobs and income.For more information and the finalworkshop report see http://vis.sdsc.edu/sbe/.

LIBRARY OF CONGRESS AND NSF ANNOUNCEDIGARCH DIGITAL PRESERVATION AWARDS

The Library of Congress National DigitalInformation Infrastructure and

Preservation Program (NDIIPP) and theNational Science Foundation awarded 11university teams a total of $3 million toundertake pioneering research to supportthe long-term management of digitalinformation. These awards are the outcome

of a partnership between the two agenciesto develop the first digital-preservationresearch grants program.

The Library is implementing a nationaldigital preservation strategy, which involvesbuilding a collaborative network of partnersto collect at-risk content and to developnew preservation tools. Research supportedunder these awards will help produce thetechnological breakthroughs needed to keepvery large bodies of digital content securelypreserved and accessible over many years.

SDSC, a world leader in preservationtechnologies, will participate in the DigitalPreservation Lifecycle Management project,demonstrating this process for videocontent. Researchers will develop anddocument a practical preservation processfor mixed collections of both legacy and“born digital” video material. SDSC willalso participate with the Scripps Institutionof Oceanography and the Woods HoleOceanographic Institution in the Multi-Institution Testbed for Scalable DigitalArchiving to develop a multi-terabytedigital repository to preserve data frommore than 1,600 oceanographic researchprojects.

SDSC HOSTS THREE SUMMER WORKSHOPS

SDSC’s in-demand educational activitiesincluded hosting three summer

workshops—its first Blue Gene UsersWorkshop, the second annual GEONCyberinfrastructure Summer Institute, andthe 11th Annual Computing Institute forScientists and Researchers. SDSC’s expertise

SDSC NewsSDSC News

Page 27: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

ENVISION Fall 2005 25www.sdsc.edu

in handling extreme datawas the overarching theme in allthree workshops:

• The Blue Gene Users Workshop, heldJuly 7–8, 2005 at SDSC, emphasizedhands-on access for 18 researchers from avariety of scientific fields who require thelarge number of processors on SDSC’snewest supercomputer. The Blue Genesystem, with 2,048 compute processorsin a single rack, is nicknamed“Intimidata” because it is speciallyconfigured with 128 I/O nodes tosupport data-intensive computing. Thehands-on sessions were complemented bylectures from experts at SDSC and theLawrence Livermore National Labora-tory, home of the world’s largest BlueGene system.

• Thirty-eight researchers attended thesecond annual GEON-hostedCyberinfrastructure Summer Institute forGeoscientists (CSIG). The attendeesincluded graduate students, postdocs,and researchers in geoscience andinformation technology from a numberof agencies and more than 30 institutionsaround the U.S. and as far away as Japan,Korea, and the United Kingdom. Thepopular one-week educational program,held from July 18-22, gives geoscientistsan “IT headstart” in using powerful newinformation technologies, orcyberinfrastructure, to enable a newgeneration of geoscience discoveries.

• The 11th Annual Computing Institute atSDSC hosted 30 researchers for a week-long lecture series and hands-onlaboratory focusing on managing largedata sets with “extreme I/O.” Attendeesincluded graduate students, scientists,and researchers representing 17 institu-tions in the U.S. and abroad. Theprogram, held July 25-29, was designed

to provide researchers with asolid introduction to the concepts

and tools available in both establishedand new technologies for the creation,manipulation, dissemination, and analysisof large data sets.

SDSC LAUNCHES ‘DATA CENTRAL’

Interdisciplinary collaborations andcommunity-shared data collections and

databases are becoming increasinglyimportant to the progress of science. Forthis reason, SDSC launched Data Central,which allows users to make their datacollections and databases publicly availableto a wide community of users. The site alsooffers hosting and long-term archiving thatprovide researchers with easy-to-use andcomplete data management services, freeingthem to focus on their research. Withstorage facilities of more than one petabyte(1,000 terabytes) of online disk and sixpetabytes of archival tape storage, SDSCcurrently hosts more than 50 publiclyavailable data collections from bee behaviorvideos and multi-terabyte all-sky astronomyimage collections to data from the Libraryof Congress.

Eligible researchers canrequest a data allocation(whether or not they havea compute allocation) thatpermits expanded access toSDSC’s facilities for datacollection hosting, databasehosting, and long-termarchiving. To request a dataallocation, fill out thesimple, fast online form,which can be submittedautomatically online. Tolearn more, see the DataCentral site at http://datacentral.sdsc.edu/.

CENTER INTRODUCES MOLECULARBIOLOGY TOOLKIT

SDSC has introduced the NationalInstitutes of Health (NIH)-funded

Molecular Biology Toolkit (MBT), a set ofJava-based software libraries for manipulat-ing, analyzing, and visualizing informationabout proteins, DNA, and RNA. This firstmajor release of the MBT runs under theLinux, Windows, Mac OS X, and IRIXoperating systems—an important advantagesince very few off-the-shelf packages enableapplications to run seamlessly on severaldifferent computer platforms. The MBTincludes source code, example applications,a Programmer’s Guide, an ApplicationProgram Interface (API) document, a BuildGuide, and a Binary Installation Guide.

The toolkit provides Java classes forefficiently loading, managing, and manipu-lating protein structure and sequence data.The MBT also provides a rich set ofgraphical 3-D and 2-D visualizationmodules which can be plugged together toproduce applications that have sophisticatedgraphical user interfaces. And the core dataI/O and manipulation tools can also beused to write completely non-graphicalapplications—to implement pure analysiscodes, for example, or to produce a non-graphical back end for Web-based applica-tions. More information is athttp://mbt.sdsc.edu/.

15.6 teraflops SDSC DataStar supercomputer.A. Decker.

Thirty researchers attend 11th Annual ComputingInstitute held this summer at SDSC.

B. Tolo.

Page 28: VOL. 21, NO. 1␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣ ␣␣␣␣␣␣ FALL 2005montage.ipac.caltech.edu › publications › Envision_fall05.pdf · University of

LONG-TERM RECORDS OF GLOBAL SURFACE TEMPERATUREhis visualization of global surface temperature is part of a National Science Digital Library collection archivedat SDSC. Preservation technologies developed at SDSC are capable of maintaining educational and scien-tific data for decades or more. This is essential to scientists’ efforts to understand and predict important phe-nomena such as global temperature and climate change, and to enable the NSDL to educate the next genera-

tion of researchers. More information on the NSDL is at http://nsdl.org/. NSDL collection support C. Cowart,SDSC/UCSD; image created by G. Shirah and J. Kermond, NASA GSFC Scientific Visualization Studio.

SUBSCRIPTIONSFor a free subscription to ENVISION, send theinformation requested below to:

Gretchen Rauen, SDSCUniversity of California, San Diego9500 Gilman Drive, MC 0505La Jolla, CA 92093-0505

[email protected], 858-534-5111

www.sdsc.edu

NAME

TITLE

INSTITUTION

ADDRESS

CITY, STATE, AND ZIP

COUNTRY

E-MAIL

University of California, San DiegoSan Diego Supercomputer Center9500 Gilman Drive, MC 0505La Jolla, CA 92093-0505

ADDRESS SERVICE REQUESTED

NON-PROFIT ORGU.S. POSTAGE

WESTERN GRAPHICS

P A I DP A I D

SAN DIEGO SUPERCOMPUTER CENTER