a newsletter for the vumc research community ...needed for a full-fledged center. this issue of peer...

12
If you ask ten different scientists, “what is bioinformatics?,” you will likely hear ten different responses. There will be common elements – computers and databases top the list – but the definition will depend on who’s doing the defining. Nancy Lorenzi, assistant vice chancellor for Health Affairs, sums it up nicely. “Bioinformatics is like an amoeba,” she says. “It comes in various shapes and sizes.” At Vanderbilt Medical Center, advancing bioinformatics is “a very distributed effort,” says Mark Magnuson, assistant vice chancel- lor for Research. “Bioinformatics is probably involved in one way or another in seven or more of our shared resources, and individual labo- ratories are making use of these tools as well.” Magnuson has spearheaded an effort to bring different investigators together over lunch, to stimulate discus- sion and development of new bioinfor- matics resources. Although the defini- tion of bioinformatics may be vague, he says, there is momentum at Vanderbilt for improving and expanding efforts in the bioinformatics arena. So what is bioinformatics? According to the National Institutes of Health, bioinformatics is “research, development, or application of computa- tional tools and approaches for expand- ing the use of biological, medical, behav- ioral or health data, including those to acquire, store, organize, archive, ana- lyze, or visualize such data.” “The term bioinformatics is not very specific to any one effort,” says Al George, direc- tor of Genetic Medicine. “Bioinformatics is a toolbox that can be configured to perform a wide variety of research needs; it is not one flavor.” Mary Edgerton, director of the Molecular Profiling and Data Mining Shared Resource, sees bioinformatics as a spectrum “that goes from storage and retrieval of massive amounts of infor- mation to techniques used to interpret that information.” In her view, bioinformatics is a large field with many subspecialties, includ- ing biomedical informatics and compu- tational biology. Others, including the NIH, define these disciplines as distinct entities, rec- ognizing that there is significant over- lap and activity at their interfaces. Lorenzi explains that biomedical informatics usually refers to clinical or medical informatics – the application of computing to patient-related data, while the term bioinformatics more often implies the computing tools that handle the research information of the genomic era. Vanderbilt is particularly strong in the realm of biomedical informatics. The department of Biomedical Informatics eview A newsletter for the VUMC research community FALL 2002 Peer Bioinformatics gains momentum New planning grant, shared efforts move research forward continued on page 2 “Bioinformatics is like an amoeba; it comes in various shapes and sizes.” –Nancy Lorenzi

Upload: others

Post on 25-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A newsletter for the VUMC research community ...needed for a full-fledged Center. This issue of Peer Review takes a look at some of the areas at VUMC where bioinformatics plays a key

If you ask ten different scientists,“what is bioinformatics?,” you will likelyhear ten different responses. There willbe common elements – computers anddatabases top the list – but the definitionwill depend on who’s doing the defining.Nancy Lorenzi, assistant vice chancellorfor Health Affairs, sums it up nicely.“Bioinformatics is like an amoeba,” shesays. “It comes in various shapes andsizes.”

At Vanderbilt Medical Center,advancing bioinformatics is “a verydistributed effort,” says MarkMagnuson, assistant vice chancel-lor for Research. “Bioinformaticsis probably involved in one way oranother in seven or more of ourshared resources, and individual labo-ratories are making use of these tools aswell.” Magnuson has spearheaded aneffort to bring different investigatorstogether over lunch, to stimulate discus-sion and development of new bioinfor-matics resources. Although the defini-tion of bioinformatics may be vague, hesays, there is momentum at Vanderbiltfor improving and expanding efforts inthe bioinformatics arena.

So what is bioinformatics?According to the National Institutes

of Health, bioinformatics is “research,development, or application of computa-tional tools and approaches for expand-ing the use of biological, medical, behav-ioral or health data, including those toacquire, store, organize, archive, ana-lyze, or visualize such data.”

“The term bioinformatics is not veryspecific to any one effort,” says Al

George, direc-tor of Genetic Medicine.“Bioinformatics is a toolbox that can beconfigured to perform a wide variety ofresearch needs; it is not one flavor.”

Mary Edgerton, director of theMolecular Profiling and Data MiningShared Resource, sees bioinformatics asa spectrum “that goes from storage andretrieval of massive amounts of infor-mation to techniques used to interpretthat information.”In her view, bioinformatics is a largefield with many subspecialties, includ-ing biomedical informatics and compu-

tational biology.Others, including the NIH, define

these disciplines as distinct entities, rec-ognizing that there is significant over-lap and activity at their interfaces.

Lorenzi explains that biomedicalinformatics usually refers to clinical ormedical informatics – the application ofcomputing to patient-related data, whilethe term bioinformatics more often impliesthe computing tools that handle theresearch information of the genomic era.

Vanderbilt is particularly strong inthe realm of biomedical informatics. Thedepartment of Biomedical Informatics

eviewA newsletter for the VUMC research community

F A L L • 2 0 0 2PeerBioinformatics gains momentumNew planning grant, shared efforts move research forward

continued on page 2

“Bioinformatics is like an amoeba; itcomes in various shapes and sizes.”

–Nancy Lorenzi

Page 2: A newsletter for the VUMC research community ...needed for a full-fledged Center. This issue of Peer Review takes a look at some of the areas at VUMC where bioinformatics plays a key

Vanderbilt is one of a select groupof institutions to receive a planninggrant to develop a NationalProgram of Excellence inBiomedical Computing (NPEBC).The NPEBCs are part of the NIH’sBiomedical Information Science andTechnology Initiative (BISTI),which is aimed at making optimaluse of computer science and tech-nology to address problems in biolo-gy and medicine. The hope is thatNPEBCs around the country will:• promote bioinformatics and

bio-computational research that enables the advancement of bio-medical research

• develop useful and interoperable informatics and computational tools for biomedical research

• establish mutually beneficialcollaborations between biomedical researchers and informatics andcomputation researchers

• train a new generation of bio-informatics and biocomputation scientists

Bill Stead and Mark Magnusonare the PI/co-PI on Vanderbilt’sgrant, which proposes a “linkedknowledge model” as an organiza-tional framework to maximize mul-tidisciplinary research. The modeldraws linkages between the realmsof discovery/applications and infor-mation technology to create newknowledge, hypotheses and models forapplications.

“The model recognizes that dis-parate groups working together willachieve more than they could workingindependently,” says Nancy Lorenzi.

Vanderbilt’s two-year planninggrant will support efforts to estab-lish the linked model organizationalstructure. The grant also will sup-port two to four development proj-ects that clearly demonstrate thelink between biology and informat-ics tool development, initiatives ineducation and career development,and a core coordinator position toaddress issues such as “one-stopshopping” for researchers and datamanagement services. An executivesteering committee, coordinated byStead, Magnuson, Lorenzi, and JohnManning will oversee implementa-tion of the linked knowledge model.

“We hope this pre-Center grantwill pave the road to a full-scaleNational Program of Excellence inBiomedical Computing,” Magnusonsays. “We are thrilled thatVanderbilt has a part in this impor-tant initiative.”

–Leigh MacMillan

and its faculty members, includingchair Randy Miller and Bill Stead, areinternationally recognized for initia-tives like StarChart and WizOrder.Bioinformatics for basic science areasof research has been slower to developat Vanderbilt, but thanks to the effortsof many investigators, it is gainingground.

A newly awarded pre-Center grantto develop a National Program ofExcellence in Biomedical Computingwill allow Vanderbilt to build on exist-ing efforts to establish the organiza-tional and infrastructure componentsneeded for a full-fledged Center.

This issue of Peer Review takes alook at some of the areas at VUMCwhere bioinformatics plays a key role.Shared resources like the microarray,proteomics, structural biology, andimaging resources rely heavily onbioinformatics tools and are developingsome of their own. The bioinformaticscores of the Program in HumanGenetics and VICC are writing codeand developing databases to store,retrieve and analyze information.Investigators in the department ofBiomedical Informatics are writingnew algorithms to find relationshipsamong the elements of huge genomicdatasets. And ongoing efforts seek torecruit new faculty members withbioinformatics expertise and to expandand improve computing resources suchas the VAMPIRE system.

The stories in this issue do not pres-ent an exhaustive review of bioinfor-matics at the Medical Center. They areintended to highlight some of theadvances, to provide a glimpse into thechanging shape and movement of theamoeba that is bioinformatics.

–Leigh MacMillan

...gains momentum continued from page 1

2 • PeerReview

Planning a national center

Bill Stead

Mark Magnuson

Page 3: A newsletter for the VUMC research community ...needed for a full-fledged Center. This issue of Peer Review takes a look at some of the areas at VUMC where bioinformatics plays a key

PeerReview • 3

Mary Edgerton erases the large red andpurple cat – her daughter’s artwork – andbegins to fill her office white board withinterconnected circles, short dashes, andlong arrows. Speaking quickly as she draws,she explains how databases and computeralgorithms can be used to link, merge, andmine clinical information and relatedmolecular data.

That’s the plan, anyway. And Edgertonand her colleagues are well on their way towhat she says some call a holy grail ofbioinformatics. “The idea is to build data-bases that link our clinical information andour molecular information and to do it insuch a way that we can search on severalparameters across all the databases,” saysEdgerton, director of the MolecularProfiling and Data Mining Shared Resourceof the Vanderbilt-Ingram Cancer Center.“Everyone wants to do this, and nobody’sdone it.”

Edgerton and her colleagues envision awarehouse with multiple databases – onewith an inventory of banked tissues, onewith the clinical information associatedwith each tissue sample, and one withmicroarray and proteomic data. The linkagebetween the databases comes from a unique“identifier” – a barcode – assigned to atumor tissue at the surgical pathology bench.This barcode travels with the tissue as it isused for molecular experiments, such as geneexpression microarray or proteomic studies.

The tissue and clinical databases will bea tremendous resource, Edgerton says. “If,for example, an investigator says ‘I wouldlike to know how many women between theages of 29 and 35 developed node negativebreast cancer between one and two cen-timeters,’ that investigator will be able tosearch the database and find out how muchof that tissue we have stored. The bankedtissues might then be used for highthroughput molecular analyses.”

The team is using research in the lungcancer SPORE (Specialized Program ofResearch Excellence) as a launchpad fordeveloping the first set of linked databasesin the warehouse. The effort has involveddetermining what clinical information

needs to be included and developing a stan-dard descriptive nomenclature. This isimportant, Edgerton says, because doctorsmay describe the same thing differently, forexample, “metastatic carcinoma to thelymph node” versus “lymph node withmetastatic carcinoma present.”

Using a controlled vocabulary to con-struct the database prevents investigatorsfrom having to think of every synonymwhen they are performing a databasesearch. The challenge, Edgerton says, isdefining a vocabulary that allows easysearching and that is simultaneously flexi-ble enough to adequately describe the tis-sue pathology.

Standards for clinical descriptors andfor the storage of gene expression microar-ray data are being developed international-ly, Edgerton says. “We work with the experi-mentalists to maintain expertise in thesestandards and comply with them, in theway that we structure our databases andwhat terms we use.”

In building the clinical database, andmaking provisions for access to it, Edgertonand her team also are responsible for main-taining the security and confidentiality of

patient information. The lung cancer clini-cal database is nearing completion,Edgerton says, and it will be used as a tem-plate for other organ systems.

In the future, Edgerton would like toadd histological images to the clinical data-base as one of the tissue characteristics.“Search and retrieval methods based onimage features is an active area ofresearch,” Edgerton says. “Then, as opposedto simply being able to search based ondemographics, stage, histopathologicalname, and so on, we might actually also beable to search based on image characteris-tics. That would be really exciting.”

Edgerton’s ultimate vision for the linkeddatabases is to use them for data miningexpeditions. She is experimenting with vari-ous algorithms to analyze the data “in afashion that combines something we knowclinically with the molecular profile to getat cause-and-effect.” Hidden in the hun-dreds and hundreds of datasets that popu-late the databases, Edgerton believes, areanswers about the molecular mechanismsunderlying disease. It’s up to bioinformaticsto find them.

–Leigh MacMillan

Cancer team links clinical, molecular data

Mary Edgerton is spearheading an effort to build a warehouse of linked databases.She hopes that linking clinical and molecular information will yield answers about themolecular mechanisms underlying disease.

Dan

a Jo

hnso

n

Page 4: A newsletter for the VUMC research community ...needed for a full-fledged Center. This issue of Peer Review takes a look at some of the areas at VUMC where bioinformatics plays a key

4 • PeerReview

With its 11 computer programmers,the Bioinformatics Core is one of thelargest bioinformatics groups on theMedical Center campus. The operationspans from the core’s home base on thefifth floor of Light Hall to the InformationTechnology Services networking center onPeabody campus, where the VAMPIREsupercomputer is housed.

The Bioinformatics Core was organizedby Jonathan Haines as part of theProgram in Human Genetics, and is underthe scientific direction of Jason Moore.The efforts of the facility are focused pri-marily on genetics-related projects, thoughnot exclusively.

“We’ve taken on a range of projects,from molecular genetics to populationgenetics to genomics to microarray work,”Moore says. “And we’re getting into pro-teomics now, too.”

As part of the services offered, the staffprogrammers provide database design serv-ices and support, Web page design and sup-port services, Web interface to the database,and software design in various program-ming languages. Scheduling of program-ming and design services can be arrangedwith Janey Wang, the core’s manager.

In addition, the core managesVanderbilt’s subscription to the Celeradatabase, and supports a wide variety ofbioinformatics software packages, such asthe Wisconsin GCG Package for nucleicacid and protein sequencing analysis.Charles Alexander, who is responsible forthis area of the core’s services, conductstraining workshops and is available forone-on-one training in the use of Celeraand the various software products.

According to Moore, it’s a simple mat-ter for the bioinformatics core to coordi-nate services with other core facilities,such as the Microarray Shared Resource.While the microarray core is responsiblefor processing the raw data generatedthere and formatting it for the researcher,the bioinformatics core is able to generatethe databases that facilitate analysis ofthe data.

For example, Moore and Shawn Levy,director of the Microarray SharedResource, are both principal investigatorson a Program Project Grant that JacekHawiger directs, which has as its goal toidentify markers of inflammation in theblood.

“The group is trying to identify genesthat are turned on and off during theinflammatory process,” Moore says. “We’reworking together to create a seamlessdatabase and analysis system for thatproject, which will require a lot ofmicroarray work.”

Moore and his programmers havedevised a number of computational meth-ods for analyzing microarray data andgenetic epidemiological data that wouldn’tbe possible, he says, without the use of asupercomputer, such as VAMPIRE. Withits 110 linked CPUs, VAMPIRE, whichstands for Vanderbilt Multiple ProcessorIntegrated Research Engine, provides theboosted speed and power needed to per-form sophisticated computations in a rea-sonable time frame.

One example of VAMPIRE’s power canbe seen in its application to the pro-

teomics work of Richard Caprioli, whoselab is investigating whether mass spec-trometry can be used to correlate tumorproteins with tumor grade. Comparingmass spectra of different tissue samples toidentify common proteins can be problem-atic, however, since the protein peakssometimes shift in a non-linear fashion.Moore devised an algorithm that correctsfor such shifts.

In the graphic depiction of results fromthis algorithm, identical proteins appearas dots within vertical bins, the color ofeach dot reflecting the relative abundanceof that protein. Variation in mass amongthe proteins is evident in the “wobble”seen in the stacked tower of dots.

The algorithm must simultaneouslyfigure out the optimal size of the bins andwhat proteins belong in each. Analyzing500 tissue samples using this algorithm,as in one of Caprioli’s brain tumor studies,can be computationally intensive, to saythe least. Even with VAMPIRE, it takes acouple of days to run the program; with-out VAMPIRE, Moore says, it would beimpossible.

Those responsible for VAMPIRE,

Genetics group tackles range of computing needs

Jason Moore directs the Bioinformatics Core of the Program in Human Genetics.The core’s 11 computer programmers provide database design services and support,Web interfacing, and software design in various programming languages.

Dan

a Jo

hnso

n

Page 5: A newsletter for the VUMC research community ...needed for a full-fledged Center. This issue of Peer Review takes a look at some of the areas at VUMC where bioinformatics plays a key

PeerReview • 5

including Alan Tackett, who has directoversight of the system, are anxious togive it more teeth. Moore envisions aresource large enough to service the entireUniversity. Moore, along with Tackett andPaul Sheldon, a professor in the depart-ment of Physics, have been marinating

the idea of creating a scientific computingcenter. Attempts to find the funding need-ed for this leap include an application tothe NIH’s High End InstrumentationProgram, an NSF proposal, and a proposalto the University’s Academic VentureCapital Fund.

“I think VAMPIRE will be a tremen-dous asset for anybody doing computation-al studies,” Moore says, “not just in bioin-formatics, but across the university. Thesystem will open doors for a lot of peopleto do things they never dreamed theycould do.”

–Mary Beth Gardiner

ITS assures computing infrastructureAs director of Vanderbilt’s Information

Technology Services, Glen Miller sees hiscontribution to the Medical Center’s bioin-formatics program as part and parcel of theITS core mission: to support the computerinfrastructure of the entire University.

ITS makes itself useful to the medicalcenter in several ways, Miller says. Higheston the list of priorities is providing the bestpossible data network facilities betweenresearch partners and facilities within andoutside of Vanderbilt.

“I think that working with NetworkComputing Services, which provides the net-work backbone in the Medical Center build-ings, we are on a very sound path of build-ing the Medical Center and the Universitynetworks as one network in structure,” hesays. “We want to try to eliminate the barri-ers between the two and to make it easier

for folks to cross disciplines in their researchand to share resources. It’s taken a while,but I think that’s working well.”

In addition, the group is intent on pro-viding scientific computing and storagefacilities by nurturing collaborative initia-tives among researchers. Staffers MaryDietrich and Alan Tackett are working onfinding new ways to fund expansion ofVAMPIRE, the high-speed, multi-processorcomputing system housed at ITS. Dietrichis the ITS Academic Liaison and Tackett isdirector of VAMPIRE. Due to the “explo-sive” needs of researchers for storing thingslike MRI data, Miller says, the staff isworking on ways to dramatically beef upstorage capability, including the formationof coalitions for purchasing tape storage.

Miller believes ITS also serves the moregeneral role of being an “aggregate point”for researchers with unmet needs. For

example, the group is working onways to improve online deliveryand version tracking of software,including variations of the UNIXoperating system.

The issues of concern with UNIX-type operating systems –

including

LINUX, which is used widely on campus –are those of security. According to Miller,users need to understand and protect them-selves from the inherent security risks, andthose risks vary as software is revised.

“It’s not necessarily that anyone else inthe outside world wants access to theirinformation,” he explains, “it’s just thatVanderbilt has this great big, fat pipeline tothe Internet and when we have these pow-erful computers sitting here, it’s a wonder-ful target for people to take over some ofthat resource for their own purposes.”

For more information or to contact ITS,visit http://www.vanderbilt.edu/its/about.php.

–Mary Beth Gardiner

Information Technology Services directorGlen Miller and his team are working toeliminate barriers between the MedicalCenter and University computing networks.

Dan

a Jo

hnso

n

Page 6: A newsletter for the VUMC research community ...needed for a full-fledged Center. This issue of Peer Review takes a look at some of the areas at VUMC where bioinformatics plays a key

6 • PeerReview

About 1200 pages. That’s how manypages of data the average microarrayexperiment produces. “You can imaginethe inefficiency that would be involved ifyou tried to read through one of thesedata files without using a computer,” saysShawn Levy, director of the VanderbiltMicroarray Shared Resource.

Using a computer, using bioinformaticstools, to analyze microarray data is notjust efficient, it is essential.“Bioinformatics reduces the dimensionali-ty of the data down to something compre-hensible,” Levy says. “That’s a lot of whatbioinformatics does; it enables people tosee and recognize patterns that they can’tsee without a computer.”

Like the pattern of gene expressionchanges in a tumor sample compared to anormal tissue sample.

Microarrays have made their mark as atool for examining gene expression, partic-ularly in the fields of cancer and autoim-mune diseases, Levy says. The tens ofthousands of DNA “spots” on a singlearray allow investigators to probe entiregenomes. “Assuming a mammalian genenumber of somewhere in the range of30,000 to 50,000,” Levy says, “microarrayscan offer a true genetic snapshot of a cell,or of an organ, or of a patient biopsy.”

With single experiments that result inover a million collected data points, theneed for sophisticated bioinformatic analy-sis tools is obvious. These tools range frompre-packaged software – with modifica-tions – to custom-tailored programs.

In the world of microarrays, bioinfor-matics impacts more than data analysis.“We have a very large need for bioinfor-matics to make the analysis possible,”Levy says.

It starts with the annotation of thelibraries of clones used to make themicroarrays. “What gene are you lookingat, what is its sequence, what protein doesit produce, what is its function, where is itexpressed in the cell...you can see how theinformation explodes,” Levy explains.

And then, he says, there’s the issue of

keeping track of clones that are in 96 wellplates for PCR before they get spottedonto microscope slides at a density of10,000 spots per slide. Which spot is theDNA from well B7 on plate 4?

These considerations don’t begin totake into account the bioinformatics asso-ciated with the samples that come in.What is the source of the RNA? How wasit isolated? How was it labeled? Whattechnologies were used to hybridize, washand scan the microarrays?

The microarray field is in the process ofdeveloping standards for all of theseissues, Levy says, which is importantbecause the technology used to arrive at afinal answer impacts that answer. It’s notlike DNA sequencing where there is adefinitive “right” answer, Levy points out.“Microarray technology is not at the pointyet where you could send the same RNAsample to five different labs and get exact-ly the same answer, and this puts a lotmore weight on the shoulders of the infor-matics that support it.”

The Vanderbilt Microarray SharedResource is focusing its bioinformaticsdevelopment efforts on tools for data man-agement and access – tools that create an“electronic lab notebook” of sorts, Levy says.

He plans for the tools to offer interac-tive ways for users to keep track of whatsamples were run, which comparisonswere made, which microarray was used,and what genes are on the microarray.“We’re trying to create the tools that pavethe road to analysis,” he says. And follow-ing the analysis, when there is a putativeanswer, the tools Levy and colleagues aredeveloping will offer high throughputways for users to understand the genesthat have been identified.

Other Vanderbilt investigators, includ-ing groups in the Program in HumanGenetics (Jason Moore) and inBiostatistics (Yu Shyr) are developing newbioinformatics tools for microarray dataanalysis. “They’re really on the cuttingedge of microarray analysis tools,” Levysays. “We do more of the bioinformaticbookkeeping.”

In addition to creating new bioinfor-matics tools, Levy and colleagues in themicroarray core are working to improvemicroarray technology and to develop newapplications. Particularly promising, Levysays, are techniques that allow investiga-tors to use a single cell’s worth of RNA forgene expression profiling. “This opens upthe clinical biopsy arena,” Levy says.“With properly handled tissue from a nee-dle biopsy, we can produce a gene expres-sion report.”

Microarrays are also being applied toefforts to detect chromosomal abnormali-ties, like changes in DNA copy number,and to sequence DNA and detect singlenucleotide polymorphisms (SNPs). Thefluorescent technologies used for microar-ray studies can be adapted to applicationslike in situ hybridizations that tradition-ally relied on radioactivity, Levy says.

–Leigh MacMillan

Bioinformatics permeates microarray core

Dan

a Jo

hnso

n

Shawn Levy directs the MicroarrayShared Resource, which utilizes bioinfor-matics not only for microarray analysis,but also to make that analysis possible.

Page 7: A newsletter for the VUMC research community ...needed for a full-fledged Center. This issue of Peer Review takes a look at some of the areas at VUMC where bioinformatics plays a key

PeerReview • 7

Servers, database improve image accessDuring bursts of dynamic imaging, the

microscopes in Vanderbilt’s Cell ImagingShared Resource can generate data at a rateof more than 240 megabytes per minute,says David Piston, the core’s scientific direc-tor. That’s about one CD-ROM’s worth ofimages every three minutes, he says.

Where do you store all of these data, andhow do you access and analyze the imagesefficiently? These are the kinds of questionsbeing addressed by imaging bioinformatics,Piston says. He was recently awarded aVanderbilt Discovery Grant to implement aplan for high volume image data storage,management, and processing.

Piston’s plan involves creating a central-ized repository for storing images andstructuring that repository so that access isefficient. “Simply having a large amount ofdisk space doesn’t guarantee access to it ina timely fashion,” he says.

Traditionally, data have been stored onindividual disk servers. The problem withthis scheme, Piston says, is that an individ-ual – during large file transfer – canmonopolize a single server and the network,leaving other disk servers sitting idle. Toovercome this problem, Piston will generatea parallel file system –essentially a group ofservers transferring datain parallel, instead ofindependently.

To manage the storedimages for analysis andsharing with other inves-tigators, Piston willimplement a databasescheme developed by the Open MicroscopyEnvironment (OME, http://www.openmi-croscopy.org). OME is an open source soft-ware project to develop a database-drivensystem for quantitative analysis of biologi-cal images, Piston says. It is a collaborativeeffort among academic and industrial labsthat was started in 2000. The first versionsof OME programs – standardized file formatsfor image data and database schema – arenearing completion.

Having an image data-base will allow researchersto systematically depositdata and selectively extractimages from simple queriesto the database. Keywordsearches or other cross-refer-encing techniques will pro-vide investigators with theopportunity to compareimages from different labora-tories and experiments,Piston says.

Piston has teamed upwith Alan Tackett atInformation TechnologyServices to install, maintainand back up the parallelservers. “In addition to usingexisting expertise at ITS,this arrangement will allowmuch wider access to theseservices once we get themestablished,” Piston says. Theserver and database systemwill benefit many differentgroups, including users ofthe Cell Imaging Shared

Resource, the In Vivo Imaging Center,which Piston also directs, and the newlycreated Vanderbilt Institute of ImagingScience under the leadership of John Gore.

“We expect this server and database sys-tem to greatly advance image informatics andto accelerate progress in biological and bio-medical research at Vanderbilt,” Piston says.

In addition to providing user-friendlystorage and access to image data, Pistonexpects the new system to speed data analy-

sis. The OME database will interface withboth existing commercial image processingsoftware and with new analysis routines.

Piston and colleagues are developing onesuch new routine for specialized particleanalysis. They are using so-calleddeformable models to track moving subcel-lular particles – such as mitochondria andinsulin granules, which they image to studyglucose-stimulated insulin secretion – overthe time course of an experiment. The newsemi-automatic and automatic methods willallow analysis, processing, and visualizationof large three- and four-dimensional datasets.

“The server and database system willmake these kinds of analyses possible,”Piston says.

–Leigh MacMillan

A Vanderbilt Discovery Grant is funding David Piston’splan for high volume image data storage, manage-ment and processing.

Dan

a Jo

hnso

n

"We expect this server and database

system to greatly advance

image informatics..."

–David Piston

Page 8: A newsletter for the VUMC research community ...needed for a full-fledged Center. This issue of Peer Review takes a look at some of the areas at VUMC where bioinformatics plays a key

Providing support for the specializedneeds of the growing population of structur-al biologists on campus, the ComputationalBiology Resource is one of four core techno-logical facets of Vanderbilt’s Center forStructural Biology. The resource providesoutreach to scientists in other disciplines, aswell, on projects that cross over into thestructural biology field. In fact, you couldsay that the mission of the resource, in gen-eral, is to promote thinking on a molecularlevel, says director Jarrod Smith.

Smith and Walter Chazin, director of theCenter, have spent the past two years build-ing the Computational Biology Resource.Though the resource resembles a tradition-al core facility, it differs in a fundamentalway. The facility provides the resources –the hardware, software, and expertise touse them – but the investigator needing theservices provides the manpower to comeand actually do the work.

The resource assists with traditionalbioinformatics needs, such as deciding howto use information once it is collected. Butthe bulk of what they do, according toSmith, falls more into one of the “fringe-def-initions” of bioinformatics.

“Our piece of bioinformatics in structuralbiology would be determining, analyzing, andmodeling biomolecular structure,” he says.

To that end, the resource boasts animpressive collection of high-end equipment.

“Currently, we’ve got a 64-processorSilicon Graphics Origin supercomputer, aswell as a 32-processor Linux cluster, for atotal of 96 CPUs dedicated to running bio-molecular simulations, biomolecular analy-sis tools, and ab-initio calculations for smallmolecules,” Smith says.

Even though the VAMPIRE computercluster has more CPUs, the bandwidth onthis parallel system is greater, and thelatency – the amount of time it takes for apacket of information to travel between twoprocessors – is very low. These qualities arecritical to many algorithms used in compu-tational biology.

“That’s what makes a supercomputer tous,” Smith says. “It’s not only how fast the

processors are, it’s how tightly coupled theyare and how you can use the thing as awhole. The Origin has 32 gigabytes ofshared memory; all the processors see thatmemory and they all see it with the samebandwidth.”

The facility also offers what Smith callsa “visualization lab.” At this time, there arefour high-end Silicon Graphics Octaneworkstations, specifically designed for real-time, 3-D graphics rendering.

“This is where people go to bring theirmolecules up into the computer, rotate themaround, and interact with them,” he says.“We also have the capability to use stereo-scopic glasses, so you can actually see theobject in three dimensions. That’s an impor-tant tool for researchers doing modelingand drug design, or for X-ray crystallogra-phers needing to put atoms into electrondensity clouds.”

The software packages availablethrough the resource are listed on theirWeb page (http://structbio.vanderbilt.edu/comp/), though the list grows faster thanthey are able to update the site. It’s a goodidea to ask if you don’t see the one you’reinterested in, Smith says. Each packagecomes with instructions on how to installand begin, plus links to manuals and cus-

tomized user hints. For some of the pack-ages, Smith has written tutorials for moreadvanced interactions with the software.

As far as infrastructure goes, theresource has several servers that handleWeb, database, software, and file servingtasks, and they have just installed a twoterabyte RAID (redundant array of inex-pensive disks) array that dramaticallyincreases storage capacity.

“The RAID device does its work in thebackground,” says Smith. “The advantage ofthat is you can get massive amounts of stor-age in one, easy to maintain system. Andthere is a redundancy built in, so you couldlose a disk and nobody would know exceptus.”

The two terabytes of space – that’s 2000gigabytes – are necessary to handle thedata collected at the NMR Center, the simu-lations that are run on the supercomputers,and the data brought back by the X-raycrystallographers from the synchrotron.

Smith and his staff, which currentlyconsists of two system administrators, areavailable as consultants to share theirexpertise in how to get started and how tosolve particular problems. Ultimately, theyplan to offer workshops in addition to one-on-one interactions.

Resource aids structure analysis, modeling

The Computational Biology Resource of the Center for Structural Biology boastsa 64-processor Silicon Graphics Origin supercomputer. The system is key todetermining, analyzing, and modeling biomolecular structure, says Jarrod Smith,the resource’s director.

8 • PeerReview

Dan

a Jo

hnso

n

Page 9: A newsletter for the VUMC research community ...needed for a full-fledged Center. This issue of Peer Review takes a look at some of the areas at VUMC where bioinformatics plays a key

PeerReview • 9

Bioinformatics yields protein answersBioinformatics is the key final step in

assuring that the proteomics shared resourcecan do what it does – identify proteins.

Those proteins might be the ones thatchange in cells treated with a newchemotherapy drug, or they might be theones associated with a large complex.

Whatever the proteins, the proteomics lab-oratory draws on several different meth-ods to separate them, and then uses massspectrometry and bioinformatics to identi-fy them.

“We generate the mass spectrometrydata and then rely on the bioinformaticsfield to get our answers,” says DavidFriedman, director of the proteomics labora-tory, which was established as a componentof the Mass Spectrometry Research Centerunder the leadership of Richard Caprioli.

A common approach for identifying

proteins uses 2D-gel electrophoresis toseparate mixtures of proteins based onphysical attributes – isoelectric point andmolecular weight. Individual proteins –spots on the 2D-gel – can be cut out of thegel, digested into peptides, and analyzedby mass spectrometry.

This technology ismost often directedto finding the pro-teins that arechanging, for exam-ple under differentexperimental condi-tions, or in diseasetissue versus nor-mal tissue. Forhigher throughput,the core takesadvantage of fluo-rescent dye labelsand laser imaging.“This is anotherway we use bioinfor-matics,” Friedmansays. “We candirectly comparetwo or three sam-ples, labeled with

different dyes and separated at the sametime. The computer algorithm will tell uswho’s changing, who’s not changing, andby how much. It’s very powerful.”

The core’s automated system allowsusers to select spots for automatic sam-pling, digestion, and mass spectrometryanalysis. Each protein has a “characteris-tic signature of tryptic peptides,”Friedman says. Bioinformatic search algo-rithms compare an experimental “signa-ture” to a theoretical digest of every pro-tein in a selected database and return a

match, if one exists in the database.“Our approach is completely dependent

on the protein being in the database,”Friedman says. “We rely on the databasesbeing properly annotated, maintained, andcontinuously updated.” The core makesuse of databases containing completeannotated proteins as well as those forexpressed sequence tags (ESTs).

The search algorithms for matchingexperimental mass spectra are either com-mercially available or free, Friedman says.Like the databases, these algorithms areregularly updated and improved.

The algorithms have to be especiallypowerful to conduct searches on data fromcomplex mixtures of proteins. AndrewLink, assistant professsor of Microbiology& Immunology, and collaborators devel-oped a technology and analysis algorithmcalled SEQUEST to directly analyze andidentify all of the proteins present in apurified protein complex. To speed theanalysis, Link built a 20-node parallelprocessor. The parallel processor,Friedman says, makes the experimentpossible – reducing the database search tohours as opposed to many days.

Improvements to processing speed willlikely be the limit of bioinformatics devel-opment efforts for the proteomics sharedresource, Friedman says.

“We’re advancing the field of pro-teomics by improving the technologies forprotein separation and detection and bydeveloping new technologies,” Friedmansays. “We rely on the expertise of bioinfor-maticians to keep database searchingstate-of-the-art.”

–Leigh MacMillan

David Friedman directs the proteomics laboratory, part of theMass Spectrometry Research Center. Mass spectrometry andbioinformatics are used to identify proteins separated by 2D-gelelectrophoresis and other methods.

“The goal is that over time the requiredknowledge and skills to take advantage ofthese technologies will start to percolatethrough the community,” Smith says.

Smith got his Ph.D. at the ScrippsResearch Institute, jointly advised by

Chazin, an NMR spectroscopist, and DavidCase, a computational biologist.

“My background is in NMR structuredetermination, but most of the work I did wasin the computational methods for resolving thestructures,” he says. “This job really takes

advantage of my skills. Plus, I enjoy helpingpeople solve problems, and that’s perhaps themost important aspect of the job.”

–Mary Beth Gardiner

Dan

a Jo

hnso

n

Page 10: A newsletter for the VUMC research community ...needed for a full-fledged Center. This issue of Peer Review takes a look at some of the areas at VUMC where bioinformatics plays a key

10 • PeerReview

Algorithms unravel cause-and-effect networksThe changes in computational capabili-

ties that have occurred in the four shortyears since Constantin Aliferis received hisPh.D. in Intelligent Systems/MedicalInformatics make him shake his head indisbelief.

“When I was a graduate student, it wasinconceivable that someone would be think-ing of routinely creating diagnostic, prog-nostic, and treatment models with thou-sands of variables,” says Aliferis, whoearned an M.D. before embarking on gradu-ate studies. “Everyone would have laughedat you if you suggested it. But there havebeen tremendous strides in the technologyand science of machine learning that makeit possible to do this now.”

This is true, he says, for classificationmodels – models that group patients intodisease groups or predict response to treat-ment, for example. It is not possible to cre-ate models that reveal cause-and-effect rela-tionships among all of the measured vari-ables for datasets with thousands of vari-ables. Not yet, anyway. “Causal discovery” –determining these cause-and-effect relation-ships – is the realm where Aliferis and col-leagues are concentrating their efforts.

Aliferis directs the Discovery SystemsLaboratory, a unit of the department ofBiomedical Informatics that is dedicated tocreating and applying new algorithms andsystems for biodiscovery. The DiscoverySystems Lab includes three faculty mem-bers from the department (Aliferis, IoannisTsamardinos, Eric Boczko), five collaborat-ing faculty members, and a strong team ofstaff members and students. Aliferis creditsthe laboratory’s existence to the vision ofleaders like Lee Limbird, Mark Magnuson,and Bill Stead.

The challenge in developing algorithmsfor causal discovery is mostly the sheer sizeof the networks to be solved, Aliferis says.The sequencing of the human genome andhigh throughput technologies like geneexpression microarrays and mass spectrom-etry/proteomics make datasets with thou-sands of variables commonplace. But deduc-ing a detailed network of cause-and-effect

interactions among the genes or proteins inthese datasets “is known to be intractablein the worst case,” he says.

Instead of studying the whole network –the “global network” – the DiscoverySystems Laboratory team has focused onthe “local causal neighborhood” around avariable of interest. “Instead of trying tolearn how 15,000 genes interact, each ofthem with every other gene, for example,we concentrate on specific target genesand try to find the minimum set of imme-diate causes and effects of those genes,”Aliferis says.

The investigators are having successwith this local approach. And they hope toapply it to larger networks. “If you can learnlocally what’s going on, as a next step, whycan’t you go back and piece everythingtogether to create as complete a picture of thefull network as possible?,” Aliferis asks. “It’s adivide and conquer approach that we’re try-

ing, spearheaded by Ioannis Tsamardinos.“We are extremely excited by the fact

that right now we’re the only lab we know ofthat can do such large-scale local discovery.”

In preliminary results, the DiscoverySystem Laboratory’s algorithms were ableto deduce local networks in a structuralbiology pharmacological dataset with140,000 variables. The analysis took anhour and a half on a single desktop comput-er. Using the VAMPIRE supercomputer, theanalysis takes a few minutes.

In another example, the algorithms wereused to analyze lung cancer microarraydata produced outside Vanderbilt. The mod-els successfully distinguished between can-cer and normal cells, between squamousand adenocarcinomas, and betweenmetastatic and non-metastatic adenocarci-nomas. And the models revealed novel andinteresting causal structure around genesknown to be implicated in lung cancer,Aliferis says.

The limitation right now is having largeenough sample sizes. A network withbetween four and eight genes around eachtarget gene – the estimated connectivity ofgene networks in eukaryotic cells – willrequire a sample size in the hundreds,Aliferis says. That means, for example, thatif you are interested in a particular geneand its local network in squamous cell lungcancer, you would need to collect samplesfrom a few hundred patients for microarrayexperiments. The investigators are lookingforward to drawing on data generated bythe various cancer SPORE (SpecializedProgram of Research Excellence) projects,which are expected to have large datasets.

“Up until now we’ve primarily been lay-ing the methodological groundwork for thefuture,” Aliferis says. “As we start applyingour methods and see how exciting the initialresults are, we become more convinced thatwe will be able to use our algorithms to learncomplex models of disease and that thesemodels will have significant implications forboth clinical care and biodiscovery.”

–Leigh MacMillan

Constantin Aliferis leads the DiscoverySystems Laboratory, a unit of the departmentof Biomedical Informatics dedicated to creat-ing new computer algorithms for analyzinglarge datasets.

Dan

a Jo

hnso

n

Page 11: A newsletter for the VUMC research community ...needed for a full-fledged Center. This issue of Peer Review takes a look at some of the areas at VUMC where bioinformatics plays a key

PeerReview • 11

Wanted: bioinformatic talentFor the past year or so, scientific leaders

across Vanderbilt have been methodicallytrying to determine how to give Vanderbiltmore of a presence in bioinformatics. Thefirst step, ironically enough, was to definebioinformatics. Because the term is usedso broadly, it seemed important to get amore global perspective.

“We spent a wholeyear bringing in agroup of distinguishedscientists from acrossthe world, all of whomhave recognition inthis area but in verydifferent aspects ofit,” says Heidi Hamm,“to come and spend acouple of days with us, to give a seminar andto let us pick their brains. We asked themwhat direction they thought Vanderbiltshould go, and we asked for names of peoplethey thought we should recruit.”

Hamm was the chairperson of the com-mittee, called the TransinstitutionalBioinformatics Initiative RecruitmentTeam, or T-BIRT, charged with exploringoptions and making recommendations.The team consisted of faculty membersrepresenting each of the major areas oncampus that make use of bioinformaticstools, including proteomics, computationaland structural biology, mathematical mod-eling, biological sciences, imaging, comput-er science, and medical bioinformatics.

Other T-BIRT members includedConstantin Aliferis, Mary Edgerton, JasonMoore, Richard Caprioli, Terry Lybrand,Walter Chazin, Martin Egli, Jim Staros,Emmanuele DiBenedetto, David Piston,and Benoit Dawant.

“Together we came up with the namesof people we’d like to come and visit,”Hamm says. “By the end of the year, wehad a really nice body of shared experi-ence of what we want to create here atVanderbilt. It took a while for us all to geton the same page, but it was a lot of funand we learned a lot during the year. Afterone more university-wide tally of existing

computational and informaticstalent, we will be ready to launch recruitmentof key, transformative bioinformaticians.”

One individual who came for a visitand was successfully recruited is ErikBoczko, who started at Vanderbilt on July 1.Boczko, who holds doctoral degrees in bothbiophysics and mathematics, is an expertin the area of machine learning, a subfieldof artificial intelligence. With the algo-rithms he builds, he attempts to clusterdifferent subsets of genes identified frommassive arrays of gene profiling databased on some variable, such as a signal-ing pathway or a specific function. Boczkois working with Constantin Aliferis in thedepartment of Biomedical Informatics.

Many of the consultants advisedenriching computer science and appliedmathematics on campus, as well as addingmore biologically oriented computationalscientists, in order to build a truly signifi-cant and competitive bioinformatics pro-gram with national stature. The visitorsalso emphasized finding the right leadersto recruit to Vanderbilt.

Finding the right people will be a chal-lenge because a lot of institutions are try-ing to do this at the same time, Hammsays. “It’s so competitive that the verygood senior leaders are being recruited bymultiple places. Young people might havetheir own area of specialization, but thechallenge would be getting the criticalmass of a group of young bioinformati-cians that work well together.

“What we’re trying to say is we want toput Vanderbilt into the next grand phaseof biological research,” she continues. “Wewant to bring in people who can help usinject more quantitative approaches intobiology, but we want to avoid the pitfallsof getting the wrong people. It’s a veryfast-moving field, so you want to have peo-ple who are at the forefront, who arethinking about what the next approachesare and how to go about getting there.”

–Mary Beth Gardiner

Merging signalingand math

An ambitious collaborativeprogram is emerging with theintent to mathematically modelsignal transduction and the waynetworks of signaling pathwaysimpact cellular phenomena suchas cell shape change and cellmigration. The effort to bringquantitative approaches to bio-logical problems is being led by agroup of mathematicians headedby Emmanuele DiBenedetto, andcombines the expertise of groupsas diverse as mathematicians,computer scientists, biologicalscientists, and bioinformaticians.

These collaborations may growinto a cross-disciplinary, multi-institutional center pending theoutcome of a proposal recentlysubmitted to the NIH by HeidiHamm and DiBenedetto. Thecenter, to be called theVanderbilt-Meharry Center ofExcellence in Modeling ComplexSignaling Networks, would facili-tate interaction among scientistsfrom Vanderbilt University andMeharry Medical College.

The major challenge, Hammsays, would be to “seamlesslyintegrate experimental data withtheoretical and systematic mod-eling.” Computational bioinfor-matic techniques would be usedto integrate information frommultiple sources, includingexperimental data such as geneexpression data, protein-proteininteraction data, genomic

Heidi Hamm

continued on back page

Page 12: A newsletter for the VUMC research community ...needed for a full-fledged Center. This issue of Peer Review takes a look at some of the areas at VUMC where bioinformatics plays a key

Vanderbilt University Medical Center

CAMPUSMAIL

12 • PeerReview

VUMC News & Public AffairsCCC-3312 MCN (2390)www.mc.vanderbilt.edu/peerreview

Address comments to the editor:[email protected]

Editor: Leigh MacMillanContributing writer: Mary Beth GardinerDesign and layout: Medical Art GroupIllustration: Matt GoreDirector of VUMCpublications: Wayne Wood

ReviewPeer

sequence data, subcellular localization data, andlipidomics data. These data would feed into and enrichmathematical models of the process, which would then beused to design targeted experiments for study of specificpathway components.

The long-term goals of the proposed center are:

• to investigate the spatio-temporal dynamics of signal transduction through a single pathway, and sub-sequently through multiple pathways simultaneously engaged

• to visualize the dynamics of diffusion of aqueous and lipidic second messengers

• to develop mathematical and computer models of signal transduction through various signaling pathways

• to develop theoretical and computer research tools to model these signaling systems and to cross-validate models and data

The center’s mission would also include a strong educa-tional component with the aim of producing a new gener-ation of students equally versed in wet-bench experimen-tation, theoretical elaborations, and computer implemen-tations. The educational effort is already well underway,with the formation of the Biomath Study Group and anumber of collaborative teams on campus modelingdiverse aspects of signaling.

–Mary Beth Gardiner

...signaling and math continued from page 11