petascale computational systems - gordon bellgordonbell.azurewebsites.net/petascale computational...

110 Computer

W E B T E C H N O L O G I E S

M ost scientific disciplineshave long had bothempirical and theoreticalbranches. In the past 50years, many disciplines—

ranging from physics to ecology to lin-guistics—have also grown a third,computational branch. Computationalscience emerged from the inability tofind closed-form solutions for complexmathematical models. Computers makeit possible to simulate such models.

In recent years, computational sci-ence has evolved to include informa-tion management to deal with theflood of data resulting from

• new scientific instruments that, dri-ven by Moore’s law, double theirdata output every year or so;

• the ability to economically storepetabytes of data online; and

• the Internet and Grid, which makearchived data accessible to anyone,anywhere.

Acquisition, organization, query,and visualization tasks scale almostlinearly with data volumes. Parallelcomputers can solve these problemswithin minutes or hours.

However, the computational effortsof most statistical analysis and data-mining algorithms increase super-linearly. Many tasks involve com-puting statistics among sets of datapoints in some metric space. Forexample, pair algorithms on Npoints scale as N2. If the data in-creases a thousandfold, the workand time can grow by a factor of amillion. Many clustering algorithmsscale even worse and are infeasiblefor terabyte-scale datasets.

DATA-CENTRIC COMPUTATION Next-generation computational sys-

tems will generate and analyze peta-scale information stores. For example,the BaBar detector at the StanfordLinear Accelerator Center currentlyprocesses and reprocesses a Pbyte ofevent data; about 60 percent of thesystem’s hardware budget is for stor-age and IO bandwidth (www-db.cs.wisc.edu/cidr/papers/P06.pdf).

The Atlas (http://atlasexperiment.org) and CMS (www.cmsinfo.cern.ch)particle detection systems haverequirements at least 100 times higher.The Large Synoptic Survey Telescope(www. lsst.org) has needs in the same

range: petaoperations per second ofprocessing and tens of Pbytes of storage.

BUILDING BALANCED SYSTEMS System performance has been im-

proving in line with Moore’s law andwill continue to do so as multicoreprocessors replace single-processorchips and memory hierarchies evolve.Within five years, a simple, shared-memory multiprocessor will deliverabout half a teraoperation per sec-ond.

Much of the effort in buildingBeowulf clusters and supercomputingcenters has focused on CPU-intensiveTOP500 systems (www.top500.org).Meanwhile, in most sciences theamount of both experimental andsimulated data has been increasingeven faster than Moore’s law becauseinstruments are getting much betterand cheaper and storage costs havebeen decreasing dramatically.

Amdahl’s lawsFour decades ago, Gene Amdahl

coined many rules of thumb for com-puter architects:

• Parallelism—if a computation hasa serial part S and a parallel com-ponent P, then the maximumspeedup is S/(S + P).

• Balanced system—a system needsa bit of I/O per second per instruc-tion per second.

• Memory—the Mbyte/MIPS ratio(α) in a balanced system is 1.

• Input/output—programs do oneI/O per 50,000 instructions.

Although α has increased and causeda slight reduction in I/O density, these“laws” are still generally valid (http://computer.org/proceedings/icde/0506/05060003abs.htm).

In addition, computer systems typ-ically allocate a comparable budgetfor RAM and for disk storage, whichis about 100 times less expensive perTbyte than RAM. Table 1 capturesthis 1:100 RAM:disk capacity ratio,along with Amdahl’s laws applied tovarious system powers.

PetascaleComputationalSystemsGordon Bell and Jim Gray, Microsoft ResearchAlex Szalay, The Johns Hopkins University

A balanced cyberinfrastructure is necessary to meet growingdata-intensive scientific needs.

January 2006 111

tasks involving Pbytes of informationrequire petascale storage and I/Obandwidth. In addition, the data mustbe reprocessed each time a new algo-rithm emerges or researchers pose afundamental new question, generat-ing even more I/O.

More importantly, to be useful,these databases require the ability toprocess information at a semanticlevel. The data must be curated withmetadata, stored under a schema witha controlled vocabulary, and orga-nized for quick and efficient temporal,spatial, and associative search.Petascale database systems will be amajor part of any successful petascalecomputational facility and requiresubstantial software investment.

DATA LOCALITYMoving a byte of data across the

Internet has a well-defined cost (http://research.microsoft.com/research/pubs/view.aspx?tr_id=655). Moving data toa remote computing facility is worth-while only if performing the analysisrequires more than 100,000 CPUcycles per byte of data. SETI@home,cryptography, and signal processinghave such CPU-intensive profiles. How-ever, most scientific tasks are more inline with Amdahl’s laws and muchmore information-intensive, withCPU:IO ratios well below 10,000:1.

For less CPU-intensive tasks, colo-cating the computation with the datais preferable. In a data-intensive worldwhere Pbytes are common, it’s impor-tant to colocate computing powerwith the databases rather than mov-ing the data across the Internet to a“free” CPU. If the data must bemoved, it makes sense to store a copyat the destination for later reuse.

Scaled to a petaoperations-per-second machine, Amdahl’s laws implythe need for

• parallel software to use thatprocessor array and a million disksin parallel;

• a Pbyte of RAM;• 100 Tbytes/s of I/O bandwidth and

an I/O fabric to support it;• one million disk devices to deliver

that bandwidth (at 100 Mbytes/s/disk); and

• 100,000 disks storing 100 Pbytesof data produced and consumed(at 1 Tbyte/disk), which is 10 timesfewer than the number of disksrequired by the bandwidth require-ment.

A million disks to support a peta-scale processor’s IO needs is a daunt-ing number. If a petascale system isconfigured with fewer disks, theprocessors will probably spend mostof their time waiting for IO and mem-ory—as is often the case today.

Petascale systemsThere are precedents for such petas-

cale distributed systems at Google,Yahoo!, AOL, and MSN (http://doi.acm.org/10.1145/945450). These sys-tems have tens of thousands of pro-cessing nodes (approximating a peta-operation per second) and have about100,000 locally attached disks todeliver the requisite bandwidth.Although they aren’t commodity sys-tems, they’re in everyday use in manydata centers.

Once empirical or simulation datais captured, huge computationalresources are needed to analyze thedata and visualize the results. Analysis

Managing this data movement andcaching poses a substantial softwarechallenge. Much current middlewareassumes that data movement is freeand discards copied data after use.

COMPUTATIONAL PROBLEM SIZES

Scientific computation task sizesdepend on the product of many inde-pendent factors. Quantities formed asa product of independent random vari-ables follow a lognormal distribution(E.W. Montroll and M.F. Shlesinger,“Maximum Entropy Formalism,Fractals, Scaling Phenomena, and 1/fNoise: A Tale of Tails,” J. StatisticalPhysics, vol. 32, no. 2, 1983, pp. 209-230). As a result, the sizes of scientificcomputational problems obey a powerlaw wherein the problem size and thenumber of such problems are inverselyproportional—there are a small num-ber of huge jobs and a huge number ofsmall jobs.

This situation is quite evident in UScomputing today. Thirty years ago,supercomputers were the mainstay ofcomputational science. However,today’s four-tier architecture—includ-ing tier-1 supercomputers, tier-2regional centers, tier-3 departmentalBeowulf clusters, and tier-4 singleworkstations—reflects the problem-size power law.

BALANCED CYBERINFRASTRUCTURE

What’s the best allocation of cyber-infrastructure investments in light ofAmdahl’s laws, the problem-sizepower law, and the move to data-cen-tric science? There certainly must betwo high-end tier-1 international datacenters serving each discipline that

Table 1. Amdahl’s laws applied to various system powers.

Operations Disks for that bandwidth Disk byte capacity Disks for that capacity per second RAM Disk I/O bytes/s at 100 Mbytes/s/disk (100x RAM) at 1 Tbyte/disk

109 Gigabyte 108 1 1011 1 1012 Terabyte 1011 1,000 1014 100 1015 Petabyte 1014 1,000,000 1017 100,000 1018 Exabyte 1017 1,000,000,000 1020 100,000,000

112 Computer

W E B T E C H N O L O G I E S

• allow competition,• encourage design diversity, and • leapfrog each other every two

years.

These tier-1 facilities should containmuch of science’s huge archivaldatasets and can only be built as anational or international priority.

What should US government agen-cies and industry do about the othertiers? They could make funding thetier-2 and tier-3 systems entirely theuniversities’ responsibility, but thatwould be a mistake.

We believe that available resourcesshould be allocated to benefit the

broadest cross-section of the scientificcommunity. Given the power-law dis-tribution of problem sizes, this meansthat about half of funding agencyresources should be spent on tier-1centers at the petascale level and theother half dedicated to tier-2 and tier-3 centers on a cost-sharing basis,as Figure 1 shows.

One of the most data-intensive sci-ence projects to date, the CERN LargeHadron Collider (http://lhc.web.cern.ch/lhc), has adopted exactly sucha multitiered architecture. The hier-archy of an increasing number of tier-2 and tier-3 analysis facilities providesimpedance matching between theindividual scientists and the huge tier-1 data archives. At the same time,the tier-2 and tier-3 nodes providecomplete replication of tier-1 datasets.

EXAMPLE TIER-2 NODEMost funding for tier-2 and tier-3

centers today splits costs between thefederal government and the host insti-tution. It’s difficult for universities toobtain private donations for comput-ing resources because they depreciateso quickly. Donors generally prefer togive money for buildings or endowedpositions, which have a long-term

staying value. Government funding istherefore crucial for tier-2 and tier-3centers in a cost-sharing arrangementwith the hosting institution.

For example, The Johns HopkinsUniversity received a National ScienceFoundation grant toward the comput-ers for a tier-2 center it’s building. JHUmatched the NSF funds 125 percent toprovide the hosting facility and staff torun it. Other institutions have had sim-ilar experiences setting up large com-puting facilities. The price of comput-ers is less than half the cost, and uni-versities can meet those infrastructuredemands only if federal agencies seedthe tier-2 and tier-3 centers.

P lacing all the financial resourcesat one end of the power-law dis-tribution would create an unnat-

ural infrastructure incapable ofmeeting the increasingly data-centricrequirements of most midscale scien-tific experiments. At the system level,focusing on CPU harvesting wouldalso create an imbalance. Fundingagencies should support balanced sys-tems, not just CPU farms, as well aspetascale IO and networking. Theyshould also allocate resources for abalanced tier-1 through tier-3 cyberin-frastructure. ■

Gordon Bell is a senior researcher in the Media Presence Research Group atMicrosoft’s Bay Area Research Center(BARC). Contact him at [email protected].

Jim Gray is a distinguished engineer in the Scalable Servers Research Groupat BARC. Contact him at [email protected].

Alex Szalay is a professor in the Depart-ment of Physics and Astronomy at TheJohns Hopkins University. Contact himat [email protected].

Editor: Simon S.Y. Shim, Department ofComputer Engineering, San Jose StateUniversity; [email protected]

1/2

FundingPopulation

Agencymatching

Relativenumbers

1/4 1/4

1/4

Tier 1

Tier 2

Tier 3 1/4

2

20

200

Figure 1.Balanced cyberinfrastructure.Government should fund tier-1 centersand half of tier-2 and tier-3 centers on acost-sharing basis.

Ensure that your networks operate safely and provide critical services even in the face

of attacks. Develop lasting security solutions, with this peer-reviewed publication.

Top security professionals in the field shareinformation you can rely on:

Wireless Security • Securing the Enterprise • Designing for Security Infrastructure Security • Privacy Issues

• Legal Issues • Cybercrime • Digital Rights Management •Intellectual Property Protection and Piracy • The Security

Profession • EducationOrder your subscription today.

www.computer.org/security/

BE SECURE.DON’T RUN THE RISK.

BE SECURE.DON’T RUN THE RISK.

petascale computational systems - gordon bellgordonbell.azurewebsites.net/petascale computational...

Documents