what i have been doing peta bumps 10k$ tb scaleable computing sloan digital sky survey

31
What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

Upload: jaden-miller

Post on 27-Mar-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

What I have been Doing

Peta Bumps

10k$ TB

Scaleable Computing

Sloan Digital Sky Survey

Page 2: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

300 MBps OC48 = G2Or

memcpy()

90 MBps PCI

Sense of scale

• How fat is your pipe?

• Fattest pipe on MS campus is the WAN!

20 MBps disk / ATM / OC3

94 MBps Coast to Coast

Page 3: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

Redmond/Seattle, WA

San Francisco, CA

New York

Arlington, VA

5626 km10 hops

Information Sciences InstituteInformation Sciences InstituteMicrosoftMicrosoft

QwestQwestUniversity of WashingtonUniversity of Washington

Pacific Northwest GigapopPacific Northwest GigapopHSCC HSCC (high speed connectivity consortium)(high speed connectivity consortium)

DARPADARPA

Page 4: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

The PathDC -> SEAC:\tracert -d 131.107.151.194Tracing route to 131.107.151.194 over a maximum of 30 hops 0 ------- DELL 4400 Win2K WKSArlington Virginia, ISI Alteon GbE 1 16 ms <10 ms <10 ms 140.173.170.65 ------- Juniper M40 GbEArlington Virginia, ISI Interface ISIe 2 <10 ms <10 ms <10 ms 205.171.40.61 ------- Cisco GSR OC48Arlington Virginia, Qwest DC Edge 3 <10 ms <10 ms <10 ms 205.171.24.85 ------- Cisco GSR OC48Arlington Virginia, Qwest DC Core 4 <10 ms <10 ms 16 ms 205.171.5.233 ------- Cisco GSR OC48New York, New York, Qwest NYC Core 5 62 ms 63 ms 62 ms 205.171.5.115 ------- Cisco GSR OC48San Francisco, CA, Qwest SF Core 6 78 ms 78 ms 78 ms 205.171.5.108 ------- Cisco GSR OC48Seattle, Washington, Qwest Sea Core 7 78 ms 78 ms 94 ms 205.171.26.42 ------- Juniper M40 OC48

Seattle, Washington, Qwest Sea Edge 8 78 ms 79 ms 78 ms 208.46.239.90 ------- Juniper M40 OC48Seattle, Washington, PNW Gigapop 9 78 ms 78 ms 94 ms 198.48.91.30 ------- Cisco GSR OC48 Redmond Washington, Microsoft 10 78 ms 78 ms 94 ms 131.107.151.194 ------- Compaq SP750 Win2K WKS Redmond Washington, Microsoft SysKonnect GbE

Page 5: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

750mbps over 5000 km (957 mbps multi-stream)

~ 4e15 bit meters per second

4 Peta bmps (“peta bumps”)Single Stream tcp/ip throughput

Information Sciences InstituteMicrosoft

QwestUniversity of WashingtonPacific Northwest Gigapop

HSCC (high speed connectivity consortium)DARPA

5 Peta bmps multi-stream

Page 6: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

“ PetaBumps”• 751 mbps for 300 seconds = (~28 GB)

single-thread single-stream tcp/ip desktop-to-desktop out of the box performance*

• 5626 km x 751Mbps =

~ 4.2e15 bit meter / second ~ 4.2 Peta bmps

• Multi-steam is 952 mbps ~5.2 Peta bmps

•4470 byte MTUs were enabled on all routers.•20 MB window size

Page 7: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey
Page 8: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

Pointers• The single-stream submission:

http://research.microsoft.com/~gray/papers/Windows2000_I2_land_Speed_Contest_Entry_(Single_Stream_mail).htm

• The multi-stream submission: http://research.Microsoft.com/~gray/papers/

Windows2000_I2_land_Speed_Contest_Entry_(Multi_Stream_mail).htm

• The code: http://research.Microsoft.com/~gray/papers/speedy.htm

speedy.hspeedy.c

And a PowerPoint presentation about it. http://research.Microsoft.com/~gray/papers/

Windows2000_WAN_Speed_Record.ppt

Page 9: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

What I have been Doing

Peta Bumps

10k$ TB

Scaleable Computing

Sloan Digital Sky Survey

Page 10: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

TPC-C high performance clusters

Standard transaction processing benchmark

Mix of 5 simple transaction types.

Database scales with workload

Measures balanced system.

Page 11: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

Scalability Successes• Single Site Clusters

– Billions of transactions per day – Tera-Ops & Peta-Bytes (10 k node clusters)

– Micro-dollar/transaction

• Hardware + Software advances– TPC & Sort examples (2x/year)– Many other examples

1.E-03

1.E+00

1.E+03

1.E+06

1985 1990 1995 2000

Records Sorted per SecondDoubles Every Year

GB Sorted per DollarDoubles Every Year

tpmC vs Time

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

Jan-95 Jan-96 Jan-97 Jan-98 Jan-99

tpm

C h

Page 12: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

Progress since Jan 99: Running out of gas?• 50% better peak perf (not 2x)

• 2x better Price/Performance

• At a cost ceiling Systems cost 7M$-13M$

• June 98 result: “hero” effort(off-scale good!)(Compaq/Alpha/Oracle 96 cpu, 8node cluster, 102,542 tpmC @139$/tpmC, 5/5/98)

tpmC vs Time

0

10,00020,000

30,000

40,00050,000

60,000

70,00080,000

90,000

Jan-95 Jan-96 Jan-97 Jan-98 Jan-99

tpm

C h

tpmC vs Time

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

100,000

110,000

120,000

130,000

140,000

150,000

160,000

170,000

180,000

190,000

200,000

210,000

220,000

230,000

240,000

Jan-95 Jan-96 Jan-97 Jan-98 Jan-99 Jan-00

Tp

mC

Out’a gas?

Out’a gas?

Page 13: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

• First proof point of commoditized scale-out

• 1.7x Better Performance3x Better price/performance

• 4M$ vs 7M$-13M$

• Much more to do, but…great start! tpmC vs Time

0

10,00020,000

30,000

40,00050,000

60,000

70,00080,000

90,000

Jan-95 Jan-96 Jan-97 Jan-98 Jan-99

tpm

C h

tpmC vs Time

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

100,000

110,000

120,000

130,000

140,000

150,000

160,000

170,000

180,000

190,000

200,000

210,000

220,000

230,000

240,000

Jan-95 Jan-96 Jan-97 Jan-98 Jan-99 Jan-00

Tpm

C

2/17/00: back on Schedule!! Back on Schedule!

Page 14: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

Year 2000 Sort Results

  Daytona IndyPenny 4.5 GB (45 m records)

886 seconds on a $1010 Win2K/Intel system HMsort: doc (74KB), pdf (32KB). Brad Helmkamp, Keith McCready,

Stenograph LLC

4.5 GB (45 m records)886 seconds on a $1010 Win2K/Intel system

HMsort: doc (74KB), pdf (32KB). Brad Helmkamp, Keith McCready,

Stenograph LLC

Minute 7.6 GB in 60 secondsOrdinal Nsort

SGI 32 cpu Origin  IRIX 

21.8 GB 218 M records in 56.51 sec

NOW+HPVMsort 64 nodes WinNT pdf (170KB).

Luis Rivera , Xianan Zhang, Andrew Chien UCSD

TeraByte 49 minutesDaivd Cossock, Sam Fineberg,

Pankaj Mehra, John Peck68x2 Compaq Tandem Sandia Labs

1057 secondsSPsort 1952 SP cluster 2168 disks

Jim Wyllie PDF SPsort.pdf (80KB)

Datamation

1 M records in .998 Seconds (doc 703KB) or (pdf 50KB) Mitsubishi DIAPRISM Hardware Sorter with

HP 4 x 550MHz Xeon PC server + 32 SCSI disks, Windows NT4

Shinsuke Azuma, Takao Sakuma, Tetsuya Takeo, Takaaki Ando, Kenji ShiraiMitsubishi Electric Corp.

Datamation

Page 15: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

System Bus

PCI Bus PCI Bus

What’s a Balanced System?

Page 16: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

Rules of Thumb in Data Engineering• Moore’s law -> an address bit per 18 months.• Storage grows 100x/decade (except 1000x last decade!)• Disk data of 10 years ago now fits in RAM (iso-price).• Device bandwidth grows 10x/decade – so need parallelism• RAM:disk:tape price is 1:10:30 going to 1:10:10• Amdahl’s speedup law: S/(S+P)• Amdahl’s IO law: bit of IO per instruction/second

(tBps/10 top! 50,000 disks/10 teraOP: 100 M$ Dollars)

• Amdahl’s memory law: byte per instruction/second (going to 10)(1 TB RAM per TOP: 1 TeraDollars)

• PetaOps anyone?• Gilder’s law: aggregate bandwidth doubles every 8 months.• 5 Minute rule: cache disk data that is reused in 5 minutes.• Web rule: cache everything!http://research.Microsoft.com/~gray/papers/

MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc

Page 17: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

Cheap Storage• Disks are getting cheap:• 7 k$/TB disks (25 40 GB disks @ 230$ each)

y = 5.7156x + 47.857

y = 15.895x + 13.446

0

100

200

300

400

500

600

700

800

900

0 10 20 30 40 50 60Raw Disk unit Size GB

$

IDE

SCSI

Price vs disk capacity

7

0

5

10

15

20

25

30

35

40

0 10 20 30 40 50 60Disk unit size GB

$

IDE

SCSI

raw k$/TB

Page 18: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

Cheap Storage or Balanced System

• Low cost storage (2 x 1.5k$ servers) 10K$ TB2x (1K$ system + 8x70GB disks + 100MbEthernet)

• Balanced server (9k$/.5 TB)– 2x800Mhz (2k$)– 256 MB (500$)– 8 x 73 GB drives (4K$)– Gbps Ethernet + switch (1.5k$)– 18k$ TB, 36K$/RAIDED TB

2x800 Mhz256 MB

Page 19: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

160 GB, 2k$ (now)300 GB by year end.

• 4x40 GB ID(2 hot plugable)– (1,100$)

• SCSI-IDE bridge– 200k$

• Box– 500 Mhz cpu– 256 MB SRAM– Fan, power, Enet– 700$

• Or 8 disks/box600 GB for ~3K$ ( or 300 GB RAID)

Page 20: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

Hot Swap Drives for Archive or Data Interchange

• 25 MBps write(so can write N x 74 GB in 3 hours)

• 74 GB/overnite

= ~N x 2 MB/second

@ 19.95$/nite

Page 21: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

Doing Studies of IO bandwidth

• SCSI & IDE bandwidth – ~15-30 MBps sequential– SCSI 10rpm ~ 110 kaps @ 600$– IDE 7.2krpm ~ 80 kaps @ 250$

• Get 2 disks for the price of 1– More bandwidth for reads– RAID– 10K$ raid TB by 2001

Page 22: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

What I have been Doing

Peta Bumps

10k$ TB

Scaleable Computing

Sloan Digital Sky Survey

Page 23: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

A project run by the Astrophysical Research Consortium (ARC)A project run by the Astrophysical Research Consortium (ARC)

Goal: To create a detailed multicolor map of the Northern Skyover 5 years, with a budget of approximately $80M

Data Size: 40 TB raw, 1 TB processed

Goal: To create a detailed multicolor map of the Northern Skyover 5 years, with a budget of approximately $80M

Data Size: 40 TB raw, 1 TB processed

The University of Chicago Princeton University The Johns Hopkins University The University of Washington Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study

SLOAN Foundation, NSF, DOE, NASA

The University of Chicago Princeton University The Johns Hopkins University The University of Washington Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study

SLOAN Foundation, NSF, DOE, NASA

The Sloan Digital Sky Survey

Page 24: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

Scientific MotivationCreate the ultimate map of the Universe:

The Cosmic Genome Project!

Study the distribution of galaxies: What is the origin of fluctuations?

What is the topology of the distribution?

Measure the global properties of the Universe: How much dark matter is there?

Local census of the galaxy population: How did galaxies form?

Find the most distant objects in the Universe: What are the highest quasar redshifts?

Page 25: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

First Light ImagesTelescope: First light May 9th 1998 Equatorial scans

Telescope: First light May 9th 1998 Equatorial scans

Page 26: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

The First StripesCamera: 5 color imaging of >100 square degrees Multiple scans across the same fields Photometric limits as expected

Camera: 5 color imaging of >100 square degrees Multiple scans across the same fields Photometric limits as expected

Page 27: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

SDSS Data Flow

Page 28: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

All raw data saved in a tape vault at Fermilab

Object catalog 400 GB parameters of >108 objects

Redshift Catalog 1 GB parameters of 106 objects

Atlas Images 1.5 TB 5 color cutouts of >108 objects

Spectra 60 GB in a one-dimensional form

Derived Catalogs 20 GB - clusters - QSO absorption lines

4x4 Pixel All-Sky Map 60 GB heavily compressed

Object catalog 400 GB parameters of >108 objects

Redshift Catalog 1 GB parameters of 106 objects

Atlas Images 1.5 TB 5 color cutouts of >108 objects

Spectra 60 GB in a one-dimensional form

Derived Catalogs 20 GB - clusters - QSO absorption lines

4x4 Pixel All-Sky Map 60 GB heavily compressed

SDSS Data Products

Page 29: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

User InterfaceUser Interface Analysis EngineAnalysis Engine

Master

Objectivity

ObjectivityObjectivity

RAIDRAID

Slave

ObjectivityObjectivity

RAIDRAID

Slave

ObjectivityObjectivity

RAIDRAID

Slave

ObjectivityObjectivity

RAIDRAID

Slave

SX Engine Objectivity Federation

Distributed Implementation

Page 30: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

• Helping move the data to SQL– Database design

– Data loading

• Experimenting with queries on a 4 M object DB– 20 questions like “find gravitational lens candidates”

– Queries use parallelism, most run in a few seconds.(auto parallel)

– Some run in hours (neighbors within 1 arcsec)

– EASY to ask questions.

• Helping with an “outreach” website: SkyServer• Personal goal:

Try datamining techniques to “re-discover” Astronomy

Color Magnitude Diff/Ratio Distribution

1.E+0

1.E+1

1.E+2

1.E+3

1.E+4

1.E+5

1.E+6

1.E+7

-30 -20 -10 0 10 20 30

Magnitude Diff/Ratio

Co

un

ts

u-g

g-r

r-i

i-z

What We Have Been Doing

Page 31: What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey

What I have been Doing

Peta Bumps

10k$ TB

Scaleable Computing

Sloan Digital Sky Survey