1 yotta zetta exa peta tera giga mega kilo data centric computing jim gray microsoft research...
TRANSCRIPT
1
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
Kilo
Data Centric ComputingData Centric ComputingJim Gray
Microsoft Research
Research.Microsoft.com/~Gray/talks
FAST 2002
Monterey, CA, 14 Oct 1999
2
Put Everything in Future (Disk) Controllers(it’s not “if”, it’s “when?”)
Jim GrayMicrosoft Researchhttp://Research.Micrsoft.com/~Gray/talksFAST 2002 Monterey, CA, 14 Oct 1999
Acknowledgements:
Dave Patterson explained this to me long ago Leonard Chung
Kim Keeton Erik Riedel Catharine Van Ingen
Helped me sharpen these arguments
3
First Disk 1956• IBM 305 RAMAC
• 4 MB
• 50x24” disks
• 1200 rpm
• 100 ms access
• 35k$/y rent
• Included computer & accounting software(tubes not transistors)
4
10 years later1.
6 m
eter
s
5
Disk Evolution• Capacity:100x in 10 years
1 TB 3.5” drive in 2005 20 GB as 1” micro-drive
• System on a chip
• High-speed SAN
• Disk replacing tape
• Disk is super computer!
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
6
Disks are becoming computers• Smart drives
• Camera with micro-drive
• Replay / Tivo / Ultimate TV
• Phone with micro-drive
• MP3 players
• Tablet
• Xbox
• Many more…
Disk Ctlr + 1Ghz cpu+1GB RAM
Comm:Infiniband, Ethernet, radio…
ApplicationsWeb, DBMS, Files
OS
7
Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks
Storage
Network
Display
ASIC
ASIC
ASICToday:
P=50 mips
M= 2 MB
In a few years
P= 500 mips
M= 256 MB
Processing decentralized
Moving to data sources
Moving to power sources
Moving to sheet metal
? The end of computers ?
8
It’s Already True of PrintersPeripheral = CyberBrick
• You buy a printer• You get a
– several network interfaces– A Postscript engine
• cpu, • memory, • software,• a spooler (soon)
– and… a print engine.
9
The (absurd?) consequences of Moore’s Law
• 256 way nUMA?• Huge main memories: now:
500MB - 64GB memories then: 10GB - 1TB memories
• Huge disksnow: 20-200 GB 3.5” disks then: .1 - 1 TB disks
• Petabyte storage farms– (that you can’t back up or restore).
• Disks >> tapes– “Small” disks:
One platter one inch 10GB
• SAN convergence 1 GBps point to point is easy
• 1 GB RAM chips
• MAD at 200 Gbpsi
• Drives shrink one quantum
• 10 GBps SANs are ubiquitous
• 1 bips cpus for 10$
• 10 bips cpus at high end
10
The Absurd Design?• Further segregate processing from storage
• Poor locality
• Much useless data movement
• Amdahl’s laws: bus: 10 B/ips io: 1 b/ips
ProcessorsDisks
~ 1 Tips
RAM
~ 1 TB
~ 100TB
100 GBps10 TBps
11
What’s a Balanced System?(40+ disk arms / cpu)
System Bus
PCI Bus PCI Bus
13
Observations re TPC C, H systems
• More than ½ the hardware cost is in disks
• Most of the mips are in the disk controllers
• 20 mips/arm is enough for tpcC
• 50 mips/arm is enough for tpcH
• Need 128MB to 256MB/arm
• Ref:– Gray& Shenoy: “Rules of Thumb…”– Keeton, Riedel, Uysal, PhD thesis.
? The end of computers ?
16
When each disk has 1bips, no need for ‘cpu’
17
Implications
• Offload device handling to NIC/HBA
• higher level protocols: I2O, NASD, VIA, IP, TCP…
• SMP and Cluster parallelism is important.
Terabyte/s Backplane
• Move app to NIC/device controller
• higher-higher level protocols: CORBA / COM+.
• Cluster parallelism is VERY important.
CentralProcessor &
Memory
Conventional Radical
18
Interim Step: Shared Logic• Brick with 8-12 disk drives• 200 mips/arm (or more)
• 2xGbpsEthernet• General purpose OS
(except NetApp )
• 10k$/TB to 50k$/TB• Shared
– Sheet metal
– Power
– Support/Config
– Security
– Network ports
Snap™~1TB 12x80GB NAS
NetApp™~.5TB 8x70GB NAS
Maxstor™~2TB 12x160GB NAS
20
Gordon Bell’s Seven Price Tiers 10$: wrist watch computers
100$: pocket/ palm computers
1,000$: portable computers
10,000$: personal computers (desktop)
100,000$: departmental computers (closet)
1,000,000$: site computers (glass house)
10,000,000$: regional computers (glass castle)
Super-Server: Costs more than 100,000 $“Mainframe” Costs more than 1M$Must be an array of processors,
disks, tapescomm ports
21
Bell’s Evolution of Computer ClassesTechnology enable two evolutionary paths:
1. constant performance, decreasing cost2. constant price, increasing performance
??Time
Mainframes (central)
Minis (dep’t.)
PCs (personals)Lo
g P
rice
WSs
1.26 = 2x/3 yrs -- 10x/decade; 1/1.26 = .81.6 = 4x/3 yrs --100x/decade; 1/1.6 = .62
22
NAS vs SAN• Network Attached Storage
– File servers– Database servers– Application servers– (it’s a slippery slope: as Novell showed)
• Storage Area Network– A lower life form– Block server: get block / put block– Wrong abstraction level (too low level)– Security is VERY hard to understand.
• (who can read that disk block?)
SCSI and iSCSI are popular.
High level Interfaces are better
23
How Do They Talk to Each Other?• Each node has an OS• Each node has local resources: A federation.• Each node does not completely trust the others.• Nodes use RPC to talk to each other
– WebServices/SOAP? CORBA? COM+? RMI?
– One or all of the above.
• Huge leverage in high-level interfaces.• Same old distributed system story.
SANSIO
stre
ams
data
gram
s
RP
C?
Applications
SIO
streams
datagrams
RP
C ?
Applications
25
The Slippery Slope
• If you add function to server
• Then you add more function to server
• Function gravitates to data.
Nothing = Sector Server
Everything = App Server
Something =
Fixed App Server
26
Why Not a Sector Server?(let’s get physical!)
• Good idea, that’s what we have today.
• But– cache added for performance– Sector remap added for fault tolerance– error reporting and diagnostics added– SCSI commends (reserve,.. are growing)– Sharing problematic (space mgmt, security,…)
• Slipping down the slope to a 2-D block server
27
Why Not a 1-D Block Server?Put A LITTLE on the Disk Server
• Tried and true design– HSC - VAX cluster– EMC– IBM Sysplex (3980?)
• But look inside– Has a cache – Has space management– Has error reporting & management– Has RAID 0, 1, 2, 3, 4, 5, 10, 50,…– Has locking– Has remote replication– Has an OS– Security is problematic– Low-level interface moves too many bytes
28
Why Not a 2-D Block Server?Put A LITTLE on the Disk Server
• Tried and true design– Cedar -> NFS– file server, cache, space,..– Open file is many fewer msgs
• Grows to have– Directories + Naming– Authentication + access control– RAID 0, 1, 2, 3, 4, 5, 10, 50,…– Locking– Backup/restore/admin– Cooperative caching with client
29
Why Not a File Server?Put a Little on the 2-D Block Server
• Tried and true design– NetWare, Windows,
Linux, NetApp, Cobalt, SNAP,...WebDav
• Yes, but look at NetWare– File interface grew– Became an app server
• Mail, DB, Web,….
– Netware had a primitive OS• Hard to program, so optimized wrong thing
30
Why Not Everything?
Allow Everything on Disk Server(thin client’s)
• Tried and true design– Mainframes, Minis, ...– Web servers,…– Encapsulates data– Minimizes data moves– Scaleable
• It is where everyone ends up.
• All the arguments against are short-term.
31
The Slippery Slope
• If you add function to server
• Then you add more function to server
• Function gravitates to data.
Nothing = Sector Server
Everything = App Server
Something =
Fixed App Server
32
Disk = Node• has magnetic storage (1TB?)
• has processor & DRAM
• has SAN attachment
• has execution environment
OS KernelSAN driver Disk driver
File System RPC, ...Services DBMS
Applications
33
Hardware• Homogenous machines
leads to quick response through reallocation
• HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives
• $4k/TB (street), 2.5processors/TB, 1GB RAM/TB
• 3 weeks from ordering to operational
Slide courtesy of Brewster Kahle, @ Archive.org
34
Disk as Tape• Tape is unreliable, specialized,
slow, low density, not improving fast, and expensive
• Using removable hard drives to replace tape’s function has been successful
• When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used.
• Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good.
Slide courtesy of Brewster Kahle, @ Archive.org
35
Disk As Tape: What format?• Today I send NTFS/SQL disks.• But that is not a good format for Linux.• Solution: Ship NFS/CIFS/ODBC servers (not disks)• Plug “disk” into LAN.
– DHCP then file or DB server via standard interface.
– Web Service in long term
36
Some Questions
• Will the disk folks deliver?
• What is the product?
• How do I manage 1,000 nodes (disks)?
• How do I program 1,000 nodes (disks)?
• How does RAID work?
• How do I backup a PB?
• How do I restore a PB?
37
Will the disk folks deliver? Maybe!Hard Drive Unit Shipments
Total Hard Drive Unit Shipments
0
50
100
150
200
250
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
Un
its in
Mill
ion
s
Source: DiskTrend/IDC
Not a pretty picture (lately)
38
Most Disks are Personal• 85% of disks are desktop/mobile (not SCSI)• Personal media is AT LEAST 50% of the problem.• How to manage your shoebox of:
– Documents
– Voicemail
– Photos
– Music
– Videos
Music6.9 GB
1.8K files180 CDs
Working2.3 GB
432 folders2.9K files
Archive5.1 GB
477 folders18.7 K files
Video2.6 GB
10 hoursLow res
My Books98 MB
27.1K files & 42K .msg17.7 GB (by size) Files (by number)
.xls.jpg
.doc/html
.pdf .tif
Mail.7 GB
43K msgs
.doc/html.jpg
.gif
.xls
.ppt
.tif
.gif
Music6.9 GB
1.8K files180 CDs
Working2.3 GB
432 folders2.9K files
Archive5.1 GB
477 folders18.7 K files
Video2.6 GB
10 hoursLow res
My Books98 MB
27.1K files & 42K .msg17.7 GB (by size) Files (by number)
.xls.jpg
.doc/html
.pdf .tif
Mail.7 GB
43K msgs
.doc/html.jpg
.gif
.xls
.ppt
.tif
.doc/html.jpg
.gif
.xls
.ppt
.tif
.gif
39
What is the Product?(see next section on media management)
• Concept: Plug it in and it works!• Music/Video/Photo appliance (home)• Game appliance • “PC”• File server appliance• Data archive/interchange appliance• Web appliance• Email appliance• Application appliance• Router appliance
power
network
40
Auto Manage Storage• 1980 rule of thumb:
– A DataAdmin per 10GB, SysAdmin per mips
• 2000 rule of thumb– A DataAdmin per 5TB – SysAdmin per 100 clones (varies with app).
• Problem:– 5TB is 50k$ today, 5k$ in a few years.
– Admin cost >> storage cost !!!!• Challenge:
– Automate ALL storage admin tasks
41
How do I manage 1,000 nodes?• You can’t manage 1,000 x (for any x).• They manage themselves.
– You manage exceptional exceptions.
• Auto Manage– Plug & Play hardware
– Auto-load balance & placement storage & processing
– Simple parallel programming model
– Fault masking
• Some positive signs:– Few admins at Google 10k nodes 2 PB ,
Yahoo! ? nodes, 0.3 PB,Hotmail 10k nodes, 0.3 PB
42
How do I program 1,000 nodes?
• You can’t program 1,000 x (for any x).
• They program themselves.– You write embarrassingly parallel programs– Examples: SQL, Web, Google, Inktomi, HotMail,….– PVM and MPI prove it must be automatic (unless you have a PhD)!
• Auto Parallelism is ESSENTIAL
43
Plug & Play Software• RPC is standardizing: (SOAP/HTTP, COM+, RMI/IIOP)
– Gives huge TOOL LEVERAGE– Solves the hard problems :
• naming, • security, • directory service, • operations,...
• Commoditized programming environments – FreeBSD, Linix, Solaris,…+ tools– NetWare + tools– WinCE, WinNT,…+ tools– JavaOS + tools
• Apps gravitate to data.
• General purpose OS on dedicated ctlr can run apps.
44
It’s Hard to Archive a PetabyteIt takes a LONG time to restore it.
• At 1GBps it takes 12 days!• Store it in two (or more) places online (on disk?).
A geo-plex• Scrub it continuously (look for errors)• On failure,
– use other copy until failure repaired, – refresh lost copy from safe copy.
• Can organize the two copies differently (e.g.: one by time, one by space)
52
CyberBricks• Disks are becoming supercomputers.• Each disk will be a file server then SOAP server• Multi-disk bricks are transitional• Long-term brick will have OS per disk.• Systems will be built from bricks.
• There will also be – Network Bricks
– Display Bricks
– Camera Bricks
– ….
53
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
Kilo
Data Centric ComputingData Centric ComputingJim Gray
Microsoft Research
Research.Microsoft.com/~Gray/talks
FAST 2002
Monterey, CA, 14 Oct 1999
54
Communications Excitement!!
Point-to-Point Broadcast
Immediate
TimeShifted
conversationmoney
lectureconcert
mail booknewspaper
NetNetWorkWork+ DB+ DB
DataDataBaseBase
Its ALL going electronicInformation is being stored for analysis (so ALL database)Analysis & Automatic Processing are being added
Slide borrowed from Craig Mundie
55
Information Excitement!• But comm just carries information
• Real value added is– information capture & render
speech, vision, graphics, animation, …
– Information storage retrieval, – Information analysis
56
Information At Your Fingertips
• All information will be in an online database (somewhere)
• You might record everything you – read: 10MB/day, 400 GB/lifetime (5 disks today)
– hear: 400MB/day, 16 TB/lifetime (2 disks/year today)
– see: 1MB/s, 40GB/day, 1.6 PB/lifetime (150 disks/year maybe someday)
• Data storage, organization, and analysis is challenge.• text, speech, sound, vision, graphics, spatial, time…
• Information at Your Fingertips– Make it easy to capture – Make it easy to store & organize & analyze – Make it easy to present & access
57
How much information is there?• Soon everything can be
recorded and indexed• Most bytes will never be seen
by humans.• Data summarization, trend
detection anomaly detection are key technologies
See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html
See Lyman & Varian:
How much informationhttp://www.sims.berkeley.edu/research/projects/how-much-info/
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
KiloA BookA Book
.Movie
All LoC books(words)
All Books MultiMedia
Everything!
Recorded
A PhotoA Photo
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
58
Why Put Everything in Cyberspace?
Low rentmin $/byte
Shrinks timenow or later
Shrinks spacehere or there
Automate processingknowbots
Point-to-Point OR Broadcast
Imm
edia
te O
R T
ime
Del
ayed
LocateProcessAnalyzeSummarize
59
Disk Storage Cheaper than Paper• File Cabinet: cabinet (4 drawer) 250$
paper (24,000 sheets) 250$space (2x3 @ 10$/ft2) 180$total 700$3 ¢/sheet
• Disk: disk (160 GB =) 300$ASCII: 100 m pages
0.0001 ¢/sheet (10,000x cheaper)
Image: 1 m photos 0.03 ¢/sheet (100x cheaper)
• Store everything on disk
60
Gordon Bell’s MainBrain™Digitize Everything
A BIG shoebox?
• Scans 20 k “pages” tiff@ 300 dpi 1 GB• Music: 2 k “tacks” 7 GB• Photos: 13 k images 2 GB • Video: 10 hrs 3 GB• Docs: 3 k (ppt, word,..) 2 GB• Mail: 50 k messages 1 GB
16 GB
61
Gary Starkweather
• Scan EVERYTHING
• 400 dpi TIFF
• 70k “pages” ~ 14GB
• OCR all scans (98% recognition ocr accuracy)
• All indexed (5 second access to anything)
• All on his laptop.
62
• Q: What happens when the personal terabyte arrives?
• A: Things will run SLOWLY….
unless we add good software
63
Summary
• Disks will morph to appliances
• Main barriers to this happening– Lack of Cool Apps– Cost of Information management