gray @ nortel 20 april 1999 scaleable computing jim gray microsoft research [email protected]...

64
Gray @ Nortel 20 April 19 99 Scaleable Computing Jim Gray Microsoft Research [email protected] http://research.Microsoft.com/~Gray/talks/ • Outline –The bandwidth revolution –ScaleUp, ScaleOut –TerraServer (Barclay, Slutz, Gray)

Upload: ariana-schultz

Post on 27-Mar-2015

226 views

Category:

Documents


2 download

TRANSCRIPT

  • Slide 1

Gray @ Nortel 20 April 1999 Scaleable Computing Jim Gray Microsoft Research [email protected] http://research.Microsoft.com/~Gray/talks/ [email protected] Outline The bandwidth revolution ScaleUp, ScaleOut TerraServer (Barclay, Slutz, Gray) Slide 2 Gray @ Nortel 20 April 1999 Gilders Law: 3x bandwidth/year for 25 more years Today: 10 Gbps per channel 4 channels per fiber: 40 Gbps 32 fibers/bundle = 1.2 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth 1 fiber = 25 Tbps Slide 3 Gray @ Nortel 20 April 1999 Software improving User-level Net-IO Software Challenge reduce software tax on messages Today 30 K ins + 10 ins/byte Goal: 1 K ins +.01 ins/byte Networking BIG!! Changes coming! Technology 1 GBps bus now 1 Gbps links now 1 Tbps links in 10 years Fast & cheap switches Standard wires for interconnect processor-processor processor-device (=processor) Deregulation WILL work someday Slide 4 Gray @ Nortel 20 April 1999 Technology (hardware) NOW CPU: nearing 1 BIPS but CPI rising fast (2-10) so less than 100 mips 1$/mips to 10$/mips DRAM: 3 $/MB DISK: 20 $/GB TAPE: 20 GB/tape, 6 MBps Lags disk 2$/GB offline, 15$/GB nearline BUS/SAN: 10/1 GBps WAN:0.1 Mbps 2003 Forecast (10x better) CPU: 1bips real (smp) 0.1$ - 1$/mips DRAM: 1 Gb chip 0.1 $/MB Disk: 10 GB smart cards 500GB RAID5 packs (NTinside) 3$ GB BUS/SAN: 100/10 GBps WAN:1 Gbps Slide 5 Gray @ Nortel 20 April 1999 Microsoft SAN Infrastructure WinSock Direct Path 110 MBps (thats B not b) 10% cpu (not 200%) Network faster than most IO attachments IP Winsock AFD App MsAfd U K TCP NDIS MiniPort HW AFD Winsock App MsAfd U K TCP NDIS MiniPort HW IP HwSPI Switch VIA Slide 6 Gray @ Nortel 20 April 1999 Gbps SAN: 110 MBps SAN: Standard Interconnect PCI: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps LAN faster than memory bus? 1 G B ps links in lab. 100$ port cost soon Port is computer Winsock: 110 MBps (10% cpu utilization at each end) RIP FDDI RIP ATM RIP SCI RIP SCSI RIP FC RIP ? Slide 7 Gray @ Nortel 20 April 1999 Outline The bandwidth revolution ScaleUp, ScaleOut TerraServer (Barclay, Slutz, Gray) Slide 8 Gray @ Nortel 20 April 1999 Latency: How Far Away is the Data? Registers On Chip Cache On Board Cache Memory Disk 1 2 10 100 Tape /Optical Robot 10 9 6 Sacramento This Campus This Room My Head 10 min 1.5 hr 2 Years 1 min Pluto 2,000 Years Andromeda Slide 9 Gray @ Nortel 20 April 1999 System On A Chip Integrate Processing with memory on one chip chip is 75% memory now 1MB cache >> 1960 supercomputers 256 Mb memory chip is 32 MB! IRAM, CRAM, PIM, projects abound Integrate Networking with processing on one chip system bus is a kind of network ATM, FiberChannel, Ethernet,.. Logic on chip. Direct IO (no intermediate bus) Functionally specialized cards shrink to a chip. Slide 10 Scaleability Scale Up and Scale Out SMP Super Server Departmental Server Personal System Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs Slide 11 Gray @ Nortel 20 April 1999 There'll be Billions Trillions Of Clients Every device will be intelligent Doors, rooms, cars Computing will be ubiquitous Slide 12 Gray @ Nortel 20 April 1999 Billions Of Clients Need Millions Of Servers Mobile clients Fixed clients Server Superserver Clients Servers All clients networked to servers All clients networked to servers May be nomadic or on-demand May be nomadic or on-demand Fast clients want faster servers Fast clients want faster servers Servers provide Servers provide Shared Data Shared Data Control Control Coordination Coordination Communication Communication Trillions Billions Slide 13 Gray @ Nortel 20 April 1999 Windows NT Server Terminal Server Dedicated Windows terminal Existing, Desktop PC MS-DOS,UNIX,Macclients Net PC FAT SERVERS Thin Client Support ( FAT SERVERS ) TSO comes to NT lower per-client costs Slide 14 Gray @ Nortel 20 April 1999 FAT STORAGE SERVERS Windows 2000 IntelliMirror Extends CMU Coda File System ideas Files and settings mirrored on client and server Great for disconnected users Facilitates roaming Easy to replace PCs Optimizes network performance Slide 15 Gray @ Nortel 20 April 1999 SMP -> nUMA: BIG FAT SERVERS Directory based caching lets you build large SMPs Every vendor building a HUGE SMP 256 way 3x slower remote memory 8-level memory hierarchy L1, L2 cache DRAM remote DRAM (3, 6, 9,) Disk cache Disk Tape cache Tape Needs 64 bit addressing nUMA sensitive OS (not clear who will do it) Or Hypervisor like IBM LSF, Stanford Disco www-flash.stanford.edu/Hive/papers.html Not certain what happens next Slide 16 Gray @ Nortel 20 April 1999 Thesis Many little beat few big Smoking, hairy golf ball Smoking, hairy golf ball How to connect the many little parts? How to connect the many little parts? How to program the many little parts? How to program the many little parts? Fault tolerance & Management? Fault tolerance & Management? $1 million $100 K $10 K Mainframe Mini Micro Nano 14" 9" 5.25" 3.5" 2.5" 1.8" 1 M SPECmarks, 1TFLOP 10 6 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 10 microsecond ram 10 millisecond disc 10 second tape archive 10 nano-second ram Pico Processor 10 pico-second ram 1 MM 3 100 TB 1 TB 10 GB 1 MB 100 MB Slide 17 Gray @ Nortel 20 April 1999 4 B PCs (1 Bips,.1GB dram, 10 GB disk 1 Gbps Net, B=G) The Bricks of Cyberspace Cost 1,000 $ Come with NT DBMS High speed Net System management GUI / OOUI Tools Compatible with everyone else CyberBricks Slide 18 Gray @ Nortel 20 April 1999 Super Server: 4T Machine Array of 1,000 4B machines Array of 1,000 4B machines 1 b ips processors 1 b ips processors 1 B B DRAM 1 B B DRAM 10 B B disks 10 B B disks 1 Bbps comm lines 1 Bbps comm lines 1 TB tape robot 1 TB tape robot A few megabucks A few megabucks Challenge: Challenge: Manageability Manageability Programmability Programmability Security Security Availability Availability Scaleability Scaleability Affordability Affordability As easy as a single system As easy as a single system Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work CPU 50 GB Disc 5 GB RAM Cyber Brick a 4B machine Slide 19 Gray @ Nortel 20 April 1999 Scale OUT Clusters Have Advantages Fault tolerance: Spare modules mask failures without limitsModular growth without limits Grow by adding small modules Parallel data search Use multiple processors and disks Clients and servers made from the same stuff Inexpensive: built with commodity CyberBricks Slide 20 Gray @ Nortel 20 April 1999 1988: IBM DB2 + CICS Mainframe 65 tps IBM 4391 Simulated network of 800 clients 2m$ computer Staff of 6 to do benchmark 2 x 3725 network controllers 16 GB disk farm 4 x 8 x.5GB Refrigerator-sized CPU Slide 21 Gray @ Nortel 20 April 1999 1987: Tandem Mini @ 256 tps 14 M$ computer (Tandem) A dozen people (1.8M$/y) False floor, 2 rooms of machines Simulate 25,600 clients 32 node processor array 40 GB disk array (80 drives) OS expert Network expert DB expert Performance expert Hardware experts Admin expert Auditor Manager Slide 22 Gray @ Nortel 20 April 1999 1997: 9 years later 1 Person and 1 box = 1250 tps 1 Breadbox ~ 5x 1987 machine room 23 GB is hand-held One person does all the work Cost/tps is 100,000x less 5 micro dollars per transaction 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk Hardware expert OS expert Net expert DB expert App expert 3 x7 x 4GB disk arrays Slide 23 Gray @ Nortel 20 April 1999 mainframe mini micro time price What Happened? Where did the 100,000x come from? Moores law: 100X (at most) Software improvements: 10X (at most) Commodity Pricing: 100X (at least) Total 100,000X 100x from commodity (DBMS was 100K$ to start: now 1k$ to start IBM 390 MIPS is 7.5K$ today Intel MIPS is 10$ today Commodity disk is 50$/GB vs 1,500$/GB ... Slide 24 Gray @ Nortel 20 April 1999 Computers shrink to a point Disks 100x in 10 years 2 TB 3.5 drive Shrink to 1 is 200GB Disk is super computer! This is already true of printers and terminals Kilo Mega Giga Tera Peta Exa Zetta Yotta Slide 25 Gray @ Nortel 20 April 1999 Tera Byte Backplane TODAY Disk controller is 10 mips risc engine with 2MB DRAM NIC is similar power SOON Will become 100 mips systems with 100 MB DRAM. They are nodes in a federation (can run Oracle on NT in disk controller). Advantages Uniform programming model Great tools Security economics (cyberbricks) Move computation to data (minimize traffic) All Device Controllers will be Cray 1s Central Processor & Memory Slide 26 Gray @ Nortel 20 April 1999 Its Already True of Printers Peripheral = CyberBrick You buy a printer You get a several network interfaces A Postscript engine cpu, memory, software, a spooler (soon) and a print engine. Slide 27 Gray @ Nortel 20 April 1999 Functionally Specialized Cards Storage Network Display M MB DRAM P mips processor ASIC Today: P=50 mips M= 2 MB In a few years P= 200 mips M= 64 MB Slide 28 Gray @ Nortel 20 April 1999 Implications Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA SMP and Cluster parallelism is important. h Move app to NIC/device controller higher-higher level protocols: DCOM. Cluster parallelism is VERY important. Central Processor & Memory ConventionalRadical Slide 29 Gray @ Nortel 20 April 1999 How Do They Talk to Each Other? Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other DCOM? IIOP? RMI? One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. Wire(s) VIAL/VIPL streams datagrams RPC? Applications VIAL/VIPL streams datagrams RPC? Applications Slide 30 Gray @ Nortel 20 April 1999 Disk = Node has magnetic storage (100 GB?) has processor & DRAM has SAN attachment has execution environment OS Kernel SAN driverDisk driver File SystemRPC,... ServicesDBMS Applications Slide 31 Scaleability Scale Up and Scale Out SMP Super Server Departmental Server Personal System Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs Slide 32 Gray @ Nortel 20 April 1999 HotMail: ~300 Computers FreeBSD and Solaris Slide 33 Gray @ Nortel 20 April 1999 Microsoft.com: ~150 nodes Slide 34 Gray @ Nortel 20 April 1999 Other Clusters 16-node Cluster 64 cpus 2 TB of disk Decision support 45-node Compaq Cluster 140 cpus 14 GB DRAM 4 TB RAID disk OLTP (Debit Credit) 1 B tpd (14 k tps) Slide 35 Gray @ Nortel 20 April 1999 Berkeley NOW (network of workstations) Project http://now.cs.berkeley.edu/ 105 nodes Sun UltraSparc 170, 128 MB, 2x2GB disk Myrinet interconnect (2x160MBps per node) SBus (30MBps) limited GLUNIX layer above Solaris Inktomi (HotBot search) NAS Parallel Benchmarks Crypto cracker Sort 9 GB per second Slide 36 Gray @ Nortel 20 April 1999 NCSA Super Cluster National Center for Supercomputing Applications University of Illinois @ Urbana 512 Pentium II cpus, 2,096 disks, SAN Compaq + HP +Myricom + WindowsNT A Super Computer for 3M$ Classic Fortran/MPI programming DCOM programming model http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html Slide 37 Gray @ Nortel 20 April 1999 Outline The bandwidth revolution ScaleUp, ScaleOut TerraServer (Barclay, Slutz, Gray) A scaleup example Slide 38 Gray @ Nortel 20 April 1999 Some Tera-Byte Databases Kilo Mega Giga Tera Peta Exa Zetta Yotta The Web: 1 TB of HTML TerraServer 1 TB of images Several other 1 TB (file) servers Hotmail: 7 TB of email Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked EOS/DIS (picture of planet each week) 15 PB by 2007 Federal Clearing house: images of checks 15 PB by 2006 (7 year history) Nuclear Stockpile Stewardship Program 10 Exabytes (???!!) Slide 39 Gray @ Nortel 20 April 1999 Library of Congress (text) Kilo Mega Giga Tera Peta Exa Zetta Yotta A novel A letter All Disks All Tapes A Movie LoC (image) Info Capture You can record everything you see or hear or read. What would you do with it? How would you organize & analyze it? Video 8 PB per lifetime (10GBph) Audio 30 TB (10KBps) Read or write:8 GB (words) See: http://www.lesk.com/mlesk/ksg97 / ksg.html Slide 40 Gray @ Nortel 20 April 1999 Michael Lesks Points www.lesk.com/mlesk/ksg97/ksg.html Soon everything can be recorded and kept Most data will never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search will be a key enabling technology. Slide 41 Gray @ Nortel 20 April 1999 The TerraServer http://www.terraserver.microsoft.com/ Slide 42 Gray @ Nortel 20 April 1999 Coverage: Range from 70N to 70S today: 35% U.S., 1% outside U.S. Source Imagery: 4 TB 1sq meter/pixel Aerial (USGS - 60,000 46Mb B&W- 151Mb Color IR files) 1 TB 1.56 meter/pixel Satellite (Spin-2 - 2400 300 Mb B&W) Display Imagery: 200x200 pixel images, subsample to build image pyramid Nav Tools: 1.5 m place names Click-on Coverage map Expedia & Virtual Globe map Pick of the week 1.6x 1.6 km city view.8 x.8 km 8m thumbnail,4 x,4 km browse 200x200 m tile Concept: User navigates an almost seamless image of earth Database & application UI Slide 43 Gray @ Nortel 20 April 1999 Image Data USGS DOQ 4 TB 6TB Coming DRG 50,000 Topo Maps adding now Spin-2 1 TB WorldWide New Data Coming Slide 44 Gray @ Nortel 20 April 1999 The Internet IE 35 Netscape 34 HTML Java Viewer Web Client Image Delivery Application SQL Server SPIN-2/USGS Store Active Server Pages Microsoft Site Serve EE 3.0 Image Commerce Site(s) 13 SQL Server 7.0 Terra-Server DB Terra-Server Stored Procedures Internet Information Server 4.0 Terra-Server Active Server Pages Active Data Object ODBC Terra-Server Web Site 19 24 39 (14 Img) (8 Place) Software Architecture Slide 45 Gray @ Nortel 20 April 1999 How Images are Found Coverage Map 19% Expedia Map 22% Name Search 40% Famous Places 18% Geo Coordinate 1% Slide 46 Gray @ Nortel 20 April 1999 TerraServer: Lots of Web Hits Today: 1.7 billion web hits 1 TB, largest SQL DB on the Web 100 qps average, 1,000 Qps peak 1.5 B SQL queries so far SummaryTotal Max Unique Users17 M 150 k Sessions24 M 172 k Hits 1.7 B 29 M Page Views274 M 1.1 M 6.6 M DB Queries 1.5 B 18 M Image Xfers 1.3 B Average 69 k 94 k 6.8 M 5.8 M 5.0 M 15 M As of Feb 28, 1999 Slide 47 Gray @ Nortel 20 April 1999 Lookup by UGrid or ZGrid ID plus resolution Lookups are fast. Indices are in DRAM (auto-magically by SQL) SQL manages all the tiles and indices Images are brought in on demand Gazetteer Index on image, place, type image, state, type image, state, country, type image, place, state, type image, place, country, type all lookups are fast Logical Schema Country Name State Name Place Name PlaceType Feature Type Where Am I Img Meta Tile Meta Jump Img Browse Img Tile Img Theme Meta Information Spin Frame Meta Thumb Img Image Data & Meta Data Slide 48 Gray @ Nortel 20 April 1999 Image Load and Update ODBC Tx TerraLoader ODBC TX TerraServer SQL DBMS DLT Tape tar Metadata Load DB Active Server Pages Cut & Load Scheduling System Staging Disk JPEG tiles Image Cutter Merge ODBC Tx Dither Image Pyramid From base Slide 49 Gray @ Nortel 20 April 1999 TerraServer Administrator Web Site Accessible by Microsoft, SPIN-2, and USGS Web browser forms to: Edit Famous Places list Modify Image Status fields Define new TerraServer Administrators Slide 50 Gray @ Nortel 20 April 1999 Backup and Recovery Using Legato Networker integrated with SQL Backup/Restore Utility Fast, incremental, differential, online Restore Fast, incremental (file oriented), not online. SQL Server Enterprise Manager DBA Maintenance SQL Performance Monitor Load & Backup&Recovery Slide 51 Gray @ Nortel 20 April 1999 Site Configuration 9710 TimberWolf Enterprise Storage Array 9 HSZ70 Ultra-SCSI Dual redundant Controllers 324 9 GB Seagate Disks Compaq 5500 4x200mhz Web Servers To the Web Compaq 5500 4x200mhz Web Servers Compaq 5500 4x200mhz Web Servers Compaq 5500 4x200mhz Web Servers Compaq 5500 4x200mhz Web Servers Compaq 5500 4x200mhz Web Servers Slide 52 Gray @ Nortel 20 April 1999 The Microsoft TerraServer Hardware Compaq AlphaServer 8400Compaq AlphaServer 8400 8x400Mhz Alpha cpus8x400Mhz Alpha cpus 10 GB DRAM10 GB DRAM 324 9.2 GB StorageWorks Disks324 9.2 GB StorageWorks Disks 3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (4 TB)STK 9710 tape robot (4 TB) WindowsNT 4 EE, SQL Server 7.0WindowsNT 4 EE, SQL Server 7.0 Slide 53 Gray @ Nortel 20 April 1999 Use StorageWorks to form 28 RAID5 sets Each raid set has 11 disks (16 spare drives) Use NTFS to form 4 595GB NT volumes Each striped over 7 Raid sets on 7 controllers Create 26 20,000MB files on F:, 27 on G: DB is File Group of 53 files (1.011TB) F: G: H: I: File System Config Slide 54 Gray @ Nortel 20 April 1999 SQL 7 TerraServer Availability Operating for 9 months : 6400 hrs Unscheduled outage: 36.5 minutes: 99.9905% scheduled up Scheduled outage: 60 minutes Availability: 99.96% overall up No NT failures (ever) One SQL7 Beta2 bug No failures in July, Aug, Oct, Dec, Jan, Feb, Mar Slide 55 Gray @ Nortel 20 April 1999 Things we did right... Use a database to store images: Simplify management Can dynamically load data into tables while viewing application is active Simple X, Y Z-Grid navigation system Used ImgStatus to control logical presence of the image in the app Stitching tiles together from multiple input images to form seamless mosaic Offering two forms of seamless -- time based (SPIN-2) and theme based (DOQ) Slide 56 Gray @ Nortel 20 April 1999 TS 3: Things are changing... Square Tiles, power of 2 size (200x200) Power of 2 zoom levels (2:1, 4:1, 8:1, etc.) so uniform tile size on each zoom (variable ground size per tile) Indexing system independent of tile size Digital Raster Graphics (Topo maps) Layered Maps (Topo merge with DOQ) Integrate with other applications and services Later: Digital Elevation Models (DEMs) Other foreign data sources (EU, etc.) Slide 57 Gray @ Nortel 20 April 1999 What TerraServer Shows Can serve huge databases on Internet for about a penny a page view mostly phone bill (!). Advertising pays more than a penny a page. Commodity tools do scale fairly far. A few people (3 developers, 1 operator) using power tools can build an impressive web site Slide 58 Gray @ Nortel 20 April 1999 Thank You! SPIN-2 Tom Barclay did most of this app, Slutz and Gray helped. Slide 59 Gray @ Nortel 20 April 1999 Outline The bandwidth revolution ScaleUp, ScaleOut TerraServer (Barclay, Slutz, Gray) Slide 60 Gray @ Nortel 20 April 1999 end Slide 61 Windows NT Versus UNIX Best Results on an SMP: SemiLog plot shows 3x (2 year) lead by UNIX see www.tpc.org Slide 62 Gray @ Nortel 20 April 1999 TPC C Improvements (MS SQL) 250%/year on Price, 100%/year performance 40% hardware, 100% software, 100% PC Technology Slide 63 Gray @ Nortel 20 April 1999 Price Breakdown (6 months old) Slide 64 Gray @ Nortel 20 April 1999 (dis) Economy Of Scale