1 bell & gray 4/15 / 95 parallel database systems a snap application gordon bell 450 old oak...

38
1 ell & Gray 4/15 / 95 Parallel Database Parallel Database Systems Systems A SNAP Application A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 [email protected] Jim Gray 310 Filbert, SF CA 94133 [email protected] Network Platform

Upload: hailey-lockhart

Post on 10-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

1Bell & Gray 4/15 / 95

Parallel Database SystemsParallel Database Systems A SNAP Application A SNAP Application

Gordon Bell

450 Old Oak CourtLos Altos, CA 94022

[email protected]

Jim Gray

310 Filbert, SF CA [email protected]

NetworkPlatform

Page 2: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

2Bell & Gray 4/15 / 95

OutlineOutline

• Cyberspace Pep Talk: •Databases are the dirt of Cyberspace•Billions of clients mean millions of servers

• Parallel Imperative:• Hardware trend: Many little devices• Consequence: Servers are arrays of commodity components• PC’s are the bricks of Cyberspace• Must automate parallel {design / operation / use}• Software parallelism via dataflow & Data Partitioning

• Parallel database techniques• Parallel execution of many little jobs (OLTP)• Data Partitioning• Pipeline Execution• Automation techniques)

• Summary

Page 3: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

3Bell & Gray 4/15 / 95

Kinds Of Information ProcessingKinds Of Information Processing

Point-to-Point Broadcast

Immediate

TimeShifted

conversationmoney

lectureconcert

mail booknewspaper

NetNetworkwork

DataDataBaseBase

Its ALL going electronicImmediate is being stored for analysis (so ALL database)Analysis & Automatic Processing are being added

Page 4: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

4Bell & Gray 4/15 / 95

Why Put Everything in Cyberspace?Why Put Everything in Cyberspace?

Low rentmin $/byte

Shrinks timenow or later

Shrinks spacehere or there

Automate processingknowbots

Point-to-Point OR Broadcast

Imm

edia

te O

R T

ime

Del

ayed

Network

DataBase

LocateProcessAnalyzeSummarize

Page 5: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

6Bell & Gray 4/15 / 95

Database Store ALL Data Database Store ALL Data TypesTypes

• The New World:•Billions of objects•Big objects (1MB)•Objects have behavior

(methods)

• The Old World:

– Millions of objects

– 100-byte objects

People

Name Address Papers Picture Voice

Mike

Won

David NY

Berk

Austin

People

Name Address

Mike

Won

David NY

Berk

Austin Paperless officeLibrary of congress onlineAll information online entertainment publishing businessInformation Network, Knowledge Navigator, Information at your fingertips

Page 6: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

7Bell & Gray 4/15 / 95

Magnetic Storage Cheaper than PaperMagnetic Storage Cheaper than Paper

• File Cabinet: cabinet (4 drawer) 250$paper (24,000 sheets) 250$space (2x3 @ 10$/ft2) 180$total 700$3 ¢/sheet

•Disk: disk (8 GB =) 4,000$ASCII: 4 m pages0.1 ¢/sheet (30x cheaper)

• Image: 200 k pages2 ¢/sheet (similar to paper)

• Store everything on disk

Page 7: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

8Bell & Gray 4/15 / 95

Cyberspace Demographics Cyberspace Demographics

• Computer History:

• most computers are smallNEXT: 1 Billion X for some X

(phone?)

• most of the money is inclients and wiring1990: 50% desktop1995: 75% desktop

1950 National Computer1960 Corporate Computer1970 Site Computer1980 Departmental Computer1990 Personal Computer2000 ?

100M

1M

1KSUPER MAIN

FRAMEMINI WS PC

po

pu

lati

on

1B$

100B$

10B$

Rev

enu

e

Page 8: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

9Bell & Gray 4/15 / 95

Billions of Clients Billions of Clients

• Every device will be “intelligent”•Doors, rooms, cars, ...

•Computing will be ubiquitous

Page 9: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

10Bell & Gray 4/15 / 95

Billions of Clients Need Billions of Clients Need Millions of ServersMillions of Servers

mobileclients

fixed clients

server

superserver

Clients

Servers

Super ServersLarge DatabasesHigh Traffic shared data

All clients are networked to serversmay be nomadic or on-demand

Fast clients want faster servers

Servers provide

data,

control,

coordination

communication

Page 10: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

17Bell & Gray 4/15 / 95

OutlineOutline

• Cyberspace Pep Talk: • Databases are the dirt of Cyberspace• Billions of clients mean millions of servers

• Parallel Imperative:•Hardware trend: Many little devices•Consequence: Server arrays of commodity parts•PC’s are the bricks of Cyberspace•Must automate parallel {design / operation / use}•Software parallelism via dataflow & Data Partitioning

• Parallel database techniques• Parallel execution of many little jobs (OLTP)• Data Partitioning• Pipeline Execution• Automation techniques)

• Summary

Page 11: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

18Bell & Gray 4/15 / 95

Hardware trends: Few generic parts: CPU RAM Disk & Tape arrays

ATM for LAN/WAN?? for CAN

?? for OS

These parts will be inexpensive (commodity components)Systems will be arrays of these partsSoftware challenge: how to program arrays

1 M$100 K$ 10 K$

Mainframe MiniMicro Nano

9"5.25" 3.5" 2.5" 1.8"

Moore’s Law RestatedMoore’s Law RestatedMany Little Won over Few BigMany Little Won over Few Big

Page 12: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

19Bell & Gray 4/15 / 95

Future SuperServerFuture SuperServer

Array of processors, disks, tapescomm lines

Challenge:How to program itMust use parallelism

Pipeline hide latency

Partitionbandwidthscaleup

1,000 discs = 10 Terrorbytes

100 Tape Transports

= 1,000 tapes = 1 PetaByte

100 Nodes 1 Tips

Hig

h S

pe

ed N

etw

ork

( 10

Gb

/s)

Page 13: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

21Bell & Gray 4/15 / 95

The Hardware is in Place and Then A Miracle Occurs

SNAPSNAPScaleable Network And Platforms

Commodity Distributed OSCommodity Distributed OS built onCommodity PlatformsCommodity Network Interconnect

?

Page 14: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

22Bell & Gray 4/15 / 95

Why Parallel Access To Data?Why Parallel Access To Data?

1 Terabyte

10 MB/s

At 10 MB/s1.2 days to scan

1 Terabyte

1,000 x parallel1.3 minute SCAN.

Parallelism: divide a big problem into many smaller ones

to be solved in parallel.

BANDWIDTH

Page 15: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

23Bell & Gray 4/15 / 95

DataFlow ProgrammingDataFlow ProgrammingPrefetch & Postwrite Hide Latency Prefetch & Postwrite Hide Latency

• Can't wait for the data to arrive• Need a memory that gets the data in advance ( 100MB/S)

• Solution: •Pipeline from source (tape, disc, ram...) to cpu cache•Pipeline results to destination LATENCY

Page 16: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

24Bell & Gray 4/15 / 95

The New Law of ComputingThe New Law of Computing

Grosch's Law:

Parallel Law: Needs

Linear Speedup and Linear ScaleupNot always possible

1 MIPS1 $

1,000 $

1,000 MIPS

2x $ is 2x performance

1 MIPS1 $

1,000 MIPS 32 $.03$/MIPS

2x $ is 4x performance

Page 17: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

25Bell & Gray 4/15 / 95

Parallelism: Performance is the GoalParallelism: Performance is the Goal

Goal is to get 'good' performance.

Law 1: parallel system should be faster than serial system

Law 2: parallel system should give near-linear scaleup or

near-linear speedup orboth.

Parallelism is faster, not cheaper:trades money for time.

Page 18: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

27Bell & Gray 4/15 / 95

The Perils of ParallelismThe Perils of Parallelism

Startup: Creating processesOpening filesOptimization

Interference: Device (cpu, disc, bus)logical (lock, hotspot, server, log,...)

Skew: If tasks get very small, variance > service time

Processors & Discs

The Good Speedup Curve

Processors & Discs

Three Perils

Sta

rtu

p

Inte

rfer

enc

e

Ske

w

Processors & Discs

A Bad Speedup Curve

Linearity

No Parallelism Benefit

Page 19: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

29Bell & Gray 4/15 / 95

Kinds of Parallel ExecutionKinds of Parallel Execution

Pipeline

Partition outputs split N ways inputs merge M ways

Any Sequential Program

Any Sequential Program

SequentialSequential

SequentialSequential Any Sequential Program

Any Sequential Program

Page 20: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

30Bell & Gray 4/15 / 95

Data RiversData Rivers Split + Merge StreamsSplit + Merge Streams

River

M ConsumersN producers

Producers add records to the river, Consumers consume records from the riverPurely sequential programming.River does flow control and buffering

does partition and merge of data recordsRiver = Exchange operator in Volcano.

N X M Data Streams

Page 21: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

31Bell & Gray 4/15 / 95

Partitioned Data and ExecutionPartitioned Data and Execution

A...E F...J K...N O...S T...Z

A Table

Count Count Count Count Count

Count

Spreads computation and IO among processors

Partitioned data gives NATURAL execution parallelism

Page 22: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

32Bell & Gray 4/15 / 95

Partitioned + Merge + Pipeline Partitioned + Merge + Pipeline Execution Execution

A...E F...J K...N O...S T...Z

Merge

Join

Sort

Join

Sort

Join

Sort

Join

Sort

Join

Sort

Pure dataflow programming Gives linear speedup & scaleup

But, top node may be bottleneckSo....

Page 23: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

33Bell & Gray 4/15 / 95

N xM way ParallelismN xM way Parallelism

A...E F...J K...N O...S T...Z

Merge

Join

Sort

Join

Sort

Join

Sort

Join

Sort

Join

Sort

Merge Merge

N inputs, M outputs, no bottlenecks.

Page 24: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

34Bell & Gray 4/15 / 95

Why are Relational OperatorsWhy are Relational OperatorsSuccessful for Parallelism?Successful for Parallelism?

Relational data model uniform operatorson uniform data streamClosed under composition

Each operator consumes 1 or 2 input streamsEach stream is a uniform collection of dataSequential data in and out: Pure dataflow

partitioning some operators (e.g. aggregates, non-equi-join, sort,..)

requires innovation

AUTOMATIC PARALLELISM

Page 25: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

35Bell & Gray 4/15 / 95

SQL SQL a NonProcedural Programming Languagea NonProcedural Programming Language

• SQL: functional programming language describes answer set.

• Optimizer picks best execution plan•Picks data flow web (pipeline), •degree of parallelism (partitioning)•other execution parameters (process placement, memory,...)

GUI

Schema

Plan

Monitor

Optimizer

ExecutionPlanning

Rivers

Executors

Page 26: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

36Bell & Gray 4/15 / 95

Database Systems “Hide” Parallelism Database Systems “Hide” Parallelism

•Automate system management via tools•data placement•data organization (indexing)•periodic tasks (dump / recover / reorganize)

•Automatic fault tolerance•duplex & failover• transactions

•Automatic parallelism•among transactions (locking)•within a transaction (parallel execution)

Page 27: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

37Bell & Gray 4/15 / 95

Success StoriesSuccess Stories

•Online Transaction Processing •many little jobs•SQL systems support 3700 tps-A

(24 cpu, 240 disk)

•SQL systems support 21,000 tpm-C (110 cpu, 800 disk)

•Batch (decision support and Utility)• few big jobs, parallelism inside•Scan data at 100 MB/s•Linear Scaleup to 50 processors

tran

sact

ion

s /

sec

hardware

recs

/ se

c

hardware

Page 28: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

38Bell & Gray 4/15 / 95

Kinds of Partitioned DataKinds of Partitioned Data

Split a SQL table to subset of nodes & disks

Partition within set:Range Hash Round Robin

A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z

Good for equijoins, range queriesgroup-by

Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning

Good for equijoins Good to spread load

Page 29: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

41Bell & Gray 4/15 / 95

Picking Data RangesPicking Data Ranges

Disk PartitioningFor range partitioning, sample load on disks.

Cool hot disks by making range smallerFor hash partitioning,

Cool hot disks by mapping some buckets to others

River PartitioningUse hashing and assume uniform If range partitioning, sample data and use

histogram to level the bulk

Teradata, Tandem, Oracle use these tricks

Page 30: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

42Bell & Gray 4/15 / 95

Parallel Data ScanParallel Data Scan

Select imagefrom landsatwhere date between 1970 and 1990and overlaps(location, :Rockies) and snow_cover(image) >.7;

Temporal

Spatial

Image

date loc image

Landsat

1/2/72.........4/8/95

33N120W.......34N120W

Assign one process per processor/disk:find images with right data & locationanalyze image, if 70% snow, return it

image

Answer

date, location, & image tests

Page 31: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

44Bell & Gray 4/15 / 95

Parallel AggregatesParallel Aggregates

For aggregate function, need a decomposition strategy:

count(S) = count(s(i)), ditto for sum()avg(S) = ( sum(s(i))) / count(s(i))and so on...

For groups,sub-aggregate groups close to the sourcedrop sub-aggregates into a hash river.

A...E F...J K...N O...S T...Z

A Table

Count Count Count Count Count

Count

Page 32: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

46Bell & Gray 4/15 / 95

Sub-sortsgenerateruns

Mergeruns

Range or Hash Partition River

River is range or hash partitioned

Scan or other source

Parallel SortParallel Sort

M input N output Sort design

Disk and mergenot needed if sort fits in memory

Scales linearly because

6

12= => 2x slowerlog(10 ) 6

log(10 ) 12

Sort is benchmark from hell for shared nothing machinesnet traffic = disk bandwidth, no data filtering at the source

Page 33: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

47Bell & Gray 4/15 / 95

Blocking Operators =Short PiplelinesBlocking Operators =Short Piplelines

An operator is blocking, if it does not produce any output, until it has consumed all its input

Examples:Sort, Aggregates, Hash-Join (reads all of one operand)

Blocking operators kill pipeline parallelismMake partition parallelism all the more important.

Sort RunsScan

Sort Runs

Sort Runs

Sort Runs

Tape File SQL Table Process

Merge Runs

Merge Runs

Merge Runs

Merge Runs

Table Insert

Index Insert

Index Insert

Index Insert

SQL Table

Index 1

Index 2

Index 3

Database LoadTemplate hasthree blocked phases

Page 34: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

50Bell & Gray 4/15 / 95

Hash JoinHash Join

Hash smaller table into N buckets (hope N=1)

If N=1 read larger table, hash to smallerElse, hash outer to disk then

bucket-by-bucket hash join.

Purely sequential data behavior

Always beats sort-merge and nestedunless data is clustered.

Good for equi, outer, exclusion joinLots of papers,

products just appearing (what went wrong?)

Hash reduces skew

Right Table

LeftTable

HashBuckets

Page 35: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

51Bell & Gray 4/15 / 95

Observation: Execution “easy”Observation: Execution “easy”Automation “hard”Automation “hard”

It is “easy” to build a fast parallel execution environment(no one has done it, but it is just programming)

It is hard to write a robust and world-class query optimizer.There are many tricksOne quickly hits the complexity barrier

Common approach:Pick best sequential planPick degree of parallelism based on bottleneck analysisBind operators to processPlace processes at nodesPlace scratch files near processesUse memory as a constraint

Page 36: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

52Bell & Gray 4/15 / 95

Systems That Work This WaySystems That Work This Way

Shared NothingTeradata: 400 nodesTandem: 110 nodesIBM / SP2 / DB2: 48 nodesATT & Sybase 112 nodesInformix/SP2 48 nodes

Shared DiskOracle 170 nodesRdb 24 nodes

Shared MemoryInformix 9 nodes RedBrick ? nodes

CLIENTS

MemoryProcessors

CLIENTS

CLIENTS

Page 37: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

53Bell & Gray 4/15 / 95

Research Research ProblemsProblems

• Automatic data placement (partition: random or organized)

• Automatic parallel programming (process placement)

• Parallel concepts, algorithms & tools

• Parallel Query Optimization

• Execution Techniques load balance, checkpoint/restart, pacing,

1,000 discs = 10 Terrorbytes

100 Tape Transports

= 1,000 tapes = 1 PetaByte

100 Nodes 1 Tips

Hig

h S

pe

ed N

etw

ork

( 10

Gb

/s)

Page 38: 1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310

54Bell & Gray 4/15 / 95

SummarySummary

• Cyberspace is Growing• Databases are the dirt of cybersspace

PCs are the bricks, Networks are the morter. Many little devices: Performance via Arrays of {cpu, disk ,tape}

• Then a miracle occurs: a scaleable distributed OS and net•SNAP: Scaleable Networks and Platforms

• Then parallel database systems give software parallelism•OLTP: lots of little jobs run in parallel•Batch TP: data flow & data partitioning•Automate processor & storage array administration •Automate processor & storage array programming

• 2000 platforms as easy as 1 platform.