cs 347: parallel and distributed data management notes x: s4
DESCRIPTION
CS 347: Parallel and Distributed Data Management Notes X: S4. Hector Garcia-Molina. Material based on:. S4. Platform for processing unbounded data streams general purpose distributed scalable partially fault tolerant (whatever this means!). Data Stream. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/1.jpg)
CS 347 Lecture 9B 1
CS 347: Parallel and Distributed
Data Management
Notes X: S4
Hector Garcia-Molina
![Page 2: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/2.jpg)
Material based on:
CS 347 Lecture 9B 2
![Page 3: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/3.jpg)
S4
• Platform for processing unbounded data streams– general purpose– distributed– scalable– partially fault tolerant
(whatever this means!)
CS 347 Lecture 9B 3
![Page 4: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/4.jpg)
Data Stream
CS 347 Lecture 9B 4
A 34B abcdC 15D TRUE
A 3B defC 0D FALSE
A 34B abcdC 78
Terminology: event (data record), key, attribute
Question: Can an event have duplicate attributes for same key?(I think so...)
Stream unbounded, generated by say user queries,purchase transactions, phone calls, sensor readings,...
![Page 5: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/5.jpg)
S4 Processing Workflow
CS 347 Lecture 9B 5
user specified“processing unit”
A 34B abcdC 15D TRUE
![Page 6: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/6.jpg)
Inside a Processing Unit
CS 347 Lecture 9B 6
PE1
PE2
PEn
...key=z
key=b
key=a
processing element
![Page 7: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/7.jpg)
Example:
• Stream of English quotes• Produce a sorted list of the top K most
frequent words
CS 347 Lecture 9B 7
![Page 8: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/8.jpg)
CS 347 Lecture 9B 8
![Page 9: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/9.jpg)
Processing Nodes
CS 347 Lecture 9B 9
...
key=b
key=a
processing node
......
hash(key)=1
hash(key)=m
![Page 10: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/10.jpg)
Dynamic Creation of PEs
• As a processing node sees new key attributes,it dynamically creates new PEs to handle them
• Think of PEs as threads
CS 347 Lecture 9B 10
![Page 11: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/11.jpg)
Another View of Processing Node
CS 347 Lecture 9B 11
![Page 12: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/12.jpg)
Failures
• Communication layer detects node failures and provides failover to standby nodes
• What happens events in transit during failure?(My guess: events are lost!)
CS 347 Lecture 9B 12
![Page 13: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/13.jpg)
How do we do DB operations on top of S4?• Selects & projects easy!
CS 347 Lecture 9B 13
• What about joins?
![Page 14: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/14.jpg)
What is a Stream Join?
• For true join, would need to storeall inputs forever! Not practical...
• Instead, define window join:– at time t new R tuple arrives– it only joins with previous w S tuples
CS 347 Lecture 9B 14
R
S
![Page 15: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/15.jpg)
One Idea for Window Join
CS 347 Lecture 9B 15
R
S
key 2rel RC 15D TRUE
key 2rel SC 0D FALSE
code for PE:for each event e: if e.rel=R [store in Rset(last w) for s in Sset: output join(e,s) ] else ...
“key” isjoin attribute
![Page 16: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/16.jpg)
One Idea for Window Join
CS 347 Lecture 9B 16
R
S
key 2rel RC 15D TRUE
key 2rel SC 0D FALSE
code for PE:for each event e: if e.rel=R [store in Rset(last w) for s in Sset: output join(e,s) ] else ...
“key” isjoin attribute
Is this right???(enforcing window on a per-key value basis)
Maybe add sequence numbersto events to enforce correct window?
![Page 17: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/17.jpg)
Another Idea for Window Join
CS 347 Lecture 9B 17
R
S
key fakerel RC 15D TRUE
key fakerel SC 0D FALSE
code for PE:for each event e: if e.rel=R [store in Rset(last w) for s in Sset: if e.C=s.C then output join(e,s) ] else if e.rel=S ...
All R & S eventshave “key=fake”;Say join key is C.
![Page 18: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/18.jpg)
Another Idea for Window Join
CS 347 Lecture 9B 18
R
S
key fakerel RC 15D TRUE
key fakerel SC 0D FALSE
code for PE:for each event e: if e.rel=R [store in Rset(last w) for s in Sset: if e.C=s.C then output join(e,s) ] else if e.rel=S ...
All R & S eventshave “key=fake”;Say join key is C. Entire join done in one PE;
no parallelism
![Page 19: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/19.jpg)
Do You Have a Better Idea for Window Join?
CS 347 Lecture 9B 19
R
S
key ?rel RC 15D TRUE
key ?rel SC 0D FALSE
![Page 20: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/20.jpg)
Final Comment: Managing state of PE
CS 347 Lecture 9B 20
Who manages state?S4: user doesMupet: System doesIs state persistent?
A 34B abcdC 15D TRUE
state
![Page 21: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/21.jpg)
CS 347 Lecture 9B 21
CS 347: Parallel and Distributed
Data Management
Notes X: Hyracks
Hector Garcia-Molina
![Page 22: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/22.jpg)
Hyracks
• Generalization of map-reduce• Infrastructure for “big data” processing• Material here based on:
CS 347 Lecture 9B 22
Appeared in ICDE 2011
![Page 23: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/23.jpg)
A Hyracks data object:
• Records partitioned across N sites
• Simple record schema is available (more than just key-value)
CS 347 Lecture 9B 23
site 1
site 2
site 3
![Page 24: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/24.jpg)
Operations
CS 347 Lecture 9B 24
operator
distributionrule
![Page 25: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/25.jpg)
Operations (parallel execution)
CS 347 Lecture 9B 25
op 1
op 2
op 3
![Page 26: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/26.jpg)
Example: Hyracks Specification
CS 347 Lecture 9B 26
![Page 27: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/27.jpg)
Example: Hyracks Specification
CS 347 Lecture 9B 27
initial partitionsinput & output isa set of partitions
mapping to new partitions2 replicated partitions
![Page 28: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/28.jpg)
Notes• Job specification can be done manually
or automatically
CS 347 Lecture 9B 28
![Page 29: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/29.jpg)
Example: Activity Node Graph
CS 347 Lecture 9B 29
![Page 30: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/30.jpg)
Example: Activity Node Graph
CS 347 Lecture 9B 30
sequencing constraint
activities
![Page 31: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/31.jpg)
Example: Parallel Instantiation
CS 347 Lecture 9B 31
![Page 32: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/32.jpg)
Example: Parallel Instantiation
CS 347 Lecture 9B 32
stage(start after input stages finish)
![Page 33: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/33.jpg)
System Architecture
CS 347 Lecture 9B 33
![Page 34: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/34.jpg)
Map-Reduce on Hyracks
CS 347 Lecture 9B 34
![Page 35: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/35.jpg)
Library of Operators:
• File reader/writers• Mappers• Sorters• Joiners (various types0• Aggregators
• Can add more
CS 347 Lecture 9B 35
![Page 36: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/36.jpg)
Library of Connectors:
• N:M hash partitioner• N:M hash-partitioning merger (input
sorted)• N:M rage partitioner (with partition vector)• N:M replicator• 1:1
• Can add more!
CS 347 Lecture 9B 36
![Page 37: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/37.jpg)
Hyracks Fault Tolerance: Work in progress?
CS 347 Lecture 9B 37
![Page 38: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/38.jpg)
Hyracks Fault Tolerance: Work in progress?
CS 347 Lecture 9B 38
save output to files
save output to files
The Hadoop/MR approach: save partial results to persistent storage;after failure, redo all work to reconstruct missing data
![Page 39: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/39.jpg)
Hyracks Fault Tolerance: Work in progress?
CS 347 Lecture 9B 39
Can we do better? Maybe: each process retains previous results until no longer needed?
Pi output: r1, r2, r3, r4, r5, r6, r7, r8
have "made way" to final result current output
![Page 40: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/40.jpg)
CS 347 Lecture 9B 40
CS 347: Parallel and Distributed
Data Management
Notes X: Pregel
Hector Garcia-Molina
![Page 41: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/41.jpg)
Material based on:
• In SIGMOD 2010
• Note there is an open-source version of Pregel called GIRAPH
CS 347 Lecture 9B 41
![Page 42: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/42.jpg)
Pregel
• A computational model/infrastructurefor processing large graphs
• Prototypical example: Page Rank
CS 347 Lecture 9B 42
x
e
d
b
a
f
PR[i+1,x] =f(PR[i,a]/na, PR[i,b]/nb)
![Page 43: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/43.jpg)
Pregel
• Synchronous computation in iterations• In one iteration, each node:
– gets messages from neighbors– computes– sends data to neighbors
CS 347 Lecture 9B 43
x
e
d
b
a
f
PR[i+1,x] =f(PR[i,a]/na, PR[i,b]/nb)
![Page 44: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/44.jpg)
Pregel vs Map-Reduce/S4/Hyracks/...
• In Map-Reduce, S4, Hyracks,...workflow separate from data
CS 347 Lecture 9B 44
• In Pregel, data (graph) drives data flow
![Page 45: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/45.jpg)
Pregel Motivation
• Many applications require graph processing
• Map-Reduce and other workflow systemsnot a good fit for graph processing
• Need to run graph algorithms on many procesors
CS 347 Lecture 9B 45
![Page 46: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/46.jpg)
Example of Graph Computation
CS 347 Lecture 9B 46
![Page 47: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/47.jpg)
Termination
• After each iteration, each vertex votes tohalt or not
• If all vertexes vote to halt, computation terminates
CS 347 Lecture 9B 47
![Page 48: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/48.jpg)
Vertex Compute (simplified)
• Available data for iteration i:– InputMessages: { [from, value] }– OutputMessages: { [to, value] }– OutEdges: { [to, value] }– MyState: value
• OutEdges and MyState are remembered for next iteration
CS 347 Lecture 9B 48
![Page 49: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/49.jpg)
Max Computation
• change := false• for [f,w] in InputMessages do
if w > MyState.value then [ MyState.value := w change := true ]
• if (superstep = 1) OR change then for [t, w] in OutEdges do add [t, MyState.value] to OutputMessages else vote to halt
CS 347 Lecture 9B 49
![Page 50: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/50.jpg)
Page Rank Example
CS 347 Lecture 9B 50
![Page 51: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/51.jpg)
Page Rank Example
CS 347 Lecture 9B 51
iteration count
iterate thru InputMessages
MyState.value
shorthand: send same msg to all
![Page 52: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/52.jpg)
Single-Source Shortest Paths
CS 347 Lecture 9B 52
![Page 53: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/53.jpg)
Architecture
CS 347 Lecture 9B 53
master
worker A
inputdata 1
inputdata 2
worker B
worker Cgraph has nodes a,b,c,d...
sample record:[a, value]
![Page 54: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/54.jpg)
Architecture
CS 347 Lecture 9B 54
master
worker A
vertexesa, b, c
inputdata 1
inputdata 2
worker B
vertexesd, e
worker C
vertexesf, g, h
partition graph andassign to workers
![Page 55: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/55.jpg)
Architecture
CS 347 Lecture 9B 55
master
worker A
vertexesa, b, c
inputdata 1
inputdata 2
worker B
vertexesd, e
worker C
vertexesf, g, h
read input data worker Aforwards inputvalues to appropriateworkers
![Page 56: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/56.jpg)
Architecture
CS 347 Lecture 9B 56
master
worker A
vertexesa, b, c
inputdata 1
inputdata 2
worker B
vertexesd, e
worker C
vertexesf, g, h
run superstep 1
![Page 57: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/57.jpg)
Architecture
CS 347 Lecture 9B 57
master
worker A
vertexesa, b, c
inputdata 1
inputdata 2
worker B
vertexesd, e
worker C
vertexesf, g, h
at end superstep 1,send messages
halt?
![Page 58: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/58.jpg)
Architecture
CS 347 Lecture 9B 58
master
worker A
vertexesa, b, c
inputdata 1
inputdata 2
worker B
vertexesd, e
worker C
vertexesf, g, h
run superstep 2
![Page 59: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/59.jpg)
Architecture
CS 347 Lecture 9B 59
master
worker A
vertexesa, b, c
inputdata 1
inputdata 2
worker B
vertexesd, e
worker C
vertexesf, g, h
checkpoint
![Page 60: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/60.jpg)
Architecture
CS 347 Lecture 9B 60
master
worker A
vertexesa, b, c
worker B
vertexesd, e
worker C
vertexesf, g, h
checkpoint
write to stable store:MyState, OutEdges,InputMessages(or OutputMessages)
![Page 61: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/61.jpg)
Architecture
CS 347 Lecture 9B 61
master
worker A
vertexesa, b, c
worker B
vertexesd, e
if worker dies,find replacement &restart fromlatest checkpoint
![Page 62: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/62.jpg)
Architecture
CS 347 Lecture 9B 62
master
worker A
vertexesa, b, c
inputdata 1
inputdata 2
worker B
vertexesd, e
worker C
vertexesf, g, h
![Page 63: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/63.jpg)
Interesting Challenge
• How best to partition graph for efficiency?
CS 347 Lecture 9B 63
x
e
d
b
a
f
g
![Page 64: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/64.jpg)
CS 347 Lecture 9B 64
CS 347: Parallel and Distributed
Data Management
Notes X:BigTable, HBASE, Cassandra
Hector Garcia-Molina
![Page 65: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/65.jpg)
Sources
• HBASE: The Definitive Guide, Lars George, O’Reilly Publishers, 2011.
• Cassandra: The Definitive Guide, Eben Hewitt, O’Reilly Publishers, 2011.
• BigTable: A Distributed Storage System for Structured Data, F. Chang et al, ACM Transactions on Computer Systems, Vol. 26, No. 2, June 2008.
CS 347 Lecture 9B 65
![Page 66: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/66.jpg)
Lots of Buzz Words!
• “Apache Cassandra is an open-source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunably consistent, column-oriented database that bases its distribution design on Amazon’s dynamo and its data model on Google’s Big Table.”
• Clearly, it is buzz-word compliant!!
CS 347 Lecture 9B 66
![Page 67: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/67.jpg)
Basic Idea: Key-Value Store
CS 347 Lecture 9B 67
key valuek1 v1k2 v2k3 v3k4 v4
Table T:
![Page 68: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/68.jpg)
Basic Idea: Key-Value Store
CS 347 Lecture 9B 68
key valuek1 v1k2 v2k3 v3k4 v4
Table T:
keys are sorted
• API:– lookup(key) value– lookup(key range)
values– getNext value– insert(key, value)– delete(key)
• Each row has timestemp• Single row actions atomic
(but not persistent in some systems?)
• No multi-key transactions• No query language!
![Page 69: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/69.jpg)
Fragmentation (Sharding)
CS 347 Lecture 9B 69
key valuek1 v1k2 v2k3 v3k4 v4k5 v5k6 v6k7 v7k8 v8k9 v9k10 v10
key valuek1 v1k2 v2k3 v3k4 v4
key valuek5 v5k6 v6
key valuek7 v7k8 v8k9 v9k10 v10
server 1 server 2 server 3
• use a partition vector• “auto-sharding”: vector selected automatically
tablet
![Page 70: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/70.jpg)
Tablet Replication
CS 347 Lecture 9B 70
key valuek7 v7k8 v8k9 v9k10 v10
server 3 server 4 server 5
key valuek7 v7k8 v8k9 v9k10 v10
key valuek7 v7k8 v8k9 v9k10 v10
primary backup backup
• Cassandra:Replication Factor (# copies)R/W Rule: One, Quorum, AllPolicy (e.g., Rack Unaware, Rack Aware, ...)Read all copies (return fastest reply, do repairs if necessary)
• HBase: Does not manage replication, relies on HDFS
![Page 71: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/71.jpg)
Need a “directory”
• Table Name: Key Server that stores key Backup servers
• Can be implemented as a special table.
CS 347 Lecture 9B 71
![Page 72: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/72.jpg)
Tablet Internals
CS 347 Lecture 9B 72
key valuek3 v3k8 v8k9 deletek15 v15
key valuek2 v2k6 v6k9 v9k12 v12
key valuek4 v4k5 deletek10 v10k20 v20k22 v22
memory
disk
Design Philosophy (?): Primary scenario is where all data is in memory.Disk storage added as an afterthought
![Page 73: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/73.jpg)
Tablet Internals
CS 347 Lecture 9B 73
key valuek3 v3k8 v8k9 deletek15 v15
key valuek2 v2k6 v6k9 v9k12 v12
key valuek4 v4k5 deletek10 v10k20 v20k22 v22
memory
disk
flush periodically
• tablet is merge of all segments (files)• disk segments imutable• writes efficient; reads only efficient when all data in memory• periodically reorganize into single segment
tombstone
![Page 74: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/74.jpg)
Column Family
CS 347 Lecture 9B 74
K A B C D Ek1 a1 b1 c1 d1 e1k2 a2 null c2 d2 e2k3 null null null d3 e3k4 a4 b4 c4 e4 e4k5 a5 b5 null null null
![Page 75: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/75.jpg)
Column Family
CS 347 Lecture 9B 75
K A B C D Ek1 a1 b1 c1 d1 e1k2 a2 null c2 d2 e2k3 null null null d3 e3k4 a4 b4 c4 e4 e4k5 a5 b5 null null null
• for storage, treat each row as a single “super value”• API provides access to sub-values
(use family:qualifier to refer to sub-values e.g., price:euros, price:dollars )
• Cassandra allows “super-column”: two level nesting of columns (e.g., Column A can have sub-columns X & Y )
![Page 76: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/76.jpg)
Vertical Partitions
CS 347 Lecture 9B 76
K A B C D Ek1 a1 b1 c1 d1 e1k2 a2 null c2 d2 e2k3 null null null d3 e3k4 a4 b4 c4 e4 e4k5 a5 b5 null null null
K Ak1 a1k2 a2k4 a4k5 a5
K Bk1 b1k4 b4k5 b5
K Ck1 c1k2 c2k4 c4
K D Ek1 d1 e1k2 d2 e2k3 d3 e3k4 e4 e4
can be manually implemented as
![Page 77: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/77.jpg)
Vertical Partitions
CS 347 Lecture 9B 77
K A B C D Ek1 a1 b1 c1 d1 e1k2 a2 null c2 d2 e2k3 null null null d3 e3k4 a4 b4 c4 e4 e4k5 a5 b5 null null null
K Ak1 a1k2 a2k4 a4k5 a5
K Bk1 b1k4 b4k5 b5
K Ck1 c1k2 c2k4 c4
K D Ek1 d1 e1k2 d2 e2k3 d3 e3k4 e4 e4
column family
• good for sparse data;• good for column scans• not so good for tuple reads• are atomic updates to row still supported?• API supports actions on full table; mapped to actions on column tables• API supports column “project”• To decide on vertical partition, need to know access patterns
![Page 78: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/78.jpg)
Failure Recovery (BigTable, HBase)
CS 347 Lecture 9B 78
tablet servermemory
log
GFS or HFS
master nodespare
tablet server
write ahead logging
ping
![Page 79: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/79.jpg)
Failure recovery (Cassandra)
• No master node, all nodes in “cluster” equal
CS 347 Lecture 9B 79
server 1 server 2 server 3
![Page 80: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/80.jpg)
Failure recovery (Cassandra)
• No master node, all nodes in “cluster” equal
CS 347 Lecture 9B 80
server 1 server 2 server 3
access any table in clusterat any server
that server sends requeststo other servers
![Page 81: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/81.jpg)
CS 347 Lecture 9B 81
CS 347: Parallel and Distributed
Data Management
Notes X: MemCacheD
Hector Garcia-Molina
![Page 82: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/82.jpg)
MemCacheD
• General-purpose distributed memory caching system
• Open source
CS 347 Lecture 9B 82
![Page 83: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/83.jpg)
What MemCacheD Should Be (but ain't)
CS 347 Lecture 9B 83
data source 1
cache 1
data source 2
cache 2
data source 3
cache 3
distributed cache
get_object(X)
![Page 84: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/84.jpg)
What MemCacheD Should Be (but ain't)
CS 347 Lecture 9B 84
data source 1
cache 1
data source 2
cache 2
data source 3
cache 3
distributed cache
get_object(X)
x
x
![Page 85: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/85.jpg)
What MemCacheD Is (as far as I can tell)
CS 347 Lecture 9B 85
data source 1
cache 1
data source 2
cache 2
data source 3
cache 3
get_object(cache 1, MyName)
x
put(cache 1, myName, X)
![Page 86: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/86.jpg)
What MemCacheD Is
CS 347 Lecture 9B 86
data source 1
cache 1
data source 2
cache 2
data source 3
cache 3
get_object(cache 1, MyName)
x
put(cache 1, myName, X)
Can purge MyName whenever
each cache ishash table of(name, value) pairs
no connection
![Page 87: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/87.jpg)
CS 347 Lecture 9B 87
CS 347: Parallel and Distributed
Data Management
Notes X: ZooKeeper
Hector Garcia-Molina
![Page 88: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/88.jpg)
ZooKeeper
• Coordination service for distributed processes• Provides clients with high throughput, high
availability, memory only file system
CS 347 Lecture 9B 88
client
client
client
client
client
ZooKeeper
/
a b
c d
e
znode /a/d/e (has state)
![Page 89: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/89.jpg)
ZooKeeper Servers
CS 347 Lecture 9B 89
client
client
client
client
client
server
server
server
state replica
state replica
state replica
![Page 90: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/90.jpg)
ZooKeeper Servers
CS 347 Lecture 9B 90
client
client
client
client
client
server
server
server
state replica
state replica
state replica
read
![Page 91: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/91.jpg)
ZooKeeper Servers
CS 347 Lecture 9B 91
client
client
client
client
client
server
server
server
state replica
state replica
state replica
write
propagate & sych
writes totally ordered:used Zab algorithm
![Page 92: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/92.jpg)
Failures
CS 347 Lecture 9B 92
client
client
client
client
client
server
server
server
state replica
state replica
state replica
if your server dies,just connect to a different one!
![Page 93: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/93.jpg)
ZooKeeper Notes
• Differences with file system:– all nodes can store data– storage size limited
• API: insert node, read node, read children, delete node, ...
• Can set triggers on nodes• Clients and servers must know all servers• ZooKeeper works as long as a majority of
servers are available• Writes totally ordered; read ordered w.r.t.
writes
CS 347 Lecture 9B 93
![Page 94: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/94.jpg)
CS 347 Lecture 9B 94
CS 347: Parallel and Distributed
Data Management
Notes X: Kestrel
Hector Garcia-Molina
![Page 95: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/95.jpg)
Kestrel
• Kestrel server handles a set of reliable, ordered message queues
• A server does not communicate with other servers (advertised as good for scalability!)
CS 347 Lecture 9B 95
client
client
server
server
queues
queues
![Page 96: CS 347: Parallel and Distributed Data Management Notes X: S4](https://reader036.vdocument.in/reader036/viewer/2022062519/56815005550346895dbdd7d5/html5/thumbnails/96.jpg)
Kestrel
• Kestrel server handles a set of reliable, ordered message queues
• A server does not communicate with other servers (advertised as good for scalability!)
CS 347 Lecture 9B 96
client
client
server
server
queues
queuesget q
put q
log
log