cs.technion.ac.il/~assaf/publications/gc.ps
DESCRIPTION
Remote Reference Counting Distributed Garbage Collection with Low Communication and Computation Overhead. www.cs.technion.ac.il/~assaf/publications/gc.ps. Distributed Systems. Consist of nodes: Lowest level: local address space Next level: disk partition, processor Top level: local net - PowerPoint PPT PresentationTRANSCRIPT
Remote Reference CountingRemote Reference CountingDistributed Garbage Collection with Low Distributed Garbage Collection with Low Communication and Computation Communication and Computation OverheadOverhead
www.cs.technion.ac.il/~assaf/publications/gc.ps
Distributed Systems• Consist of nodes:
– Lowest level: local address space– Next level: disk partition, processor– Top level: local net
• Interaction through message passing
• Failures:– Due to hardware or software problems– Disconnection: due to network overload, reboot...
Distributed GC
• Motivations:– Transparent object management– Storage management is complex - not to be
handled by users
• Goals:– Efficiency– Scalability– Fault tolerance
Distributed GC
• The main problem:– A section of GC code running on one node
must verify that no other node needs an object before collecting it
• Result:– Many modules must cooperate closely, leading
to a tight binding between supposedly independent modules
Distributed GC
• Problems with simple approaches:– Determining the status of a remote node is
costly– Asynchronous systems inconsistent data– Failures
Remote References
• Terminology:– Owner - node which contains the object
– Client - node which has a reference to the object
• Creation:– A reference to an object crosses node boundaries
– Side effect of message passing
• Duplication:– Client of a remote object sends to a receiver node a
reference to that object
Naive Reference Counting
• Keep a reference count for each object• Upon duplication or creation, inform the
owner to update the counter, by sending him a control message
• Problems:– Increases communication overhead– Loss or duplication of messages– Race between decrement/increment messages
Race Conditions in Naive Reference Counting:
Decrement/Increment
RA
X
RB
V
RC
U
&V
+1-1
Counterv = 1
Race Conditions in Naive Reference Counting:
Increment/Decrement
RA
X
RB
V
RC
U
&V
+1 -1
Counterv = 1
Avoiding Race by Acknowledge Messages
RA
X
RB
V
RC
U
&V
+1
Counterv = 1
ack
2
Weighted Reference Counting
• Each object referenced has a partial weight and a total weight
• Object creation: – total weight = partial weight = even value > 0
RB
V
Total = 64
Partial = 64
Weighted Reference Counting:Reference Duplication
RA
X
RB
V
RC
U
Totalv = 64
Partialv = 32
Partialv = 32 Partialv = 16
&V/16
16
partial weight halved and sent with the reference
Weighted Reference Counting:Reference Deletion
RA
X
RB
V
RC
U
Partialv = 16 Partialv = 16-16
Totalv = 64
Partialv = 32
48
partial weight sent to owner and subtracted from total weight
Weighted Reference Counting
• Invariant: total weightv = partial weightv
• When total weight = partial weight there are no remote references
• Advantage: Eliminates increment messages, and therefore race conditions
Weighted Reference Counting• Shortcomings:
– Weight underflow• Possible solutions:
– Use partial weights which are powers of 2, keep only the exponent
– [Yu-Cox] “Stop the world”, last resort global trace
– Not resilient to message loss or duplication:• Loss may cause garbage objects to remain uncollected
• Duplication may cause an object to be prematurely collected
Indirect Reference Counting
• Stub contains strong and weak locators– Strong: refers to a scion in the sender node; used
only for distributed GC
– Weak: refers to the node where target object is located; used to invoke target object in a single hop
• Duplication performed locally without informing the owner node– The weak reference is sent along with the message
containing the reference
Indirect Reference Duplication
RA
X
RB
V
RC
U
&scionB, &scionA
VA
VA
1
1
scion
stub
weak locator
strong locator
Indirect Reference Deletion
RA
X
RB
V
RC
UVA
VA
1
1
scion
stub
weak locator
strong locator
Indirect Reference Deletion
RA
X
RB
V
RC
UVA
1
1
scion
stub
-1
Indirect Reference Deletion
RA
X
RB
V
RC
UVA
1 scion
stub
Indirect Reference Counting
• Advantages:– Unlimited number of duplications
– Access to object in one hop through weak locator
• Disadvantages:– Not resilient to message failures
– Messages are sent whenever an object is deleted
Reference Listing
• The object’s owner allocates a table of outgoing pointers (scions), one for each client that owns a reference to the object
• Client nodes hold tables of incoming pointers (stubs)
RA
X B
RB
X
A x
object
RC
scion
CxY
Z
XB
stub
Use of Timestamps
RB
Xobject
RC
scion
CxY
XB
stub
Sent &X/1
Received delete X/1
Sent delete X/1
Sent &X/2
Ignored
Reference Listing• Advantages:
– Resilience to message duplication when timestamps are used
– Resilience to node failure: Owner can prompt client to send a live/delete message
– Owner may explicitly query about a reference that is suspected to be part of a distributed garbage cycle
– Owner can decide whether to keep objects referred to by a crashed client node until it recovers or not
• Disadvantages:– Memory overhead
– Doesn’t collect cycles of garbage
Remote Reference Counting
• Advantages:– Depends only on the number of nodes in the
system• Independent of pointer operations• Independent of heap size
– Messages are sent only during GC, when the chance of collecting an object is very high
– Independent of consistency protocols and global order of operations
Remote Reference Counting
• Disadvantages:– Doesn’t collect cycles of garbage– Dependent on the number of nodes in the
system
The System Model
• Communication through a reliable asynchronous message-passing system– Messages are never lost, duplicated or altered– Messages can be delayed or arrive out of order
• Processors can share objects
• Objects can be replicated
Local and Remote Counters
• Local and remote counters are attached to every shared object
• Locali(X)
– Increased by m when node i receives a message containing m pointers to X
– Otherwise maintained as in traditional reference counting
– When Locali(X) = 0, i is clean - has no references to X
Local and Remote Counters
• Remotei(X)
– Increased by m when some object Y containing m pointers to X is sent from node i
– Decreased by m when some object Y containing m pointers to X is received at node i
– The sum of Remotei(X) is the number of pointers to X in transit in the system
The Algorithm - Layout• Build a spanning tree covering all the nodes• Collection of object X:
– The root send signals to all its children
– Inner nodes pass the signal down
– When a leaf is clean it sends up a token
– An inner node sends up a token when it received tokens from all its children and is clean
– When the root received tokens from all its children it checks a condition C:
• If C = true X is garbage
• Otherwise - another wave begins
The Algorithm
0Signals
a node with local(x) = a
The Algorithm
0
1
0 00
0 0 0 01 0
Tokens
a node with local(x) = a
The Algorithm
0
1
0 00
0 0 0 01 0
a node in S - hasn’t sent a token
TokensSS
R = R0 all the nodes outside S are clean
Example: R0 falsification
0
1
Y:=ZY:=Z 00
0 0 0 01 0
a node in S - hasn’t sent a token
SS j
Example: R0 falsification
0
XXZZ 1
Y:=ZY:=Z 00
0 0 0 01 0
a node in S - hasn’t sent a token
ZZ
Locali(x) = 1
Remotei(x) = 1
Localj(x) = 2
Remotej(x) = -1
i
jSS
The Algorithm• Use the remote counter to count pointers sent
and receivedi definition:
– for a node i outside S, i is the value held at remotei(X) when i sent its token
– for a node i in S, i is the value held at remotei(X)
= i
fin = at the end of the wave
The Algorithm
• A leaf sends in the token the value of its remote counter
• An inner node sends up the sums of its remote counter and those of its descendants
• R1 > 0
• R = R0 R1
Example (cont.)
0
XXZZ 1
Y:=ZY:=Z 00
0 0 0 01 0
Locali(x) = 1
Remotei(x) = 1
Localj(x) = 2
Remotej(x) = -1
i
j
= 1 R1 is true
SS
Example: R1 Falsification
0
XXZZ W:=YW:=Y
Y:=ZY:=Z 00
0 0 0 01 0
Locali(x) = 1
Remotei(x) = 1
Localj(x) = 2
Remotej(x) = -1
i
j
k
SS
Example: R1 Falsification
0
XXZZ W:=YW:=Y
Y:=ZY:=Z 00
0 0 0 01 0
Locali(x) = 1
Remotei(x) = 1
Localj(x) = 2
Remotej(x) = 0
i
j
k YY
= 0 R1 is false
Localk(x) = 2
Remotek(x) = -1
SS
The Algorithm
• Detect if may have decreased due to a node in S:– Initially paint all nodes in white– A node that decreases remote(X) turns black
• R2 at least one node in S is black
• R = R0 R1 R2
Example: R2 Falsification
0
XXZZ W:=YW:=Y
Y:=ZY:=Z 00
0 0 0 01 0
Locali(x) = 1
Remotei(x) = 1
Localj(x) = 2
Remotej(x) = 0
i
j
k YYLocalk(x) = 2
Remotek(x) = -1
SS
Example: R2 Falsification
0
XXZZ W:=YW:=Y
Y:=ZY:=Z 00
0 0 0 01 0
Locali(x) = 1
Remotei(x) = 1
Localj(x) = 2
Remotej(x) = 0
i
j
Localk(x) = 2
Remotek(x) = -1
k
SS
Example: R2 Falsification
0
XXZZ
Y:=ZY:=Z 00
0 0 0 01 0
Locali(x) = 1
Remotei(x) = 1
Localj(x) = 2
Remotej(x) = 0
i
j
k Localk(x) = 0
Remotek(x) = -1
Token
No node is S is black R2 is false
SS
The Algorithm
• Propagate the color information:– A node that is black or has received a black
token transmits a black token– Otherwise, transmits a white token– A node that transmits a black token becomes
white
• R3 some node in S has a black token
• R = R0 R1 R2 R3
Example (cont.)
0
XXZZ
Y:=ZY:=Z 00
0 0 0 01 0
Locali(x) = 1
Remotei(x) = 1
Localj(x) = 2
Remotej(x) = 0
i
j
kLocalk(x) = 0
Remotek(x) = -1
Token
SS
The Algorithm• C = [S = {root}
root is white and localroot(X) = 0
all tokens at the root are white
fin = 0]
• Once the root received tokens from all its children and localroot(x) = 0 it checks C:
– C = true object X is garbage– Otherwise - the root becomes white and initiates another
wave
Correctness Proof
• Layout:– Show that R = (R0 R1 R2 R3) is invariant– C = true (R1 R2 R3) = false
R0 = true object X is garbage
R0R1R2R3 is invariant
• Assume by negation R is false• Look at the wave in which R first became false:
– R = false R0 = false some node outside S was dirty
– i = the first node outside S to become dirty• Case 1: R became false before i first became dirty
– Implies that some node became dirty before i - impossible by definition of i
R0R1R2R3 is invariant
• Case 2: R became false after i first became dirty– i received a message containing a pointer to X
after sending its token– case 2.1: the message was sent in a previous wave
• More pointers sent than received > 0 at the beginning of the wave
– If doesn’t decrease R1 = true
– Otherwise: some node becomes black R2 R3 = true
R0R1R2R3 is invariant
– case 2.2: the message was sent in the current wave
• The message could have been sent only by a node j with local(X) > 0 inside S
• j increased after sending the message:
– If was < 0 before then some node turned black before i became dirty R2 R3 = true until the end of the wave
– Otherwise > 0 after j increased it R1 = true till the end of the wave or until some node becomes black
Correctness Proof (cont.)
• If the root hasn’t received a black token, S={root}, the root is white and fin = 0, then there are no messages in transit with pointers to object X– No node became black during the wave didn’t decrease
no messages were sent during the wave by nodes in S
– 0 at the beginning, fin = 0 = 0 for the duration of
the wave no message in transit at the beginning of the wave
– No node outside S can receive a message and become dirty
Correctness Proof (cont.)
• If the root hasn’t received a black token, S={root}, the root is white and fin = 0, then R0 = true– R2 R3 = false
– R1 = false
– R is invariant
• If the root hasn’t received a black token, S={root}, fin = 0 and the root is white and clean, then object X can be safely reclaimed– R0 = true - all nodes outside S are clean
– The root is clean
– There are no pointers in transit
Liveness Proof
• RRC doesn’t reclaim cycles• Unreferenced object - referenced from neither
local memory of any node nor from any traveling message
Liveness Proof
• If an object is unreferenced, it will finally be reclaimed by RRC– For all nodes local(X) = 0
– All nodes will finally send a token
– If at the root C = false another wave begins:• No messages with pointers to X exist no node
will turn black
• No pointers to X exist none will be sent = 0
• C = true at the end of the wave
Liveness Proof
• If a garbage object is not reachable from any garbage cycle, it will finally be reclaimed by RRC
X
X1
X2
X3
Liveness Proof
• If a garbage object is not reachable from any garbage cycle, it will finally be reclaimed by RRC
X
X1
X2
Liveness Proof
• If a garbage object is not reachable from any garbage cycle, it will finally be reclaimed by RRC
X
X1
Liveness Proof
• If a garbage object is not reachable from any garbage cycle, it will finally be reclaimed by RRC
X
Distributed Shared Memory (DSM)
• Software providing an abstraction of shared memory, running on networked workstations
• Workstation’s memory act as cache
• No messages exchanged - data shared through virtual shared memory
Millipage DSM
• Implements MULTIVIEW– Enables fine-grained sharing in page-based DSMs
– Eliminates false sharing
• Each object is mapped to a different virtual page, called minipage– One node is the manager of the minipage
• handles page faults - read/write requests
• invalidation of a minipage = discarding it from local memory
– Current version implements sequential consistency
RRC Message Waves
• A global tree is build during initialization• A wave begins when the local counter at the root
becomes 0• Communication may be asynchronous - RRC
message can be delayed and sent with other RRC or DSM messages
• Discard messages are sent only in memory reuse
Example
k
i
j
P P
Y
P
Xp1 Locali(X) = 1Locali(Y) = 1
Remotei(X) = 2
Localk(X) = 1
Remotek(X) = -1
Localj(X) = 1
Remotej(X) = -1
Read(X)Read(X)
PP PP
Remotei(Y) = 1
Example
k
i
j
P P
Y
P
Xp1Locali(Y) = 1
MinipageMinipageXX
Remotei(Y) = 1
Locali(X) = 1
Remotei(X) = 2
Localk(X) = 1
Remotek(X) = -1
Localj(X) = 1
Remotej(X) = -1
Example
k
i
j
P P
Y
P
Xp1Locali(Y) = 1
Remotei(Y) = 1
Localj(Y) = 1
Remotej(Y) = -1
Localk(X) = 1
Remotek(X) = -1
Localj(X) = 1
Remotej(X) = -1
Locali(X) = 1
Remotei(X) = 2
X
Example
k
i
j
P P
Y
P
Xp1Locali(Y) = 1
Remotei(Y) = 2
Locali(X) = 1
Remotei(X) = 2
X
Localk(Y) = 1
Remotek(Y) = -1
Localj(Y) = 1
Remotej(Y) = -1X
Localk(X) = 1
Remotek(X) = -1
Localj(X) = 1
Remotej(X) = -1
Z
Example
k
i
j
P P
Y
P
Xp1Locali(Y) = 1
Remotei(Y) = 2
Locali(X) = 1
Remotei(X) = 2
X
Localk(Y) = 1
Remotek(Y) = -1
Localj(Y) = 1
Remotej(Y) = -1X
Localk(X) = 1
Remotek(X) = -1
Localj(X) = 1
Remotej(X) = -1
Z
PageInvalide(X)
Example
k
i
j
P P
Y
P
Xp1Locali(Y) = 1
Remotei(Y) = 2
Locali(X) = 1
Remotei(X) = 2
Localk(Y) = 1
Remotek(Y) = -1
Localj(Y) = 1
Remotej(Y) = -1 XX
Localk(X) = 1
Remotek(X) = -1
Localj(X) = 1
Remotej(X) = -1
Z
Example
k
i
j
P P
Y
P
Xp1Locali(Y) = 0
Remotei(Y) = 2
Locali(X) = 1
Remotei(X) = 2
Localk(Y) = 0
Remotek(Y) = -1
Localj(Y) = 0
Remotej(Y) = -1 XX
Localk(X) = 1
Remotek(X) = -1
Localj(X) = 1
Remotej(X) = -1
Z
Signals
Example
k
i
j
P P
Y
P
Xp1Locali(Y) = 0
Remotei(Y) = 2
Locali(X) = 1
Remotei(X) = 2
Localk(Y) = 0
Remotek(Y) = -1
Localj(Y) = 0
Remotej(Y) = -1 XX
Localk(X) = 1
Remotek(X) = -1
Localj(X) = 1
Remotej(X) = -1
Z
Tokens
YY
Performance Evaluation
• The system:– 8 Pentium II 300 MHz
– Windows NT Workstation 4.0 SP3
– 128 Mbytes RAM
– Workstations interconnected by a switched Myrinet LAN
• Benchmarks:– Allocate objects and don’t free them
– Executed a number of times in a non-stop manner
Benchmarks
• Water - a parallel application from the field of molecular dynamics
• LU Decomposition - factors a dense matrix A into the product of a lower triangular matrix L and an upper triangular matrix U
• Integer Sort - sorts N integer values in parallel• Successive Over-Relaxation - input: a two dimensional grid.
In each iteration every grid element is updated to the average of its four neighboring elements
• Traveling Salesman Problem - find the minimum-cost, simple, cyclic tour in a weighted graph.
Application Suite
Water LU TSPIS SOR3 30 510 3
1MB 240MB 4MB160KB 24.6MB
1542 510 2502580 6150Garbage Creation Rate
(obj/sec)
No. of Runs
Shared Memory
No. of Objects
27 57 13212 5082
6.5 4.9 5.57.4 7.0Speedup on 8 Nodes
RRC Communication Cost
1-2 waves are enough to detect an object as garbage
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
IS Water LU TSP SOR
Nu
mb
er
of
RR
C m
es
sa
ge
s p
er
ob
jec
t
RRC Communication Cost
• Communication complexity is independent of the number of pointer operations– Simulation of different rates of pointer operations
showed no change in the number of GC messages
• Efficiency relies on 2 observations:– Object use is usually localized in time– The node that created the object is usually the last
to use it
RRC Communication Cost
• To improve performance:– Tokens and signals can be combined or
piggybacked on other messages– RRC can be turned off or delayed when best
performance is desired
Scalability
• Problem: GC waves span all the processes in the system
• Increase less than linear, also due to:– Increased garbage creation rate– Increased number of page faults– Increased number of “discard” messages
• GC overhead in a single node is independent of the number of signals and tokens sent
Scalability
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
1 2 3 4 5 6 7 8
Number of processors
Ove
rhea
d, %
IS (2 obj/sec)
Water (27 obj/sec)
LU (57 obj/sec)
TSP (1321 obj/sec)
Scalability
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
Number of processors
Spe
edup
SOR
LU
IS
WATER
TSP
linear
Collection in Granularity Larger than Objects
• Expected to decrease the number of GC messages
• Tested on SOR:– A single minipage contains several objects
instead of only one
Collection in Granularity of Pages
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
1 2 4 8 16 32
matrix rows per minipage
over
head
Total RRCoverhead
RRC messageprocessingoverhead
Other thanmessageprocessing RRCoverhead
Collection in Granularity of Pages
• Advantages:– Reduction in memory overhead
– Easier organization of the free list
– Cycles contained entirely within the page are collected
• Disadvantages:– Delay in reclamation
• Not a significant problem according to the memory locality principle
– Creation of false cycles
CPU Time - Root
0%
20%
40%
60%
80%
100%
IS Water LU TSP SOR
invalidation
on DSM page receive
on DSM page send
pointer operations
allocation
GC message processing
CPU Time - Inner Node
0%
20%
40%
60%
80%
100%
IS Water LU TSP SOR
invalidation
on DSM page receive
on DSM page send
pointer operations
allocation
GC message processing
RRC - Conclusions
• A GC algorithms that works correctly in a reliable asynchronous message passing distributed system
• Successfully implemented as a WIN32 library on Windows-NT on top of MILLIPAGE
• 2-3 messages to identify a garbage object, independent of reference graph mutations
• Use of a reference counting technique insures low computational overhead
RRC - Conclusions (cont.)
• Scalable - the number of GC messages sent by a single node is independent of the number of nodes
• Improvement in communication overhead with increase in collection granularity
• Unable to collect cycles