seven o’clock: a new distributed gvt algorithm using network atomic operations
Post on 01-Feb-2016
26 Views
Preview:
DESCRIPTION
TRANSCRIPT
Seven O’Clock: A New Distributed GVT Algorithm using Network
Atomic Operations
David Bauer, Garrett Yaun
Christopher Carothers
Computer Science
Murat Yuksel
Shivkumar Kalyanaraman
ECSE
Global Virtual Time
Defines a lower bound onDefines a lower bound on
any unprocessed event in theany unprocessed event in the
system.system.
Defines the pointDefines the point
beyond which events shouldbeyond which events should
not be reclaimed.not be reclaimed.
! Imperative that GVT Imperative that GVT computation operate as computation operate as efficiently as possible.efficiently as possible.
Key Problems
Simultaneous Reporting ProblemSimultaneous Reporting Problem Transient Message ProblemTransient Message Problemarises “because not all processors will report their local minimum at precisely the same instant in wall-clock time”.
message is delayed in the network and neither the sender nor the receiver consider that message in their respective GVT calculation.
Asynchronous Solution: create a synchronization, or “cut”, across the distributed simulation that divides events into two categories: past and future.
Consistent Cut: a cut where there is no message scheduled in the future of the sending processor, but received in the past of the destination processor.
Mattern’s GVT Algorithm
Construct cut via message-passing
Cost: O(log n) if tree, O(N) if ring
! If large number of processors, then free pool exhausted waiting for GVT to complete
Fujimoto’s GVT Algorithm
Construct cut using shared memory flag
Cost: O(1)
! Limited to shared memory architecture
Sequentially consistent memory model ensures proper causal order
Memory Model
Sequentially consistent does not mean instantaneous
Memory events are only guaranteed to be causally ordered
Is there a method to achieve sequentially consistent shared memory in a loosely coordinated,
distributed environment?
GVT Algorithm Differences
Fujimoto 7 O’Clock Mattern Samadi
Cost of Cut Calculation
O(1) O(1)O(N) or
O(log N)
O(N) or
O(log N)*
Parallel / Distributed
P P+D P+D P+D
Global Invariant
Shared Memory Flag
Real Time Clock
Message Passing
Message Passing
Independent of Event Memory
N Y N N
*cost of algorithm much higher
Network Atomic Operations
Goal: each processor observes the “start” of the GVT computation at the same instance of wall clock time
Definition: An NAO is an agreed upon frequency in wall clock time at which some event is logically observed to have happened across a distributed system.
Network Atomic Operations
Goal: each processor observes the “start” of the GVT computation at the same instance of wall clock time
Definition: An NAO is an agreed upon frequency in wall clock time at which some event is logically observed to have happened across a distributed system.
wall-clock time
Compute GVT
Compute GVT
Compute GVT
Compute GVT
Compute GVT
Compute GVT
Compute GVT
wall-clock time
Update Tables
Update Tables
Update Tables
Update Tables
Update Tables
Update Tables
Update Tables
possible operations provided by a complete sequentially consistent memory model
Clock Synchronization
• Assumption: all processors share a highly accurate, common view of wall clock time.
• Basic building block: CPU timestamp counter– Measures time in terms of clock cycles, so a
gigahertz CPU clock has granularity of 109 secs– Sending events across network is much larger
granularity depending on tech: ~106 secs on 1000base/T
Clock Synchronization
• Issues: clock synchronization, drift and jitter• Ostrovsky and Patt-Shamir:
– provably optimal clock synchronization
– clocks have drift and the message latency may be unbounded
• Well researched problem in distributed computing – we used simplified approach– simplified approach helpful in determining if system
working properly
Max Send t
• Definition: max_send_delta_t is maximum of– worst case bound on the time to send an event through
the network– twice synchronization error– twice max clock drift over simulation time
• add a small amount of time to the NAO expiration– Similar to sequentially consistent memory
• Overcomes:– Transient message problem, clock drift/jitter and clock
synchronization error
Max Send t: clock drift
• Clock drift causes CPU clocks to become unsynchronized– Long running simulations
may require multiple synchs– Or, we account for it in the
NAO
• Max Send t overcomes clock drift by ensuring no event “falls between the cracks”
Max Send t
• What if clocks are not well synched?– Let Dmax be the maximum clock drift.
– Let Smax be the maximum synchronization error.
• Solution: Re-define tmax as
t’max = max(tmax , 2*Dmax , 2*Smax)
• In practice both Dmax and Smax are very small in comparison to tmax. LP1
wallclock time
LP2
GVT tmax
GVT
Dmax Dmax
Dmax Dmaxtmax
Transient Message Problem
• Max Send t: worst case bound on time to send event in network– guarantees events are
accounted for by either sender of receiver
Simultaneous Reporting Problem
• Problem arises when processors do not start GVT computation simultaneously
• Seven O’Clock does start simultaneously across all CPUs, therefore, problem cannot occur
NAO
7
5
GVT
10 9
LVT: 7
LVT: 5
LVT: min(5,9)
GVT: min(5,7)
A B C D E
NAO
7
5
GVT
10 9
LVT: 7
LVT: 5
LVT: min(5,9)
GVT: min(5,7)
Simulation: Seven O’Clock GVT Algorithm
– Assumptions:• Each processor has a highly accurate clock• A message passing interface w/o ack is available• The worst case bound on the time to transmit a message
through the network tmax is known.
cut point
LP1
LP2
wallclock time
LP3
LP4
GVT #1tmax tmaxGVT #2
5
7
10
9
NAO NAONAO
LVT=min(5,9)
LVT=min(7,9)
12
GVT=min(5,7)
– Properties:• a clock-based algorithm for distributed
processors
• creates a sequentially consistent view of distributed memory
Limitations
• NAOs cannot be “forced”– agreed upon intervals cannot change
• Simulation End Time– worst-case, complete NAO and only one event
remaining to process
– amortized over entire run-time, cost is O(1)
• Exhausted Event Pool– requires tuning to ensure enough optimistic memory
available
Uniqueness
• Only real-time based GVT algorithm
• Zero-cost consistent-cut truly scalable– O(1) cost optimal
• Only algorithm which is entirely independent of available event memory– Event memory loosely tied to GVT algorithm
Performance Analysis: Models
r-PHOLD
• PHOLD with reverse computation
• Modified to control percent remote events (normally 75%)
• Destinations still decided using a uniform random number generator all LPs possible destination
TCP-Tahoe
• TCP-TAHOE ring of Campus Networks topology
• Same topology design as used by PDNS in MASCOTS ’03
• Model limitations required us to increase the number of LAN routers in order to simulate the same network
Performance Analysis: ClustersItanium ClusterLocation: RPI
Total Nodes: 4
Total CPU: 16
Total RAM: 64GB
CPU: Quad Itanium-2 1.3GHz
Network: Myrinet 1000base/T
NetSim ClusterLocation: RPI
Total Nodes: 40
Total CPU: 80
Total RAM: 20GB
CPU: Dual Intel 800MHz
Network: ½ 100base/T, ½ 1000base/T
Sith ClusterLocation: Georgia Tech
Total Nodes: 30
Total CPU: 60
Total RAM: 180GB
CPU: Dual Itanium-2 900MHz
Network: ethernet 1000base/T
Itanium Cluster: r-PHOLD, CPUs allocated round-robin
Maximize distribution (round robin among nodes) VERSUS
Maximize parallelization (use all CPUs before using additional nodes)
NetSim Cluster: Comparing 10- and 25% remote events (using 1 CPU per node)
NetSim Cluster: Comparing 10- and 25% remote events
(using 1 CPU per node)
TCP Model Topology
Single Campus 10 Campus Networks in a Ring
Our model contained 1,008 campus networks in a ring, simulating > 540,000 nodes.
Itanium Cluster: TCP results using 2- and 4-nodes
Sith Cluster: TCP Model using 1 CPU per node and 2 CPU per node
Future Work & Conclusions
• Investigate “power” of different models by computing spectral analysis– GVT now in frequency domain– Determine max length of rollbacks
• Investigate new ways of measuring performance– Models too large to run sequentially– Account for hardware affects (even in NOW there are
fluctuations in HW performance)– Account for model LP mapping– Account for different cases, ie, 4 CPUs distributed
across 1, 2, and 4 nodes
top related