tornado: maximizing locality and concurrency in a shared memory multiprocesor operating system by:...
TRANSCRIPT
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocesor Operating System
By: Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm
Presented by: Holly Grimes
Locality
In uniprocessors Spatial Locality—access neighboring memory
locations Temporal Locality—access memory locations that
where accessed recently
In multiprocessors, locality involves each processor using data from its own cache
Why is Locality Important? Modern multiprocessors exhibit
Higher memory latency Large write-sharing costs Large cache lines (false sharing) Larger system sizes Large secondary caches NUMA effects
Why is Locality Important?
Why is Locality Important?
Why is Locality Important?
Why is Locality Important?
Why is Locality Important?
Why is Locality Important?
Why is Locality Important? Sharing the counter requires moving it back
and forth between the CPU caches Solution??
Split the counter into an array of integers Each CPU gets its own counter
Achieving Locality
Achieving Locality
Achieving Locality
Achieving Locality Using an array of counters seems to solve our
problem… But what happens if both array elements map
to the same cache line? False sharing
Solution: Pad each array element
Different elements map to different cache lines
Comparing Counter Implementations
Why is Locality Important? Modern multiprocessors exhibit
Higher memory latency Large write-sharing costs Large cache lines (false sharing) Larger system sizes Large secondary caches NUMA effects
NUMA Effects
NUMA Effects
Achieving Locality in an OS We’ve seen several ways to achieve locality in
the implementation of a counter
Now we extend these concepts to see how locality can be achieved in an OS
Tornado’s Approach – Built a new OS from the ground up Make locality the primary design goal
Tornado’s Approach
Locality
Independence
User Application
Locality
Independence
OS Implementation
Illustration by Philip Howard
Tornado’s Locality-Maximizing features
Object-oriented Design
Clustered Objects
A New Locking Strategy
Protected Procedure Calls
Object-oriented Design
Object-oriented Structure Each OS resource is represented as a
separate object All locks and data structures are internal to
the objects localizes and encapsulates the locks and data
This structure allows different resources to be managed without accessing shared data structures without acquiring shared locks
Simplifies OS implementation modular design
Example: Memory Management Objects
HAT
Process
Region FCM
COR
Region FCM
COR
DRAM
HAT Hardware Address TranslationFCM File Cache ManagerCOR Cached Object RepresentativeDRAM Memory manager
Illustration by Philip Howard
Object-oriented Design
For each resource, this design provides one object to be shared by all CPUs Comparable to having one copy of the counter
shared among all CPUs
To maximize locality, something more needs to be done
Clustered Objects
Clustered Objects Clustered objects are composed of a set of
representative objects There can be one rep for the system, one rep per
processor, or one rep for a cluster of processors Clients access a clustered object using a
common reference to the object Each call to an object using this reference is
automatically directed to the appropriate rep Clients do not need to know anything about the
location or organization of the reps to use a clustered object
Impact on locality is similar to the effect of padded arrays on the counter
Clustered Objects
Keeping Clustered Objects Consistent
When a clustered object has multiple reps, there must be a way of keeping the reps consistent
Coordination between reps can happen via Shared Memory Protected Procedure Calls
Clustered Object Implementation Each processor has a translation table
The table is located at the same virtual address in every processor
For each clustered object, the table contains a pointer to the rep that serves the given processor
A clustered object reference is just a pointer into this table
Reps for a clustered object are created dynamically when they are first accessed Dynamic rep creation is dealt with by the global
miss handler
Clustered Object Implementation
A New Locking Strategy
Synchronization
Two kinds of synchronization issues must be dealt with
Using locks to protect data structures
Ensuring the existence of needed data structures
Locks
Tornado uses spin-then-block locks to minimize the overhead of the lock/unlock instructions
Tornado limits lock contention by Encapsulating the locks in objects to limit the
scope of locks Using clustered objects to provide multiple copies
of a lock
Existence Guarantees Traditional Approach: use locks to protect object
references Prevents races where one process destroys an object
that another process is using Tornado’s Approach: semi-automatic garbage
collection Garbage collection destroys a clustered object only
when there are no more references to the object When clients use an existing reference to access a
clustered object, they have a guarantee that the referenced object still exists
References can be safely accessed without the use of (global) locks
Protected Procedure Calls
Interprocess Communication
Tornado uses Protected Procedure Calls (PPCs) to bring locality and concurrency to interprocess communication
A PPC is a call from a client object to a server object Acts like a clustered object call that passes between the
protection domains of the client and server processes Advantages of PPC
Client requests are always serviced on their local processor Clients and servers share the CPU in a manner similar to
handoff scheduling The server has one thread of control for each client request
Protected Procedure Calls
Performance
Conclusions
Tornado’s increased locality and concurrency has produced a scalable OS design for multiprocessors
This locality was provided by several key system features
An object-oriented design Clustered objects A new locking strategy Protected procedure calls