presented by: sagnik bhattacharya
DESCRIPTION
Cellular Disco. Kingshuk Govil, Dan Teodosiu, Yongjang Huang, Mendel Rosenblum. Presented by: Sagnik Bhattacharya. Overview. Problems of current shared memory multiprocessors and our requirements Cellular Disco as a solution architecture prototype hardware-fault containment - PowerPoint PPT PresentationTRANSCRIPT
Presented by:Sagnik Bhattacharya
Kingshuk Govil, Dan Teodosiu, Yongjang Kingshuk Govil, Dan Teodosiu, Yongjang Huang, Mendel RosenblumHuang, Mendel Rosenblum
OverviewOverview• Problems of current shared memory multiprocessors and
our requirements• Cellular Disco as a solution
– architecture– prototype– hardware-fault containment– CPU management– Memory management– statistics
•Cellular Disco and ubiquitous environments•Conclusion
ProblemProblem
• Extending modern Operating systems to run efficiently on shared memory multiprocessors.
• Software development has not kept pace with hardware development.
• Common operating systems fail beyond 12 processors.
What we need….What we need….
• the system should be reliable• it should be scalable• it should be fault-tolerant• it should not take too much of development
time or effort.
Traditional approachesTraditional approaches
• Hardware partitioning - lacks resource sharing, makes physical clusters.
• Software-centric approaches : (significant development time and cost)– modify existing OS– develop new OS
A scenario….A scenario….
Control unitControl unitSmart Smart SpaceSpace
ProcProc ProcProc
ProcProc ProcProc
(No rebooting (No rebooting necessary)necessary)
Solution : Cellular DiscoSolution : Cellular Disco
• Extension of previous work - Disco• Uses the concept of Virtual machine
monitors• Partitions the multiprocessor system into
virtual clusters.
Virtual Machine MonitorVirtual Machine Monitor
Virtual Machine MonitorVirtual Machine Monitor
VM1VM1
µP1µP1 µP2µP2 µP3µP3
VM2VM2
µP1µP1 µP3µP3 µP8µP8
VM1 - µP’s 1,2,3VM1 - µP’s 1,2,3
µP5µP5
VM2 - µP’s 1,3,5,8VM2 - µP’s 1,3,5,8
OSOS(Win NT)(Win NT)
OSOS(IRIX 6.2)(IRIX 6.2)
Virtual Virtual MachineMachine
Virtual Virtual MachineMachine
HardwareHardware
Virtual Machine MonitorVirtual Machine Monitor
VM1VM1
µP1µP1 µP2µP2 µP3µP3VM2VM2
µP1µP1 µP3µP3 µP8µP8
VM1 - µP’s 1,2,3VM1 - µP’s 1,2,3
µP5µP5
VM2 - µP’s 1,3,5,8VM2 - µP’s 1,3,5,8
OSOS(Win NT)(Win NT)
OSOS(IRIX 6.2)(IRIX 6.2)
I/O requestI/O request
Virtual Machine MonitorVirtual Machine Monitor
VM1VM1
µP1µP1 µP2µP2 µP3µP3VM2VM2
µP1µP1 µP3µP3 µP8µP8
VM1 - µP’s 1,2,3VM1 - µP’s 1,2,3
µP5µP5
VM2 - µP’s 1,3,5,8VM2 - µP’s 1,3,5,8
OSOS(Win NT)(Win NT)
OSOS(IRIX 6.2)(IRIX 6.2)
Trap I/O Trap I/O request & request & perform I/Operform I/O
Virtual Machine MonitorVirtual Machine Monitor
VM1VM1
µP1µP1 µP2µP2 µP3µP3VM2VM2
µP1µP1 µP3µP3 µP8µP8
VM1 - µP’s 1,2,3VM1 - µP’s 1,2,3
µP5µP5
VM2 - µP’s 1,3,5,8VM2 - µP’s 1,3,5,8
OSOS(Win NT)(Win NT)
OSOS(IRIX 6.2)(IRIX 6.2)
Perform I/O Perform I/O and send and send interruptinterrupt
Virtual Machine MonitorVirtual Machine Monitor
VM1VM1
µP1µP1 µP2µP2 µP3µP3VM2VM2
µP1µP1 µP3µP3 µP8µP8
VM1 - µP’s 1,2,3VM1 - µP’s 1,2,3
µP5µP5
VM2 - µP’s 1,3,5,8VM2 - µP’s 1,3,5,8
OSOS(Win NT)(Win NT)
OSOS(IRIX 6.2)(IRIX 6.2)
Issues it addressesIssues it addresses
• Address scalability• NUMA awareness• Hardware fault-containment• Resource management
Basic Cellular Disco ArchitectureBasic Cellular Disco Architecture
PrototypePrototype
• Runs on a 32-processor SGI-Origin 2000• Supports shared memory systems based on
MIPS R1000 architecture.• The prototype runs piggybacked on IRIX
6.4• The host OS is made dormant and is only
used to invoke some device drivers.
Hardware VirtualizationHardware Virtualization
• Physical Resources - visible to a virtual machine
• Machine Resources - actual resources; allocated by Cellular Disco
• CD operates in the kernel mode of the MIPS processor
• CD intercepts all system calls.
Resource ManagementResource Management• CPU management - Each processor maintains its
own run queue• Memory Management - Memory borrowing
mechanism• Each OS instance is only given as many
resources as it can handle. Large applications are split and communications between the parts is established by using the shared-memory regions.
CPU ManagementCPU Management
• VCPU migration :
- Intra node (37 µsec)
- Inter node (520 µsec)
- Inter Cell (1520 µsec)
VCPU migrationVCPU migration
Cellular DiscoCellular Disco
InterconnectInterconnect
NodeNode NodeNode NodeNode NodeNodeNodeNodeNodeNode
CPUCPU CPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPU
CellCell CellCell CellCell
VCPUVCPU
Intra NodeIntra Node
Cellular DiscoCellular Disco
InterconnectInterconnect
NodeNode NodeNode NodeNode NodeNodeNodeNodeNodeNode
CPUCPU CPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPU
CellCell CellCell CellCell
VCPUVCPU
Inter NodeInter Node
Cellular DiscoCellular Disco
InterconnectInterconnect
NodeNode NodeNode NodeNode NodeNodeNodeNodeNodeNode
CPUCPU CPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPU
CellCell CellCell CellCell
VCPUVCPU
Inter CellInter Cell
Cellular DiscoCellular Disco
InterconnectInterconnect
NodeNode NodeNode NodeNode NodeNodeNodeNodeNodeNode
CPUCPU CPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPU
CellCell CellCell CellCell
VCPUVCPU
CPU ManagementCPU Management (contd.)(contd.)
• CPU balancing : Idle Balancer Periodic
balancer
Load Balancing ScenarioLoad Balancing Scenario
Idle balancerIdle balancer
CPU0CPU0 CPU1CPU1 CPU2CPU2 CPU3CPU3
VC B0VC B0
VC A1VC A1
VC B1VC B1
VC A0VC A0
Does this have enough Does this have enough cache affinity to CPU2?cache affinity to CPU2?
(Idle)(Idle)
AsksAsks
Idle balancerIdle balancer
CPU0CPU0 CPU1CPU1 CPU2CPU2 CPU3CPU3
VC B0VC B0
VC A1VC A1
VC B1VC B1
VC A0VC A0
Does this have enough Does this have enough cache affinity to CPU2?cache affinity to CPU2?
NO!!NO!!
(Idle)(Idle)
AsksAsks
Idle balancerIdle balancer
CPU0CPU0 CPU1CPU1 CPU2CPU2 CPU3CPU3
VC B0VC B0
VC A1VC A1
VC B1VC B1
VC A0VC A0VC B1VC B1
Periodic BalancerPeriodic Balancer
• Does depth-first traversal of the load tree
44
11 33
11 00 22 11
Trav
ersa
lTr
aver
sal
Periodic BalancerPeriodic Balancer
• Checks difference of 2 siblings, ignores if<2
44
11 33
11 00 22 11
Trav
ersa
lTr
aver
sal
Diff=1Diff=1 Diff=1Diff=1
Periodic BalancerPeriodic Balancer
• If diff>=2 does load balancing if benefit>cost
44
11 33
11 00 22 11
Trav
ersa
lTr
aver
sal
Diff=2Diff=2
Gang SchedulingGang Scheduling
• For all the CPU’s we select the VCPU that is to run on the physical CPU.
• The VCPU selected is the highest priority be gang-runnable VCPU– all non-idle VCPU’s of that VM are either
• running or,• waiting on run queues of processors running lower-
priority VM’s.
ExampleExample
µP1 :µP1 :
µP2 :µP2 :
µP3 :µP3 :
VC1VC1
VC2VC2
VC5VC5
VC7VC7 VC5VC5
VC1VC1 VC9VC9
VC3VC3 VC4VC4
Currently Currently Executing VCPUExecuting VCPU
Wait QueueWait Queue VM1VM1VC’s - 1,3,8VC’s - 1,3,8(idle)(idle)
VM2VM2VC’s - 2,4,6VC’s - 2,4,6(idle),7(idle),7
VM3VM3VC’s - 5,9VC’s - 5,9
Prio
rity
Prio
rity
ExampleExample
µP1 :µP1 :
µP2 :µP2 :
µP3 :µP3 :
VC1VC1
VC2VC2
VC5VC5
VC7VC7 VC5VC5
VC1VC1 VC9VC9
VC3VC3 VC4VC4
VM1VM1VC’s - 1,3,8 VC’s - 1,3,8 (idle)(idle)
VM2VM2VC’s - 2,4,6VC’s - 2,4,6(idle),7(idle),7
VM3VM3VC’s - 5,9VC’s - 5,9
Prio
rity
Prio
rity
Gang RunnableGang Runnable
ExampleExample
µP1 :µP1 :
µP2 :µP2 :
µP3 :µP3 :
VC5VC5
VC9VC9
VC5VC5
VC7VC7 VC1VC1
VC1VC1 VC2VC2
VC3VC3 VC4VC4
New New Executing VCPUExecuting VCPU
NewNewWait QueueWait Queue
VM1VM1VC’s - 1,3,8VC’s - 1,3,8(idle)(idle)
VM2VM2VC’s - 2,4,6VC’s - 2,4,6(idle),7(idle),7
VM3VM3VC’s - 5,9VC’s - 5,9
Prio
rity
Prio
rity
Memory ManagementMemory Management
• Each cell maintains its own freelist, and allocates memory to other cells in it allocation preference list on request(RPC).
• Speed - 758 µsec for 4 MB.• A threshold is set for min. amount of local
free memory• As far as possible Paging is avoided.
Memory BorrowingMemory Borrowing
• freelist - list of free pages in the cell• allocation preference list - list of cells from
which borrowing memory is more beneficial than paging.
Memory BorrowingMemory Borrowing
Cell 1Cell 1 Cell 3Cell 3 Cell 4Cell 4 Cell 5Cell 5Cell 2Cell 2
Freelist sizesFreelist sizes
16 16 MBMB
32 32 MBMB
Borrowing thresholdBorrowing threshold
Lending thresholdLending threshold
Memory BorrowingMemory Borrowing
Cell 1Cell 1 Cell 3Cell 3 Cell 4Cell 4 Cell 5Cell 5Cell 2Cell 2
Freelist sizesFreelist sizes
16 16 MBMB
32 32 MBMB
Borrowing thresholdBorrowing threshold
Lending thresholdLending threshold
asksasks
Memory BorrowingMemory Borrowing
Cell 1Cell 1 Cell 3Cell 3 Cell 4Cell 4 Cell 5Cell 5Cell 2Cell 2
Freelist sizesFreelist sizes
16 16 MBMB
32 32 MBMB
Borrowing thresholdBorrowing threshold
Lending thresholdLending threshold
refusedrefused
Memory BorrowingMemory Borrowing
Cell 1Cell 1 Cell 3Cell 3 Cell 4Cell 4 Cell 5Cell 5Cell 2Cell 2
Freelist sizesFreelist sizes
16 16 MBMB
32 32 MBMB
Borrowing thresholdBorrowing threshold
Lending thresholdLending thresholdcannot cannot askask
Memory BorrowingMemory Borrowing
Cell 1Cell 1 Cell 3Cell 3 Cell 4Cell 4 Cell 5Cell 5Cell 2Cell 2
Freelist sizesFreelist sizes
16 16 MBMB
32 32 MBMB
Borrowing thresholdBorrowing threshold
Lending thresholdLending threshold
asksasks
Memory BorrowingMemory Borrowing
Cell 1Cell 1 Cell 3Cell 3 Cell 4Cell 4 Cell 5Cell 5Cell 2Cell 2
Freelist sizesFreelist sizes
16 16 MBMB
32 32 MBMB
Borrowing thresholdBorrowing threshold
Lending thresholdLending threshold
GivesGives
4 MB4 MB
Memory BorrowingMemory Borrowing
Cell 1Cell 1 Cell 3Cell 3 Cell 4Cell 4 Cell 5Cell 5Cell 2Cell 2
Freelist sizesFreelist sizes
16 16 MBMB
32 32 MBMB
Borrowing thresholdBorrowing threshold
Lending thresholdLending threshold
Memory Management Memory Management (contd.)(contd.)
• Paging : Algo - Second Chance FIFO• Page sharing information by some control
data structure• Cellular Disco traps all read and write
requests made by the Operating Systems
Second-chance FIFOSecond-chance FIFO
• A reference bit is added to each page in FIFO scheme
• Every time the page is accessed the bit is set to 1• If the page is selected by FIFO, and the reference
bit is 1, then it is set to 0 and another page is looked for.
• A page is the target page if it is selected b FIFO and the reference bit is 0
ExampleExample
Page FaultPage Fault
1 Oldest Page1 Oldest Page
0 Second0 Second Oldest Page Oldest Page
FIFOFIFO
RBRB
Page TablePage Table
ExampleExample
Page FaultPage Fault
0 Oldest Page0 Oldest Page
0 Second0 Second Oldest Page Oldest Page
Second-Second-chance chance FIFOFIFO
RBRB
Page TablePage Table
ExampleExample
0 Oldest Page0 Oldest Page
RBRB
Page TablePage Table
Hardware fault-containmentHardware fault-containment
• Failure rate increases with increase in processors.
• Internally structured as a set of semi-independent cells.
• Failure in one cell does not impact VM’s running in other cells (localization of faults)
• Assumption - CD is a trusted software layer
Cellular StructureCellular Structure
Fault in one cell does not affect others
Hardware fault-containment Hardware fault-containment (contd.)(contd.)
• Communication modes - Fast inter-processor RPC - Message
• Side benefit - Software fault containment, i.e., individual OS crashes do not impact the system.
Hardware-Fault recoveryHardware-Fault recovery
• liveset - set of still functioning nodes.• Failure - removal from liveset• Recovery - insert back to liveset• Virtual machines dependent on the failed
cell are terminated.• Memory dependencies are updated when a
cell fails.
ExampleExample
Cellular DiscoCellular Disco
InterconnectInterconnect
Node1Node1 Node4Node4 Node5Node5 Node6Node6Node3Node3Node2Node2
CPUCPU CPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPU
CellCell CellCell CellCell
VM 1VM 1 VM 2VM 2VM 3VM 3
Liveset - 1,2,3,4,5,6Liveset - 1,2,3,4,5,6
ExampleExample
Cellular DiscoCellular Disco
InterconnectInterconnect
Node1Node1 Node4Node4 Node5Node5 Node6Node6Node3Node3Node2Node2
CPUCPU CPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPU
CellCell CellCell CellCell
VM 1VM 1 VM 2VM 2VM 3VM 3
Liveset - 1,2,3,4,5,6Liveset - 1,2,3,4,5,6
BOOMBOOM
ExampleExample
Cellular DiscoCellular Disco
InterconnectInterconnect
Node4Node4 Node5Node5 Node6Node6Node3Node3
CPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPU
CellCell CellCell CellCell
VM 2VM 2
Liveset - 5,6Liveset - 5,6
ExampleExample
Cellular DiscoCellular Disco
InterconnectInterconnect
Node1Node1 Node4Node4 Node5Node5 Node6Node6Node3Node3Node2Node2
CPUCPU CPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPU
CellCell CellCell CellCell
VM 2VM 2
Liveset - 5,6Liveset - 5,6
InterruptInterrupt
ExampleExample
Cellular DiscoCellular Disco
InterconnectInterconnect
Node1Node1 Node4Node4 Node5Node5 Node6Node6Node3Node3Node2Node2
CPUCPU CPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPUCPUCPU CPUCPU
CellCell CellCell CellCell
VM 2VM 2
Liveset - 1,2,3,4,5,6Liveset - 1,2,3,4,5,6
Fault-Recovery overheadFault-Recovery overhead
Virtualization OverheadsVirtualization Overheads
(the first column shows the exec. Time on IRIX 6.4 and the second shows the exec. time on Cellular Disco).
Cellular Disco and Ubiquitous Cellular Disco and Ubiquitous environmentsenvironments
• Provides raw computational power for our smart spaces.
• More importantly it does not fail. Fault-recovery present.
• Adaptable to new Operating systems
Grey AreasGrey Areas
• Will the source simplicity remain if it is not piggybacked on IRIX 6.4?
• Will it work on non-uniform multiprocessor systems?– Probable solution - development of a hardware
virtualization standard
In conclusion….In conclusion….
• Cellular Disco present a midway path between hardware and software directed techniques.
• It can be used on the central control unit for our smart spaces because it is scalable and fault-tolerant.