eecc756 - shaaban #1 lec # 10 spring2000 4-13-2000 parallel system performance: evaluation &...
TRANSCRIPT
EECC756 - ShaabanEECC756 - Shaaban#1 lec # 10 Spring2000 4-13-2000
Parallel System Performance: Parallel System Performance: Evaluation & ScalabilityEvaluation & Scalability
• Factors affecting parallel system performance:– Algorithm-related, parallel program related, architecture/hardware-
related.
• Workload-Driven Quantitative Architectural Evaluation:– Select applications or suite of benchmarks to evaluate architecture either
on real or simulated machine.– From measured performance results compute performance metrics:
• Speedup, System Efficiency, Redundancy, Utilization, Quality of Parallelism.
– Application Models of Parallel Computer Models: How the speedup of an application is affected subject to specific constraints:
• Fixed-load Model.• Fixed-time Model.• Fixed-Memory Model.
• Performance Scalability:– Definition.– Conditions of scalability.– Factors affecting scalability.
EECC756 - ShaabanEECC756 - Shaaban#2 lec # 10 Spring2000 4-13-2000
Factors affecting Parallel System PerformanceFactors affecting Parallel System Performance• Parallel Algorithm-related:
– Available concurrency and profile, grain, uniformity, patterns.– Required communication/synchronization, uniformity and patterns.– Data size requirements.– Communication to computation ratio.
• Parallel program related:– Programming model used.– Resulting data/code memory requirements, locality and working set
characteristics.– Parallel task grain size.– Assignment: Dynamic or static.– Cost of communication/synchronization.
• Hardware/Architecture related:– Total CPU computational power available.– Shared address space Vs. message passing.– Communication network characteristics.– Memory hierarchy properties.
EECC756 - ShaabanEECC756 - Shaaban#3 lec # 10 Spring2000 4-13-2000
Performance Evaluation of UniprocessorsPerformance Evaluation of Uniprocessors• Design modification decisions made only after quantitative
performance evaluation.
• For existing systems: Comparison of results.
• For future systems: Careful extrapolation from known quantities
• Wide base of programs leads to standard benchmarks:– Measured on wide range of machines and successive generations
• Measurements and technology assessment lead to proposed features.
• Simulation of new features designs:– Simulator developed that can run with and without a feature– Benchmarks run through the simulator to obtain results– Together with cost and complexity, decisions made.
EECC756 - ShaabanEECC756 - Shaaban#4 lec # 10 Spring2000 4-13-2000
Performance Isolation: MicrobenchmarksPerformance Isolation: Microbenchmarks
• Microbenchmarks: Small, specially written programs to isolate performance characteristics– Processing.
– Local memory.
– Input/output.
– Communication and remote access (read/write, send/receive)
– Synchronization (locks, barriers).
– Contention.
EECC756 - ShaabanEECC756 - Shaaban#5 lec # 10 Spring2000 4-13-2000
Types of Workloads/BenchmarksTypes of Workloads/Benchmarks– Kernels: matrix factorization, FFT, depth-first tree search
– Complete Applications: ocean simulation, crew scheduling, database.
– Multiprogrammed Workloads.
• Multiprog. Appls Kernels Microbench.
Realistic ComplexHigher level interactionsAre what really matters
Easier to understandControlledRepeatableBasic machine characteristics
Each has its place:
Use kernels and microbenchmarks to gain understanding, but applications to evaluate effectiveness and performance
EECC756 - ShaabanEECC756 - Shaaban#6 lec # 10 Spring2000 4-13-2000
Sample Workload/Benchmark SuitesSample Workload/Benchmark Suites• Numerical Aerodynamic Simulation (NAS)
– Originally pencil and paper benchmarks
• SPLASH/SPLASH-2– Shared address space parallel programs
• ParkBench– Message-passing parallel programs
• ScaLapack– Message-passing kernels
• TPC– Transaction processing
– SPEC-HPC• . . .
EECC756 - ShaabanEECC756 - Shaaban#7 lec # 10 Spring2000 4-13-2000
Multiprocessor SimulationMultiprocessor Simulation• Simulation runs on a uniprocessor (can be parallelized too)
– Simulated processes are interleaved on the processor
• Two parts to a simulator:– Reference generator: plays role of simulated processors
• And schedules simulated processes based on simulated time
– Simulator of extended memory hierarchy• Simulates operations (references, commands) issued by reference
generator
• Coupling or information flow between the two parts varies– Trace-driven simulation: from generator to simulator– Execution-driven simulation: in both directions (more accurate)
• Simulator keeps track of simulated time and detailed statistics.
EECC756 - ShaabanEECC756 - Shaaban#8 lec # 10 Spring2000 4-13-2000
Execution-Driven SimulationExecution-Driven Simulation• Memory hierarchy simulator returns simulated time information to
reference generator, which is used to schedule simulated processes.
P1
P2
P3
Pp
$1
$2
$3
$p
Mem1
Mem2
Mem3
Memp
Reference generator Memory and interconnect simulator
···
···
Network
EECC756 - ShaabanEECC756 - Shaaban#9 lec # 10 Spring2000 4-13-2000
Parallel Performance Metrics RevisitedParallel Performance Metrics Revisited • Degree of Parallelism (DOP): For a given time period, reflects
the number of processors in a specific parallel computer actually executing a particular parallel program.
• Average Parallelism: – Given maximum parallelism = m
– n homogeneous processors
– Computing capacity of a single processor – Total amount of work (instructions, computations:
or as a discrete summation W ii
i
m
t .
1
W DOP t dtt
t
( )1
2
A DOP t dtt t t
t
1
2 1 1
2
( )A i
ii
m
ii
m
t t
.
1 1
ii
m
t t t
12 1Where ti is the total time that DOP = i and
The average parallelism A:
In discrete form
EECC756 - ShaabanEECC756 - Shaaban#10 lec # 10 Spring2000 4-13-2000
Parallel Performance Metrics RevisitedParallel Performance Metrics Revisited
Asymptotic Speedup:Asymptotic Speedup:
T
Ti
T
Ti
ii
mi
i
m
ii
mi
i
m
ii
m
i
m
t W
t W
SW
W i
( ) ( )
( ) ( )
( )
( )
1 1
1
1 1
1 1
1
1
Execution time with one processor
Execution time with an infinite number of available processors
Asymptotic speedup S
EECC756 - ShaabanEECC756 - Shaaban#11 lec # 10 Spring2000 4-13-2000
Parallel Performance Metrics RevisitedParallel Performance Metrics Revisited• Arithmetic Mean Performance:
Let Ri be the execution rates (instructions per second) of m programs or benchmarks i = 1, 2 … m
– With equal weighting:
– With distribution = (fi i = 1,2, …, m)
• Geometric Mean Performance:
– With distribution = (fi i = 1,2, …, m)
a ii
m
R R m
1
a iR f Rii
m*
1
g im
i
m
R R
1
1
g ii
m
R R if*
1
EECC756 - ShaabanEECC756 - Shaaban#12 lec # 10 Spring2000 4-13-2000
Harmonic Mean PerformanceHarmonic Mean Performance• Arithmetic mean execution time per instruction:
• The harmonic mean execution rate across m benchmark programs:
• Weighted harmonic mean execution rate with weight distribution = {fi|i = 1, 2, …, m}
• Harmonic Mean Speedup for a program with n parallel execution modes:
a ii
m
ii
m
T T Rm m
1 1 1
1 1
ha ii
mR T Rm
1
11
h
i ii
mRf R
*
1
1
S T Tf Ri ii
n 1
1
1*
EECC756 - ShaabanEECC756 - Shaaban#13 lec # 10 Spring2000 4-13-2000
Harmonic Mean Speedup for Harmonic Mean Speedup for nn Execution Mode Multiprocessor systemExecution Mode Multiprocessor system
Fig 3.2 page 111See handout
EECC756 - ShaabanEECC756 - Shaaban#14 lec # 10 Spring2000 4-13-2000
Efficiency, Utilization, Redundancy, Quality of ParallelismEfficiency, Utilization, Redundancy, Quality of Parallelism• System Efficiency: Let O(n) be the total number of unit operations
performed by an n-processor system and T(n) be the execution time in unit time steps:
– Speedup factor: S(n) = T(1) /T(n)
– System efficiency for an n-processor system: E(n) = S(n)/n = T(1)/[nT(n)]
• Redundancy: R(n) = O(n)/O(1)
• Utilization: U(n) = R(n)E(n) = O(n) /[nT(n)]
• Quality of Parallelism:Q(n) = S(n) E(n) / R(n) = T3(1) /[nT2(n)O(n)]
Parallel Performance Metrics RevisitedParallel Performance Metrics Revisited
EECC756 - ShaabanEECC756 - Shaaban#15 lec # 10 Spring2000 4-13-2000
Parallel Performance ExampleParallel Performance Example• The execution time T for three parallel programs is given in terms of processor
count P and problem size N• In each case, we assume that the total computation work performed by an
optimal sequential algorithm scales as N+N2 .1 For first parallel algorithm: T = N + N2/P This algorithm partitions the computationally demanding O(N2) component of the
algorithm but replicates the O(N) component on every processor. There are no other sources of overhead.
2 For the second parallel algorithm: T = (N+N2 )/P + 100 This algorithm optimally divides all the computation among all processors but
introduces an additional cost of 100.
3 For the third parallel algorithm: T = (N+N2 )/P + 0.6P2
This algorithm also partitions all the computation optimally but introduces an additional cost of 0.6P2.
• All three algorithms achieve a speedup of about 10.8 when P = 12 and N=100 . However, they behave differently in other situations as shown next.
• With N=100 , all three algorithms perform poorly for larger P , although Algorithm (3) does noticeably worse than the other two.
• When N=1000 , Algorithm (2) is much better than Algorithm (1) for larger P .
EECC756 - ShaabanEECC756 - Shaaban#16 lec # 10 Spring2000 4-13-2000
Parallel Parallel Performance Performance
Example Example (continued)(continued)
All algorithms achieve:Speedup = 10.8 when P = 12 and N=100
N=1000 , Algorithm (2) performs much better than Algorithm (1)for larger P .
Algorithm 1: T = N + N2/PAlgorithm 2: T = (N+N2 )/P + 100Algorithm 3: T = (N+N2 )/P + 0.6P2
EECC756 - ShaabanEECC756 - Shaaban#17 lec # 10 Spring2000 4-13-2000
A Parallel Performance measures A Parallel Performance measures ExampleExample
• O(1) = T(1) = n3
• O(n) = n3 + n2log2n T(n) = 4n3/(n+3)
Fig 3.4 page 114
Table 3.1 page 115Table 3.2 page 117See handout
EECC756 - ShaabanEECC756 - Shaaban#18 lec # 10 Spring2000 4-13-2000
Parallel Performance Metrics Revisited: Amdahl’s LawParallel Performance Metrics Revisited: Amdahl’s Law• Harmonic Mean Speedup (i number of processors used):
• In the case w = {fi for i = 1, 2, .. , n} = (, 0, 0, …, 1-), the system is running sequential code with probability and utilizing n processors with probability (1-) with other processor modes not utilized.
Amdahl’s Law:
S 1/ as n Under these conditions the best speedup is upper-bounded by 1/
S T Tf Ri ii
n 1
1
1*
nSn
n
1 1( )
EECC756 - ShaabanEECC756 - Shaaban#19 lec # 10 Spring2000 4-13-2000
Application Models of Parallel ComputersApplication Models of Parallel Computers• If work load W or problem size s is unchanged then:
– The efficiency E decreases rapidly as the machine size n increases because the overhead h increases faster than the machine size.
• The condition of a scalable parallel computer solving a scalable parallel problems exists when:
– A desired level of efficiency is maintained by increasing the machine size and problem size proportionally.
– In the ideal case the workload curve is a linear function of n: (Linear scalability in problem size).
• Application Workload Models for Parallel Computers:
Bounded by limited memory, limited tolerance to interprocess communication (IPC) latency, or limited I/O bandwidth:
– Fixed-load Model: Corresponds to a constant workload.
– Fixed-time Model: Constant execution time. – Fixed-memory Model: Limited by the memory bound.
EECC756 - ShaabanEECC756 - Shaaban#20 lec # 10 Spring2000 4-13-2000
The Isoefficiency ConceptThe Isoefficiency Concept• Workload w as a function of problem size s : w = w(s)
• h total communication/other overhead , as a function of problem size s and machine size n, h = h(s,n)
• Efficiency of a parallel algorithm implemented on a given parallel computer can be defined as:
• Isoefficiency Function: E can be rewritten as:
E = 1/(1 + h(s, n)/w(s)). To maintain a constant E, W(s) should grow in proportion to h(s,n) or,
C = E/(1-E) is a constant for a fixed efficiency E.
The isoefficiency function is defined as follows:
If the workload w(s) grows as fast as fE(n) then a constant efficiency
can be maintained for the algorithm-architecture combination.
EW s
W s h s n
( )
( ) ( , )
w sE
Eh s n( ) ( , )
1
Ef n C h s n( ) ( , )
EECC756 - ShaabanEECC756 - Shaaban#21 lec # 10 Spring2000 4-13-2000
Speedup Performance Laws: Fixed-Workload SpeedupSpeedup Performance Laws: Fixed-Workload SpeedupWhen DOP = i > n (n = number of processors)
n
ii
m
i
i
mSW
WT
T n
iin
( )
( )
1 1
1
iit Wn
i
i
n( )
Execution time of Wi
T ni
i
ni
i
m W( )
1
Total execution time
n
ii
m
i
i
mSW
WT
T n Q n
iin
Q n
( )
( ) ( )( )
1 1
1
If DOP = , then i n n ii i it t W ( ) ( )
Fixed-load speedup factor is defined as the ratio of T(1) to T(n):
Let Q(n) be the total system overheads on an n-processor system:
The overhead delay Q(n) is both application- and machine-dependent and difficult to obtain in closed form.
EECC756 - ShaabanEECC756 - Shaaban#22 lec # 10 Spring2000 4-13-2000
Amdahl’s Law for Fixed-Load SpeedupAmdahl’s Law for Fixed-Load Speedup• For the special case where the system either operates in
sequential mode (DOP = 1) or a perfect parallel mode (DOP = n), the Fixed-load speedup is simplified to:
We assume here that the overhead factor Q(n)= 0
For the normalized case where:
The equation is reduced to the previously seen form of
Amdahl’s Law:
nn
nS W W
W W n
1
1
1 11 1 1W W W Wn n
with and ( )
nSn
n
1 1( )
EECC756 - ShaabanEECC756 - Shaaban#23 lec # 10 Spring2000 4-13-2000
Fixed-Time SpeedupFixed-Time Speedup• To run the largest problem size possible on a larger
machine with about the same execution time.
Let be the maximum DOP for the scaled up problem,
be the scaled workload with DOP = i
In general, for and
Assuming that we obtain:
m
i m
T( )=T'(n)
i
i i
W'W' W W' W
'
' 2
11 1
ii
mi
i
m
W Wi
i
nQ n
1 1
''
( )
Speedup is given by: nS T T n' ( ) / '( ) 1
n
ii
m
i
i
m
ii
m
ii
mSW
WW
W
T
T n
iin
Q n'
''
'( )
'( )( )
' '
1 1
1
1
1
EECC756 - ShaabanEECC756 - Shaaban#24 lec # 10 Spring2000 4-13-2000
Gustafson’s Fixed-Time SpeedupGustafson’s Fixed-Time Speedup• For the special fixed-time speedup case where DOP can
either be 1 or n and assuming Q(n) = 0
n
ii
m
ii
mn
n
n
n
n n n n
SW
W
W WW W
W WW W
W W W W W W
T
T n
n
n n
'' ' '
' ' '
( )
'( )
'
1 1
1
1
1
1
1
1 1Where and
Assuming and and a= -W W W Wn n1 11 1
nST
T n
nn n'
( )
'( )
( )
( )( )
1 1
11
EECC756 - ShaabanEECC756 - Shaaban#25 lec # 10 Spring2000 4-13-2000
Fixed-Memory SpeedupFixed-Memory Speedup• Let M be the memory requirement of a given problem
• Let W = g(M) or M = g-1(W) where
Wi
i
m
W
1
workload for sequential execution* *
*
W W ii
mn
1
scaled workload on nodes
The memory bound for an active node is
1
1
g W ii
m
The fixed-memory speedup is defined by:
n
ii
i
i
SW
WT
T n
m
iin
Q nm
*
*
*
( )
'( )( )
*
*
1 1
1
Assuming
and either sequential or perfect parallelim and Q(n) = 0
*( ) ( ) ( ) ( )g WnM G n g M G n
n
nn
n
n
nS W W
W WW W
W Wn
G n
G n n*
* *
* */
( )
( ) /
1
1
1
1
EECC756 - ShaabanEECC756 - Shaaban#26 lec # 10 Spring2000 4-13-2000
Scalability MetricsScalability Metrics• The study of scalability is concerned with determining the
degree of matching between a computer architecture and and an application algorithm and whether this degree of matching continues to hold as problem and machine sizes are scaled up .
• Basic scalablity metrics affecting the scalability of the system for a given problem:
Machine Size n Clock rate f
Problem Size s CPU time T
I/O Demand d Memory Capacity m
Communication overhead h(s, n), where h(s, 1) =0
Computer Cost c
Programming Overhead p
EECC756 - ShaabanEECC756 - Shaaban#27 lec # 10 Spring2000 4-13-2000
Parallel Scalability MetricsParallel Scalability Metrics
Scalability of An architecture/algorithm Combination
Machine Size Hardware
Cost CPU Time
I/O Demand
Memory Demand
Programming Cost Problem
Size
Communication Overhead
EECC756 - ShaabanEECC756 - Shaaban#28 lec # 10 Spring2000 4-13-2000
Revised Asymptotic Speedup, EfficiencyRevised Asymptotic Speedup, Efficiency• Revised Asymptotic Speedup:
– s problem size.
– T(s, 1) minimal sequential execution time on a uniprocessor.
– T(s, n) minimal parallel execution time on an n-processor system.
– h(s, n) lump sum of all communication and other overheads.
• Revised Asymptotic Efficiency:
S s nT s
T s n h s n( , )
( , )
( , ) ( , )
1
E s nS s n
n( , )
( , )
EECC756 - ShaabanEECC756 - Shaaban#29 lec # 10 Spring2000 4-13-2000
Parallel System ScalabilityParallel System Scalability• Scalability (informal restrictive definition):
A system architecture is scalable if the system efficiency E(s, n) = 1 for all algorithms with any number of processors and any size problem s
• Scalability Definition (more formal):
The scalability (s, n) of a machine for a given algorithm is defined as the ratio of the asymptotic speedup S(s,n) on the real machine to the asymptotic speedup SI(s, n)
on the ideal realization of an EREW PRAM
IS
( , )( , )
( , )
( , )
( , )s n
S s n
s n
s n
T s nI
I
ST
II
S Ts n
T s
s n( , )
( , )
( , )
1
EECC756 - ShaabanEECC756 - Shaaban#30 lec # 10 Spring2000 4-13-2000
Example: Scalability of Network Example: Scalability of Network Architectures for Parity CalculationArchitectures for Parity Calculation
Table 3.7 page 142see handout
EECC756 - ShaabanEECC756 - Shaaban#31 lec # 10 Spring2000 4-13-2000
Programmability Vs. ScalabilityProgrammability Vs. ScalabilityIn
crea
sed
Sca
labi
lity
Increased Programmability
Ideal Parallel Computers
Message-passingmulticomuter withdistributed memory
Multiprocessor with shared memory
EECC756 - ShaabanEECC756 - Shaaban#32 lec # 10 Spring2000 4-13-2000
MPPs Scalability IssuesMPPs Scalability Issues• Problems:
– Memory-access latency.
– Interprocess communication complexity or synchronization overhead.
– Multi-cache inconsistency.
– Message-passing overheads.
– Low processor utilization and poor system performance for very large system sizes.
• Possible Solutions:– Low-latency fast synchronization techniques.
– Weaker memory consistency models.
– Scalable cache coherence protocols.
– To relize shared virtual memory.
– Improved software portability; standard parallel and distributed operating system support.
EECC756 - ShaabanEECC756 - Shaaban#33 lec # 10 Spring2000 4-13-2000
Scalable MachinesScalable Machines• Design Choices:
– Custom-designed or commodity nodes?
– Capability of node-to-network interface
– Supporting programming models?
• What does hardware scalability mean?– Avoids inherent design limits on resources
– Bandwidth increases with machine size P
– Latency does not.
– Cost increases slowly with P.
EECC756 - ShaabanEECC756 - Shaaban#34 lec # 10 Spring2000 4-13-2000
Limited Scaling of a BusLimited Scaling of a Bus
• Bus: each level of the system design is grounded in the scaling limits at the layers below and assumptions of close coupling between components
Characteristic Bus
Physical Length ~ 1 ft
Number of Connections fixed
Maximum Bandwidth fixed
Interface to Comm. medium memory inf
Global Order arbitration
Protection Virt -> physical
Trust total
OS single
comm. abstraction HW
EECC756 - ShaabanEECC756 - Shaaban#35 lec # 10 Spring2000 4-13-2000
ScalingScaling of Workstations in a LAN? of Workstations in a LAN?
• No clear limit to physical scaling, no global order, consensus difficult to achieve.
• Independent failure and restart.
Characteristic Bus LAN
Physical Length ~ 1 ft KM
Number of Connections fixed many
Maximum Bandwidth fixed ???
Interface to Comm. medium memory inf peripheral
Global Order arbitration ???
Protection Virt -> physical OS
Trust total none
OS single independent
comm. abstraction HW SW
EECC756 - ShaabanEECC756 - Shaaban#36 lec # 10 Spring2000 4-13-2000
• What fundamentally limits bandwidth?– single set of wires
• Must have many independent wires• Connect modules through switches• Bus vs Network Switch.
Bandwidth ScalabilityBandwidth Scalability
P M M P M M P M M P M M
S S S S
Typical switches
Bus
Multiplexers
Crossbar
EECC756 - ShaabanEECC756 - Shaaban#37 lec # 10 Spring2000 4-13-2000
Dancehall MP OrganizationDancehall MP Organization
• Network bandwidth?• Bandwidth demand?
– independent processes?– communicating processes?
• Latency?
Scalable network
P
$
Switch
M
P
$
P
$
P
$
M M
Switch Switch
EECC756 - ShaabanEECC756 - Shaaban#38 lec # 10 Spring2000 4-13-2000
Generic Distributed Memory Generic Distributed Memory OrganizationOrganization
• Network bandwidth?• Bandwidth demand?
– independent processes?– communicating processes?
• Latency?
Scalable network
CA
P
$
Switch
M
Switch Switch
EECC756 - ShaabanEECC756 - Shaaban#39 lec # 10 Spring2000 4-13-2000
Key System Scaling PropertyKey System Scaling Property
• Large number of independent communication paths between nodes.
• => allow a large number of concurrent transactions using different wires.
• initiated independently.
• No global arbitration
• Effect of a transaction only visible to the nodes involved– Effects propagated through additional transactions.
EECC756 - ShaabanEECC756 - Shaaban#40 lec # 10 Spring2000 4-13-2000
Network Latency Scaling ExampleNetwork Latency Scaling Example
• max distance: log n
• number of switches: n log n
• overhead = 1 us, BW = 64 MB/s, 200 ns per hop
• Pipelined• T64(128) = 1.0 us + 2.0 us + 6 hops * 0.2 us/hop = 4.2 us
• T1024(128) = 1.0 us + 2.0 us + 10 hops * 0.2 us/hop = 5.0 us
• Store and Forward• T64
sf(128) = 1.0 us + 6 hops * (2.0 + 0.2) us/hop = 14.2 us
• T64sf(1024) = 1.0 us + 10 hops * (2.0 + 0.2) us/hop = 23 us
EECC756 - ShaabanEECC756 - Shaaban#41 lec # 10 Spring2000 4-13-2000
Cost ScalingCost Scaling
• cost(p,m) = fixed cost + incremental cost (p,m)
• Bus Based SMP?
• Ratio of processors : memory : network : I/O ?
• Parallel efficiency(p) = Speedup(P) / P
• Costup(p) = Cost(p) / Cost(1)
• Cost-effective: speedup(p) > costup(p)
• Is super-linear speedup
EECC756 - ShaabanEECC756 - Shaaban#42 lec # 10 Spring2000 4-13-2000
nCUBE/2 Machine OrganizationnCUBE/2 Machine Organization
• Entire machine synchronous at 40 MHz
Single-chip node
Basic module
Hypercube networkconfiguration
DRAM interface
DM
Ach
anne
ls
Ro
ute
r
MMU
I-Fetch&
decode
64-bit integerIEEE floating point
Operand$
Execution unit
EECC756 - ShaabanEECC756 - Shaaban#43 lec # 10 Spring2000 4-13-2000
CM-5 Machine OrganizationCM-5 Machine Organization
Diagnostics network
Control network
Data network
Processingpartition
Processingpartition
Controlprocessors
I/O partition
PM PM
SPARC
MBUS
DRAMctrl
DRAM DRAM DRAM DRAM
DRAMctrl
Vectorunit DRAM
ctrlDRAM
ctrl
Vectorunit
FPU Datanetworks
Controlnetwork
$ctrl
$SRAM
NI