lecture 13 (part 1) thread level parallelism (6)€¦ · network topology distributed switched...
TRANSCRIPT
Lecture 13 (part 1)Thread Level Parallelism (6)
EEC 171 Parallel ArchitecturesJohn Owens
UC Davis
Credits• © John Owens / UC Davis 2007–8.
• Thanks to many sources for slide material: Computer Organization and Design (Patterson & Hennessy) © 2005, Computer Architecture (Hennessy & Patterson) © 2007, Inside the Machine (Jon Stokes) © 2007, © Dan Connors / University of Colorado 2007, © Kathy Yelick / UCB 2007, © Wen-Mei Hwu/David Kirk, University of Illinois 2007, © David Patterson / UCB 2003–7, © John Lazzaro / UCB 2006, © Mary Jane Irwin / Penn State 2005, © John Kubiatowicz / UCB 2002, © Krste Asinovic/Arvind / MIT 2002, © Morgan Kaufmann Publishers 1998.
Outline
• Interconnection Networks
• Grab bag:
• Amdahl’s Law
• Novices & Parallel Programming
• Interconnect Technologies
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network TopologyPreliminaries and Evolution
• One switch suffices to connect a small number of devices– Number of switch ports limited by VLSI technology, power
consumption, packaging, and other such cost constraints• A fabric of interconnected switches (i.e., switch fabric or network
fabric) is needed when the number of devices is much larger– The topology must make a path(s) available for every pair of
devices—property of connectedness or full access (What paths?) • Topology defines the connection structure across all components
– Bisection bandwidth: the minimum bandwidth of all links crossing a network split into two roughly equal halves
– Full bisection bandwidth: › Network BWBisection = Injection (or Reception) BWBisection= N/2
– Bisection bandwidth mainly affects performance• Topology is constrained primarily by local chip/board pin-outs;
secondarily, (if at all) by global bisection bandwidth
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network TopologyCentralized Switched (Indirect) Networks
• Crossbar network– Crosspoint switch complexity increases quadratically with the
number of crossbar input/output ports, N, i.e., grows as O(N2)– Has the property of being non-blocking
7
6
5
4
3
2
1
0
76543210
7
6
5
4
3
2
1
0
76543210
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network TopologyCentralized Switched (Indirect) Networks
• Multistage interconnection networks (MINs)– Crossbar split into several stages consisting of smaller crossbars– Complexity grows as O(N × log N), where N is # of end nodes– Inter-stage connections represented by a set of permutation
functions
Omega topology, perfect-shuffle exchange
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network TopologyCentralized Switched (Indirect) Networks
16 port, 4 stage Butterfly network
00000001
00100011
01000101
01100111
10001001
10101011
11001101
11101111
00000001
00100011
01000101
01100111
10001001
10101011
11001101
11101111
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network TopologyCentralized Switched (Indirect) Networks
• Reduction in MIN switch cost comes at the price of performance– Network has the property of being blocking– Contention is more likely to occur on network links
› Paths from different sources to different destinations share one or more links
blocking topology
X
non-blocking topology
7
6
5
4
3
2
1
0
76543210
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network TopologyCentralized Switched (Indirect) Networks
• How to reduce blocking in MINs? Provide alternative paths!– Use larger switches (can equate to using more switches)
› Clos network: minimally three stages (non-blocking)» A larger switch in the middle of two other switch stages
provides enough alternative paths to avoid all conflicts– Use more switches
› Add logkN - 1 stages, mirroring the original topology» Rearrangeably non-blocking» Allows for non-conflicting paths» Doubles network hop count (distance), d» Centralized control can rearrange established paths
› Benes topology: 2(log2N) - 1 stages (rearrangeably non-blocking)» Recursively applies the three-stage Clos network concept to
the middle-stage set of switches to reduce all switches to 2 x 2
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network TopologyCentralized Switched (Indirect) Networks
16 port, 7 stage Clos network = Benes topology
7
6
5
4
3
2
1
0
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
15
14
13
12
11
10
9
8
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network TopologyCentralized Switched (Indirect) Networks
Alternative paths from 0 to 1. 16 port, 7 stage Clos network = Benes topology
7
6
5
4
3
2
1
0
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
15
14
13
12
11
10
9
8
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network TopologyCentralized Switched (Indirect) Networks
7
6
5
4
3
2
1
0
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
15
14
13
12
11
10
9
8
Alternative paths from 4 to 0. 16 port, 7 stage Clos network = Benes topology
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network Topology
Myrinet-2000 Clos Network for 128 Hosts
• Backplane of the M3-E128 Switch• M3-SW16-8F fiber line card (8 ports)
http://myri.com
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network TopologyDistributed Switched (Direct) Networks
• Bidirectional Ring networks– N switches (3 × 3) and N bidirectional network links– Simultaneous packet transport over disjoint paths– Packets must hop across intermediate nodes– Shortest direction usually selected (N/4 hops, on average)
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network TopologyDistributed Switched (Direct) Networks:
• Fully connected and ring topologies delimit the two extremes• The ideal topology:
– Cost approaching a ring– Performance approaching a fully connected (crossbar) topology
• More practical topologies:– k-ary n-cubes (meshes, tori, hypercubes)
› k nodes connected in each dimension, with n total dimensions› Symmetry and regularity
» network implementation is simplified» routing is simplified
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network TopologyDistributed Switched (Direct) Networks
2D t
oru
s of
16
node
s
hyp
ercu
be
of 1
6 no
des
(16
= 2
4 , s
o n
= 4
)
2D m
esh
or
grid
of
16 n
odes
NetworkBisection
≤ full bisection bandwidth!“Performance Analysis of k-ary n-cube Interconnection Networks,” W. J. Dally,IEEE Trans. on Computers, Vol. 39, No. 6, pp. 775–785, June, 1990.
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Network Topology
Company System [Network] Name
Max. numberof nodes
[x # CPUs]Basic network topology
Injection[Recept’n]node BW
inMBytes/s
# of databits perlink per
direction
Raw network link
BW per direction in Mbytes/sec
Raw network bisection
BW (bidir) in Gbytes/s
Intel ASCI RedParagon
4,510[x 2]
2-D mesh64 x 64
400[400] 16 bits 400
IBMASCI WhiteSP Power3[Colony]
512[x 16]
BMIN w/8-port bidirect. switches (fat-
tree or Omega)
500[500]
8 bits (+1 bit of
control)500
IntelThunter Itanium2
Tiger4[QsNetII]
1,024[x 4]
fat tree w/8-portbidirectional
switches
928[928]
8 bits (+2 control for 4b/5b enc)
1,333
51.2
256
1,365
Cray XT3 [SeaStar]
30,508[x 1]
3-D torus40 x 32 x 24
3,200[3,200] 12 bits 3,800
Cray X1E 1,024[x 1]
4-way bristled2-D torus (~ 23 x 11)
with express links
1,600[1,600] 16 bits 1,600
IBMASC Purple pSeries 575[Federation]
>1,280[x 8]
BMIN w/8-portbidirect. switches
(fat-tree or Omega)
2,000[2,000]
8 bits (+2 bits of
control)2,000
IBMBlue Gene/LeServer Sol.[Torus Net]
65,536[x 2]
3-D torus32 x 32 x 64
612,5[1,050]
1 bit (bit serial) 175
5,836.8
51.2
2,560
358.4
Topological Characteristics of Commercial Machines
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Routing, Arbitration, and SwitchingRouting
• Performed at each switch, regardless of topology• Defines the “allowed” path(s) for each packet (Which paths?)• Needed to direct packets through network to intended destinations• Ideally:
– Supply as many routing options to packets as there are paths provided by the topology, and evenly distribute network traffic among network links using those paths, minimizing contention
• Problems: situations that cause packets never to reach their dest.– Livelock
› Arises from an unbounded number of allowed non-minimal hops› Solution: restrict the number of non-minimal (mis)hops allowed
– Deadlock› Arises from a set of packets being blocked waiting only for network
resources (i.e., links, buffers) held by other packets in the set› Probability increases with increased traffic & decreased availability
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Routing, Arbitration, and SwitchingRouting
• Common forms of deadlock:– Routing-induced deadlock
p
c1 c2
c4 c5
c7 c8
c10 c11
p1
p1
p2p3
p2
p3p4
p4
c0
1
c3
p2
c6
p3
c9
p4
c12
p5
s1
d1
s2
s3
d3
d2
c1 c2
c4
c7 c8
c10
c5 c11
s4
d4
c0
c3
c6
c9
s5
d5
c12
Routing of packets in a 2D mesh Channel dependency graph
ci = channel isi = source node idi = destination node ipi = packet i
“A Formal Model of Message Blocking and Deadlock Resolution in Interconnection Networks,” S. Warnakulasuriyaand T. Pinkston, IEEE Trans. on Parallel and Distributed Systems , Vol. 11, No. 3, pp. 212–229, March, 2000.
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
On-Chip Networks (OCNs)Institution &
Processor [Network] name
Year built
Number of network ports [cores or tiles + other
ports]Basic network
topologyLink bandwidth
[link clock speed]
# of chip metal layers;
flow control; # VCs
# of data bits per link per
direction
MIT Raw [General Dynamic Network]
IBM POWER5
2002
2004
16 port [16 tiles]
7 ports [2 PE cores + 5 other ports]
2-D mesh 4 x 4
Crossbar
0.9 GBps [225 MHz, clocked at
proc speed]
[1.9 GHz, clocked at proc
speed]
6 layers; credit-based;
no VCs
7 layers; handshaking;
no virtual channels
32 bits
256 b Inst fetch; 64 b for stores; 256 b LDs
Routing; Arbitration; Switching
XY DOR w/ request-reply deadlock recovery; RR arbitration; wormhole
Shortest-path; non-blocking; circuit
switch
U.T. Austin TRIPS EDGE [Operand
Network]
U.T. Austin TRIPS EDGE [On-Chip
Network]
2005
2005
25 ports [25 execution unit tiles]
40 ports [16 L2 tiles + 24 network interface tile]
2-D mesh 5 x 5
2-D mesh 10 x 4
5.86 GBps [533 MHz clk scaled
by 80%]
6.8 GBps [533 MHz clk scaled
by 80%]
7 layers; on/off flow
control; no VCs
7 layers; credit-based
flow control; 4 VCs
110 bits
128 bits
XY DOR; distributed RR arbitration;
wormhole
XY DOR; distributed RR arbitration; VCT
switched
Sony, IBM, Toshiba Cell BE [Element Interconnect Bus]
Sun UltraSPARC T1 processor
2005
2005
12 ports [1 PPE and 8 SPEs + 3 other ports for memory, I/&O interface]
Up to 13 ports [8 PE cores + 4 L2 banks + 1
shared I/O]
Ring 4 total, 2 in each
direction
Crossbar
25.6 GBps [1.6 GHz, clocked at
half the proc speed]
19.2 GBps [1.2 GHz, clocked at
proc speed]
8 layers; credit-based flow control;
no VCs
9 layers; handshaking;
no VCs
128 bits data (+16 bits tag)
128 b both for the 8
cores and the 4 L2 banks
Shortest-path; tree-based RR arb. (centralized);
pipelined circuit switch
Shortest-path; age-based arbitration;
VCT switched
Examples of Interconnection Networks
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Cell Broadband Engine Element Interconnect Bus• Cell BE is successor to PlayStation 2’s Emotion Engine
– 300 MHz MIPS-based– Uses two vector elements– 6.2 GFLOPS (Single Precision)– 72KB Cache + 16KB Scratch Pad RAM– 240mm2 on 0.25-micron process
• PlayStation 3 uses the Cell BE*– 3.2 GHz POWER-based– Eight SIMD (Vector) Processor Elements– >200 GFLOPS (Single Precision)– 544KB cache + 2MB Local Store RAM– 235mm2 on 90-nanometer SOI process
*Sony has decided to use only 7 SPEs for the PlayStation 3 to improve yield. Eight SPEs will be assumed for the purposes of this discussion.
Examples of Interconnection Networks
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Cell Broadband Engine Element Interconnect Bus• Cell Broadband Engine (Cell BE): 200 GFLOPS
– 12 Elements (devices) interconnected by EIB:› One 64-bit Power processor element (PPE) with aggregate
bandwidth of 51.2 GB/s› Eight 128-bit SIMD synergistic processor elements (SPE) with local
store, each with a bandwidth of 51.2 GB/s› One memory interface controller (MIC) element with memory
bandwidth of 25.6 GB/s› Two configurable I/O interface elements: 35 GB/s (out) and 25GB/s
(in) of I/O bandwidth– Element Interconnect Bus (EIB):
› Four unidirectional rings (two in each direction) each connect the heterogeneous 12 elements (end node devices)
› Data links: 128 bits wide @ 1.6 GHz; data bandwidth: 25.6 GB/s› Provides coherent and non-coherent data transfer› Should optimize network traffic flow (throughput) and utilization
while minimizing network latency and overhead
Examples of Interconnection Networks
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Examples of Interconnection Networks
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Examples of Interconnection Networks
Cell Broadband Engine Element Interconnect Bus• Element Interconnect Bus (EIB)
– Packet size: 16B – 128B (no headers); pipelined circuit switching– Credit-based flow control (command bus central token manager)– Two-stage, dual round-robin centralized network arbiter– Allows up to 64 outstanding requests (DMA)
› 64 Request Buffers in the MIC; 16 Request Buffers per SPE– Latency: 1 cycle/hop, transmission time (largest packet) 8 cycles– Effective bandwidth: peak 307.2 GB/s, max. sustainable 204.8 GB/s
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Examples of Interconnection Networks
Blue Gene/L 3D Torus Network• 360 TFLOPS (peak)• 2,500 square feet• Connects 65,536 dual-processor nodes and 1,024 I/O nodes
– One processor for computation; other meant for communication
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Examples of Interconnection Networks
Blue Gene/L 3D Torus Network
Chip (node)2 processors2.8/5.6 GF/s
512MB
Compute Card2 chips, 1x2x15.6/11.2 GF/s
1.0 GB
Node Card32 chips, 4x4x2
16 compute, 0-2 I/O cards90/180 GF/s
16 GB
System64 Racks,64x32x32
180/360 TF/s32 TB
Rack32 Node cards2.8/5.6 TF/s
512 GB
Node distribution: Two nodes on a 2 x 1 x 1 compute card, 16 compute cards + 2 I/O cards on a 4 x 4 x 2 node board, 16 node boards on an 8 x 8 x 8 midplane, 2 midplanes
on a 1,024 node rack, 8.6 meters maximum physical link length
www.ibm.com
Inte
rcon
nect
ion
Net
wor
ks: ©
Tim
othy
Mar
k Pi
nkst
on a
nd J
osé
Dua
to...
with
maj
or p
rese
ntat
ion
cont
ribut
ion
from
Jos
é Fl
ich
Examples of Interconnection Networks
Blue Gene/L 3D Torus Network• Main network: 32 x 32 x 64 3-D torus
– Each node connects to six other nodes– Full routing in hardware
• Links and Bandwidth– 12 bit-serial links per node (6 in, 6 out)– Torus clock speed runs at 1/4th of processor rate– Each link is 1.4 Gb/s at target 700-MHz clock rate (175 MB/s)– High internal switch connectivity to keep all links busy
› External switch input links: 6 at 175 MB/s each (1,050 MB/s aggregate)› External switch output links: 6 at 175 MB/s each (1,050 MB/s aggregate)› Internal datapath crossbar input links: 12 at 175 MB/s each› Internal datapath crossbar output links: 6 at 175 MB/s each› Switch injection links: 7 at 175 MBps each (2 cores, each with 4 FIFOs)› Switch reception links: 12 at 175 MBps each (2 cores, each with 7 FIFOs)
Encountering Amdahl’s Law• Speedup due to enhancement E is
• Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected
Speedup with E =Exec time w/o EExec time with E
ExTime w/ E = ExTime w/o E! ((1" F ) + F/S)
Speedup w/ E =1
(1! F ) + F/S
Challenges of Parallel Processing
• Application parallelism ⇒ primarily via new
algorithms that have better parallel performance
• Long remote latency impact ⇒ both by architect and
by the programmer
• For example, reduce frequency of remote accesses either by
• Caching shared data (HW)
• Restructuring the data layout to make more accesses local (SW)
Examples: Amdahl’s Law
• Consider an enhancement which runs 20 times faster but which is only usable 25% of the time.
• Speedup w/ E =
• What if it’s usable only 15% of the time?
• Speedup w/ E =
• Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!
• To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less
Speedup w/ E = 1 / ((1-F) + F/S)
Challenges of Parallel Processing
• Second challenge is long latency to remote memory
• Suppose 32 CPU MP, 2 GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.)
• What is performance impact if 0.2% instructions involve remote access?
• 1.5X
• 2.0X
• 2.5X
Challenges of Parallel Processing
• Application parallelism ⇒ primarily via new
algorithms that have better parallel performance
• Long remote latency impact ⇒ both by architect and
by the programmer
• For example, reduce frequency of remote accesses either by
• Caching shared data (HW)
• Restructuring the data layout to make more accesses local (SW)
How hard is parallel programming anyway?
• Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers
• Lorin Hochstein, Jeff Carver, Forrest Shull, Sima Asgari, Victor Basili, Jeffrey K. Hollingsworth, Marvin V. Zelkowitz
• Supercomputing 2005
Why use students for testing?• First, multiple students are routinely given the same
assignment to perform, and thus we are able to conduct experiments in a way to control for the skills of specific programmers.
• Second, graduate students in a HPC class are fairly typical of a large class of novice HPC programmers who may have years of experience in their application domain but very little in HPC-style programming.
• Finally, due to the relatively low costs, student studies are an excellent environment to debug protocols that might be later used on practicing HPC programmers.
Tests run
4
A final important independent variable is programmer
experience. In the studies reported here, the majority of sub-
jects were novice HPC developers (not surprisingly, as the
studies were run in a classroom environment).
4.2 Dependent Variables
Our studies measured the following as outcomes of the
HPC development practices applied:
A primary measure of the quality of the HPC solution was
the speedup achieved, that is, the relative execution time of a
program running on a multi-processor system compared to a
uniprocessor system. In this paper, all values reported for
speedup were measured when the application was run on 8
parallel processors, as this was the largest number of proces-
sors that was feasible for use in our classroom environments.
A primary measure of cost was the amount of effort re-
quired to develop the final solution, for a given problem and
given approach. The effort undertaken to develop a serial solu-
tion included the following activities: thinking/planning the
solution, coding, and testing. The effort undertaken to develop
a parallel solution included all of the above as well as tuning
the parallel code (i.e. improving performance through optimiz-
ing the parallel instructions). HPC development for the studies
presented in this paper was always done having a serial ver-
sion available.
Another outcome variable studied was the code expansion
factor of the solutions. In order to take full advantage of the
parallel processors, HPC codes can be expected to include
many more lines of code (LOC) than serial solutions. For ex-
ample, message-passing approaches such as MPI require a
significant amount of code to deal with communication across
different nodes. The expansion factor is the ratio of LOC in a
parallel solution to LOC in a serial solution of the same prob-
lem.
Finally, we look at the cost per LOC of solutions in each
of the various approaches. This value is another measure (in
person-hours) of the relative cost of producing code in each of
the HPC approaches.
4.3 Studies Described in this Paper
To validate our methodology, we selected a subset of the
cells in the table for which we have already been able to
gather sufficient data to draw conclusions (i.e., the gray-
shaded cells in Table 1), where the majority of our data lies at
this time
Studies analyzed in this paper include:
C0A1. This data was collected in Fall 2003, from a
graduate-level course with 16 students. Subjects were asked to
implement the “game of life” program in C on a cluster of
PCs, first using a serial solution and then parallelizing the so-
lution with MPI.
C1A1. This data was from a replication of the C0A1 as-
signment in a different graduate-level course at the same uni-
versity in Spring 2004. 10 subjects participated.
C2A2. This data was collected in Spring 2004, from a
graduate-level course with 27 students. Subjects were asked to
parallelize the “grid of resistors” problem in a combination of
Matlab and C on a cluster of PCs. Given a serial version, sub-
Serial MPI OpenMP Co-Array
Fortran
StarP XMT
Nearest-Neighbor Type Problems
Game of Life C3A3 C3A3
C0A1
C1A1
C3A3
Grid of Resistors C2A2 C2A2 C2A2 C2A2
Sharks & Fishes C6A2 C6A2 C6A2
Laplace’s Eq. C2A3 P2A3
SWIM C0A2
Broadcast Type Problems
LU Decomposition C4A1
Parallel Mat-vec C3A4
Quantum Dynamics C7A1
Embarrassingly Parallel Type Problems
Buffon-Laplace Nee-
dle
C2A1
C3A1
C2A1
C3A1
C2A1
C3A1
(Miscellaneous Problem Types)
Parallel Sorting C3A2 C3A2 C3A2
Array Compaction C5A1
Randomized Selection C5A2
Table 1: Matrix describing the problem space of HPC studies being run. Columns show the parallel program-
ming model user. Rows shows the assignment, grouped by communication pattern required. Each study is indi-
cated with a label CxAy, identifying the participating class (C) and the assignment (A). Studies analyzed in this
paper are grey-shaded.
What They Learned
• Novices are able to achieve speedup on a parallel machine.
• MPI and OpenMP both require more {code, cost per line, effort} than serial implementations
• MPI takes more effort than OpenMP
5
jects were asked to produce an HPC solution first in MPI and
then in OpenMP.
C3A3. This data was collected in Spring 2004, from a
graduate-level course with 16 students. Subjects were asked to
implement the “game of life” program in C on a cluster of
PCs, first as a serial and then as an MPI and OpenMP version.
5 Hypotheses Investigated
The data collected from the classroom studies allow us to
address a number of issues about how novices learn to develop
HPC codes, by looking within and across cells on our matrix.
In Sections 5.1-5.4 we compare the two parallel programming
models, MPI and OpenMP, to serial development to derive a
better understanding of the relationship between serial devel-
opment and parallel development. Then in Section 5.5 we
compare MPI and OpenMP to each other to get a better under-
standing of their relationship with regard to programmer pro-
ductivity.
In the analysis presented in Sections 5.2-5.5, we used the
paired t-test [7] to compare MPI to OpenMP. For example, we
used a paired t-test to investigate whether there was any dif-
ference in the LOC required to implement the same solution in
different HPC approaches for the same subject. This statistical
test calculates the signed difference between two values for
the same subject (e.g. the LOC required for an OpenMP im-
plementation and an MPI implementation of the same prob-
lem). The output of the test is based on the mean of the differ-
ences for all subjects: If this mean value is large it will tend to
indicate a real and significant difference across the subjects for
the two different approaches. Conversely, if the mean is close
to zero then it would tend to indicate that both approaches
performed about the same (i.e. that there was no consistent
effect due to the different approaches). By making a within-
subject comparison we avoid many threats to validity, by
holding factors such as experience level, background, general
programming ability, etc., constant.
5.1 Achieving Speedup
A key question was whether novice developers could
achieve speedup at all. It is highly relevant for characterizing
HPC development, because it addresses the question of
whether the benefits of HPC solutions can be widely achieved
or will only be available to highly skilled developers. A survey
of HPC experts (conducted at the DARPA-sponsored HPCS
project meeting in January 2004) indicated that 87% of the
experts surveyed felt that speedup could be achieved by nov-
ices, although no rigorous data was cited to bolster this asser-
tion. Based on this information, we posed the following hy-
pothesis:
H1 Novices are able to achieve speedup on a parallel ma-
chine
To evaluate this hypothesis, we evaluated speedup in two
ways: 1) Comparing the parallel version of the code to a serial
version of the same application, or 2) comparing the parallel
version of the code run on multiple processors to the same
version of the code run on one processor.
For the game of life application, we have data from 3
classroom studies for both MPI and OpenMP approaches.
These data are summarized in Table 2, which shows the paral-
lel programming model, the application being developed, and
the programming language used in each study for which data
was collected. The OpenMP programs were ran on shared
memory machines. The MPI programs were run on clusters.
Data
set
Programming
Model
Speedup on
8 processors
Speedup w.r.t. serial version
C1A1 MPI mean 4.74, sd 1.97, n=2
C3A3 MPI mean 2.8, sd 1.9, n=3
C3A3 OpenMP mean 6.7, sd 9.1, n=2
Speedup w.r.t. parallel version run on 1 processor
C0A1 MPI mean 5.0, sd 2.1, n=13
C1A1 MPI mean 4.8, sd 2.0, n=3
C3A3 MPI mean 5.6, sd 2.5, n=5
C3A3 OpenMP mean 5.7, sd 3.0, n=4
Table 2: Mean, standard deviation, and number of subjects for
computing speedup. All data sets are for C implementations of
the Game of Life.
Although there are too few data points to say with much
rigor, the data we do have supports our hypothesis H1. Nov-
ices seem able to achieve about 4.5x speedup (on 8 proces-
sors) for the game of life application over the serial version.
Speedup on 8 processors is consistently about 5 compared to
the parallel version run on one processor, for this application.
In addition to the specific hypothesis evaluated, we also
observed that OpenMP seemed to result in speedup values
near the top of that range, but too few data points are available
to make a meaningful comparison and too few processors
were used to compare the parallelism between programming
models.
5.2 Code expansion factor
Although the number of lines of code in a given solution
is not important on its own terms, it is a useful descriptive
measure of the solution produced. Different programming
paradigms are likely to require different types of optimization
their solution; for example, OpenMP typically seems to re-
quire adding only a few key lines of code to a serial program,
while MPI typically requires much more significant changes
to produce a working solution. Therefore we can pose the fol-
lowing hypotheses:
H2 An MPI implementation will require more code than its
corresponding serial implementation
H3 An OpenMP implementation will require more code than
its corresponding serial implementation.
We can use the data from both C2A2 and C3A3 (the first
and second rows of the matrix) to investigate how the size of
the solution (measured in LOC) changes from serial to various
HPC approaches. A summary of the datasets is presented in
Table 3. (In the C2A2 assignment, students were given an
existing serial solution to use as their starting point. It is in-
Ethernet Performance
• Achieves close to theoretical bw even with default 1500 B/message
• Broadcom part: 31 µs latency
Performance Characteristics of Dual-Processor HPC Cluster Nodes Based on 64-bit Commodity Processors, Purkayastha et al.
Myrinet• Interconnect designed
with clustering in mind
• 2 Gbps links, possibly 2 physical links (so 4 Gb/s)
• $850/node up to 128 nodes, $1737/node up to 1024 nodes
• MPI latency 6–7 µs
• TCP/IP latency 27–30 µs
• Strong support!
0
400
800
1200
1600
2000
2400
2800
3200
3600
4000
1.0E+0
0
1.0E+0
1
1.0E+0
2
1.0E+0
3
1.0E+0
4
1.0E+0
5
1.0E+0
6
1.0E+0
7
1.0E+0
8
Message Size (Bytes)
Mbps
MPI - dual port
TCP - dual port
MPI - single port
TCP - single port
Dell Builtin Gig E
Figure 2: Myrinet Performance
�
���
��
��
���
���
������ ����� ������ ������ �����
��&&������)����('�&�
��
$&
�(%�"�'�������&�"� ��$#%'
����������(�"������
����������(�"�����
�� ��%#���#!�������'
Figure 3: SCI Performance
The big payoff is that the switch elements can be very
simple since they do not perform any routing calcula-
tions. In addition the software design is based on an OS-
bypass like interface to provide low-latency and low-
CPU overhead.
Current Myrinet hardware utilizes a 2 Gbps link speed
with PCI-X based NICs providing one or two optical
links. The dual-port NIC virtualizes the two ports to
provide an effective 4 Gbps channel. The downside to
the two port NIC is cost. Both the cost of the NIC and
the extra switch port it requires. Myrinet is designed to
be scalable. Until recently the limit has been 1024
nodes, but that limit has been removed in the latest
software. However, there are reports that the network
mapping is not yet working reliably with more than
1024 nodes. The cost of Myrinet runs around $850/node
up to 128 nodes, beyond that a second layer of switches
must be added increasing the cost to $1737/node for
1024 nodes. The dual port NICs effectively double the
infrastructure and thus add at least $800 per node to the
cost.
Myricom provides a full open-source based software
stack with support for a variety of OS’s and architec-
tures. Though some architectures, such as PPC64, are
not tuned as well as others. One of the significant
plusses for Myrinet is its strong support for TCP/IP.
Generally TCP/IP performance has nearly matched the
MPI performance, albeit with higher latency.
Figure 2 illustrates the performance of both the single
and dual port NICs with the Broadcom based Gigabit
Ethernet data included from Figure 1 for reference. In
both cases the MPI latencies are very good at 6-7
µseconds and TCP latencies of 27-30 µseconds. Also
both NICs achieve around 90% of their link bandwidth
with over 3750 Mbps on the dual NIC and 1880 Mbps
on the single port NIC. TCP/IP performance is also
excellent with peak performance of 1880 Mbps on the
single port NIC and over 3500Mbps on the dual port
NIC.
2.3. Scalable Coherent Interface3
The Scalable Coherent Interface (SCI) from Dolphin
solutions is the most unique interconnect in this study
in that it is the only interconnect that is not switched.
Instead the nodes are connected in either a 2D wrapped
mesh or a 3D torus depending on the total size of the
cluster. The NIC includes intelligent ASICs that handle
all aspects of pass through routing, thus pass through is
very efficient and does not impact the host CPU at all.
However, one downside is that when a node goes down
(powered off) its links must be routed around thus im-
pacting messaging in the remaining system.
Since the links between nodes are effectively shared the
system size is limited by how many nodes you can ef-
fectively put in a loop before it is saturated. Currently
that number is in the range of 8-10 leading to total scal-
ability in the range of 64-100 nodes for a 2D configura-
tion and 640-1000 nodes for the 3D torus. Because there
are no switches the cost of the systems scales linearly
with the number of nodes, $1095 for the 2D NIC and
$1595 for the 3D NIC including cables. Unfortunately
cable length is a significant limitation with a preferred
length of 1m or less, though 3-5m cables can be used if
necessary. This poses quite a challenge in cabling up a
systems since each node must connected to 4 or 6 other
nodes.
Dolphin provides an open source driver and a 3rd party
MPICH based MPI is under development. However,
Scalable Coherent Interface• Not a “switched”
interconnect
• Max scalability: 64–100 nodes for 2D torus ($1095/node), 640–1000 for 3D torus ($1595/node)
• MPI: 4 µs latency, 1830 Mbps
• TCP/IP: 900 Mbps
0
400
800
1200
1600
2000
2400
2800
3200
3600
4000
1.0E+0
0
1.0E+0
1
1.0E+0
2
1.0E+0
3
1.0E+0
4
1.0E+0
5
1.0E+0
6
1.0E+0
7
1.0E+0
8
Message Size (Bytes)
Mbps
MPI - dual port
TCP - dual port
MPI - single port
TCP - single port
Dell Builtin Gig E
Figure 2: Myrinet Performance
0
400
800
1200
1600
2000
1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+08
Message Size (Bytes)
Mbps
Myrinet - MPI single port
SCI MPI (Tyan 2466N)
SCI TCP (Tyan2466N)
Dell/Broadcom Gigabit
Figure 3: SCI Performance
The big payoff is that the switch elements can be very
simple since they do not perform any routing calcula-
tions. In addition the software design is based on an OS-
bypass like interface to provide low-latency and low-
CPU overhead.
Current Myrinet hardware utilizes a 2 Gbps link speed
with PCI-X based NICs providing one or two optical
links. The dual-port NIC virtualizes the two ports to
provide an effective 4 Gbps channel. The downside to
the two port NIC is cost. Both the cost of the NIC and
the extra switch port it requires. Myrinet is designed to
be scalable. Until recently the limit has been 1024
nodes, but that limit has been removed in the latest
software. However, there are reports that the network
mapping is not yet working reliably with more than
1024 nodes. The cost of Myrinet runs around $850/node
up to 128 nodes, beyond that a second layer of switches
must be added increasing the cost to $1737/node for
1024 nodes. The dual port NICs effectively double the
infrastructure and thus add at least $800 per node to the
cost.
Myricom provides a full open-source based software
stack with support for a variety of OS’s and architec-
tures. Though some architectures, such as PPC64, are
not tuned as well as others. One of the significant
plusses for Myrinet is its strong support for TCP/IP.
Generally TCP/IP performance has nearly matched the
MPI performance, albeit with higher latency.
Figure 2 illustrates the performance of both the single
and dual port NICs with the Broadcom based Gigabit
Ethernet data included from Figure 1 for reference. In
both cases the MPI latencies are very good at 6-7
µseconds and TCP latencies of 27-30 µseconds. Also
both NICs achieve around 90% of their link bandwidth
with over 3750 Mbps on the dual NIC and 1880 Mbps
on the single port NIC. TCP/IP performance is also
excellent with peak performance of 1880 Mbps on the
single port NIC and over 3500Mbps on the dual port
NIC.
2.3. Scalable Coherent Interface3
The Scalable Coherent Interface (SCI) from Dolphin
solutions is the most unique interconnect in this study
in that it is the only interconnect that is not switched.
Instead the nodes are connected in either a 2D wrapped
mesh or a 3D torus depending on the total size of the
cluster. The NIC includes intelligent ASICs that handle
all aspects of pass through routing, thus pass through is
very efficient and does not impact the host CPU at all.
However, one downside is that when a node goes down
(powered off) its links must be routed around thus im-
pacting messaging in the remaining system.
Since the links between nodes are effectively shared the
system size is limited by how many nodes you can ef-
fectively put in a loop before it is saturated. Currently
that number is in the range of 8-10 leading to total scal-
ability in the range of 64-100 nodes for a 2D configura-
tion and 640-1000 nodes for the 3D torus. Because there
are no switches the cost of the systems scales linearly
with the number of nodes, $1095 for the 2D NIC and
$1595 for the 3D NIC including cables. Unfortunately
cable length is a significant limitation with a preferred
length of 1m or less, though 3-5m cables can be used if
necessary. This poses quite a challenge in cabling up a
systems since each node must connected to 4 or 6 other
nodes.
Dolphin provides an open source driver and a 3rd party
MPICH based MPI is under development. However,
Quadrics• “a premium interconnect
choice on high-end systems such as the Compaq AlphaServer SC”
• Max 4096 ports
• $2400/port for small system up to $4078/port for 1024 node system
• MPI: 2–3 µs latency, 6370 Mpbs bw
0
1000
2000
3000
4000
5000
6000
7000
1.0E+00 1.0E+02 1.0E+04 1.0E+06
Message Size (Bytes)M
bps
Quadrics - MPI
Myrinet - MPI single port
Dell/Broadcom Gigabit
Figure 4: Quadrics Performance
currently we have not gotten the MPI to function cor-
rectly in some cases. An alternative to the open source
stack is a package from Scali4. This adds $70 per node,
but does provide good performance. Unfortunately the
Scali packages are quite tied to the Red Hat and Suse
distributions that they support. Indeed it is difficult to
get the included driver to work on kernels other than the
default distribution kernels.
Figure 3 illustrates the performance of SCI on a Tyan
2466N (dual AMD Athlon) based node. Since Dolphin
does not currently offer a PCI-X based NIC it is un-
likely that the performance would be substantially dif-
ferent in the Dell 2650 nodes used for the other tests.
The Scali MPI and driver were used for these tests. The
MPI performance is quite good with a latency of 4
µseconds and peak performance of 1830Mbps nearly
matching that of the single port Myrinet NIC, which is
a PCI-X, based adapter. However, the TCP/IP perform-
ance is much less impressive as it barely gets above
900 Mbps and more importantly proved unreliable in
our tests.
2.4. Quadrics5
The Quadrics QSNET network has been known mostly
as a premium interconnect choice on high-end systems
such as the Compaq AlphaServer SC. On systems such
as the SC the nodes tend to be larger SMPs with a
higher per node cost than the typical cluster. Thus the
premium cost of Quadrics has posed less of a problem.
Indeed some systems are configured with dual Quadrics
NICs to further increase performance and to get around
the performance limitation of a single PCI slot.
The QSNET system basically consists of intelligent
NICs with an on-board IO processor connected via cop-
per cables (up to 10m) to 8 port switch chips arranged
in a fat tree. Quadrics has recently released an updated
version of their interconnect called QSNet II based on
ELAN4 ASICs. Along with the new NICs Quadrics has
introduced new smaller switches, which has brought
down the entry point substantially. In addition the limit
on the total port count has been increased to 4096 from
the original 1024. Still it remains a premium option
with per port costs starting at $2400 for a small sys-
tem, $2800 per port for a 64 way system, up to $4078
for a 1024 node system.
On the software side quadrics provides an open source
software stack including the driver, userland and MPI.
The DMA engine offloads most of the communications
onto the IO processor on the NIC. This includes the
ability to perform DMA on paged virtual memory ad-
dresses eliminating the need to register and pin memory
regions. Unfortunately their supported software configu-
ration also requires a licensed, non-open source resource
manager (RMS). In our experience the RMS system
was by far the hardest part to get working.
Figure 4 illustrates the performance of Quadrics using
the MPI interface. TCP/IP is also provided, but we were
unable to get it to build on our system in time for this
paper due to incompatibilities with our compiler ver-
sion. The MPI performance is extremely good with
latencies of 2-3 µseconds and peak performance of 6370
Mbps. Indeed this is the lowest latency we have seen on
these nodes.
2.5. Infiniband
Infiniband has received a great deal of attention over the
past few years even though actual products are just be-
ginning to appear. Infiniband was designed by an indus-
try consortium to provide high bandwidth communica-
tions for a wide range of purposes. These purposes
range from a low-level system bus to a general purpose
interconnect to be used for storage as well as inter-node
communications. This range of purposes leads to the
hope that Infiniband will enjoy commodity-like pricing
due to its use by many segments of the computer mar-
ket. Another promising development is the use of In-
finiband as the core system bus. Such a system could
provide external Infiniband ports that would connect
directly to the core system bridge chip bypassing the
PCI bus altogether. This would not only lower the cost,
but also provide a significantly lower latency. Another
significant advantage for Infiniband is that is designed
with scalable performance. The basic Infiniband link
speed is 2.5Gbps (known as a 1X link). However the
links are designed to bonded into higher bandwidth con-
nections with 4 link channels (aka a 4X links) provid-
ing 10Gbps and 12 link channels (aka 12X links) pro-
viding 30 Gbps.
Infiniband
• Defined by industry consortium
• Scalable—base is 2.5 Gb/s link, scales to 30 Gb/s
• Current parts are 4x (10 Gb/s)
• $1200–1600/node
• Also used as system interconnect
• 6750 Mb/s, 6–7 µs
0
1000
2000
3000
4000
5000
6000
7000
1.0E+00 1.0E+02 1.0E+04 1.0E+06
Message Size (Bytes)
Mbps
Quadrics - MPI
Infiniband -Infinicon MPI
Infiniband - OSUMVICH
Infiniband - LAMMPI
Infiniband -Infinicon SDP-TCP
Myrinet - MPI singleport
Dell/BroadcomGigabit
Figure 5: Infiniband Performance
0
1000
2000
3000
4000
5000
6000
7000
1.0E+00 1.0E+02 1.0E+04 1.0E+06
Message Size (Bytes)
Mbps
Quadrics - MPI
Infiniband -Infinicon MPI
Infiniband -Infinicon SDP-TCP
10 Gigabit Ethernet
Myrinet - TCP - dualport
Myrinet - MPI singleport
Dell/BroadcomGigabit
Figure 6: 10 Gigabit Ethernet Performance
�855(17��1),1,%$1'�,03/(0(17$7,216�$5(�$9$,/$%/(�$6����#�%$6('����6�$1'���25�� �3257�6:,7&+�&+,36�87,/,=�,1*� #���%36��/,1.6�� +(�6:,7&+�&+,36�&$1�%(�&21�),*85('�)25�27+(5�/,1.�63(('6�$6�:(//�,1&/8',1*��#/,1.6�� +,6�0$.(6�6:,7&+�$**5(*$7,21�620(:+$7�($6,(56,1&(�<28�&$1�&21),*85(�$�6:,7&+�:,7+��� #�/,1.6�72&211(&7�72�12'(6�$1'� ��#�/,1.6�72�&211(&7�72�27+(56:,7&+(6���855(17��1),1,%$1'�35,&,1*�,6�(1-2<,1*�7+(%(1(),76�2)� $**5(66,9(�9(1785(�&$3,7$/�)81',1*�:+,/(�7+(9$5,286�9(1'256�$77(037�72�'(),1(�7+(,5�0$5.(7�� +867+(5(�$5(�$�9$5,(7<�2)�9(1'256�&203(7,1*�:,7+�6/,*+7/<',))(5(17����6�$1'�6:,7&+(6�(9(1�7+28*+�7+(�&25(�6,/,&21,1�0267�&855(17�,03/(0(17$7,216�,6�)520��(//$12;���855(17�35,&,1*�5$1*(6�)520�������3(5�'(3(1',1*21�7+(�9(1'25�$1'�&/867(5�6,=(� �35,&,1*�:28/'�%(�+,*+(5)25�9(5<�/$5*(�&/867(56��
�1(�2)� 7+(�35,0$5<�:$<6�7+(�9(1'256�$5(�$77(037,1*�72',))(5(17,$7(�7+(,5�352'8&76�,6�7+528*+�7+(,5�62)7:$5(67$&.�� +,6�+$6�&5($7('�$�*(1(5$/�5(/8&7$1&(�72�5(/($6(�7+(62)7:$5(�67$&.�$6�)8//�23(1�6285&(���1�$'',7,21�0$1<�2)7+(�62)7:$5(�(/(0(176�$5(�67,//�9(5<�08&+�,1�'(9(/23�0(17���25�285�7(676�:(�$77(037('�72�86�$�&283/(�2)����,03/(0(17$7,216��%87�)281'�7+(�/$7(67��������%$6('&2'(�8167$%/(���167($'�:(�86('�$1�2/'(5� �"����%$6(',03/(0(17$7,21���/62�,1&/8'('�,6�$�121�23(1�6285&(67$&.�)520��1),1,&21��25325$7,21��
�,*85(���3/276�7+(�3(5)250$1&(�2)��1),1,%$1'�9(5686�$&283/(�2)� 27+(5� 7(&+12/2*,(6���/($5/<� 7+(5(�$5(�68%67$1�7,$/�',))(5(1&(6�%(7:((1� 7+(�',))(5(17�����,03/(0(17$�7,216�� +(��1),1,&21�����6+2:6�48,7(�*22'�3($.�3(5�)250$1&(�29(5������%36��7+(�27+(5�7:2�3($.�287$5281'�����%36���//�7+5((�6+2:�$�/$5*(�'523�2))$%29(� ���,1�0(66$*(�6,=(� :+(1�7+(�0(66$*(�6,=(� (;�
&(('6�7+(�21�����&$&+(�6,=(���1�$'',7,21�&855(17�/$7(1&,(62)�����>6(&21'6�$5(�$�%,7�+,*+(5�7+$1�7+(�27+(5����%<3$66�7(&+12/2*,(6�
2.6. 10 Gigabit Ethernet
��,*$%,7��7+(51(7�,6�7+(�1(;7�67(3�,1�7+(��7+(51(76(5,(6���(&$86(�2)�,76�+(5,7$*(�,76�'(6,*1�,1+(5,76�$//�7+($'9$17$*(6�$1'�',6$'9$17$*(6�2)�35(9,286��7+(51(7�,0�3/(0(17$7,216�� +(�35,0$5<�$'9$17$*(6�,1&/8'(�,17(523�(5$%,/,7<�:,7+�35(9,286� �7+(51(7�*(1(5$7,216��:,'(�3257�$%,/,7<� $1'� $� 8%,48,7286� ,17(5)$&(� � �������� +(� 35,0$5<',6$'9$17$*(6�+$9(�$/:$<6�%((1�5(/$7,9(/<�+,*+�/$7(1&<$1'���!�/2$'��$6�:(//�$6�(;3(16,9(�6:,7&+(6���1�7+(35(6(17� &$6(� $� 3257,21� 2)� 7+(� &267� ,6�%(,1*�'5,9(1�%<�7+(&267�2)�7+(�237,&6�:+,&+� &855(17/<�581�$5281'����%<7+(06(/9(6���20(��,*$%,7��7+(51(7�6:,7&+(6�$5(�12:2))(5,1*���,*$%,7�83/,1.�32576�(66(17,$//<�$7�7+(�&2672)�,167$//,1*�7+(�237,&6�7+$7�,6�0$.,1*�,7�620(:+$7025(�35$&7,&$/�72�7581.�08/7,3/(��,*$%,7��7+(51(76:,7&+(6�72*(7+(5���2:(9(5��)8//���,*$%,7��7+(51(76:,7&+(6�$5(�67,//�)$,5/<�(;3(16,9(�$7�$5281'���3(56:,7&+�3257�(9(1�7+28*+�7+$7�),*85(�+$6�'5233('�%<�$)$&725�2)�����,1�7+(�/$67�<($5���1�$'',7,21���,*$%,7�7+(51(7�6:,7&+(6�$5(�&855(17/<�)$,5/<�/,0,7('�,1�3257& 2 8 1 7 � : , 7 + � / $ 5 * ( � 6 : , 7 & + ( 6 � 2 ) ) ( 5 , 1 * � 2 1 / < � $ 5 2 8 1 ' � � �32576� +(�&267�$1'�/2:�3257�'(16,7<�&/($5/<�0$.(�&/867(5�%$6('21���,*$%,7��7+(51(7�81/,.(/<�,1�7+(�1($5�7(50���1�7+(/21*(5�7(50�7+(6(�,668(6�:,//�/,.(/<�0,7,*$7(�$6�7+(7(&+12/2*<�&2002',7,=(6�
�,*85(���3/276�7+(�3(5)250$1&(�2)�7:2��17(/��,*$%,7�7+(51(7�32576�$*$,167�6$03/(6�2)�7+(�27+(5�7(&+12/2�*,(6���/($5/<�7+(�3(5)250$1&(�2)���,*�����,6�',6$3�32,17,1*�:,7+�3(5)250$1&(�7233,1*�287�$5281'
10 Gb Ethernet
• Similar tradeoffs to previous Ethernets
• $10k per switch port
• Only 3700 Mb/s
Application Results
���%36���2:(9(5�7+,6�'2(6�$33($5�$�&20021�/,0,72)�!������$021*�7+(�27+(5� 7(&+12/2*,(6� $1'�7+86�0$<%(�,1�3$57�/,0,7('�%<�7+(�29(5+($'�2)�!������,76(/)�
3. Application Performance
�(<21'�5$:�1(7:25.�3(5)250$1&(�7+(5(�$5(�27+(5�,668(67+$7�&$1�'5$0$7,&$//<�())(&7�7+(�29(5$//�3(5)250$1&(�2)7+(�,17(5&211(&7���+,()�$021*�7+(�,668(6�,6�+2:�7+(62)7:$5(�$1'�+$5':$5(�'($/�:,7+�5811,1*�025(� 352&(66(67+$1�7+(5(�$5(�352&(66256���63(&,$//<�,1�7+(�&$6(�2)�$3�3/,&$7,216�7+$7�86(�$8;,/,$5<�352&(66(6�$6�'$7$�6(59(56�!+,6�7<3(�2)�352&(66(6�7<3,&$//<�0867�%/2&.�,1�$�5(&(,9(&$//��"1'(5�0$1<�����,03/(0(17$7,216�7+,6�5(68/76�,1�$32//,1*�%(+$9,25�7+$7�&$1�7$.(�68%67$17,$/���"�7,0($:$<�)520�7+(�&20387(�352&(66(6���1�285�/2&$/�(19,�5210(17�7+(�'20,1$17�$33/,&$7,21�,6�7+(�48$1780�&+(0�,675<�3$&.$*(����� ���!+86�:(�+$9(�5$1�$�6$03/(�&20081,&$7,21�,17(16,9(��&$/&8/$7,21�21�($&+�2)�7+(,17(5&211(&7�&+2,&(6�72�6((�-867�+2:�08&+�,03$&7�7+(1(7:25.�+$6�21�7+(�7,0,1*6�
#node, 2Procs per 1 2 4
�,*$%,7��7+(51(7 ��� �� ����<5,1(7���� ���� �� ����<5,1(7�!�� � �� �� ����1),1,%$1'���� ��� ���� ���8$'5,&6���� ���� ��# node, 1 proc per 1 2 4
�,*$%,7��7+(51(7 ���� ��� ����<5,1(7���� ��� �� � ��<5,1(7�!�� �� � ��� ��1),1,%$1'���� �� � �� ����8$'5,&6���� ��� ���
!$%/(��/,676�620(�7,0,1*6�)25�$�&$/&8/$7,21�7+$7�87,/,=(6$�'8$/�352&(66�02'(/���1(�352&(66�)81&7,216�$6�7+(&20387(�352&(66�$1'�7+(�6(&21'�352&(66�$&76�$6�$�'$7$6(59(5�)25�$�36(8'2�*/2%$/�6+$5('�0(025<�6(*0(17��!+(),567�%/2&.�2)�7,0,1*6�29(5/2$'�7+(���"6�:,7+���&20�387(�$1'���'$7$�6(59(56�21�($&+�'8$/���"�12'(��!+(6(&21'�%/2&.�581�21/<��&20387(�$1'��'$7$�6(59(5�3(512'(�
�1(�2%9,286�5(68/7�,6�7+$7�7+(�)$67(67� 1(7:25.�'2(6�1271(&(66$5,/<�352'8&(�7+(�)$67(67�7,0,1*�)25�7+,6�$33/,&$�7,21���#+(1�7+(���">6�$5(�127�29(5/2$'('��,(� 21(� &20�387(�$1'�21(�'$7$�6(59(5�3(5�12'(��7+(�7,0,1*6�$5(�)$,5/<6,0,/$5�)25�$//�,17(5&211(&7�7<3(6���2:(9(5�:+(1�7+(180%(5� 2)� 352&(66(6�3(5�12'(�,6�'28%/('�7+(�5(68/76�$5(025(�,17(5(67,1*���,567�2))��<5,1(7�78516�,1�68%67$1�
7,$//<�)$67(5�7,0,1*�)25�$�6,1*/(�12'(���267�/,.(/<�7+,6,1',&$7(6�$�*22'�,175$�12'(�0(66$*(�3$66,1*�,03/(0(1�7$7,21����2:(9(5�:+(1�5811,1*�21� �12'(6�7+(��<5,1(7!���581�,6�)$67(5�7+$1�7+(�27+(56��!+(�5($621�7+(����>6$5(�/,.(/<�6/2:(5�,6�'8(�72�7+(�:$<�0$1<����>6�+$1'/(%/2&.,1*�,1�$�5(&(,9(�&$//���$1<�,03/(0(17$7,216�$6�680(�7+$7�($&+�����352&(66�(66(17,$//<�2:16�$���"$1'�7+86�,03/(0(17�$�32//,1*�0(&+$1,60�7+$7�($76���"7,0(���25�$33/,&$7,216�68&+�$6����� �7+,6�5(68/76�,1$�68%67$17,$/�/266�2)�3(5)250$1&(���1�$'',7,21�0$1<�2)7+(�,17(5&211(&76��6((0�72�3(5)250�:256(�:+(1�7+(5(�$5(/276�2)�352&(66(6�87,/,=,1*�7+(�,17(5&211(&7�$7�7+(�6$0(7,0(���<�7+,6�:(�0($1�7+(5(�6((06�72�%(�68%67$17,$/29(5+($'�,1�08/7,3/(;,1*�$1'� '(08/7,3/(;,1*�7+(�0(6�6$*(6�:+(1�08/7,3/(�352&(66(6�$5(�6,08/7$1(286/<�$&�7,9(�
4. Conclusions
!+(�*22'�1(:6�,6�7+$7�72'$<�7+(5(�$5(�6(9(5$/�*22'&+2,&(6��)25�$�+,*+�63(('�,17(5&211(&7�21�$�&/867(5�$7�$5$1*(�2)�35,&(���7�7+(�/2:�(1'���,*$%,7��7+(51(7�+$6(0(5*('�$6�$�62/,'�237,21�:,7+�$�&267�2)�����12'(�25%(/2:�83�72�6(9(5$/�+81'5('�12'(6���)�&2856(��,*$%,7�7+(51(7�,6�$/62�7+(�/2:(67�3(5)250,1*�1(7:25.�,1�7+(6859(<��%87�,76�62/,'��8%,48,7286��62)7:$5(�6833257�0$.(,7�$�*22'�&+2,&(�)25�$33/,&$7,216�7+$7�'2�127�5(48,5(&877,1*�('*(�&20081,&$7,216�3(5)250$1&(�
�1�7+(�0,''/(�2)�7+(�3$&.�,1�7(506�2)�&267�$1'�3(5)250�$1&(�$5(� ���$1'��<5,1(7���27+�352'8&76�2))(5�*22'3(5)250$1&(�$7�$�02'(5$7(�&267�� ���+$6�620(�,17(5(67'8(�72�,76�)/$7�&267�3(5�12'(�$6�7+(�&/867(5�6,=(� ,6�6&$/('83���2:(9(5��7+(�62)7:$5(�67$&.6�$9$,/$%/(�)25� ���+$9(6(9(5$/�,668(6��620(�2)�:+,&+�6,*1,),&$17/<�,03$&7�7+(3(5)250$1&(�2)�7+(�6<67(0���<5,1(7�21�7+(�27+(5�+$1'2))(56��$�&203/(7(�23(1� 6285&(�62)7:$5(�3$&.$*(�7+$7�,67+(�0267�62/,'�62)7:$5(�67$&.�6859(<('�$3$57�)520!������%$6('��7+(51(7��!+(�&855(17��<5,1(7�+$5':$5(3529,'(6�*22'�3(5)250$1&(�$7�$�5($621$%/(�&267�$1'� ,67+86�$�48,7(�62/,'�&+2,&(�)25�0267�$33/,&$7,216�
�1),1,%$1'�67$1'6�287�$6�$�7(&+12/2*<�:,7+�$�*5($7�'($/2)�3520,6(��%87�48,7(�$�)(:�528*+�('*(6���+,()�$021*7+(�528*+�('*(6�$5(�7+(�352%/(06�:,7+�7+(�9$5,('�62)7�:$5(�67$&.6���855(17/<�,7�,6�3266,%/(�72�6(783�$1��1),1,�%$1'�%$6('�1(7:25.�$1'�),1'�$�86$%/(�62)7:$5(�67$&.��2:(9(5��68&+�$�1(7:25.�:,//�5(48,5(�$�%,7�025(� 0$,1�7(1$1&(�$6�7+(�62)7:$5(�,6�5(9,6('�
�1�7+(�/21*�7(50���,*$%,7��7+(51(7�:,//�/,.(/<�3/$<�$52/(�,1�&/867(56��%87�,7�:,//�/,.(/<�7$.(�$7�/($67�$�&283/(025(�<($56�%()25(�7+(�35,&(�%(&20(6�&203(7,7,9(���85�,1*�7+$7�7,0(���"6�$1'�%866(6�:,//�$/62�,1&5($6(�,1
Table 1: GAMESS results