crayyy xt and cray xe system overview - nersc: national energy
TRANSCRIPT
Cray XT and Cray XEy ySystem Overview
Customer Documentation and Training
Overview Topics
• System Overview– Cabinets, Chassis, and Blades– Compute and Service Nodes– Components of a Node
Opteron ProcessorOpteron ProcessorSeaStar ASIC • Portals API DesignGemini ASIC
• System Networks• Interconnection Topologies
10/18/2010 2Cray Private
Cray XT System
10/18/2010 3Cray Private
System Overview
GigE
X
YZ
10 GigE
GigE
Fibre
SMW
RAIDSubsystem
Channels
Compute node
Login nodeg
Network node
Boot /Syslog/Database nodes
I/O and Metadata nodes10/18/2010 4Cray Private
Cabinet
– The cabinet contains three chassis, a blower for cooling, a power distribution unit (PDU), a control system (CRMS), and the compute and service blades (modules)
– All components of the system are air cooled
A blower in the bottom of the cabinet cools the blades within the cabinet• Other rack-mounted devices
within the cabinet have theirwithin the cabinet have their own internal fans for cooling
– The PDU is located behind the blower in the back of the cabinet
10/18/2010 5Cray Private
Liquid Cooled CabinetsHeat exchanger(LC cabinets only)
Heat exchanger(XT5-HE LC only)
Cage 2Cage 248Vdc flexible
buses
Cage 1 Cage 1
0 1 2 3 4 5 6 7
Cage 0
Cage inletair temp
Heat exchanger(LC cabinets only)
Cage 0backplaneassembly
Cage VRMs
Cage IDcontroller Interconnect
network cableconnection
Blower speedcontroller (VFD)
Blower
air tempsensor(slot 3 rail)
Airflowconditioner
Heat exchanger(XT5-HE LC only)
L1 controller
48Vdc shelf 3
48Vdc shelf 2
48Vdc shelf 1
PDUo eline filter
XDP temperature& humidity sensor(behind perforatedpanel in someXT5-HE LCcabinets)
XDP interfaceboard (behindperforated panel,XT5-HE LC only)
10/18/2010 6Cray Private
Blades
• The system contains two types of blades:– Compute blades
4 SeaStar or 2 Gemini ASICs4 nodes
– Service blades– Service bladesSIO (XT systems)• 4 SeaStar ASICs• 2 nodes• One dual-slot PCI-X or PCIe riser assemblies per node XIO (XE systems)XIO (XE systems)• 2 Gemini ASICs• 4 nodes
O i l l t PCI i d ( t th b t d )• One single slot PCIe riser per node (except on the boot node)
10/18/2010 7Cray Private
PCI Cards
• Cray supports the following types of PCIe cards:– Gigabit Ethernet– 10Gigabit Ethernet– Fibre Channel (FC2, FC4, and FC8)
I fi iB d– InfiniBand
10/18/2010 8Cray Private
Node Components
– Opteron processorSingle-, dual-, quad-, six-, eight-, twelve-core versions
N d– Node memory2- to 8-GB of DDR1 memory in Cray XT3 compute nodes2- to 8-GB of DDR1 memory in Cray XT service (SIO) nodes2 to 8 GB of DDR1 memory in Cray XT service (SIO) nodes4- to 16-GB of DDR2 memory in Cray XT4 compute nodes4- to 32-GB of DDR2 memory in Cray XT5 compute nodes8 t 64 GB f DDR3 i C XT6 t d8- to 64-GB of DDR3 memory in Cray XT6 compute nodes8- to 64-GB of DDR3 memory in Cray XE6 compute nodes8- to 16-GB of DDR2 memory in Cray XE6 service (XIO) nodesy y ( )
– SeaStar or Gemini ASIC
10/18/2010 9Cray Private
Cray XT4 Compute Blade
Node memory
Node VRMNode 1
N d 2
Node 0
Node 2
Node 3
SeaStar
Node 3
L0 controller
48Vdc to 12Vdc VRMs
L0 controller
10/18/2010 10Cray Private
XT4 Node Block Diagram
Opteron
M s
lots
processorX86-64
4 D
IMM
HyperTransportLink
SeaStarHigh-speednetwork (HSN)(interconnect)( )
L0 controller
10/18/2010 11Cray Private
SIO Service Blade
PCI cardPCI card
Node 3
Node 01
23
PCI riser0
PCI riser(PCI-X or PCIe)
10/18/2010 12Cray Private
SeaStar ASIC
– Direct HyperTransport connection to Opteron– The DMA engine transfers data between the host
d th t kmemory and the networkSeparate read and write sides in the DMA engineThe Opteron does not directly load/store to network p yOn the receive side a set of content addressable memory (CAM) locations are used to route incoming messages• There are 256 CAM entries• There are 256 CAM entries
– Designed for Sandia Portals API
10/18/2010 13Cray Private
SeaStar Diagram
MemoryDDR-
SDRAMinterface
Northbridge
Core
interface
HyperTransportInterface
HyperTransportInterface
SeaStarDMA
engineHyper
TransportRx
To PCI-X onService blades
Memory
engine TransportRou
x
y
SSI PPU
Memoryter
y
z
L0 controller
SSI PPUr
10/18/2010 14Cray Private
Node Connection Bandwidths
+Y
Opteroncore(s)
ory
GB
/s HyperTransport
+Y
Mem
o
HyperTransportCray XT3 6.4 GB/s bidirectional
2.17 GB/s sustainedCray XT4 8.0 GB/s bidirectional
9.6
G +Z
MemoryCray XT3 DDR1 6.4 GB/s (400 MHz DIMMs)
5.7 GB/s sustained51.6 ns latency - measured
Cray XT4 DDR2 10.4 GB/s (667 MHz DIMMs)8 5 GB/s sustained
9.6 GB/s SeaStar 9.6 GB/s-X +X
8.5 GB/s sustained50 ns latency - measured
Cray XT4 DDR2 12.8 GB/s (800 MHz DIMMs)
Cray XT5 DDR2 10.4 GB/s (667 MHz DIMMs)Cray XT5 DDR2 12.8 GB/s (800 MHz DIMMs)G
B/s
High Speed Network9.6 GB/s bidirectional6.5 GB/s sustained
Cray XT5 DDR2 12.8 GB/s (800 MHz DIMMs)
9.6
-Y
-Z
10/18/2010 15Cray Private
Portals API Design
– Designed to message among thousands of nodesPerformance is critical only in terms of scalabilityS i d b h l li ti i ll d tSuccess is measured by how large an application is allowed to scale, not by a 2-node ping-pong time
– Designed to avoid scalability limitationsIs network independentBypasses OS – no memory copies into or out of the kernel; few interruptsinterruptsBypasses application – does not require activity by the application to ensure progressReliable ordered delivery of messages between two nodesReliable, ordered delivery of messages between two nodes
10/18/2010 16Cray Private
Portals API Design
– Receiver (target) managed – no reserved memory for thousands of potential senders
I ti l li it ti i t bli h d ithIs connectionless; no explicit connection is established with other processes in the application• When connected, maintains a minimum amount of stateThe target process determines how to respond to the incoming message• The target node can choose to accept or ignore a message
from any node• Destination of any message is not an address
– Instead “Match bits” enable the receiver to place itInstead Match bits enable the receiver to place itUnexpected messages are handled by the unexpected message queueAll buffers are in user spaceAll buffers are in user space
10/18/2010 17Cray Private
X6 Compute Blade with Gemini
Node 3
N d 2 Gemini 1Node 2 Gemini 1
Node 1
Gemini 0
Node 0
Gemini 0
10/18/2010 18Cray Private
XE6 Node Block Diagram
slot
s
OpteronOpteron
slot
s
4 D
IMM
sprocessorX86-64
processorX86-64
4 D
IMM
s
HyperTransportLinks
SeaStar (XT)Gemini (XE)
High-speednetwork (HSN)(interconnect)( )
L0 controller
10/18/2010 19Cray Private
XIO Service Blades
• Four nodes; each node contains:– One AMD Opteron processor with up to 16 GB of DDR2
memoryProcessor is a six-core Opteron
– A connection to a Gemini ASICA connection to a Gemini ASIC– Voltage regulating modules (VRMs)
• L0 controllerL0 controller• Gemini mezzanine card
– Contains two Gemini ASICsContains two Gemini ASICs• Four PCIe risers
– One riser per nodep
10/18/2010 20Cray Private
XIO Blade
Node 3node 3 connects to the lower mezzanine
Node 2 Gemini 1
Node 1node 1 connects to the lower
Node 0
Gemini 0mezzanine
10/18/2010 21Cray Private
XIO Riser Configuration
PCIe G2Inner
DDR2memory
x16 HyperTransport
Upper bay
Opteronnode 3
PCIe G2bridge
Innermezzanine
DDR2memory
x16 HyperTransport
Gemini
HyperTransport
HyperTransport
75 Watt slot
225 Watt slot
Opteronnode 2
PCIe G2bridge
Outer mezzanine
DDR2memory
x16 HyperTransport
HyperTransport
Opteronnode 1
PCIe G2bridge
Innermezzanine
memory
x16 HyperTransport
HyperTransportnode 1
OpteronPCIe G2Outer
DDR2memory
x16 HyperTransport
GeminiHyperTransport
75 Watt slot
225 Watt slot
Opteronnode 0bridgemezzanine
Lower bay10/18/2010 22Cray Private
XIO Boot Node Configuration
PCIe G2Inner
DDR2memory
x16 HyperTransport
Upper bay
Opteronnode 3
PCIe G2bridge
Innermezzanine
DDR2memory
x16 HyperTransport
Gemini
HyperTransport
HyperTransport
Opteronnode 2
PCIe G2bridge
Outer mezzanine
DDR2memory
x16 HyperTransport
HyperTransport
Opteronnode 1
PCIe G2bridge
Innermezzanine
memory
x16 HyperTransport
HyperTransportnode 1
OpteronPCIe G2Outer boot
DDR2memory
x8 HyperTransport
GeminiHyperTransport
Opteronnode 0bridgemezzanine
Lower bay10/18/2010 23Cray Private
Opteron Processors
CPU
L1 I cache L1 D cache
L2 cache
CPU
L1 caches L1 caches
L2 cache
CPU CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2L2 cache
System request queue
CrossbarMemory
System request queue
CrossbarMemory
System request queue
Crossbar
L3 cache
Controller
DDR
Controller
DDR
CrossbarMemory
Controller
DDR
Hyper-TransportLinks
Hyper-TransportLinks Hyper-
TransportLinks
AMD Opteron single-core processorCray XT3 SystemSocket 940: DDR1 buffered DIMMs
AMD Opteron quad-core processorCray XT4 System Socket AM2: DDR2 unbuffered DIMMs
AMD Opteron dual-core processorCray XT3 SystemSocket 940: DDR1 buffered DIMMs
and 3 HyperTransports and one HyperTransportCray XT5 SystemSocket F: DDR2 registered (buffered)
DIMMsand 3 HyperTransports
and 3 HyperTransportsCray XT4 SystemSocket AM2: DDR2 unbuffered DIMMs
and one HyperTransport
AMD Opteron hex-coreprocessor
10/18/2010 24Cray Private
Opteron Processors
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
CPU
L1 s
L2
System request queue
Crossbar
L3 cache
L2 L2
System request queue
Crossbar
L3 cache
System request queue
Crossbar
L3 cache
CrossbarMemory
Controller
DDR
Hyper-TransportLinks
MemoryController
DDR
Hyper-TransportLinks
MemoryController
DDR
Hyper-TransportLinks
AMD Opteron six-core processorCray XT5 SystemSocket F: DDR2 unbuffered DIMMs
AMD Opteron twelve-core processorDual-die socket
(for an eight-core socket remove twoand 3 HyperTransportsL3 Cache: 6 MB
(for an eight core socket remove twoCPUs from each side)
Cray XT6 and Cray XE6 SystemsSocket G34: DDR3 unbuffered DIMMs
and 4 HyperTransportsL3 cache: each die has 6 MBs, 12 MB per socket
10/18/2010 25Cray Private
Gemini
• Gemini is designed to pass data to and from the network with less control
G i i i it d f SMP– Gemini is suited for SMP– Gemini allows for adaptive routing
Hardware support for PGAS languages– Hardware support for PGAS languagesRemote memory address translations, as with Cray X2 systems
– Atomic memory operationsy p• The Gemini chip uses HyperTransport version 3
10/18/2010 26Cray Private
Gemini Block Diagram
Opteron Opteron Opteron Opteron
Node 0 Node 1
HyperTransportand NIC 0
HyperTransportand NIC 1
Router
L0 processor
10/18/2010 27Cray Private
Terminology
• BTE – Block Transfer Engine• DMAPP – Distributed Memory Application• FMA – Fast Memory Access• GHAL – Generic Hardware Abstraction Layery• GNI – Generic Network Interface• PGAS – Partitioned Global Address SpacePGAS Partitioned Global Address Space
– Programming language extensions such as Unified Parallel C (UPC) and Co-Array Fortran (CAF)
10/18/2010 28Cray Private
Performance and Resiliency Features
• Automatic link-level and HT3 retries– HT3 supports an improved CRC
• Congestion feedback to enable routing around network bottlenecks
P k t d d i i b ff– Packets are reordered in receive bufferSeparate completion notification when all are stored
• Automatic link failure handling with software help• Automatic link failure handling with software help– To route around a failed link, system traffic must be
quiesced to prevent deadlock and preserve orderingThis feature is also used to support hot blade swap
• Router errors are detected and reported at the point of error
10/18/2010 29Cray Private
Gemini ASIC Diagram
HT3 HT3
NIC 0 NIC 1
Tile 0, coordinate 0,0 Tile 7, coordinate 0,7
Netlink block
48-port YARC router
Tile 47, coordinate 5,7
10/18/2010 30Cray Private
Node Types
– Compute nodesNo user loginO l f ti f ll l li tiOnly for execution of parallel applications Diskless - no permanent storageCNL operating system• Linux symmetrical multi-processing (SMP)• Supports OpenMP programming model• apinit part of ALPS (Application Level Placement Scheduler)• apinit part of ALPS (Application Level Placement Scheduler)
– Service NodesProvide services based on hardware/software configurationgNode names: boot node, SDB, syslog, login, I/O, and network nodesRun Linux operating system - SLESRun Linux operating system SLES
10/18/2010 31Cray Private
Service Nodes
– Boot nodeFirst node bootedH it t fil tHas its own root file systemExports the “shared root” file system to the other service nodes
– System database (SDB) nodey ( )Stores system configuration information and system stateImplemented with MySQL Pro
– syslog nodeCollects syslog data from other service nodes
– Login nodes– Login nodesProvide users with a familiar Linux environmentEdit, compile, submit, and monitor interactive or batch jobs• PBS Pro is the standard batch system
10/18/2010 32Cray Private
Service Nodes
– I/O nodesAttach to RAID storage devicesP f t DMA I/O th h th HSNPerform remote DMA I/O through the HSNFunction as a metadata server (MDS) and object storage servers (OSSs) for Lustre file systems
– Network nodes10 Gigabit Ethernet (10GbE) adapterConnect to other file serversConnect to other file servers
10/18/2010 33Cray Private
Service Node Software Components
– ALPS - Application Level Placement SchedulerApplication placement, launch and management for CNLI l d l d d li tIncludes several daemons and client processes
– RCA – Resiliency Communication AgentPart of the kernel, on all nodesPart of the kernel, on all nodesSends a heartbeat message to the SMW (CRMS/HSS/CMS)Detects and responds to failing application launchers
10/18/2010 34Cray Private
System Networks
• The High-speed Network (HSN) or interconnect network– Connects the service and compute nodes
The preferred topology is a folded torus• A torus is a circle• Folding enables the use of the shortest cables possible toFolding enables the use of the shortest cables possible to
connect all nodes within a dimension • Depending on the class, the topology may be a meshThe class of the system governs the configurationThe class of the system governs the configuration
– Applications move data across the HSN• The Hardware Supervisory System (HSS) networkThe Hardware Supervisory System (HSS) network
– Monitors and controls the system– Ethernet network with the System Management y g
Workstation (SMW) is at the top of the network
10/18/2010 35Cray Private
Interconnect Network Topology Classes
• Configurations are based on topology classes:– Class 0 contains 1 to 9 chassis (1 to 3 cabinets)
1 row– Class 1 contains 4 to 16 cabinets
1 row1 row– Class 2 contains 16 to 48 cabinets
2 equal length rows– Class 3 contains 48 to 320 cabinets
Three rows of equal lengthA h i di i• A mesh in one or more dimensions
Even number of 4 or more equal length rows• Torus in all dimensions
10/18/2010 36Cray Private
A Single-chassis Configuration
• The basic building block is a single chassis– A chassis is 1 x 4 x 8– A chassis is 1 x 4 x 8
Dimensions are: X x Y x Z– Each node on a blade is
connected in the Y dimensionconnected in the Y dimension (mezzanine)
– Each node in a chassis is connected in the Z dimension ad
e
(backplane)– All X-dimension connections
are cables
ars
in a
bla
y
4 S
eaS
tax
y
z
10/18/2010 37Cray Private
Network Connectors
10/18/2010 38Cray Private
Class 0 Topology
10/18/2010 39Cray Private
Class 0 Cable Drawing, 3 Cabinets
ZY
X
10/18/2010 40Cray Private
Class 1 Topology
10/18/2010 41Cray Private
Class 1 Cable Drawing, 4 Cabinets
10/18/2010 42Cray Private
Class 2 Topology
Y
X Z
10/18/2010 43Cray Private
Class 2 Cable Drawing, 16 Cabinets
X
Z
Y
10/18/2010 44Cray Private
Class 3 Topology
Z
X
Z
Y
10/18/2010 45Cray Private
HSN Cabling Photo
10/18/2010 46Cray Private
Cray XT5m
• Cray XT5m topology• 1 to 18 chassis,1 to 6
cabinetscabinets• All cabinets are in a single
row • The system scales in units ofThe system scales in units of
chassis• 2D torus topology (y and z)
• (Y x 4) x 8 topology(Y x 4) x 8 topology • Y is the number of
populated chassis in the systemy
10/18/2010 47Cray Private
Identifying Components
• System components are labeled according to physical ID (HSS Identification), node ID, IP address, or class
Component Format DescriptionSystem s0, all All components attached to the SMW.
Cabinet cX-Y Cabinet number and row; this is the L1 host name.;
Cage cX-Yc# Physical cage in cabinet: 0, 1, 2. Cages are numbered from bottom to top.
Bl d l t X Y # # Physical blade slot in cage:0 – 7, numbered fromBlade or slot cX-Yc#s# Physical blade slot in cage:0 7, numbered from left to right; this is the L0 hosts name.
Node cX-Yc#s#n# Opteron Chip on a blade: 0 - 3 for compute blades, 0 and 3 for SIO blades
SeaStar ASIC cX-Yc#s#s# Cray SeaStar ASIC on a blade: 0 – 3
Gemini ASIC cX-Yc#s#g# Cray Gemini ASIC on a blade: 0 – 1
X Y # # #l# f S S S CLinkcX-Yc#s#s#l#cX-Yc#s#g#l#
Link port of a SeaStar ASIC: 0 – 5Link port of a Gemini ASIC: 0 - 57
10/18/2010 48Cray Private
Node ID (NID)
• The Node ID is a unique hexadecimal or decimal number that identifies each CLE node
Th NID fl t th d l ti i th t k– The NID reflects the node location in the networkIn a SeaStar based system, the NIDs are assigned on 128-number boundaries for each cabinet In a Gemini based system, NIDs are sequential
– NID format is nidnnnnn (decimal) when it refers to a hostname of a node for example: nid00003hostname of a node, for example: nid00003
10/18/2010 49Cray Private
Cabinet 0 NID Numbering Example6,
67
0, 7
1
4, 7
5
8, 7
9
2, 8
3
6, 8
7
0, 9
1
4, 9
5
4, 9
5
2, 9
3
0, 9
1
8, 8
9
6, 8
7
4, 8
5
2, 8
3
0, 8
1
Cray XT (SeaStar based) system Cray XE (Gemini based) system 64
, 65,
66
68, 6
9, 7
0
72, 7
3, 7
4
76, 7
7, 7
8
80, 8
1, 8
2
84, 8
5, 8
6
88, 8
9, 9
0
92, 9
3, 9
4
64, 6
5, 9
4
66, 6
7, 9
2
68, 6
9, 9
0
70, 7
1, 8
8
72, 7
3, 8
6
74, 7
5, 8
4
76, 7
7, 8
2
78, 7
9, 8
0
3, 3
4, 3
5
7, 3
8, 3
9
1, 4
2, 4
3
5, 4
6, 4
7
9, 5
0, 5
1
3, 5
4, 5
5
7, 5
8, 5
9
1, 6
2, 6
3
3, 6
2, 6
3
5, 6
0, 6
1
7, 5
8, 5
9
9, 5
6, 5
7
5, 4
0, 4
1
3, 4
2, 4
3
5, 5
0, 5
1
7, 4
8, 4
9
3 7 1 15 19 23 27 31
32, 3
3
36, 3
7
40, 4
1
44, 4
5
48, 4
9
52, 5
3
56, 5
7
60, 6
1
31 29 27 25 23 21 19 17
32, 3
3
34, 3
5
36, 3
7
38, 3
9
54, 5
5
52, 5
3
44, 4
5
46, 4
7
0, 1
, 2, 3
4, 5
, 6, 7
8, 9
, 10,
1
2, 1
3, 1
4,
6, 1
7, 1
8,
0, 2
1, 2
2,
4, 2
5, 2
6,
8, 2
9, 2
0,
0, 1
, 30,
3
2, 3
, 28,
2
4, 5
, 26,
2
6, 7
, 24,
2
8, 9
, 22,
2
0, 1
1, 2
0,
2, 1
3, 1
8,
4, 1
5, 1
6,
8 12 16 20 24 28 0 2 4 6 8 10 12 14
10/18/2010 50Cray Private
Cray System
10/18/2010 Cray Private 51
External Services for Cray Systems
– Implementation of services external to the Cray XT and Cray XE systems to address customer requirements
M fl iblMore flexible user accessMore options for data management, data protectionLeverage commodity components in customer-specific implementationsProvides faster access to new devices and technologies Repeatable solutions that remain open to custom configurationRepeatable solutions that remain open to custom configurationEach solution can be used, scaled, and configured independently
esFS esLogin esDM
10/18/2010 52Cray Private
Specifically
– esFSProvides globally shared data between multiple systems
C t d th• Cray systems and othersProvides access to other file systemsData Virtualization Service (DVS) is used to project Panasas or StorNext to the compute nodes
– esLoginLess dependence on a single Cray system for data access andLess dependence on a single Cray system for data access and compilesAn enhanced user environment
l d h• larger memory, swap space, and more horsepower• Increased availability of data and system services to users
– esDMesDMMore options for data management and data protection
10/18/2010 53Cray Private
esLogin
Cray XE System
Service/MOM nodeService/MOM node
NIC
XE | AP Clientside commands
XE ProgrammingEnvironment
Batch Client Software
Modules
NIC NIC NIC
MS
10/18/2010 Cray Private 54
esMS User network Other services