cosc 6385 computer architecture - data level parallelism...
TRANSCRIPT
1
COSC 6385
Computer Architecture
- Data Level Parallelism (III)
The Intel Larrabee, Intel Xeon
Phi and IBM Cell processors
Edgar Gabriel
Fall 2020
References• Intel Larrabee:
[1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J.
Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan:
“Larrabee: a many-core x86 architecture for visual computing”,
ACM Trans. Graph., Vol. 27, No. 3. (August 2008), pp. 1-15.
http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf
• IBM Cell processor:
[2] C. R. Johns, D. A. Brokenshire
“Introductioon to the Cell Broadband Engine Architecture”,
IBM Journal of Research and Development, vol. 51, no. 5, pp. 503-519
http://www.research.ibm.com/journal/rd/515/johns.pdf
[3] M. Kistler, M. Perrone, F. Petrini,
“Cell Multiprocessor Communication Network: Built for Speed”
IEEE Micro, vol. 26, no. 3, pp .10-23
ttp://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf
1
2
2
Larrabee Motivation
• Comparison of two architectures with the same number
of transistors
– Half the performance of a single stream for the simplified
core
– 40x increase for multi-stream executions
2 out-of-order
cores
10 in-order
cores
Instruction issue 4 2
VPU per core 4-wide SSE 16-wide
L2 cache size 4 MB 4 MB
Single stream 4 per clock 2 per clock
Vector
throughput
8 per clock 160 per clock
Larrabee Overview
• Many-core visual computing architecture
• Based on x86 CPU cores
– Extended version of the regular x86 instruction set
– Supports subroutines and page faulting
• Number of x86 cores can vary depending on the
implementation and processor version
• Fixed functional units for texture filtering
– Other graphical operations such as rasterization or post-
shader blending done in software
3
4
3
Larrabee Overview (II)
Image Source: [1]
Overview of a Larrabee Core (I)
Image Source: [1]
5
6
4
Overview of a Larrabee Core (I)
• x86 core derived from the Pentium processor
– No out-of-order execution
• Standard Pentium instruction set with the addition of
– 64 bit instructions
– Instructions for pre-fetching data into L1 and L2 cache
– Support for 4 simultaneous threads, separate registers for
each thread
• Each core is augmented with a wide vector processor
(VPU)
• 32kb L1 Instruction cache, 32 kb L1 Data Cache
• 256 KB of ‘local subset’ of the L2 cache
– Coherent L2 cache across all cores
Vector Processing Unit in Larrabee
• 16-wide VPU executing integer, single- and double
precision floating point operations
• VPU supports gather-scatter operations
– The 16 elements are loaded or can be stored from up to
16 different addresses
• Support for predicated instructions using a mask control
register (if-then-else statements)
7
8
5
Inter-Processor Ring Network
• Bi-directional ring network
• 512 bits-wide per direction
• Routing decisions done before injecting message into
the network
Larrabee Programming Models
• Most application can be executed without modification
due to the full support of the x86 instruction set
• Support for POSIX threads to create multiple threads
– API extended by thread affinity parameters
• Recompiling code with Larrabee’s native compiler will
generate automatically the codes to use the VPUs.
• Alternative parallel approaches
– Intel threading building blocks
– Larrabee specific OpenMP directives
9
10
6
Larrabee Performance
Image Source: [1]
Intel Xeon Phi Processor
• First generation of Intel MIC (Many Integrated Cores)
architecture
• 60 cores / 1.0 GHz
• 512-bit wide vector engine
• 32 Kb L1 I/D cache,
• 512 Kb L2 cache (per core)
• Up to 1 TFLOPS double-precision performance
• 8 Gb GDDR5 memory and 320 Gb/s bandwidth
• Standard PCIe x16 form factor
11
12
7
IBM Cell Overview (I)
• Cell Broadband Architecture (CBEA) defined by a
consortium of IBM, Sony, and Toshiba
• Originally targeting the multi-media industry
– E.g. Playstation 3, Toshiba HDTV, etc.
• Sold as regular compute-blades also by IBM
– IBM QS20, QS21, QS22
• Main idea: heterogeneous microprocessor consisting of
– one (or more) general purpose processor element (PPE)
and
– (one or) more synergistic processor elements (SPEs)
13
14
8
Cell Architecture block diagram
Image Source: [2]
• Two generations available so far:
– Cell BE:
• 204.8 GFLOPS single precision peak performance
• 14.6 GFLOPS double precision peak performance
– PowerXCell 8i (2008):
• 204.8 GFLOPS single precision peak performance
• 102.4 GFLOPS double precision peak performance
– Both have 1 PPE and 8 SPEs
15
16
9
General Purpose Processor (PPE)
• Based on the IBM PowerPC processor
– Supports multiple simultaneous operating environments
(virtualization)
– E.g. can execute an instance of a real-time operating
system and an instance of a non-real-time operating
system
• Performs management and application control
functions
Synergistic Processor Element (SPE)
• SIMD processor used for offloading compute-intensive,
data parallel operations from the PPE
• Each SPE has its own local storage and can access data
only from the local storage
– Current versions of the Cell processors: 256k local storage
• The local storage is connected to the main memory
through a Memory Flow Controller (MFC)
– MFC moves data from main memory to local storage or
between two SPEs.
17
18
10
MFC commands
Image Source: [2]
Synergistic Processor Element (SPE) (II)
• Each SPE has 128 registers
• Each register is 128 bits wide which can be used to
hold
– Sixteen 8-bit integers or
– Eight 16-bit integers or
– Four 32-bit integers or single precision floating-point
numbers
– Two 64-bit integers or double precision floating point
numbers
• Most instructions supported by the synergistic processor
unit utilize all elements in a register -> SIMD
19
20
11
Simplified representation of a current
Cell processor
Image Source: [3]
Element Interconnect Bus
• PPE and SPEs communicate through the Element
Interconnect Bus
– Contains a shared command bus
• Sets up end-to-end transactions
• Used for coherence protocols
– Point-to-point data interconnect
• Four 16-byte-wide rings, two used for clockwise data
transfers, two for counter-clockwise data transfers
• Each ring transfer 128 byte packets ( = cache block
size of an SPE)
• Communication costs between two SPEs can vary
between 1 hop and 6 hops
– Overall bandwidth: 204.8 GB/s
21
22
12
Comparison IBM Cell and Intel
Larrabee• Both use a large number of small and simple cores
• Both use high-bandwidth ring bus to communicate
between the cores
• Intel Larrabee is homogeneous, while IBM Cell is a
heterogeneous process (difference between PPE and
SPE)
• IBM Cell requires data to be moved explicitly to the
‘local store’, while Larrabee can address any memory
area
– Programm for the Cell have to be written taking the
limited amount of memory available for a SPE into
account
23