feeding the multicore beast: it’s all about the data!...ibm research © 2008 feeding the multicore...
TRANSCRIPT
IBM Research
© 2008
Feeding theMulticore Beast:It’s All About the Data!
Michael PerroneIBM Master InventorMgr, Cell Solutions Dept.
IBM Research
© 20082 [email protected]
Outline
History: Data challenge
Motivation for multicore
Implications for programmers
How Cell addresses these implications
Examples
• 2D/3D FFT– Medical Imaging, Petroleum, general HPC…
• Green’s Functions– Seismic Imaging (Petroleum)
• String Matching– Network Processing: DPI & Intrusion Detections
• Neural Networks– Finance
IBM Research
© 20084 [email protected]
The Hungry Beast
Processor
(“beast”)
Data
(“food”)Data Pipe
Pipe too small = starved beast
Pipe big enough = well-fed beast
Pipe too big = wasted resources
IBM Research
© 20085 [email protected]
The Hungry Beast
Processor
(“beast”)
Data
(“food”)Data Pipe
Pipe too small = starved beast
Pipe big enough = well-fed beast
Pipe too big = wasted resources
If flops grow faster than pipe capacity…
… the beast gets hungrier!
IBM Research
© 20086 [email protected]
Move the food closer
Example: Intel Tulsa
– Xeon MP 7100 series
– 65nm, 349mm2, 2 Cores
– 3.4 GHz @ 150W
– ~54.4 SP GFlops
– http://www.intel.com/products
/processor/xeon/index.htm
Large cache on chip
– ~50% of area
– Keeps data close for
efficient access
If the data is local,
the beast is happy!
– True for many algorithms
IBM Research
© 20087 [email protected]
What happens if the beast is still hungry?
Data
Cache
If the data set doesn’t fit in cache
– Cache misses
– Memory latency exposed
– Performance degraded
Several important application classes don’t fit
– Graph searching algorithms
– Network security
– Natural language processing
– Bioinformatics
– Many HPC workloads
IBM Research
© 20088 [email protected]
Make the food bowl larger
Data
Cache
Cache size steadily increasing
Implications
– Chip real estate reserved for cache
– Less space on chip for computes
– More power required for fewer FLOPS
IBM Research
© 20089 [email protected]
Make the food bowl larger
Data
Cache
Cache size steadily increasing
Implications
– Chip real estate reserved for cache
– Less space on chip for computes
– More power required for fewer FLOPS
But…
– Important application working sets are growing faster
– Multicore even more demanding on cache than uni-core
IBM Research
© 200811 [email protected]
Power Density – The fundamental problem
1
10
100
1000
1.5 1 0.7 0.5 0.35 0.25 0.18 0.13 0.1 0.07
i386i486
Pentium®
Pentium Pro ®
Pentium II ®
Pentium III®
W/cm2
Hot Plate
Nuclear Reactor
Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32
IBM Research
© 200812 [email protected]
What’s causing the problem?
10S Tox=11AGate Stack
Gate dielectric approaching a
fundamental limit (a few atomic layers)
Po
wer
Den
sit
y (
W/c
m2)
65 nM
Gate Length (microns)
1 0.010.1
1000
100
10
1
0.1
0.01
0.001
Power, signal jitter, etc...
IBM Research
© 200813 [email protected]
1.0E+02
1.0E+03
1.0E+04
1990 1995 2000 2005 2010
Clo
ck S
peed
(M
Hz)
Clock Speed
103
102
104
Diminishing Returns on FrequencyIn a power-constrained environment, chip clock speed yields diminishing
returns. The industry has moved to lower frequency multicore architectures.
Frequency-DrivenDesignPoints
IBM Research
© 200814 [email protected]
Power vs Performance Trade Offs
Relative Performance
0
1
2
3
4
5
Rela
tive P
ow
er
1
1.45
1.3.85 1.7
We need to adapt our algorithms to
get performance out of multicore
IBM Research
© 200815 [email protected]
Implications of Multicore
There are more mouths to feed
– Data movement will take center stage
Complexity of cores will stop increasing
… and has started to decrease in some cases
Complexity increases will center around communication
Assumption
– Achieving a significant % or peak performance is important
IBM Research
© 200818 [email protected]
Feeding the Cell Processor
8 SPEs each with
– LS
– MFC
– SXU
PPE
– OS functions
– Disk IO
– Network IO
16B/cycle (2x)16B/cycle
BIC
FlexIOTM
MIC
Dual
XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXU
SPU
MFC
PXUL1
PPU
16B/cycle
L232B/cycle
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
IBM Research
© 200819 [email protected]
Cell Approach: Feed the beast more efficiently
Explicitly “orchestrate” the data flow between main
memory and each SPE’s local store
– Use SPE’s DMA engine to gather & scatter data between
memory main memory and local store
– Enables detailed programmer control of data flow
• Get/Put data when & where you want it
• Hides latency: Simultaneous reads, writes & computes
– Avoids restrictive HW cache management
• Unlikely to determine optimal data flow
• Potentially very inefficient
– Allows more efficient use of the existing bandwidth
IBM Research
© 200820 [email protected]
Cell Approach: Feed the beast more efficiently
Explicitly “orchestrate” the data flow between main
memory and each SPE’s local store
– Use SPE’s DMA engine to gather & scatter data between
memory main memory and local store
– Enables detailed programmer control of data flow
• Get/Put data when & where you want it
• Hides latency: Simultaneous reads, writes & computes
– Avoids restrictive HW cache management
• Unlikely to determine optimal data flow
• Potentially very inefficient
– Allows more efficient use of the existing bandwidth
BOTTOM LINE:
It’s all about the data!
IBM Research
© 200821 [email protected]
Cell Comparison: ~4x the FLOPS @ ~½ the power
Both 65nm technology
(to scale)
IBM Research
© 200822 [email protected]
Memory Managing Processor vs. Traditional General Purpose Processor
IBM
AMD
Intel
Cell
BE
IBM Research
© 200823 [email protected]
Examples of Feeding Cell
2D and 3D FFTs
Seismic Imaging
String Matching
Neural Networks (function approximation)
IBM Research
© 200824 [email protected]
Feeding FFTs to Cell
Buffer
Input
Image
Transposed
Image
Tile
Transposed
Tile
Transposed
Buffer
SIMDized data
DMAs double buffered
Pass 1: For each buffer
• DMA Get buffer
• Do four 1D FFTs in SIMD
• Transpose tiles
• DMA Put buffer
Pass 2: For each buffer
• DMA Get buffer
• Do four 1D FFTs in SIMD
• Transpose tiles
• DMA Put buffer
IBM Research
© 200825 [email protected]
3D FFTs
Long stride trashes cache
Cell DMA allows prefetch
Single Element Data envelope
Stride 1
Stride
N2
N
IBM Research
© 200826 [email protected]
Feeding Seismic Imaging to Cell
(X,Y)
New G at each (x,y)
Radial symmetry of G reduces BW requirements
Data
Green’s Function
ij
jiyxGjyixD ),,,(),(
IBM Research
© 200827 [email protected]
Feeding Seismic Imaging to Cell Data
SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
IBM Research
© 200828 [email protected]
Feeding Seismic Imaging to Cell Data
SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
IBM Research
© 200829 [email protected]
Feeding Seismic Imaging to Cell
For each X
– Load next column of data
– Load next column of indices
– For each Y
• Load Green’s functions
• SIMDize Green’s functions
• Compute convolution at (X,Y)
– Cycle buffers
H
2R+1
1
Data buffer
Green’s Index buffer
(X,Y)
R
2
IBM Research
© 200830 [email protected]
Feeding String Matching to Cell
Find (lots of) substrings in (long) string
Build graph of words & represent as DFA
Problem: Graph doesn’t fit in LS
Sample Word List:
“the”
“that”
“math”
IBM Research
© 200834 [email protected]
Feeding Neural Networks to Cell
Neural net function F(X)
– RBF, MLP, KNN, etc.
If too big for LS, BW Bound
N Basis functions: dot product + nonlinearity
D Input dimensions
DxN Matrix of parameters
Output
F
X
IBM Research
© 200835 [email protected]
Convert BW Bound to Compute Bound
Split function over multiple SPEs
Avoids unnecessary memory traffic
Reduce compute time per SPE
Minimal merge overhead
Merge
IBM Research
© 200836 [email protected]
Moral of the Story:It’s All About the Data!
The data problem is growing: multicore
Intelligent software prefetching
– Use DMA engines
– Don’t rely on HW prefetching
Efficient data management
– Multibuffering: Hide the latency!
– BW utilization: Make every byte count!
– SIMDization: Make every vector count!
– Problem/data partitioning: Make every core work!
– Software multithreading: Keep every core busy!
IBM Research
© 200838 [email protected]
Abstract
Technological obstacles have prevented the microprocessor
industry from achieving increased performance through increased
chip clock speeds. In a reaction to these restrictions, the industry
has chosen the multicore processors path. Multicore processors
promise tremendous GFLOPS performance but raise the challenge
of how one programs them. In this talk, I will discuss the motivation
for multicore, the implications to programmers and how the
Cell/B.E. processors design addresses these challenges. As an
example, I will review one or two applications that highlight the
strengths of Cell.