tomorrow’s computing engines february 3, 1998 symposium on high-performance computer architecture
Post on 31-Dec-2015
14 Views
Preview:
DESCRIPTION
TRANSCRIPT
Tomorrow's Computing EnginesWJD Feb 3, 1998 1
Tomorrow’s Computing Engines
February 3, 1998Symposium on High-Performance Computer Architecture
William J. DallyComputer Systems Laboratory
Stanford Universitybilld@csl.stanford.edu
Tomorrow's Computing EnginesWJD Feb 3, 1998 2
Focus on Tomorrow, not Yesterday
General’s tend to always fight the last war
Computer architects tend to always design the last computer
old programs
old technology assumptions
Tomorrow's Computing EnginesWJD Feb 3, 1998 3
Some Previous “Wars” (1/3)
MARS Router1984
Torus Routing Chip1985
Network Design Frame1988
Reliable Router1994
Tomorrow's Computing EnginesWJD Feb 3, 1998 4
Some Previous “Wars” (2/3)
MDP Chip J-Machine Cray T3D MAP Chip
Tomorrow's Computing EnginesWJD Feb 3, 1998 6
Tomorrow’s Computing Engines
• Driven by tomorrow’s applications - media
• Constrained by tomorrow’s technology
Tomorrow's Computing EnginesWJD Feb 3, 1998 7
90% of Desktop Cycles will Be Spent on ‘Media’ Applications by 2000
• Quote from Scott Kirkpatric of IBM (talk abstract)• Media applications include
– video encode/decode– polygon & image-based graphics– audio processing - compression, music, speech -
recognition/synthesis– modulation/demodulation at audio and video rates
• These applications involve stream processing • So do
– radar processing: SAR, STAP, MTI ...
Tomorrow's Computing EnginesWJD Feb 3, 1998 8
Typical Media KernelImage Warp and Composite
• Read 10,000 pixels from memory• Perform 100 16-bit integer operations on each pixel• Test each pixel• Write 3,000 result pixels that pass to memory
• Little reuse of data fetched from memory– each pixel used once
• Little interaction between pixels– very insensitive to operation latency
• Challenge is to maximize bandwidth
Tomorrow's Computing EnginesWJD Feb 3, 1998 9
Telepresence: A Driving Application
Acquire2D
Images
Extract Depth
(3D Images)
SegmentationModel
ExtractionCompression
Decompression RenderingDisplay
3DScene
Most kernels: Latency insensitiveHigh ratio of arithmetic to memory references
Channel
Tomorrow's Computing EnginesWJD Feb 3, 1998 10
Tomorrow’s Technology is Wire Limited
• Lots of devices• A little faster• Slow wires
Tomorrow's Computing EnginesWJD Feb 3, 1998 11
Technology scaling makes communication the scarce resource
0.35m64Mb DRAM
16 64b FP Proc400MHz
0.10m4Gb DRAM
1K 64b FP Proc2.5GHz
1997 2007
18mm12,000 tracks
1 clock
32mm90,000 tracks
20 clocks
P
Tomorrow's Computing EnginesWJD Feb 3, 1998 12
On-chip wires are getting slower
x1 x2
y
y
x2 = s x1 0.5x
R2 = R1/s2 4x
C2 = C1 1x
tw2 = R2C2y2 = tw1/s2 4x
tw2/tg2= tw1/(tg1s3) 8x
v = 0.5(tgRC)-1/2 (m/s)
v2 = v1s1/2 0.7x
vtg = 0.5(tg/RC)1/2 (m/gate)
v2tg2 = v1tg1s3/2 0.35x
tw = RCy2 RCy2 RCy2
tg tg tg
Tomorrow's Computing EnginesWJD Feb 3, 1998 13
Bandwidth and Latency of Modern VLSI
Size101 100 103 104 105
10
100
1
103
Latency
Latency Bandwidth
1
0.01
10-4
10-6
Bandwidth
Chip Boundary
Tomorrow's Computing EnginesWJD Feb 3, 1998 14
Architecture for LocalityExploit high on-chip bandwidth
Off-chip RAM
Pin
-Ban
dwid
th,
2G
B/s
VectorRegFile
10432-bit ALUs50GB/s S
witc
h
500GB/s
Tomorrow's Computing EnginesWJD Feb 3, 1998 15
Tomorrow’s Computing Engines
• Aimed at media processing– stream based
– latency tolerant
– low-precision
– little reuse
– lots of conditionals
• Use the large number of devices available on future chips
• Make efficient use of scarce communication resources– bandwidth hierarchy
– no centralized resources
• Approach the performance of a special-purpose processor
Tomorrow's Computing EnginesWJD Feb 3, 1998 16
Why do Special-Purpose Processors Perform Well?
Fed by dedicated wires/memoriesLots (100s) of ALUs
Tomorrow's Computing EnginesWJD Feb 3, 1998 17
Care and Feeding of ALUs
DataBandwidth
Instruction Bandwidth
Regs
Instr.Cache
IR
IP‘Feeding’ Structure Dwarfs ALU
Tomorrow's Computing EnginesWJD Feb 3, 1998 18
Three Key Problems
• Instruction bandwidth• Data bandwidth• Conditional execution
Tomorrow's Computing EnginesWJD Feb 3, 1998 19
A Bandwidth Hierarchy
SDRAM
SDRAM
SDRAM
SDRAM Str
eam
ing
Mem
ory
1.6GB/s
Vec
tor
Reg
iste
r F
ile
50GB/s
ALU Cluster
ALU Cluster
ALU Cluster
500GB/s
13 ALUs per cluster
•Solves data bandwidth problem
•Matched to bandwidth curve of technology
Tomorrow's Computing EnginesWJD Feb 3, 1998 20
A Streaming Memory System
AddressGenerator
AddressGenerator
IX
D
Cro
ssba
r
ReorderQueue
ReorderQueue
SDRAMBank
SDRAMBank
Tomorrow's Computing EnginesWJD Feb 3, 1998 21
Streaming Memory Performance
Bank Queue Effectiveness
0.00000
0.20000
0.40000
0.60000
0.80000
1.00000
1.20000
1.40000
1.60000
1.80000
1 2 4 8 16 32 64 Infinite
Queue Size
Cyc
les/
Acc
ess
• Exploit latency insensitivity for improved bandwidth
• 1.75:1 Performance improvement from relatively short reorder queue
Tomorrow's Computing EnginesWJD Feb 3, 1998 22
Compound Vector Operations1 Instruction does lots of work
LD Vd Vx
Mem AG
VRF
Memory Instructions
ControlStore
uIP
Op V0 V1 V2 V3 V4 V5 V6 V7
Compound Vector Instruction
Op Ra Rb Op Ra Rb Op Ra Rb
1 CV Inst (50b)
uInst (300b)x 20uInst/Opx 1000el/vec------------------6 x 106 b
Tomorrow's Computing EnginesWJD Feb 3, 1998 23
Scheduling by Simulated Annealing
• List scheduling assumes global communication– does poorly when
communication exposed
• View scheduling as a CAD problem (place and route)– generate naïve ‘feasible’
schedule
– iteratively improve schedule by moving operations.
ALUsTime
ReadyOps
Tomorrow's Computing EnginesWJD Feb 3, 1998 24
Typical Annealing Schedule
0
20
40
60
80
100
120
140
160
180
1 2001 4001 6001 8001 10001 12001 14001 16001 18001
13
166
Energy function changed
Tomorrow's Computing EnginesWJD Feb 3, 1998 25
Conventional Approaches to Data-Dependent Conditional Execution
A
x>0
B
C
J
K
Data-DependentBranch
Y N
A
x>0
Y
B
C
Whoops
J
K
SpeculativeLoss D x W~1000
A
B
J
y=(x>0)
if y
if ~y
C
K
if y
if ~y
ExponentiallyDecreasing Duty Factor
Tomorrow's Computing EnginesWJD Feb 3, 1998 26
Zero-Cost Conditionals
• Most Approaches to Conditional Operations are Costly– Branching control flow - dead issue slots on mispredicted branches
– Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle.
• Conditional Vectors– append an element to an output stream depending on a case
variable.
Result Stream
Case Stream {0,1}
0
1
Output Stream 0
Output Stream 1
Tomorrow's Computing EnginesWJD Feb 3, 1998 27
Application Sketch - Polygon Rendering
V1V2
V3V1 V2 V3 X Y RGB
Y X1 X2 RGB1 RGBY
X1 X2
UV
UV1 UV
Vertex
Span
X Y RGB UV PixelY
X
X Y RGBTexturedPixel
UV RGB
Tomorrow's Computing EnginesWJD Feb 3, 1998 28
Status
• Working simulator of Imagine• Simple kernels running on simulator
– FFT
• Applications being developed– Depth extraction, video compression, polygon rendering,
image-based graphics
• Circuit/Layout studies underway
Tomorrow's Computing EnginesWJD Feb 3, 1998 29
Acknowledgements
• Students/Staff– Don Alpert (Intel)
– Chris Buehler (MIT)
– J.P Grossman (MIT)
– Brad Johanson
– Ujval Kapasi
– Brucek Khailany
– Abelardo Lopez-Lagunas
– Peter Mattson
– John Owens
– Scott Rixner
• Helpful Suggestions– Henry Fuchs (UNC)
– Pat Hanrahan
– Tom Knight (MIT)
– Marc Levoy
– Leonard McMillan (MIT)
– John Poulton (UNC)
Tomorrow's Computing EnginesWJD Feb 3, 1998 30
Conclusion
• Work toward tomorrow’s computing engines• Targeted toward media processing
– streams of low-precision samples– little reuse– latency tolerant
• Matched to the capabilities of communication-limited technology– explicit bandwidth hierarchy– explicit communication between units– communication exposed
• Insight not numbers
top related