designing memory systems for tiled architectures anshuman gupta september 18, 2009 1
TRANSCRIPT
![Page 1: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/1.jpg)
1
Designing Memory Systems for Tiled ArchitecturesAnshuman GuptaSeptember 18, 2009
![Page 2: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/2.jpg)
Multi-core Processors are abundant
Multi-cores increase the compute resources on the chip without increasing hardware complexity
Keeps power consumption within the budgets.
2
AMD Phenom (4-core)
Sun Niagara 2 (8-core)
Tile64 (64-core) Intel Polaris (80-core)
![Page 3: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/3.jpg)
3
Multi-Core Processors are underutilized
…b = a + 4 … (0)c = b * 8 … (1)d = c – 2 … (2)e = b * b … (3)f = e * 3 … (4)g = f + d … (5)…
0
3
2
5
11
12
1
Single –thread code Parallel Execution
1
42
313
14
13
0
3
2
5
2
1
1
43
54
6
Serial Execution
Software gets the responsibility of utilizing the cores with parallel instruction streams
Hard to parallelize applications.
![Page 4: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/4.jpg)
4
Tiled Architectures increase Utilization by enabling Parallelization
The OCN communication latencies are of the order of 2+(distance between tiles) cycles*
*Latency for RAW inter-ALU OCN
Tiled architectures are of class of multi-core architectures
Provide mechanisms to facilitate automatic parallelization of single-threaded programs
Fast On Chip Networks (OCNs) to connect cores
![Page 5: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/5.jpg)
5
Automatic Parallelization on Tiled Architectures
…b = a + 4 … (0)c = b * 8 … (1)d = c – 2 … (2)e = b * b … (3)f = e * 3 … (4)g = f + d … (5)…
0
3
2
5
11
12
1
Single –thread code Multi-cores Tiled Architecture
In tiled architectures, dependent instructions can be placed on multiple cores with low penalty in tiled architectures due to cheap inter-ALU communication.
1
42
313
14
13
0
3
2
5
2
1
1
43
54
6
0
3
2
5
2
3
1
1
42
34
5
4
![Page 6: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/6.jpg)
6
Why aren’t tiled architectures used everywhere?
Automatic parallelization is still very difficult due to slow resolution of remote memory dependencies
Tiled Architecture Memory systems have a special requirement –
Fast Memory Dependence Resolution
…(*b) = a + 4 … (0)c = (*b) * 8 … (1)(*d) = c – 2 … (2)e = (*h) * 4 … (3)f = e * 3 … (4)g = f + (*i) … (5)…
0
3
2
5
11
12
1
Single –thread code Multi-cores Tiled Architecture
1
42
313
14
13
0
3
2
5
2
1
1
4
3
54
6
0
3
2
5
11
12
1
1
42
313
14
13
What if we add some memory
instructions?
![Page 7: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/7.jpg)
7
OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing
WorkAnalysis of Existing WorkFuture Work and Conclusion
![Page 8: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/8.jpg)
8
Memory Dependence
Static Analysis
Type a address
b address
Static placement
No No 0x1000
0x2000
Must True 0x1000
0x1000
May True 0x1000
0x1000
False 0x1000
0x2000
*a = … … = *b
foo (int * a, int * b){ *a = … … = *b}
*a = …… = *b
*a = … … = *b
![Page 9: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/9.jpg)
9
Memory Coherence
Coherent space provides an abstraction of a single data buffer with a single read write port
Hierarchical implementation of shared memory◦ Require coherence protocols to provide the same abstraction
Core 0 Core 1
Shared Memory
Core 0
Write A = 1
Core 1
Read A
Shared MemoryCache Cache
Shared Buffer
Write A = 1 Read A
A = 0
A = 1
A = 1
DependenceSignal
![Page 10: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/10.jpg)
10
Improving Memory Dependence Resolution
Memory Dependence Resolution Performance depends on –◦True Dependence Performance◦False Dependence Performance◦Coherence System Performance
![Page 11: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/11.jpg)
11
True Dependence Resolution
Delay 1 – Determined by Signaling Stage◦ Earlier is better
Delay 2 – Determined by signaling delay inside the ordering mechanism◦ Faster is better
Delay 3 – Determined by Stalling Stage◦ Later is better
Delays 1 and 3 are determined by the resolution model
Source Destination
Signal
Stall Stage
Signal Stage
1
2
3
Delay
![Page 12: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/12.jpg)
12
False Dependence ResolutionFalse Dependencies occur when
◦Static analysis cannot disambiguate◦Memory Dependence encoding is not partial
For false dependencies, dependent instruction should ideally not wait for any signal◦Runtime Disambiguation
The address comparison done in hardware to declare the dependent instruction as free
◦Speculation Dependent instruction is issued speculatively
assuming the dependence is false
![Page 13: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/13.jpg)
13
Fast Data AccessLocal L1 caches can help decrease
average latencies◦No network delays
Cache Coherence (CC)◦Dynamic access – data location not known
statically◦Expensive dynamic access in the absence of
CC
![Page 14: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/14.jpg)
14
What features to look out for?
L1 Local
CC Ordering Point
Resolution Encoding Spec
Runtime Disambiguation
![Page 15: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/15.jpg)
15
OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing
Work◦RAW◦WaveScalar◦EDGE
Analysis of Existing WorkFuture Work and Conclusion
![Page 16: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/16.jpg)
16
RAWA highly static tiled architecture
Array of simple in-order MIPS cores Scalar Operand Network (SON) for fast inter-ALU
communication Shared address space, local caches and shared DRAMs No cache coherence mechanism
Software cache management through flush and invalidation
*Taylor et al, IEEE Micro 2002
![Page 17: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/17.jpg)
17
Artifacts of Software Cache ManagementDifficult to keep track of the most up-
to-date version of a memory addressAll memory accesses can be
categorized as -◦Static Access
The location of the cache line is known statically
◦Dynamic Access A runtime lookup is required for determining
the location of the cache line These are really expensive (36 vs 7)
![Page 18: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/18.jpg)
18
Static-Dynamic Access OrderingTwo static accesses
◦Synchronization over SONDependence between a static and a
dynamic access◦Synchronizing over SON between
Static access Static requestor or receiver for dynamic access
Execute side resolutionNo speculative runaheadFalse dependencies are as expensive
as true dependence
![Page 19: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/19.jpg)
19
Summary
Arch L1 Local
CC Ordering Point
Resolution Encoding Spec
Runtime Disambiguation
RAWsd Yes No OCN Exec-side Partial No No
![Page 20: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/20.jpg)
20
Dynamic Access Ordering Execute side resolution very
expensive Resolution done late in the memory
system Static ordering point
◦ Turnstile tile◦ One per equivalence class◦ Equivalence class - set of all memory
operations that can access the same memory address
Requests sent on static SON to turnstile◦ Receives in memory order
In-order dynamic network channels
![Page 21: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/21.jpg)
21
Summary
Arch L1 Local
CC Ordering Point
Resolution Encoding Spec
Runtime Disambiguation
RAWsd Yes No OCN Exec-side Partial No No
RAWdd Yes No Turnstile Secondary Mem-side
Partial No Yes
![Page 22: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/22.jpg)
22
OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing
Work◦RAW◦WaveScalar◦EDGE
Analysis of Existing WorkFuture Work and Conclusion
![Page 23: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/23.jpg)
23
WaveScalarA fully dynamic Tiled Architecture with Memory Ordering
Clusters arranged in 2D array connected by mesh dynamic network
Each tile has a store buffer and banked data cache
Secondary memory system made up of L2 caches around the tiles
Cache coherence*Swanson et al, Micro 2003
![Page 24: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/24.jpg)
24
Memory OrderingLoad A
Store B
Load C
Store Buffer
WaveScalar preserves memory ordering by using a sequence number for each memory operation in a wave ◦ Unique◦ Indicates age
Each memory operation also stores its predecessor’s and successor’s sequence number◦ Use “?” if not known at compile time
There cannot be a memory operation whose possible predecessor has it’s successor marked as “?” and vice-versa◦ MEM-NOPs
A request is allowed to go ahead if it’s predecessor has issued
In hardware this ordering is managed in the store buffers◦ A single store buffer is responsible to
handle all memory requests for a dynamic wave
Load A <0>
Store B <1>
Load C <2>
Load A <.,0,?>
Store B <0,1,2>
Load C <?,2,.>
Nop<0,2,3>
Store B <0,1,3>
Load C <?,3,.>
Load C <?,3,.>Store B <0,1,3>Load A <.,0,?>
Load C <1,3,.>
Load A <.,0,1>
![Page 25: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/25.jpg)
25
Removing False Load Dependencies
Sequence number based ordering is highly restrictive◦ Loads are stalled on previous
loadsEach memory operation has
ripple number as last store’s sequence number
Memory operation can issue if op with ripple number has issued◦ Loads can issue OoO
Stores still have total ordering
![Page 26: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/26.jpg)
26
Summary
Arch L1 Local
CC Ordering Point
Resolution Encoding Spec
Runtime Disambiguation
RAWsd Yes No OCN Exec-side Partial No No
RAWdd Yes No Turnstile Secondary Mem-side
Partial No Yes
WaveScalar
No Yes Store Buffer
Primary Mem-side
Store Total
No No
![Page 27: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/27.jpg)
27
OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing
Work◦RAW◦WaveScalar◦EDGE
Analysis of Existing WorkFuture Work and Conclusion
![Page 28: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/28.jpg)
28
EDGEA partially dynamic Tiled Architecture with block execution
Array of tiles connected over fast OCNs
Primary memory system is distributed over tiles
Each such tile has address interleaved
Data cache Load Store Queue
Distributed Secondary Memory System
Cache Coherence*S. Sethumadhavan et al, ICCD ‘06
![Page 29: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/29.jpg)
29
Memory Ordering Unique 5 bit tag called LSID
◦ Completion of block execution
◦ Ordering of memory operations
DTs get a list of all LSIDs in a block during fetch stage
Memory operations reach a DT◦ LSID sent to all the DTs
Request issued if all requests with earlier LSIDs completed◦ memory side dependence
resolution When all memory
operations have completed, block is committed
<0,1,2,3>
<0,1,2,3>
<0,1,2,3>
<0,1,2,3>
Ld A <0>Ld B <1>St C <2>Ld C <3>
Ld A <0>
Ld B <1>
St C <2>
Ld C <3>
<0,1,2,3>
<0,1,2,3>, 1
<0,1,2,3>
<0,1,2,3>
<0,1,2,3>,0
<0,1,2,3>, 1
<0,1,2,3>
<0,1,2,3>
<0,1,2,3>,0
<0,1,2,3>, 1
<0,1,2,3>,0
<0,1,2,3>, 1
<0,1,2,3>, 3
<0,1,2,3>
<0,1,2,3>,0
<0,1,2,3>, 1
<0,1,2,3>, 3,2
<0,1,2,3>
<0,1,2,3>, 3,2<0,1,2,3>, 3,2
Control Tile
Execution Tiles
Interleaved Data Tiles
![Page 30: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/30.jpg)
30
Dependence SpeculationEDGE memory ordering is very
restrictive◦Total memory order
Loads execute speculativelyEarlier store to the same address
causes squash◦Predictor used to reduce squashes
![Page 31: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/31.jpg)
31
Summary
Arch L1 Local
CC Ordering Point
Resolution Encoding Spec
Runtime Disambiguation
RAWsd Yes No OCN Exec-side Partial No No
RAWdd Yes No Turnstile Secondary Mem-side
Partial No Yes
WaveScalar
No Yes Store Buffer
Primary Mem-side
Store Total
No No
EDGE No Yes LSQ Primary Mem-side
Total Yes No
![Page 32: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/32.jpg)
32
OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing
WorkAnalysis of Existing WorkFuture Work and Conclusion
![Page 33: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/33.jpg)
33
True Dependence Optimization
Arch L1 Local
CC Ordering Point
Resolution Encoding Spec
Runtime Disambiguation
RAWsd Yes No OCN Exec-side Partial No No
RAWdd Yes No Turnstile Secondary Mem-side
Partial No Yes
WaveScalar
No Yes Store Buffer
Primary Mem-side
Store Total
No No
EDGE No Yes LSQ Primary Mem-side
Total Yes No
![Page 34: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/34.jpg)
34
Memory Side Resolution allows more Overlap
Requestor A Requestor B
Home Node
Requestor A Requestor B
Home Node
Requestor A Requestor B
Home Node Tag Buffer
Turnstile
RAWsd EDGE/WaveScalarRAWdd
RAWsd
E/WS
RAWdd
*The length of the bars do not indicate delays
Request A
Response A
Coherence delay A
Request B
Response B
Coherence delay B
![Page 35: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/35.jpg)
35
Network Stalls should be avoided
Execute Side Resolution - e◦ RAWsd
Memory Side Resolution - m◦ Edge, WaveScalar
RAW dynamic ordering - mt
◦ Network delay to memory system is overlapped
e
em
m mt
F Na E N$ Tp Nm Ts M Nc Nr W m,
mt
e
m
mt
E,W,N$,Nr
Tp,N
m
![Page 36: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/36.jpg)
36
False Dependence Optimization
Arch L1 Local
CC Ordering Point
Resolution Encoding Spec
Runtime Disambiguation
RAWsd Yes No OCN Exec-side Partial No No
RAWdd Yes No Turnstile Secondary Mem-side
Partial No Yes
WaveScalar
No Yes Store Buffer
Primary Mem-side
Store Total
No No
EDGE No Yes LSQ Primary Mem-side
Total Yes No
Partial Ordering reduces
false deps
Speculation on false
deps reduces
stalls
Disambiguation should
be done early
![Page 37: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/37.jpg)
37
OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing
WorkAnalysis of Existing WorkFuture Work and Conclusion
![Page 38: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/38.jpg)
38
What’s a Good Tiled Architecture Memory System? Local caches for fast L1 hit Cache Coherence support for ease in
programmability and no dynamic access delays Fast True Dependence Resolution
◦ Performance comparable to same core placement of operations
◦ Late stalls◦ Early signaling
Reduction of false dependencies through partial memory operation ordering
Fast False Dependence resolution◦ Performance comparable to same core placement of
operations◦ Early runtime memory disambiguation◦ Speculative memory requests
![Page 39: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/39.jpg)
39
ConclusionAuto-parallelization on tiled architecture can
benefit from fast Memory Dependence resolution◦ Multi-core memory system were not designed with this
goalPerformance of both true and false dependence
resolution should be comparable dependent memory instructions placed on the same core
ISA should support partial memory operation ordering to avoid artificial false dependencies
Memory system should have local caches and cache coherence for performance and programmability
Thank You!Questions?
![Page 40: Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1](https://reader035.vdocument.in/reader035/viewer/2022062516/56649dbe5503460f94ab2014/html5/thumbnails/40.jpg)
40
Dynamic Accesses are expensive
X looks up a global address list and sends a dynamic request to owner Y
Y is interrupted, data is fetched and dynamic request sent to Z
Z is interrupted, data is stored in local cache
One table lookup, two interrupt handlers and two dynamic requests make dynamic loads expensive
Lifted portions represent processor occupancy,while unlifted portions portion represents network latency