1 E. Bolotin – The Power of Priority, NoCs 2007
The Power of Priority:NoC based Distributed Cache
Coherency
Evgeny Bolotin, Zvika Guz, Israel Cidon, Ran Ginosar, Avinoam Kolodny
QNoC Research GroupTechnion
EE Department Technion, Haifa, Israel
2 E. Bolotin – The Power of Priority, NoCs 2007
Chip Multi-Processor (CMP)
Dual-Core
Monolithic shared cache
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
Multi-Core
Large cache
Shared cache
Distributed cache
NoC-based: How?
3 E. Bolotin – The Power of Priority, NoCs 2007
• Global wires delayGlobal wire delay
100
1
10
0.1 250130 90 65 45 32180250250
Gate delay
Source: ITRS 2003
Global Wires Delay
Future Cache - Physics Perspective
• Large cache Large access time
Fraction of chip reachable in 1 clock cycleSource: Keckler et al. ISSCC 2003
• Distance reached in single cycle Today: ~25% of chip In 10 years: ~1% of chip
Large monolithic cache is not scalable
4 E. Bolotin – The Power of Priority, NoCs 2007
NUCA - Non Uniform Cache Architecture
NUCA= Non uniform access times
Banked cache over NoC Smaller bank Smaller Access Time Multiple banks Multiple Ports Closer bank Smaller Access Time
Cache-line placement policy
• Static NUCA (SNUCA)
• Dynamic NUCA (DNUCA)
Sources:Kim et al. ASPLOS 2002Beckmann et al. MICRO 2004
5 E. Bolotin – The Power of Priority, NoCs 2007
Issues in NUCA-based CMP
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
• NoC performance CMP performance
• Cache coherency and transaction order (correctness)
• Search (in DNUCA)
• Different traffic types (e.g. fetch vs. prefetch)
• Synchronization (locks)
NoC Services for CMP?
6 E. Bolotin – The Power of Priority, NoCs 2007
Cache Coherency over NoC
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
How do we maintain coherency over NoC?
• Snooping
• Central directory
cache line status vec. D
cache line status vec. D
cache line status vec. D
cache line status vec. D
cache line status vec. D
cache line status vec. D
Cache lines Dist. Directory
Cache bank with distributed directory
• Distributed directory
7 E. Bolotin – The Power of Priority, NoCs 2007
Distributed Cache Coherency
Example: Simple read transaction
L2
Dire
ctor
y
P0L1
1. READ REQ
2. READ RESP (data transfer)
NoC
P0-Shared
Cache access Multiple NoC transactions
Ctrl. packet
Data packet
8 E. Bolotin – The Power of Priority, NoCs 2007
Read Transaction of Modified Block
L2
Dir
ect
ory
P2L1
P0L1
2. READ RESP (data transfer)
No
C
No
C
P2-MOD.
L2D
ire
cto
ryP2L1
P0L1
4. WR BACK REQ3. READ REQ
6. READ RESP (data transfer)
5. WR BACK RESP(data transfer)
No
C
No
CP0-SHARED
1. READ EXCL. REQ
Ctrl. packet
Data packet
9 E. Bolotin – The Power of Priority, NoCs 2007
Read Exclusive of Shared Block
L2
Dire
ctor
y
NoC
NoC
NoC
P1L1
P2L1
P0L1
1. READ. REQ
1. R
EA
D R
EQ
P1-SharedP2-Shared
L2
Dire
ctor
y
NoCN
oC
NoC
P1L1
P2L1
P0L1
3. READ EXCL. REQ
6. Read EXCL. RESP (data transfer)
5. INVALID. ACK
5. IN
VA
LID
. AC
K
P0-MOD.
Ctrl. packet
Data packet
10 E. Bolotin – The Power of Priority, NoCs 2007
• Smart interfaces
Basic NoC to Support CMP
Can We Do Better?
Off-the-shelf (Vanilla) NoC:
• Grid of wormhole routers
L2
Dire
ctor
y
NoC
NoC
NoC
P1L1
P2L1
P0L1
4. IN
VALID. R
EQ
3. READ EXCL. REQ
6. Read EXCL. RESP (data transfer)
5. INVALID. ACK
5. IN
VA
LID
. AC
K
P0-MOD.
• Unicast only
• Ordering in network Static routing No virtual channels
Vanilla NoC
11 E. Bolotin – The Power of Priority, NoCs 2007
Observations: L2 Access
A) Delay = Queueing + NoC transactions B) All NoC transactions are equally important
C) NoC transactions consist of:• Short ctrl. packets• Long data packets
Idea: Differentiate between Ctrl. and Data
Solution: Preemptive Priority NoC Give priority to short ctrl. packets
L2
Dire
cto
ry
NoC
NoC
NoC
P1L1
P2L1
P0L1
4. IN
VALID. R
EQ
3. READ EXCL. REQ
6. Read EXCL. RESP (data transfer)
5. INVALID. ACK
5.
INV
AL
ID.
AC
K
P0-MOD.
12 E. Bolotin – The Power of Priority, NoCs 2007
Preemptive Priority NoC: QNoC
Multiple SL link
QNoC
Input ports Output ports
BufSize
SL 0
SL 1
CR
OS
S-B
AR
Scheduler CREDITControlCREDIT
SL 2
SL 3
SL 0
SL 1
SL 2
SL 3
Physical Link
Output Input
SL 0
SL 1
SL 2
SL 3
SL 0
SL 1
SL 2
SL 3
Service Levels:
• Dedicated wormhole buffer
• Preemptive priority scheduling
Multiple SL Router
13 E. Bolotin – The Power of Priority, NoCs 2007
Example: Vanilla NoC
Blue delay ~XRed delay ~ 2X+δAverage delay ~ 1.5X
Vanilla NoC example
A B
Without contention:X:Delay of long packetδ:Delay of short packetLong Data
Transaction 1
Short Req.
Long Resp.
Transaction 2
14 E. Bolotin – The Power of Priority, NoCs 2007
Example: Priority NoC
Blue delay=XRed delay = 2X+δAverage delay ~ 1.5X
Without contention:X:Delay of long packetδ:Delay of short packet
Vanilla NoC example
A BBlue delay= X+δ Red delay = X+δAverage delay ~ X
Potential delay reduction ~ 0.5X
Priority NoC example
Long Data
Transaction 1
Short Req.
Long Resp.
Transaction 2
15 E. Bolotin – The Power of Priority, NoCs 2007
Priority NoC: Different Destinations
Very important in wormhole • When ctrl. packet is blocked by other worms
Short Req.
Long Data
16 E. Bolotin – The Power of Priority, NoCs 2007
Protocol Correctness
L2
Dir
ect
ory
1. Read Req.
2. Read Resp.
4. Invalidation Req.
P0L1
P1L1
3. Read Excl. Req.Legend:
High Priority (ctrl.)
Low Priority (data)
Need state-preserving serialization of transactions in
the processor interface
17 E. Bolotin – The Power of Priority, NoCs 2007
Numerical Evaluation
• CMP simulator (SIMICS)
Simulate parallel benchmarks
Obtain L2-cache access traces
• QNoC simulator (OPNET)
Simulate distributed coherence protocol over NoC
Measure total RD/RX L2-access delay
Measure total program throughput
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
18 E. Bolotin – The Power of Priority, NoCs 2007
Priority NoC: Results
Av. Delay Reduction of L2-Transaction in Apache
0.00
5.00
10.00
15.00
20.00
25.00
30.00
1 4 16Link Capacity [gbps]
De
lay
Re
du
cti
on
[%
] Read
Read Exclusive
Av. Delay of L2-Read in Apache
234
5762
286
1301
994
0
200
400
600
800
1000
1200
1400
1 4 16Link Capacity[gbps]
De
lay
[c
yc
les
]
Vanilla NoC
Priority-based NoC
• Short ctrl. packet gets high priority• Long data packet gets low priority
Delay Reduction vs. Network Load
RD Delay - Apache RD/RX Delay Reduction - Apache
19 E. Bolotin – The Power of Priority, NoCs 2007
Priority NoC: Several Benchmarks
L2 Access Delay Reduction by Priority-based NoC
22.6
31.8
19.6
28.4
13.5
25.3
18.3
32.9
22.3
28.0
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
apache zeus fft ocean radix
De
lay
Re
du
cti
on
[%
]
Read Read Exclusive
Delay Reduction Program Speedup
Total Program Speedup by Priority-based NoC
9.48.7 9.0
8.6
5.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
apache zeus fft ocean radix
Sp
ee
du
p [
%]
20 E. Bolotin – The Power of Priority, NoCs 2007
So Far: The Power of Priority
• Simplicity - Almost for Free
• Significant CMP Speed-up
Good For:
• Coherency
• Traffic differentiation (e.g. Fetch vs. Pre-Fetch)
• Search in DNUCA
• Synchronization (Locks)
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
21 E. Bolotin – The Power of Priority, NoCs 2007
• Special Broadcast for Short Messages
Broadcast service (e.g. search in DNUCA)
Wormhole broadcast slow and expensive
S&F broadcast embedded in wormhole
• Virtual Ring
No Additional Cost
For Invalidation Multicast
Snooping or synchronization
Advanced Support Functions
S
Source
Replicating
Forwarding
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
22 E. Bolotin – The Power of Priority, NoCs 2007
Summary
NoC at CMP Service!
• Shared cache over NoC
• Priority is powerful
• Built-in support functions