fist: a fast, lightweight, fpga-friendly packet latency...
TRANSCRIPT
FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency
Estimator for NoC Modeling in Full-System Simulations
5/3/2011
Michael K. Papamichael, James C. Hoe, Onur [email protected], [email protected], [email protected]
Computer Architecture Lab at
Our work was supported by NSF. We thank Xilinxand Bluespec for their FPGA and tool donations.
CALCM Computer Architecture Lab at Carnegie Mellon
Simulation in Computer Architecture
Slow for large-scale multiprocessor studies
Full-system fidelity + long benchmarks
How can we make it faster?
Speed, accuracy, flexibility trade-off
Full-system simulators sacrifice accuracy for speed and flexibility
Speed
Accuracy Flexibility
Accelerate simulation with FPGAs
Can simulate up to millions of gates
2
Orders of magnitude simulation speedup
CALCM Computer Architecture Lab at Carnegie Mellon
The FIST Project
3
Explores fast NoC models for full-system simulations FPGA-friendly, but avoid direct implementation Low error, many topologies, >10M packets/sec
Simpler requirements of full-system simulation Estimate packet latencies, capture high-order effects
L2L1P
L2L1P
L2L1P
L2L1P
R
R
R
R
FPGA
(Xilinx LX110T)
FPGA area requirements
for state-of-the-art mesh NoC*
4x4
5x5
*NoC RTL from http://nocs.stanford.edu/router.html
CALCM Computer Architecture Lab at Carnegie Mellon
FIST Approach
View NoC as set of routers/links Abstract router into black-box Represent by load-delay curves Specific to each router configuration and traffic pattern
R
R
R
R R
R R
R R
NN
N
N
N
NN
N N
R
R
R R
RR
4
N
RStateLogic
BuffersRRLa
ten
cy
Load
CALCM Computer Architecture Lab at Carnegie Mellon
N
N
N
N
N
N
FIST Approach
Treat each hop as a set of load-delay curves Trade-off between model complexity and fidelity
Keep track of load at each node To track router load monitor traffic over window of time
R
R
R
R
R
RN
N
N
R
R
R
5
Late
ncy
Load
CALCM Computer Architecture Lab at Carnegie Mellon
N
N
N
N
N
NN
N
N
FIST in Action
Route packet from source to destination Determine routers that will be traversed
Sum up the delays for each traversed router Index load-delay curves using current load at each router
R
R
R
R
R
R
R
R
R
S
D
packet
delay
6
CALCM Computer Architecture Lab at Carnegie Mellon
Outline
Introduction to FIST FIST-based Network Models Evaluation Related Work & Conclusions
CALCM Computer Architecture Lab at Carnegie Mellon
Outline
Introduction to FIST FIST-based Network Models Evaluation Related Work & Conclusions
CALCM Computer Architecture Lab at Carnegie Mellon
Fee
db
ack
Putting FIST Into Context
Train Curves Use Curves
Network models within full-system simulators Model network within a broader simulated system
Assign delay to each packet traversing the network
Traffic generated by real workloads
Up
date
d C
urve
s
Detailed network models Cycle-accurate network simulators (e.g. BookSim)
Analytical network models
Typically study networks under synthetic traffic patterns
9
CALCM Computer Architecture Lab at Carnegie Mellon
Offline and Online FIST
Offline FIST Detailed network simulator generates curves offline Can use synthetic or actual workload traffic Load curves into FIST and run experiment
Detailed Network Model
Online FIST (tolerates dynamic changes in network behavior) Initialization of curves same as offline Periodically run detailed network simulator on the side Compare accuracy and, if necessary, update curves
Detailed Network Model
Provide feedback and receive updated curves
10
CALCM Computer Architecture Lab at Carnegie Mellon
Online Training in Action
El
05
101520253035404550
0
66
13
2
19
8
26
4
33
0
39
6
46
2
18
84
19
50
20
16
20
82
21
48
22
14
22
80
23
46
24
12
24
78
Late
ncy
Ellapsed Cycles (in 1000s)
Actual Latency
Estimated Latency
El
…
…
…
Elapsed cycles (in 1000s)
Example with no initial training
Before Training
After Training
CALCM Computer Architecture Lab at Carnegie Mellon
FIST Applicability
“FIST-Friendly” Networks
Exhibit stable, predictable behavior as load fluctuates
Actual traffic similar to training traffic
FIST Limitations
Depends on fidelity, representativeness of training models
Higher loads and large buffers can limit FIST’s accuracy
High network load increased packet latency variance
Large buffers increased range of observed packet latencies
Cannot capture fine-grain packet interactions
Cannot replace cycle-accurate detailed network models
FIST only as good as its training data12
CALCM Computer Architecture Lab at Carnegie Mellon
Applying FIST to NoCs
NoCs affected by on-chip limitations and scarce resources
Employ simple routing algorithms
Usually simple deterministic routing
Operate at low loads
NoCs usually over-provisioned to handle worst-case
Have been observed to operate at low injection rates
Small buffers
On-chip abundance of wires reduces buffering requirements
Amount of buffering in NoCs is limited or even eliminated
NoCs are “FIST-Friendly”13
CALCM Computer Architecture Lab at Carnegie Mellon
Outline
Introduction to FIST FIST-based Network Models Evaluation Related Work & Conclusions
CALCM Computer Architecture Lab at Carnegie Mellon
FIST Implementations
PacketDescriptors
Router Elements
Src
Dest
Size
CalculateDelay
Pickrouters
Routing Logic
Packet
Delays
Partial DelaysB
RA
M
Load Tracker
Curve
Software Implementation of FIST (written in C++) Implements online and offline FIST models
Hardware Implementation (written in Bluespec) Precisely replicates software-based FIST Block diagram of architecture
15
Handle load tracking
& delay queries
CALCM Computer Architecture Lab at Carnegie Mellon
Peeking Under The Hood
Similar issues arise for load tracking & dynamic training
Tracking Latency T 567 4 23 H
R0
S D
R1 R2
Wormhole-routed
Injection latencyTraversal latencies
?
Packet Latency = 30
Latency =
Latency =
Latency =
4 5 6 7 TH
TH
0 10 20 30
TH
5 15 25
2 3
4 5 6 72 3
52 3
R0R1R2 4
R0R1R2
Store-and-forwardPacket Latency = 30
Latency = 9
Latency = 14
Latency = 7
4 5 6 7 TH
TH
0 10 20 30
TH
5 15 25
2 3
4 5 6 72 3
5 6 742 3
R0R1R2
R0R1R2
Use separate injection and traversal latency curves per router
CALCM Computer Architecture Lab at Carnegie Mellon
Methodology
Examined online and offline FIST models Replaced cycle-accurate NoC model in tiled CMP simulator
Network and system configuration 4x4, 8x8, 16x16 wormhole-routed mesh Each network node hosts core+coherent L1 and a slice of L2
Multiprogrammed and multithreaded workloads 26 SPEC CPU2006 benchmarks of varying network intensity 8 SPLASH-2 and 2 PARSEC workloads
Traffic generated by cache misses
Consists of control, data and coherence packets
Offline and Online FIST models with two curves per router Curves represent injection and traversal latency at each router Initial training using uniform random synthetic traffic
Please see paper for more details!17
CALCM Computer Architecture Lab at Carnegie Mellon
Accuracy Results (offline)
8x8 mesh using FIST offline model
Average Latency and Aggregate IPC Error
-15%
-10%
-5%
0%
5%
10%
15%Latency ErrorIPC Error
MT (SPL/PAR)MP (High)MP (Med)MP (Low)
Late
ncy
/IP
CEr
ror
(in
%)
IPC Error < 4%
Latency Error < 8%
18
CALCM Computer Architecture Lab at Carnegie Mellon
Accuracy Results (online)
-0.1
-0.05
0
0.05
0.1
Late
ncy
Err
or
Latency Error
IPC Error
MT (SPL/PAR)MP (High)MP (Med)MP (Low)
8x8 mesh using FIST online model
Average Latency and Aggregate IPC Error
Both Latency and IPC Error below 3%
19
Late
ncy
/IP
CEr
ror
(in
%)
-10%
-5%
0%
5%
10%
CALCM Computer Architecture Lab at Carnegie Mellon
What about a very simple model?
Very high error for both latency and IPC!20
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Late
ncy
Erro
r
Latency Error
IPC Error
MT (SPL/PAR)MP (High)MP (Med)MP (Low)
Late
ncy
/IP
CEr
ror
(in
%)
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Late
ncy
Erro
r
Latency Error
IPC ErrorLatency ErrorIPC Error
-10%
-20%
0%
20%
40%
60%
80%
-40%
-60%
-80%
8x8 mesh using hop-based model How does simple network model affect high-order results?
FIST models always within this range
CALCM Computer Architecture Lab at Carnegie Mellon
Performance Results
Qualitative comparison
10K
100K
1M
10M
100M
SW FIST
HW FIST
Complexity / Level of Detail
Sp
ee
d (
pa
cke
ts/s
ec)
21
1K
SW NoC Sims(e.g. BookSim)
SW-based speedup results for 16x16 mesh Offline FIST: 43x Online FIST: 18x
HW-based speedup (offline): ~3-4 orders of magnitude
CALCM Computer Architecture Lab at Carnegie Mellon
Hardware Implementation Results
FPGA resource usage & clock frequency
Different mesh configurations
Xilinx Virtex-5 LX155T FPGA
FIST Model Direct Implementation
Size FPGA Area Freq. FPGA Area Freq.
4x4 4% 380 MHz 61% 130 MHz
8x8 15% 263 MHz - -
12x12 34% 250 MHz - -
16x16 60% 214 MHz - -
20x20 94% 200 MHz - -
FIST can scale to large NoCs with many routers
Will not fit
CALCM Computer Architecture Lab at Carnegie Mellon
Outline
Introduction to FIST FIST-based Network Models Evaluation Related Work & Conclusions
CALCM Computer Architecture Lab at Carnegie Mellon
Related Work
Vast body of work on network modeling
Analytical models, hardware prototyping, etc.
Abstract network modeling
Performance vs. accuracy trade-off studies [Burger 95]
Load-delay curve representation of network [Lugones 09]
FPGAs for network modeling
Cycle-accurate fidelity at the cost of limited scalability
Time-multiplexing can help with scalability [Wang 10]
But still suffer from high implementation complexity
CALCM Computer Architecture Lab at Carnegie Mellon
Conclusions & Future Directions
Conclusions Full-system simulators can tolerate small inaccuracies
FIST can provide fast SW- or HW-based NoC models
SW model provides 18x-43x average speedup w/ <2% error
HW model can scale to 100s routers with >1000x speedup
NoCs within a CMP are “FIST-friendly”
But not all networks good candidates for FIST modeling
Future Directions FPGA-friendly NoC models at multiple levels of fidelity
Configurable generation of hardware NoC models
CALCM Computer Architecture Lab at Carnegie Mellon
Thanks!
Questions?