fist: a fast, lightweight, fpga-friendly packet latency ...mpapamic/research/fist - nocs11...

1
El 0 5 10 15 20 25 30 35 40 45 50 0 66 132 198 264 330 396 462 1950 2016 2082 2148 2214 2280 2346 2412 2478 Latency Actual Latency Estimated Latency Elapsed cycles (in 1000s) Example of online training with no initial training Before Training After Training FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations The FIST Project FIST Approach Simulation in Computer Architecture Acknowledgements Funding for this work has been provided by NSF CCF-0811702. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations and support. Michael K. Papamichael, James C. Hoe, Onur Mutlu Carnegie Mellon University, Pittsburgh, PA USA [email protected] , [email protected], [email protected] Related Work and Conclusions FIST-based Network Models Computer Architecture Lab at Slow for large-scale multiprocessor studies Full-system fidelity + long benchmarks How can we make it faster? Speed, accuracy, flexibility trade-off Full-system simulators sacrifice accuracy for speed and flexibility Speed Accuracy Flexibility Accelerate simulation with FPGAs Can simulate up to millions of gates Orders of magnitude simulation speedup Explores fast NoC models for full-system simulations FPGA-friendly , but avoid direct implementation Low error, many topologies, >10M packets/sec Simpler requirements of full-system simulation Estimate packet latencies, capture high-order effects R R R R FPGA area requirements for state-of-the-art mesh NoC* 5x5 *NoC RTL from http://nocs.stanford.edu/router.html View NoC as set of routers/links Abstract router into black-box Represent by load-delay curves Specific to each router configuration and traffic pattern R R R R R R R R R N N N N N N N N N R R R R R R N R State Logic Buffers R R Latency Load Feedback Train Curves Use Curves Network models within full-system simulators Model network within a broader simulated system Assign delay to each packet traversing the network Traffic generated by real workloads Updated Curves Detailed network models Cycle-accurate network simulators (e.g. BookSim) Analytical network models Typically study networks under synthetic traffic patterns Offline FIST Detailed network simulator generates curves offline Can use synthetic or actual workload traffic Load curves into FIST and run experiment Detailed Network Model Online FIST (tolerates dynamic changes in network behavior) Initialization of curves same as offline Periodically run detailed network simulator on the side Compare accuracy and, if necessary, update curves Detailed Network Model Provide feedback and receive updated curves Putting FIST Into Context FIST Applicability Packet Descriptors Router Elements Src Dest Size Calculate Latency Pick routers Routing Logic Packet Delays Partial Delays BRAM Load Tracker Curves Software Implementation of FIST (written in C++) Examined online and offline FIST models Replaced cycle-accurate NoC model in tiled CMP simulator Network and system configuration 4x4, 8x8, 16x16 wormhole-routed mesh Each network node hosts core+coherent L1 and a slice of L2 Multiprogrammed and multithreaded workloads 26 SPEC CPU2006 benchmarks of varying network intensity 8 SPLASH-2 and 2 PARSEC workloads Traffic generated by cache misses Consists of control, data and coherence packets Offline and Online FIST models with two curves per router Curves represent injection and traversal latency at each router Initial training using uniform random synthetic traffic Please see paper for more details! Hardware Implementation (written in Bluespec) Precisely replicates software-based FIST 3-4 orders of magnitude speedup (offline FIST) 8x8 mesh using FIST offline model -15% -10% -5% 0% 5% 10% 15% Latency Error IPC Error MT (SPL/PAR) MP (High) MP (Med) MP (Low) Latency/IPC Error (in %) IPC Error < 4% Latency Error < 8% 0.05 0.1 Latency Error IPC Error MT (SPL/PAR) MP (High) MP (Med) MP (Low) Both Latency and IPC Error below 3% Latency/IPC Error (in %) -10% -5% 0% 5% 10% 8x8 mesh using FIST online model FPGA resource usage & clock frequency Virtex-5 LX155T Virtex-6 LX760 Size BRAMs LUTs Freq. BRAMs LUTs Freq. 4x4 8 1% 380 MHz 8 0% 448 MHz 8x8 32 5% 263 MHz 32 1% 443 MHz 12x12 72 11% 250 MHz 72 2% 375 MHz 16x16 128 20% 214 MHz 129 5% 375 MHz 20x20 200 32% 200 MHz 201 8% 319 MHz 24x24 - - - 289 12% 312 MHz Related Work Abstract network modeling Performance vs. accuracy trade-off studies [Burger 95] Load-delay curve representation of network [Lugones 09] FPGAs for network modeling Cycle-accurate fidelity at the cost of limited scalability Time-multiplexing can help with scalability [Wang 10] But still suffer from high implementation complexity Conclusions Full-system simulators can tolerate small inaccuracies FIST can provide fast SW- or HW-based NoC models SW model provides 18x-43x average speedup w/ <2% error HW model can scale to 100s routers with >1000x speedup NoCs are “FIST-friendly” But not all networks good candidates for FIST modeling Future Directions FPGA-friendly NoC models at multiple levels of fidelity Configurable generation of hardware NoC models Latency Error Latency Error IPC Error MT (SPL/PAR) MP (High) MP (Med) MP (Low) Very high error for both latency and IPC! Latency/IPC Error (in %) Latency Error IPC Error -60% -20% 0% 20% 40% 60% 80% -40% -80% FIST models always within this range Comparison against simple hop-based model Online Training in Action “FIST-Friendly” Networks Exhibit stable, predictable behavior as load fluctuates Actual traffic similar to training traffic FIST Limitations Depends on fidelity, representativeness of training models Higher loads and large buffers can limit FIST’s accuracy High network load increased packet latency variance Large buffers increased range of observed packet latencies Cannot capture fine-grain packet interactions Cannot replace cycle-accurate detailed network models NoCs are “FIST-Friendly” Employ simple routing algorithms Operate at low loads Small buffers 8x8 mesh using simple hop-based model Latency and IPC accuracy for FIST-based models Evaluation FPGA Implementation of FIST Handle load tracking & delay queries Speedup for 16x16 mesh using offline FIST: 43x Speedup for 16x16 mesh using online FIST: 18x Methodology

Upload: vulien

Post on 28-May-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency ...mpapamic/research/FIST - NOCS11 poster.pdf · Periodically run detailed network simulator on the side ... High network load

El

05

101520253035404550

0 66 132

198

264

330

396

462

1884

1950

2016

2082

2148

2214

2280

2346

2412

2478

Late

ncy

Ellapsed Cycles (in 1000s)

Actual Latency

Estimated Latency

El

Elapsed cycles (in 1000s)

Example of online training with no initial training

Before Training

After Training

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations

The FIST Project FIST ApproachSimulation in Computer Architecture

Acknowledgements Funding for this work has been provided by NSF CCF-0811702. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations and support.

Michael K. Papamichael, James C. Hoe, Onur MutluCarnegie Mellon University, Pittsburgh, PA USA

[email protected] , [email protected], [email protected]

Related Work and Conclusions

FIST-based Network Models

Computer Architecture Lab at

Slow for large-scale multiprocessor studies

Full-system fidelity + long benchmarks

How can we make it faster?

Speed, accuracy, flexibility trade-off

Full-system simulators sacrifice accuracy for speed and flexibility

Speed

Accuracy Flexibility

Accelerate simulation with FPGAs

Can simulate up to millions of gates

Orders of magnitude simulation speedup

Explores fast NoC models for full-system simulations FPGA-friendly, but avoid direct implementation Low error, many topologies, >10M packets/sec

Simpler requirements of full-system simulation Estimate packet latencies, capture high-order effects

L2L1P

L2L1P

L2L1P

L2L1P

R

R

R

R

FPGA (Xilinx LX110T)

FPGA area requirements

for state-of-the-art mesh NoC*

4x4

5x5

*NoC RTL from http://nocs.stanford.edu/router.html

View NoC as set of routers/links Abstract router into black-box Represent by load-delay curves Specific to each router configuration and traffic pattern

R

R

R

R R

R R

R R

NN

N

N

N

NN

N N

R

R

R R

RR

N

RStateLogic

BuffersRRLa

ten

cy

Load

Feed

bac

k

Train Curves Use Curves

Network models within full-system simulators Model network within a broader simulated system

Assign delay to each packet traversing the network

Traffic generated by real workloads

Up

dated

Cu

rves

Detailed network models Cycle-accurate network simulators (e.g. BookSim)

Analytical network models

Typically study networks under synthetic traffic patterns

Offline FIST Detailed network simulator generates curves offline Can use synthetic or actual workload traffic Load curves into FIST and run experiment

Detailed Network Model

Online FIST (tolerates dynamic changes in network behavior) Initialization of curves same as offline Periodically run detailed network simulator on the side Compare accuracy and, if necessary, update curves

Detailed Network Model

Provide feedback and receive updated curves

Putting FIST Into Context FIST Applicability

PacketDescriptors

Router Elements

SrcDest

Size

CalculateLatency

Pickrouters

Routing Logic

Packet

Delays

Partial Delays

BRAM

Load Tracker

Curves

Software Implementation of FIST (written in C++) Examined online and offline FIST models

Replaced cycle-accurate NoC model in tiled CMP simulator

Network and system configuration 4x4, 8x8, 16x16 wormhole-routed mesh Each network node hosts core+coherent L1 and a slice of L2

Multiprogrammed and multithreaded workloads 26 SPEC CPU2006 benchmarks of varying network intensity 8 SPLASH-2 and 2 PARSEC workloads

Traffic generated by cache misses Consists of control, data and coherence packets

Offline and Online FIST models with two curves per router Curves represent injection and traversal latency at each router Initial training using uniform random synthetic traffic

Please see paper for more details!

Hardware Implementation (written in Bluespec) Precisely replicates software-based FIST 3-4 orders of magnitude speedup (offline FIST)

8x8 mesh using FIST offline model

-15%

-10%

-5%

0%

5%

10%

15%Latency ErrorIPC Error

MT (SPL/PAR)MP (High)MP (Med)MP (Low)

Late

ncy

/IP

CEr

ror

(in

%)

IPC Error < 4%

Latency Error < 8%

-0.1

-0.05

0

0.05

0.1

Late

ncy

Err

or

Latency Error

IPC Error

MT (SPL/PAR)MP (High)MP (Med)MP (Low)

Both Latency and IPC Error below 3%

Late

ncy

/IP

CEr

ror

(in

%)

-10%

-5%

0%

5%

10%

8x8 mesh using FIST online model

FPGA resource usage & clock frequency

Virtex-5 LX155T Virtex-6 LX760

Size BRAMs LUTs Freq. BRAMs LUTs Freq.

4x4 8 1% 380 MHz 8 0% 448 MHz

8x8 32 5% 263 MHz 32 1% 443 MHz

12x12 72 11% 250 MHz 72 2% 375 MHz

16x16 128 20% 214 MHz 129 5% 375 MHz

20x20 200 32% 200 MHz 201 8% 319 MHz

24x24 - - - 289 12% 312 MHz

Related Work Abstract network modeling

Performance vs. accuracy trade-off studies [Burger 95]

Load-delay curve representation of network [Lugones 09]

FPGAs for network modeling Cycle-accurate fidelity at the cost of limited scalability

Time-multiplexing can help with scalability [Wang 10]

But still suffer from high implementation complexity

Conclusions Full-system simulators can tolerate small inaccuracies

FIST can provide fast SW- or HW-based NoC models SW model provides 18x-43x average speedup w/ <2% error

HW model can scale to 100s routers with >1000x speedup

NoCs are “FIST-friendly” But not all networks good candidates for FIST modeling

Future Directions FPGA-friendly NoC models at multiple levels of fidelity

Configurable generation of hardware NoC models

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Late

ncy

Erro

r

Latency Error

IPC Error

MT (SPL/PAR)MP (High)MP (Med)MP (Low)

Very high error for both latency and IPC!

Late

ncy

/IP

CEr

ror

(in

%)

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Late

ncy

Erro

r

Latency Error

IPC ErrorLatency Error

IPC Error

-60%

-20%

0%

20%

40%

60%

80%

-40%

-80%

FIST models always within this range

Comparison against simple hop-based model

Online Training in Action

“FIST-Friendly” Networks Exhibit stable, predictable behavior as load fluctuates

Actual traffic similar to training traffic

FIST Limitations Depends on fidelity, representativeness of training models

Higher loads and large buffers can limit FIST’s accuracy High network load increased packet latency variance

Large buffers increased range of observed packet latencies

Cannot capture fine-grain packet interactions

Cannot replace cycle-accurate detailed network models

NoCs are “FIST-Friendly” Employ simple routing algorithms Operate at low loads Small buffers

8x8 mesh using simple hop-based model

Latency and IPC accuracy for FIST-based models

Evaluation

FPGA Implementation of FIST

Handle load tracking

& delay queries

Speedup for 16x16 mesh using offline FIST: 43xSpeedup for 16x16 mesh using online FIST: 18x

Methodology