hybrid system emulation

66
Hybrid System Emulation Taeweon Suh Taeweon Suh Computer Science Education Computer Science Education Korea University Korea University January 2010 January 2010

Upload: shiloh

Post on 11-Jan-2016

81 views

Category:

Documents


0 download

DESCRIPTION

Hybrid System Emulation. Taeweon Suh Computer Science Education Korea University January 2010. Agenda. Scope Background Related Work Hybrid System Emulation Case Studies L3 Cache Emulation Evaluation of Coherence Traffic Efficiency HW/SW Co-Simulation Conclusions. Scope. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hybrid System Emulation

Hybrid System EmulationHybrid System Emulation

Taeweon SuhTaeweon Suh

Computer Science EducationComputer Science EducationKorea UniversityKorea University

January 2010January 2010

Page 2: Hybrid System Emulation

2/36

AgendaAgenda

ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies

– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation

ConclusionsConclusions

Page 3: Hybrid System Emulation

3/36

ScopeScope

CPU

North Bridge

South Bridg

e

Main Memor

y(DDR2)

FSB (Front-Side Bus)

DMI (Direct Media I/F)

A typical computer system (till Core A typical computer system (till Core 2)2)

Page 4: Hybrid System Emulation

4/36

Scope (Cont.)Scope (Cont.)

CPU

North Bridge

South Bridg

e

Main Memor

y(DDR3)

Quickpath (Intel) orQuickpath (Intel) orHypertransport (AMD)Hypertransport (AMD)

DMI (Direct Media I/F)

A Nehalem-based computer systemA Nehalem-based computer system

Page 5: Hybrid System Emulation

5/36

Scope (Cont.)Scope (Cont.)

CPU

North Bridge

South Bridg

e

Main Memor

y(DDR2)

FSB

DMI

CPU

core

L1, L2

core

L1, L2

…Scope of this talk

Page 6: Hybrid System Emulation

6/36

AgendaAgenda

ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies

– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation

ConclusionsConclusions

Page 7: Hybrid System Emulation

7/36

BackgroundBackground

Computer architecture research has Computer architecture research has been done mostly with software been done mostly with software simulationsimulation– ProsPros

Relatively easy-to-implementRelatively easy-to-implement FlexibilityFlexibility ObservabilityObservability DebuggabilityDebuggability

– ConsCons Simulation timeSimulation time Difficulty modeling real-world such as I/ODifficulty modeling real-world such as I/O

Page 8: Hybrid System Emulation

8/36

Background (Cont.)Background (Cont.)

What is an alternative? What is an alternative? – FPGA (Field-Programmable Gate Array) FPGA (Field-Programmable Gate Array)

ReconfigurabilityReconfigurability– Programmable hardwareProgrammable hardware– Short turn-around timeShort turn-around time

High operation frequencyHigh operation frequency Observability and debuggabilityObservability and debuggability Many IPs providedMany IPs provided

– CPUs, memory controller, etc.CPUs, memory controller, etc.

Page 9: Hybrid System Emulation

9/36

Background (Cont.)Background (Cont.)

FPGA capability exampleFPGA capability example– Reconfigurable PentiumReconfigurable Pentium

Real PentiumReconfigurabl

e PentiumFPGA

Reconfigurable Pentium

Page 10: Hybrid System Emulation

10/36

AgendaAgenda

ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies

– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation

ConclusionsConclusions

Page 11: Hybrid System Emulation

11/36

Related WorkRelated Work

MemorIES (2000)MemorIES (2000)– MemorMemory y IInstrumentation and nstrumentation and EEmulation mulation

SSystem from IBM T.J. Watsonystem from IBM T.J. Watson– L3 Cache and/or coherence protocol L3 Cache and/or coherence protocol

emulation emulation Plugged into 6xx bus of RS/6000 SMP machinePlugged into 6xx bus of RS/6000 SMP machine

– Passive emulatorPassive emulator

Page 12: Hybrid System Emulation

12/36

Related Work (Cont.)Related Work (Cont.)

RAMPRAMP– RResearch esearch AAccelerator for ccelerator for MMultiple ultiple PProcessorsrocessors– Parallel computer architecture Parallel computer architecture

Multi-core HW/SW researchMulti-core HW/SW research– Full emulatorFull emulator– Multi-disciplinary project by UC-Berkeley, Multi-disciplinary project by UC-Berkeley,

Stanford, CMU, UT-Austin, MIT and IntelStanford, CMU, UT-Austin, MIT and Intel

BEE2 board

FPGAs

Page 13: Hybrid System Emulation

13/36

AgendaAgenda

ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies

– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation

ConclusionsConclusions

Page 14: Hybrid System Emulation

14/36

Hybrid System EmulationHybrid System Emulation

Combination of FPGA and a real Combination of FPGA and a real systemsystem– FPGA is deployed in a system of interestFPGA is deployed in a system of interest– FPGA interacts with a systemFPGA interacts with a system

Monitor transactions from the systemMonitor transactions from the system Provide feedback to the system Provide feedback to the system

– System-level System-level activeactive emulation emulation– Run workload in a real systemRun workload in a real system– Research, measure and evaluate the Research, measure and evaluate the

emulated components in a full-system emulated components in a full-system configurationconfiguration

FPGA is deployed on FSB in this FPGA is deployed on FSB in this researchresearch

Page 15: Hybrid System Emulation

15/36

Intel server systemIntel server system

Pentium-IIIPentium-III

FPGA boardFPGA board

Hybrid System Emulation Experiment SetupHybrid System Emulation Experiment Setup

Front-side bus (FSB)

Pentium-III Pentium-III

North Bridge

2GB SDRAM

Pentium-IIIPentium-IIIFPGAFPGA

Use an Intel server system equipped with two Use an Intel server system equipped with two Pentium-IIIsPentium-IIIs

Replace one Pentium-III with an FPGAReplace one Pentium-III with an FPGA– FPGA actively participates in transactions on FSBFPGA actively participates in transactions on FSB

Page 16: Hybrid System Emulation

16/36

Hybrid System Emulation Front-side Bus (FSB)Hybrid System Emulation Front-side Bus (FSB)

FSB protocolFSB protocol– 7-stage pipelined bus (Pentium-III)7-stage pipelined bus (Pentium-III)

Request1, request2, error1, error2, snoop, Request1, request2, error1, error2, snoop, response, dataresponse, data

How FPGA participates in FSB How FPGA participates in FSB transactions?transactions?– Snoop stallSnoop stall

Part of cache coherence mechanismPart of cache coherence mechanism Delaying the snoop responseDelaying the snoop response

– Cache-to-cache transferCache-to-cache transfer Part of cache coherence mechanismPart of cache coherence mechanism Providing data from a processor’s cache to Providing data from a processor’s cache to

the requester via FSBthe requester via FSB

Page 17: Hybrid System Emulation

17/36Main Memory

Pentium-III Pentium-III (P1)(P1)

Pentium-IIIPentium-III(P0)(P0)

North Bridge

Hybrid System Emulation Cache Coherence ProtocolHybrid System Emulation Cache Coherence Protocol Example: MESI ProtocolExample: MESI Protocol

– Snoop-based protocolSnoop-based protocol– Intel implements MESIIntel implements MESI

ModifiedExclusiveSharedInvalid

1234

Example

E 1234S 1234 S 1234M abcd

invalidation

I 1234 S abcdS abcd

1. P0: read2. P1: read3. P1: write (abcd)4. P0: read

I ----- I -----

abcd

shared

“snoop stall”

cache-to-cache transfer

Page 18: Hybrid System Emulation

18/36

AgendaAgenda

ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies

– L3 Cache EmulationL3 Cache Emulation┼┼

– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation

ConclusionsConclusions

┼┼ Erico Nurvitadhi, Jumnit Hong and Shih-Lien Lu “Active Cache Erico Nurvitadhi, Jumnit Hong and Shih-Lien Lu “Active Cache Emulator.”, Emulator.”, IEEE Transactions on VLSI Systems, 2008IEEE Transactions on VLSI Systems, 2008

Page 19: Hybrid System Emulation

19/36

L3 Cache Emulation MethodologyL3 Cache Emulation Methodology

L3 cache emulation methodologyL3 cache emulation methodology– Implement L3 tags in FPGAImplement L3 tags in FPGA– If missed, inject snoop stalls and store the information in If missed, inject snoop stalls and store the information in

L3 tagL3 tag ““New” memory access latency (= L3 miss latency)New” memory access latency (= L3 miss latency)

= snoop stalls + = snoop stalls + memory access latencymemory access latency– If hit, no snoop stallIf hit, no snoop stall

L3 latency (L3 hit latency)L3 latency (L3 hit latency) = = memory access latencymemory access latency

Front-side bus (FSB)

Pentium-III Pentium-III

North Bridge

2GB SDRAM

Snoop stalls

FPGAFPGA

L3 TAGL3 TAGL1, L2L1, L2

datadata

Miss!

Hit!

Page 20: Hybrid System Emulation

20/36

L3 Cache Emulation Experiment EnvironmentL3 Cache Emulation Experiment Environment

Operating systemOperating system– Windows XPWindows XP

Validation of emulated L3 cacheValidation of emulated L3 cache– RightMark Memory Analyzer RightMark Memory Analyzer ┼┼

┼┼ RightMark Memory Analyzer, http://cpu.rightmark.org/products/rmma.shtmlRightMark Memory Analyzer, http://cpu.rightmark.org/products/rmma.shtml

Page 21: Hybrid System Emulation

21/36

L3 Cache Emulation Experiment ResultL3 Cache Emulation Experiment Result

RightMark Memory Analyzer resultRightMark Memory Analyzer result

L3 L3 CacheCache

L2 L2 CacheCache

Access Access latency latency

(CPU cycle)(CPU cycle)

Main Main MemoryMemory

AccesAccess s latenclatency y (nsec)(nsec)

L1 L1 CacheCache

Working set sizeWorking set size

Page 22: Hybrid System Emulation

22/36

AgendaAgenda

ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies

– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic Efficiency Evaluation of Coherence Traffic Efficiency

┼┼

– HW/SW Co-SimulationHW/SW Co-Simulation

ConclusionsConclusions

┼┼ Taeweon Suh, Shih-Lien Lu and Hsien-Hsin S. Lee, “An FPGA Approach to Quantifying Taeweon Suh, Shih-Lien Lu and Hsien-Hsin S. Lee, “An FPGA Approach to Quantifying Coherence Traffic Efficiency on Multiprocessor Systems.”, 17Coherence Traffic Efficiency on Multiprocessor Systems.”, 17 thth FPL 2007 FPL 2007

Page 23: Hybrid System Emulation

23/36

Evaluation of Coherence Traffic Efficiency MethodologyEvaluation of Coherence Traffic Efficiency Methodology Evaluation methodologyEvaluation methodology

– Implement an L2 cache in FPGAImplement an L2 cache in FPGA– Save evicted cache lines into theSave evicted cache lines into the cachecache– Supply data using cache-to-cache transfer when Supply data using cache-to-cache transfer when

P-III requests it next timeP-III requests it next time– Measure execution time of benchmarks and Measure execution time of benchmarks and

compare with the baselinecompare with the baseline

Front-side bus (FSB)

Pentium-III Pentium-III (MESI)(MESI)

North Bridge

2GB SDRAM

“cache-to-cache transfer”

FPGAFPGA

D$D$

Page 24: Hybrid System Emulation

24/36

Evaluation of Coherence Traffic Efficiency Experiment EnvironmentEvaluation of Coherence Traffic Efficiency Experiment Environment

Operating systemOperating system– Redhat Linux 2.4.20-8 Redhat Linux 2.4.20-8

Natively run SPEC2000 benchmarkNatively run SPEC2000 benchmark– Selection of benchmark does not affect the Selection of benchmark does not affect the

evaluation as long as reasonable # bus traffic is evaluation as long as reasonable # bus traffic is generatedgenerated

FPGA sends statistics information to PC via FPGA sends statistics information to PC via UARTUART– # cache-to-cache transfers per second# cache-to-cache transfers per second– # invalidation traffic per second# invalidation traffic per second

Page 25: Hybrid System Emulation

25/36

100k

200k

300k

400k

500k

600k

700k

800k 1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB

Evaluation of Coherence Traffic Efficiency Experiment Results Evaluation of Coherence Traffic Efficiency Experiment Results

Average # cache-to-cache transfers / Average # cache-to-cache transfers / secondsecond

gzip vpr gcc mcf parser gap bzip2 twolf average

Avera

ge #

cach

e-t

o-c

ach

e

tran

sfe

rs/s

ec

804.2K/sec

433.3K/sec

Page 26: Hybrid System Emulation

26/36

Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)

1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB-20

0

20

40

60

80

100

120

140

160

180

200

exec

ution tim

e in

crea

se o

ver

bas

elin

e (s

ec)

cache size in FPGA

Average

Average execution time increaseAverage execution time increase– Baseline: benchmarks execution on a single P-III without Baseline: benchmarks execution on a single P-III without

FPGAFPGA data is always supplied from main memorydata is always supplied from main memory

191 seconds

Average execution time: 5635

seconds(93 min)

171 seconds

Page 27: Hybrid System Emulation

27/36

Evaluation of Coherence Traffic Efficiency Run-time BreakdownEvaluation of Coherence Traffic Efficiency Run-time Breakdown

Run-time estimation with 256KB cache in Run-time estimation with 256KB cache in FPGAFPGA

Invalidation trafficCache-to-cache

transfer

Latencies 5 ~ 10 FSB cycles5 ~ 10 FSB cycles 10 ~ 20 FSB cycles10 ~ 20 FSB cycles

Estimated run-times Note that the execution time increased 171 seconds on average out Note that the execution time increased 171 seconds on average out

of average total execution time (5635 seconds) of the baselineof average total execution time (5635 seconds) of the baseline Cache-to-cache transfer is responsible for at least 33 (171-138) Cache-to-cache transfer is responsible for at least 33 (171-138)

second increasesecond increase

69 ~ 138 seconds 69 ~ 138 seconds 381 ~ 762 381 ~ 762 seconds seconds

Cache-to-cache transfer on P-III server system is Cache-to-cache transfer on P-III server system is NOT as efficient as main memory access!NOT as efficient as main memory access!

Page 28: Hybrid System Emulation

28/36

AgendaAgenda

ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies

– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-Simulation HW/SW Co-Simulation ┼┼

ConclusionsConclusions

┼┼ Taeweon Suh, Hsien-Hsin S. Lee, Hsien-Hsin S. Lee, and John Shen, “Initial Observations Taeweon Suh, Hsien-Hsin S. Lee, Hsien-Hsin S. Lee, and John Shen, “Initial Observations of Hardware/Software Co-Simulation using FPGA in Architecture Research.”, 2of Hardware/Software Co-Simulation using FPGA in Architecture Research.”, 2ndnd WARFP WARFP 20062006

Page 29: Hybrid System Emulation

29/36

HW/SW Co-Simulation MotivationHW/SW Co-Simulation Motivation

Gain advantages from both software Gain advantages from both software simulation and hardware emulationsimulation and hardware emulation– FlexibilityFlexibility– High-speedHigh-speed

IdeaIdea– Offload heavy software routines into FPGAOffload heavy software routines into FPGA– The remaining simulator interacts with FPGAThe remaining simulator interacts with FPGA

Page 30: Hybrid System Emulation

30/36

HW/SW Co-Simulation Communication MethodHW/SW Co-Simulation Communication Method Communication between P-III and Communication between P-III and

FPGAFPGA– Use FSB as communication mediumUse FSB as communication medium– Allocate one page in memory for Allocate one page in memory for

communicationcommunication– SendSend data to FPGA: data to FPGA: write-throughwrite-through cache cache

modemode– ReceiveReceive data from FPGA: data from FPGA: cache-to-cachecache-to-cache

transfertransferFront-side bus (FSB)

Pentium-III Pentium-III (MESI)(MESI)

North Bridge

2GB SDRAM

FPGAFPGA

“write” bus transaction

“cache-to-cache transfer”“read” bus transaction

Page 31: Hybrid System Emulation

31/36

HW/SW Co-Simulation Co-Simulation ResultsHW/SW Co-Simulation Co-Simulation Results Preliminary experiment result with Preliminary experiment result with

SimpeScalar for correctness checkupSimpeScalar for correctness checkup– Implement a simple function Implement a simple function

((mem_access_latencymem_access_latency) into FPGA) into FPGA

mcf

bzip2

craftyeon-cook

Baseline (h:m:s)Co-simulation

(h:m:s)difference

(h:m:s)2:18:38 2:20:50 + 0:02:12

gcc-166

parser

perl

twolf

3:03:58 3:06:50 + 0:02:52

2:56:38 2:59:28 + 0:02:50

2:43:52 2:45:45 + 0:01:53

3:45:30 3:48:56 + 0:03:26

3:34:57 3:37:27 + 0:02:30

2:42:30 2:45:50 + 0:03:20

2:43:30 2:45:28 + 0:01:58

Page 32: Hybrid System Emulation

32/36

HW/SW Co-Simulation Analysis & Learnings HW/SW Co-Simulation Analysis & Learnings

Reason for the slowdownReason for the slowdown– FSB access is expensiveFSB access is expensive– Too simple function Too simple function

((mem_access_latency)mem_access_latency)– Device driver overhead Device driver overhead

Success criteriaSuccess criteria– Time-consuming software routines Time-consuming software routines – Reasonable FPGA access frequencyReasonable FPGA access frequency

Page 33: Hybrid System Emulation

33/36

HW/SW Co-Simulation Research OpportunityHW/SW Co-Simulation Research Opportunity Multi-core researchMulti-core research

– Implement distributed lowest level Implement distributed lowest level caches, and interconnection network such caches, and interconnection network such as ring or mesh in FPGAas ring or mesh in FPGA

L3

L3

CPU0L1,L2

Ring I/F

Ring I/F

CPU4

L1,L2

L3

L3

CPU1L1,L2

Ring I/F

Ring I/F

CPU5

L1,L2

L3

L3

CPU2L1,L2

Ring I/F

Ring I/F

CPU6

L1,L2

L3

L3

CPU3L1,L2

Ring I/F

Ring I/F

CPU7

L1,L2

FPGAFPGA

Page 34: Hybrid System Emulation

34/36

AgendaAgenda

ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies

– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation

ConclusionsConclusions

Page 35: Hybrid System Emulation

35/36

ConclusionsConclusions

Hybrid system emulationHybrid system emulation– Deploy FPGA to a place of interest in a Deploy FPGA to a place of interest in a

systemsystem– System-level active emulationSystem-level active emulation– Take advantage of an existing systemTake advantage of an existing system

Presented 3 usage cases in computer Presented 3 usage cases in computer architecture researcharchitecture research– L3 cache emulationL3 cache emulation– Evaluation of coherence traffic efficiencyEvaluation of coherence traffic efficiency– HW/SW co-simulationHW/SW co-simulation

FPGA-based emulation provides an FPGA-based emulation provides an alternative to software simulationalternative to software simulation

Page 36: Hybrid System Emulation

36/36

Questions, Comments?Questions, Comments?

Thanks for your attention!

Page 37: Hybrid System Emulation

37/36

Backup Slides

Page 38: Hybrid System Emulation

38/36

North Bridge

Evaluation of Coherence Traffic Efficiency

Cache Coherence ProtocolEvaluation of Coherence Traffic Efficiency

Cache Coherence Protocol

Example: MESI ProtocolExample: MESI Protocol– Snoop-based protocolSnoop-based protocol– Intel implements MESIIntel implements MESI

ModifiedExclusiveSharedInvalid

1234

Example

E 1234S 1234 S 1234

shared

M abcd

invalidate

I 1234 S abcdS abcd

1. P0: read2. P1: read3. P1: write (abcd)4. P0: read

I ----- I -----

cache-to-cache

P0 P1

abcd

Page 39: Hybrid System Emulation

39/36

L3 Cache Emulation MotivationL3 Cache Emulation Motivation

Software simulation has limitationsSoftware simulation has limitations– Simulation timeSimulation time– Reduced dataset and workloadReduced dataset and workload

Results could be offset by 100% or moreResults could be offset by 100% or more

Passive emulation has limitationsPassive emulation has limitations– Monitor transactionsMonitor transactions– Impact of emulated components on Impact of emulated components on

system can not be modeledsystem can not be modeled

Full-simulation requires much more Full-simulation requires much more efforteffort– Take much longer time to developTake much longer time to develop

Develop a full systemDevelop a full system Adapt workload to a new systemAdapt workload to a new system

Page 40: Hybrid System Emulation

40/36

L3 Cache Emulation Motivation (Cont.)L3 Cache Emulation Motivation (Cont.)

Active Cache Emulation (ACE) Active Cache Emulation (ACE) – Take advantage of an existing systemTake advantage of an existing system– Deploy an emulated component to a place Deploy an emulated component to a place

of interestof interest

Page 41: Hybrid System Emulation

41/36

L3 Cache Emulation HW DesignL3 Cache Emulation HW Design

Implemented modules in FPGAImplemented modules in FPGA– State machinesState machines

Keep track of up to 8 FSB transactionsKeep track of up to 8 FSB transactions– L3 TagsL3 Tags

L3 in FPGA varies from 1MB to 64MBL3 in FPGA varies from 1MB to 64MB Block size varies from 32B to 512BBlock size varies from 32B to 512B

– Statistics module Statistics module

FPGA (FPGA (Xilinx Virtex-II)Xilinx Virtex-II)

Front-side bus (FSB)

L3 cache TagL3 cache Tag

Registers for Registers for statisticsstatistics

PC via UARTLogic Analyzer

State machineState machine

FSB pipeline

8

Page 42: Hybrid System Emulation

42/36

Evaluation of Coherence Traffic Efficiency HW DesignEvaluation of Coherence Traffic Efficiency HW Design Implemented modules in FPGAImplemented modules in FPGA

– State machinesState machines Keep track of FSB transactionsKeep track of FSB transactions

– Taking evicted data from FSBTaking evicted data from FSB– Initiating cache-to-cache transferInitiating cache-to-cache transfer

– Direct-mapped cachesDirect-mapped caches Cache size in FPGA varies from 1KB to 256KBCache size in FPGA varies from 1KB to 256KB Note that Pentium-III has 256KB 4-way set associative L2 Note that Pentium-III has 256KB 4-way set associative L2

– Statistics moduleStatistics module

Xilinx Virtex-II FPGAXilinx Virtex-II FPGA

Front-side bus (FSB)

Direct-mapped Direct-mapped cachecacheTagTag DataData

Registers for Registers for statisticsstatistics

PC via UART

Logic Analyzer

State machineState machine

write-back

cache-to-cache

the rest

8

Page 43: Hybrid System Emulation

43/36

HW/SW Co-Simulation Implementation HW/SW Co-Simulation Implementation Hardware (FPGA) implementationHardware (FPGA) implementation

– State machinesState machines Monitoring bus transactions on FSBMonitoring bus transactions on FSB Checking bus transaction types (read or write)Checking bus transaction types (read or write) Managing cache-to-cache transferManaging cache-to-cache transfer

– Software functions to FPGASoftware functions to FPGA– Statistics countersStatistics counters

Software implementationSoftware implementation– Linux device driverLinux device driver

Specific physical address is needed for Specific physical address is needed for communication communication

Allocate one page of memory for FPGA access via Allocate one page of memory for FPGA access via Linux device driverLinux device driver

– Simulator modification for accessing FPGASimulator modification for accessing FPGA

Page 44: Hybrid System Emulation

44/36

Comparison with SimpleScalar Comparison with SimpleScalar simulationsimulation

L3 Cache Emulation Experiment Results (Cont.)L3 Cache Emulation Experiment Results (Cont.)

Page 45: Hybrid System Emulation

45/36

Evaluation of Coherence Traffic Efficiency MotivationEvaluation of Coherence Traffic Efficiency Motivation

Evaluation of coherence traffic Evaluation of coherence traffic efficiencyefficiency– Why important?Why important?

Understand the impact of coherence traffic Understand the impact of coherence traffic on system performance on system performance

Reflect into communication architecture Reflect into communication architecture – Problems with traditional methodsProblems with traditional methods

Evaluation of protocols themselvesEvaluation of protocols themselves Software simulationsSoftware simulations Experiments on SMP machines: ambiguousExperiments on SMP machines: ambiguous

– SolutionSolution A novel method to measure the intrinsic A novel method to measure the intrinsic

delay of coherence traffic and evaluate its delay of coherence traffic and evaluate its efficiencyefficiency

Page 46: Hybrid System Emulation

46/36

0

50k

100k

150k

200k

250k

300k

1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB

Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)

Average increase of invalidation traffic / Average increase of invalidation traffic / secondsecond

gzip vpr gcc mcf parser gap bzip2 twolf average

Avera

ge in

cre

ase o

f in

valid

ati

on

tr

affi

c/s

ec

157.5K/sec

306.8K/sec

Page 47: Hybrid System Emulation

47/36

0

10

20

30

40

50

60

70

1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB

Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)

Average hit rate in the FPGA’s cacheAverage hit rate in the FPGA’s cache

gzip vpr gcc mcf parser gap bzip2 twolf average

Avera

ge h

it r

ate

(%

)

Hit rate = # cache-to-cache transfer

# data read (full cache line)

64.89%

16.9%

Page 48: Hybrid System Emulation

48/36

MotivationMotivation

Traditionally, evaluations of coherence protocols Traditionally, evaluations of coherence protocols focused on reducing bus traffic incurred along with focused on reducing bus traffic incurred along with state transitions of coherence protocolsstate transitions of coherence protocols– Trace-based simulations were mostly used for the Trace-based simulations were mostly used for the

protocol evaluationsprotocol evaluations Software simulations are too slow to perform the Software simulations are too slow to perform the

broad range analysis of system behaviorsbroad range analysis of system behaviors– In addition, it is very difficult to do exact real-world In addition, it is very difficult to do exact real-world

modeling such as I/Osmodeling such as I/Os System-wide performance impact of coherence System-wide performance impact of coherence

traffic has not been explicitly investigated using real traffic has not been explicitly investigated using real systemssystems

This research provides a new method to evaluate This research provides a new method to evaluate and characterize coherence traffic efficiency of and characterize coherence traffic efficiency of snoop-based, invalidation protocols using an off-the-snoop-based, invalidation protocols using an off-the-shelf system and an FPGAshelf system and an FPGA

Page 49: Hybrid System Emulation

49/36

Motivation and ContributionMotivation and Contribution Evaluation of coherence traffic Evaluation of coherence traffic

efficiencyefficiency– MotivationMotivation

Memory wall becomes Memory wall becomes higherhigher

– Important to understand the Important to understand the impact of communication impact of communication among processorsamong processors

Traditionally, evaluation of Traditionally, evaluation of coherence protocols focused coherence protocols focused on protocols themselveson protocols themselves

– Software-based simulationSoftware-based simulation FPGA technologyFPGA technology

– Original Pentium fits into one Original Pentium fits into one Xilinx Virtex-4 LX200Xilinx Virtex-4 LX200

– Recent emulation effortRecent emulation effort RAMP consortium RAMP consortium

– ContributionContribution A novel method to measure A novel method to measure

the intrinsic delay of the intrinsic delay of coherence traffic and coherence traffic and evaluate its efficiency using evaluate its efficiency using emulation techniqueemulation technique

MemorIES (ASPLOS MemorIES (ASPLOS 2000)2000)

BEE2 BEE2 board board

Page 50: Hybrid System Emulation

50/36

Cache Coherence ProtocolsCache Coherence Protocols

Well-known technique for data consistency Well-known technique for data consistency among multiprocessor with cachesamong multiprocessor with caches

ClassificationClassification– Snoop-based protocolsSnoop-based protocols

Rely on broadcasting on shared busRely on broadcasting on shared bus– Based on shared memoryBased on shared memory

Symmetric access to main memorySymmetric access to main memory Limited scalabilityLimited scalability Used to build small-scale multiprocessor systemsUsed to build small-scale multiprocessor systems

– Very popular in servers and workstationsVery popular in servers and workstations

– Directory-based protocolsDirectory-based protocols Message-based communication via interconnection Message-based communication via interconnection

networknetwork– Based on distributed shared memory (DSM)Based on distributed shared memory (DSM)

Cache coherent non-uniform memory Access (ccNUMA)Cache coherent non-uniform memory Access (ccNUMA) ScalableScalable Used to build large-scale systemsUsed to build large-scale systems Actively studied in 1990sActively studied in 1990s

Page 51: Hybrid System Emulation

51/36

Cache Coherence Protocols (Cont.)Cache Coherence Protocols (Cont.) Snoop-based protocolsSnoop-based protocols

– Invalidation-based protocolsInvalidation-based protocols Invalidate shared copies when writingInvalidate shared copies when writing 1980s1980s

– Write-once, Synapse, Berkeley, and Illinois Write-once, Synapse, Berkeley, and Illinois Currently, adopt different combinations of Currently, adopt different combinations of

the states (M, O, E, S, and I)the states (M, O, E, S, and I)– MEI: PowerPC750, MIPS64 20KcMEI: PowerPC750, MIPS64 20Kc– MSI: Silicon Graphics 4D seriesMSI: Silicon Graphics 4D series– MESI: Pentium class, AMD K6, PowerPC601MESI: Pentium class, AMD K6, PowerPC601– MOESI: AMD64, UltraSparcMOESI: AMD64, UltraSparc

– Update-based protocolsUpdate-based protocols Update shared copies when writingUpdate shared copies when writing Dragon protocol and FireflyDragon protocol and Firefly

Page 52: Hybrid System Emulation

52/36

Cache Coherence Protocols (Cont.)Cache Coherence Protocols (Cont.) Directory-based protocolsDirectory-based protocols

– Memory-based schemesMemory-based schemes Keep Keep directory at the granularity of a cache directory at the granularity of a cache

lineline in home node’s memory in home node’s memory– One dirty bit, and one presence bit for each nodeOne dirty bit, and one presence bit for each node

Storage overhead due to directoryStorage overhead due to directory ExamplesExamples

– Stanford DASH, Stanford FLASH, MIT Alewife, and SGI Stanford DASH, Stanford FLASH, MIT Alewife, and SGI OriginOrigin

– Cache-based schemesCache-based schemes Keep Keep only head pointer for each cache lineonly head pointer for each cache line in in

home node’ directoryhome node’ directory– Keep Keep forward and backward pointers in cachesforward and backward pointers in caches of each of each

nodenode Long latency due to serialization of messagesLong latency due to serialization of messages ExamplesExamples

– Sequent NUMA-Q, Convex Exemplar, and Data GeneralSequent NUMA-Q, Convex Exemplar, and Data General

Page 53: Hybrid System Emulation

53/36

Emulation Initiatives for Protocol EvaluationEmulation Initiatives for Protocol Evaluation RPM (mid-to-late ’90s)RPM (mid-to-late ’90s)

– RRapid apid PPrototyping engine for rototyping engine for MMultiprocessor ultiprocessor from Univ. of Southern Californiafrom Univ. of Southern California

– ccNUMA Full system emulation ccNUMA Full system emulation A Sparc IU/FPU core is used as CPU in each A Sparc IU/FPU core is used as CPU in each

node, and the rest (L1, L2 etc) is implemented node, and the rest (L1, L2 etc) is implemented with 8 FPGAswith 8 FPGAs

Nodes are connected through Futurebus+Nodes are connected through Futurebus+

Page 54: Hybrid System Emulation

54/36

FPGA Initiatives for EvaluationFPGA Initiatives for Evaluation

Other cache emulatorsOther cache emulators– RACFCS (1997)RACFCS (1997)

RReconfigurable econfigurable AAddress ddress CCollector and ollector and FFlying lying CCache ache SSimulator from imulator from Yonsei Yonsei Univ. in Korea Univ. in Korea

Plugged into Intel486 busPlugged into Intel486 bus– Passively collect Passively collect

– HACS (2002)HACS (2002) HHardware ardware AAccelerated ccelerated CCache ache SSimulator imulator

from Brigham Young Univ.from Brigham Young Univ. Plugged into FSB of Pentium-Pro-based Plugged into FSB of Pentium-Pro-based

systemsystem– ACE (2006)ACE (2006)

AActive ctive CCache ache EEmulator from Intel Corp.mulator from Intel Corp. Plugged into FSB of Pentium-III-based Plugged into FSB of Pentium-III-based

systemsystem

Page 55: Hybrid System Emulation

55/36

Background (Cont.)Background (Cont.)

ExampleExample

Page 56: Hybrid System Emulation

56/36

Intel server systemIntel server system

Pentium-IIIPentium-III

FPGA boardFPGA board

Logic analyzerLogic analyzerHost PCHost PC

UARTUART

Hybrid System Emulation Experiment Setup (Cont.)Hybrid System Emulation Experiment Setup (Cont.)

Page 57: Hybrid System Emulation

57/36

Experimental Setup (Cont.)Experimental Setup (Cont.)

Xilinx Virtex-IIFPGA

FSB interface

Logic analyzer ports

LEDs

Page 58: Hybrid System Emulation

58/36

FSB ProtocolFSB Protocol

Snoop stallSnoop stall

ADS

addrA[35:3]#

request1

request2

error1

error2

snoop

response

dataFSB pipeline stages

HITM#

HIT new transaction

Snoop stalls

Page 59: Hybrid System Emulation

59/36

FSB ProtocolFSB Protocol

Cache-to-cache transferCache-to-cache transfer

ADS

addrA[35:3]#

HIT#

HITM#

TRDY#

DRDY#

DBSY#

data0D[63:0]# data2 data3data1

request1

request2

error1

error2

snoop

response

dataFSB pipeline stages

snoop-hit

memory controller is ready to accept data

new transaction

Page 60: Hybrid System Emulation

60/36

Evaluation MethodologyEvaluation Methodology Goal Goal

– Measure Measure the intrinsic delay of coherence trafficthe intrinsic delay of coherence traffic and and evaluate its efficiencyevaluate its efficiency

Shortcomings in multiprocessor environmentShortcomings in multiprocessor environment– Nearly impossible to isolate the impact of coherence Nearly impossible to isolate the impact of coherence

traffic on system performancetraffic on system performance– Even worse, there are non-deterministic factorsEven worse, there are non-deterministic factors

Arbitration delayArbitration delay Stall in pipelined busStall in pipelined bus

“cache-to-cache transfer”shared bus

Processor Processor 00

(MESI)(MESI)

Memorycontroller

Main memory

Processor Processor 11

(MESI)(MESI)

Processor Processor 22

(MESI)(MESI)

Processor Processor 33

(MESI)(MESI)

Page 61: Hybrid System Emulation

61/36

Evaluation of Coherence Traffic Efficiency Run-time BreakdownEvaluation of Coherence Traffic Efficiency Run-time Breakdown

Run-time estimation with 256KB cache in Run-time estimation with 256KB cache in FPGAFPGA

Invalidation trafficCache-to-cache

transfer

Latencies 5 ~ 10 FSB cycles5 ~ 10 FSB cycles 10 ~ 20 FSB cycles10 ~ 20 FSB cycles

Estimated run-times

Estimated time =Estimated time = avg. occurrenceavg. occurrence

secsecx avg. total execution timeavg. total execution timex

clock periodclock period

cyclcyclee

x latency of each trafficlatency of each traffic

Note that the execution time increased 171 seconds on average out Note that the execution time increased 171 seconds on average out of average total execution time (5635 seconds) of the baselineof average total execution time (5635 seconds) of the baseline

Cache-to-cache transfer is responsible for at least 33 (171-138) Cache-to-cache transfer is responsible for at least 33 (171-138)

second increase!second increase!

69 ~ 138 seconds 69 ~ 138 seconds 381 ~ 762 381 ~ 762 seconds seconds

Coherence traffic on P-III server system is NOT as Coherence traffic on P-III server system is NOT as efficient as main memory accessefficient as main memory access

Page 62: Hybrid System Emulation

62/36

ConclusionConclusion Proposed a novel method to measure the Proposed a novel method to measure the

intrinsic delay of coherence traffic and intrinsic delay of coherence traffic and evaluate its efficiencyevaluate its efficiency– Coherence traffic in P-III-based Intel server system Coherence traffic in P-III-based Intel server system

is not efficient as expectedis not efficient as expected The main reason is that, in MESI, main memory The main reason is that, in MESI, main memory

should be updated at the same time upon should be updated at the same time upon cache-to-cache-to-cache transfercache transfer

Opportunities for performance enhancementOpportunities for performance enhancement– For faster cache-to-cache transferFor faster cache-to-cache transfer

Cache line buffers in memory controllerCache line buffers in memory controller– As long as buffer space is available, memory controller As long as buffer space is available, memory controller

can take datacan take data

MOESI would help shorten the latencyMOESI would help shorten the latency – Main memory need not be updated upon cache-to-cache Main memory need not be updated upon cache-to-cache

transfertransfer

– For faster invalidation trafficFor faster invalidation traffic Advancing the snoop phase to an earlier stageAdvancing the snoop phase to an earlier stage

Page 63: Hybrid System Emulation

63/36

HW/SW Co-Simulation MotivationHW/SW Co-Simulation Motivation

Software simulationSoftware simulation– ProsPros

Flexible, observable, easy-to-implementFlexible, observable, easy-to-implement– ConsCons

Intolerable simulation timeIntolerable simulation time

Hardware emulationHardware emulation– ProsPros

Significant speedupSignificant speedup Concurrent executionConcurrent execution

– ConsCons Much less flexible and observableMuch less flexible and observable Low-level design taking longer time to Low-level design taking longer time to

implement and validateimplement and validate

Page 64: Hybrid System Emulation

64/36

Communication DetailsCommunication Details

All FSB signals are mapped to FPGA pinsAll FSB signals are mapped to FPGA pins Encoding software function arguments Encoding software function arguments

in the FSB address for Simplescalar in the FSB address for Simplescalar exampleexample– For 4KB page,For 4KB page,

Set its attribute as write-through modeSet its attribute as write-through mode Lower 12 bits in FSB address bus are free to Lower 12 bits in FSB address bus are free to

useuse High 24 bits are used for TLB translationHigh 24 bits are used for TLB translation

Front-side bus (FSB)

Pentium-III Pentium-III (MESI)(MESI)

XilinxXilinxVirtex-IIVirtex-II

Page 65: Hybrid System Emulation

65/36

HW/SW Co-Simulation Co-simulation Results Analysis HW/SW Co-Simulation Co-simulation Results Analysis FSB access is expensiveFSB access is expensive

– ~ 20 FSB cycles (~ 20 FSB cycles (≈ ≈ 160 CPU cycles) for each 160 CPU cycles) for each transfertransfer

One cache line (32 bytes) needs to be One cache line (32 bytes) needs to be transferred for cache-to-cache transfertransferred for cache-to-cache transfer

P-III MESI requires to update main memory P-III MESI requires to update main memory upon cache-to-cache transferupon cache-to-cache transfer

““mem_access_latency”mem_access_latency” function is too function is too simplesimple– Even software simulation takes at most a few Even software simulation takes at most a few

dozen CPU cyclesdozen CPU cycles Device driver overhead Device driver overhead

– System overhead due to device driverSystem overhead due to device driver– It requires one TLB entry, which would be It requires one TLB entry, which would be

used in the simulation otherwiseused in the simulation otherwise Time-consuming software routines and Time-consuming software routines and

reasonable FPGA access frequency are reasonable FPGA access frequency are needed to benefit from hardware needed to benefit from hardware implementationimplementation

Page 66: Hybrid System Emulation

66/36

Conclusions Conclusions

Proposed a new co-simulation methodologyProposed a new co-simulation methodology Preliminary co-simulation using Preliminary co-simulation using

Simplescalar proves the correctness of the Simplescalar proves the correctness of the methodology methodology – Hardware/softwareHardware/software implementationimplementation– Communication between P-III and FPGA via FSBCommunication between P-III and FPGA via FSB– Linux driver Linux driver

Co-simulation results indicate Co-simulation results indicate – Bus access (FSB) is expensiveBus access (FSB) is expensive– Linux driver overhead also needs to be overcomeLinux driver overhead also needs to be overcome– Time-consuming blocks need to be emulatedTime-consuming blocks need to be emulated

Multi-core co-simulation would benefit from Multi-core co-simulation would benefit from FPGAFPGA– Implement distributed low-level caches and Implement distributed low-level caches and

interconnection network, which would be complex interconnection network, which would be complex enough to benefit from hardware modelingenough to benefit from hardware modeling