hybrid system emulation

Hybrid System EmulationHybrid System Emulation

Taeweon SuhTaeweon Suh

Computer Science EducationComputer Science EducationKorea UniversityKorea University

January 2010January 2010

2/36

AgendaAgenda

ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies

– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation

ConclusionsConclusions

3/36

ScopeScope

CPU

North Bridge

South Bridg

e

Main Memor

y(DDR2)

FSB (Front-Side Bus)

DMI (Direct Media I/F)

A typical computer system (till Core A typical computer system (till Core 2)2)

4/36

Scope (Cont.)Scope (Cont.)

CPU

North Bridge

South Bridg

e

Main Memor

y(DDR3)

Quickpath (Intel) orQuickpath (Intel) orHypertransport (AMD)Hypertransport (AMD)

DMI (Direct Media I/F)

A Nehalem-based computer systemA Nehalem-based computer system

5/36

Scope (Cont.)Scope (Cont.)

CPU

North Bridge

South Bridg

e

Main Memor

y(DDR2)

FSB

DMI

CPU

core

L1, L2

core

L1, L2

…Scope of this talk

6/36

AgendaAgenda




7/36

BackgroundBackground

Computer architecture research has Computer architecture research has been done mostly with software been done mostly with software simulationsimulation– ProsPros

Relatively easy-to-implementRelatively easy-to-implement FlexibilityFlexibility ObservabilityObservability DebuggabilityDebuggability

– ConsCons Simulation timeSimulation time Difficulty modeling real-world such as I/ODifficulty modeling real-world such as I/O

8/36

Background (Cont.)Background (Cont.)

What is an alternative? What is an alternative? – FPGA (Field-Programmable Gate Array) FPGA (Field-Programmable Gate Array)

ReconfigurabilityReconfigurability– Programmable hardwareProgrammable hardware– Short turn-around timeShort turn-around time

High operation frequencyHigh operation frequency Observability and debuggabilityObservability and debuggability Many IPs providedMany IPs provided

– CPUs, memory controller, etc.CPUs, memory controller, etc.

9/36


FPGA capability exampleFPGA capability example– Reconfigurable PentiumReconfigurable Pentium

Real PentiumReconfigurabl

e PentiumFPGA

Reconfigurable Pentium

10/36

AgendaAgenda




11/36

Related WorkRelated Work

MemorIES (2000)MemorIES (2000)– MemorMemory y IInstrumentation and nstrumentation and EEmulation mulation

SSystem from IBM T.J. Watsonystem from IBM T.J. Watson– L3 Cache and/or coherence protocol L3 Cache and/or coherence protocol

emulation emulation Plugged into 6xx bus of RS/6000 SMP machinePlugged into 6xx bus of RS/6000 SMP machine

– Passive emulatorPassive emulator

12/36

Related Work (Cont.)Related Work (Cont.)

RAMPRAMP– RResearch esearch AAccelerator for ccelerator for MMultiple ultiple PProcessorsrocessors– Parallel computer architecture Parallel computer architecture

Multi-core HW/SW researchMulti-core HW/SW research– Full emulatorFull emulator– Multi-disciplinary project by UC-Berkeley, Multi-disciplinary project by UC-Berkeley,

Stanford, CMU, UT-Austin, MIT and IntelStanford, CMU, UT-Austin, MIT and Intel

BEE2 board

FPGAs

13/36

AgendaAgenda




14/36

Hybrid System EmulationHybrid System Emulation

Combination of FPGA and a real Combination of FPGA and a real systemsystem– FPGA is deployed in a system of interestFPGA is deployed in a system of interest– FPGA interacts with a systemFPGA interacts with a system

Monitor transactions from the systemMonitor transactions from the system Provide feedback to the system Provide feedback to the system

– System-level System-level activeactive emulation emulation– Run workload in a real systemRun workload in a real system– Research, measure and evaluate the Research, measure and evaluate the

emulated components in a full-system emulated components in a full-system configurationconfiguration

FPGA is deployed on FSB in this FPGA is deployed on FSB in this researchresearch

15/36

Intel server systemIntel server system

Pentium-IIIPentium-III

FPGA boardFPGA board

Hybrid System Emulation Experiment SetupHybrid System Emulation Experiment Setup

Front-side bus (FSB)

Pentium-III Pentium-III

North Bridge

2GB SDRAM

Pentium-IIIPentium-IIIFPGAFPGA

Use an Intel server system equipped with two Use an Intel server system equipped with two Pentium-IIIsPentium-IIIs

Replace one Pentium-III with an FPGAReplace one Pentium-III with an FPGA– FPGA actively participates in transactions on FSBFPGA actively participates in transactions on FSB

16/36

Hybrid System Emulation Front-side Bus (FSB)Hybrid System Emulation Front-side Bus (FSB)

FSB protocolFSB protocol– 7-stage pipelined bus (Pentium-III)7-stage pipelined bus (Pentium-III)

Request1, request2, error1, error2, snoop, Request1, request2, error1, error2, snoop, response, dataresponse, data

How FPGA participates in FSB How FPGA participates in FSB transactions?transactions?– Snoop stallSnoop stall

Part of cache coherence mechanismPart of cache coherence mechanism Delaying the snoop responseDelaying the snoop response

– Cache-to-cache transferCache-to-cache transfer Part of cache coherence mechanismPart of cache coherence mechanism Providing data from a processor’s cache to Providing data from a processor’s cache to

the requester via FSBthe requester via FSB

17/36Main Memory

Pentium-III Pentium-III (P1)(P1)

Pentium-IIIPentium-III(P0)(P0)

North Bridge

Hybrid System Emulation Cache Coherence ProtocolHybrid System Emulation Cache Coherence Protocol Example: MESI ProtocolExample: MESI Protocol

– Snoop-based protocolSnoop-based protocol– Intel implements MESIIntel implements MESI

ModifiedExclusiveSharedInvalid

1234

Example

E 1234S 1234 S 1234M abcd

invalidation

I 1234 S abcdS abcd

1. P0: read2. P1: read3. P1: write (abcd)4. P0: read

I ----- I -----

abcd

shared

“snoop stall”

cache-to-cache transfer

18/36

AgendaAgenda


– L3 Cache EmulationL3 Cache Emulation┼┼

– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation


┼┼ Erico Nurvitadhi, Jumnit Hong and Shih-Lien Lu “Active Cache Erico Nurvitadhi, Jumnit Hong and Shih-Lien Lu “Active Cache Emulator.”, Emulator.”, IEEE Transactions on VLSI Systems, 2008IEEE Transactions on VLSI Systems, 2008

19/36

L3 Cache Emulation MethodologyL3 Cache Emulation Methodology

L3 cache emulation methodologyL3 cache emulation methodology– Implement L3 tags in FPGAImplement L3 tags in FPGA– If missed, inject snoop stalls and store the information in If missed, inject snoop stalls and store the information in

L3 tagL3 tag ““New” memory access latency (= L3 miss latency)New” memory access latency (= L3 miss latency)

= snoop stalls + = snoop stalls + memory access latencymemory access latency– If hit, no snoop stallIf hit, no snoop stall

L3 latency (L3 hit latency)L3 latency (L3 hit latency) = = memory access latencymemory access latency


Pentium-III Pentium-III

North Bridge

2GB SDRAM

Snoop stalls

FPGAFPGA

L3 TAGL3 TAGL1, L2L1, L2

datadata

Miss!

Hit!

20/36

L3 Cache Emulation Experiment EnvironmentL3 Cache Emulation Experiment Environment

Operating systemOperating system– Windows XPWindows XP

Validation of emulated L3 cacheValidation of emulated L3 cache– RightMark Memory Analyzer RightMark Memory Analyzer ┼┼

┼┼ RightMark Memory Analyzer, http://cpu.rightmark.org/products/rmma.shtmlRightMark Memory Analyzer, http://cpu.rightmark.org/products/rmma.shtml

21/36

L3 Cache Emulation Experiment ResultL3 Cache Emulation Experiment Result

RightMark Memory Analyzer resultRightMark Memory Analyzer result

L3 L3 CacheCache

L2 L2 CacheCache

Access Access latency latency

(CPU cycle)(CPU cycle)

Main Main MemoryMemory

AccesAccess s latenclatency y (nsec)(nsec)

L1 L1 CacheCache

Working set sizeWorking set size

22/36

AgendaAgenda


– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic Efficiency Evaluation of Coherence Traffic Efficiency

┼┼

– HW/SW Co-SimulationHW/SW Co-Simulation


┼┼ Taeweon Suh, Shih-Lien Lu and Hsien-Hsin S. Lee, “An FPGA Approach to Quantifying Taeweon Suh, Shih-Lien Lu and Hsien-Hsin S. Lee, “An FPGA Approach to Quantifying Coherence Traffic Efficiency on Multiprocessor Systems.”, 17Coherence Traffic Efficiency on Multiprocessor Systems.”, 17 thth FPL 2007 FPL 2007

23/36

Evaluation of Coherence Traffic Efficiency MethodologyEvaluation of Coherence Traffic Efficiency Methodology Evaluation methodologyEvaluation methodology

– Implement an L2 cache in FPGAImplement an L2 cache in FPGA– Save evicted cache lines into theSave evicted cache lines into the cachecache– Supply data using cache-to-cache transfer when Supply data using cache-to-cache transfer when

P-III requests it next timeP-III requests it next time– Measure execution time of benchmarks and Measure execution time of benchmarks and

compare with the baselinecompare with the baseline


Pentium-III Pentium-III (MESI)(MESI)

North Bridge

2GB SDRAM

“cache-to-cache transfer”

FPGAFPGA

D$D$

24/36

Evaluation of Coherence Traffic Efficiency Experiment EnvironmentEvaluation of Coherence Traffic Efficiency Experiment Environment

Operating systemOperating system– Redhat Linux 2.4.20-8 Redhat Linux 2.4.20-8

Natively run SPEC2000 benchmarkNatively run SPEC2000 benchmark– Selection of benchmark does not affect the Selection of benchmark does not affect the

evaluation as long as reasonable # bus traffic is evaluation as long as reasonable # bus traffic is generatedgenerated

FPGA sends statistics information to PC via FPGA sends statistics information to PC via UARTUART– # cache-to-cache transfers per second# cache-to-cache transfers per second– # invalidation traffic per second# invalidation traffic per second

25/36

100k

200k

300k

400k

500k

600k

700k

800k 1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB

Evaluation of Coherence Traffic Efficiency Experiment Results Evaluation of Coherence Traffic Efficiency Experiment Results

Average # cache-to-cache transfers / Average # cache-to-cache transfers / secondsecond

gzip vpr gcc mcf parser gap bzip2 twolf average

Avera

ge #

cach

e-t

o-c

ach

e

tran

sfe

rs/s

ec

804.2K/sec

433.3K/sec

26/36

Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)

1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB-20

0

20

40

60

80

100

120

140

160

180

200

exec

ution tim

e in

crea

se o

ver

bas

elin

e (s

ec)

cache size in FPGA

Average

Average execution time increaseAverage execution time increase– Baseline: benchmarks execution on a single P-III without Baseline: benchmarks execution on a single P-III without

FPGAFPGA data is always supplied from main memorydata is always supplied from main memory

191 seconds

Average execution time: 5635

seconds(93 min)

171 seconds

27/36

Evaluation of Coherence Traffic Efficiency Run-time BreakdownEvaluation of Coherence Traffic Efficiency Run-time Breakdown

Run-time estimation with 256KB cache in Run-time estimation with 256KB cache in FPGAFPGA

Invalidation trafficCache-to-cache

transfer

Latencies 5 ~ 10 FSB cycles5 ~ 10 FSB cycles 10 ~ 20 FSB cycles10 ~ 20 FSB cycles

Estimated run-times Note that the execution time increased 171 seconds on average out Note that the execution time increased 171 seconds on average out

of average total execution time (5635 seconds) of the baselineof average total execution time (5635 seconds) of the baseline Cache-to-cache transfer is responsible for at least 33 (171-138) Cache-to-cache transfer is responsible for at least 33 (171-138)

second increasesecond increase

69 ~ 138 seconds 69 ~ 138 seconds 381 ~ 762 381 ~ 762 seconds seconds

Cache-to-cache transfer on P-III server system is Cache-to-cache transfer on P-III server system is NOT as efficient as main memory access!NOT as efficient as main memory access!

28/36

AgendaAgenda


– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-Simulation HW/SW Co-Simulation ┼┼


┼┼ Taeweon Suh, Hsien-Hsin S. Lee, Hsien-Hsin S. Lee, and John Shen, “Initial Observations Taeweon Suh, Hsien-Hsin S. Lee, Hsien-Hsin S. Lee, and John Shen, “Initial Observations of Hardware/Software Co-Simulation using FPGA in Architecture Research.”, 2of Hardware/Software Co-Simulation using FPGA in Architecture Research.”, 2ndnd WARFP WARFP 20062006

29/36

HW/SW Co-Simulation MotivationHW/SW Co-Simulation Motivation

Gain advantages from both software Gain advantages from both software simulation and hardware emulationsimulation and hardware emulation– FlexibilityFlexibility– High-speedHigh-speed

IdeaIdea– Offload heavy software routines into FPGAOffload heavy software routines into FPGA– The remaining simulator interacts with FPGAThe remaining simulator interacts with FPGA

30/36

HW/SW Co-Simulation Communication MethodHW/SW Co-Simulation Communication Method Communication between P-III and Communication between P-III and

FPGAFPGA– Use FSB as communication mediumUse FSB as communication medium– Allocate one page in memory for Allocate one page in memory for

communicationcommunication– SendSend data to FPGA: data to FPGA: write-throughwrite-through cache cache

modemode– ReceiveReceive data from FPGA: data from FPGA: cache-to-cachecache-to-cache

transfertransferFront-side bus (FSB)


North Bridge

2GB SDRAM

FPGAFPGA

“write” bus transaction

“cache-to-cache transfer”“read” bus transaction

31/36

HW/SW Co-Simulation Co-Simulation ResultsHW/SW Co-Simulation Co-Simulation Results Preliminary experiment result with Preliminary experiment result with

SimpeScalar for correctness checkupSimpeScalar for correctness checkup– Implement a simple function Implement a simple function

((mem_access_latencymem_access_latency) into FPGA) into FPGA

mcf

bzip2

craftyeon-cook

Baseline (h:m:s)Co-simulation

(h:m:s)difference

(h:m:s)2:18:38 2:20:50 + 0:02:12

gcc-166

parser

perl

twolf

3:03:58 3:06:50 + 0:02:52

2:56:38 2:59:28 + 0:02:50

2:43:52 2:45:45 + 0:01:53

3:45:30 3:48:56 + 0:03:26

3:34:57 3:37:27 + 0:02:30

2:42:30 2:45:50 + 0:03:20

2:43:30 2:45:28 + 0:01:58

32/36

HW/SW Co-Simulation Analysis & Learnings HW/SW Co-Simulation Analysis & Learnings

Reason for the slowdownReason for the slowdown– FSB access is expensiveFSB access is expensive– Too simple function Too simple function

((mem_access_latency)mem_access_latency)– Device driver overhead Device driver overhead

Success criteriaSuccess criteria– Time-consuming software routines Time-consuming software routines – Reasonable FPGA access frequencyReasonable FPGA access frequency

33/36

HW/SW Co-Simulation Research OpportunityHW/SW Co-Simulation Research Opportunity Multi-core researchMulti-core research

– Implement distributed lowest level Implement distributed lowest level caches, and interconnection network such caches, and interconnection network such as ring or mesh in FPGAas ring or mesh in FPGA

L3

L3

CPU0L1,L2

Ring I/F

Ring I/F

CPU4

L1,L2

L3

L3

CPU1L1,L2

Ring I/F

Ring I/F

CPU5

L1,L2

L3

L3

CPU2L1,L2

Ring I/F

Ring I/F

CPU6

L1,L2

L3

L3

CPU3L1,L2

Ring I/F

Ring I/F

CPU7

L1,L2

FPGAFPGA

34/36

AgendaAgenda




35/36


Hybrid system emulationHybrid system emulation– Deploy FPGA to a place of interest in a Deploy FPGA to a place of interest in a

systemsystem– System-level active emulationSystem-level active emulation– Take advantage of an existing systemTake advantage of an existing system

Presented 3 usage cases in computer Presented 3 usage cases in computer architecture researcharchitecture research– L3 cache emulationL3 cache emulation– Evaluation of coherence traffic efficiencyEvaluation of coherence traffic efficiency– HW/SW co-simulationHW/SW co-simulation

FPGA-based emulation provides an FPGA-based emulation provides an alternative to software simulationalternative to software simulation

36/36

Questions, Comments?Questions, Comments?

Thanks for your attention!

37/36

Backup Slides

38/36

North Bridge

Evaluation of Coherence Traffic Efficiency

Cache Coherence ProtocolEvaluation of Coherence Traffic Efficiency

Cache Coherence Protocol

Example: MESI ProtocolExample: MESI Protocol– Snoop-based protocolSnoop-based protocol– Intel implements MESIIntel implements MESI

ModifiedExclusiveSharedInvalid

1234

Example

E 1234S 1234 S 1234

shared

M abcd

invalidate

I 1234 S abcdS abcd

1. P0: read2. P1: read3. P1: write (abcd)4. P0: read

I ----- I -----

cache-to-cache

P0 P1

abcd

39/36

L3 Cache Emulation MotivationL3 Cache Emulation Motivation

Software simulation has limitationsSoftware simulation has limitations– Simulation timeSimulation time– Reduced dataset and workloadReduced dataset and workload

Results could be offset by 100% or moreResults could be offset by 100% or more

Passive emulation has limitationsPassive emulation has limitations– Monitor transactionsMonitor transactions– Impact of emulated components on Impact of emulated components on

system can not be modeledsystem can not be modeled

Full-simulation requires much more Full-simulation requires much more efforteffort– Take much longer time to developTake much longer time to develop

Develop a full systemDevelop a full system Adapt workload to a new systemAdapt workload to a new system

40/36

L3 Cache Emulation Motivation (Cont.)L3 Cache Emulation Motivation (Cont.)

Active Cache Emulation (ACE) Active Cache Emulation (ACE) – Take advantage of an existing systemTake advantage of an existing system– Deploy an emulated component to a place Deploy an emulated component to a place

of interestof interest

41/36

L3 Cache Emulation HW DesignL3 Cache Emulation HW Design

Implemented modules in FPGAImplemented modules in FPGA– State machinesState machines

Keep track of up to 8 FSB transactionsKeep track of up to 8 FSB transactions– L3 TagsL3 Tags

L3 in FPGA varies from 1MB to 64MBL3 in FPGA varies from 1MB to 64MB Block size varies from 32B to 512BBlock size varies from 32B to 512B

– Statistics module Statistics module

FPGA (FPGA (Xilinx Virtex-II)Xilinx Virtex-II)


L3 cache TagL3 cache Tag

Registers for Registers for statisticsstatistics

PC via UARTLogic Analyzer

State machineState machine

FSB pipeline

8

42/36

Evaluation of Coherence Traffic Efficiency HW DesignEvaluation of Coherence Traffic Efficiency HW Design Implemented modules in FPGAImplemented modules in FPGA

– State machinesState machines Keep track of FSB transactionsKeep track of FSB transactions

– Taking evicted data from FSBTaking evicted data from FSB– Initiating cache-to-cache transferInitiating cache-to-cache transfer

– Direct-mapped cachesDirect-mapped caches Cache size in FPGA varies from 1KB to 256KBCache size in FPGA varies from 1KB to 256KB Note that Pentium-III has 256KB 4-way set associative L2 Note that Pentium-III has 256KB 4-way set associative L2

– Statistics moduleStatistics module

Xilinx Virtex-II FPGAXilinx Virtex-II FPGA


Direct-mapped Direct-mapped cachecacheTagTag DataData

Registers for Registers for statisticsstatistics

PC via UART

Logic Analyzer

State machineState machine

write-back

cache-to-cache

the rest

8

43/36

HW/SW Co-Simulation Implementation HW/SW Co-Simulation Implementation Hardware (FPGA) implementationHardware (FPGA) implementation

– State machinesState machines Monitoring bus transactions on FSBMonitoring bus transactions on FSB Checking bus transaction types (read or write)Checking bus transaction types (read or write) Managing cache-to-cache transferManaging cache-to-cache transfer

– Software functions to FPGASoftware functions to FPGA– Statistics countersStatistics counters

Software implementationSoftware implementation– Linux device driverLinux device driver

Specific physical address is needed for Specific physical address is needed for communication communication

Allocate one page of memory for FPGA access via Allocate one page of memory for FPGA access via Linux device driverLinux device driver

– Simulator modification for accessing FPGASimulator modification for accessing FPGA

44/36

Comparison with SimpleScalar Comparison with SimpleScalar simulationsimulation

L3 Cache Emulation Experiment Results (Cont.)L3 Cache Emulation Experiment Results (Cont.)

45/36

Evaluation of Coherence Traffic Efficiency MotivationEvaluation of Coherence Traffic Efficiency Motivation

Evaluation of coherence traffic Evaluation of coherence traffic efficiencyefficiency– Why important?Why important?

Understand the impact of coherence traffic Understand the impact of coherence traffic on system performance on system performance

Reflect into communication architecture Reflect into communication architecture – Problems with traditional methodsProblems with traditional methods

Evaluation of protocols themselvesEvaluation of protocols themselves Software simulationsSoftware simulations Experiments on SMP machines: ambiguousExperiments on SMP machines: ambiguous

– SolutionSolution A novel method to measure the intrinsic A novel method to measure the intrinsic

delay of coherence traffic and evaluate its delay of coherence traffic and evaluate its efficiencyefficiency

46/36

0

50k

100k

150k

200k

250k

300k

1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB


Average increase of invalidation traffic / Average increase of invalidation traffic / secondsecond


Avera

ge in

cre

ase o

f in

valid

ati

on

tr

affi

c/s

ec

157.5K/sec

306.8K/sec

47/36

0

10

20

30

40

50

60

70

1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB


Average hit rate in the FPGA’s cacheAverage hit rate in the FPGA’s cache


Avera

ge h

it r

ate

(%

)

Hit rate = # cache-to-cache transfer

# data read (full cache line)

64.89%

16.9%

48/36

MotivationMotivation

Traditionally, evaluations of coherence protocols Traditionally, evaluations of coherence protocols focused on reducing bus traffic incurred along with focused on reducing bus traffic incurred along with state transitions of coherence protocolsstate transitions of coherence protocols– Trace-based simulations were mostly used for the Trace-based simulations were mostly used for the

protocol evaluationsprotocol evaluations Software simulations are too slow to perform the Software simulations are too slow to perform the

broad range analysis of system behaviorsbroad range analysis of system behaviors– In addition, it is very difficult to do exact real-world In addition, it is very difficult to do exact real-world

modeling such as I/Osmodeling such as I/Os System-wide performance impact of coherence System-wide performance impact of coherence

traffic has not been explicitly investigated using real traffic has not been explicitly investigated using real systemssystems

This research provides a new method to evaluate This research provides a new method to evaluate and characterize coherence traffic efficiency of and characterize coherence traffic efficiency of snoop-based, invalidation protocols using an off-the-snoop-based, invalidation protocols using an off-the-shelf system and an FPGAshelf system and an FPGA

49/36

Motivation and ContributionMotivation and Contribution Evaluation of coherence traffic Evaluation of coherence traffic

efficiencyefficiency– MotivationMotivation

Memory wall becomes Memory wall becomes higherhigher

– Important to understand the Important to understand the impact of communication impact of communication among processorsamong processors

Traditionally, evaluation of Traditionally, evaluation of coherence protocols focused coherence protocols focused on protocols themselveson protocols themselves

– Software-based simulationSoftware-based simulation FPGA technologyFPGA technology

– Original Pentium fits into one Original Pentium fits into one Xilinx Virtex-4 LX200Xilinx Virtex-4 LX200

– Recent emulation effortRecent emulation effort RAMP consortium RAMP consortium

– ContributionContribution A novel method to measure A novel method to measure

the intrinsic delay of the intrinsic delay of coherence traffic and coherence traffic and evaluate its efficiency using evaluate its efficiency using emulation techniqueemulation technique

MemorIES (ASPLOS MemorIES (ASPLOS 2000)2000)

BEE2 BEE2 board board

50/36

Cache Coherence ProtocolsCache Coherence Protocols

Well-known technique for data consistency Well-known technique for data consistency among multiprocessor with cachesamong multiprocessor with caches

ClassificationClassification– Snoop-based protocolsSnoop-based protocols

Rely on broadcasting on shared busRely on broadcasting on shared bus– Based on shared memoryBased on shared memory

Symmetric access to main memorySymmetric access to main memory Limited scalabilityLimited scalability Used to build small-scale multiprocessor systemsUsed to build small-scale multiprocessor systems

– Very popular in servers and workstationsVery popular in servers and workstations

– Directory-based protocolsDirectory-based protocols Message-based communication via interconnection Message-based communication via interconnection

networknetwork– Based on distributed shared memory (DSM)Based on distributed shared memory (DSM)

Cache coherent non-uniform memory Access (ccNUMA)Cache coherent non-uniform memory Access (ccNUMA) ScalableScalable Used to build large-scale systemsUsed to build large-scale systems Actively studied in 1990sActively studied in 1990s

51/36

Cache Coherence Protocols (Cont.)Cache Coherence Protocols (Cont.) Snoop-based protocolsSnoop-based protocols

– Invalidation-based protocolsInvalidation-based protocols Invalidate shared copies when writingInvalidate shared copies when writing 1980s1980s

– Write-once, Synapse, Berkeley, and Illinois Write-once, Synapse, Berkeley, and Illinois Currently, adopt different combinations of Currently, adopt different combinations of

the states (M, O, E, S, and I)the states (M, O, E, S, and I)– MEI: PowerPC750, MIPS64 20KcMEI: PowerPC750, MIPS64 20Kc– MSI: Silicon Graphics 4D seriesMSI: Silicon Graphics 4D series– MESI: Pentium class, AMD K6, PowerPC601MESI: Pentium class, AMD K6, PowerPC601– MOESI: AMD64, UltraSparcMOESI: AMD64, UltraSparc

– Update-based protocolsUpdate-based protocols Update shared copies when writingUpdate shared copies when writing Dragon protocol and FireflyDragon protocol and Firefly

52/36

Cache Coherence Protocols (Cont.)Cache Coherence Protocols (Cont.) Directory-based protocolsDirectory-based protocols

– Memory-based schemesMemory-based schemes Keep Keep directory at the granularity of a cache directory at the granularity of a cache

lineline in home node’s memory in home node’s memory– One dirty bit, and one presence bit for each nodeOne dirty bit, and one presence bit for each node

Storage overhead due to directoryStorage overhead due to directory ExamplesExamples

– Stanford DASH, Stanford FLASH, MIT Alewife, and SGI Stanford DASH, Stanford FLASH, MIT Alewife, and SGI OriginOrigin

– Cache-based schemesCache-based schemes Keep Keep only head pointer for each cache lineonly head pointer for each cache line in in

home node’ directoryhome node’ directory– Keep Keep forward and backward pointers in cachesforward and backward pointers in caches of each of each

nodenode Long latency due to serialization of messagesLong latency due to serialization of messages ExamplesExamples

– Sequent NUMA-Q, Convex Exemplar, and Data GeneralSequent NUMA-Q, Convex Exemplar, and Data General

53/36

Emulation Initiatives for Protocol EvaluationEmulation Initiatives for Protocol Evaluation RPM (mid-to-late ’90s)RPM (mid-to-late ’90s)

– RRapid apid PPrototyping engine for rototyping engine for MMultiprocessor ultiprocessor from Univ. of Southern Californiafrom Univ. of Southern California

– ccNUMA Full system emulation ccNUMA Full system emulation A Sparc IU/FPU core is used as CPU in each A Sparc IU/FPU core is used as CPU in each

node, and the rest (L1, L2 etc) is implemented node, and the rest (L1, L2 etc) is implemented with 8 FPGAswith 8 FPGAs

Nodes are connected through Futurebus+Nodes are connected through Futurebus+

54/36

FPGA Initiatives for EvaluationFPGA Initiatives for Evaluation

Other cache emulatorsOther cache emulators– RACFCS (1997)RACFCS (1997)

RReconfigurable econfigurable AAddress ddress CCollector and ollector and FFlying lying CCache ache SSimulator from imulator from Yonsei Yonsei Univ. in Korea Univ. in Korea

Plugged into Intel486 busPlugged into Intel486 bus– Passively collect Passively collect

– HACS (2002)HACS (2002) HHardware ardware AAccelerated ccelerated CCache ache SSimulator imulator

from Brigham Young Univ.from Brigham Young Univ. Plugged into FSB of Pentium-Pro-based Plugged into FSB of Pentium-Pro-based

systemsystem– ACE (2006)ACE (2006)

AActive ctive CCache ache EEmulator from Intel Corp.mulator from Intel Corp. Plugged into FSB of Pentium-III-based Plugged into FSB of Pentium-III-based

systemsystem

55/36


ExampleExample

56/36

Intel server systemIntel server system

Pentium-IIIPentium-III

FPGA boardFPGA board

Logic analyzerLogic analyzerHost PCHost PC

UARTUART

Hybrid System Emulation Experiment Setup (Cont.)Hybrid System Emulation Experiment Setup (Cont.)

57/36

Experimental Setup (Cont.)Experimental Setup (Cont.)

Xilinx Virtex-IIFPGA

FSB interface

Logic analyzer ports

LEDs

58/36

FSB ProtocolFSB Protocol

Snoop stallSnoop stall

ADS

addrA[35:3]#

request1

request2

error1

error2

snoop

response

dataFSB pipeline stages

HITM#

HIT new transaction

Snoop stalls

59/36

FSB ProtocolFSB Protocol

Cache-to-cache transferCache-to-cache transfer

ADS

addrA[35:3]#

HIT#

HITM#

TRDY#

DRDY#

DBSY#

data0D[63:0]# data2 data3data1

request1

request2

error1

error2

snoop

response

dataFSB pipeline stages

snoop-hit

memory controller is ready to accept data

new transaction

60/36

Evaluation MethodologyEvaluation Methodology Goal Goal

– Measure Measure the intrinsic delay of coherence trafficthe intrinsic delay of coherence traffic and and evaluate its efficiencyevaluate its efficiency

Shortcomings in multiprocessor environmentShortcomings in multiprocessor environment– Nearly impossible to isolate the impact of coherence Nearly impossible to isolate the impact of coherence

traffic on system performancetraffic on system performance– Even worse, there are non-deterministic factorsEven worse, there are non-deterministic factors

Arbitration delayArbitration delay Stall in pipelined busStall in pipelined bus

“cache-to-cache transfer”shared bus

Processor Processor 00

(MESI)(MESI)

Memorycontroller

Main memory


(MESI)(MESI)


(MESI)(MESI)


(MESI)(MESI)

61/36

Evaluation of Coherence Traffic Efficiency Run-time BreakdownEvaluation of Coherence Traffic Efficiency Run-time Breakdown

Run-time estimation with 256KB cache in Run-time estimation with 256KB cache in FPGAFPGA

Invalidation trafficCache-to-cache

transfer

Latencies 5 ~ 10 FSB cycles5 ~ 10 FSB cycles 10 ~ 20 FSB cycles10 ~ 20 FSB cycles

Estimated run-times

Estimated time =Estimated time = avg. occurrenceavg. occurrence

secsecｘ avg. total execution timeavg. total execution timeｘ

clock periodclock period

cyclcyclee

ｘ latency of each trafficlatency of each traffic

Note that the execution time increased 171 seconds on average out Note that the execution time increased 171 seconds on average out of average total execution time (5635 seconds) of the baselineof average total execution time (5635 seconds) of the baseline

Cache-to-cache transfer is responsible for at least 33 (171-138) Cache-to-cache transfer is responsible for at least 33 (171-138)

second increase!second increase!

69 ~ 138 seconds 69 ~ 138 seconds 381 ~ 762 381 ~ 762 seconds seconds

Coherence traffic on P-III server system is NOT as Coherence traffic on P-III server system is NOT as efficient as main memory accessefficient as main memory access

62/36

ConclusionConclusion Proposed a novel method to measure the Proposed a novel method to measure the

intrinsic delay of coherence traffic and intrinsic delay of coherence traffic and evaluate its efficiencyevaluate its efficiency– Coherence traffic in P-III-based Intel server system Coherence traffic in P-III-based Intel server system

is not efficient as expectedis not efficient as expected The main reason is that, in MESI, main memory The main reason is that, in MESI, main memory

should be updated at the same time upon should be updated at the same time upon cache-to-cache-to-cache transfercache transfer

Opportunities for performance enhancementOpportunities for performance enhancement– For faster cache-to-cache transferFor faster cache-to-cache transfer

Cache line buffers in memory controllerCache line buffers in memory controller– As long as buffer space is available, memory controller As long as buffer space is available, memory controller

can take datacan take data

MOESI would help shorten the latencyMOESI would help shorten the latency – Main memory need not be updated upon cache-to-cache Main memory need not be updated upon cache-to-cache

transfertransfer

– For faster invalidation trafficFor faster invalidation traffic Advancing the snoop phase to an earlier stageAdvancing the snoop phase to an earlier stage

63/36

HW/SW Co-Simulation MotivationHW/SW Co-Simulation Motivation

Software simulationSoftware simulation– ProsPros

Flexible, observable, easy-to-implementFlexible, observable, easy-to-implement– ConsCons

Intolerable simulation timeIntolerable simulation time

Hardware emulationHardware emulation– ProsPros

Significant speedupSignificant speedup Concurrent executionConcurrent execution

– ConsCons Much less flexible and observableMuch less flexible and observable Low-level design taking longer time to Low-level design taking longer time to

implement and validateimplement and validate

64/36

Communication DetailsCommunication Details

All FSB signals are mapped to FPGA pinsAll FSB signals are mapped to FPGA pins Encoding software function arguments Encoding software function arguments

in the FSB address for Simplescalar in the FSB address for Simplescalar exampleexample– For 4KB page,For 4KB page,

Set its attribute as write-through modeSet its attribute as write-through mode Lower 12 bits in FSB address bus are free to Lower 12 bits in FSB address bus are free to

useuse High 24 bits are used for TLB translationHigh 24 bits are used for TLB translation



XilinxXilinxVirtex-IIVirtex-II

65/36

HW/SW Co-Simulation Co-simulation Results Analysis HW/SW Co-Simulation Co-simulation Results Analysis FSB access is expensiveFSB access is expensive

– ~ 20 FSB cycles (~ 20 FSB cycles (≈ ≈ 160 CPU cycles) for each 160 CPU cycles) for each transfertransfer

One cache line (32 bytes) needs to be One cache line (32 bytes) needs to be transferred for cache-to-cache transfertransferred for cache-to-cache transfer

P-III MESI requires to update main memory P-III MESI requires to update main memory upon cache-to-cache transferupon cache-to-cache transfer

““mem_access_latency”mem_access_latency” function is too function is too simplesimple– Even software simulation takes at most a few Even software simulation takes at most a few

dozen CPU cyclesdozen CPU cycles Device driver overhead Device driver overhead

– System overhead due to device driverSystem overhead due to device driver– It requires one TLB entry, which would be It requires one TLB entry, which would be

used in the simulation otherwiseused in the simulation otherwise Time-consuming software routines and Time-consuming software routines and

reasonable FPGA access frequency are reasonable FPGA access frequency are needed to benefit from hardware needed to benefit from hardware implementationimplementation

66/36

Conclusions Conclusions

Proposed a new co-simulation methodologyProposed a new co-simulation methodology Preliminary co-simulation using Preliminary co-simulation using

Simplescalar proves the correctness of the Simplescalar proves the correctness of the methodology methodology – Hardware/softwareHardware/software implementationimplementation– Communication between P-III and FPGA via FSBCommunication between P-III and FPGA via FSB– Linux driver Linux driver

Co-simulation results indicate Co-simulation results indicate – Bus access (FSB) is expensiveBus access (FSB) is expensive– Linux driver overhead also needs to be overcomeLinux driver overhead also needs to be overcome– Time-consuming blocks need to be emulatedTime-consuming blocks need to be emulated

Multi-core co-simulation would benefit from Multi-core co-simulation would benefit from FPGAFPGA– Implement distributed low-level caches and Implement distributed low-level caches and

interconnection network, which would be complex interconnection network, which would be complex enough to benefit from hardware modelingenough to benefit from hardware modeling

hybrid system emulation

Documents

emulation system

system of interestfpga

system configurationfpga

intel server system

processors cache

scope cont

background cont

fsb transactions