document1

International Journal of Software Engineering & Applications (IJSEA), Vol.4, No.2, March 2013

DOI : 10.5121/ijsea.2013.4201 1

A TAXONOMY OF PERFORMANCE ASSURANCEMETHODOLOGIES AND ITS APPLICATION IN HIGH

PERFORMANCE COMPUTER ARCHITECTURES

Hemant Rotithor

Microprocessor Architecture Group (IDGa)Intel Corporation Hillsboro, OR, 97124, USA

[email protected]

ABSTRACT

This paper presents a systematic approach to the complex problem of high confidence performanceassurance of high performance architectures based on methods used over several generations of industrialmicroprocessors. A taxonomy is presented for performance assurance through three key stages of a productlife cycle-high level performance, RTL performance, and silicon performance. The proposed taxonomyincludes two components-independent performance assurance space for each stage and a correlationperformance assurance space between stages. It provides a detailed insight into the performance assurancespace in terms of coverage provided taking into account capabilities and limitations of tools andmethodologies used at each stage. An application of the taxonomy to cases described in the literature andto high performance Intel architectures is shown. The proposed work should be of interest to manufacturersof high performance microprocessor/chipset architectures and has not been discussed in the literature.

KEYWORDS

Taxonomy, high performance microprocessor, performance assurance, computer architecture, modeling

1. INTRODUCTION

Phases in the design of a high performance architecture include: generating ideas for performanceimprovement, evaluation of those ideas, designing a micro-architecture to implement ideas, andbuilding silicon implementing key ideas. At each stage, potential performance improvementsneed to be tested with high confidence. Three stages of developing a high performancearchitecture correspond to three levels of abstraction for performance assurance- high level (HL)performance, RTL performance, and silicon performance. Performance assurance consists ofperformance analysis of key ideas at a high level, performance correlation of the implementationof a micro-architecture of these ideas to high level analysis/expectations, and performancemeasurement on the silicon implementing the micro-architecture. Examples of high performancearchitectures include-microprocessors, special purpose processors, memory controller and IOcontroller chipsets, accelerators etc.

A successful high performance architecture seeks major performance improvement over previousgeneration and competitive products in the same era. Significant resources are applied indeveloping methodologies that provide high confidence in meeting performance targets. A highperformance architecture may result in several products with different configurations, each ofwhich has a separate performance target. For example, a CPU core may be used in server,

mailto:[email protected]


2

desktop, mobile products with different cache sizes, core/uncore frequencies, number of memorychannels/size/speed. A performance assurance scheme should provide high confidence inperformance of each product. We propose a generalized taxonomy of performance assurancemethods that has been successfully deployed for delivering high performance architectures overseveral generation of CPUs/chipsets. The proposed taxonomy is regular and designed to highlightkey similarities and differences in different performance methodologies. Such an insight is notavailable in existing literature.

2. BACKGROUND

Literature pertaining to performance related taxonomies has focused on specific aspects ofperformance evaluation-primarily on workloads and simulation methods or application specificperformance issues, for example, taxonomy for imaging performance [1]. A taxonomy ofhardware supported measurement approaches and instrumentation for multi-processorperformance is considered in [2]. A taxonomy for test workload generation is considered in [3]that covers aspects of valid test workload generation and [4] that considers process executioncharacteristics. A proposal for software performance taxonomy was discussed in [5]. Work onperformance simulation methods, their characteristics, and application is described in [6-9].Another example describes specific aspects of validating pre-silicon performance verification ofHubChip chipset [10]. Other related work focuses on performance verification techniques forprocessors and SOCs and describe specific methods used and experience from using them [11-14]. The literature, while addressing specific aspects of performance verification, addresses onlypart of the issues needed for complete performance assurance of a complex high performancearchitecture. Significant more effort is needed in producing high performance architectures andthe goal of the paper is to provide a complete picture of this effort in the form of a unifiedtaxonomy that can’t be gathered through glimpses of pieces described in the literature. This papercovers key aspects of product life-cycle performance assurance methods and proposes ataxonomy to encapsulate these methods in a high level framework. We show in a later sectionhow the proposed taxonomy covers subsets of the performance verification methods described inthe literature and its application to real world high performance architectures.

Section 3 provides motivation for development of the taxonomy. Section 4 describes the proposedtaxonomy. Section 5 discusses examples of application of the proposed taxonomy. Section 6concludes the paper.

3. MOTIVATION FOR THE PROPOSED TAXONOMY

Product performance assurance is not a new problem and manufacturers of high performancearchitectures have provided snapshots of subset of the work done [10-14]. This paper unifies keymethods employed in performance assurance from inception to the delivery of silicon. Such ataxonomy is useful in the following ways:

a. It depicts how high confidence performance assurance is conducted for modernmicroprocessors/chipsets based on experience over several generations of products.

b. It provides new insight into the total solution space of performance assurance methodsemployed for real high performance chips and a common framework within which newmethods can be devised and understood.

c. It provides a rational basis for comparison of different methods employed and showssimilarities and differences between methods employed at each stage of performanceassurance.


3

d. Exposes the complexity, flexibility, and trade-offs involved in the total task and providesa basis for identifying adequacy of performance assurance coverage obtained with adifferent solutions and any potential gaps that might exist that can be filled to improvecoverage

e. Provides a framework for assessing risk with respect to product performance withreference to initial expectations set through planning or competitive assessment andprovides a high level framework for creating a detailed performance assurance executionplan

Why is it important to look at a detailed framework for components of performance assurance?To understand this, it is useful to go through the process of specification of performancerequirements and their evaluation through product life cycle:

• Performance targets for a new architecture and its derived products are set via carefulplanning for the time frame when it is introduced to make it competitive.

• A set of high level ideas to reach performance targets are investigated via a high levelmodel and a subset of these ideas is selected for implementation.

• A micro-architecture for implementing the selected ideas is designed and RTL (registertransfer level) model is created.

• Silicon implementing the RTL model is created and tested.

Performance evaluation is necessary at each stage to meet the set targets. The tools used forperformance analysis at each stage differ greatly in their capabilities, coverage, accuracy, andspeed. Table 1 shows how various attributes of performance assurance at each stage compare. Ahigh level performance model can be developed rapidly, can project performance for modestnumber and sizes of workloads, stimulus can be injected and observed at fine granularity but maynot capture all micro-architecture details. Performance testing with an RTL model needs longerdevelopment time, runs slow, and can project performance for a small set of workloads over shortdurations but captures details of the micro-architecture. Performance testing with silicon can runfull set of workloads, captures all details of micro-architecture and provides significant coverageof performance space, however, ability to inject stimulus and observability of results is limited.The goals of performance testing in these stages are also different. In the high level model thegoal is feature definition and initial performance projections to help reach the goals and evaluateperformance trade-off vs. micro-architecture changes needed from initial definition at a later stageto see if it is still acceptable. The goal of RTL performance testing is to validate that thepolicies/algorithms specified by high level feature definition are correctly implemented andcorrelated on a preselected set of tests on key metrics and that performance is regularly regressedagainst implementation changes. Silicon performance is what is seen by the customer of theproduct and its goal is to test that the initial performance targets are met and published externally,it also provides key insights for development of next architecture via measured data with anyprogrammable features and de-features in the chip.

Considering these differences in the capabilities and goals of performance assurance at eachstage, thinking of performance in a monolithic manner does not help one easily comprehend thecomplete space needed to deliver high performance architectures. It is important to tackleperformance assurance at each stage of development process with a clear understanding of thegoals, capabilities, and limitations to understand the scope and gaps in coverage that is addressedby the proposed taxonomy.


4

Table 1. Comparison of attributes of performance testing with different abstraction levels.

HL Performance RTL Performance Silicon PerformanceDevelopment time Low Modest HighWorkload size and

lengthModest Short Long

Stimulus Injectiongranule

Fine Fine Coarse

Observation Granule Fine Fine CoarseResult speed Modest Slow Fast

MicroarchitectureDetail captured/tested

(accuracy)

Low High High

Perf space coverage Modest Modest HighGoal High level arch

partitions, pre-sifeature defn, pre-si

pef projection,implementation

cost vs. perftradeoff

Validate archpolicies get

implemented inRTL, maintain

projectedperformance

Validate expectedsilicon performancefrom part, provide

input for nextgeneration arch, perfover competition ornext process shrink

4. A TAXONOMY FOR PERFORMANCE ASSURANCE

The total performance assurance space (PA) consists of a cross product of two spaces-independent performance assurance space (IPA) and correlation performance assurance space(CPA). IPA marks the space covered by independently testing each of the three abstraction levelswhereas CPA marks space covered by correlating performance between combinations ofabstraction levels. Examples of IPA space performance testing includes: performance comparisonwith a feature on vs off, performance comparison with previous generation, performancesensitivity to key micro-architecture parameters (policies, pipeline latency, buffer sizes, buswidth, speeds etc), benchmark score projections, transaction flow visual defect analysis (pipelinebubbles), idle/loaded latency and peak bandwidth measurements, multi-source traffic interferenceimpact, etc. CPA space correlates measurements done in one space to that done in other spacewith comparable configurations on various metrics to identify miscorrelations and gainconfidence. Coverage in both spaces is needed to get high confidence in performance. Wediscuss each level of abstraction and propose a taxonomy consisting of the following fourcomponents.

Let us denote:

α as the high level performance space,β as the RTL performance space,γ as the silicon performance space,θ as the correlation performance assurance space (CPA), of individual spaces (α, β, γ), then thetaxonomy for performance assurance space for high performance architectures (PA) denoted byis given as:


5

XXX or IPA X CPA (1) Where X denotes a Cartesian product of individual spaces. IPA is marked by α X β X γ .

4.1. High level performance assurance space α

Figure 1 depicts high level (HL) performance assurance space. Components of the space exploitsymmetry in providing coverage in all spaces to generate a regular taxonomy.

α ∈ Analysis method (λ) X Stimulus (φ) X Component granularity (µ) X Transactionsource (η) X Metric (ρ) X Configuration (ξ) (2)

Where:

Component granularity () ∈ Platform, full chip, cluster, combination (3)Analysis method () ∈ Analytical model, simulation model, emulation, combination (4)

Stimulus () ∈ Complete workload/benchmark, samples of execution traces, synthetic/directed

workload, combination (5)Traffic source () ∈ Single source, multiple sources (6)

Metric () ∈ Benchmark score, throughput/runtime, latency /bandwidth, meetingarea/power/complexity constraints, combination (7)Configuration () ∈ Single configuration, multiple configurations (8)

Figure 1: IPA-space of high level performance assurance

High level performance analysis may be done using analytical model, simulation model,emulation, or a combination of these methods. Analytical models are suitable for rapid high levelanalysis of architectural partitions when the behavior and stimulus is well understood or can beabstracted as such, simulation modeling may be trace or execution driven and can incorporate


6

more details of the behavior to get higher confidence in performance analysis under complexbehavior and irregular stimulus, emulation is suitable when an emulation platform is availableand speed of execution is important. A behavioral high level simulation model may describedifferent units with different abstraction levels (accuracy) and gets progressively more accuratewith respect to the implementation details as RTL is coded and correlated, the HL model servesas a reference in later stages. A combination method can also be used for example, a spread-sheetmodel that combines an analytical model with input from simulation model, if it is too expensiveto simulate underlying system with adequate accuracy and speed.

We may choose to test the system at different levels of component granularity. It is possible totest at platform level (where the device under test is a component of the user platform), at fullchip level where the device under test is a chip implementing the high performance architecture,for example, in a high volume manufacturing tester, at a large cluster level within the chip (forexample: out of order execution unit or last level cache in the uncore), or we may target all ofthese depending on which pieces are critical for product performance. The test stimulus and testenvironment for each component granularity may differ and needs infrastructure support to createcomparable stimulus, configuration etc. for performance correlation.

Stimulus may be provided in several forms depending on the device under test. We may use acomplete workload execution on a high level model, short trace samples from execution of aworkload (e. g. running on a previous generation platform or new arch simulator) driving asimulation model, use synthetic/directed tests to exercise a specific performance feature or acluster level latency and bandwidth characterization. Synthetic stimulus may target for example,idle or loaded latencies, cache hit/miss and memory page hit/miss bandwidth, peak read/writeinterconnect bandwidth (BW) etc. Synthetic stimulus can also be directed toward testingperformance of new high risk features that may span across the micro-architecture. Syntheticstimulus is targeted toward testing a specific behavior and/or metric whereas a real workloadtrace captures combinations of micro-architecture conditions and flows that a synthetic behaviormay not generate and both are important from getting good coverage. Synthetic and real workloadstimuli may converge if the workload is a synthetic kernel and traces from its execution are usedin driving a simulator, however, in most cases the differentiation can be maintained. Stimulusmay also be a combination of these stimuli. The selected method depends on speed of executionof the model, and the importance of the metric and workloads.

For traffic sources, depending on the device under test, we may test with a single traffic source ora combination of traffic sources. Examples of a single traffic sources are CPU multi-core traffic,integrated graphics traffic, or IO traffic that might be used to characterize core, graphics, IOperformance with a new feature. We may have a combination of above traffic sources to findinteresting micro-architecture performance bottlenecks. Examples of such bottlenecks include forexample, buffer sizes, forward progress mechanisms, coherency conflict resolution mechanisms.

Various metrics are used in evaluation. If the benchmark can be run on the HL model/silicon, abenchmark score is used. If components of the benchmark or short traces of workload executionare used, throughput (CPI) or run time is used. If performance testing is targeted to a specificcluster, we may use latency of access or bandwidth to the unit as a metric. For a performancefeature to be viable, it needs to also meet area, power, and complexity constraints inimplementation. An addition of a new feature may need certain die area and incur leakage anddynamic power that impacts TDP (thermal design point) power and battery life. Based on theperformance gain from a new feature and impact on the area/power, a feature may or may not beviable depending on the product level guidelines and needs to be evaluated during HL and RTLperformance stages. Design/validation complexity of implementing the performance feature is akey constraint for timely delivery. We may use a combination of these metrics depending on theevaluation plan.


7

There may be more than one product configuration supported with a given architecture. Severalpossibilities exist: do complete performance testing on all configurations, a subset of theperformance testing on all configurations, or a subset of the performance testing on a subset ofconfigurations that differ in key ways to trade off effort against performance risk. The exactconfigurations and the performance testing with each configuration depends on the context, theproposed taxonomy differentiates between how much testing is done for each. An example ofmultiple configurations for a core/uncore is use in several desktop, mobile, server configurationsthat differ in key attributes (cache size, number of cores, core/uncore frequency, DRAMspeed/size/channels, PCI lanes etc.).

Not all combinations generated in the HL space are either valid, feasible, or equally important.For example, although in principle one could specify an analytical model at platform granularityto measure benchmark score, creating such a model with desired accuracy may not be feasible.Performance testing with one configuration and traffic source may be more extensive than othercombinations due to the significance attached to those tests. A performance architect will specifyrelevant components of the space that are deemed significant in a performance assuranceexecution plan. We do not enumerate key combinations as their significance differs depending onthe context.

4.2. RTL performance assurance space

Figure 2 depicts the RTL performance assurance space.

β ∈ Stimulus (φ) X Component granularity (µ) X transaction source (η) X Metric (ρ) Xconfiguration (ξ) (9)

Where:

Stimulus () ∈ Samples of execution traces, synthetic/directed workload, combination (10)

Component granularity () ∈ Full chip, cluster, combination (11)

Traffic source () ∈ Single source, multiple sources (12)

Metric () ∈ Throughput/runtime, latency /bandwidth, meeting area/power/complexityconstraints, combination (13)

Configuration () ∈ Single configuration, multiple configurations (14)

Components of RTL performance assurance space are symmetric with the high level componentsexcept for the following key differences arising from differences in environments. Performancetesting is done on RTL model that generally runs slow since it captures micro-architecture details.Running large benchmarks is thus generally hard without a large compute capacity and it is bestto use short workload test snippets or directed tests. The execution results may be visuallyinspected or measured using performance checker rules on result log files.


8

Figure 2: IPA-space of RTL performance assurance

4.3. Silicon performance assurance space γ

Figure 3 depicts silicon performance assurance space.γ ∈ Stimulus (φ) X Component granularity (µ) X transaction source (η) X Metric (ρ) X

configuration (ξ) (15)

Where:

Stimulus () ∈ Complete workload/benchmark, synthetic/directed workload, combination(16)

Component granularity () ∈ Platform, full chip, combination (17)

Traffic source () ∈ Single source, multiple sources (18)

Metric () ∈ Benchmark score , Throughput/runtime, latency /bandwidth, combination (19)

Configuration () ∈ Single configuration, multiple configurations (20)

Components of silicon performance are symmetric to other spaces with notable differencesrelated to accessibility/observability notes earlier. Thus for devices under test, stimuluscomponent granularity is limited to full chip/platform.

Figure 3: IPA-space of silicon performance assurance


9

4.4. Correlation performance Assurance (CPA) Space

Figure 4 shows four components of CPA using definitions symmetric to IPA space:Let τ denote the correlation space between RTL and High level performanceLet ϖ denote the correlation space between High level and silicon performanceLet ∂ denote the correlation space between RTL and silicon performanceLet Ω denote the correlation space between HL, RTL, and silicon performanceThen CPA θ is given as:

θ ∈ τ X X X (21)

τ , , , ∈ Stimulus (φ) X Component granularity (µ) X transaction source (η) X Metric

(ρ) X configuration (ξ) (22)

For :Stimulus () ∈ Samples of execution traces, synthetic/directed workload, combinationComponent granularity () ∈ Full chip, cluster, combinationTraffic source () ∈ Single source, multiple sources Metric () ∈ Throughput/runtime, latency /bandwidth, area/power/complexity constraint,combinationConfiguration () ∈ Single configuration, multiple configurationsFor :Stimulus () ∈ Complete workload/benchmark, synthetic/directed workload, combinationComponent granularity () ∈ Platform, full chip, combinationTraffic source () ∈ Single source, multiple sources Metric () ∈ Benchmark score , Throughput/runtime, latency /bandwidth, combinationConfiguration () ∈ Single configuration, multiple configurationsFor :Stimulus () ∈ Synthetic/directed workloadComponent granularity () ∈ Full chipTraffic source () ∈ Single source, multiple sources Metric () ∈ Throughput/runtime, latency /bandwidth, combinationConfiguration () ∈ Single configuration, multiple configurationsFor :Stimulus () ∈ Synthetic/directed workloadComponent granularity () ∈ Full chipTraffic source () ∈ Single source, multiple sources Metric () ∈ Throughput/runtime, latency /bandwidth, combinationConfiguration () ∈ Single configuration, multiple configurations

CPA space denotes the part of the total coverage that is obtained by correlating between IPAspaces using comparable stimulus, metric, traffic sources, components, and configurations. Thiscoverage is necessary because we are not able to test everything in individual spaces due tolimitations discussed earlier and correlation space improves that coverage. In CPA space, highpriority is on correlating the performance of the RTL model with the high level model. The highlevel model runs fast enough and can be used to project benchmark level performance and if thetwo models correlate, the high level model serves as a good proxy for what we may expect forRTL benchmark level projection. The significance of each correlation space may differ. We have


10

discussed individual components of each space earlier and their definition is not repeated here forbrevity.

The PA taxonomy for high performance architectures provides a new way to look at the completeperformance assurance space that is easily understood and extended using a well defined andregular set of criteria. The criteria used in defining performance assurance space are representedby a key set of issues that an architect would need to resolve while designing the solution. Thisdoes not mean it includes every possible issue as a taxonomy based on such an endeavor wouldbe unwieldy. The selected criteria are relevant to all abstraction levels, capture key issues thatneed to be addressed, and any significant differences between the levels can be isolated using thecriteria. We discuss application of this taxonomy in the next section.

Figure 4: Correlation performance assurance space (CPA)

5. APPLICATION AND CONSIDERATIONS

5.1. Solution Spaces and Coverage

Figure 5. shows that the proposed taxonomy partitions the total performance assurance space intoseven distinct spaces. The IPA is marked by spaces 1, 2, and 3. CPA space is marked by spaces 4,5, 6, 7 that overlap IPA spaces. Table 2 illustrates high level characteristics of each space andshows what areas they may cover. The table is meant to be illustrative and not an exhaustivecoverage of each space. For example, if synthetic/directed stimulus is missing from the selectedsolution in all components and instead have only real workload/traces for stimulus, there may bea hole in testing peak bandwidth of key micro-architecture components. If synthetic/directed testswere present only in silicon performance, then the testing gap may propagate until silicon throughHL and RTL performance and may be expensive to fix later. Similar consideration applies todropping testing of a high risk feature from one or more of the spaces using synthetic/directedtests. In these cases, real workload traces may not find a performance problem with the featurewithout explicit directed testing and may result in a potential performance coverage hole. Similarcoverage comments apply to CPA space in the table depending on what coverage is sought. Fordetailed gap/risk assessment, more details of each component of the solution need to be specifiedin an assurance plan and the combinations reviewed over the PA space, for example, modelsneeded for evaluation, list of workloads, details of synthetic tests targeting specificbehaviors/features, details of clusters, traffic sources, detailed metrics and configurations.


11

HL Performance RTL performance

Silicon performance

Space 1

Space 2

Space 3

Space 4

Space 5Space 6

Space 7

Figure 5: Solution Spaces of Performance Assurance Methods

Depending on a product’s life stage and goals, coverage in all spaces may not be equallyimportant. For example, for a product design to deliver expected performance covering space 4(RTL performance correlation with high level model) may be more important than covering space5 that would test micro-architecture defeatures, hardware performance counters/events etc.Similarly, space 7 may be higher priority than space 5 and one could make coverage, efforttradeoffs/prioritization that way.

Table 2: Example of coverage provided by each solution space

Performance Validation Space Coverage

Space 1 (IPA- atures, lessmicro-architecture details (more refined as micro-architecture is defined), set productperformance projections/expectations

Space 2 (IPA- Silicon performance. Product performance projections published forvarious benchmarks with silicon implementation or measuring and comparingperformance with competitive products, tune parameters to optimize performance(BIOS setting)

Space 3 (IPA-after functional coding at unit/cluster level with details of micro-architectureimplemented, transaction flow inspection for defects (bubbles)

Space 4 (CPA-τ) Verify that RTL implemented algorithms specified in the architecturespecification derived from high level analysis by correlating with HL model with shorttests and snippets of workloads. Validate and correlate changes in micro-architecturerequired from implementation complexity and their performance impact

Space 5 (CPA-∂)Test/validate cases that have performance impact and needs details of micro-architecture not implemented in HL model, examples- product defeatures, rarearchitecture/micro-architecture corner cases with short full chip tests, hardwareperformance counters and events, other performance observability hooks

Space 6 (CPA-ϖ)Test full benchmark execution and correlate silicon performance to thatprojected with a high level model to see if it meets targets when the fullimplementation is considered, provides a method for correlation of pre and post siliconmeasurements and validation of pre silicon methodologies, also useful for providinginput for next generation CPUs with targeted studies of features and defeatures

Space 7 (CPA-Ω)This is intersection of all three methods and used to test performance pillarsin all cases. For example running full chip micros/directed tests for key componentlatencies and bandwidths and high risk features which can be regularly tracked in aregression suite as the RTL and silicon steppings change


12

We illustrate below application of the taxonomy to performance verification described in theliterature and then show more complete examples of application of taxonomy to specificexamples of high performance Intel processors and MCH (memory controller hub) chipsets.These examples depict how the performance verification work in the literature can be describedunder the proposed framework and how the taxonomy extends to testing with real chips.

5.2. Application Examples

Application of the taxonomy to work done in the literature is shown only to the specific methodsdiscussed in these papers and does not reflect on whether the products described were limited totesting shown here. We consider example discussed in [10, 11, 12-13]. In 10, Doering et alconsider performance verification for high performance PERCS Hub chip developed by IBM thatbinds several cores and IO. This work largely relates to high level(analytical(queue)+simulation(OMNET) ) and VHDL RTL correlation for the chipset. In theproposed taxonomy, the work described in the paper would be classified under CPA space andHL-RTL correlation () branch of CPA as follows:

HL-RTL Correlation Stimulus=Trace driven, Component granularity=full chip, Trafficsource=multiple, Metric=multiple (latency, throughput), Configuration= single

In 11, Holt et al describe system level performance verification of multi-core SOC. Two methodsof performance verification have been described in this paper-top down and bottom upverification. Under the proposed taxonomy, the top down performance verification would bedescribed under IPA HL performance assurance (verification would be described under CPA HL-RTL correlation () branch as follows:

(Top down) IPA HL performance Analysis method= emulation, Stimulus=synthetic,Component granularity=full chip, Traffic source=multiple, Metric=combination (latency/BW,throughput), Configuration= multiple

(Bottom up) CPA HL-RTL correlation Stimulus=synthetic, Component granularity=full chip,Traffic source=multiple, Metric=combination (unloaded latency, throughput), Configuration=single

In 12, 13 Bose et al have described architecture performance verification of IBM’s PowerPC ™Processors. Under the proposed taxonomy, the work described here would be included in theCPA space and HL-RTL correlation () branch of CPA as follows:

HL-RTL Correlation Stimulus=combination, Component granularity=full chip, Trafficsource=single (CPU core), Metric=combination (latency/BW, throughput), Configuration=Multiple (Power3, Power4)

These examples show that the performance verification work in the literature focuses on subset ofthe PA space and there is no clear definition of the whole space. The proposed taxonomy achievestwo goals-describes the total space and provides a consistent terminology to describe parts of thetotal space. The classification above also shows high level similarities and differences in themethods used in these cases.

Next, we show application of the proposed taxonomy to three examples: IA™ CPU core, MCHchipset, and a memory controller cluster in Table 3. The first table shows IPA space and thesecond table shows CPA space. The taxonomy mapping for each example is illustrative and othersolutions are possible depending on the context.


13

For IPA HL core performance, a combination of analytical model during early exploration and asimulation model of the architecture are used. The stimulus is a combination of directed tests forspecific latency/BW characterization and real workload benchmark traces for high qualitycoverage. The directed tests also cover new features introduced in the architecture. The testing isdone as a combination of cluster and full chip granularity. Single source of traffic is IA™ coreworkloads/traces and the measurement granularity is a combination of benchmark score,throughput/run time, and latency and BW of targeted units. Since the core is used in multipleconfigurations (desktop, mobile, server), testing is done with multiple configurations. For coreRTL testing, similar considerations apply as high level testing except that the metrics are runtime/throughput and latency bandwidth combination and stimulus contains a combination oftraces and synthetic workload. For core Silicon testing, the stimulus consists of combination ofcomplete workload and directed full chip tests, and component granularity is platform and fullchip. Other considerations for metric and configurations with RTL and silicon performance arecomparable to that of HL performance.

For IPA MCH chipset performance testing (in chipset column), one significant difference is in thetraffic source. The core had a single source of traffic, MCH binds multiple sources that includescores, IO, graphics. The performance testing for MCH is done with multiple sources oftransactions and combination of metrics. If the MCH functionality is integrated into an uncore ora SOC, it would have a comparable IPA scheme.

A memory controller (MC) is a cluster within the uncore or MCH and its performance testing isshown as the third example. It can also be left as a part of uncore cluster/MCH testing ifconsidered adequate. In this example, we consider memory controller as a modular componentthat may be used for targeting more than one architecture and thus needs to be independentlytested for high confidence. High level IPA testing of a memory controller is done with asimulation model and synthetic micros directed at performance aspects of a memory controllerthat test core timings, turnarounds, latency, and BW under various read write mixes and pagehit/miss proportions. It can be tested with multiple traffic sources with different memoryconfigurations (number of ranks, DIMMS, speeds, timings etc.). For silicon testing, memorycontroller performance is tested as a combination of synthetic workloads and benchmarks(streams) etc.

CPA space for all four components is shown in the second table. For example, for a CPU coreHL-RTL correlation, a combination of short real workload traces along with synthetic workloadsis tested on the HL model and RTL at full chip and cluster combination. The workload source isan IA core and a combination of metrics throughput (for workload traces) and latency/BW (withsynthetic workload) is used for correlation. This correlation is done on multiple configurations.For HL-silicon correlation, combination of full chip latency/BW micros and benchmarks are runand correlation is done for benchmark scores and latency/BW metrics. This correlation also helpsimprove the HL model accuracy and a useful reference for development of next generationprocessors. For CPU core, RTL silicon correlation is done on a single configuration whereasother three correlations are done on multiple configurations. This illustrates an example of tradingeffort vs. coverage at a low risk since RTL silicon correlation covers uncommon cases fromperformance perspective and get adequate testing on a single configuration. The HL-RTL-Siliconcorrelation testing is done with targeted synthetic full chip micros that test the core metrics thatare key for product performance and the testing is done at full chip with combination ofthroughput and latency/BW metrics in multiple configurations. Similar considerations apply tochipset and memory controller CPA space.


14

Table 3: Example of application of taxonomy to real world examples

IPA CPU core Chipset Memory Controller unit

High Level

Testing

Analysis

method

Combination Simulation Simulation

Stimulus Combination Combination Synthetic

Component

Granularity

Combination Combination Cluster

Traffic src Single Multiple

(IA/IO/GFX)

Multiple

Metric Combination Combination Latency/BW

Configs Multiple Multiple Multiple

RTL Stimulus Combination Combination Synthetic

Component

Granularity

Combination Combination Cluster

Traffic src Single Multiple Multiple



Silicon Stimulus Combination Combination Combination

Component

Granularity

Combination Combination Platform


Metric Combination Multiple Latency/BW



15

CPA CPU core Chipset Memory Controller unit

HL-RTL Stimulus Combination Synthetic Synthetic

Component

Granularity

Combination Full chip Cluster

Traffic src Single Multiple

(IA/IO/GFX)

Multiple


Configs Multiple Single Multiple

HL-Silicon Stimulus Combination Synthetic Synthetic

Component

Granularity

Combination Full chip Full chip




RTL-Silicon Stimulus Synthetic Synthetic Synthetic

Component

Granularity

Full chip Full chip Full chip

Traffic src Single Single Single

Metric Combination Throughput/ru

n time

Latency/BW

Configs Single Single Single

HL-RTL-

Silicon

Stimulus Synthetic Synthetic Synthetic

Component

Granularity

Full chip Full chip Full chip

Traffic src Single Multiple Single

Metric Combination Latency/BW Latency/BW



16

6. CONCLUSIONS

This paper presented a systematic approach to the complex problem of performance assurance ofhigh performance architectures manufactured in high volume based on methods successfullydeployed over several generations of Intel cores/chipsets in a unified taxonomy. The taxonomyextensively considers performance assurance through three key stages of a product that includehigh level product performance, RTL performance, and silicon performance and has not beendiscussed in the literature previously. The proposed taxonomy incorporated capabilities andlimitations of performance tools used at each stage and helps one construct a complete high levelpicture of performance testing that needs to be done at each stage. An application of thetaxonomy to examples in the literature and real world examples of a CPU core, MCH chipset, andmemory controller cluster are shown.

The key advantages of proposed taxonomy are: it shows at high level where the performanceassurance methods need to be different, it makes one think through all phases of a product startingfrom high level until silicon, enumeration of the taxonomy in a detailed performance assuranceexecution plan identifies if there are holes in the performance testing that either need to be filledor concomitant risk is appropriately assessed. The taxonomy helps with resource planning andmapping and delivering a successful high performance product.

The proposed taxonomy has been successfully used in performance assurance of Intel’sNehalem/Westmere CPUs and several generations of chipsets. This systematic approach has beeninstrumental in identifying many pre-silicon performance issues early on and any corner casesidentified in silicon due to several cross checks embedded in the methodology. It has helpedcreating a rigorous performance assurance plan. The proposed work is new and should be ofinterest to manufacturers of high performance architectures.

7. REFERENCES

[1] Don Williams, Peter D. Burns, Larry Scarff, (2009) “Imaging performance taxonomy”; Proc. SPIE7242, 724208; doi:10.1117/12.806236, Monday 19 January 2009, San Jose, CA, USA

[2] Mink, A.; Carpenter, R.J.; Nacht, G.G.; Roberts, J.W.; (1990) “Multiprocessor performance-measurement instrumentation”, Computer, Volume: 23 , Issue: 9, Digital Object Identifier:10.1109/2.58219, Page(s): 63 - 75

[3] Mamrak, S.A.; Abrams, M.D, (1979) “Special Feature: A Taxonomy for Valid Test WorkloadGeneration “; Computer, Volume: 12, Issue: 12, Digital Object Identifier 10.1109/MC.1979.1658577,Page(s): 60 – 65

[4] Oliver, R.L.; Teller, P.J.; (1999) “Are all scientific workloads equal?”, Performance, Computing andCommunications Conference, 1999. IPCCC '99. IEEE International, Digital Object Identifier:10.1109/PCCC.1999.749450, Page(s): 284 - 290

[5] Mary Hesselgrave , (2002) “Panel: constructing a performance taxonomy”, July 2002 WOSP '02:Proceedings of the 3rd international workshop on Software and performance

[6] S. Mukherjee, S Adve, T. Austin, J. Emer, P. Magnisson, (2002) “Performance simulation tools”,Computer , Issue Date : Feb 2002, Volume : 35 , Issue:2, On page(s): 38, Digital Sponsored by :IEEE Computer Society

[7] S. Mukherjee, S. Reinhardt, B. Falsafi, M. Litzkow, S. Huss-Lederman, M. Hill, J. Larus, and D.Wood, (2000) “Fast and portable parallel architecture simulators: Wisconsin wind tunnel II”, IEEEConcurrency, vol. 8, no. 4, pp. 12–20, Oct.–Dec. 2000.

[8] Heekyung Kim , Dukyoung Yun , (2009) “Scalable and re-targetable simulation techniques forsystems”, Proceeding CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conferenceon Hardware/software codesign and system synthesis , NY 2009

[9] Hoe, James C.; Burger, Doug; Emer, Joel; Chiou, Derek; Sendag, Resit; Yi, Joshua; (2010) “TheFuture of Architectural Simulation”, Micro, IEEE, Volume: 30 , Issue: 3, Digital Object Identifier:10.1109/MM.2010.56, Page(s): 8 - 18


17

[10] Andreas Doering and Hanspeter Ineichen, “Visualization of Simulation Results for the PERCSHubChip Performance Verification”, Proc. SIMUTooLs 2011, 4th ICST conf on simulation tools andtechniques, March 21-25, Barcelona, Spain

[11] Holt, J.; Dastidar, J.; Lindberg, D.; Pape, J.; Peng Yang; “System-level Performance Verification ofMulticore Systems-on-Chip”, Microprocessor Test and Verification (MTV), 2009 10th InternationalWorkshop on Digital Object Identifier: 10.1109/MTV.2009.10, Publication Year: 2009 , Page(s): 83 -87

[12] Surya, S.; Bose, P.; Abraham, J.A.; “Architectural performance verification: PowerPC ”, ComputerDesign: VLSI in Computers and Processors, 1994. ICCD '94. Proceedings., IEEE InternationalConference on Digital Object Identifier: 10.1109/ICCD.1994.331922 , Publication Year: 1994 ,Page(s): 344 - 347

[13] Bose, P.; “ Ensuring dependable processor performance: an experience report on pre-siliconperformance validation “, Dependable Systems and Networks, 2001. DSN 2001. InternationalConference on, Digital Object Identifier: 10.1109/DSN.2001.941432, Publication Year: 2001 ,Page(s): 481 - 486

[14] Richter, K.; Jersak, M.; Ernst, R.; “A formal approach to MpSoC performance verification”,Computer, Volume: 36 , Issue: 4, Digital Object Identifier: 10.1109/MC.2003.1193230, PublicationYear: 2003 , Page(s): 60 – 67.

Authors

Hemant Rotithor: Received his M.S and Ph.D. in Electrical and computer Engineeringfrom IIT Bombay and University of Kentucky. He taught at Worcester PolytechnicInstitute, worked at DEC on compiler performance analysis; he is currently working atIntel Corporation in Hillsboro Oregon in the microprocessor architecture group. AtIntel Corporation, he has worked on performance of many generations ofmicroprocessors and chipsets. He has several patents issued in the area of uncoremicroarchitecture performance, memory scheduling, and power management. He haspublished papers on performance analysis, and distributed computing, and validation.

document1

Documents