performance evaluation of system architectures with validated input data

5
Performance evaluation of system architectures with validated input data Humayun Khalid Dell Computer Corporation, Enterprise Server Performance, 1000 Cassat Cove, Austin, TX 78753, USA Received 10 February 1999; accepted 20 October 1999 Abstract In this paper we have extended our methodology, presented earlier in [1], for generating and validating representative traces. Our novel technique was applied to a relatively realistic and dicult multimedia benchmark suite called Me- diaMark. We have also introduced a new metric called the K-metric (Khalid – metric) that was used for validation. The aim of our present research was to demonstrate that the proposed methodology can be successful even for complex and challenging benchmarks like multimedia benchmarks. Earlier, our methodology was shown to be the most successful one as compared to the popular contemporary techniques for tracing relatively simple and primitive suite of appli- cations contained within SPEC95 benchmark suite [1]. Experimental results in this article demonstrate that our methodology works even in the worst case scenarios. Ó 2000 Elsevier Science B.V. All rights reserved. Keywords: Simulation; Performance evaluation; Validation; Benchmarks 1. Introduction Trace-driven simulations use real traces, col- lected during the execution of actual benchmarks, as inputs for the simulators. Therefore, trace- driven simulations have been widely used by de- sign engineers and performance analysts for per- formance evaluation of computer systems. The accuracy of such simulations in predicting actual machine performance for a variety of benchmarks is essentially determined by the accuracy of simu- lators, and the accuracy of the input data (traces) to the simulators. In this paper, we have addressed the issue of trace sampling and validation needed for accurate trace-driven simulations. This is an important problem on the SPEC95 benchmark suite, which contains an aggregate of 800 billion instructions [2]. The problem becomes very severe as we move to a relatively complex and challenging benchmark suite like MediaMark [3]. Several approaches have been suggested for generating reduced traces from the workloads [4– 7]. The cited strategies are aective in reducing the amount of data which must be processed however, it is not clear how useful the resulting sampled traces would be in allowing one to predict the be- havior of a processor or system on the full work- loads which they are intended to represent. This is known as the problem of trace representativeness. Infact, we have already demonstrated the superi- ority of our rudimentary tracing methodology as compared to the popular methodologies in [1]. This paper presents a new trace sampling methodology. We introduce a new metric called www.elsevier.com/locate/sysarc Journal of Systems Architecture 46 (2000) 1013–1017 E-mail address: [email protected] (H. Khalid). 1383-7621/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved. PII: S 1 3 8 3 - 7 6 2 1 ( 0 0 ) 0 0 0 0 6 - 0

Upload: humayun-khalid

Post on 05-Jul-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Performance evaluation of system architectures with validatedinput data

Humayun Khalid

Dell Computer Corporation, Enterprise Server Performance, 1000 Cassat Cove, Austin, TX 78753, USA

Received 10 February 1999; accepted 20 October 1999

Abstract

In this paper we have extended our methodology, presented earlier in [1], for generating and validating representative

traces. Our novel technique was applied to a relatively realistic and di�cult multimedia benchmark suite called Me-

diaMark. We have also introduced a new metric called the K-metric (Khalid ± metric) that was used for validation. The

aim of our present research was to demonstrate that the proposed methodology can be successful even for complex and

challenging benchmarks like multimedia benchmarks. Earlier, our methodology was shown to be the most successful

one as compared to the popular contemporary techniques for tracing relatively simple and primitive suite of appli-

cations contained within SPEC95 benchmark suite [1]. Experimental results in this article demonstrate that our

methodology works even in the worst case scenarios. Ó 2000 Elsevier Science B.V. All rights reserved.

Keywords: Simulation; Performance evaluation; Validation; Benchmarks

1. Introduction

Trace-driven simulations use real traces, col-lected during the execution of actual benchmarks,as inputs for the simulators. Therefore, trace-driven simulations have been widely used by de-sign engineers and performance analysts for per-formance evaluation of computer systems. Theaccuracy of such simulations in predicting actualmachine performance for a variety of benchmarksis essentially determined by the accuracy of simu-lators, and the accuracy of the input data (traces)to the simulators. In this paper, we have addressedthe issue of trace sampling and validation neededfor accurate trace-driven simulations. This is an

important problem on the SPEC95 benchmarksuite, which contains an aggregate of 800 billioninstructions [2]. The problem becomes very severeas we move to a relatively complex and challengingbenchmark suite like MediaMark [3].

Several approaches have been suggested forgenerating reduced traces from the workloads [4±7]. The cited strategies are a�ective in reducing theamount of data which must be processed however,it is not clear how useful the resulting sampledtraces would be in allowing one to predict the be-havior of a processor or system on the full work-loads which they are intended to represent. This isknown as the problem of trace representativeness.Infact, we have already demonstrated the superi-ority of our rudimentary tracing methodology ascompared to the popular methodologies in [1].

This paper presents a new trace samplingmethodology. We introduce a new metric called

www.elsevier.com/locate/sysarc

Journal of Systems Architecture 46 (2000) 1013±1017

E-mail address: [email protected] (H. Khalid).

1383-7621/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved.

PII: S 1 3 8 3 - 7 6 2 1 ( 0 0 ) 0 0 0 0 6 - 0

the K-metric that was used for evaluating thevalidation of static characteristics in a given trace.Section 2 deals with the K-metric and other de®-nitions, notations, and terminology usedthroughout the text and in the tables. Section 3enumerates the proposed trace sampling and val-idation technique. Section 4 presents the simula-tion results and discussions. Section 5 containconclusions.

2. Notations, de®nitions and formulae

This section contains the following terminolo-gy, de®nitions, and notations:

K-metric (%): It is de®ned as the weighted aver-age of dynamic instruction mix di�erence be-tween a binary and its trace, mathematicallyexpressed by the following equation:

K-metric �XT

i�1

xi � yi;

where T is the total number of instruction types,xi the percentage di�erence between a binaryand a trace in terms of instruction type i, and yi

is the average percentage of instruction type i.Binary Mill. Instr: Dynamic binary path length(for a benchmark) in millions of instructions.Trace Mill. Instr: Path length for a trace in mil-lions of instructions.ipc delta (%): Percentage di�erence in instruc-tions per cycle between a binary and its trace.ipc. Instructions per cycle.IC-miss delta (%): Percentage di�erence in level1instruction-cache misses (per instruction) be-tween a binary and its trace.IC miss (%): Percentage of level1 instruction-cache misses per instruction.DC-miss delta (%): Percentage di�erence in lev-el1 data_cache misses (per instruction) betweena binary and its trace.DC miss (%): Percentage of level1 data_cachemisses per instruction.Emul code delta (%): Percentage di�erence in theemulation code between a binary and its trace.Emul (%): Percentage of emulation code in thedynamic binary/trace path length.

fp_op (%): Percentage of ¯oating point type ofinstructions in the dynamic binary/trace pathlength.int_op (%): Percentage of integer type of instruc-tions in the dynamic binary/trace path length.fp_ls (%): Percentage of ¯oating point load/store type of instructions in the dynamic bina-ry/trace path length.int_ls (%): Percentage of integer load/store typeof instructions in the dynamic binary/trace pathlength.Brnch (%): Percentage of branch instructions inthe dynamic binary/trace path length.Cache (%): Percentage of cache instructions inthe dynamic binary/trace path length.Other (%): Percentage of miscellaneous categoryof instructions in the dynamic binary/trace pathlength, e.g. eieio, isync, sync, r®, mfspr, mtspretc.

3. A trace sampling methodology and validation of

the sampled traces

Intuition behind the proposed trace samplingmethodology is that a representative reduced traceshould preserve both the static and dynamiccharacteristics seen in the executable. Our tech-nique ensures that the static and dynamic charac-teristics of a sampled trace remains within anacceptable prede®ned threshold.

We begin by examining the dynamic and staticbehavior of a given executable. This is done bymonitoring the aggregate dynamic instructions percycle (ipc), dynamic instruction mix (IM), per-centage of emulation instructions (Emul), andcache statistics using the PowerPCTM 1 perfor-mance monitor (PowerPCTMPM). For more in-formation on PowerPCPM refer to [8]. Next,identify regions in the executable that exhibit be-havior which is consistent with the aggregate dy-namic behavior. This can be achieved by sampling

1 PowerPC, PowerPC Performance Monitor are trademarks

of International Business Machines Corporation. SPEC is a

trademark of System Performance Evaluation Corporation.

MacOS is a trademark of Apple Inc.

1014 H. Khalid / Journal of Systems Architecture 46 (2000) 1013±1017

dynamic information related to the samples of theexecutable. Sampling size will depend on the sizeof the executable and the sampling e�ciency.From the aforementioned steps, we should be ableto obtain the segments of an executable that ex-hibit dynamic properties similar to the aggregatedynamic characteristic. These segments are thentraced using appropriate tracing tool(s). The tracesegments are eventually concatenated in the orderthey are sampled to maintain temporal consistencybetween the sampled trace and the executable. Thefourth step is to check for the static characteristicsof a trace. This refers to the instruction mix for thetrace in our methodology. Our instruction mix iscomposed of all the instruction divided into 7categories (for the purposes of ease of computa-tion) namely, fp_op, int_op, fp_ls, int_ls, Brnch,Cache, and Other (see Section 2 for de®nitions).After the static mix for the trace is obtained wecompare it to the static mix from the binary usinga metric called the K-metric (see Section 2 forde®nition). K-metric is basically a weighted aver-age of dynamic instruction mix di�erence betweena binary and its trace. The reason for de®ning K-metric is that a weighted average is de®nitely moremeaningful in the case under consideration ascompared to its other counterparts primarily be-cause large di�erences in the actual number ofsimilar instruction types in a binary and a trace hasrelatively little signi®cance if the total number ofsuch instructions are relatively small in the entirebinary and its corresponding trace. If the staticcharacteristics of a sampled trace are not withinthe speci®ed threshold limits (usually within 10±15%), modify the sampling interval and repeat the

sampling experiments by identifying new dynamicsegments of interest. Lastly, we compare the ®nalstatic and dynamic characteristics of the resultantsampled trace with the executable.

4. Simulation results and discussion

MediaMark benchmark suite is made up of 8Macintosh applications. It was developed at theSomerset Design Center with the intention of de-veloping a suite that would span the entire spec-trum of multimedia applications run on AppleÕsMacintosh computers. These benchmarks posesevere challenge to the sampling methodologiesdue to several characteristics which include non-loopy code, complex embedded loops, real-timerequirements, relatively heavy emulation code,relatively large amount of instructions accessingmemory (load/store), etc. A brief description ofMediaMark workloads is provided in [3].

Using the methodology presented in Section 3,representative sampled traces were generated forMediaMark on a PowerPC machine runningMacOSTM. The benchmark suite was chosen pri-marily due to the relatively high complexity ofsuch benchmarks. No comparison was made withany of the existing popular trace sampling tech-niques due to the fact that considerably betterperformance of our technique was already dem-onstrated in [1] earlier with the rudimentary ver-sion of our methodology for the relatively simpleSPEC95 benchmarks [2].

Tables 1 and 2 illustrate runtime binary andtrace statistics, respectively, for all the benchmarks

Table 1

MediaMark binary statistics

Benchmark fp_op

(%)

int_op

(%)

fp_ls

(%)

int_ls

(%)

Brnch

(%)

Cache

(%)

Other

(%)

ipc IC miss

(%)

DC miss

(%)

Emul

(%)

DVC_Pupp 0.00 60.48 1.82 29.85 6.63 0.68 0.47 1.3 0.16 0.37 2.42

Mpeg-1 0.05 49.40 0.64 31.84 13.96 0.83 3.27 0.99 0.31 0.59 3.57

Tartus3D 5.29 66.37 2.26 21.49 4.17 0.00 0.39 0.86 0.08 0.94 0.37

PS_Gaus 0.00 51.84 0.04 31.87 13.56 0.01 2.65 0.96 0.24 0.51 4.97

PS_Rot 0.02 60.75 1.70 19.55 12.10 0.00 5.84 0.75 0.58 0.82 10.71

FreeHand 0.00 34.03 0.80 37.54 20.21 0.01 7.38 0.54 2.92 0.94 4.2

In®niD 2.79 30.13 4.39 27.32 16.46 0.36 18.55 0.45 3.15 0.44 13

Adobe_Pre 0.17 56.32 0.52 19.70 22.23 0.03 1.01 1.09 0.29 0.36 5.46

H. Khalid / Journal of Systems Architecture 46 (2000) 1013±1017 1015

in MediaMark suite. For each benchmark 11parameters or data points were captured. First 7columns of Tables 1 and 2 provide the staticcharacteristic data for the binary/trace while theremaining 4 columns contain dynamic character-istic data. From the tables several aspects of themultimedia benchmarks in MediaMark suite be-come apparent. The benchmarks are made up ofhighly integer code. Between 1/3 and 1/4 of thedynamic instructions in most of the binaries resultin access to the memory hierarchy (load/storeinstructions). As high as 1/4 of the dynamic in-structions result in branch activity. Also, despiteconsiderable amount of emulation the instructionsexecuted per cycle are fairly high. Comparison ofdata in Tables 1 and 2 demonstrate the perfor-mance of our methodology. Similarity of numbersin both the tables show how representative a traceis with respect to the corresponding binary.

Table 3 summarizes and compares data fromTables 1 and 2. The sampled trace size for eachbenchmark was restricted to approximately less

than 100 million instructions. Table 3 quantita-tively emphasizes the success of our methodology.It shows that the proposed technique was able toachieve highly representative traces with smallaggregate errors resulting in a considerableamount of data compression. Approximate ag-gregate error of around 10±15% from all majorsources of errors: sampling, simulator, and hard-ware measurements is truly a remarkable perfor-mance. Our results further illustrate that in almostall instances traces exhibit dynamic behavior (ipc)within 10% of the corresponding binary. For thecase of freehand delta was 11.11%, only slightlyhigher than 10%. Instruction and data cachemisses di�er by relatively higher percentages.However, this is not signi®cant due to the fact thatthe misses were very low to begin with. Instructionmix di�erence between the trace and its corre-sponding binary was also found to be very low(less than 5% for all the applications) as shown bythe K-metric. Lastly, emulation code was noted tobe within acceptable bounds.

Table 2

MediaMark trace statistics

Benchmark fp_op

(%)

int_op

(%)

fp_ls

(%)

int_ls

(%)

Brnch

(%)

Cache

(%)

Other

(%)

ipc IC miss

(%)

DCmiss

(%)

Emul

(%)

DVC_Pupp 0.00 60.48 1.82 29.89 6.64 0.68 0.47 1.28 0.14 0.42 2.64

Mpeg-1 0.07 50.21 0.63 32.13 12.98 0.95 2.96 0.95 0.28 0.52 3.99

Tartus3D 5.29 66.37 2.26 21.49 4.17 0.00 0.39 0.95 0.07 1.07 0.37

PS_Gaus 0.00 51.10 0.04 31.83 13.98 0.08 2.92 0.96 0.27 0.46 5.46

PS_Rot 0.02 62.32 0.93 19.33 11.46 0.00 5.91 0.82 0.64 0.72 12.19

FreeHand 0.00 34.01 0.80 37.55 20.22 0.01 7.39 0.48 2.89 0.98 3.70

In®niD 2.97 29.93 4.6 26.70 16.43 0.38 18.97 0.45 2.81 0.49 11.31

Adobe_Pre 0.15 55.80 0.39 19.29 23.25 0.03 1.05 1.18 0.26 0.39 5.57

Table 3

MediaMark trace validation data

Benchmark Binary Mill.

Instr.

Trace Mill.

Instr.

ipc delta

(%)

IC-miss delta

(%)

DC-miss delta

(%)

K-metric

(%)

Emul code delta

(%)

DVC_Pupp 2921 75 1.54 12.50 13.51 0.05 9.09

Mpeg-1 7608 65 4.04 9.68 11.86 2.59 11.76

Tartus3D 3175 96 10.47 12.50 13.83 0.00 0.00

PS_Gaus 1617 97 0.00 12.50 9.80 3.20 9.86

PS_Rot 2184 73 9.33 10.34 12.20 3.63 13.82

FreeHand 578 89 11.11 1.03 4.26 0.04 11.90

In®niD 1733 93 0.00 10.79 11.36 1.70 13.00

Adobe_Pre 10,348 78 8.25 10.34 8.33 2.19 2.01

1016 H. Khalid / Journal of Systems Architecture 46 (2000) 1013±1017

5. Conclusions

In this paper, we have presented a trace sam-pling and validation scheme to evaluate the rep-resentativeness of a sampled trace and itssuitability for analyzing the performance of atarget system. Experimental results illustrated ex-cellent performance of the proposed methodology.A vigorous, real, and challenging multimediabenchmark suite called MediaMark was used as atest case for our methodology. Earlier in [1] weshowed that a rudimentary version of our tech-nique far outperforms the contemporary tech-niques for fairly simple and standard benchmarksuite like SPEC95.

References

[1] H. Khalid, Trace Sampling, in: Proceedings of the

SPECTS'98, 19±22 July 1998, pp. 231±237.

[2] B. Case, SPECTM95 Retires SPECTM92, Microprocessor

Report, August 1995.

[3] H. Khalid, Tracing multimedia benchmarks with ®ve

degrees of validation, Computer Architecture News 27 (3)

(1999) 43±48.

[4] S.K. Das, E.E. Johnson, Accuracy of ®ltered traces, in:

Proc. of the 1995 IEEE IPCCC'95, March 1995, pp. 82±

86.

[5] R.E. Kessler et al., A comparison of trace-sampling

techniques for multi-megabyte caches, IEEE TOC 43 (6)

(1994) 664±675.

[6] M. Kobayashi, Memory reference metrics and instruction

trace sampling, in: Proceedings of the 1997 IEEE IP-

CCC'97, February 1997, pp. 1±7.

[7] P.K. Dubey, R. Nair, Pro®le driven generation of trace

samples, in: Proceedings of ICCD'96, 1996, pp. 217±224.

[8] C.Roth, PowerPCTM performance monitor evolution, in:

Proceedings of the 1997 IEEE IPCCC'97, February 1997,

pp. 331±336.

Humayun Khalid was born in Karachi,Pakistan. He received his B.S.E.E.(Magna Cum Laude), M.S.E.E.(Graduate Citation) degrees from theCity College of New York, and hisPh.D. degree from the City Universityof New York. He was research assis-tant, research associate, and a memberof research group at the ComputerEngineering Research Laboratory(CERL) of the City University of NewYork. He was a lecturer at the CityCollege for over three years (1993±

1996). He has been a reviewer for several national and inter-national conferences and journals. At present he is a�liatedwith the enterprise server performance team at Dell ComputerCorporation where he is responsible for performance evaluationof enterprise servers, performance projections for future plat-forms, modeling, architectural tradeo� analysis, design recom-mendations for next generation platforms, and design support.He has published over 50 high quality research papers in refereedjournals, conferences, and as technical reports. Dr. Khalid is amember of IEEE, SCS, and IEEP. His research interests includeparallel computer architecture, high performance computing,computer networks, performance evaluation, applied arti®cialneural networks, multimedia, and image processing.

H. Khalid / Journal of Systems Architecture 46 (2000) 1013±1017 1017