lecture 2: performance evaluation
DESCRIPTION
Lecture 2: Performance Evaluation. Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI. What Does Performance Mean?. Response time A simulation program finishes in 5 minutes Throughput A web server serves 5 million request per second Other metrics - PowerPoint PPT PresentationTRANSCRIPT
Lecture 2: Performance Lecture 2: Performance EvaluationEvaluation
Performance definition, Performance definition, benchmark, summarizing benchmark, summarizing
performance, Amdahl’s law, performance, Amdahl’s law, and CPIand CPI
What Does Performance What Does Performance Mean?Mean?
Response timeResponse time– A simulation program finishes in 5 A simulation program finishes in 5
minutesminutes ThroughputThroughput
– A web server serves 5 million request per A web server serves 5 million request per secondsecond
Other metricsOther metrics– MIPS (million instruction per second)MIPS (million instruction per second)– MFLOPSMFLOPS– Clock frequencyClock frequency
Execution TimeExecution Time Processor design is concerned with Processor design is concerned with
processor consumed by program processor consumed by program execution. Shorter execution time=>execution. Shorter execution time=>– Shorter response timeShorter response time– Higher throughputHigher throughput
Execution time = Execution time = #inst×CPI×Cycletime#inst×CPI×Cycletime– What affects #inst, CPI, and cycle time?What affects #inst, CPI, and cycle time?– Almost all designs can be interpretedAlmost all designs can be interpreted
Any other metrics is meaningful only if Any other metrics is meaningful only if consistent with execution timeconsistent with execution time
Performance of ComputersPerformance of ComputersPerformance is defined for Performance is defined for a program and a program and
a machinea machine..How to compare computers? Need How to compare computers? Need
benchmark programs:benchmark programs:– Real applications: scientific programs, Real applications: scientific programs,
compilers, text-processing software, image compilers, text-processing software, image processingprocessing
– Modified applications: providing portability and Modified applications: providing portability and focusfocus
– Kernels: good to isolate performance of Kernels: good to isolate performance of individual featuresindividual features
Lmbench: measure latency and bandwidth of Lmbench: measure latency and bandwidth of memory, file system, networking, etc.memory, file system, networking, etc.
– Toy benchmarksToy benchmarks– Synthetic benchmarks: matching average Synthetic benchmarks: matching average
execution profileexecution profile
Performance ComparisonPerformance Comparison
nn: speedup if we are considering an : speedup if we are considering an enhancement, optimization, etc.enhancement, optimization, etc.
What does “improving” mean?What does “improving” mean?– Improve performance: decrease execution time, Improve performance: decrease execution time,
increase throughputincrease throughput– Improve execution time: decrease execution timeImprove execution time: decrease execution time– Degrade performance: the reverse of the above; Degrade performance: the reverse of the above;
brings negative speedupbrings negative speedup
nx
y
y
x
timeExecution
timeExecution
ePerformanc
ePerformanc
“X is n times faster than Y”:
Benchmark SuiteBenchmark Suite Benchmark suite is a collection of benchmarks Benchmark suite is a collection of benchmarks
with a variety of applicationswith a variety of applications– Alleviating weakness of a single benchmarkAlleviating weakness of a single benchmark– More representative for computer designers to evaluate More representative for computer designers to evaluate
their designtheir design– Benchmarks test both computer and compilers, Benchmarks test both computer and compilers,
and OS in many casesand OS in many cases Desktop benchmarks: CPU, memory, and graphics Desktop benchmarks: CPU, memory, and graphics
performanceperformance Sever benchmarks: throughput-oriented, I/O and Sever benchmarks: throughput-oriented, I/O and
OS intensiveOS intensive Embedded benchmarks: measuring the ability to Embedded benchmarks: measuring the ability to
meet deadline and save powermeet deadline and save power
Summarizing PerformanceSummarizing PerformanceGiven the performance of a set of programs, Given the performance of a set of programs,
how to evaluate the performance of how to evaluate the performance of machines?machines?
AA BB CC
P1 (secs)P1 (secs) 11 1010 2020
P2 (secs)P2 (secs) 10001000 100100 2020
Total (secs)Total (secs) 10011001 110110 4040
Which computer is the “best” one?Which computer is the “best” one?
Arithmetic MeanArithmetic Mean
Total execution time / (number of Total execution time / (number of programs)programs)
– Simple and intuitiveSimple and intuitive– Representative if the user run the Representative if the user run the
programs an equal number of timesprograms an equal number of times
n
in 1iTime
1
Weighted Arithmetic MeanWeighted Arithmetic Mean
Give (different) weights to different Give (different) weights to different programsprograms
– Considering the frequencies of programs Considering the frequencies of programs in the workloadin the workload
11
iWeight ,TimeWeight1
ii
n
i
n
i
Geometric MeansGeometric Means Based on relative performance to a Based on relative performance to a
reference machinereference machine
Relative performance is consistent with Relative performance is consistent with different reference machinesdifferent reference machines
– If C is 2x faster than B (using B as the reference), If C is 2x faster than B (using B as the reference), B is 2x faster than A (A as the reference), then C B is 2x faster than A (A as the reference), then C is 4x faster than A (A as the reference)is 4x faster than A (A as the reference)
n
n
i
1iratio timeExecution
)Y
Xmean( Geometric
)mean(Y Geometric
)mean(X Geometric
i
i
i
i
Harmonic MeanHarmonic Mean
Given speedups s1, s2, …, s_n, the Given speedups s1, s2, …, s_n, the average speedup by harmonic mean average speedup by harmonic mean isis
n / (1/s1 + 1/s2 + … + 1/s_n)n / (1/s1 + 1/s2 + … + 1/s_n)
Why not arithmetic mean?Why not arithmetic mean?
Amdahl’s LawAmdahl’s Law
We know about performance: defining, We know about performance: defining, measuring, and summarizingmeasuring, and summarizing
How to maximize performance gains How to maximize performance gains from the beginning in our design?from the beginning in our design?
Principle: Make the Common Case Principle: Make the Common Case Fast!Fast!
Amdahl’s LawAmdahl’s Law
Predict overall speedup from “local Predict overall speedup from “local speedup” by an enhancement, speedup” by an enhancement, provided the frequency to use the provided the frequency to use the enhancement is know.enhancement is know.
– ““Local speedup” is related to design and Local speedup” is related to design and optimization objectives, like to double optimization objectives, like to double CPU frequency, to reduce cache latency CPU frequency, to reduce cache latency by halfby half
Amdahl’s LawAmdahl’s Law
enhance
enhancedenhanced
oldnew
Speedup
FractionFraction1
TimeExecution timeExecution
enhanced
enhancedenhanced
new
oldoverall
SpeedupFraction
Fraction-1
1
timeExecution
timeExecution Speedup
Equation Based on Equation Based on Instruction TypesInstruction Types
n
1ii CPIfrequencyn Instructio CPI
timecycleClock n
1ii
CPIi
IC timeCPU
n
1ii
CPIi
IC CyclesClock CPU
timecycleClock CyclesClock CPU timeCPU
i
Make Design Choice Using Make Design Choice Using CPU Time EquationCPU Time Equation
Assume we need to improve the Assume we need to improve the performance of a graphics engine:performance of a graphics engine:
FPFP FPSQRFPSQR OtherOtherFrequencyFrequency 25%25% 2%2% 75%75%CPICPI 4.04.0 2020 1.331.33
Alternative 1: CPIAlternative 1: CPIFPSQRFPSQR 20 20 2 2Alternative 2: CPIAlternative 2: CPIFP FP 44 2.5 2.5
Which one is better? Calculate speedups. Which one is better? Calculate speedups.
Amdahl’s LawAmdahl’s Law
Choice oneChoice one: Speed up FP Square root by 10x: Speed up FP Square root by 10x
Choice twoChoice two: Speed up all FP instruction by 1.6x: Speed up all FP instruction by 1.6x
20% 20% timetime are used by FP Square root, 50% for are used by FP Square root, 50% for all FP install FP inst
Which choice is better?Which choice is better?
Implication: Optimizing for the common case Implication: Optimizing for the common case firstfirst
SPEC CPU BenchmarkSPEC CPU Benchmark
SPEC: Standard Performance SPEC: Standard Performance Evaluation CorporationEvaluation Corporation
CPU-intensive benchmark for CPU-intensive benchmark for evaluating processor performance of evaluating processor performance of workstationworkstation
Four generations: SPEC89, SPEC92, Four generations: SPEC89, SPEC92, SPEC95, and SPEC2000SPEC95, and SPEC2000
Two types of programs: INT and FPTwo types of programs: INT and FP Emphasizing memory system Emphasizing memory system
performance in SPEC2000performance in SPEC2000
SPEC CPU2000 ProfilingSPEC CPU2000 Profiling
Dynamic instruction mixDynamic instruction mixInstructionInstruction Int avgInt avg FP avgFP avg
Load intLoad int 26%26% 15%15%
Store intStore int 10%10% 2%2%
Load fpLoad fp -- 15%15%
Store fpStore fp -- 7%7%
AddAdd 19%19% 23%23%
All fp instAll fp inst -- 41%41%
Cond br.Cond br. 12%12% 4%4%
All ctrl instAll ctrl inst 16%16% 4%4%
Other SPEC BenchmarksOther SPEC Benchmarks
SPECviewperf and SPEapc: 3D SPECviewperf and SPEapc: 3D graphics performancegraphics performance
SPEC JVM98: performance of client-SPEC JVM98: performance of client-side Java virtual machineside Java virtual machine
SPEC JBB2000: Server-cline Java SPEC JBB2000: Server-cline Java applicationapplication
SPEC WEB99: evaluating WWW serversSPEC WEB99: evaluating WWW servers SPEC HPC96: parallel and distributed SPEC HPC96: parallel and distributed
computingcomputing
Server BenchmarksServer Benchmarks
SPEC CPU2000, WBB99, SFS97 SPEC CPU2000, WBB99, SFS97 TPC Measuring the ability of a system TPC Measuring the ability of a system
to handle transactionsto handle transactions– TPC-C: online transaction processing TPC-C: online transaction processing
(OLTP) benchmark (for bank systems)(OLTP) benchmark (for bank systems)– TPC-H: ad hoc decision make supportTPC-H: ad hoc decision make support– TPC-R: decision make support with TPC-R: decision make support with
standard queriesstandard queries– TPC-W: simulating business-oriented TPC-W: simulating business-oriented
transactional web servertransactional web server
Embedded BenchmarkEmbedded Benchmark
EEMBC (Embedded Microprocessor EEMBC (Embedded Microprocessor Benchmark Consortium) benchmarksBenchmark Consortium) benchmarks– Based on kernel performanceBased on kernel performance– Five classes: automotive/industrial, Five classes: automotive/industrial,
consumer networking, office consumer networking, office automation, and telecommunicationsautomation, and telecommunications
Embedded benchmarks are not matureEmbedded benchmarks are not mature