where has this performance improvement come from? technology –more transistors per chip –faster...

Where Has This Performance Improvement Come From?

• Technology– More transistors per chip

– Faster logic

• Machine Organization/Implementation– Deeper pipelines

– More instructions executed in parallel

• Instruction Set Architecture– Reduced Instruction Set Computers (RISC)

– Multimedia extensions

– Explicit parallelism

• Compiler technology– Finding more parallelism in code

– Greater levels of optimization

Measurement and Evaluation

Architecture is an iterative process:• Search the possible design space• Make selections • Evaluate the selections made

Good IdeasGood Ideas

Mediocre IdeasBad Ideas

Cost /PerformanceAnalysis

Good measurement tools are required to accurately evaluate the selection.

Measurement Tools

• Benchmarks

• Simulation (many levels)– ISA, RTL, Gate, Transistor

• Cost, Delay, and Area Estimates

• Queuing Theory

• Rules of Thumb

• Fundamental Laws

The Bottom Line: Performance (and Cost)

• Time to run the task (Execution Time)– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns … (Performance)– Throughput, bandwidth

Plane

Boeing 747

BAD/Sud Concorde

Speed

610 mph

1350 mph

DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput (pmph)

286,700

178,200

Performance and Execution Time

Execution time and performance are reciprocals

ExTime(Y) Performance(X)

--------- = ---------------ExTime(X) Performance(Y)

• Speed of Concorde vs. Boeing 747

• Throughput of Boeing 747 vs. Concorde

Performance Terminology

“X is n% faster than Y” means:ExTime(Y) Performance(X) n

--------- = -------------- = 1 + -----

ExTime(X) Performance(Y) 100

n = 100(Performance(X) - Performance(Y))

Performance(Y)

Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?

n = 100(ExTime(Y) - ExTime(X))

ExTime(X)

Example

Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?

n = 100(ExTime(Y) - ExTime(X))

ExTime(X)

n = 100(15 - 10)

10

n = 50%

Amdahl's LawSpeedup due to enhancement E: ExTime w/o E Performance w/ E

Speedup(E) = ------------- = -------------------

ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction

Fractionenhanced of the task by a factor Speedupenhanced,

and the remainder of the task is unaffected.

What are the new execution time and the overall

speedup due to the enhancement?

Amdahl’s Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall =ExTimeold

ExTimenew

Speedupenhanced

=

1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced

Example of Amdahl’s Law

• Floating point instructions improved to run 2X; but only 10% of the time was spent on these instructions.

Speedupoverall =

ExTimenew =

Example of Amdahl’s Law

• Floating point instructions improved to run 2X; but only 10% of the time was spent on these instructions.

Speedupoverall = ExTimenew = 1 0.95

= 1.053

ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold

ExTimeold

• The new machine is 5.3% faster for this mix of instructions.

Make The Common Case Fast

• All instructions require an instruction fetch, only a fraction require a data fetch/store.

– Optimize instruction access over data access

• Programs exhibit locality

- 90% of time in 10% of code- Temporal Locality (items referenced recently)

- Spatial Locality (items referenced nearby)

• Access to small memories is faster– Provide a storage hierarchy such that the most frequent

accesses are to the smallest (closest) memories.

Reg'sCache

Memory Disk / Tape

Hardware/Software Partitioning

• The simple case is usually the most frequent and the easiest to optimize!

• Do simple, fast things in hardware and be sure the rest can be handled correctly in software.

Would you handled these in hardware or software:

• Integer addition?

• Accessing data from disk?

• Floating point square root?

Performance Factors

• The number of instructions/program is called the instruction count (IC).

• The average number of cycles per instruction is called the CPI.

• The number of seconds per cycle is the clock period.

• The clock rate is the multiplicative inverse of the clock period and is given in cycles per second (or MHz).

• For example, if a processor has a clock period of 5 ns, what is it’s clock rate?

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle



Aspects of CPU Performance





Instr. Cnt CPI Clock RateProgram

Compiler

Instr. Set

Organization

Technology

Marketing Metrics

MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6

• Not effective for machines with different instruction sets

• Not effective for programs with different instruction mixes

• Uncorrelated with performance

MFLOPs = FP Operations / Time * 10^6

• Machine dependent

• Often not where time is spent

•Peak - maximum able to achieve

•Native - average for a set of benchmarks

•Relative - compared to another platform

Normalized MFLOPS:

add,sub,compare,mult 1

divide, sqrt 4

exp, sin, . . . 8

Normalized MFLOPS:

add,sub,compare,mult 1

divide, sqrt 4

exp, sin, . . . 8

Programs to Evaluate Processor Performance

• (Toy) Benchmarks– 10-100 line program

– e.g.: sieve, puzzle, quicksort

• Synthetic Benchmarks– Attempt to match average frequencies of real workloads

– e.g., Whetstone, dhrystone

• Kernels– Time critical excerpts

• Real Benchmarks

Benchmarks

• Benchmark mistakes– Only average behavior represented in test workload

– Loading level controlled inappropriately

– Caching effects ignored

– Buffer sizes not appropriate

– Ignoring monitoring overhead

– Not ensuring same initial conditions

– Collecting too much data but doing too little analysis

• Benchmark tricks– Compiler wired to optimize the workload

– Very small benchmarks used

– Benchmarks manually translated to optimize performance

SPEC: System Performance Evaluation Cooperative

• First Round SPEC CPU89– 10 programs yielding a single number

• Second Round SPEC CPU92– SPEC CINT92 (6 integer programs) and SPEC CFP92 (14 floating point

programs)

– Compiler flags can be set differently for different programs

• Third Round SPEC CPU95– new set of programs: SPEC CINT95 (8 integer programs) and SPEC CFP95 (10

floating point)

– Single flag setting for all programs

• Fourth Round SPEC CPU2000– new set of programs: SPEC CINT2000 (12 integer programs) and SPEC

CFP2000 (14 floating point)

– Single flag setting for all programs

– Programs in C, C++, Fortran 77, and Fortran 90

SPEC 2000• 12 integer programs:

– 2 Compression

– 2 Circuit Placement and Routing

– C Programming Language Compiler

– Combinatorial Optimization

– Chess, Word Processing

– Computer Visualization

– PERL Programming Language

– Group Theory Interpreter

– Object-oriented Database.

– Written in C (11) and C++ (1)

• 14 floating point programs:– Quantum Physics

– Shallow Water Modeling

– Multi-grid Solver

– 3D Potential Field

– Parabolic / Elliptic PDEs

– 3-D Graphics Library

– Computational Fluid Dynamics

– Image Recognition

– Seismic Wave Simulation

– Image Processing

– Computational Chemistry

– Number Theory / Primality Testing

– Finite-element Crash Simulation

– High Energy Nuclear Physics

– Pollutant Distribution

– Written in Fortran (10) and C (4)

Other SPEC Benchmarks• JVM98:

– Measures performance of Java Virtual Machines

• SFS97: – Measures performance of network file server (NFS) protocols

• Web99: – Measures performance of World Wide Web applications

• HPC96:– Measures performance of large, industrial applications

• APC, MEDIA, OPC– Meausre performance of graphics applications

• For more information about the SPEC benchmarks see: http://www.spec.org.

Conclusions on Performance

• A fundamental rule in computer architecture is to make the common case fast.

• The most accurate measure of performance is the execution time of representative real programs (benchmarks).

• Execution time is dependent on the number of instructions per program, the number of cycles per instruction, and the clock rate.

• When designing computer systems, both cost and performance need to be taken into account.

where has this performance improvement come from? technology –more transistors per chip –faster...

Documents

extimeold x

response time

new execution time

mix of instructions

extimey performancex

enhancement e

performance improvement

fraction fractionenhanced