where has this performance improvement come from? technology –more transistors per chip –faster...
TRANSCRIPT
Where Has This Performance Improvement Come From?
• Technology– More transistors per chip
– Faster logic
• Machine Organization/Implementation– Deeper pipelines
– More instructions executed in parallel
• Instruction Set Architecture– Reduced Instruction Set Computers (RISC)
– Multimedia extensions
– Explicit parallelism
• Compiler technology– Finding more parallelism in code
– Greater levels of optimization
Measurement and Evaluation
Architecture is an iterative process:• Search the possible design space• Make selections • Evaluate the selections made
Good IdeasGood Ideas
Mediocre IdeasBad Ideas
Cost /PerformanceAnalysis
Good measurement tools are required to accurately evaluate the selection.
Measurement Tools
• Benchmarks
• Simulation (many levels)– ISA, RTL, Gate, Transistor
• Cost, Delay, and Area Estimates
• Queuing Theory
• Rules of Thumb
• Fundamental Laws
The Bottom Line: Performance (and Cost)
• Time to run the task (Execution Time)– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns … (Performance)– Throughput, bandwidth
Plane
Boeing 747
BAD/Sud Concorde
Speed
610 mph
1350 mph
DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput (pmph)
286,700
178,200
Performance and Execution Time
Execution time and performance are reciprocals
ExTime(Y) Performance(X)
--------- = ---------------ExTime(X) Performance(Y)
• Speed of Concorde vs. Boeing 747
• Throughput of Boeing 747 vs. Concorde
Performance Terminology
“X is n% faster than Y” means:ExTime(Y) Performance(X) n
--------- = -------------- = 1 + -----
ExTime(X) Performance(Y) 100
n = 100(Performance(X) - Performance(Y))
Performance(Y)
Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?
n = 100(ExTime(Y) - ExTime(X))
ExTime(X)
Example
Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?
n = 100(ExTime(Y) - ExTime(X))
ExTime(X)
n = 100(15 - 10)
10
n = 50%
Amdahl's LawSpeedup due to enhancement E: ExTime w/o E Performance w/ E
Speedup(E) = ------------- = -------------------
ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction
Fractionenhanced of the task by a factor Speedupenhanced,
and the remainder of the task is unaffected.
What are the new execution time and the overall
speedup due to the enhancement?
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupoverall =ExTimeold
ExTimenew
Speedupenhanced
=
1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
Example of Amdahl’s Law
• Floating point instructions improved to run 2X; but only 10% of the time was spent on these instructions.
Speedupoverall =
ExTimenew =
Example of Amdahl’s Law
• Floating point instructions improved to run 2X; but only 10% of the time was spent on these instructions.
Speedupoverall = ExTimenew = 1 0.95
= 1.053
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
ExTimeold
• The new machine is 5.3% faster for this mix of instructions.
Make The Common Case Fast
• All instructions require an instruction fetch, only a fraction require a data fetch/store.
– Optimize instruction access over data access
• Programs exhibit locality
- 90% of time in 10% of code- Temporal Locality (items referenced recently)
- Spatial Locality (items referenced nearby)
• Access to small memories is faster– Provide a storage hierarchy such that the most frequent
accesses are to the smallest (closest) memories.
Reg'sCache
Memory Disk / Tape
Hardware/Software Partitioning
• The simple case is usually the most frequent and the easiest to optimize!
• Do simple, fast things in hardware and be sure the rest can be handled correctly in software.
Would you handled these in hardware or software:
• Integer addition?
• Accessing data from disk?
• Floating point square root?
Performance Factors
• The number of instructions/program is called the instruction count (IC).
• The average number of cycles per instruction is called the CPI.
• The number of seconds per cycle is the clock period.
• The clock rate is the multiplicative inverse of the clock period and is given in cycles per second (or MHz).
• For example, if a processor has a clock period of 5 ns, what is it’s clock rate?
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Instr. Cnt CPI Clock RateProgram
Compiler
Instr. Set
Organization
Technology
Marketing Metrics
MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6
• Not effective for machines with different instruction sets
• Not effective for programs with different instruction mixes
• Uncorrelated with performance
MFLOPs = FP Operations / Time * 10^6
• Machine dependent
• Often not where time is spent
•Peak - maximum able to achieve
•Native - average for a set of benchmarks
•Relative - compared to another platform
Normalized MFLOPS:
add,sub,compare,mult 1
divide, sqrt 4
exp, sin, . . . 8
Normalized MFLOPS:
add,sub,compare,mult 1
divide, sqrt 4
exp, sin, . . . 8
Programs to Evaluate Processor Performance
• (Toy) Benchmarks– 10-100 line program
– e.g.: sieve, puzzle, quicksort
• Synthetic Benchmarks– Attempt to match average frequencies of real workloads
– e.g., Whetstone, dhrystone
• Kernels– Time critical excerpts
• Real Benchmarks
Benchmarks
• Benchmark mistakes– Only average behavior represented in test workload
– Loading level controlled inappropriately
– Caching effects ignored
– Buffer sizes not appropriate
– Ignoring monitoring overhead
– Not ensuring same initial conditions
– Collecting too much data but doing too little analysis
• Benchmark tricks– Compiler wired to optimize the workload
– Very small benchmarks used
– Benchmarks manually translated to optimize performance
SPEC: System Performance Evaluation Cooperative
• First Round SPEC CPU89– 10 programs yielding a single number
• Second Round SPEC CPU92– SPEC CINT92 (6 integer programs) and SPEC CFP92 (14 floating point
programs)
– Compiler flags can be set differently for different programs
• Third Round SPEC CPU95– new set of programs: SPEC CINT95 (8 integer programs) and SPEC CFP95 (10
floating point)
– Single flag setting for all programs
• Fourth Round SPEC CPU2000– new set of programs: SPEC CINT2000 (12 integer programs) and SPEC
CFP2000 (14 floating point)
– Single flag setting for all programs
– Programs in C, C++, Fortran 77, and Fortran 90
SPEC 2000• 12 integer programs:
– 2 Compression
– 2 Circuit Placement and Routing
– C Programming Language Compiler
– Combinatorial Optimization
– Chess, Word Processing
– Computer Visualization
– PERL Programming Language
– Group Theory Interpreter
– Object-oriented Database.
– Written in C (11) and C++ (1)
• 14 floating point programs:– Quantum Physics
– Shallow Water Modeling
– Multi-grid Solver
– 3D Potential Field
– Parabolic / Elliptic PDEs
– 3-D Graphics Library
– Computational Fluid Dynamics
– Image Recognition
– Seismic Wave Simulation
– Image Processing
– Computational Chemistry
– Number Theory / Primality Testing
– Finite-element Crash Simulation
– High Energy Nuclear Physics
– Pollutant Distribution
– Written in Fortran (10) and C (4)
Other SPEC Benchmarks• JVM98:
– Measures performance of Java Virtual Machines
• SFS97: – Measures performance of network file server (NFS) protocols
• Web99: – Measures performance of World Wide Web applications
• HPC96:– Measures performance of large, industrial applications
• APC, MEDIA, OPC– Meausre performance of graphics applications
• For more information about the SPEC benchmarks see: http://www.spec.org.
Conclusions on Performance
• A fundamental rule in computer architecture is to make the common case fast.
• The most accurate measure of performance is the execution time of representative real programs (benchmarks).
• Execution time is dependent on the number of instructions per program, the number of cycles per instruction, and the clock rate.
• When designing computer systems, both cost and performance need to be taken into account.