lecture 2 quantifying performance topics speedup amdahl’s law execution time readings: chapter 1...

Lecture 2Quantifying Performance

Lecture 2Quantifying Performance

Topics Topics Speedup Amdahl’s law Execution time

Readings: Chapter 1Readings: Chapter 1

August 26, 2015

CSCE 513 Computer Architecture

– 2 – CSCE 513 Fall 2015

OverviewOverviewLast TimeLast Time

Overview: Speed-up Power wall, ILP wall, to multicore Def Computer Architecture Lecture 1 slides 1-29?

NewNew Syllabus and other course pragmatics

Website (not shown)Dates

Figure 1.9 Trends: CPUs, Memory, Network, Disk Why geometric mean? Speed-up again Amdahl’s Law

– 3 – CSCE 513 Fall 2015

Instruction Set Architecture (ISA)Instruction Set Architecture (ISA)

““Myopic view of computer architecture”Myopic view of computer architecture”

• ISAs – appendices A and KISAs – appendices A and K• 80x86• ARM• MIPS

– 4 – CSCE 513 Fall 2015

MIPS Register Usage Figure 1.4MIPS Register Usage Figure 1.4

Ref. CAAQA

– 5 – CSCE 513 Fall 2015

MIPS Instructions Fig 1.5 Data TransfersMIPS Instructions Fig 1.5 Data Transfers

Ref. CAAQA

– 6 – CSCE 513 Fall 2015

MIPS Instructions Fig 1.5 Arithmetic/LogicalMIPS Instructions Fig 1.5 Arithmetic/Logical

Most significant bit is bit zero; lsb #63Most significant bit is bit zero; lsb #63

Ref. CAAQA

– 7 – CSCE 513 Fall 2015

MIPS Instructions Fig 1.5 ControlMIPS Instructions Fig 1.5 Control

Condition Codes set by ALU operationsCondition Codes set by ALU operations

PC Relative branchesPC Relative branches

JumpsJumps

JumpAndLinkJumpAndLink

Return address on function call?Return address on function call?

Return AddressReturn AddressRef. CAAQA

– 8 – CSCE 513 Fall 2015

MIPS Instruction Format (RISC)MIPS Instruction Format (RISC)

Ref. CAAQA

– 9 – CSCE 513 Fall 2015

New World “Computer Architecture is back”New World “Computer Architecture is back”““Computer architects must design a computer to meet Computer architects must design a computer to meet

functional requirements as well as price, power, functional requirements as well as price, power, performance, and availability goals”performance, and availability goals”

Patterson, David A.; Hennessy, John L. (2011-08-01). Patterson, David A.; Hennessy, John L. (2011-08-01). Computer Architecture: A Quantitative Approach Computer Architecture: A Quantitative Approach (The Morgan Kaufmann Series in Computer (The Morgan Kaufmann Series in Computer Architecture and Design) (Kindle Locations 944-945). Architecture and Design) (Kindle Locations 944-945). Elsevier Science (reference). Kindle Edition. Elsevier Science (reference). Kindle Edition.

You Tube You Tube

Google(Computer Architecture is back Patterson)Google(Computer Architecture is back Patterson)

– 10 – CSCE 513 Fall 2015

Fig 1.7 Requirement Challenges for Computer ArchitectsFig 1.7 Requirement Challenges for Computer Architects

Level of software compatibilityLevel of software compatibility

Operating system requirementsOperating system requirements

StandardsStandardsRef. CAAQA

– 11 – CSCE 513 Fall 2015

Fig 1.10 Performance over last 25-40 yearsFig 1.10 Performance over last 25-40 years

ProcessorsProcessors

Ref. CAAQA

– 12 – CSCE 513 Fall 2015


MemoryMemory

Ref. CAAQA

– 13 – CSCE 513 Fall 2015


NetworksNetworks

DiskDisk

Ref. CAAQA

– 14 – CSCE 513 Fall 2015


ProcessorsProcessors

Ref. CAAQA

– 15 – CSCE 513 Fall 2015

Quantitative Principles of DesignQuantitative Principles of Design

Take advantage of ParallelismTake advantage of Parallelism

Principle of localityPrinciple of locality Temporal locality Spatial locality

Focus on the common caseFocus on the common case

Amdahl’s Law Amdahl’s Law

Ref. CAAQA

– 16 – CSCE 513 Fall 2015

Taking Advantage of ParallelismTaking Advantage of Parallelism

Logic parallelism – carry lookahead adderLogic parallelism – carry lookahead adder

Word parallelism – SIMDWord parallelism – SIMD

Instruction pipelining – overlap fetch and executeInstruction pipelining – overlap fetch and execute

Multithreads – executing independent instructions at Multithreads – executing independent instructions at the same timethe same time

Speculative execution - Speculative execution -

Ref. CAAQA

– 17 – CSCE 513 Fall 2015

Principle of LocalityPrinciple of Locality

Rule of thumb – (Zipf’s law?? Not really)Rule of thumb – (Zipf’s law?? Not really)

A program spends 90% of its execution time in only A program spends 90% of its execution time in only 10% of the code.10% of the code.

So what do you try to optimize?So what do you try to optimize?

Locality of memory referencesLocality of memory references

Temporal localityTemporal locality

Spatial localitySpatial locality

– 18 – CSCE 513 Fall 2015

Amdahl’s LawAmdahl’s Law

])1[(

1

enhanced

enhancedenhanced

overall

Speedup

FracFrac

Speedup

Suppose you have an enhancement or improvement in a Suppose you have an enhancement or improvement in a design component.design component.

The improvement in the performance of the system is limited The improvement in the performance of the system is limited by the % of the time the enhancement can be usedby the % of the time the enhancement can be used

Ref. CAAQA

– 19 – CSCE 513 Fall 2015

Amdahl’s with Fractional Use FactorAmdahl’s with Fractional Use Factor

Example:Example: Suppose we are considering an enhancement to Suppose we are considering an enhancement to a web server. The enhanced CPU is 10 times faster on a web server. The enhanced CPU is 10 times faster on computation but the same speed on I/O. Suppose also computation but the same speed on I/O. Suppose also that 60% of the time is waiting on I/Othat 60% of the time is waiting on I/O

])1[(

1

enhanced

enhancedenhanced

overall

Speedup

FracFrac

Speedup

Ref. CAAQA

– 20 – CSCE 513 Fall 2015

Amdahl’s Law revisitedAmdahl’s Law revisited

Speedup = Speedup = (execution time without enhance.) / (execution time with (execution time without enhance.) / (execution time with enhance.)enhance.)

= (time without) / (time with) = T= (time without) / (time with) = Twowo / T / Twithwith

NotesNotes

1.1. The enhancement will be used only a portion of the time.The enhancement will be used only a portion of the time.

2.2. If it will be rarely used then why bother trying to improve itIf it will be rarely used then why bother trying to improve it

3.3. Focus on the improvements that have the highest fraction of use Focus on the improvements that have the highest fraction of use time denoted Fractiontime denoted Fractionenhancedenhanced. .

4.4. Note FractionNote Fractionenhancedenhanced is always less than 1. is always less than 1.

Then Then

Ref. CAAQA

– 21 – CSCE 513 Fall 2015


])1[(*ExecTimeExecTime oldnewenhanced

enhancedenhanced Speedup

FracFrac

])1[(

1

)/()(

enhanced

enhancedenhanced

newoldoverall

SpeedupFrac

Frac

ExecTimeExecTimeSpeedup

Ref. CAAQA

– 22 – CSCE 513 Fall 2015


Example:Example: Suppose we are considering an enhancement to Suppose we are considering an enhancement to a web server. The enhanced CPU is 10 times faster on a web server. The enhanced CPU is 10 times faster on computation but the same speed on I/O. Suppose also computation but the same speed on I/O. Suppose also that 60% of the time is waiting on I/Othat 60% of the time is waiting on I/O

5625.164.

1

04.6.

1104.

)4.1(

1

])1[(

1

enhanced

enhancedenhanced

overall

SpeedupFrac

FracSpeedup

Ref. CAAQA

– 23 – CSCE 513 Fall 2015

Graphics Square Root Enhancement p 40Graphics Square Root Enhancement p 40

NewDesign1 FPSQRT NewDesign1 FPSQRT

• 20% speed up FPSQR 10 times20% speed up FPSQR 10 times

NewDesign2 FP NewDesign2 FP

• improve all FP by 1.6; FP=50% of exec timeimprove all FP by 1.6; FP=50% of exec time

Ref. CAAQA

– 24 – CSCE 513 Fall 2015

Geometric Means vs Arithmetic MeansGeometric Means vs Arithmetic Means

Ref. CAAQA

– 25 – CSCE 513 Fall 2015

Comparing 2 computers Spec_RatiosComparing 2 computers Spec_Ratios

Ref. CAAQA

– 26 – CSCE 513 Fall 2015

Performance MeasuresPerformance Measures

Response time (latency) -- time between start and completion Response time (latency) -- time between start and completion

Throughput (bandwidth) -- rate -- work done per unit time Throughput (bandwidth) -- rate -- work done per unit time

Processor Speed – e.g. 1GHzProcessor Speed – e.g. 1GHz

When does it matter? When does it matter?

When does it not?When does it not?

tenhancemenwithtimeexecution

tenhancemenwithouttimeexecutionSpeedup

___

___

Ref. CAAQA

– 27 – CSCE 513 Fall 2015

AvailabilityAvailability

MTTRMTTF

MTTFlabilityModuleAvai

Ref. CAAQA

– 28 – CSCE 513 Fall 2015

MTTF Example MTTF Example

Ref. CAAQA

– 29 – CSCE 513 Fall 2015

Comparing Performance fig 1.15Comparing Performance fig 1.15

Computer AComputer A Computer BComputer B Computer CComputer C

Program P1Program P1 11 1010 2020


Total TimesTotal Times 10011001 110110 4040

Comparing three program executing on three machines

Faster than relationships A is 10 times faster than B on program 1 B is 10 times faster than A on program 2 C is 50 times faster than A on program 2 … 3 * 2 comparisons (3 choose 2 computers * 2 programs)

So what is the relative performance of these machines???

Ref. CAAQA

– 30 – CSCE 513 Fall 2015

fig 1.15 Total Execution timesfig 1.15 Total Execution times




Total timesTotal times 10011001 110110 4040

Comparing three program executing on three machines

So now what is the relative performance of these machines??? B is 1001/110 = 9.1 times as fast as A

Arithmetic mean execution time =

Ref. CAAQA

– 31 – CSCE 513 Fall 2015

Weighted Execution Times fig 1.15Weighted Execution Times fig 1.15





Now assume that we know that P1 will run 90%, and P2 10% of the time.

So now what is the relative performance of these machines???

timeA = .9*1 + .1*1000 = 100.9timeB = .9*10 +.1*100 = 19Relative performance A to B = 100.9/19 = 5.31

Ref. CAAQA

– 32 – CSCE 513 Fall 2015

Geometric MeansGeometric Means

Compare ratios of performance to a standardCompare ratios of performance to a standard

Using A as the standard Using A as the standard

program 1 B ratio = 10/1 = 10 C ratio = 20/1 = 20program 1 B ratio = 10/1 = 10 C ratio = 20/1 = 20

program 2 Br = 100/1000 = .1 Cr = 20/1000 = .02program 2 Br = 100/1000 = .1 Cr = 20/1000 = .02

B is “twice as fast” as C using A as the standardB is “twice as fast” as C using A as the standard

Using B as the standardUsing B as the standard

program 1 Ar = 1/10 = .1 program 1 Ar = 1/10 = .1 Cr = Cr =

program 2 Br = 1000/100 = 10 Cr =program 2 Br = 1000/100 = 10 Cr =

So now compare A and B ratios to each other you get So now compare A and B ratios to each other you get the same 10 and .1, so what? Same ?the same 10 and .1, so what? Same ?

Ref. CAAQA

– 33 – CSCE 513 Fall 2015

Geometric Means fig 1.17Geometric Means fig 1.17Measure performance ratios to a standard machineMeasure performance ratios to a standard machine

Normalized to ANormalized to A Normalized to BNormalized to B Normalized to CNormalized to C

AA BB CC AA BB CC AA BB CC

P1P1 1.01.0 10.010.0 20.020.0 .1.1 1.01.0 2.02.0 .05.05 .5.5 1.01.0

P2P2 1.01.0 .1.1 .02.02 1010 1.01.0 .2.2 50.50. 5.05.0 1.01.0

Arithmetic Arithmetic meanmean

1.01.0 5.055.05 10.0110.01 5.055.05 1.01.0 1.11.1 25.0325.03 2.752.75 1.01.0

Geometric Geometric MeanMean

1.01.0 1.01.0 .63.63 1.01.0 1.01.0 .63.63 1.581.58 1.581.58 1.01.0

Total TimeTotal Time 1.01.0 .11.11 .4.4 9.19.1 1.01.0 .36.36 25.0325.03 2.752.75 1.01.0

Ref. CAAQA

– 34 – CSCE 513 Fall 2015

CPU Performance EquationCPU Performance Equation

Almost all computers use a clock running at a fixed Almost all computers use a clock running at a fixed rate.rate.

Clock period e.g. 1GHzClock period e.g. 1GHz

Instruction Count (IC) – Instruction Count (IC) –

CPI = CPUclockCyclesForProgram / InstructionCountCPI = CPUclockCyclesForProgram / InstructionCount

CPUtime = IC * ClockCycleTime * CyclesPerInstructionCPUtime = IC * ClockCycleTime * CyclesPerInstruction

ClockRateogramclesForCPUclockCy

TimeClockCycleogramclesForCPUclockCyCPUtime

/Pr

*Pr

Ref. CAAQA

– 35 – CSCE 513 Fall 2015

CPU Performance EquationCPU Performance Equation

CPUtime =CPUtime =

Instruction CountInstruction Count

CPICPI

Clock cycle timeClock cycle time

CPUtimeogram

Seconds

ClockCycle

Seconds

nInstructio

sClockCycle

ogram

nsInstructio

PrPr

n

i ii CPIICCPUcycles1

Ref. CAAQA

– 36 – CSCE 513 Fall 2015

Fallacies and PitfallsFallacies and Pitfalls

1.1. Pitfall: Falling prey to Amdahl’s law.Pitfall: Falling prey to Amdahl’s law.

2.2. Pitfall: A single point of failure.Pitfall: A single point of failure.

3.3. Fallacy: the cost of the processor dominates the Fallacy: the cost of the processor dominates the cost of the system.cost of the system.

4.4. Fallacy: Benchmarks remain valid indefinitely.Fallacy: Benchmarks remain valid indefinitely.

5.5. The rated mean time to failure of disks is 1,2000,000 The rated mean time to failure of disks is 1,2000,000 hours or almost 140 years, so disks practically never hours or almost 140 years, so disks practically never fail.fail.

6.6. Fallacy Peak performance tracks observed Fallacy Peak performance tracks observed performance.performance.

7.7. Pitfall: Fault detection can lower availability.Pitfall: Fault detection can lower availability.

Ref. CAAQA

– 37 – CSCE 513 Fall 2015

List of AppendicesList of Appendices

Ref. CAAQA

– 38 – CSCE 513 Fall 2015

Homework Set #2Homework Set #2

1.1. 1.8 a-d (Change 2015 throughout the question 1.8 a-d (Change 2015 throughout the question 2025)2025)

2.2. 1.91.9

3.3. 1.121.12

4.4. 1.181.18

5.5. Matrix multiply (mm.c will be emailed and placed on Matrix multiply (mm.c will be emailed and placed on website)website)

a. Compile with gcc –S

b. Compile with gcc –O2 –S and note differences

George K. Zipf (1949) (1949) Human Behavior and the Principle Human Behavior and the Principle of Least Effortof Least Effort. Addison-Wesley. Addison-Wesley

– 39 – CSCE 513 Fall 2015

1.81.8 [10/ 15/ 15/ 10/ 10] < 1.4, 1.5 > One challenge for architects is that [10/ 15/ 15/ 10/ 10] < 1.4, 1.5 > One challenge for architects is that the design created today will require several years of the design created today will require several years of implementation, verification, and testing before appearing on the implementation, verification, and testing before appearing on the market. This means that the architect must project what the market. This means that the architect must project what the technology will be like several years in advance. Sometimes, this technology will be like several years in advance. Sometimes, this is difficult to do. is difficult to do.

a.a. [10] < 1.4 > According to the trend in device scaling observed by [10] < 1.4 > According to the trend in device scaling observed by Moore’s law, the number of transistors on a chip in 2015 should be Moore’s law, the number of transistors on a chip in 2015 should be how many times the number in 2005? how many times the number in 2005?

b.b. b. [15] < 1.5 > The increase in clock rates once mirrored this trend. b. [15] < 1.5 > The increase in clock rates once mirrored this trend. Had clock rates continued to climb at the same rate as in the Had clock rates continued to climb at the same rate as in the 1990s, approximately how fast would clock rates be in 2015? 1990s, approximately how fast would clock rates be in 2015?

c.c. c. [15] < 1.5 > At the current rate of increase, what are the clock c. [15] < 1.5 > At the current rate of increase, what are the clock rates now projected to be in 2015? rates now projected to be in 2015?

d.d. d. [10] < 1.4 > What has limited the rate of growth of the clock rate, d. [10] < 1.4 > What has limited the rate of growth of the clock rate, and what are architects doing with the extra transistors now to and what are architects doing with the extra transistors now to increase performance? increase performance?

Patterson, David A.; Hennessy, John L. (2011-08-01). Computer Architecture: A Quantitative Patterson, David A.; Hennessy, John L. (2011-08-01). Computer Architecture: A Quantitative Approach (The Morgan Kaufmann Series in Computer Architecture and Design) (Kindle Approach (The Morgan Kaufmann Series in Computer Architecture and Design) (Kindle Locations 2203-2217). Elsevier Science (reference). Kindle Edition. Locations 2203-2217). Elsevier Science (reference). Kindle Edition.

– 40 – CSCE 513 Fall 2015

1.91.9 [10/ 10] < 1.5 > You are designing a system for a real- [10/ 10] < 1.5 > You are designing a system for a real-time application in which specific deadlines must be time application in which specific deadlines must be met. Finishing the computation faster gains nothing. met. Finishing the computation faster gains nothing. You find that your system can execute the necessary You find that your system can execute the necessary code, in the worst case, twice as fast as necessary. code, in the worst case, twice as fast as necessary.

a.a. [10] < 1.5 > How much energy do you save if you [10] < 1.5 > How much energy do you save if you execute at the current speed and turn off the system execute at the current speed and turn off the system when the computation is complete? when the computation is complete?

b.b. [10] < 1.5 > How much energy do you save if you set the [10] < 1.5 > How much energy do you save if you set the voltage and frequency to be half as much?voltage and frequency to be half as much?

Patterson, David A.; Hennessy, John L. (2011-08-01). Patterson, David A.; Hennessy, John L. (2011-08-01). Computer Architecture: A Quantitative Approach (The Computer Architecture: A Quantitative Approach (The Morgan Kaufmann Series in Computer Architecture and Morgan Kaufmann Series in Computer Architecture and Design) (Kindle Locations 2218-2224). Elsevier Science Design) (Kindle Locations 2218-2224). Elsevier Science (reference). Kindle Edition. (reference). Kindle Edition.

– 41 – CSCE 513 Fall 2015

1.121.12 [20/ 20/ 20] < 1.1, 1.2, 1.7 > In a server farm such as that used [20/ 20/ 20] < 1.1, 1.2, 1.7 > In a server farm such as that used by Amazon or eBay, a single failure does not cause the entire by Amazon or eBay, a single failure does not cause the entire system to crash. Instead, it will reduce the number of requests system to crash. Instead, it will reduce the number of requests that can be satisfied at any one time.that can be satisfied at any one time.

a.a. [20] < 1.7 > If a company has 10,000 computers, each with a MTTF [20] < 1.7 > If a company has 10,000 computers, each with a MTTF of 35 days, and it experiences catastrophic failure only if 1/ 3 of of 35 days, and it experiences catastrophic failure only if 1/ 3 of the computers fail, what is the MTTF for the system? the computers fail, what is the MTTF for the system?

b.b. b. [20] < 1.1, 1.7 > If it costs an extra $ 1000, per computer, to b. [20] < 1.1, 1.7 > If it costs an extra $ 1000, per computer, to double the MTTF, would this be a good business decision? Show double the MTTF, would this be a good business decision? Show your work.your work.

c.c. [20] < 1.2 > Figure 1.3 shows, on average, the cost of downtimes, [20] < 1.2 > Figure 1.3 shows, on average, the cost of downtimes, assuming that the cost is equal at all times of the year. For assuming that the cost is equal at all times of the year. For retailers, however, the Christmas season is the most profitable retailers, however, the Christmas season is the most profitable (and therefore the most costly time to lose sales). If a catalog (and therefore the most costly time to lose sales). If a catalog sales center has twice as much traffic in the fourth quarter as sales center has twice as much traffic in the fourth quarter as every other quarter, what is the average cost of downtime per every other quarter, what is the average cost of downtime per hour duringhour during


– 42 – CSCE 513 Fall 2015

1.181.18 [10/ 20/ 20/ 20/ 25] < 1.10 > When parallelizing an [10/ 20/ 20/ 20/ 25] < 1.10 > When parallelizing an application, the ideal speedup is speeding up by the number of application, the ideal speedup is speeding up by the number of processors. This is limited by two things: percentage of the processors. This is limited by two things: percentage of the application that can be parallelized and the cost of application that can be parallelized and the cost of communication. Amdahl’s law takes into account the former communication. Amdahl’s law takes into account the former but not the latter. but not the latter.

a.a. [10] < 1.10 > What is the speedup with N processors if 80% of [10] < 1.10 > What is the speedup with N processors if 80% of the application is parallelizable, ignoring the cost of the application is parallelizable, ignoring the cost of communication? communication?

b.b. b. [20] < 1.10 > What is the speedup with 8 processors if, for b. [20] < 1.10 > What is the speedup with 8 processors if, for every processor added, the communication overhead is 0.5% of every processor added, the communication overhead is 0.5% of the original execution time. the original execution time.

c.c. c. [20] < 1.10 > What is the speedup with 8 processors if, for c. [20] < 1.10 > What is the speedup with 8 processors if, for every time the number of processors is doubled, the every time the number of processors is doubled, the communication overhead is increased by 0.5% of the original communication overhead is increased by 0.5% of the original execution time? execution time?

– 43 – CSCE 513 Fall 2015

d. [20] < 1.10 > What is the speedup with N processors d. [20] < 1.10 > What is the speedup with N processors if, for every time the number of processors is if, for every time the number of processors is doubled, the communication overhead is increased doubled, the communication overhead is increased by 0.5% of the original execution time? by 0.5% of the original execution time?

e. [25] < 1.10 > Write the general equation that solves e. [25] < 1.10 > Write the general equation that solves this question: What is the number of processors with this question: What is the number of processors with the highest speedup in an application in which P% of the highest speedup in an application in which P% of the original execution time is parallelizable, and, for the original execution time is parallelizable, and, for every time the number of processors is doubled, the every time the number of processors is doubled, the communication is increased by 0.5% of the original communication is increased by 0.5% of the original execution time?execution time?


lecture 2 quantifying performance topics speedup amdahl’s law execution time readings: chapter 1...

Documents