inroduction and performance analysis

30
1 Computer Architecture

Upload: drsaagrawal

Post on 22-Dec-2015

212 views

Category:

Documents


0 download

DESCRIPTION

good for performance

TRANSCRIPT

Page 1: Inroduction and Performance Analysis

1

Computer Architecture

Page 2: Inroduction and Performance Analysis

2

Performance

What do you mean by performance of computer?Two important metrics• Response Time or Latency – Time taken for

completion of a single job. Smaller is better.• Throughput – Number of jobs done per unit of

time. Larger is better.

Does one imply the other?• Yes. Eg. If latency decreases, throughput will increase.• No. Eg. In pipelining, latency may have be increased to

increase throughput!

Page 3: Inroduction and Performance Analysis

3

CPU Performance Equation

RateClock

ninstructioPerClocksnsInstructioNoTIMECPU

TimeCycleClockNeededCyclesClockTIMECPU

_

__*_._

__*___

What is this Response Time or Throughput??

Page 4: Inroduction and Performance Analysis

4

How can we Improve Performance ?

• No. Instructions can be reduced by:– Better ISA– Better Compiler– Better Algorithm

• Clocks Per Instruction can be reduced by:– Better Hardware Design– Make the common case faster

• Clock Rate can be increased by:– Hardware Design

Page 5: Inroduction and Performance Analysis

5

Numerical AssignmentA computer (3.06 GHz) has the following CPI

Instruction Type A B CCPI 1 2 3

An algorithm may be implemented in 2 ways I1 and I2, for each implementation the number of instructions used (in million) are as follows

Instruction Type A B CI1 0 2 2I2 2 2 1

1. Which implementation has lesser number of instructions?2. What is average CPI for both implementations? Which implementation is

faster?3. What is the total time taken for executing I1 and I2?4. What can you say about the MIPS rating?

Page 6: Inroduction and Performance Analysis

6

1. No. of Instructions I1 = 4 M

No. of Instructions I2 = 5 M

Hence I1 has lesser number of instructions

2. Clocks req. by I1 = 2*2 + 2*3 = 10 M.

by I2 = 2*1 + 2*2 + 1*3 = 9 M.

Average CPI for I1 = 10/4 = 2.5

I2 = 9/5 = 1.8

I2 is faster as it requires lesser number of clock cycles. Notice that number of instructions required by I1 is lesser.

3. Total Time for I1 = 10 M / 3.06 GHz = 3.27 mS

I2 = 9 M / 3.06 GHz = 2.94 mS

Page 7: Inroduction and Performance Analysis

7

4. MIPS rating = Million Instructions per second. This can be calculated from • CPI and clock rate of machine

MIPS = clock rate / CPI * 10-6

• Total Execution Time and Instruction Count

MIPS = Instruction Count / Total Execution Time * 10-6

MIPS rating for I1 = 1224 MIPS

for I2 = 1700 MIPS

MIPS rating for I2 machine > MIPS rating for I1 machine. This is as expected, since I2 has lesser execution time.

Page 8: Inroduction and Performance Analysis

8

Probable Conclusions

1. Total Number of instructions is definitely not a good metric.

2. MIPS is a good metric.

Page 9: Inroduction and Performance Analysis

9

Numerical AssignmentA computer (3.06 GHz) has the following CPI

Instruction Type A B CCPI 5 2 3

An algorithm may be implemented in 2 ways I1 and I2, for each implementation the number of instructions used (in million) are as follows

Instruction Type A B CI1 0 2 2I2 1 2 0

1. Which implementation has lesser number of instructions?2. What is average CPI for both implementations? Which implementation is

faster?3. What is the total time taken for executing I1 and I2?4. What can you say about the MIPS rating?

Page 10: Inroduction and Performance Analysis

10

1. No. of Instructions I1 = 4 M

No. of Instructions I2 = 3 M

Hence number of instructions for I1 is greater than number of instructions for I2.

2. Clocks req. by I1 = 2*2 + 2*3 = 10 M.

by I2 = 1*5 + 2*2 = 9 M.

Average CPI for I1 = 10/4 = 2.5

I2 = 9/3 = 3

I2 is faster as it requires lesser number of clock cycles.

3. Total Time for I1 = 10 M / 3.06 GHz = 3.27 mS

I2 = 9 M / 3.06 GHz = 2.94 mS

Page 11: Inroduction and Performance Analysis

11

4. MIPS rating for I1 = 1224 MIPS

for I2 = 1020 MIPS

MIPS rating for I1 machine > MIPS rating for I2 machine. This is unexpected, since I2 has lesser execution time.

Conclusion

MIPS is also not a good metric for overall system performance.

Page 12: Inroduction and Performance Analysis

12

Conclusion

Total time of execution is always a better metric as it sums up all factors and can not be replaced by considering

1. MIPS

2. Total number of instructions

3. Clock Rate

alone.

Page 13: Inroduction and Performance Analysis

13

Measuring Performance

Now that we know that performance is dependent upon program, which program(s) should be used to measure performance?

Benchmarks.

Page 14: Inroduction and Performance Analysis

14

Benchmarks

• Are a set of programs that are specifically chosen for measuring performance.

• Types of Benchmarks– Real Programs– Kernel

• Extract the key feature from a program – Component– Synthetic

• Dhrystone – floating Point• Whetstone – Integer and String Arithemetic

– I/O – Parallel

Page 15: Inroduction and Performance Analysis

15

Challenges

1. Vendors may tinker with benchmark to make them run better on their platform. At-times this is permitted.

2. Give data set rather than a single performance number.

3. Concentrate only on computational power.

Page 16: Inroduction and Performance Analysis

16

Popular Benchmarks• SPEC - Standard Performance Evaluation Corporation

– Floating point– Integer– Web– Graphics

• TPC – Transaction Processing Performance Council– Web Server– Transaction Processing– Decision Support Systems

• BAPCo – Business Applications Performance Corporation– Popular business applications

• EEMBC – Embedded Microprocessor Benchmark Consortium– Embedded Applications

Page 17: Inroduction and Performance Analysis

17

Statistical Summarization of Data

For Response time metric

Arithmetic Mean

For Throughput metric

Harmonic Mean or Geometric Mean.

SPEC uses Geometric Mean

Page 18: Inroduction and Performance Analysis

18

Are Benchmarks enough?

Benchmarks give the overall performance, if one wants to optimize performance, it may be necessary to know about the instruction or section of program where maximum time is being spent.

Profilers do this job.

Page 19: Inroduction and Performance Analysis

19

Profiling or Dynamic Program Analysis

• Program behavior is analysed as it is being run.

• Techniques used– Instruction Set Simulation– Hardware Interrupts– OS Hooks– Code instrumentation

• Example, Intel Vtune, Gprof

Page 20: Inroduction and Performance Analysis

20

Simulation

• Difficult to build the system. Simulation is cost effective.

• Beneficial for learning/improving some aspect of architecture.

• Simulators available are :– Kiel – Instruction Simulator– Little Mans Simulator – Simulator of a

machine– Cacheprof – Cache Simulator

Page 21: Inroduction and Performance Analysis

21

Moore’s Law (1965)

Moore's Law states that the number of transistors on a single chip at the same price will double every 18 to 24 months.

Page 22: Inroduction and Performance Analysis

22

Implication?

As more transistors are added to the chip of the same area, their speed increases, hence circuits become faster. Or clock rate increases.

Moore’s Law in combination with various other factors like ILP (Instruction Level Parallelism) were responsible for major improvements till a long time.

Page 23: Inroduction and Performance Analysis

23

Trends in Computing (Intel Processors)

Fastest Processor reported in Text, 2003

Current fastest processor, 2008

Intel® Processor name

Pentium 4Intel® Core™ i7-965 Processor

Extreme Edition

Processor speed 3.20 GHz 3.20GHz

Processor Primary Level Cache

12KB + 8KB 4x32KB

Processor secondary cache

512 KB 4x256KB Level 2 cache

Processor third level cache

2 MB Unified inclusive 8MB L3

Page 24: Inroduction and Performance Analysis

24

Observations

Fastest Processor reported in Text 2003

Current fastest processor, 2008

Intel® Processor name

Pentium 4Intel® Core™ i7-965 Processor

Extreme Edition

Processor speed 3.20 GHz 3.20GHz

Processor Primary Level Cache

12KB + 8KB 4x32KB

Processor secondary cache

512 KB 4x256KB Level 2 cache

Processor third level cache

2 MB Unified inclusive 8MB L3

Processor Speed or Clock Rate has not changed!!!

Page 25: Inroduction and Performance Analysis

25

Observations

Fastest Processor reported in Text 2003

Current fastest processor, 2008

Intel® Processor name

Pentium 4Intel® Core™ i7-965 Processor

Extreme Edition

Processor speed 3.20 GHz 3.20GHz

Processor Primary Level Cache

12KB + 8KB 4x32KB

Processor secondary cache

512 KB 4x256KB Level 2 cache

Processor third level cache

2 MB Unified inclusive 8MB L3

What is 4?

Page 26: Inroduction and Performance Analysis

26

The Answer

Multi Core Approach - Actually more transistors are being used to pack more cores into a chip, rather than increasing clock speed.

Why?

1. Power Wall

2. Memory Wall

3. No more ILP.

Page 27: Inroduction and Performance Analysis

27

Topics for further Study

• Papers– Performance papers– Memory Wall.

• Software– Intel Vtune or any other profiling tool– Little Mans Computer Simulator or any other

simulator apart from keil.

Page 28: Inroduction and Performance Analysis

28

Amdahl’s Law

Execution time after improvement

= Execution time affected by improvement

Amount of improvement

+ Execution time unaffected by improvement

Page 29: Inroduction and Performance Analysis

29

What this means?

Even if we substantially increase performance any one component, it may not result in overall substantial performance improvement.

A new architecture increases the speed of memory instructions by 50%. If memory instructions account for 50% of total time taken. What is the overall increase in performance?

Told = 100, Tnew = 25 + 50 = 75. Imp = 25%

Page 30: Inroduction and Performance Analysis

30

What is better?

a. 20% increase in perf. of instructions executing 90% of time.

b. 90% increase in perf of instructions executing 20% of time.