cpu performance enhancements

CPU Performance

Enhancements

CS2052 Computer Architecture

Computer Science & Engineering

University of Moratuwa

Dilum BandaraDilum.Bandara@uom.lk

Pipelining – It’s Natural!

Laundry example

Amal, Bimal, Chamal, & Dinal

each have one load of clothes

to wash, dry, & fold

Washer takes 30 minutes

Dryer takes 40 minutes

Folder takes 20 minutes

A B C D

Sequential Laundry

Sequential laundry takes 6 hours for 4 loads

If they learned pipelining, how long would laundry take?

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Pipelined Laundry – Start Work ASAP

Pipelined laundry takes 3.5 hours for 4 loads

6 PM 7 8 9 10 11 Midnight

30 40 40 40 40 20

Pipelining Lessons

Pipelining doesn’t reduce

latency of a single task

Improve throughput of entire

workload

Pipeline rate limited by

slowest pipeline stage

Multiple tasks operating

simultaneously

Potential speedup = No pipe

stages

Unbalanced lengths of pipe

stages reduces speedup

Time to fill pipeline & time to

drain/flush it reduces

speedup

6 PM 7 8 9

30 40 40 40 40 20

Source:

http://mail.humber.ca/~paul.mi

chaud/Pipeline.htm

Instruction Level

Parallelism (ILP)

CPU Pipelines

7Source: http://en.wikipedia.org/wiki/Classic_RISC_pipeline

5-stage MIPS

pipeline

Pipeline With a Branch Penalty

Due to a Taken Branch

Source: http://mail.humber.ca/~paul.michaud/Pipeline.htm

Superscalar Architectures

Executes more than 1 instruction during a clock

cycle by simultaneously dispatching multiple

instructions to redundant functional units

Source: http://mail.humber.ca/~paul.michaud/Pipeline.htm

Intel Hyper Threading (HT)

Introduced with Intel Pentium 4

Allows 2 different resources of CPU to be used at

the same time

While 1st thread (instruction) is working with integers

(ALU’s integer unit) 2nd thread can work on floating

point numbers (ALU’s floating point unit)

OS feels that there are 2 logical CPUs

Achieved through a mix of shared, replicated, &

partitioned chip resources such as:

Registers

Arithmetic units

Cache memory 11

Amdahl’s Law

What’s maximum expected improvement to an

overall system when only part of it is improved?

Amdahl said this relationship is not linear

Amdahl’s Law (Cont.)

Best you could ever hope to do

enhanced

maximumFraction - 1

1 Speedup

Amdahl’s Law – Example

Floating point instructions improved to run 2X;

but only 10% of actual instructions are FP

Speedupoverall =1

0.95= 1.053

ExTimenew = ExTimeold × (0.9 + 0.1/2) = 0.95 × ExTimeold

Moore’s Law – Today’s Status

Moore’s Law – No of

transistors on a chip

tends to double about

every 2 years

Transistor

count still

rising

Clock speed

flattening

sharply

www.extremetech.com/wp-

content/uploads/2012/02/CPU-Scaling.jpg

Dual Core

Introduced by IBM Power4

However, AMD brought it to consumer market

Combines 2 independent CPUs & their

respective caches onto a single silicon chip

Provide better performance improvement than

True parallelism

Multi-Core

Source: www.anandtech.com/show/5174/why-ivy-bridge-is-

still-quad-core

Multi-Core (Cont.)

18Source: www.legitreviews.com/intel-core-i7-4770k-haswell-3-5ghz-quad-core-cpu-review_2203

Multi-Core (Cont.)

Source: www.hardwarecanucks.com/news/cpu/intel-launch-8-core-xeon-nehalemex/

Multi-Cores + Hyper Threading

Source: www.notebookcheck.net/Intel-Core-i7-Notebook-Processor-Clarksfield.21025.0.html

NVIDIA Tesla 2070

Many-Cores

Graphic Processing Unit

NVIDIA & ATI

SIMD – Single Instruction Multiple Data

Intel Xeon Phi

General purpose

Intel Xeon Phi

Example Specifications

GTX 480 Tesla 2070 Tesla K80

Peak double

precision FP

performance

650 Gigaflops 515 Gigaflops 2.91 Teraflops

Peak single

precision FP

performance

1.3 Teraflops 1.03 Teraflops 8.74 Teraflops

CUDA cores 480 448 4992

Frequency of CUDA

1.40 GHz 1.15 GHz 560/875 MHz

Memory size

(GDDR5)

1536 MB 6 GB 24 GB

Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec

ECC Memory No Yes Yes

CPU vs. GPU Architecture

GPU devotes more transistors for computation

Multithreaded SIMD Processor

Source: Computer Architecture by

John L. Hennessy and David A.

Patterson

NVIDIA CUDA Architecture

Intel Xeon Phi

Source: www.pcgameshardware.de/Xeon-Phi-Hardware-256199/News/Intel-Xeon-Phi-Hardware-

Informationen-1040924/

Intel Xeon Phi (Cont.)

27Source: www.altera.com/technology/system-design/articles/2012/multicore-many-core.html

Power Consumption

Dynamic energy

Transistor switch from 0 1 or 1 0

½ × Capacitive load × Voltage2

Dynamic power

½ × Capacitive load × Voltage2 × Frequency switched

Static power consumption

Currentstatic × Voltage

Scales with no of transistors

Reducing voltage reduces energy

Reducing clock rate reduces power, not energy

Power gating than not only taking out clock signal28

cpu performance enhancements

linear12amdahls law

laundry exampleamal

work asappipelined laundry

pipeline time

thread instruction

taken branch9 source

intel pentium

multiple instructions

Engineering

cs3410 guest lecture a simple cpu: remaining branch...

cs61a lecture...

cpu performance equation

high performance computing at ac3 - university of...

cpu performance pipelined cpu

server side performance enhancements

tev11-06 esp performance enhancements by...

cpu performance evaluation: cycles per instruction...

performance enhancements for kinetic hydropower abstract

personal area networks: interconnects and performance...

enhancements monitoring, security, performance, and

low level cpu performance profiling examples

gromacs (gpu) performance benchmark and profiling€¦ ·...

performance enhancements in postgre sql 8.4

oracle 10g performance enhancements / features on...

data warehouse performance enhancements with oracle9i

opencl 1.1 enhancements for multi-gpu performance€¦ · -...

performance enhancements of transmission control protocol...

evaluating cpu performance

performance tools: process-specific cpu