multicores and parallel architectures

CDA 3101 Spring 2016

Introduction to Computer Organization

Multicore / Multiprocessor Architectures

7, 12 April 2016

2

Multicore Architectures Introduction – What are Multicores?

Why Multicores? Power and Performance Perspectives

Multiprocessor Architectures

Conclusion

CDA 3101 – Fall 2011 Copyright © 2011 Prabhat Mishra

How to Reduce Power ConsumptionMulticore

One core with frequency 2 GHzTwo cores with 1 GHz frequency (each)

Same performance Two 1 GHz cores require half power/energy

– Power freq2

– 1GHz core needs one-fourth power compared to 2GHz core.

New challenges – Performance How to utilize the coresIt is difficult to find parallelism in programs to keep

all these cores busy.

Reducing Energy Consumption

[www.transmeta.com]

Pentium Max Temp = 105.5 deg C Crusoe Max Temp = 48.2 deg C

Both processors are running the same multimedia application.

Infrared Cameras (FLIR) can be used to detect thermal distribution.

6

IntroductionNever ending story …

Complex Applications Faster Computation How far did we go with uniprocessors?

Parallel Processors now play a major role Logical way to improve performance

Connect multiple microprocessors Not much left with ILP exploitation Server and embedded software have parallelism

Multiprocessor architectures will become increasingly attractiveDue to slowdown in advances of uniprocessors

7

Level of ParallelismBit level parallelism: 1970 to ~1985

4 bits, 8 bit, 16 bit, 32 bit microprocessorsInstruction level parallelism: ~1985 - today

Pipelining Superscalar VLIW Out-of-order execution / Dynamic Instr. Scheduling

Process level or thread level parallelism Servers are parallel Desktop dual processor PCs Multicore architectures (CPUs, GPUs)

8

Taxonomy of Parallel ArchitecturesSISD (Single Instruction Single Data)

Uniprocessors

MISD (Multiple Instruction Single Data)Multiple processors on a single data stream

No commercial prototypes. Can be thought of as successive refinement of a given set of data by multiple processors (units).

SIMD (Single Instruction Multiple Data)Examples: Illiac-IV, CM-2

Simple programming model, low overhead, and flexibilityAll custom integrated circuits

MIMD (Multiple Instruction Multiple Data)Examples: Sun Enterprise 5000, Cray T3D, SGI Origin

Flexible – Difficult to program – no unifying model of parallelismUse off-the-shelf microprocessors

MIMD in practice: designs with <= 128 processors

FlynnClassification

9

MIMDTwo types

Centralized shared-memory multiprocessors Distributed-memory multiprocessors

Exploits thread-level-parallelism The program should have at least n threads or

processes for a MIMD machine with n processors

Threads can be of different types Independent programs Parallel iterations of a loop (extracted by compiler)

10

Centralized Shared-Memory Multiprocessor

11

Centralized Shared-Memory MultiprocessorSmall number of processors share a

centralized memoryUse multiple buses or switchesMultiple memory banks

Main memory has a symmetric relationship to all processors and uniform access time from any processorSMP: symmetric shared-memory multiprocessorsUMA: uniform memory access architectures

Increase in processor performance and memory bandwidth requirements make centralized memory paradigm less attractive

12

Distributed-Memory Multiprocessors

13

Distributed-Memory MultiprocessorsDistributing memory has two benefits

Cost-effective way to scale memory bandwidthReduces local memory access time.

Communicating data between processors is complex and has higher latency

Two approaches for data communicationShared address space (not centralized memory)

Same physical addr. refers to same memory location DSM: Distributed Shared-Memory Architectures NUMA: Non-uniform memory access since the access

time depends on the location of the dataLogically disjoint address space - Multicomputers

14

Caches serve to:Increase

bandwidth versus bus/memory

Reduce latency of access

Valuable for both private data and shared data

What about cache consistency?

Time Event $A $B X (memory)

0 1 1 CPU A

reads X 1 1

2 CPU B reads X

1 1 1

3 CPU A stores 0 into X

0 1 0

Small-Scale—Shared Memory

15

Example: Cache Coherence Problem

Processors see different values for u after event 3

With write back caches, value written back to memory depends on which cache flushes or writes back value Processes accessing main memory may see very stale value

Unacceptable for programming, and its frequent!

I/O devices

Memory

P1

Cache Cache Cache

P2 P3

5

u = ?

4

u = ?

u :51

u :5

2

u :5

3

u = 7

4 C’s: Sources of Cache MissesCompulsory misses (aka cold start misses)

First access to a blockCapacity misses

Due to finite cache size A replaced block is later accessed again

Conflict misses (aka collision misses) In a non-fully associative cache Due to competition for entries in a set Would not occur in a fully associative cache of

the same total size Coherence Misses

Graphics Processing Units (GPUs) Moore’s Law will come to an end Many complicated solutions Simple solution – SPATIAL PARALLELISM SIMD model (single instr, multiple data streams) GPUs have a SIMD grid with local & shared memory model

17

GPUs – Nvidia CUDA Hierarchy

18

Map Process to Thread

Group Threads in Block

Group Blocks in Grids for Efficiency Memory Access

Also, memory coales-cing operations for faster data transfer

Graphics Processing Units (GPUs) Nvidia Fermi GPU – 3GB DRAM, 512 cores

19

CUDA architecture

-Thread

- Thread Block

- Grid of Thread Blocks

- Intelligent CUDA Compiler

Nvidia Tesla 20xx GPU Board

20

Graphics Processing Units (GPUs) Nvidia Maxwell GM100 – 8GB + 6,144 cores

21

CUDA architecture

-Threads can be spawned internally

- 32 cores per streaming multiprocessor

- 128KB L1 and 2MB L2 cache

- v.5.2+ CUDA Compiler

GPU Problems and Solutions GPUs are designed for graphics rendering GPUs are not designed for general-purpose

computing!! (no unifying model of ||-ism) Memory hierarchy:

Local Memory – Fast, small (MBs) Shared Memory – Slower, larger Global Memory – Slow, Gbytes

How to circumvent data movement cost? Clever hand coding costly, app-specific Automatic coding sub-optimal, softwe support

22

Speedup =

P is fraction of program that is parallel

S is fraction of program that is sequential

Advantages and Disadvantages GPUs provide fast parallel computing

GPUs work best for parallel solutions Sequential programs can actually run slower Amdahl’s Law describes speedup:

23

Multicore CPUs Intel Nehalem:

Servers, HPC arrays 45nm circuit technology

Intel Xeon:

2001-present 2 to >60 cores Workstations Multiple cores Laptops Heat dissipation?

24

DUAL NEHALEM

Intel Multicore CPU Performance

25

SINGLE CORE

Conclusions Parallel machines Parallel solutions Inherently sequential programs don’t benefit

much from parallelism 2 main types of parallel architectures

SIMD – Single-instruction, multiple data stream MIMD – Multiple-instruction, multiple data stream

Modern parallel architectures (multicores) GPUs – Exploit SIMD parallelism for general-

purpose parallel computing solutions CPUs – Multicore CPUs are more amenable to

MIMD parallel applications26

multicores and parallel architectures

Documents