multi-core architectures

58
Multi-Core Multi-Core Architecture Architecture s s 1

Upload: valtina-tony

Post on 03-Jan-2016

70 views

Category:

Documents


0 download

DESCRIPTION

Multi-Core Architectures. Single-Core Computer. Single-Core CPU chip. the single core. Multi-Core Architectures. This lecture is about a new trend in computer architecture: Replicate multiple processor cores on a single die. Core 1. Core 2. Core 3. Core 4. Multi-core CPU chip. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multi-Core Architectures

Multi-Core Multi-Core ArchitecturesArchitectures

1

Page 2: Multi-Core Architectures

Single-Core ComputerSingle-Core Computer

2

Page 3: Multi-Core Architectures

Single-Core CPU chipSingle-Core CPU chip

3

the single core

Page 4: Multi-Core Architectures

Multi-Core ArchitecturesMulti-Core Architectures• This lecture is about a new trend in computer

architecture: Replicate multiple processor cores on a single die.

4

Core 1 Core 2 Core 3 Core 4

Multi-core CPU chip

Page 5: Multi-Core Architectures

Multi-Core CPU chipMulti-Core CPU chip• The cores fit on a single processor socket

• Also called CMP (Chip Multi-Processor)

5

core

1

core

2

core

3

core

4

Page 6: Multi-Core Architectures

The Cores Run In ParallelThe Cores Run In Parallel

6

core

1

core

2

core

3

core

4

thread 1 thread 2 thread 3 thread 4

Page 7: Multi-Core Architectures

Within each core, threads are time-Within each core, threads are time-sliced (just like on a uniprocessor)sliced (just like on a uniprocessor)

7

core

1

core

2

core

3

core

4

several threads

several threads

several threads

several threads

Page 8: Multi-Core Architectures

Interaction with the Operating Interaction with the Operating

SystemSystem

• OS perceives each core as a separate processor

• OS scheduler maps threads/processes to different cores

• Most major OS support multi-core today:Windows, Linux, Mac OS X, …

8

Page 9: Multi-Core Architectures

Why multi-core ?Why multi-core ?• Difficult to make single-core

clock frequencies even higher • Deeply pipelined circuits:

o heat problemso speed of light problemso difficult design and verificationo large design teams necessaryo server farms need expensive

air-conditioning• Many new applications are multithreaded • General trend in computer architecture (shift towards

more parallelism)

9

Page 10: Multi-Core Architectures

Instruction-level parallelismInstruction-level parallelism• Parallelism at the machine-instruction level

• The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc.

• Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years

10

Page 11: Multi-Core Architectures

Thread-level parallelism Thread-level parallelism (TLP)(TLP)

• This is parallelism on a more coarser scale

• Server can serve each client in a separate thread (Web server, database server)

• A computer game can do AI, graphics, and physics in three separate threads

• Single-core superscalar processors cannot fully exploit TLP

• Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP

11

Page 12: Multi-Core Architectures

MultiprocessorsMultiprocessors

• Multiprocessor is any computer with several processors

• SIMDo Single instruction, multiple dataoModern graphics cards

• MIMDoMultiple instructions, multiple data

12

Lemieux cluster,Pittsburgh

supercomputing center

Page 13: Multi-Core Architectures

Multiprocessor Memory TypesMultiprocessor Memory Types

• Shared memory:In this model, there is one (large) common shared memory for all processors

• Distributed memory:In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else

13

Page 14: Multi-Core Architectures

Multi-core processor is a special kind of a Multi-core processor is a special kind of a multiprocessor: multiprocessor:

All processors are on the same chipAll processors are on the same chip

• Multi-core processors are MIMD:Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data).

• Multi-core is a shared memory multiprocessor:All cores share the same memory

14

Page 15: Multi-Core Architectures

What applications benefit What applications benefit from multi-core?from multi-core?

• Database servers• Web servers (Web commerce)• Compilers• Multimedia applications• Scientific applications,

CAD/CAM• In general, applications with

Thread-level parallelism(as opposed to instruction-level parallelism)

15

Each can run on itsown core

Page 16: Multi-Core Architectures

More examplesMore examples• Editing a photo while recording a TV show through a

digital video recorder

• Downloading software while running an anti-virus program

• “Anything that can be threaded today will map efficiently to multi-core”

• BUT: some applications difficult to parallelize

16

Page 17: Multi-Core Architectures

A technique complementary to multi-core:A technique complementary to multi-core:

Simultaneous multithreading Simultaneous multithreading • Problem addressed:

The processor pipeline can get stalled:oWaiting for the result

of a long floating point (or integer) operation

oWaiting for data to arrive from memory

Other execution unitswait unused

17

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Source: Intel

Page 18: Multi-Core Architectures

Simultaneous multithreading (SMT)Simultaneous multithreading (SMT)

• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core

• Weaving together multiple “threads” on the same core

• Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units

18

Page 19: Multi-Core Architectures

Without SMT, only a single thread Without SMT, only a single thread can run at any given timecan run at any given time

19

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1: floating point

Page 20: Multi-Core Architectures

Without SMT, only a single thread Without SMT, only a single thread can run at any given timecan run at any given time

20

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 2:integer operation

Page 21: Multi-Core Architectures

SMT processor: both threads can SMT processor: both threads can run concurrentlyrun concurrently

21

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1: floating pointThread 2:integer operation

Page 22: Multi-Core Architectures

But: Can’t simultaneously use the But: Can’t simultaneously use the same functional unitsame functional unit

22

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 2

This scenario isimpossible with SMTon a single core(assuming a single integer unit)IMPOSSIBLE

Page 23: Multi-Core Architectures

SMT not a “true” parallel processorSMT not a “true” parallel processor

• Enables better threading (e.g. up to 30%)

• OS and applications perceive each simultaneous thread as a separate “virtual processor”

• The chip has only a single copy of each resource

• Compare to multi-core: each core has its own copy of resources

23

Page 24: Multi-Core Architectures

Multi-core: threads can run on Multi-core: threads can run on separate coresseparate cores

24

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 2

Page 25: Multi-Core Architectures

Multi-core: threads can run on Multi-core: threads can run on separate coresseparate cores

25

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 3 Thread 4

Page 26: Multi-Core Architectures

Combining Multi-core and SMTCombining Multi-core and SMT

• Cores can be SMT-enabled (or not)

• The different combinations:o Single-core, non-SMT: standard uniprocessoro Single-core, with SMT oMulti-core, non-SMToMulti-core, with SMT: our fish machines

• The number of SMT threads:2, 4, or sometimes 8 simultaneous threads

• Intel calls them “hyper-threads”

26

Page 27: Multi-Core Architectures

SMT Dual-core: all four threads SMT Dual-core: all four threads can run concurrentlycan run concurrently

27

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 3 Thread 2 Thread 4

Page 28: Multi-Core Architectures

Comparison: multi-core vs SMTComparison: multi-core vs SMT

• Multi-core:o Since there are several cores,

each is smaller and not as powerful(but also easier to design and manufacture)

oHowever, great with thread-level parallelism

• SMTo Can have one large and fast superscalar coreoGreat performance on a single threadoMostly still only exploits instruction-level

parallelism28

Page 29: Multi-Core Architectures

The memory hierarchyThe memory hierarchy• If simultaneous multithreading only:

o all caches shared

• Multi-core chips:o L1 caches privateo L2 caches private in some architectures and shared

in others

• Memory is always shared

29

Page 30: Multi-Core Architectures

““Fish” machinesFish” machines• Dual-core Intel Xeon processors

• Each core is hyper-threaded

• Private L1 caches

• Shared L2 caches

30

memory

L2 cache

L1 cache L1 cacheC O

R E

1

C O

R E

0

hyper-threads

Page 31: Multi-Core Architectures

Designs with private L2 cachesDesigns with private L2 caches

31

memory

L2 cache

L1 cache L1 cacheC O

R E

1

C O

R E

0

L2 cache

memory

L2 cache

L1 cache L1 cacheC O

R E

1

C O

R E

0

L2 cache

Both L1 and L2 are private

Examples: AMD Opteron, AMD Athlon, Intel Pentium D

L3 cache L3 cache

A design with L3 caches

Example: Intel Itanium 2

Page 32: Multi-Core Architectures

Private vs shared cachesPrivate vs shared caches• Advantages of private:

o They are closer to core, so faster accesso Reduces contention

• Advantages of shared:o Threads on different cores can share the same

cache dataoMore cache space available if a single (or a few)

high-performance thread runs on the system

32

Page 33: Multi-Core Architectures

The cache coherence problemThe cache coherence problem

• Since we have private caches: How to keep the data consistent across caches?

• Each core should perceive the memory as a monolithic array, shared by all the cores

33

Page 34: Multi-Core Architectures

The cache coherence problemThe cache coherence problem

34

Suppose variable x initially contains 15213

Core 1 Core 2 Core 3 Core 4

One or more levels of

cache

One or more levels of

cache

One or more levels of

cache

One or more levels of

cache

Main memoryx=15213

multi-core chip

Page 35: Multi-Core Architectures

35

The cache coherence problemThe cache coherence problem

Core 1 reads x

Core 1 Core 2 Core 3 Core 4

One or more levels of

cachex=15213

One or more levels of

cache

One or more levels of

cache

One or more levels of

cache

Main memoryx=15213

multi-core chip

Page 36: Multi-Core Architectures

36

The cache coherence problemThe cache coherence problem

Core 2 reads x

Core 1 Core 2 Core 3 Core 4

One or more levels of

cachex=15213

One or more levels of

cachex=15213

One or more levels of

cache

One or more levels of

cache

Main memoryx=15213

multi-core chip

Page 37: Multi-Core Architectures

37

The cache coherence problemThe cache coherence problem

Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more levels of

cachex=21660

One or more levels of

cachex=15213

One or more levels of

cache

One or more levels of

cache

Main memoryx=21660

multi-core chip

assuming write-through caches

Page 38: Multi-Core Architectures

38

The cache coherence problemThe cache coherence problem

Core 2 attempts to read x… gets a stale copy

Core 1 Core 2 Core 3 Core 4

One or more levels of

cachex=21660

One or more levels of

cachex=15213

One or more levels of

cache

One or more levels of

cache

Main memoryx=21660

multi-core chip

Page 39: Multi-Core Architectures

Solutions for cache coherenceSolutions for cache coherence• This is a general problem with multiprocessors, not

limited just to multi-core

• There exist many solution algorithms, coherence protocols, etc.

• A simple solution:invalidation-based protocol with snooping

39

Page 40: Multi-Core Architectures

Inter-core busInter-core bus

40

Core 1 Core 2 Core 3 Core 4

One or more levels of

cache

One or more levels of

cache

One or more levels of

cache

One or more levels of

cache

Main memory

multi-core chip

inter-corebus

Page 41: Multi-Core Architectures

Invalidation protocol with snoopingInvalidation protocol with snooping

• Invalidation:If a core writes to a data item, all other copies of this data item in other caches are invalidated

• Snooping: All cores continuously “snoop” (monitor) the bus connecting the cores.

41

Page 42: Multi-Core Architectures

42

The cache coherence problemThe cache coherence problemRevisited: Cores 1 and 2 have both read x

Core 1 Core 2 Core 3 Core 4

One or more levels of

cachex=15213

One or more levels of

cachex=15213

One or more levels of

cache

One or more levels of

cache

Main memoryx=15213

multi-core chip

Page 43: Multi-Core Architectures

43

The cache coherence problemThe cache coherence problem

Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more levels of

cachex=21660

One or more levels of

cachex=15213

One or more levels of

cache

One or more levels of

cache

Main memoryx=21660

multi-core chip

assuming write-through caches

INVALIDATEDsendsinvalidationrequest

inter-corebus

Page 44: Multi-Core Architectures

44

The cache coherence problemThe cache coherence problem

After invalidation:

Core 1 Core 2 Core 3 Core 4

One or more levels of

cachex=21660

One or more levels of

cache

One or more levels of

cache

One or more levels of

cache

Main memoryx=21660

multi-core chip

Page 45: Multi-Core Architectures

45

The cache coherence problemThe cache coherence problemCore 2 reads x. Cache misses, and loads the new copy.

Core 1 Core 2 Core 3 Core 4

One or more levels of

cachex=21660

One or more levels of

cachex=21660

One or more levels of

cache

One or more levels of

cache

Main memoryx=21660

multi-core chip

Page 46: Multi-Core Architectures

46

Alternative to invalidate protocol: Alternative to invalidate protocol: update protocolupdate protocol

Core 1 writes x=21660:

Core 1 Core 2 Core 3 Core 4

One or more levels of

cachex=21660

One or more levels of

cachex=21660

One or more levels of

cache

One or more levels of

cache

Main memoryx=21660

multi-core chip

assuming write-through caches

UPDATED

broadcastsupdatedvalue inter-core

bus

Page 47: Multi-Core Architectures

47

Invalidation V.S. UpdateInvalidation V.S. Update

• Multiple writes to the same location– invalidation: only the first time– update: must broadcast each write

(which includes new variable value)

• Invalidation generally performs better:it generates less bus traffic

Page 48: Multi-Core Architectures

48

Invalidation ProtocolsInvalidation Protocols

• This was just the basic invalidation protocol

• More sophisticated protocols use extra cache state bits

• MSI, MESI (Modified, Exclusive, Shared, Invalid)

Page 49: Multi-Core Architectures

Programming for Multi-CoreProgramming for Multi-Core• Programmers must use threads or processes

• Spread the workload across multiple cores

• Write parallel algorithms

• OS will map threads/processes to cores

49

Page 50: Multi-Core Architectures

Thread Safety Very ImportantThread Safety Very Important

• Pre-emptive context switching: context switch can happen AT ANY TIME

• True concurrency, not just uniprocessor time-slicing

• Concurrency bugs exposed much faster with multi-core

50

Page 51: Multi-Core Architectures

Assigning threads to the coresAssigning threads to the cores

• Each thread/process has an affinity mask

• Affinity mask specifies what cores the thread is allowed to run on

• Different threads can have different masks

• Affinities are inherited across fork()

51

Page 52: Multi-Core Architectures

Affinity masks are bit vectorsAffinity masks are bit vectors

• Example: 4-way multi-core, without SMT

52

1011

core 3 core 2 core 1 core 0

• Process/thread is allowed to run on cores 0,2,3, but not on core 1

Page 53: Multi-Core Architectures

Affinity masks when multi-core Affinity masks when multi-core and SMT combinedand SMT combined

• Separate bits for each simultaneous thread

• Example: 4-way multi-core, 2 threads per core

53

1

core 3 core 2 core 1 core 0

1 0 0 1 0 1 1

thread1

• Core 2 can’t run the process• Core 1 can only use one simultaneous thread

thread0

thread1

thread0

thread1

thread0

thread1

thread0

Page 54: Multi-Core Architectures

Default AffinitiesDefault Affinities• Default affinity mask is all 1s:

all threads can run on all processors

• Then, the OS scheduler decides what threads run on what core

• OS scheduler detects skewed workloads, migrating threads to less busy processors

54

Page 55: Multi-Core Architectures

Process migration is costlyProcess migration is costly• Need to restart the execution pipeline

• Cached data is invalidated

• OS scheduler tries to avoid migration as much as possible: it tends to keeps a thread on the same core

• This is called soft affinity

55

Page 56: Multi-Core Architectures

Hard affinitiesHard affinities

• The programmer can prescribe her own affinities (hard affinities)

• Rule of thumb: use the default scheduler unless a good reason not to

56

Page 57: Multi-Core Architectures

When to set your own When to set your own affinitiesaffinities

• Two (or more) threads share data-structures in memoryomap to same core so that can share cache

• Real-time threads:Example: a thread running a robot controller:- must not be context switched, or else robot can go unstable- dedicate an entire core just to this thread

57

Source: Sensable.com

Page 58: Multi-Core Architectures

ConclusionConclusion• Multi-core chips an

important new trend in computer architecture

• Several new multi-core chips in design phases

• Parallel programming techniques likely to gain importance

58