arquitetura e organização de processadores - ufrgsflavio/ensino/cmp237/aula17.pdf · arquitetura...

Universidade Federal do Rio Grande do SulInstituto de Informática

Programa de Pós-Graduação em Computação

Arquitetura e Organização de Processadores

Aula 17

Arquiteturas multi-core

Motivation

• Future applications will require still more performance• Power is current bottleneck• Performance driven by higher frequency and ILP hits

power wall• Performance may be obtained by multiple processors

running at lower frequencies

Types of parallelism

• Single processor core– ILP – Instruction-level parallelism– VLIW – Very Long Instruction Word– SIMD – Single Instruction Multiple Data– SMT – Simultaneous Multi-Threading

• Multiprocessing– SMP – Symmetrical Multi-Processing– CMP – Chip Multi-Processing (usually homogeneous)– MPSoC – Multi-Processor SoCs (usually heterogeneous)

Parallelism

Massive parallelism required in the foreseeable future

2003 2009 2015

Frequency(MHz) 300 600 1500

Gigaops/s 0.3 14 2458

Operationsper cycle 1 23 1638

Source: ITRS Roadmap, 2003

Complexity of media applicationsOps/sample

10 MOPS100 MOPS

1 GOPS10 GOPS

100 GOPS

1 TOPS

10 TOPS

100 TOPS

H.2

63Q

CIF

MP

EG

1C

IF

MP

EG

2 M

P@

ML

MP

EG

2 M

P@

HL

MP

EG

2 M

P@

ML Decode

100K

10K Encode

1K

100

10

Samplingrate / sec100K 1M 10M 100M 1G 10G

Source: Shen, SIPS 2003

Parallelism

Many-core eraMassively

parallelapplications100

2003 2005 2007 2009 2011 2013

1

10

IncreasingHW threadsper socket

Source: www.intel.com

Multi-core eraScalar and

parallelapplications

Hyper-thread

General-purpose processing

• Tera-level computing involves three distinct types of workloads, or computing capabilities:– Recognition: the ability to recognize patterns and models of

interest to a specific user or application scenario– Mining: the ability to mine large amounts of real-world data for

the patterns or models of interest– Synthesis: the ability to synthesize large datasets or a virtual

world based on the patterns or models of interest

• Intel foresees a multi-core architecture that is scalable, adaptable, and programmable

Source: www.intel.com

General-purpose solutions - IBM

• IBM Power4

General-purpose solutions - IBM

• IBM Power4– 2 cores, f = 1.4 GHz, 174 Mtransistors– Single clock over entire die– Power = 85 W

• One core may be turned off– Trend: multiple processors on die, bus communication, shared

cache• 4 Power4 chips into single module

– Chips connected via 4 128-bit buses– Up to 128 MB L3 cache– Bus speed = ½ processor speed– Total throughput = 35 GB/s– Trend: multiple processors on MCM, on-module communication,

huge cache

Source: Franza, MPSoC’05

General-purpose solutions - Sun

• Sun Ultrasparc IV– 2 cores, f = 1.8 GHz– Shared 2 MB L2 cache– 300 Mtransistors

• Sun Niagara– 8 cores– 4 threads per core– Shared 3 MB L2 cache– To be released in 2006

4-way M

T SP

AR

C pipe

4-way M

T SP

AR

C pipe

4-way M

T SP

AR

C pipe

4-way M

T SP

AR

C pipe

4-way M

T SP

AR

C pipe

4-way M

T SP

AR

C pipe

4-way M

T SP

AR

C pipe

4-way M

T SP

AR

C pipe

I/O shared functions


crossbar

4-way banked L2 cache

Memory controllers & I/O

General-purpose solutions - AMD

• AMD dual-core Opteron– 2 cores, f = 1.8 GHz– 106 Mtransistors– Power = 70 W– 2 x 1 MB L2 caches– Unshared caches


General-purpose solutions - Intel

• Intel Pentium D– 2 HT processors on MCM– 2 x 1 MB L2 caches, unshared– f = 3.2 GHz– 230 Mtransistors

• Intel Itanium Montecito– 2 VLIW cores, f = 1.5 GHz– Power = 100 W– 1.72 Btransistors– 2 x 12 MB L3 asynchronous caches– Multiple clock domains– Power management

• Dynamic voltage and frequency adjustment


CMP com cache compartilhada

• Vantagens– Baixa latência de comunicação entre os

cores– Interface entre a cache e a E/S é usada

somente para comunicação off-chip– A cache pode ser dinamicamente alocada

entre os cores• Desvantagens

– Maior complexidade– Necessidade de maior banda para a cache

• Exemplos– IBM Power 4/5– Sun UltraSPARC-IV+

CMP com E/S compartilhada

• Vantagens– Simplicidade em relação ao modelo de

cache compartilhada– Não é necessário sair do chip para fazer

comunicação entre os cores• Desvantagens

– Desperdício de recursos devido à cachenão compartilhada

– A banda entre a cache e o barramento é compartilhada pelo tráfego in-chip e off-chip

• Exemplos– Intel Itanium 2 (Montecito)– AMD Opteron Dual-Core

CMP com encapsulamento compartilhado

• Vantagens– Não requer modificações na lógica da

CPU– Tempo curto de projeto relativo aos

outros modelos• Desvantagens

– Latência da comunicação entre as CPUs

– Limita a freqüência do barramento de interconexão

• Exemplos– Intel Pentium D (Smithfield)– Intel Pentium D (Presley)– Intel Xeon (Dempsey)

MPSoC issues

• Heterogeneous x homogeneous multi-processing: trade-off between programmability and efficiency– Heterogeneous ISAs– DSP processors for media applications– Hardwired blocks– Configurable processors– Heterogeneous memory systems and address spaces– Heterogeneous interconnects

• MPSoCs are custom architectures, derived from configurable platforms, driven by standards– Standards usually define I/O relationships, not algorithms

MPSoC issues

• Programming model and software development tools• Memory model

– Heterogeneous memory systems are harder to program– Support to real-time constraints and performance

• Communication architecture– Support to real-time constraints and performance

• Design methodologies and tools– How to configure a platform to meet application constraints?– Time-to-market requires support from tools– Market for tools is too limited– More simulation-oriented (ASIC tools are more synthesis-

oriented)

Examples of multi-cores for the embedded market

• ST Nomadik• Cell - IBM / Sony / Toshiba• ARM11 MPCore• Toshiba media processor MeP• NEC MP-211• Panasonic UniPhier• Infineon 3G-baseband MPSoC

ST Nomadik platform

Memory, Storage & Connectivity Peripheral Interfaces

General-purposeCPUARM

SystemDMA Embedded Memory

cache

cache

Multi-mediaDSP

HW1DMA

cache

Multi-mediaDSP

HW2DMA

cache

Multi-mediaDSP

SymmetricalDSPs

GraphicsAccelerationLoosely-coupled

Sub-systems

Source: Artieri, MPSoC’05

Nomadik - MPSoC benefits

• High-computing performance– Multiple non-interfering domains of intense activity, each having

its own processor, DMA services, and hardware accelerators for data intensive functions

– Hardware acceleration embedding standard functions– Highest and predictable performance through a careful bus and

memory hierarchy design

• Low-power– Intrinsic low-power sub-systems– Fine grain power management at sub-system level– Leakage management by switching on & off sub-systems


Nomadik - MPSoC benefits

• Software flexibility– General-purpose CPU allows fast porting of new features– Performance through optimization on DSP with reasonable effort– Full performance at low power using HW functions

• Three levels from simplest to most advanced usage– Monolithic general-purpose CPU– Monolithic general-purpose CPU, multiple symmetrical DSPs– Monolithic general-purpose CPU, multiple symmetrical DSPs,

hardware accelerators


Nomadik - Multi-media DSP processor profile• Short pipeline, high VLIW parallelism efficiency

– 1 convolution tap per cycle (2 loads + 2 pointer updates + 1 multiplication + 1 MAC)

• Incremental architectural evolution, no race for frequency• Floating point unit

– IEEE754 compliant– Division and square root operation

• SIMD support• Low power

– Level 0 cache for power saving– Low-power instructions– Massive gated clock physical implementation

• Programmed only in ANSI C– Reduced learning curve and development time– Allow seamless DSP architecture evolution


Nomadik - Memory hierarchy and bus

• Becomes the main design bottleneck– Memory cache hierarchy– Bus matrix– Usage of shared embedded memory to offload bandwidth from

external memory– Smart caching in embedded memory is key

• Managed by software• Hardware controlled

– L1-cache at sub-system level is sized in accordance with average latency

• A very manageable bottleneck


Nomadik - Memory hierarchy

Very high bandwidth,Low latency

DMA

Sub-system 1

L1 cache

DMA

Sub-system 1

L1 cache

System DMA

ExternalMass

Memory(SDRAM)


Bandwidth bottleneck,High latency

EmbeddedMemory

(L2 cache)

Nomadik – Software platform

User interface

gamingMP3 player

telephonemessaging browser

PIM

High-level client API

Communication infra-structure SecurityFrame-work

Multi-mediaframework Java P

ower m

anagement

Telephony Networking

Operating system core(kernel, device drivers, file system, …)

Symbian WinCE Linux

Low-level API (HCL)

Multi-media Accelerators &Audio-video codec(MP3, AAC, Midi,…MPEG4, H.264, …)

Communicationinterfaces

(UARTs, USB, BT, …)

Peripheralinterfaces

(LCD, cameras, memory, …)


Nomadik - Software overview

Applic Applic Applic Applic Applic

middleware

Driver Driver Driver Driver Driver

HCL HCL HCL HCL HCL

Open OS

Component Manager

OS

FW FW

OS

FW FW

ARM

DSPs

Nomadikkernel

Nomadik - Programming model

• Nomadik kernel– A set of system services and API on which

• Open OS drivers are built• Sub-system firmware is built

– Open OS agnostic– Provides execution resource abstraction for user applications

and firmware



• Component = process = service– A dynamically downloadable object

• Component Manager– A unique gateway to all sub-systems– Aware of all sub-system resources’ state and activity– Transparently execute a component on any of the sub-systems– Manage the life cycle of a component

• Create, start, stop, kill component instances• Apply policy rules

– Memory management• Image installation• Memory allocation• Garbage collection



• Sub-system OS– Real-time micro task scheduler– Communication and synchronization services

• A sound execution framework– Clear separation between invocation (component manager side)

and execution (component instances)– Highly scalable and flexible– Best use of platform resources


Nomadik - Tool support• Multiple core approach

– ARM• No a priori: whatever is available from the market for both

compilation and debug– Multi-media based sub-systems

• Dedicated and optimized tools for– Compilation– Simulation and analysis– Debug and trace

• Compilation– All C-based approach, no assembly code– Highly optimized and robust ANSI C compiler– DSP extensions matching the ITU/ETSI basic operation package– Multi-platform tools


ARM11 MPCore

ARM11 MPCore• OS support: AMP vs. SMP• Asymmetric multiprocessor (AMP)

– Programmer statically allocates tasks– Uses a distributed view of memory

• Synchronization and communication via explicit message passing mechanism

– Same model as traditionally used in heterogeneous designs• Workloads are partitioned and manually offloaded to specific processors

• Symmetric multiprocessing (SMP)– OS dynamically allocates tasks to CPU– Programmer uses a shared view of memory

• Synchronization and communication via common state in shared memory– Normally homogeneous CPU arrangement

• Workloads are partitioned and dynamically shared between any processors• OS related requirements

– Cache coherency– Generic interrupt controller– Watchdog timer per processor Source: Zivojnovic, MPSoC’05

Toshiba MeP (Media Processor)

MeP CPU core

InstructionRAM/cache

DataRAM/cache

DSP Unit

UCI Unit

VLIWco-processor

HWengine

HW extensions

Local busbus bridge DMA

HeterogeneousmultiprocessorMeP

module

Configurableprocessors 1 2 3 N

Global bus

Source: Matsui, MPSoC’05

Toshiba MeP (Media Processor) • Configurable processor – MeP-C2 core• Base processor:

– 32-bit RISC– 5-stage pipeline– 350 MHz– 50 Kgates

• Configuration– memory size– optional instructions– bus width (32/64 bits)– interrupt (# channels, # levels)– debug support unit

• User extensions– User Custom Instruction (UCI) Unit – single-cycle ALU instructions– DSP unit – multi-cycle ALU instructions– VLIW co-processor – 2-way or 3-way– up to 10 hardware engines – control register extension up to 4 Kwords

Source: Matsui, MPSoC’05

Toshiba MeP (Media Processor)

• Example of application: MPSoC– 4 MeP processors

• Main control• Filter• Video processor, with MPEG4 / H.264 codec accelerators• Audio DSP, with DSP extension

Panasonic UniPhier• Market: home electronics equipment – TV, DVD, cell phones• DPP encourages future signal processing functions• DPP is an optional part for cell phones• Hardware engines are normally ASIC design parts for standardized

functions• Complex functions which are not yet standardized as realized by DPP

Excursion units

Instruction ParallelProcessor (IPP)

ControlU

nit

Data Parallel Processor (DPP)

Processing Element Array

HardwareEngine

Fundamental Extension Extension

Source: Nishitani, MPSoC’05

NEC MP211

• Market: cell phones– Current business acceleration

• Different OSs in component processors• Poor future expandability due to single DSP

Multi-layer AHB

ARM926(CPU1)

ARM926(CPU1)

ARM926(CPU0)

DSPSPX-K602

Source: Nishitani, MPSoC’05

Massive multi-core

• CISCO CRS-1 Carrier Router System• Continuous operation, service flexibility, extended

longevity• 92 Terabits per second• Software programmable network processor (SPP)• Each SPP processes 40 Gbps• Parallel array of 188 Xtensa-based SPP processors

Source: Fu, MPSoC’05

arquitetura e organização de processadores - ufrgsflavio/ensino/cmp237/aula17.pdf · arquitetura...

Documents