arquitetura e organização de processadores - ufrgsflavio/ensino/cmp237/aula17.pdf · arquitetura...
TRANSCRIPT
Universidade Federal do Rio Grande do SulInstituto de Informática
Programa de Pós-Graduação em Computação
Arquitetura e Organização de Processadores
Aula 17
Arquiteturas multi-core
Motivation
• Future applications will require still more performance• Power is current bottleneck• Performance driven by higher frequency and ILP hits
power wall• Performance may be obtained by multiple processors
running at lower frequencies
Types of parallelism
• Single processor core– ILP – Instruction-level parallelism– VLIW – Very Long Instruction Word– SIMD – Single Instruction Multiple Data– SMT – Simultaneous Multi-Threading
• Multiprocessing– SMP – Symmetrical Multi-Processing– CMP – Chip Multi-Processing (usually homogeneous)– MPSoC – Multi-Processor SoCs (usually heterogeneous)
Parallelism
Massive parallelism required in the foreseeable future
2003 2009 2015
Frequency(MHz) 300 600 1500
Gigaops/s 0.3 14 2458
Operationsper cycle 1 23 1638
Source: ITRS Roadmap, 2003
Complexity of media applicationsOps/sample
10 MOPS100 MOPS
1 GOPS10 GOPS
100 GOPS
1 TOPS
10 TOPS
100 TOPS
H.2
63Q
CIF
MP
EG
1C
IF
MP
EG
2 M
P@
ML
MP
EG
2 M
P@
HL
MP
EG
2 M
P@
ML Decode
100K
10K Encode
1K
100
10
Samplingrate / sec100K 1M 10M 100M 1G 10G
Source: Shen, SIPS 2003
Parallelism
Many-core eraMassively
parallelapplications100
2003 2005 2007 2009 2011 2013
1
10
IncreasingHW threadsper socket
Source: www.intel.com
Multi-core eraScalar and
parallelapplications
Hyper-thread
General-purpose processing
• Tera-level computing involves three distinct types of workloads, or computing capabilities:– Recognition: the ability to recognize patterns and models of
interest to a specific user or application scenario– Mining: the ability to mine large amounts of real-world data for
the patterns or models of interest– Synthesis: the ability to synthesize large datasets or a virtual
world based on the patterns or models of interest
• Intel foresees a multi-core architecture that is scalable, adaptable, and programmable
Source: www.intel.com
General-purpose solutions - IBM
• IBM Power4
General-purpose solutions - IBM
• IBM Power4– 2 cores, f = 1.4 GHz, 174 Mtransistors– Single clock over entire die– Power = 85 W
• One core may be turned off– Trend: multiple processors on die, bus communication, shared
cache• 4 Power4 chips into single module
– Chips connected via 4 128-bit buses– Up to 128 MB L3 cache– Bus speed = ½ processor speed– Total throughput = 35 GB/s– Trend: multiple processors on MCM, on-module communication,
huge cache
Source: Franza, MPSoC’05
General-purpose solutions - Sun
• Sun Ultrasparc IV– 2 cores, f = 1.8 GHz– Shared 2 MB L2 cache– 300 Mtransistors
• Sun Niagara– 8 cores– 4 threads per core– Shared 3 MB L2 cache– To be released in 2006
4-way M
T SP
AR
C pipe
4-way M
T SP
AR
C pipe
4-way M
T SP
AR
C pipe
4-way M
T SP
AR
C pipe
4-way M
T SP
AR
C pipe
4-way M
T SP
AR
C pipe
4-way M
T SP
AR
C pipe
4-way M
T SP
AR
C pipe
I/O shared functions
Source: Franza, MPSoC’05
crossbar
4-way banked L2 cache
Memory controllers & I/O
General-purpose solutions - AMD
• AMD dual-core Opteron– 2 cores, f = 1.8 GHz– 106 Mtransistors– Power = 70 W– 2 x 1 MB L2 caches– Unshared caches
Source: Franza, MPSoC’05
General-purpose solutions - Intel
• Intel Pentium D– 2 HT processors on MCM– 2 x 1 MB L2 caches, unshared– f = 3.2 GHz– 230 Mtransistors
• Intel Itanium Montecito– 2 VLIW cores, f = 1.5 GHz– Power = 100 W– 1.72 Btransistors– 2 x 12 MB L3 asynchronous caches– Multiple clock domains– Power management
• Dynamic voltage and frequency adjustment
Source: Franza, MPSoC’05
CMP com cache compartilhada
• Vantagens– Baixa latência de comunicação entre os
cores– Interface entre a cache e a E/S é usada
somente para comunicação off-chip– A cache pode ser dinamicamente alocada
entre os cores• Desvantagens
– Maior complexidade– Necessidade de maior banda para a cache
• Exemplos– IBM Power 4/5– Sun UltraSPARC-IV+
CMP com E/S compartilhada
• Vantagens– Simplicidade em relação ao modelo de
cache compartilhada– Não é necessário sair do chip para fazer
comunicação entre os cores• Desvantagens
– Desperdício de recursos devido à cachenão compartilhada
– A banda entre a cache e o barramento é compartilhada pelo tráfego in-chip e off-chip
• Exemplos– Intel Itanium 2 (Montecito)– AMD Opteron Dual-Core
CMP com encapsulamento compartilhado
• Vantagens– Não requer modificações na lógica da
CPU– Tempo curto de projeto relativo aos
outros modelos• Desvantagens
– Latência da comunicação entre as CPUs
– Limita a freqüência do barramento de interconexão
• Exemplos– Intel Pentium D (Smithfield)– Intel Pentium D (Presley)– Intel Xeon (Dempsey)
MPSoC issues
• Heterogeneous x homogeneous multi-processing: trade-off between programmability and efficiency– Heterogeneous ISAs– DSP processors for media applications– Hardwired blocks– Configurable processors– Heterogeneous memory systems and address spaces– Heterogeneous interconnects
• MPSoCs are custom architectures, derived from configurable platforms, driven by standards– Standards usually define I/O relationships, not algorithms
MPSoC issues
• Programming model and software development tools• Memory model
– Heterogeneous memory systems are harder to program– Support to real-time constraints and performance
• Communication architecture– Support to real-time constraints and performance
• Design methodologies and tools– How to configure a platform to meet application constraints?– Time-to-market requires support from tools– Market for tools is too limited– More simulation-oriented (ASIC tools are more synthesis-
oriented)
Examples of multi-cores for the embedded market
• ST Nomadik• Cell - IBM / Sony / Toshiba• ARM11 MPCore• Toshiba media processor MeP• NEC MP-211• Panasonic UniPhier• Infineon 3G-baseband MPSoC
ST Nomadik platform
Memory, Storage & Connectivity Peripheral Interfaces
General-purposeCPUARM
SystemDMA Embedded Memory
cache
cache
Multi-mediaDSP
HW1DMA
cache
Multi-mediaDSP
HW2DMA
cache
Multi-mediaDSP
SymmetricalDSPs
GraphicsAccelerationLoosely-coupled
Sub-systems
Source: Artieri, MPSoC’05
Nomadik - MPSoC benefits
• High-computing performance– Multiple non-interfering domains of intense activity, each having
its own processor, DMA services, and hardware accelerators for data intensive functions
– Hardware acceleration embedding standard functions– Highest and predictable performance through a careful bus and
memory hierarchy design
• Low-power– Intrinsic low-power sub-systems– Fine grain power management at sub-system level– Leakage management by switching on & off sub-systems
Source: Artieri, MPSoC’05
Nomadik - MPSoC benefits
• Software flexibility– General-purpose CPU allows fast porting of new features– Performance through optimization on DSP with reasonable effort– Full performance at low power using HW functions
• Three levels from simplest to most advanced usage– Monolithic general-purpose CPU– Monolithic general-purpose CPU, multiple symmetrical DSPs– Monolithic general-purpose CPU, multiple symmetrical DSPs,
hardware accelerators
Source: Artieri, MPSoC’05
Nomadik - Multi-media DSP processor profile• Short pipeline, high VLIW parallelism efficiency
– 1 convolution tap per cycle (2 loads + 2 pointer updates + 1 multiplication + 1 MAC)
• Incremental architectural evolution, no race for frequency• Floating point unit
– IEEE754 compliant– Division and square root operation
• SIMD support• Low power
– Level 0 cache for power saving– Low-power instructions– Massive gated clock physical implementation
• Programmed only in ANSI C– Reduced learning curve and development time– Allow seamless DSP architecture evolution
Source: Artieri, MPSoC’05
Nomadik - Memory hierarchy and bus
• Becomes the main design bottleneck– Memory cache hierarchy– Bus matrix– Usage of shared embedded memory to offload bandwidth from
external memory– Smart caching in embedded memory is key
• Managed by software• Hardware controlled
– L1-cache at sub-system level is sized in accordance with average latency
• A very manageable bottleneck
Source: Artieri, MPSoC’05
Nomadik - Memory hierarchy
Very high bandwidth,Low latency
DMA
Sub-system 1
L1 cache
DMA
Sub-system 1
L1 cache
System DMA
ExternalMass
Memory(SDRAM)
Source: Artieri, MPSoC’05
Bandwidth bottleneck,High latency
EmbeddedMemory
(L2 cache)
Nomadik – Software platform
User interface
gamingMP3 player
telephonemessaging browser
PIM
High-level client API
Communication infra-structure SecurityFrame-work
Multi-mediaframework Java P
ower m
anagement
Telephony Networking
Operating system core(kernel, device drivers, file system, …)
Symbian WinCE Linux
Low-level API (HCL)
Multi-media Accelerators &Audio-video codec(MP3, AAC, Midi,…MPEG4, H.264, …)
Communicationinterfaces
(UARTs, USB, BT, …)
Peripheralinterfaces
(LCD, cameras, memory, …)
Source: Artieri, MPSoC’05
Nomadik - Software overview
Applic Applic Applic Applic Applic
middleware
Driver Driver Driver Driver Driver
HCL HCL HCL HCL HCL
Open OS
Component Manager
OS
FW FW
OS
FW FW
ARM
DSPs
Nomadikkernel
Nomadik - Programming model
• Nomadik kernel– A set of system services and API on which
• Open OS drivers are built• Sub-system firmware is built
– Open OS agnostic– Provides execution resource abstraction for user applications
and firmware
Source: Artieri, MPSoC’05
Nomadik - Programming model
• Component = process = service– A dynamically downloadable object
• Component Manager– A unique gateway to all sub-systems– Aware of all sub-system resources’ state and activity– Transparently execute a component on any of the sub-systems– Manage the life cycle of a component
• Create, start, stop, kill component instances• Apply policy rules
– Memory management• Image installation• Memory allocation• Garbage collection
Source: Artieri, MPSoC’05
Nomadik - Programming model
• Sub-system OS– Real-time micro task scheduler– Communication and synchronization services
• A sound execution framework– Clear separation between invocation (component manager side)
and execution (component instances)– Highly scalable and flexible– Best use of platform resources
Source: Artieri, MPSoC’05
Nomadik - Tool support• Multiple core approach
– ARM• No a priori: whatever is available from the market for both
compilation and debug– Multi-media based sub-systems
• Dedicated and optimized tools for– Compilation– Simulation and analysis– Debug and trace
• Compilation– All C-based approach, no assembly code– Highly optimized and robust ANSI C compiler– DSP extensions matching the ITU/ETSI basic operation package– Multi-platform tools
Source: Artieri, MPSoC’05
ARM11 MPCore
ARM11 MPCore• OS support: AMP vs. SMP• Asymmetric multiprocessor (AMP)
– Programmer statically allocates tasks– Uses a distributed view of memory
• Synchronization and communication via explicit message passing mechanism
– Same model as traditionally used in heterogeneous designs• Workloads are partitioned and manually offloaded to specific processors
• Symmetric multiprocessing (SMP)– OS dynamically allocates tasks to CPU– Programmer uses a shared view of memory
• Synchronization and communication via common state in shared memory– Normally homogeneous CPU arrangement
• Workloads are partitioned and dynamically shared between any processors• OS related requirements
– Cache coherency– Generic interrupt controller– Watchdog timer per processor Source: Zivojnovic, MPSoC’05
Toshiba MeP (Media Processor)
MeP CPU core
InstructionRAM/cache
DataRAM/cache
DSP Unit
UCI Unit
VLIWco-processor
HWengine
HW extensions
Local busbus bridge DMA
HeterogeneousmultiprocessorMeP
module
Configurableprocessors 1 2 3 N
Global bus
Source: Matsui, MPSoC’05
Toshiba MeP (Media Processor) • Configurable processor – MeP-C2 core• Base processor:
– 32-bit RISC– 5-stage pipeline– 350 MHz– 50 Kgates
• Configuration– memory size– optional instructions– bus width (32/64 bits)– interrupt (# channels, # levels)– debug support unit
• User extensions– User Custom Instruction (UCI) Unit – single-cycle ALU instructions– DSP unit – multi-cycle ALU instructions– VLIW co-processor – 2-way or 3-way– up to 10 hardware engines – control register extension up to 4 Kwords
Source: Matsui, MPSoC’05
Toshiba MeP (Media Processor)
• Example of application: MPSoC– 4 MeP processors
• Main control• Filter• Video processor, with MPEG4 / H.264 codec accelerators• Audio DSP, with DSP extension
Panasonic UniPhier• Market: home electronics equipment – TV, DVD, cell phones• DPP encourages future signal processing functions• DPP is an optional part for cell phones• Hardware engines are normally ASIC design parts for standardized
functions• Complex functions which are not yet standardized as realized by DPP
Excursion units
Instruction ParallelProcessor (IPP)
ControlU
nit
Data Parallel Processor (DPP)
Processing Element Array
HardwareEngine
Fundamental Extension Extension
Source: Nishitani, MPSoC’05
NEC MP211
• Market: cell phones– Current business acceleration
• Different OSs in component processors• Poor future expandability due to single DSP
Multi-layer AHB
ARM926(CPU1)
ARM926(CPU1)
ARM926(CPU0)
DSPSPX-K602
Source: Nishitani, MPSoC’05
Massive multi-core
• CISCO CRS-1 Carrier Router System• Continuous operation, service flexibility, extended
longevity• 92 Terabits per second• Software programmable network processor (SPP)• Each SPP processes 40 Gbps• Parallel array of 188 Xtensa-based SPP processors
Source: Fu, MPSoC’05