parallel architectures
Post on 31-Dec-2015
36 Views
Preview:
DESCRIPTION
TRANSCRIPT
Parallel Architectures
Martino Ruggieromartino.ruggiero@unibo.it
Why Multicores?The SPECint performance of the hottest chip grew by 52% per year from 1986 to 2002, and then grew only 20% in the next three years (about 6% per year).
Diminishing returns from uniprocessor designs
[from Patterson & Hennessy]
Power Wall
• The design goal for the late 1990’s and early 2000’s was to drive the clock rate up. • This was done by adding more transistors to a smaller chip.
• Unfortunately, this increased the power dissipation of the CPU chip beyond the capacity of inexpensive cooling techniques
[from Patterson & Hennessy]
Roadmap for CPU Clock Speed: Circa 2005
Here is the result of the best thought in 2005. By 2015, the clock speedof the top “hot chip” would be in the 12 – 15 GHz range.
[from Patterson & Hennessy]
The CPU Clock Speed Roadmap (A Few Revisions Later)
This reflects the practical experience gained with dense chips that were literally“hot”; they radiated considerable thermal power and were difficult to cool.Law of Physics: All electrical power consumed is eventually radiated as heat.
[from Patterson & Hennessy]
The MultiCore Approach
Multiple cores on the same chip– Simpler– Slower– Less power demanding
The Memory Gap
• Bottom-line: memory access is increasingly expensive and computer architect must devise new ways of hiding this cost
1
10
100
1000
10000
100000
1980
1985
1990
1995
2000
2005
Per
form
ance
Memory CPU
[from Patterson & Hennessy]
04/19/2023 Spring 2011 -- Lecture #15 8
Transition to Multicore
Sequential App Performance
Parallel Architectures
• Definition: “A parallel architecture is a collection of processing elements that cooperate and communicate to solve large problems fast”
• Questions about parallel architectures:– How many are the processing elements?– How powerful are processing elements?– How do they cooperate and communicate?– How are data transmitted? – What type of interconnection?– What are HW and SW primitives for programmer?– Does it translate into performance?
Flynn Taxonomy of parallel computersData streams
Single Parallel
InstructionStreams
Single SISD SIMD
Multiple MISD MIMD
M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
• Flynn's Taxonomy provides a simple, but very broad, classification of computer architectures:
• Single Instruction, Single Data (SISD)• A single processor with a single instruction stream, operating sequentially on a single data
stream.• Single Instruction, Multiple Data (SIMD)
• A single instruction stream is broadcast to every processor, all processors execute the same instructions in lock-step on their own local data stream.
• Multiple Instruction, Multiple Data (MIMD)• Each processor can independently execute its own instruction stream on its own local data
stream.• SISD machines are the traditional single-processor, sequential computers - also known as Von Neumann
architecture, as opposed to “non-Von" parallel computers.• SIMD machines are synchronous, with more fine-grained parallelism - they run a large number parallel
processes, one for each data element in a parallel vector or array.• MIMD machines are asynchronous, with more coarse-grained parallelism - they run a smaller number of
parallel processes, one for each processor, operating on the large chunks of data local to each processor.
Single Instruction/Single Data Stream:SISD
• Sequential computer • No parallelism in either the instruction or
data streams• Examples of SISD architecture are
traditional uniprocessor machines
Processing Unit
Multiple Instruction/Single Data Stream:MISD
• Computer that exploits multiple instruction streams against a single data stream for data operations that can be naturally parallelized– For example, certain kinds of
array processors• No longer commonly encountered,
mainly of historical interest only
Single Instruction/Multiple Data Stream:SIMD
• Computer that exploits multiple data streams against a single instruction stream to operations that may be naturally parallelized
– e.g., SIMD instruction extensions or Graphics Processing Unit (GPU)• Single control unit• Multiple datapaths (processing elements – PEs) running in parallel
– PEs are interconnected and exchange/share data as directed by the control unit– Each PE performs the same operation on its own local data
Multiple Instruction/Multiple Data Streams:MIMD
• Multiple autonomous processors simultaneously executing different instructions on different data.
• MIMD architectures include multicore and Warehouse Scale Computers (datacenters)
Parallel Computing Architectures Memory Model Pro-
cessorPro- cessor
...
Interconnection
Shared Memory
UMA (Uniform Memory Access)(SMP) symmetric multiprocessor
Pro- cessor
Pro- cessor
...
NUMA (Non-Uniform Memory Access)distributed-shared-memory
multiprocessor
Interconnection
LocalMemory
LocalMemory
Pro- cessor
Pro- cessor
...
Interconnection
LocalMemory
LocalMemory
MPP (Massively Parallel Processors)message-passing
(shared-nothing) multiprocessor
send receive
empty
Centrilized memory Physically distributed memory
Priv
ate
add
ress
spa
ces
Shar
ed a
ddre
ss s
pace
Parallel Architecture = Computer Architecture + Communication Architecture
Question: how do we organize and distribute memory in a multicore architecture?
2 classes of multiprocessors WRT memory:1. Centralized Memory Multiprocessor 2. Physically Distributed-Memory
multiprocessor
2 classes of multiprocessors WRT addressing:1. Shared2. Private
Memory Performance Metrics• Latency is the overhead in setting up a connection between processors for
passing data. – This is the most crucial problem for all parallel architectures - obtaining good
performance over a range of applications depends critically on low latency for accessing remote data.
• Bandwidth is the amount of data per unit time that can be passed between processors.
– This needs to be large enough to support efficient passing of large amounts of data between processors, as well as collective communications, and I/O for large data sets.
• Scalability is how well latency and bandwidth scale with the addition of more processors.
– This is usually only a problem for architectures with manycores.
Distributed Shared Memory Architecture:NUMA
• Data set is distributed among processors:
– each processor accesses only its own data from local memory
– if data from another section of memory (i.e. another processor) is required, it is obtained by a remote access.
• Much larger latency for accessing non-local data, but can scale to large numbers (thousands) of processors for many applications.
– Advantage: Scalability– Disadvantage: Locality Problems and Connection congestion
• Aggregated memory of the whole system appear as one single address space.
Communication Network
Host Processor
P 1 P 2 P 3
M 1 M 2 M 3
P Processor
M Local Memory
Distributed Memory—Message Passing Architectures
• Each processor is connected to exclusive local memory
– i.e. no other CPU has direct access to it• Each node comprises at least one network
interface (NI) that mediates the connection to a communication network.
• On each CPU runs a serial process that can communicate with other processes on other CPUs by means of the network.
• Non-blocking vs. Blocking communication• MPI Problems:
– All data layout must be handled by software
– Message passing has high software overhead
19
P
Mem
NI
Interconnect Network
P
Mem
NI
P
Mem
NI
P
Mem
NI
P
Mem
NI
P
Mem
NI
P
Mem
NI
P
Mem
NI
Shared Memory Architecture: UMA
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
SecondaryCache
Bus
SecondaryCache
SecondaryCache
SecondaryCache
PrimaryCache
PrimaryCache
PrimaryCache
Global Memory
• Each processor has access to all the memory, through a shared memory bus and/or communication network– Memory bandwidth and latency are the same for all processors and all memory
locations.• Lower latency for accessing non-local data, but difficult to scale to large numbers of
processors, usually used for small numbers (order 100 or less) of processors.
Shared memory candidates
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
SecondaryCache
SecondaryCache
SecondaryCache
SecondaryCache
Global Memory
PrimaryCache
PrimaryCache
PrimaryCache
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
Secondary Cache
Global Memory
PrimaryCache
PrimaryCache
PrimaryCache
Shared-main memory
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
Secondary Cache
Global Memory
Primary Cache
Shared-primary cacheShared-secondary cache
• Caches are used to reduce latency and to lower bus traffic• Must provide hardware to ensure that caches and memory are consistent (cache coherency)• Must provide a hardware mechanism to support process synchronization
Challenge of Parallel Processing
• Two biggest performance challenges in using multiprocessors
– Insufficient parallelism
• The problem of inadequate application parallelism must be attacked primarily in
software with new algorithms that can have better parallel performance.
– Long-latency remote communication
• Reducing the impact of long remote latency can be attacked both by the
architecture and by the programmer.
Amdahl’s Law• Speedup due to enhancement E is
Speedup w/ E = ---------------------- Exec time w/o EExec time w/ E
• Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected
Execution Time w/ E =
Speedup w/ E =
Execution Time w/o E [ (1-F) + F/S]
1 / [ (1-F) + F/S ]
Amdahl’s Law
Speedup =
Example: the execution time of half of the program can be accelerated by a factor of 2.What is the program speed-up overall?
Amdahl’s Law
Speedup = 1
Example: the execution time of half of the program can be accelerated by a factor of 2.What is the program speed-up overall?
(1 - F) + FSNon-speed-up part Speed-up part
10.5 + 0.5
2
10.5 + 0.25
= = 1.33
Amdahl’s LawIf the portion ofthe program thatcan be parallelizedis small, then thespeedup is limited
The non-parallelportion limitsthe performance
04/19/2023 Spring 2011 -- Lecture #1 27
Strong and Weak Scaling
• To get good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem.– Strong scaling: when speedup can be achieved on a
parallel processor without increasing the size of the problem
– Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors
Needed to amortize sources of OVERHEAD (additional code, not present in the original sequential program, needed to execute the program in parallel)
• Symmetric shared-memory machines usually support the caching of both shared and private data.
• Private data are used by a single processor, while shared data are used by multiple processors.
• When a private item is cached, its location is migrated to the cache, reducing the average access time as well as the memory bandwidth required. Since no other processor uses the data, the program behavior is identical to that in a uniprocessor.
• When shared data are cached, the shared value may be replicated in multiple caches. In addition, This replication also provides a reduction in contention that may exist for shared data items that are being read by multiple processors simultaneously.
Caching of shared data, however, introduces a new problem : cache coherence
Symmetric Shared-Memory Architectures
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
SecondaryCache
Bus
SecondaryCache
SecondaryCache
SecondaryCache
PrimaryCache
PrimaryCache
PrimaryCache
Global Memory
Example Cache Coherence Problem
– Cores see different values for u after event 3– With write back caches, value written back to memory depends
on the order of which cache flushes or writes back value– Unacceptable for programming, and it is frequent!
I/O devices
Memory
P1
$ $ $
P2 P3
5
u = ?
4
u = ?
u :51
u :5
2
u :5
3
u = 7
Keeping Multiple Caches Coherent• Architect’s job: shared memory => keep cache values coherent
• Idea: When any processor has cache miss or writes, notify other processors via interconnection network– If only reading, many processors can have copies– If a processor writes, invalidate all other copies
• Shared written result can “ping-pong” between caches
32
Shared Memory Multiprocessor
Use snoopy mechanism to keep all processors’ view of memory coherent
M1
M2
M3
Snoopy Cache
DMA
Physical Memory
Memory Bus
Snoopy Cache
Snoopy Cache
DISKS
33
Example: Write-thru Invalidate
• Must invalidate before step 3• Write update uses more broadcast medium BW
all recent SMP multicores use write invalidate
I/O devices
Memory
P1
$ $ $
P2 P3
5
u = ?
4
u = ?
u :51
u :5
2
u :5
3
u = 7
u = 7
Need for a more scalable protocol
• Snoopy schemes do not scale because they rely on broadcast
• Hierarchical snoopy schemes have the root as a bottleneck
• Directory based schemes allow scaling– They avoid broadcasts by keeping track of all CPUs caching
a memory block, and then using point-to-point messages to maintain coherence
– They allow the flexibility to use any scalable point-to-point network
35
Scalable Approach: Directories
• Every memory block has associated directory information– keeps track of copies of cached blocks and their states– on a miss, find directory entry, look it up, and
communicate only with the nodes that have copies if necessary
– in scalable networks, communication with directory and copies is through network transactions
• Many alternatives for organizing directory information
36
Basic Operation of Directory
• k processors • With each cache-block in memory:
k presence-bits, 1 dirty-bit• With each cache-block in cache:
1 valid bit, and 1 dirty (owner) bit
• Read from main memory by processor i:• If dirty-bit OFF then { read from main memory; turn p[i] ON; }• if dirty-bit ON then { recall line from dirty proc (downgrade cache
state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}
• Write to main memory by processor i:• If dirty-bit OFF then {send invalidations to all caches that have the
block; turn dirty-bit ON; supply data to i; turn p[i] ON; ... }
• ••
P P
Cache Cache
Memory Directory
presence bits dirty bit
Interconnection Network
Real Manycore Architectures
• ARM Cortex A9• GPU• P2012
ARM Cortex-A9 processors• 98% of mobile phones use at least on ARM processor• 90% of embedded 32-bit systems use ARM• The Cortex-A9 processors are the highest performance ARM processors
implementing the full richness of the widely supported ARMv7 architecture.
Cortex-A9 CPU
• Superscalar out-of-order instruction execution
– Any of the four subsequent pipelines can select instructions from the issue queue
• Advanced processing of instruction fetch and branch prediction
• Up to four instruction cache line prefetch-pending
– Further reduces the impact of memory latency so as to maintain instruction delivery
• Between two and four instructions per cycle forwarded continuously into instruction decode
• Counters for performance monitoring
The Cortex-A9 MPCore Multicore Processor
• Design-configurable Processor supporting between 1 and 4 CPU• Each processor may be independently configured for their cache sizes, FPU and NEON• Snoop Control Unit• Accelerator Coherence Port
Snoop Control Unit and Accelerator Coherence Port
• The SCU is responsible for managing: – the interconnect, – arbitration, – communication, – cache-2-cache and system memory transfers, – cache coherence
• The Cortex-A9 MPCore processor also exposes these capabilities to other system accelerators and non-cached DMA driven mastering peripherals:
– To increase the performance– To reduce the system wide power consumption by sharing access to the processor’s cache hierarchy
• This system coherence also reduces the software complexity involved in otherwise maintaining software coherence within each OS driver.
What is GPGPU?• The graphics processing unit (GPU) on commodity video cards has evolved into an
extremely flexible and powerful processor– Programmability– Precision– Power
• GPGPU: an emerging field seeking to harness GPUs for general-purpose computation other than 3D graphics
– GPU accelerates critical path of application• Data parallel algorithms leverage GPU attributes
– Large data arrays, streaming throughput– Fine-grain SIMD parallelism– Low-latency floating point (FP) computation
• Applications – see //GPGPU.org– Game effects (FX) physics, image processing– Physical modeling, computational engineering, matrix algebra, convolution, correlation,
sorting
Motivation 1:• Computational Power
– GPUs are fast…– GPUs are getting faster, faster
Motivation 2:
• Flexible, Precise and Cheap:– Modern GPUs are deeply programmable
• Solidifying high-level language support
– Modern GPUs support high precision• 32 bit floating point throughout the pipeline• High enough for many (not all) applications
CPU style cores CPU-“style”
Slimming down
Two cores
Four cores
Sixteen cores
Add ALUs
128 elements in parallel
But what about branches?
But what about branches?
But what about branches?
But what about branches?
Stalls!
• Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.
• Memory access latency = 100’s to 1000’s of cycles• We’ve removed the fancy caches and logic that helps avoid
stalls.• But we have LOTS of independent work items.• Idea #3: Interleave processing of many elements on a single
core to avoid stalls caused by high latency operations.
Hiding stalls
Hiding stalls
Hiding stalls
Hiding stalls
Hiding stalls
Throughput!
NVIDIA Tesla• Three key ideas
– Use many “slimmed down cores” to run in parallel– Pack cores full of ALUs (by sharing instruction stream across groups of work items)– Avoid latency stalls by interleaving execution of many groups of work-items/ threads/ ...
• When one group stalls, work on another group
On-chip memory
• Each multiprocessor has on-chip memory of the four following types:
– One set of local 32-bit registers per processor,– A parallel shared memory that is shared by all scalar processor
cores and is where the shared memory space resides,– A read-only constant cache that is shared by all scalar processor
cores and speeds up reads from the constant memory space, which is a read-only region of device memory,
– A read-only texture cache that is shared by all scalar processor cores and speeds up reads from the texture memory space, which is a read-only region of device memory; each multiprocessor accesses the texture cache via a texture unit that implements the various addressing modes and data filtering.
• The local and global memory spaces are read-write regions of device memory and are not cached.
Shared Memory
• Is on-chip:– much faster than the global memory
– divided into equally-sized memory banks
– as fast as a register when no bank conflicts
• Successive 32-bit words are assigned to successive banks
• Each bank has a bandwidth of 32 bits per clock cycle.
Shared Memory Examples of Shared Memory Access Patterns
without Bank Conflicts
Shared Memory Examples of Shared Memory Access Patterns
with Bank Conflicts
Global Memory: Coalescing• The device is capable of reading 4-byte, 8-byte, or 16-byte words from global memory into registers in a single
instruction. • Global memory bandwidth is used most efficiently when the simultaneous memory accesses can be coalesced into
a single memory transaction of 32, 64, or 128 bytes.
Coalescing Examples
Coalescing Examples
NVIDIA’s Fermi Generation CUDA Compute Architecture:
The key architectural highlights of Fermi are:• Third Generation Streaming Multiprocessor (SM)
– 32 CUDA cores per SM, 4x over GT200– 8x the peak double precision floating
point performance over GT200
• Second Generation ParallelThread Execution ISA
– Unified Address Space with Full C++ Support– Optimized for OpenCL and DirectCompute
• Improved Memory Subsystem– NVIDIA Parallel DataCache hierarchy
with Configurable L1 and Unified L2 Caches – improved atomic memory op performance
• NVIDIA GigaThreadTM Engine– 10x faster application context switching– Concurrent kernel execution– Out of Order thread block execution– Dual overlapped memory transfer engines
Third Generation Streaming Multiprocessor
• 512 High Performance CUDA cores– Each SM features 32 CUDA processors– Each CUDA processor has a fully pipelined
integer arithmetic logic unit (ALU) and floating point unit (FPU)
• 16 Load/Store Units– Each SM has 16 load/store units, allowing
source and destination addresses to be calculated for sixteen threads per clock.
– Supporting units load and store the data at each address to cache or DRAM.
• Four Special Function Units– Special Function Units (SFUs) execute
transcendental instructions such as sin, cosine, reciprocal, and square root.
P2012 Introduction
• The P2012 cluster is the computing node of the P2012 Fabric
• The P2012 cluster has two variants :– An homogeneous computing variant,– An heterogeneous computing variant.
• A single architecture for both variants.
Fabric Controller
SystemBridge
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
SystemBridge
SystemBridge
P2012 Cluster Main Features• Symmetric Multi-processing• Uniform Memory Access within the cluster• Non Uniform Memory Access between clusters• Up to 16 +1 processors per cluster.• Up to 30.6 GOPS peak per cluster (assuming non-SIMD extension) at 600 MHz. • Up to 20.4 GFLOPs (32 bits) peak per cluster at 600 MHz.• 2 DMA channels allowing up to 6.4 GB/s data transfer • HW Support for synchronization:
– Fast barrier (within a cluster only) in ~4 Cycles for 16 processors– Flexible barrier ~20 cycles for 16 processors
• Seamless combination of non-programmable (HWPEs) and programmable (PEs) processing elements
• High level of customization though:– The number of STxP70 processing elements– The STxP70 extensions (ISA customization)– Up to 32 User-defined HWPEs,– Memory sizes,– Banking factor of the shared memory,
P2012 Cluster OverviewP2012 Cluster Architecture
Multi-coreSub-system
(ENCore <N>)
Glo
bal I
nter
conn
ect
Inte
rfac
e
• N x STxP70 Cores• 2xN-banked Shared Data Memory• N-to-2M Logarithmic interconnect (memory)• Peripheral Logarithmic interconnect • Runtime accelerator (HWS)• Timers• Cluster interfaces (I/O)
P2012 Cluster OverviewP2012 Cluster Architecture
Multi-coreSub-system
(ENCore <N>)
ClusterController
(CC)
Glo
bal I
nter
conn
ect
Inte
rfac
e
• 1 STxP70-based Cluster processor• 16KB P$ & TCDM • CC peripheral (boot, …)• Clock, variability, power controller (CVP)• Cluster Controller Interconnect
P2012 Cluster OverviewP2012 Cluster Architecture
Multi-coreSub-system
(ENCore <N>)
Debug and Test Unit (DTU)
ClusterController
(CC)
Glo
bal I
nter
conn
ect
Inte
rfac
e
• Provides controllability and observability to the application developer • Breakpoint propagation inside the cluster and across the fabric
P2012 Cluster OverviewP2012 Cluster Architecture
Multi-coreSub-system
(ENCore <N>)
Debug and Test Unit (DTU)
ClusterController
(CC)
Glo
bal I
nter
conn
ect
Inte
rfac
e
Custom HW Processing Elements
Steaming Interface (SIF)
• P x HW Processing Elements• Stream Flow Local Interconnect (LIC)• HWPE to/from LIC interfaces (HWPE_WPR)• CC to/from LIC interface (SIF).
P2012 Cluster Overview (Con’d)P2012 Cluster Architecture
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory (TCDM)
Logarithmic interconnect (TCDM)
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Perip
hera
l Log
arith
mic
inte
rcon
nect
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
ENCo
re<N
>-CC
in
terf
ace
Local Interconnect (Stream flow)
HWPE#P
HWPE#2
HWPE#1
……………………
HWPE_WPR & SIF
HWPE_WPR & SIF
HWPE_WPR & SIF
Steaming Interface (SIF)
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC In
terc
onne
ct, C
CI
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#31
Mem
ory
bank
#32
• 32-bit RISC processor• 16 KB P$, No local data memory• 600 MHz in 32 nm• Variable length ISA• Up to two instructions executed per cycle• Configurable core• Extendible through its ISA• Complete software development tools chain
P2012 Cluster Overview (Con’d)P2012 Cluster Architecture
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory (TCDM)
Logarithmic interconnect (TCDM)
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Perip
hera
l Log
arith
mic
inte
rcon
nect
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
ENCo
re<N
>-CC
in
terf
ace
Local Interconnect (Stream flow)
HWPE#P
HWPE#2
HWPE#1
……………………
HWPE_WPR & SIF
HWPE_WPR & SIF
HWPE_WPR & SIF
Steaming Interface (SIF)
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC In
terc
onne
ct, C
CI
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#31
Mem
ory
bank
#32
• Parametric multi-core crossbar with a logarithmic structure• Reduced arbitration complexity• round robin arbitration scheme• Up to N memory accesses per cycle• Test-and-Set support
P2012 Cluster Overview (Con’d)P2012 Cluster Architecture
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory (TCDM)
Logarithmic interconnect (TCDM)
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Perip
hera
l Log
arith
mic
inte
rcon
nect
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
ENCo
re<N
>-CC
in
terf
ace
Local Interconnect (Stream flow)
HWPE#P
HWPE#2
HWPE#1
……………………
HWPE_WPR & SIF
HWPE_WPR & SIF
HWPE_WPR & SIF
Steaming Interface (SIF)
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC In
terc
onne
ct, C
CI
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#31
Mem
ory
bank
#32
• Supports 1D & 2D transfers• Up to 3.2GB/s peak per DMA • Support up to 16 outstanding transactions• Support of Out of Order (OoO)
P2012 Cluster Overview (Con’d)P2012 Cluster Architecture
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory (TCDM)
Logarithmic interconnect (TCDM)
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Perip
hera
l Log
arith
mic
inte
rcon
nect
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
ENCo
re<N
>-CC
in
terf
ace
Local Interconnect (Stream flow)
HWPE#P
HWPE#2
HWPE#1
……………………
HWPE_WPR & SIF
HWPE_WPR & SIF
HWPE_WPR & SIF
Steaming Interface (SIF)
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC In
terc
onne
ct, C
CI
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#31
Mem
ory
bank
#32
• Ultrafast frequency adaptation (power control)• Continuous critical path monitoring (dynamic bin sampling)• Continuous thermal sensing (temperature control)
P2012 Cluster Overview (Con’d)P2012 Cluster Architecture
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory (TCDM)
Logarithmic interconnect (TCDM)
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Perip
hera
l Log
arith
mic
inte
rcon
nect
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
ENCo
re<N
>-CC
in
terf
ace
Local Interconnect (Stream flow)
HWPE#P
HWPE#2
HWPE#1
……………………
HWPE_WPR & SIF
HWPE_WPR & SIF
HWPE_WPR & SIF
Steaming Interface (SIF)
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC In
terc
onne
ct, C
CI
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#31
Mem
ory
bank
#32
• Highly flexible and configurable interconnect,• Asynchronous implementation• Low-area or high-performance targets,• Natural GALS enabler• high robustness to variations
P2012 Cluster Overview (Con’d)P2012 Cluster Architecture
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory (TCDM)
Logarithmic interconnect (TCDM)
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Perip
hera
l Log
arith
mic
inte
rcon
nect
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
ENCo
re<N
>-CC
in
terf
ace
Local Interconnect (Stream flow)
HWPE#P
HWPE#2
HWPE#1
……………………
HWPE_WPR & SIF
HWPE_WPR & SIF
HWPE_WPR & SIF
Steaming Interface (SIF)
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC In
terc
onne
ct, C
CI
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#31
Mem
ory
bank
#32
LD/ST and DMA memory transfers• Intra-Cluster:
– LD/ST (UMA)– DMA: From/to TCDM to/from HWPE
• Inter-Cluster:– LD/ST (NUMA)– DMA: L1-to/from-L1
• Cluster to/from L2-Mem:– LD/ST (NUMA)– DMA: L1 to/from L2
• Cluster to/from L3-Mem (though the system bridge):– LD/ST (NUMA)– DMA: L1 to/from L3
Fabric Controller
SystemBridge
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
L2-MEM
SystemBridge
P2012 as GP Accelerator
P2012 Fabric
L2
L3 (DRAM)
Cluster 0
L1
TCD
M
Cluster 1
L1
TCD
M
Cluster 2
L1
TCD
M
Cluster 3
L1
TCD
M
ARM Host
FC
Summary• P2012 Cluster includes up to 16 + 1 x STxP70 cores, delivering up
to 30.6 GOPS and 20.4 GFLOP peak.
• ~7 GB/s DMA transfers
• Symmetric multi-processing in a UMA fashion within a Cluster; shared data memory in a NUMA fashion between Clusters.
• Fast multi processor synchronization thanks to HW support
• Seamless combination of non-programmable (HWPEs) and programmable (PEs) processing elements
Mobile SOC in 2012…• Features
– TSMC 40nm (LP/G)– Dual core A9 – 1-1.2GHz (G)– GPU, etc. - 330-400MHz (LP)– GEForce ULV (8 shaders)– 2 separate Vdd rails– 1MB L2$– 32b LPDDR2 (600MHz DR)
NVIDIA Tegra II SoC (2011)
A few (2, 4, 8) High-power processors (ARM): we need to handle power peaks
Efficient accelerator Fabrics with many (10s) PEs: we need to improve efficiency
Lots of (cool) memory, but we need more
top related