maximizing heterogeneous system performance with arm ...€¦ · 28/06/2017  · easy adoption and...

23
© ARM 2017 Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Teratec June 2017 Systems and software group, ARM

Upload: others

Post on 12-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

Title 44pt sentence case

Affiliations 24pt sentence case

20pt sentence case

© ARM 2017

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Neil Parris, Director of product marketing

Teratec June 2017

Systems and software group, ARM

Page 2: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 2

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Intelligent flexible cloud to enable new use cases

Compute

Acceleration

Page 3: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 3

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Distributing intelligence efficiently at scale

Cortex-A75: Ground-breaking performance Cortex-A55: High-efficiency redefined

Page 4: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 4

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Unveiling DynamIQ – New single cluster design

▪ New application processors

▪ Cortex-A75 and Cortex-A55 CPUs

▪ Built for DynamIQ technology

▪ Private L2 and shared L3 caches

▪ Local cache close to processors

▪ L3 cache shared between all cores

▪ DynamIQ shared unit

▪ Contains L3, Snoop Control Unit (SCU) and

all cluster interfaces

▪ New AMBA 5 CHI bus interface

..

AMBA4 ACE or AMBA5 CHI

SCU

Shared L3 cacheACP

Cortex-A5532b/64b CPU

Private L2 cache

Async BridgesPeripheral Port

Cortex-A7532b/64b CPU

Private L2 cache

DynamIQ Shared Unit (DSU)

CoreLink CMN-600

DDR4-3200 DRAM

DMC-620 DMC-620

Page 5: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 5

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Open Source

UEFI bootARM Trusted Firmware

Linux Kernel

Fixed Virtual Platform (System Model)

SCP

Peripherals DDR4-3200 DRAM

DMC-620

MMU-500

NIC-450

PCIe/25GbE Storage

MCP

CoreLink CMN-600 Coherent Mesh Network

Cortex-A75Cortex-A75Cortex-A75Cortex-A75

DynamIQ& L3 Cache

Cortex-A75

DSU

Cortex-A75Cortex-A75Cortex-A75Cortex-A75

DynamIQ& L3 Cache

Cortex-A75

DSU

CoreSight

GIC-600

Non-ARM IP ARM IP

DMC-620

Build higher performance systems

▪ Boost performance

▪ Up to 7.5x more compute

▪ Innovations to increase throughput

▪ Up to 3x more packets per second

▪ Tailor solutions from edge to cloud

▪ Scale from 1 to 256 CPUs on a single chip

▪ Multichip connectivity supporting CCIX

▪ Scale to 1000+ CPUs

1 to 32 Clusters

1 to 8Controllers

Page 6: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 6

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

CMN-600: Delivering maximum compute density

7.5xcompute

5x throughput

32 Cortex-A72

CoreLink CCN-508

4x DMC-520

64 Cortex-A75

CoreLink CMN-600

8x DMC-620

16 Cortex-A57

CoreLink CCN-504

2x DMC-520

Rela

tive

Perf

orm

ance

Compute = measured by specint2k6_rate

Throughput = achieved requested bandwidth

Same process node and test conditions.

3x

0

2.5

5

7.5

Page 7: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 7

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Scalable solutions from core to edge

Access point Data center compute

System cache

AcceleratorCortex-A

CoreLink CMN-600

DMC-620

IO

NIC-450

IO

DMC-620100GbEPCIe

DMC-620 DMC-620

DMC-620

DM

C-6

20

DM

C-6

20

DM

C-6

20

DM

C-6

20

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CoreLink CMN-600Bandwidth 1 TB/s20 GB/s

System cache 128MB0MB

DDR channels 81

Cortex-A CPUs 2561

Scalable system configurations

Page 8: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 8

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

ARM AMBA 5 CHI coherency protocol

▪ Credited, non-blocking, coherence protocol

▪ Capable of achieving high frequencies (>2GHz)

▪ Layered architecture definition

▪ Protocol/Routing/Link/Physical layers

▪ Flexible topologies for target applications

Messages

Flits

Phits

Packets

Protocol

Routing

Link

Physical

Protocol

Routing

Link

Physical

Topology agnostic

Page 9: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 9

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Innovations to increase throughput

▪ Up to 3x more throughput with cache stashing

▪ Used for acceleration, network, storage use-cases

▪ Critical data stays on-chip

▪ Accelerator or IO selects data placement

▪ Low latency access by a CPU or group of CPUs

▪ Stash to any cache level

▪ CPU L2 cache – private to a single CPU thread

▪ DynamIQ L3 Cache – shared by CPUs in a cluster

▪ Agile System Cache – shared by all DynamIQ clusters

Accelerator

or IO

CoreLink CMN-600

DMC-620 DMC-620

Agile System Cache

DDR4 DDR4

L3 Cache

L2 Cache

CPU

L2 Cache

CPU

L2 Cache

CPU

L2 Cache

Cortex-A

Agile System Cache

L3 Cache

L2 Cache

CPU

L2 Cache

CPU

L2 Cache

CPU

L2 Cache

Cortex-A

Stash critical data to any

cache level

Page 10: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 10

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Customized, tiered acceleration platform

Closely coupled cluster accelerator• Low latency control and data exchange

Global

Accelerator

or IO

DMC-620

Cortex-A

DSU

Loca

l

Acc

ele

rato

r

DMC-620

Off-chip coherent accelerator• High performance CCIX connectivity

Global SoC accelerator or IO• Direct, high bandwidth path to memory

Agile System Cache Agile System Cache

DDR4 DDR4

CoreLink CMN-600Remote

Accelerator

Agile System Cache• Shared heterogenous system cache

Page 11: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 11

Text 54pt sentence case

11

Text 54pt sentence case Coherent multichip

Page 12: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 12

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Interconnects needs at different scale

SoC interconnect

Connectivity for on-chip processors, accelerator, IO and memory

elements.

Server node interconnect - ‘scale-up’

Simple multichip interconnect (typically PCIe) topology on a PCB

motherboard with simple switches and expansion connectors

Rack interconnect - ‘scale-out’

Scale-out capabilities with complex topologies connecting 1000’s of

server nodes and storage elements.

Page 13: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 13

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Scale-up server node compute and acceleration

DMC-620

DMC-620 DMC-620

DMC-620

CoreLink CMN-600

Shared virtual memory system

DMC-620

DMC-620 DMC-620

DMC-620

CoreLink CMN-600

Accelerator

Smart

Network

Persistent

Memory

Page 14: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 14

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Cache coherent interconnect for accelerators

▪ Extends cache coherency to the multi-chip use cases

▪ Intelligent network/storage, machine learning, image/video

analytics, search and 4G/5G wireless

▪ Open standard to foster innovation, collaboration and

choice to solve emerging workload challenges

▪ Hardware specification available for designs start for

member companies

http://www.ccixconsortium.com

Page 15: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 15

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

CCIX multichip connectivity and topologies

▪ New class of interconnect providing high performance, low

latency for new accelerators use cases

▪ CCIX defines up to 25GT/s (3x performance*)

▪ Examining 56GT/s (7x performance*) and beyond

▪ Enabling low latency via light transaction layer

▪ Flexible, scalable interconnect topologies

▪ Flexible point-to-point, daisy chained and switched topologies

▪ Simplified deployment by leveraging existing PCIe

hardware and software infrastructure

▪ Runs on existing PCIe transport layer and management stack

▪ Coexist with legacy PCIe designs

Compute Node

Accelerator

Smart

Network

Persistent

Memory

Switch

* Note: Based on PCIe Gen3 Performance

Page 16: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 16

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Extending the benefits of cache coherency

▪ CCIX provides true peer processing with shared memory

▪ Accelerators may cache data from processor memory

▪ Coherency eliminates the software and DMA overhead of

transferring data between devices

▪ Coherency protocol easily adaptable to other transport in

the future

▪ Supports all major instruction set architectures (ISA) enabling

innovation and flexibility in accelerator system design

Compute Node

Accelerator

DDR Memory

Compute Node

Accelerator

DDR MemoryDDR Memory

Cache

Cache

shared virtual memory

shared virtual memory

Page 17: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 17

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

CCIX example request to home data flows

Memory

Accelerator shares processor memory

Requestor

Cache

Home

Daisy chain to shared processor memory

Requestor

Cache

Memory

Cache Cache Cache

Memory

Cache Cache

Shared processor and accelerator memory

Memory

Memory

Cache Cache

Shared memory with aggregation

Memory

Page 18: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 18

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Today’s existing PCIe integration

PC

Ie

Tra

nsa

ctio

n

Lay

er D

ata

Lin

k L

ayer

PH

Y (

16G

pbs)

16 Lanes

DMC-620

DMC-620 DMC-620

DMC-620

CoreLink CMN-600 XP

RN

I AXI

Cadence IP

Example CMN-600 mesh designPCIe IP connects to a

CMN IO interface via AXI

Page 19: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 19

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Coherent Multi-chip Link (CML)

PC

Ie

Tra

nsa

ctio

n

Lay

er D

ata

Lin

k L

ayer

PH

Y (

16G

pbs)

16 Lanes

DMC-620

DMC-620 DMC-620

DMC-620

CoreLink CMN-600 XP

CM

LR

NI

CXS

AXI

Cadence IP

Example CMN-600 mesh design CML converts CHI to

CCIX messages

CXS - CCIX stream

interface

Page 20: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 20

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Cadence CCIX integration

PC

Ie

Tra

nsa

ctio

n

Lay

er

CC

IX

Tra

nsa

ctio

n

Lay

er

Dat

a Lin

k L

ayer

PH

Y (

up t

o 25Gpbs)

16 Lanes

DMC-620

DMC-620 DMC-620

DMC-620

CoreLink CMN-600 XP

CM

LR

NI

CXS

AXI

Cadence IP

Example CMN-600 mesh design Low latency CCIX

transaction layer

Support up to 25Gbps vs

16Gbps PCIe Gen4

Page 21: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 21

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

CCIX 25Gbps technology demonstration

▪ 3X faster transfer speed with CCIX vs existing PCIe

Gen3 solutions

▪ Transferring of a data pattern at 25 Gbps between

two FPGAs

▪ Channel comprised of an Amphenol/FCI PCI Express

CEM connector and a trace card

▪ Transceivers are electrically compliant with CCIX

▪ Fastest data transfer between accelerators over PCI

Express connections https://forums.xilinx.com/t5/Xcell-Daily-Blog/CCIX-Tech-Demo-Proves-

25Gbps-Performance-over-PCIe/ba-p/767484

https://youtu.be/JpUSAcnn7VA

Xilinx and Amphenol FCI first public CCIX technology demo

Page 22: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

© ARM 2017 22

Title 40pt sentence case

Bullets 24pt sentence case

bullets 20pt sentence case

Scale-up server node performance with CCIX

▪ CCIX is a class of interconnect providing high performance, low latency for

new accelerator use cases

▪ Easy adoption and simplified development by leveraging today’s data centers

infrastructure

▪ Optimize CCIX SoCs with ARM CoreLink CMN-600 and CCIX controller IP

▪ Server, FPGAs, GPUs, network/storage adapters, intelligent networks and custom ASICs

DMC-620

DMC-620 DMC-620

DMC-620

CoreLink CMN-600

DMC-620

DMC-620 DMC-620

DMC-620

CoreLink CMN-600

Accelerator

Smart

Network

Persistent

Memory

Page 23: Maximizing heterogeneous system performance with ARM ...€¦ · 28/06/2017  · Easy adoption and simplified development by leveraging today’s data centers infrastructure Optimize

The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited

(or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be

trademarks of their respective owners.

Copyright © 2017 ARM Limited

© ARM 2017