use-cases for remote memory in the unimem architecture

15
Use-cases for Remote Memory in the Unimem Architecture Computer Architecture and VLSI Systems (CARV) Laboratory FORTH – ICS Nikolaos Kallimanis Manolis Marazakis Emmanouil Skordalakis

Upload: others

Post on 10-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Use-cases for Remote Memory in the Unimem Architecture

Use-cases for Remote Memory in the Unimem Architecture

C o m pute r A rc h i te c t ure a n d V L S I Sy st e ms ( C A RV ) L a b o rato r yF O RT H – I C S

Nikolaos Kallimanis Manolis Marazakis Emmanouil Skordalakis

Page 2: Use-cases for Remote Memory in the Unimem Architecture

Unimem Architecture

2

Communication mechanisms of the Unimem architecture:

1. Load/Store instructions across remote nodes.

2. Every page of physical memory is cacheable only in a single node.

3. Efficiently copying large amounts of memory from/to remote nodes.

4. Send and receive of small atomic messages in a low latency manner.

Unimem

Interconnect

Node 1 Node 2

Node 3 Node 4

Page 3: Use-cases for Remote Memory in the Unimem Architecture

3

How to exercise the Unimem

remote memory?

Page 4: Use-cases for Remote Memory in the Unimem Architecture

Unimem’s APIs

4

Apps(Unimem

optimized)

Apps(unmodified)

Apps(unmodified)

Runtimes(Unimem oriented)

Global Shared Address Space

Remote Memory SWAP

Apps(unmodified)

System Software - Unimem testbed(drivers, etc.)

HW - Unimem testbed(RDMA, virtual mailbox, virtual packetizer, remote memory access)

Unimem Sockets(modified libc)

Unimem Interfaces(Remote DMA, etc.)

Page 5: Use-cases for Remote Memory in the Unimem Architecture

Exercising Remote Memory

GSAS - Global Shared Address Space

1. Global Shared Address Space across system’s remote nodes.

2. It is mostly implemented based on mechanisms for sending/receiving small atomic messages.

3. API resembles to shared memory communication.

4. Applications can use this API for synchronization and for using remote memory.

5. Data are cached in the node that reside on → cacheable at single node.

KRAM - Remote Memory SWAP

1. It uses remote node's (unused) memory to create a SWAP.

2. It uses the Xilinx CDMA engine, or memcpy for transferring data.

3. Transparent to user application (unmodified apps).

4. Extends the memory that applications can use.

5. Data are cached in the local processor that application runs →cacheable at single node.

5

In both cases data are cached at a single node (Unimem property).

No complex hw-coherence protocols.

Flexibility.

Page 6: Use-cases for Remote Memory in the Unimem Architecture

Overview of the GSAS environment

GSAS Address Space

Node 1

0x0001-0000-0000-0000

0x0002-0000-0000-0000

0xFFFF-0000-0000-0000

0x0003-0000-0000-0000 …

Node 2

Node N

6

o64-bit address space.

oThe first 16 bits contain the routing information. node-id.

oThe remaining 48 bits indexing the memory of each node.

node-id (16 bits)

Page 7: Use-cases for Remote Memory in the Unimem Architecture

Overview of the GSAS environment

7

oThere is an atomic service at each node that serves requests.

oAtomic service is running on core 0 on every node of the system.

oApps and the atomic service communicate through small atomic messages with low latency.

oThere is a user-space library that handles the requests on the issuer side.

Unimem

Interconnect

Node 2

Packetizer Mailbox

Node 1

Packetizer Mailbox

Node 3

Packetizer Mailbox

Node 4

Packetizer Mailbox

Atomic Service

Page 8: Use-cases for Remote Memory in the Unimem Architecture

Overview of the GSAS environment

An application that uses the GSAS API is able to:o Allocation of memory in any

remote node.

o Spawning a new process on any remote node.

o Executing atomic operations (i.e., CAS, FAD, SWAP, etc.) on any remote memory location.

Low latency primitives (current prototype).o ≈ 2.0 μsec for a local issued

atomic instruction.o ≈ 3.9 μsec for a remote issued

atomic instruction.

GSAS Addresses

0x0001-0000-0000-0000

0x0002-0000-0000-0000

0xFFFF-0000-0000-0000

0x0003-0000-0000-0000

App

8

Page 9: Use-cases for Remote Memory in the Unimem Architecture

Performance/DHT on GSASGSAS use case example:oAn in-memory concurrent

Distributed Hash Table (DHT) is based on GSAS.

o It supports:DhtPut→ Store pairs of ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩

items in GSAS address space.

DhtGet→ Retrieves the value that corresponds to some 𝑘𝑒𝑦.

o Experiments on Unimem testbed (8 Trenz nodes):‒ Zynq MP Ultrascale+ SoC.

‒ 4 Arm A53 cores & 2 GB of local DDR4.

‒ Each thread executes pairs of DhtPut & DhtGet operations.

‒ Throughput of operations/sec is measured for different # of nodes.

‒ Experiments were performed for 1, 2 and 3 threads per node.

9

0

50000

100000

150000

200000

1 2 3 4 5 6 7 8

Thro

ugh

pu

t o

pe

rati

on

s/se

c

# nodes

1 Thread

2 Threads

3 Threads

Page 10: Use-cases for Remote Memory in the Unimem Architecture

KRAM – Concept/Architecture

10

250 MB are dedicated for the remote allocator service.

The maximum amount of local memory that local applications can use is 1.75 GB.

For more memory, KRAM creates a swap device to extend the memory.

𝑛0

𝑛1

Remoteallocator

KRAMKramtool

Remoteallocator

KRAMKramtool

𝑛1attached

unattached1.75 GB

1 GB

2 GB

𝑛0attached

unattached1.75 GB

0.8 GB

2 GB

Unused main memoryUsed main memory

Unused reserved memoryUsed reserved memory

Page 11: Use-cases for Remote Memory in the Unimem Architecture

KRAM -Requesting More Memory

11

1. User requests a swap device of a specific size.

2. KRAM requests memory from the local remote allocator service.

3. The remote allocator communicates with neighbor allocators for more memory.

Remoteallocator

KRAMKramtool

𝑛0

𝑛1

Remoteallocator

KRAMKramtool

𝑛1attached

unattached1.75 GB

1 GB

2 GB

𝑛0attached

unattached1.75 GB

0.8 GB

2 GB

Unused main memoryUsed main memory

Unused reserved memoryUsed reserved memory

Page 12: Use-cases for Remote Memory in the Unimem Architecture

KRAM – Allocating Remote Memory

12

4. The remote allocator of someneighbor sends the physicaladdress of the free remotememory.

5. The remote allocator sendsthe address to KRAM.

6. KRAM creates a swap device.

Remoteallocator

KRAMKramtool

𝑛0

𝑛1

Remoteallocator

KRAMKramtool

𝑛1attached

unattached1.75 GB

1 GB

2 GB

𝑛0attached

unattached1.75 GB

0.8 GB

2 GB

Unused main memoryUsed main memory

Unused reserved memoryUsed reserved memory

Page 13: Use-cases for Remote Memory in the Unimem Architecture

KRAM – Swapping

13

Now, Applications on node 𝑛0 are able to use:

- 1.75GB from local memory

- Some memory from remotes

Remoteallocator

KRAMKramtool

Remoteallocator

KRAMKramtool

𝑛1attached

unattached

1.75 GB

1 GB

2 GB

𝑛0 attached

𝑛0

𝑛1

𝑛0attached

unattached1.75 GB

0.8 GB

2 GB

Unused main memoryUsed main memory

Unused reserved memoryUsed reserved memory

Page 14: Use-cases for Remote Memory in the Unimem Architecture

KRAM – Stream Performance

14

3 nodes with specifications:• 2 GB RAM (1.75 GB main, 0.25GB reserved)• Zynq MP Ultrascale+ SoC

The Stream benchmark which calculates MB/s using the Copy, Scale, Add and Triad functions.

6 runs were performed, with dma mode and with memcpy mode

1. Local memory only.2. Remote memory of 48 MB.3. Remote memory of 96 MB.4. Remote memory of 144 MB.

1.6GB local 1.6GB local48MB swap

1.6GB local96MB swap

1.6GB local144MB swap

1.6GB local 1.6GB local48MB swap

1.6GB local96MB swap

1.6GB local144MB swap

memcpy memcpy memcpy memcpy dma dma dma dma

0

500

1000

1500

2000

2500

3000

3500

4000

4500

MB

/s

Copy Scale Add Triad

Page 15: Use-cases for Remote Memory in the Unimem Architecture

Thank You

15