exploring hardware support for scaling irregular...

19
Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures MARCO CERIANI† SIMONE SECCHIANTONINO TUMEOORESTE VILLA‡ GIANLUCA PALERMO† December 6, 2013 CARL 2013 1 †Politecnico di Milano - DEI, 20133, Milano, Italy. {mceriani,gpalermo}@elet.polimi.it Universita’ degli Studi di Cagliari - DIEE, 09123, Cagliari, Italy. [email protected] ‡Pacific Northwest National Laboratory, Richland, WA. [email protected]

Upload: vutuyen

Post on 02-May-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures

MARCO CERIANI† SIMONE SECCHI∗ ANTONINO TUMEO‡ ORESTE VILLA‡ GIANLUCA PALERMO†

December 6, 2013 CARL 2013 1

†Politecnico di Milano - DEI, 20133, Milano, Italy. {mceriani,gpalermo}@elet.polimi.it ∗Universita’ degli Studi di Cagliari - DIEE, 09123, Cagliari, Italy. [email protected] ‡Pacific Northwest National Laboratory, Richland, WA. [email protected]

New generation of irregular HPC applications

Complex  Networks  

Community  Detec5on  

Bioinforma5cs  

Knowledge  Discovery  

Seman5c  Databases  

Language  Understanding   PaBern  Recogni5on  

Big  Science  

December 6, 2013 2

Characteristics of Emerging Irregular Applications

December 6, 2013 3

!   Use pointer or linked list-based data structures !   Graphs, unbalanced trees, unstructured grids !   Fine grained data accesses

!   Very large datasets !   Way more than what is currently available for single cluster nodes !   Very difficult to partition without generating load unbalancing

!   Very poor spatial and temporal locality !   Unpredictable network and memory accesses !   Memory and network bandwidth limited!

!   Large amounts of parallelism (e.g., each vertex, each edge in the graph) !   But irregularity in control

!   If (vertex==x) z; else k

Objective

!   We aim at designing a full-system architecture for irregular applications starting from off-the-shelf cores !   Big datasets imply a multi-node architecture

!   We do it by: !   Introducing custom hardware and software components that optimize the

architecture for executing multi-node irregular applications !   Employing a FPGA prototype to validate the approach

December 6, 2013 4

Supporting Irregular Applications

December 6, 2013 5

 Fast  

Context  Switching  

 Fine-­‐grain    Global  

Address  Space  

 Hardware  Synch    

!   Fast context switching: tolerates latencies

!   Fine-grain global address space: removes partitioning requirements, simplify code development

!   Hardware synch: increase performance with synchronization intensive workloads

Why a prototype?

!   Hardware components designed at the register transfer level !   Stronger validation than a simulator !   Enable capturing primary performance issues !   Expose hardware implementation challenges

!   Higher speed than a simulation infrastructure !   Allows faster iterations between hardware and software

!   Software layer can be co-developed and evaluated with the hardware

December 6, 2013 6

Node Architecture Overview

December 6, 2013 7

! MicroBlaze processors !   Connected to private

scratchpads !   All access a shared external

DDR3 memory !   Internal interconnection: AXI !   External interconnection: Aurora !   Three custom hardware

components !   GMAS: Global Memory Access

Scheduler !   GNI: Global Network Interface !   GSYNC: Global

SYNChronization module !   Support for lightweight software

multithreading

Programming model

!   Global address space !   shared-memory programming model on top of a distributed memory machine !   Developer allocates and frees memory areas in the global address space by using

standard memory allocation primitives. !   The Application Programming Interface (API) provides:

!   Extended malloc and free primitives that support allocation in the shared global memory space and the node-local memory space

!   POSIX-like thread management: thread creation, join, yield !   Synchronization routines: lock, spinning lock, unlock, barrier

!   Application developed with a Single Program Multiple Data (SPMD) approach. !   Each thread executes the same code on different elements of the dataset

!   In the current prototype, contexts of the thread are stored in private scratchpads and do not migrate !   Potential load imbalance, faster context switching !   Alternative approach: storing contexts in the global address space, prefetching in

the scratchpads

December 6, 2013 8

Quad-Board Prototyping Platform

!   4 Xilinx Virtex-6 ML605 boards – Virtex-6 LX240T devices !   Xilinx ISE Embedded Design Suite 13.4 !   Prototyped a quad-node systems

December 6, 2013 9

GMAS

December 6, 2013 10

!   One per core !   Forwards memory

operations from the cores to the memories

!   Enables scrambled global address space support

!   Hosts Load Store Queues for long latency memory operations

!   Provides thread ids to the core

!   Provides interface to the GSYNC

GMAS Operation

!   When a core emits a memory operation !   The GMAS descrambles it and verify its destination

!   If it is local (local memories, local part of the global address space) !   It is directly forwarded to the destination memory

!   If it is remote !   The request is sent to the GNI !   The related information of the memory operation are saved in the LSQ

block, the pending is set !   A canary value is sent to the core, setting the redo bit !   An interrupt is triggered, starting a context switch

!   When the reply to the remote reference comes back !   The pending bit is reset, allowing the source thread to be scheduled !   When the thread is scheduled, it re-executes the memory operation and

the redo bit is reset

December 6, 2013 11

GNI

December 6, 2013 12

!   A GNI for each node !   Interfaces AXI with the

Network (Aurora) !   Translates internal network

protocol to external network protocol and viceversa

!   Packet contains: header with source node, original AXI transaction

!   The destination GNI translates the incoming transaction, executes the memory operation, and sends back the result

GSYNC

!   A GSYNC for each node !   Implements a lock table of configurable size

!   Each GSYNC stores locks for the addresses on its own node !   Direct Mapping: multiple addresses share the same lock (aliasing)

!   When a core write on the lock register of the GMAS !   A load is sent to the GSYNC addressing the related lock bit !   The GSYNC handles the load as a bit swap, and returns the current value

in the slot !   Locks not taken are retried in software

!   When a core writes on the unlock register of the GMAS !   A store with value 0 is sent to the GSYNC addressing the related lock bit

!   Remote GSYNC are accessed through the GNI as normal remote memory operations

December 6, 2013 13

Experimental setup

!   4 nodes !   From 1 to 32 MicroBlazes per node !   From 1 to 4 threads per MicroBlaze !   512 MB per node, 32 MB as local memory, the rest exposed in the global address

space for a total of 1920 MB !   Scrambling: 8 bytes - GNSYNC Lock table: 8196 entries !   Bandwidth: 1.5 Gbps (500 Mbit per channel), 1/3 overhead for headers (1 Gbps

effective) !   Frequency: 100 MHz

!   Delays: !   Context switch: 232 cycles (41 ISR launch, 65 save context, 20 launch scheduler,

50 load context, 24 interrupt reset, 50 exit ISR) !   Round trip for a remote memory reference: 403 cycles

!   Applications !   Pointer chasing !   Breadth First Search (BFS)

December 6, 2013 14

Area of the Hardware Components

December 6, 2013 15

!   Area with respect to a Virtex 6 LX240T

Experimental results - Pointer Chasing

December 6, 2013 16

!   BW utilization increases with the number of cores

!   BW utilization also increases with the number of threads !   However, system is

saturated with 3 threads

!   Utilization decreases with 3 and 4 threads and 32 cores wrt 16 cores because of higher contention on the internal interconnection

Experimental results - BFS

December 6, 2013 17

!   100,000 vertices !   80 neighbors in

average !   3,998,706 traversed

edges !   Throughput increases

with the number of cores !   Biggest increase from 4

to 8 cores !   Increasing the number of

threads from 1 to 3 increases performance

!   However, with 4 threads performance decreases !   Increased contention

on the GSYNC for the locks (BFS is synch intensive)

Conclusions

!   Presented the set of hardware and software components that enable efficient execution of irregular applications on a manycore multinode system, Starting from off-the-shelf cores !   Support for global address space and long latency remote memory

operation (GMAS) !   Fine-grained hardware synchronization (GSYNC) !   Integrated network interface (GNI) !   Fast software multithreading (with hardware supported scheduling)

!   Introduced an FPGA prototype of the proposed design !   Validated the prototype with two typical irregular kernels

!   Scaling in bandwidth utilization and performance when increasing cores and threads

December 6, 2013 18

Thank you for your attention!

!   Questions? ! [email protected]

December 6, 2013 19