martin kruliš 6. 1. 2015 by martin kruliš (v1.0)1

by Martin Kruliš (v1.0) 1

Parallela Board with Epiphany

CoprocessorMartin Kruliš

6. 1. 2015


Adapteva Company◦ Small fabless semiconductor company◦ Founded in 2008◦ Main objective is to design massively parallel

chips with emphasis on power efficiency First company that designed chip that expects to

scale over 1000 cores◦ Current products

Epiphany processor (16 core and 64 core versions) Parallela board

◦ Parallela University Program started this year

6. 1. 2015

About Adapteva

by Martin Kruliš (v1.0) 36. 1. 2015

Parallela Board

16-core EpiphanyCoprocessor

1GB SDRAMμUSB

1Gb Ethernet

μSD

μHDMI

μUSBZyng dual-core ARM-A9(with integrated FPGA)

Expansion Slots


Parallela Architecture


Epiphany Coprocessor


Coprocessor◦ 32-bit RISC cores with superscalar architecture◦ 32KB local memory per core (1 cycle latency)

Divided into four independent banks◦ IEEE754 compliant floating point instruction set◦ Two DMA channels

eMesh (Network-on-Chip)◦ Both on chip and off chip communication◦ No specific API, works with memory transactions

eLink (Chip-to-Chip Links)◦ 4 I/O ports for external communication

6. 1. 2015

Epiphany Architecture


Coprocessor Cores◦ Simple in-order RISC architecture

Most instructions take 1 cycle 8-stage dual-issue pipeline Instruction set optimized for signal processing

◦ Separate integer and floating point ALU◦ 64x 32-bit registers (for both IALU and FPU)

Load store architecture Per cycle 3/1 FPU and 2/1 IALU accesses, 1 load/store

◦ Performance 16 cores ~ 2Gflops each, 64 cores ~ 1.6 Gflops each

6. 1. 2015



Memory Model◦ Internal memory of each node

is mapped into global memory

6. 1. 2015



Local Memory◦ Divided into four banks with independent

controllers◦ Each clock cycle each bank may perform:

Send 64bit word to program sequencer Transfer 64bit word between memory and registers Receive 64bit word from eMesh interface Local DMA sends 64bit word to eMesh interface

◦ Memory order model Local reads and writes follow strong memory model Non-local transactions follow weak memory model

Operations may not propagate in the same order

6. 1. 2015



eMesh◦ 2D topology with nearest-neighbor connections◦ 3 orthogonal (independent) meshes

cMesh – on-chip write transactions (8B/cycle) xMesh – off-chip write transactions (1B/cycle) rMesh – read requests (1req/8cycles)

◦ Edge connections may be interfaced with other epiphany chips Or other type of busses (off-core memory, IO ports,

…)◦ Significantly favorizes writing operations to

reading Writing transactions are 16x faster

6. 1. 2015



eMesh

6. 1. 2015



eMesh Routing◦ Upper 12bits of the address is address of the core

6 bits – row index, 6 bits – col index◦ Each node uses simple routing algorithm

◦ Nodes use round-robin arbitration to avoid deadlock

6. 1. 2015



DMA◦ Two DMA channels per node◦ 2D addressing awareness, flexible strides◦ Local-external memory and external-external

memory transfers◦ Completion signaling by HW interrupt◦ Master and slave modes

Slave DMA is controlled by external IO or another DMA

6. 1. 2015



Epiphany SDK◦ Separate compilation for host and coprocessor code

Epiphany uses e-gcc and e-objcopy◦ The host runtime provide way to

Detect the coprocessor Allocate memory, transfer data Execute precompiled binaries on the coprocessor

OpenCL◦ The coprocessor is perceived as OpenCL accelerator◦ Each core is computing unit, on-chip memory is

local memory, …

6. 1. 2015

Programming


Host Code Examplee_platform_t platform;

e_epiphany_t dev;

e_init(NULL);

e_reset_system();

e_get_platform_info(&platform);

e_open(&dev, 0, 0, platform.rows, platform.cols);

e_load_group("coproccode.srec", &dev, 0, 0, platform.rows, platform.cols);

for (i = 0; i < platform.rows ; ++i)

for (j = 0; j < platform.cols; ++j) {

coreid = (i + platform.row) * 64 + j + platform.col;

usleep(100000);

e_read(&emem, 0, 0, 0x0, emsg, _BufSize);

e_read(&dev, i, j, 0x6000, &flag, sizeof(flag));

...

}

e_close(&dev);

e_finalize();

6. 1. 2015

Programming


Matrix Multiplication◦ Using naïve algorithm◦ Square matrices◦ N is divisible by number of cores

Each core computing its corresponding tile of the result matrix

◦ Both input matrices and output matrix fit the total amount of local memory A smart plan of the computations and the data

transfers can be devised

6. 1. 2015

Example


MatrixMultiplication

6. 1. 2015

Example

A tiles are rotated vertically in each

column

B tiles are rotated horizontally in each row


Discussion

martin kruliš 6. 1. 2015 by martin kruliš (v1.0)1

Documents

martin kruli v1

core memory

local memory

memory model internal

global memory

memory transactions

strong memory model

local transactions