martin kruliš 6. 1. 2015 by martin kruliš (v1.0)1
TRANSCRIPT
![Page 1: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/1.jpg)
by Martin Kruliš (v1.0) 1
Parallela Board with Epiphany
CoprocessorMartin Kruliš
6. 1. 2015
![Page 2: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/2.jpg)
by Martin Kruliš (v1.0) 2
Adapteva Company◦ Small fabless semiconductor company◦ Founded in 2008◦ Main objective is to design massively parallel
chips with emphasis on power efficiency First company that designed chip that expects to
scale over 1000 cores◦ Current products
Epiphany processor (16 core and 64 core versions) Parallela board
◦ Parallela University Program started this year
6. 1. 2015
About Adapteva
![Page 3: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/3.jpg)
by Martin Kruliš (v1.0) 36. 1. 2015
Parallela Board
16-core EpiphanyCoprocessor
1GB SDRAMμUSB
1Gb Ethernet
μSD
μHDMI
μUSBZyng dual-core ARM-A9(with integrated FPGA)
Expansion Slots
![Page 4: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/4.jpg)
by Martin Kruliš (v1.0) 46. 1. 2015
Parallela Architecture
![Page 5: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/5.jpg)
by Martin Kruliš (v1.0) 56. 1. 2015
Epiphany Coprocessor
![Page 6: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/6.jpg)
by Martin Kruliš (v1.0) 6
Coprocessor◦ 32-bit RISC cores with superscalar architecture◦ 32KB local memory per core (1 cycle latency)
Divided into four independent banks◦ IEEE754 compliant floating point instruction set◦ Two DMA channels
eMesh (Network-on-Chip)◦ Both on chip and off chip communication◦ No specific API, works with memory transactions
eLink (Chip-to-Chip Links)◦ 4 I/O ports for external communication
6. 1. 2015
Epiphany Architecture
![Page 7: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/7.jpg)
by Martin Kruliš (v1.0) 7
Coprocessor Cores◦ Simple in-order RISC architecture
Most instructions take 1 cycle 8-stage dual-issue pipeline Instruction set optimized for signal processing
◦ Separate integer and floating point ALU◦ 64x 32-bit registers (for both IALU and FPU)
Load store architecture Per cycle 3/1 FPU and 2/1 IALU accesses, 1 load/store
◦ Performance 16 cores ~ 2Gflops each, 64 cores ~ 1.6 Gflops each
6. 1. 2015
Epiphany Architecture
![Page 8: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/8.jpg)
by Martin Kruliš (v1.0) 8
Memory Model◦ Internal memory of each node
is mapped into global memory
6. 1. 2015
Epiphany Architecture
![Page 9: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/9.jpg)
by Martin Kruliš (v1.0) 9
Local Memory◦ Divided into four banks with independent
controllers◦ Each clock cycle each bank may perform:
Send 64bit word to program sequencer Transfer 64bit word between memory and registers Receive 64bit word from eMesh interface Local DMA sends 64bit word to eMesh interface
◦ Memory order model Local reads and writes follow strong memory model Non-local transactions follow weak memory model
Operations may not propagate in the same order
6. 1. 2015
Epiphany Architecture
![Page 10: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/10.jpg)
by Martin Kruliš (v1.0) 10
eMesh◦ 2D topology with nearest-neighbor connections◦ 3 orthogonal (independent) meshes
cMesh – on-chip write transactions (8B/cycle) xMesh – off-chip write transactions (1B/cycle) rMesh – read requests (1req/8cycles)
◦ Edge connections may be interfaced with other epiphany chips Or other type of busses (off-core memory, IO ports,
…)◦ Significantly favorizes writing operations to
reading Writing transactions are 16x faster
6. 1. 2015
Epiphany Architecture
![Page 11: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/11.jpg)
by Martin Kruliš (v1.0) 11
eMesh
6. 1. 2015
Epiphany Architecture
![Page 12: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/12.jpg)
by Martin Kruliš (v1.0) 12
eMesh Routing◦ Upper 12bits of the address is address of the core
6 bits – row index, 6 bits – col index◦ Each node uses simple routing algorithm
◦ Nodes use round-robin arbitration to avoid deadlock
6. 1. 2015
Epiphany Architecture
![Page 13: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/13.jpg)
by Martin Kruliš (v1.0) 13
DMA◦ Two DMA channels per node◦ 2D addressing awareness, flexible strides◦ Local-external memory and external-external
memory transfers◦ Completion signaling by HW interrupt◦ Master and slave modes
Slave DMA is controlled by external IO or another DMA
6. 1. 2015
Epiphany Architecture
![Page 14: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/14.jpg)
by Martin Kruliš (v1.0) 14
Epiphany SDK◦ Separate compilation for host and coprocessor code
Epiphany uses e-gcc and e-objcopy◦ The host runtime provide way to
Detect the coprocessor Allocate memory, transfer data Execute precompiled binaries on the coprocessor
OpenCL◦ The coprocessor is perceived as OpenCL accelerator◦ Each core is computing unit, on-chip memory is
local memory, …
6. 1. 2015
Programming
![Page 15: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/15.jpg)
by Martin Kruliš (v1.0) 15
Host Code Examplee_platform_t platform;
e_epiphany_t dev;
e_init(NULL);
e_reset_system();
e_get_platform_info(&platform);
e_open(&dev, 0, 0, platform.rows, platform.cols);
e_load_group("coproccode.srec", &dev, 0, 0, platform.rows, platform.cols);
for (i = 0; i < platform.rows ; ++i)
for (j = 0; j < platform.cols; ++j) {
coreid = (i + platform.row) * 64 + j + platform.col;
usleep(100000);
e_read(&emem, 0, 0, 0x0, emsg, _BufSize);
e_read(&dev, i, j, 0x6000, &flag, sizeof(flag));
...
}
e_close(&dev);
e_finalize();
6. 1. 2015
Programming
![Page 16: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/16.jpg)
by Martin Kruliš (v1.0) 16
Matrix Multiplication◦ Using naïve algorithm◦ Square matrices◦ N is divisible by number of cores
Each core computing its corresponding tile of the result matrix
◦ Both input matrices and output matrix fit the total amount of local memory A smart plan of the computations and the data
transfers can be devised
6. 1. 2015
Example
![Page 17: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/17.jpg)
by Martin Kruliš (v1.0) 17
MatrixMultiplication
6. 1. 2015
Example
A tiles are rotated vertically in each
column
B tiles are rotated horizontally in each row
![Page 18: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d825503460f94a681b2/html5/thumbnails/18.jpg)
by Martin Kruliš (v1.0) 186. 1. 2015
Discussion