an alternative to gpu acceleration for mobile platforms · an alternative to gpu acceleration for...
TRANSCRIPT
![Page 1: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/1.jpg)
Inventing the Future of Computing
An Alternative to GPU Acceleration For Mobile Platforms
Andreas Olofsson [email protected]
50th DAC
June 5th, Austin, TX
![Page 2: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/2.jpg)
Adapteva Achieves 3 “World Firsts”
2
1. First commercial chip to reach 50 GFLOPS/W
2. First coprocessor with an open source OpenCL SDK
3. First semiconductor company to successfully crowd-source project
Copyright © Adapteva. All rights reserved.
![Page 3: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/3.jpg)
What is Adapteva
3
Company History: • Fabless semiconductor company founded in 2008
• 16-core 65nm Epiphany-III chip product sampling since May 2011
• 64-core 28nm Epiphany-IV chip product sampling since July 2012
• Parallella open computing platform launched in October 2012
Notable Achievements: • #1 in microprocessor energy efficiency
• 4 chips on $2.5M in raised capital
• $2M in total revenue to date
• 5K customers, 6,300 boards pre-sold
• 18 Patents pending
Copyright © Adapteva. All rights reserved.
![Page 4: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/4.jpg)
Our guiding light
4
Efficient Efficient
Robust Robust Hetero-geneous Hetero-geneous
Parallel Parallel
Copyright © Adapteva. All rights reserved.
![Page 5: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/5.jpg)
5
No Computing Parallel Computing
“Von Neumann Age” Serial Computing
1943-2013?
No Electronic Computing
-1943
Parallel Computing 2013-??
Any Reason to Think the Future of Computing is NOT Parallel?
Copyright © Adapteva. All rights reserved.
![Page 6: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/6.jpg)
A Practical Start: True Heterogeneous Computing
SYSTEM-ON-CHIP
SYSTEM-ON-CHIP
BIG CPU BIG CPU
FPGA FPGA
BIG CPU BIG CPU
BIG CPU BIG CPU
BIG CPU BIG CPU
1000’s of small RISC
CPUs
1000’s of small RISC
CPUs GPU GPU Analog Analog
6
Math Math Weird Math Weird Math
Graphics Graphics “The
Joker” “The
Joker”
O/S Application
O/S Application
Copyright © Adapteva. All rights reserved.
![Page 7: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/7.jpg)
The Accelerator Challenge
Limited Accelerator
Application
Move Data
Application
Context Switch
Move Data
Context Switch
Move Data
Context Switch
Limited Accelerator
Application
Context Switch
Move Data
Something Else
Something Else
Status Quo Approach (~1.3X speedup)
“Smart” Coprocessor
Application
Move Data
Application
Context Switch
Move Data
Context Switch
Something Else
Smart Coprocessor (>10X speedup?)
7 Copyright © Adapteva. All rights reserved.
![Page 8: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/8.jpg)
The Epiphany Coprocessor
8
`
1 GHz RISC
Core
Local
Memory
Multicore
Framework
Router
<20pJ / FLOP ! <20pJ / FLOP ! MIMD/Task-Parallel
Accelerator
MIMD/Task-Parallel
Accelerator
Coprocessor for
ARM/x86 Host
Coprocessor for
ARM/x86 Host
32-128KB Local
Memory
32-128KB Local
Memory
1.6 GFLOPS Per Core @
~25mW
1.6 GFLOPS Per Core @
~25mW
Packet Based Network-On-Chip
With 100GB/s Bisection BW
Packet Based Network-On-Chip
With 100GB/s Bisection BW
Copyright © Adapteva. All rights reserved.
![Page 9: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/9.jpg)
Epiphany-IV -- GLOBALFOUNDRIES 28SLP IP • 64 CPUs
• IEEE Floating Point (SP)
• 800 MHz Max Frequency
• 100 GFLOPS Performance
• 6.4 GB/s IO BW
• 200 GB/s peak NOC BW
• 1.6 TB/sec on chip memory BW
• 25 Billion Messages/sec
• 2MB on chip memory
• 10 mm2 total silicon area in 28nm
• 2 Watt total chip power
• 324 ball 15x15mm BGA
• Sampling since July, 2012
eLink IO
1 GHz
High Performance
RISC CPU
32KB+
Distributed
Local Memory
Multicore
Communication
Framework
Router
eLink IO
eLink IO
`
eLink IO
9 Copyright © Adapteva. All rights reserved.
![Page 10: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/10.jpg)
Epiphany ANSI-C Benchmarks (Cycles) Naïve C Optimal C Theoretical C-Efficiency
8x8 Matrix Multiplication
2852 773 512 66%
16 Tap FIR Filter (32 points)
1562 620 512 82%
Bi-quad IRR Filter (32 points)
n/a 991 768 77%
Dot-product (256 point)
800 557 256 49%
1 day per benchmark
(compare to
GPUs?)
1 day per benchmark
(compare to
GPUs?)
Adapteva E64
800 MHz
Tilera GX36
1.4GHz
Intel Xeon L5640
2.2GHz
Nvidia Tegra-2 1GHz
CoreMark TM Score
77,912 165,276 118,571 5,866
# Cores 64 36 8 2
Power 2W ~30-50W ~50-100W ~1-2W
1024-Core Chip 2,493,184 n/a n/a n/a
Server Level Performance at 2Watts!!
Server Level Performance at 2Watts!!
10 Copyright © Adapteva. All rights reserved.
![Page 11: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/11.jpg)
Architecture Comparison Technology FPGA DSP GPU CPU Epiphany
Process 28nm 40nm 28nm 32nm 28nm
Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++ OCL/C/C++
Area (mm^2) 590 108 294 216 10
Chip Power (W) 40 22 135 130 2
“CPUs” n/a 8 32 4 64
Max GFLOPS 1500 160 3000 115 102
GHz * Cores n/a 12 32 14.4 51.2
Compile Time Hours Minutes Minutes Minutes Minutes
L1 Memory 6MB 512KB 2.5MB 256KB 2MB
Peak performance means very little
Peak performance means very little
No magic bullet! No magic bullet! Efficiency is everything Efficiency is everything
11 Copyright © Adapteva. All rights reserved.
![Page 12: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/12.jpg)
Epiphany: A Truly Scalable Architecture
16 64
256 1024
4096
1
4
16
64
256
1,024
4,096
16,384
G
F
L
O
P
S
# Epiphany Cores
Performance
0.35W
1.4W
5.7W
23W
92W
A Single Unified Instruction Set Architecture!
A Single Unified Instruction Set Architecture!
12 Copyright © Adapteva. All rights reserved.
![Page 13: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/13.jpg)
13
How the $#@% Do We Program This
Thing?
Copyright © Adapteva. All rights reserved.
![Page 14: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/14.jpg)
Epiphany Programming Models MODEL#2
WORKER BEE MODEL • Great for up to 2GFLOPS • Supports standard C/C++ • “Cloud on a chip”
MODEL#2 WORKER BEE MODEL
• Great for up to 2GFLOPS • Supports standard C/C++ • “Cloud on a chip”
MODEL #1 DATA PARALLEL MODEL
• openCL programmable • Easy integration with C/C++ • openMP/MPI roadmap
MODEL #1 DATA PARALLEL MODEL
• openCL programmable • Easy integration with C/C++ • openMP/MPI roadmap
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
X86/ARM/FPGA Host X86/ARM/FPGA Host
Task1 Task1
Task3 Task3 Task4 Task4
Task2 Task2
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
MINI CPU MINI CPU
X86/ARM/FPGA Host X86/ARM/FPGA Host Task1 Task1
14 Copyright © Adapteva. All rights reserved.
![Page 15: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/15.jpg)
Parallel Programming Frameworks
15
Erlang SystemC Intel TBB Co-Fortran Lisp Janus
Scala Haskell Pragmas Fortress Hadoop Linda
Smalltalk CUDA Clojure UPC PVM Alef
Julia OpenCL Go X10 Posix XC
Occam OpenHMPP ParaSail APL Simulink Charm++
Occam-pi OpenMP Ada Labview Ptolemy StreamIt
Verilog OpenACC C++ Amp Rust Sisal Star-P
VHDL Cilk Chapel MPI MCAPI ?????????
Copyright © Adapteva. All rights reserved.
![Page 16: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/16.jpg)
Stupid Hurdles That Hinder Collaboration
16
• Proprietary SDKs and programming frameworks
• Lack of datasheets/documents
• Closed source drivers
• Expensive lock-in hardware
• NDA requirements
• Exlcusive access
Copyright © Adapteva. All rights reserved.
Open HW is now following the same
successful path as open SW!
Open HW is now following the same
successful path as open SW!
![Page 17: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/17.jpg)
Parallella: Our “Secret Weapon”
17
• A $99 single board ”parallel” computer that runs Linux
• Open source (SDK, board files, drivers) (github.com/parallella)
• Open documentation (adapteva.com/all-documents)
• Open to all (forums.parallella.org)
Copyright © Adapteva. All rights reserved.
![Page 18: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/18.jpg)
The Parallella Board
18
Zynq dual core ARM- A9
(with FPGA Logic)
1GB SDRAM
Gigabit
Ethernet
uUSB
16-core
Epiphany Coprocessor
uSD
uHDMI
5V DC
uUSB
Copyright © Adapteva. All rights reserved.
![Page 19: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/19.jpg)
Parallella Kickstarter Campaign
• 5,000 customers
• 6,300 boards ”pre-sold” in 4 weeks
• 67 countries, all 50 US states
• 50-75% of backers are developers
• 12,000 more signups since Jan 1st
• Backer Application Interest:
• Software Defined Radio
• Ray tracing/rendering
• Image processing
• Robotics
• Gaming
• Cryptography
• Parallel computing research
• Distributed Computing
• Machine Learning
• HPC
19 Copyright © Adapteva. All rights reserved.
![Page 20: An Alternative to GPU Acceleration For Mobile Platforms · An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson ... Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++](https://reader031.vdocument.in/reader031/viewer/2022021718/5b95262109d3f272648c30a8/html5/thumbnails/20.jpg)
Epiphany IP Conclusions
20 Copyright © Adapteva. All rights reserved.
• #1 in processor energy efficiency at 70 GFLOPS/Watt (core)
• Silicon proven in GLOBALFOUNDRIES 28SLP node
• Only multicore IP that is scalable to 1000’s of cores on chip
• Easier to use than GPGPUs