blowing the doors off your bottlenecks with python on amd apusdeveloper.amd.com ›...
TRANSCRIPT
© 2015 Continuum Analytics- Confidential & Proprietary
Blowing the Doors Off YourBottlenecks with Python on AMD APUs
Stan Seibert Continuum Analytics
December 8, 2015
© 2015 Continuum Analytics- Confidential & Proprietary
My Background• Trained in physics
• Using Python for data analysis for 10 years
• Using GPUs for data analysis for 7 years
• Currently lead the High Performance Analytics team at Continuum
2
© 2015 Continuum Analytics- Confidential & Proprietary3
OUR HISTORY
0
35
70
105
140
2012 2013 2014 2015
OpsSales & MktgDevl & Eng
OUR TEAM
Global Community 2M+
Investors General Catalyst | BuildGroup
Global Presence Americas | EMEA
July 2012 V1 | Anaconda
June 2013 10K/mon Anaconda downloads
Sept 2014 100K/mon Anaconda downloads
Enterprise Customers30+
Industries Financial Services Government Health & Life Sciences
Retail & CPG Oil & Gas High Tech
OSS Contributors 75+
OUR BEGINNING Travis Oliphant & Peter Wang co-founded in 2012 Team includes OSS authors: NumPy, SciPy, PyTables, Pandas, Jupyter/IPython Vision foundational tools for next generation data scientists
May 2015 150K/mon Anaconda downloads
May 2014 V2 | Anaconda
© 2015 Continuum Analytics- Confidential & Proprietary
Agenda
1. Numba: A Compiler for Python
2. HSA: Bringing the CPU and GPU together
3. Numba+HSA Examples
4. Conclusion
© 2015 Continuum Analytics- Confidential & Proprietary 5
NUMBA A POWERFUL & FAST PYTHON COMPILER
Designed specifically for math-intensive algorithms and NumPy arrays
Can accelerate Python functions by
2x to 200x
Approaching the speeds of C or
FORTRAN
© 2015 Continuum Analytics- Confidential & Proprietary
Numba
6
A Powerful and Fast Python Compiler
© 2015 Continuum Analytics- Confidential & Proprietary
How Does Numba Work?
7
Python Function (bytecode)
Bytecode Analysis
Functions Arguments
Numba IR
Machine CodeExecute!
Type Inference
LLVM JIT LLVM IR
Lowering
Rewrite IR
Cache
@jitdef do_math(a, b): …>>> do_math(x, y)
© 2015 Continuum Analytics- Confidential & Proprietary
Supported Platforms
8
OS HW SW
• Windows (7 and later) • 32 and 64-bit x86 CPUs • Python 2 and 3
• OS X (10.7 and later) • Experimental support for ARMv7 (Raspberry Pi 2) • NumPy 1.6 through 1.9
• Linux (~RHEL 5 and later)
• AMD GPUs supporting HSA
• NVIDIA GPUs that support CUDA
© 2015 Continuum Analytics- Confidential & Proprietary
Questions?
9
© 2015 Continuum Analytics- Confidential & Proprietary
HSA BRINGING THE CPU AND GPU TOGETHER
10
© 2015 Continuum Analytics- Confidential & Proprietary
What is HSA?
11
Heterogeneous System Architecture (HSA) HSA is a multi-vendor standard for creating chips with CPU and GPU cores that work together and share the same memory. This standard includes an API for loading compute kernels, launching tasks, and communicating between CPU and GPU. Compute kernels are written in HSAIL.
© 2015 Continuum Analytics- Confidential & Proprietary
Why HSA?
12
• Manually moving data between CPU and GPU memory spaces adds code complexity and execution overhead
• Traditional GPU programming tends to force algorithms to fit into “all-CPU” or “all-GPU” categories
• HSA makes it easier to let each core do what it is good at: • CPU: low latency sequential calculations • GPU: high throughput data parallel calculations
© 2015 Continuum Analytics- Confidential & Proprietary
The HSA Programming Model
13
GridWork-itemWork-group
© 2015 Continuum Analytics- Confidential & Proprietary
NUMBA & HSA EXAMPLES
14
© 2015 Continuum Analytics- Confidential & Proprietary
Hardware and Software Requirements• Ubuntu Linux 14.04 64-bit
• Kaveri or Carrizo APU(Numba tested with A10-7850K, A10-7800P)
• At least 4 GB of system memory
• Example code on GitHub:https://github.com/ContinuumIO/Numba-HSA-Webinar/
15
• Install drivers from:https://github.com/HSAFoundation/HSA-Docs-AMD/wiki/HSA-Platforms-&-Installation
• Download and install 64-bit Linux Miniconda from: http://conda.pydata.org/miniconda.html
• Run the following commands: conda create -n hsa_webinar python=3.4 \ numba libhlc pandas bokeh matplotlib basemap jupyter source activate hsa_webinar export LD_LIBRARY_PATH=/opt/hsa/lib:$LD_LIBRARAY_PATH jupyter notebook
© 2015 Continuum Analytics- Confidential & Proprietary
Setup Instructions
16
© 2015 Continuum Analytics- Confidential & Proprietary
EXAMPLE #1:CREATING A UFUNC
17
© 2015 Continuum Analytics- Confidential & Proprietary
Sample Data Set• Geographic point data
• Latitude, Longitude in degrees • Distance computations involve a lot of math
• Sample data comes from satellite-observed lightning strikes on Earth, but could easily be: • Geotagged social media posts • GPS tracking information for fleet vehicles • Geocoded customer addresses
18
© 2015 Continuum Analytics- Confidential & Proprietary
Task: Geographic Locality
• Given a large collection of points, what is the distance of each from a target point?
• How many are within a given range?
19
© 2015 Continuum Analytics- Confidential & Proprietary
What is a ufunc?
20
A Universal function (ufunc) is a special function that broadcasts over elements of a NumPy array.
© 2015 Continuum Analytics- Confidential & Proprietary
Parallelizing Ufuncs
• Ufunc computations are inherently parallel
• Numba can auto-parallelize a user-created ufunc for many platforms, including HSA
• Developer does not need to know any details about GPU scheduling
21
© 2015 Continuum Analytics- Confidential & Proprietary
Computing Distance
22http://en.wikipedia.org/wiki/Great-circle_distance#Computational_formulas
© 2015 Continuum Analytics- Confidential & Proprietary
Computing Distance
23http://en.wikipedia.org/wiki/Great-circle_distance#Computational_formulas
Type signature
Device functionSelectufunc target
© 2015 Continuum Analytics- Confidential & Proprietary
Calling the function
24http://en.wikipedia.org/wiki/Great-circle_distance#Computational_formulas
No special syntax to call a GPU ufunc!
© 2015 Continuum Analytics- Confidential & Proprietary
Performance
25
© 2015 Continuum Analytics- Confidential & Proprietary
Performance Tips and Tricks
• Prefer 32-bit over 64-bit data
• GPUs are fast at special math functions
• Don’t force it: If it is easier to do a calculation on the CPU, do it there!
26
© 2015 Continuum Analytics- Confidential & Proprietary
Pro-tip: Compiling a function for CPU and GPU targets
• Use numba.vectorize as a function:
27
© 2015 Continuum Analytics- Confidential & Proprietary
Questions on Example #1?
28
© 2015 Continuum Analytics- Confidential & Proprietary
EXAMPLE #2:CREATING AN HSA KERNEL
29
© 2015 Continuum Analytics- Confidential & Proprietary
Task: Compute Distance Matrix
• Compute the distance between all pairs of points
• Common first step in route planning, clustering, etc.
• Could do this with ufunc, but let’s write a kernel function instead
30
© 2015 Continuum Analytics- Confidential & Proprietary
The HSA Programming Model
31
GridWork-itemWork-group
© 2015 Continuum Analytics- Confidential & Proprietary
Mapping to GPU work-items
32
0 1 2 3 4 5
0 0
1 0
2 0
3 0
4 0
5 0
workitem 0
workitem 1
workitem 2
workitem 3
workitem 4
workitem 5
Note: There are more efficient ways to divide the work than this!
© 2015 Continuum Analytics- Confidential & Proprietary
Creating a Device Function
33
© 2015 Continuum Analytics- Confidential & Proprietary
Creating a Kernel Function
34
© 2015 Continuum Analytics- Confidential & Proprietary
Calling a Kernel
35
© 2015 Continuum Analytics- Confidential & Proprietary
Performance
36
© 2015 Continuum Analytics- Confidential & Proprietary
Performance Tips and Tricks
• Use lots of work-items
• Minimize branch divergence
• Learn from other GPU APIs: OpenCL and CUDA are very similar to HSA
37
© 2015 Continuum Analytics- Confidential & Proprietary
Questions on Example #2?
38
© 2015 Continuum Analytics- Confidential & Proprietary
CONCLUSION
39
© 2015 Continuum Analytics- Confidential & Proprietary
Conclusion• Create high performing CPU or GPU code in Python with Numba!
• HSA lets you process data using the GPU and the CPU, without the overhead of memory copies
• Numba + HSA is a great combination
• The Jupyter notebook used in this demo can be downloaded here: https://github.com/ContinuumIO/Numba-HSA-Webinar
• For more documentation:http://numba.pydata.org/numba-doc/0.22.1/hsa/index.html
40
© 2015 Continuum Analytics- Confidential & Proprietary
What’s Next?
• Boltzmann Initiative:HSA+ for FirePro GPU cards
• HSA code for APUs will be portable to FirePro cards with few changes
• Stay tuned for more updates!
41
© 2015 Continuum Analytics- Confidential & Proprietary
Resources
AMD Developer Central • Additional Developer Resources: developer.amd.com • Follow AMD Developer Central: twitter.com/AMDDevCentral • This and other webinars posted to YouTube: www.youtube.com/user/AMDDevCentral
Continuum Analytics • Website: https://continuum.io • Twitter: @ContinuumIO • For more information on Numba: http://numba.pydata.org • Get help optimizing your Python code! Contact [email protected] for a code
assessment
42
© 2015 Continuum Analytics- Confidential & Proprietary
Q & A
43