gpus for data science (rapids)...execute end-to end data science and analytics pipelines on gpus...

41
Sangmoon Lee, [email protected] GPUS FOR DATA SCIENCE (RAPIDS)

Upload: others

Post on 20-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

Sangmoon Lee, [email protected]

GPUS FOR DATA SCIENCE (RAPIDS)

Page 2: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

2

Artificial IntelligenceComputer GraphicsGPU Computing

NVIDIA“THE AI COMPUTING COMPANY”

Page 3: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

3

What is RAPIDS

Page 4: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

4

RAPIDS

Opens GPU for Data Science

RAPIDS is a set of open source libraries

for GPU accelerating data preparation

and machine learning.

OSS website: http://www.rapids.ai/

RAPIDS: Rapid Accelerate Platform Integrated for Data Science

Page 5: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

5

In GPU Memory

cuXFilter

Visualization

Data Preparation VisualizationModel Training

cuML

Machine Learning

cuGraph

Graph AnalyticsDeep Learning

cuDF

Analytics

RAPIDSGPU Accelerated End-to-End Data Science pipeline

www.rapids.ai

Page 6: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

6

Why RAPIDS

Page 7: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

7

REALITIES OF DATA

Page 8: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

8

WHAT IS TYPICAL DATASET SIZE

Faster, Repeat analysis

Page 9: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

9

WHAT % OF TIME IS DEVOTED TO:

1. Data Visualization2. Logistic Regression3. Cross Validation4. Decision Trees5. Random Forests…

Page 10: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

10

MOST USED DATA-FRAME METHODS

1. pd.read_csv2. pd.DataFrame3. df.append4. df.mean5. df.head6. df.drop7. df.sum8. df.to_csv9. df.get…

Page 11: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

11

RAPIDS Benefits

Page 12: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

12

FASTER DATA SCIENCE WORKFLOWOpen Source, GPU-accelerated End-to-End Data Science

All

DataETL

Manage Data

Structured

Data Store

Data Preparation

Training

Model Training

Visualization

Evaluate

Scoring

Deploy

Slow Training Times for Data Scientists

Reduce Data Prep, Training Times, Improve Models

All

DataETL Structured

Data Store

Data Preparation

Model Training

Visualization ScoringAll

DataETL

Manage Data

Structured

Data Store

Data Preparation

Training

Model Training

Visualization

Evaluate

Inference

Deploy

Page 13: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

13

DATA PROCESSING EVOLUTION

Page 14: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

14

DAY IN THE LIFE OF A DATA SCIENTIST

Page 15: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

15

RAPIDS SW Stack

Page 16: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

16

CUDA

PYTHON

APACHE ARROW in GPU Memory

DASK

DEEP LEARNING

FRAMEWORKS

CUDNN

RAPIDS

CUMLCUDF CUGRAPH

RAPIDS – SOFTWARE STACKExecute end-to end data science and analytics pipelines on GPUs

Data Preparation VisualizationModel Training

cuDF cuML cuGRAPH

Apache Arrow

A columnar, in-memory data structure that delivers

efficient and fast data interchange with flexibility

to support complex data models.

Page 17: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

17

CUDA

PYTHON

APACHE ARROW in GPU Memory

DASK

DEEP LEARNING

FRAMEWORKS

CUDNN

RAPIDS

CUMLCUDF CUGRAPH

RAPIDS – SOFTWARE STACKExecute end-to end data science and analytics pipelines on GPUs

Data Preparation VisualizationModel Training

cuDF cuML cuGRAPH

DASK

A distributed computing project with:

Python interface, modular design, adaptability to IT

infrastructure (Kubernetes, YARN clusters).

RAPIDS can process data split across many GPUs in one

server, or many GPUs distributed across a cluster for

scalability to support large datasets.

Page 18: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

18

CUDA

PYTHON

APACHE ARROW in GPU Memory

DASK

DEEP LEARNING

FRAMEWORKS

CUDNN

RAPIDS

CUMLCUDF CUGRAPH

RAPIDS – SOFTWARE STACKExecute end-to end data science and analytics pipelines on GPUs

Data Preparation VisualizationModel Training

cuDF cuML cuGRAPHcuDF

A DataFrame manipulation library based on Apache

Arrow that accelerates loading, filtering, and

manipulation of data for model training data

preparation.

The Python bindings of the core-accelerated CUDA

DataFrame manipulation primitives mirror the

pandas interface for seamless onboarding of pandas

users.

cuIO FOR CSV

Page 19: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

19

CUDA

PYTHON

APACHE ARROW in GPU Memory

DASK

DEEP LEARNING

FRAMEWORKS

CUDNN

RAPIDS

CUMLCUDF CUGRAPH

RAPIDS – SOFTWARE STACKExecute end-to end data science and analytics pipelines on GPUs

Data Preparation VisualizationModel Training

cuDF cuML cuGRAPH

cuML

A collection of GPU-accelerated machine

learning libraries that will provide GPU versions

of all machine learning algorithms available in

scikit-learn..

Page 20: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

20

CUDA

PYTHON

APACHE ARROW in GPU Memory

DASK

DEEP LEARNING

FRAMEWORKS

CUDNN

RAPIDS

CUMLCUDF CUGRAPH

RAPIDS – SOFTWARE STACKExecute end-to end data science and analytics pipelines on GPUs

Data Preparation VisualizationModel Training

cuDF cuML cuGRAPH

cuGRAPH

A framework and collection of graph

analytics libraries that seamlessly

integrate into the RAPIDS data

science platform.

Page 21: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

21

cuDF CODE EXAMPLES

Page 22: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

22

PANDAS (CPU) VS CUDF (GPU) DATAFRAMECode comparison

Page 23: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

23

DATA LOADING FROM FILEcuDF code example

Page 24: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

24

LOADING DATA INTO A GPU DATAFRAME

Create an empty DataFrame, and add a column. Create a DataFrame with two columns.

Load a CSV file into a GPU DataFrame. Use Pandas to load a CSV file, and copy its content into a GPU DF.

cuDF code examples

Page 25: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

25

WORKING WITH GPU DATAFRAMEScuDF code examples

Return the first three rows as a new DataFrame.

Find the mean and standard deviation of a column.Count number of occurrences per value, and unique values.

Transform column values with a custom function.

Change the data type of a column.

Row slicing with column selection.

Page 26: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

26

QUERY, SORT, GROUP, JOIN, …cuDF code examples

Query the columns of a DataFrame with a boolean expression.

Return the first ‘n’ rows ordered by ‘columns’ in ascending order.

Join columns with other DataFrame on index.

One-hot encoding

Group by column with aggregate function.

Merge two DataFrames.

Sort a column by its values.

Page 27: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

27

RAPIDS - SCIKIT_LEARNRegression: CPU vs GPU coding import cudf;

df = cudf.DataFrame({‘x’: x, ‘y’: y_noisy})

Page 28: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

28

CLUSTERINGExample

Page 29: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

29

DBSCAN

Page 30: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

30

RAPIDS

cuML

cuDF

Page 31: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

31

COMPARE CPU VS GPU

Page 32: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

32

CPU vs GPU

PCA

Training results:

CPU: 57.1 seconds

GPU: 4.28 seconds

System: AWS p3.8xlargeCPUs: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz, 32 vCPU cores, 244 GB RAMGPU: Tesla V100 SXM2 16GBDataset: https://github.com/rapidsai/cuml/tree/master/python/notebooks/data

Principal Component Analysis (PCA)

…Now!Before…

Page 33: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

33

…Now!Before…

k-Nearest Neighbors (KNN)

CPU vs GPU

KNN

Training results:

CPU: ~9 minutes

GPU: 1.12 seconds

System: AWS p3.8xlargeCPUs: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz, 32 vCPU cores, 244 GB RAMGPU: Tesla V100 SXM2 16GBDataset: https://github.com/rapidsai/cuml/tree/master/python/notebooks/data

Page 34: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

34

THE RAPIDS (USECASE)

2,290

1,956

1,999

1,948

169

157

0 500 1,000 1,500 2,000 2,500

20 CPU Nodes

30 CPU Nodes

50 CPU Nodes

100 CPU Nodes

DGX-2

5x DGX-1

0 2,000 4,000 6,000 8,000 10,000

20 CPU Nodes

30 CPU Nodes

50 CPU Nodes

100 CPU Nodes

DGX-2

5x DGX-1

cuML — XGBoost

2,741

1,675

715

379

42

19

0 1,000 2,000 3,000

20 CPU Nodes

30 CPU Nodes

50 CPU Nodes

100 CPU Nodes

DGX-2

5x DGX-1

End-to-EndcuIO (for tabular) / cuDF —Load and Data Preparation

Benchmark

200GB CSV dataset; Data preparation includes joins, variable transformations.

CPU Cluster Configuration

CPU nodes (61 GiB of memory, 8 vCPUs, 64-bit platform), Apache Spark

DGX Cluster Configuration

5x DGX-1 on InfiniBand network

Time in seconds — Shorter is better

cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost

9X 12X 2000X

Page 35: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

35

Roadmap

Page 36: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

36

cuDF — ROADMAP

Libraries

daskgdf: Distributed Computing pygdf using Dask; Support for multi-GPU, multi-node

pygdf: Python bindings for libgdf (Pandas like API for DataFrame manipulation)

libgdf: CUDA C++ Apache Arrow GPU DataFrame and operators (Join, GroupBy, Sort, etc.)

Memory Allocation Requirement

Budget 2-3X dataset size for cuDF working memory

Multi-GPU Multi-node Roadmap

Analytics

Availability Multi-GPU Multi-NodePeer-to-peer

Data Sharing*

Now Yes Yes Yes

*Note: No peer-to-peer data sharing means computation performed via map/reduce style programming in Dask

libgdfCUDA C/C++ helper function

pygdfPython Bindings

daskgdfDistributed Computing

Page 37: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

37

TODAY – RAPIDS 0.6GTC San Jose 2019

Last updated 10.16.18

SG Single GPU

MG Multi-GPU

MGMN Multi-GPU Multi-Node

Page 38: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

38

ROAD TO 1.0Q4 2019 – RAPIDS 0.12

Last updated 10.16.18

SG Single GPU

MG Multi-GPU

MGMN Multi-GPU Multi-Node

Page 39: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

39

XGBOOST SUPPORT SPARK NOW

Apache Spark

Easy to use through Scala/Java API using an existing Spark cluster – Python API coming soon

YARN for managing resources

Dask

Simple to use Python API that is infrastructure agnostic

Kubernetes for managing resources

Both Apache Spark and Dask support loading CSV, Parquet, ORC, and other file formats from local disk, HDFS, S3, GS, Azure Storage

Scaling using Apache Spark

Near identical performance for both Apache Spark and Dask

Page 40: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

40

RAPIDS RESOURCE

http://www.rapids.ai/

cuDF Documentation: https://rapidsai.github.io/projects/cudf/en/latest/

cuML Documentation: https://rapidsai.github.io/projects/cuml/en/latest/

cuGraph Documentation: https://rapidsai.github.io/projects/cugraph/en/0.6.0/index.html

Dask Documentation: https://docs.dask.org/en/latest/

Arrow Documentation: https://arrow.apache.org/docs/

GitHub: https://github.com/RAPIDSai

Page 41: GPUS FOR DATA SCIENCE (RAPIDS)...Execute end-to end data science and analytics pipelines on GPUs Data Preparation Model Training Visualization cuDF cuML cuGRAPH Apache Arrow A columnar,

Sangmoon Lee, [email protected]

THANK YOU