cudatree (gtc 2014)

22

CudaTree Training Random Forests on the GPU Alex Rubinsteyn (NYU, Mount Sinai) @ GTC on March 25th, 2014

Upload: hawflake

Post on 10-May-2015

464 views

Category:

Technology

0 download

Report

Download

Tags:

Embed Size (px):

DESCRIPTION

Random Forests have become an extremely popular machine learning algorithm for making predictions from large and complicated data sets. The currently highest performing implementations of Random Forests all run on the CPU. We implemented a Random Forest learner for the GPU (using PyCUDA and runtime code generation) which outperforms the currently preferred libraries (scikits-learn and wiseRF). The "obvious" parallelization strategy (using one thread-block per tree) results in poor performance. Instead, we developed a more nuanced collection of kernels to handle various tradeoffs between the number of samples and the number of features.

TRANSCRIPT

Page 1: CudaTree (GTC 2014)

CudaTreeTraining Random Forests on the GPU

Alex Rubinsteyn (NYU, Mount Sinai)@ GTC on March 25th, 2014

Page 2: CudaTree (GTC 2014)

What’s a Random Forest?

trees = []for i in 1 .. T:Xi, Yi = random_sample(X, Y)t = RandomizedDecisionTree()t.fit(Xi,Yi)trees.append(t)

A bagged ensemble of randomized decision trees. Learning by uncorrelated memorization.

Few free parameters, popular for “data science”

Page 3: CudaTree (GTC 2014)

Random Forest trees are trained like normal decision trees (CART), but use random subset of features at each split.

bestScore = bestThresh = None for i in RandomSubset(nFeatures): Xi, Yi = sort(X[:, i], Y) for nLess, thresh in enumerate(Xi): score = Impurity(nLess, Yi) if score < bestScore: bestScore = score bestThresh = thresh

Decision Tree Training

1;

Page 4: CudaTree (GTC 2014)

Random Forests: Easy to Parallelize?

• Each tree gets trained independently!

• Simple parallelization strategy: “each processor trains a tree”

• Works great for multi-core implementations (wiseRF, scikit-learn, &c)

• Terrible for GPU

Page 5: CudaTree (GTC 2014)

Random Forest GPU Training Strategies

• Tree per Thread: naive/slow

• Depth-First: data parallel threshold selection for one tree node

• Breadth-First: learn whole level of a tree simultaneously (threshold per block)

• Hybrid: Use Depth-First training until too few samples, switch to Breadth-First

Page 6: CudaTree (GTC 2014)

Depth-First Training

Thread block = subset of samples

Page 7: CudaTree (GTC 2014)

Depth-First Training

Thread block = subset of samples

Page 8: CudaTree (GTC 2014)

Depth-First Training

Thread block = subset of samples

Page 9: CudaTree (GTC 2014)

Depth-First Training

Thread block = subset of samples

Page 10: CudaTree (GTC 2014)

Depth-First Training

Thread block = subset of samples

Page 11: CudaTree (GTC 2014)

Depth-First Training

Thread block = subset of samples

Page 12: CudaTree (GTC 2014)

Depth-First Training: Algorithm Sketch

(1)Parallel Prefix Scan: compute label count histograms

(2)Map: Evaluate impurity score for all feature thresholds

(3)Reduce: Which feature/threshold pair has the lowest impurity score?

(4)Map: Marking whether each sample goes left or right at the split

(5)Shuffle: keep samples/labels on both sides of the split contiguous.

Page 13: CudaTree (GTC 2014)

Breadth-First Training

Thread block = tree node

Page 14: CudaTree (GTC 2014)

Breadth-First Training

Thread block = tree node

Page 15: CudaTree (GTC 2014)

Breadth-First Training

Thread block = tree node

Page 16: CudaTree (GTC 2014)

Breadth-First Training

Thread block = tree node

Page 17: CudaTree (GTC 2014)

Breadth-First Training: Algorithm Sketch

Uses the same sequence of data parallel operations (Scan label counts, Map impurity evaluation, &c) as Depth-First training but within each thread block

Fast at the bottom of the tree (lots of small tree nodes), insufficient parallelism at the top.

Page 18: CudaTree (GTC 2014)

Hybrid Training

★ c = number of classes

★ n = total number of samples

★ f = number of features considered at split

3702 + 1.58c + 0.0577n + 21.84f

Add a tree node to the Breadth-First queue when it contains fewer samples than:

(coefficients from machine-specific regression, needs to be generalized)

Page 19: CudaTree (GTC 2014)

GPU Algorithms vs. Number of Features

• Randomly generated synthetic data

• Performance relative to sklearn 0.14

Page 20: CudaTree (GTC 2014)

Benchmark Data

Dataset Samples Features Classes

ImageNet 10k 4k 10

CIFAR100 50k 3k 100

covertype 581k 57 7

poker 1M 11 10

PAMAP2 2.87M 52 10

intrustion 5M 41 24

Page 21: CudaTree (GTC 2014)

Benchmark Results

Dataset wiseRF sklearn 0.15

CudaTree(Titan)

CudaTree+

wiseRFImageNet 23s 13s 27s 25s

CIFAR100 160s 180s 197s 94s

covertype 107s 73s 67s 52s

poker 117s 98s 59s 58s

PAMAP2 1,066s 241s 934s 757s

intrustion 667s 1,682s 199s 153s

6- core Xeon E5-2630, 24GB, GTX Titan, n_trees = 100, features_per_split = sqrt(n)

Page 22: CudaTree (GTC 2014)

Thanks!

• Installing: pip install cudatree

• Source:https://github.com/EasonLiao/CudaTree

• Credit:Yisheng Liao did most of the hard work, Russell Power & Jinyang Li were the sanity check brigade.

https://github.com/EasonLiao/CudaTree

https://github.com/EasonLiao/CudaTree

CRASH, BOOM, BANG! - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2014/presentations/S4260-leveraging... · PhysX and the ray tracing framework OptiX offer a wealth of

Using GPUs to Accelerate Orthorectification, Atmospheric ...on-demand.gputechconf.com/gtc/2014/webinar/gtc-express-exelis... · Graphical Processing Units to Accelerate Orthorectification,

GTC Design Technology Magazine 2014

GTC Brochure 2019 - GTC Group | GTC Training | GTC Consulting · • Advanced Tendering Procedures and Bid Evaluation • Effective Purchasing, Tendering and Supplier Selection •

e C ar b Per Capita emissions over time compared to global ... · 0 GtC 2 JapanGtC 4 GtC 6 GtC 8 GtC 10 GtC 12 GtC 14 GtC China USA India FSU Germany LUC UK Canada Australia Brazil

GTC TECHNOLOGYGTC TECHNOLOGY Corporate Introduction · Cement Thai. GTC - Technology Provider GTC Technology Inc. in the Refining and Petrochemical Industry GTC Portfolio St C k C

Nvidia GTC 2014 Talk

GTC Business Applications · c) Folders for GTC applications and Data -Default folder for GTC applications is c:\gtc\gbs50p (Professional Edition) or c:\gtc\gbs50i (Enhanced Edition)

GTC Scotland Member Expenses and Compensation Scheme … · GTC Scotland Member Expenses and Compensation Scheme Policy . 11 June 2014 . ... compliance with the Equality Act 2010

FlameWorks GTC 2014

Introduction to Productive GPU Programming | GTC 2014on-demand.gputechconf.com/gtc/2014/presentations/S... · Thrust Functions Many thrust functions can be altered using built-in

CUDA 6.5 OVERVIEW - GTC On-Demand …on-demand.gputechconf.com/gtc/2014/webinar/gtc-express...CUDA 6.5 OVERVIEW 2 1 Supported Platforms 2 64-bit ARM 3 cuFFT Callbacks 4 CUDA Fortran

Gtc Progress Report Feb 2014

GTC Brochure 2019 - GTC Group | GTC Training14 Welcome to GTC Training 18 Management Courses 22 Management Courses Calendar 2019 74 Technical Courses 86 Technical Courses Calendar

Gtc Corporation

Trendline - GTC

Detailed Overview of NVENC Encoder API - GTC On …on-demand.gputechconf.com/gtc/2014/presentations/...encoder-api.pdf · 720p Performance on NVIDIA Geforce GTX 650 200 FPS 100 FPS

GTC PRICE/SPECIFICATION GUIDE - Vauxhall Fleet · GTC PRICE/SPECIFICATION GUIDE 2 July 2014. 2 2 July014yOurfe*3DefraflJrTsy0trie COMPANY CAR DRIVER ... ecoFLEX S/S 18945.83 3789.17

GPU Accelerated Rendering In Adobe Illustrator CCon-demand.gputechconf.com/gtc/2014/webinar/gtc-express-gpu... · © 2014 Adobe Systems Incorporated. All Rights Reserved. Rich Feature

GTC On Demand | NVIDIA GTC Digital - Fast Quantum Molecular Dynamics on LATTE | GTC 2013 · 2013. 3. 21. · gtc 2013. Keywords explicitly quantum mechanical reactive molecular dynamics

An Optimized Solver for Unsteady Transonic Aerodynamics ...on-demand.gputechconf.com/gtc/2016/presentation/s6241-jean-mari… · 2014 GTC 2016, April 7th, San José California

GTC GTC OPC - opel.de · Opel GTC und GTC OPC 3 Serienausstattung – GTC ab € 21.750,00 Sicherheit Adaptives Bremslicht Airbagsystem: – Frontairbags, Fahrer und Beifahrer

Image and Vision Processing on Tegra K1 | GTC 2014on-demand.gputechconf.com/gtc/2014/presentations/S4873-image-vision-processing-tegra-k...Result = Data for advanced user interface

Gtc construction 07 01-2014

Webinar: Easy to Use Photorealistic Rendering with Iray ...on-demand.gputechconf.com/gtc/2014/webinar/gtc-express-easy-to-us… · Webinar: Easy to Use Photorealistic Rendering with

NVIDIA VISIONWORKS TOOLKIT - GTC On-Demand …on-demand.gputechconf.com/gtc/2014/presentations/S4714-nvidia... · NVIDIA VISIONWORKS TOOLKIT Frank Brill, ... Shadow LP C-A15 CPU

Gtc progress report june 2014

Presented by Valeriu Codreanu - GTC On-Demand Featured …on-demand.gputechconf.com/gtc/2014/presentations/S4401-rt-affine-invariant-feature...Valeriu Codreanu Subject: Learn how to

Scaling OpenACC Across Multiple GPUs - GTC On …on-demand.gputechconf.com/gtc/2014/presentations/S... · #pragma acc exit data delete ... Scaling OpenACC Across Multiple GPUs Author:

GTC - hydroni.org€¦ · GTC 1200 630 520 Technical data subject to variations without prior notice. GTC 10/2012 GTC Off-Line Filters. HYDRONI HYDRONI SRL, +39/039.2027069 . HYDRONI

Greg Scantlen, CEOdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Stereoscopic Computational Microscope GTC 2012 May 15 . GITerDone GTC 2009 GTC 2012 May 15 Switchless IB TRIAD