sage: self-tuning approximation for graphics engines

SAGE: Self-Tuning Approximation for Graphics

Engines

Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1

University of Michigan1, Google Inc.2

December 2013

Compilers creating custom processorsUniversity of MichiganElectrical Engineering and Computer Science

Approximate Computing

• Different domains:– Machine Learning– Image Processing– Video Processing– Physical Simulation– …

Higher performanceLower power consumption

Less work

Quality: 100% 95%

90% 85%

Ubiquitous Graphics Processing Units

Super Computers Desktops Cell Phones

• Wide range of devices

• Mostly regular applications• Works on large data sets

Servers

Good opportunity for automatic approximation

SAGE Framework

• Simplify or skip processing• Computationally expensive• Lowest impact on the output quality

Output Error

Speedup

• Self-Tuning Approximation on Graphics Engines• Write the program once• Automatic approximation• Self-tuning dynamically

Overview

SAGE FrameworkInput Program

ApproximationMethods

Approximate Kernels

Tuning Parameters

Static Compiler

Runtime system

Static Compilation

Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

Runtime System

Preprocessing

Tuning Execution Calibration

Quality

Speedup

Tuning T T + C T + 2C

Tuning

Preprocessing

Calibration

Runtime System

Preprocessing

Tuning Execution Calibration

Quality

Speedup

Tuning T T + C T

Approximation Methods

Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

Atomic Operations

// Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);

• Atomic operations update a memory location such that the update appears to happen atomically

Atomic Operations

Threads

Atomic Operations

Threads

Atomic Operations

Threads

Atomic Operations

Threads

Atomic Add

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310

Conflicts per Warp

Slowdown

Atomic Operation Tuning

Iterations

T0 T32 T64 T96 T128 T160

// Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);

Iterations

T0 T32 T64 T96 T128 T160

• SAGE skips one iteration per thread• To improve the performance, it drops the iteration with the maximum number of conflicts

T0 T32 T64 T96 T128 T160

It drops 50% of iterations• SAGE skips one iteration per thread• To improve the performance, it drops the iteration with the maximum number of conflicts

T0 T32 T64

Drop rate goes down to 25%

Dropping One Iteration

0Conflicts

Iteration No.

After optimizationCD 0

Conflict Detection

Max conflictIteration 0Iteration 1Iteration 2

Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

Data PackingThreads

Memory

Data PackingThreads

Memory

Data PackingThreads

Memory

Data PackingThreads

Memory

Quantization• Preprocessing finds min and max of the input

sets and packs the data

• During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation

Min Max

0 1 2 3 4 5 6 7

Quantization Levels

Quantization• Preprocessing finds max and min of the input

sets and packs the data

• During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation

Min Max

0 1 2 3 4 5 6 7

Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

Thread Fusion

Threads

Memory

Memory ≈ ≈ ≈ ≈

Thread Fusion

Threads

Memory

Thread Fusion

Computation

Output writing

ComputationOutput writing

Thread Fusion

T0 T1 T254 T255 T0 T1 T254 T255

Block 0 Block 1

ComputationOutput writing

Thread Fusion

T0 T127

Block 0 Block 1

T0 T127

Reducing the number of threads per block results in poor resource utilization

Block Fusion

T0 T254

Block 0 & 1 fused

T255T1

Runtime

How to Compute Output Quality?

Accurate Version

ApproximateVersion

EvaluationMetric

• High overhead• Tuning should find a good enough

approximate version as fast as possible• Greedy algorithm

Tuning

K(0,0) Quality = 100%Speedup = 1x

TOQ = 90%

K(x,y)

Tuning parameter of the First optimization

Tuning parameter of the Second optimization

Tuning

K(1,0) K(0,1)94%1.15X

96%1.5X

TOQ = 90%

Tuning

K(1,0) K(0,1)

K(1,1) K(0,2)

94%1.15X

96%1.5X

94%2.5X

95%1.6X

TOQ = 90%

Tuning

K(1,0) K(0,1)

K(1,1) K(0,2)

K(2,1) K(1,2)

94%1.15X

96%1.5X

94%2.5X

95%1.6X

87%2.8X

Final ChoiceTuning Path

TOQ = 90%

Evaluation

Experimental Setup

• Backend of Cetus compiler

• GPU– NVIDIA GTX 560

• 2GB GDDR 5

• CPU– Intel Core i7

• Benchmarks– Image processing– Machine Learning

K-Means

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 1011061111160

Output Quality

Accumulative Speedup

Confidence

0 50 100 150 200 250 300 350 40040

Calibration points

CI = 99%

CI = 95%

CI = 90%

After checking 50 samples, we will be 93% confident that 95% of the outputs satisfy the TOQ threshold𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝𝑝𝑟𝑖𝑜𝑟× h𝑙𝑖𝑘𝑒𝑙𝑖 𝑜𝑜𝑑

Beta Uniform Binomial

Calibration Overhead

0 20 40 60 80 100 120 140 160 180 2000

Calibration Interval

Gaussian

K-Means

Calibration Overhead(%)

K-Means

Histogram

MeanShift

Binarization

Dynamic

Mean Filter

Gaussian

1 2 3 4

Speedup 1 2 3 4

Speedup

Performance

SAGE SAGE

TOQ = 95% TOQ = 90%Loop Perforation Loop Perforation

GeometricMean

Conclusion• Automatic approximation is possible

• SAGE automatically generates approximate kernels with different parameters

• Runtime system uses tuning parameters to control the output quality during execution

• 2.5x speedup with less than 10% quality loss compared to the accurate execution

SAGE: Self-Tuning Approximation for Graphics

Engines

Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1

University of Michigan1, Google Inc.2

December 2013

Compilers creating custom processorsUniversity of MichiganElectrical Engineering and Computer Science

Backup Slides

What Does Calibration Miss?

10 20 30 40 50 60 70 80 90 1000%

flower(250) grandma(800) deadline(900) foreman(300) harbour(300)

ice(240) akiyo(300) carphone(380) crew(300) galleon(350)

Calibration Interval

SAGE Can Control The Output Quality

909192939495969798991001

7Naïve Bayes Fuzzy Kmeans Mean Filter

Output Quality(%)

Distribution of Errors

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0%

HistogramKmeansNaïve BayesFuzzy KmeansSVMDynamicMean FilterGaussianBinarizationMeanshift

sage: self-tuning approximation for graphics engines

Documents

document10 - suzuki katana · functional range...

approximation algorithms and hardness of approximation ipm

sage 100 gestion commerciale -...

sage 9.1 reference manual: diophantine approximation ·...

overview tuning defined tuning in the us the tuning process...

account information: manage bank feeds · sage xero canada...

beyond mact ddddd: heater and boiler tuning from a best...

sage hrms installation guide/media/site/sage...

sage sage reference online collections sage reference online

tuning derby - apache db...

from scale to revenue€¦ · quickbooks yes sage 50 – us...

data sheet sky73101-11: 1930–1990 mhz high performance...

sage crm & sage crm

sage: self-tuning approximation for graphics engines mehrzad...

sage university sage erp...

meshfree approximation with matlab - lecture ii: rbf ... ·...

7 scaling performance via self-tuning approximation for...

approximation algorithms - lehigh...

salesforce integration with sage erp - sage x3 / sage 100 /...

sage 50 canadian edition and spanish/bilingual...