deep learning compiler - sampl: home€¦ · © 2018, amazon web services, inc. or its affiliates....

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon/Intel Confidential

AWS AI

Deep Learning Compiler

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Acknowledgement


Amazon Sagemaker Neo

Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge

Product targets• Amazon Rekognition• AWS DeepLens• Amazon Lex• …• And a lot of internal/external

products

Hardware targets• Intel CPU, Intel graphics• ARM CPU, ARM GPU• Nvidia GPU• FPGA• ASIC• …


CONV Kernel tuning


Intel Xeon Platinum 8000-series CPUs (Skylake)

• Multi-cores• E.g., EC2 c5.9xlarge: 1 processor with 18 cores.

• AVX-512 supported• 512-bit width registers (ZMM)• E.g. vfmadd231ps -1664(%rax,%r13){1to16},

%zmm0, %zmm1


CONV optimizationData layout is important!conv = tvm.compute(oshape, lambda n, oc, oh, ow: tvm.sum( data[n, ic, oh*stride+kh, ow*stride+kw] * kernel[oc, ic, kh, kw], axis=[ic, kh, kw]),) for (n, 0, N):

for (oc, 0, OC): for (oh, 0, OH): for (ow, 0, OW): Out[n, oc, oh, ow] = 0 // init Out for (ic, 0, IC): for (kh, 0, KH): for (kw, 0, KW): // Out += In * Kernel

• NCHW -> NHWC • NCHW -> NCHW[x]c

• OIHW-> OIHW[x]i[y]o

in_height

in_width

kernel_width

kernel_height

out_width

out_heightout_channel(# of kernel)

in_channel


CONV optimizationUtilize the AVX-512 ISA well

(broadcast) Load input to DRAM; Load kernels to ZMM; // up to 16 float32 vfmadd input, kernel, output Store output back to DRAM

in_height

in_width

kernel_width

kernel_height

out_width

out_heightout_channel(# of kernel)

in_channel

ow_inner

inputs kernels

ZMM_0

ZMM_1 - ZMM_{ow_inner}

+ ×

DRAM

outputs

vectorized FMA

Load 31 inputs to DRAM; Load kernels to ZMM; vfmadd input_1, kernel, output_1 vfmadd input_2, kernel, output_2… vfmadd input_31, kernel, output_31 Store output_{1…31} back to DRAM


Intel Graphics on Amazon DeepLens

Hardware Configs: Intel HD Graphics 500 (Intel’s Gen 9)• On-die integrated GPU• 12 EUs, 0.55 GHz• 7 physical threads per EU, 2 128-bit FPUs per EU• 105.6 GFLOPS peak performance• Work items in the same SIMD group form a subgroup sharing 4KB GRFs

• Intel Opencl ext: cl_intel_subgroups

• Shares the main memory with CPU


Instruction examples and corresponding TVM instructions• intel_sub_group_block_read/write ⇒ cache_read/write(buffer,

“warp”, [result])

• Intel_sub_group_shuffle ⇒ storage_align(axis, 16) and bind it to threads

Convolution:• Work items work on a certain block of workloads to utilize local memory• Layout transform for coalescing memory accesses• Utilize cl_intel_subgroups operations


Graph-level optimization


Graph-level layout optimization

Data

CONV

BATCH_NORM

RELU POOLING

CONV

FLATTEN

NCHW

NCHW

NCHW

Undef

NCHW

KernelOIHW

Mean / VarianceC

Kernel

Data

CONV_NCHW16c

BATCH_NORM

RELU POOLING

CONV_NCHW16c

FLATTEN

NCHW16c

NCHW16c

NCHW16c

NCHW16c

Kernel

Mean / Variance

Kernel

LayoutTransformNCHW

NCHW16c

LayoutTransformOIHW16i16o OIHW

LayoutTransform CC16c

OIHW LayoutTransform

LayoutTransform

NCHW16c

NCHWUndef

OIHW16i16oOIHW

optimizedlayout

LayoutTransform forparameters can be

pre-computed duringcompile time.

AlterOpLayout

NCHW

NCHW


Graph/tensor co-optimization

CONVi LayoutTransform CONVj

LayoutTransform CONVk LayoutTransform

CONVl

ELEWISE_ADD

LayoutTransform

CONV

LayoutTransform ?1 2 3

N-2 N-1 N

Yes

NoCONV schemes

CONV computing time: varies alongwith different CONV schemes

Layout Transform time: variesalong with different CONV schemes

Dynamic programming + necessary heuristics


End-to-end results Batch size = 1

Intel CPU Intel Graphics

0

0.5

1

1.5

2

2.5

ResNet-18

ResNet-34

ResNet-50

ResNet-101

ResNet-152

VGG-11

VGG-13

VGG-16

VGG-19

DenseNet-121

DenseNet-161

DenseNet-169

DenseNet-201

Inception-v3

MobileNetSSD

MXNet OpenVINO TVM

00.20.40.60.81

1.2

ResNet-18

ResNet-34

ResNet-50

ResNet-101

ResNet-152 SSD

OpenVINO TVM


Other functionalities


Runtime multi-threadingUse a customized thread pool for CPU targets• Lock-free queue using C++ atomics• Thread-binding to physical cores• Cache line padding


ResNet-152 VGG-19

DenseNet-121 Inception-v3


Graph Annotation

GPU

CPU

GPU

GPU

Annotation Copy node insertion Optimization/Compilation Runtime

GPU node CPU node Data copy node

GPU

CPU

GPU

GPU

TVM op

TVM op

TVM op

TVM op

GPU

CPU

GPU

GPU

CPU lib

GPU lib

CPU lib

GPU lib

graph lib file


Quantization on Intel CPUs

Hardware support : Fast INT8 operations with INT32 accumulationINT8 conv2d kernel requires new schedule

• Performs reduction in groups of 4 INT8 elements to INT32 elements• FP32 schedule does not require in-vector reduction

0

0.5

1

1.5

2

2.5

3

3.5

W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15 W16 W17 W18 W19 W20 W21 W22 W23 W24 W25

Spee

dup

norm

alize

d to

FP32

sche

dule

s

Workloads

INT8 schedules speedup for varying workloads of conv2d


ASICs – AWS inferentia


Takeaways


Takeaways• Industry needs an open standard compiler for DL

• AWS working on the TVM stack

mailto:[email protected]





• We are eager to collaborate with the community• Talk to us, we have 10+ people here today!






• We are eager to collaborate with the community• Talk to us, we have 10+ people here today!

• We are hiring!• Write to Vin Sharma ([email protected]) or Yida Wang

([email protected])



deep learning compiler - sampl: home€¦ · © 2018, amazon web services, inc. or its affiliates....

Documents