deep learning compiler - sampl: home€¦ · © 2018, amazon web services, inc. or its affiliates....
TRANSCRIPT
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon/Intel Confidential
AWS AI
Deep Learning Compiler
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Acknowledgement
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Sagemaker Neo
Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge
Product targets• Amazon Rekognition• AWS DeepLens• Amazon Lex• …• And a lot of internal/external
products
Hardware targets• Intel CPU, Intel graphics• ARM CPU, ARM GPU• Nvidia GPU• FPGA• ASIC• …
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CONV Kernel tuning
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Intel Xeon Platinum 8000-series CPUs (Skylake)
• Multi-cores• E.g., EC2 c5.9xlarge: 1 processor with 18 cores.
• AVX-512 supported• 512-bit width registers (ZMM)• E.g. vfmadd231ps -1664(%rax,%r13){1to16},
%zmm0, %zmm1
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CONV optimizationData layout is important!conv = tvm.compute(oshape, lambda n, oc, oh, ow: tvm.sum( data[n, ic, oh*stride+kh, ow*stride+kw] * kernel[oc, ic, kh, kw], axis=[ic, kh, kw]),) for (n, 0, N):
for (oc, 0, OC): for (oh, 0, OH): for (ow, 0, OW): Out[n, oc, oh, ow] = 0 // init Out for (ic, 0, IC): for (kh, 0, KH): for (kw, 0, KW): // Out += In * Kernel
• NCHW -> NHWC • NCHW -> NCHW[x]c
• OIHW-> OIHW[x]i[y]o
in_height
in_width
kernel_width
kernel_height
out_width
out_heightout_channel(# of kernel)
in_channel
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CONV optimizationUtilize the AVX-512 ISA well
(broadcast) Load input to DRAM; Load kernels to ZMM; // up to 16 float32 vfmadd input, kernel, output Store output back to DRAM
in_height
in_width
kernel_width
kernel_height
out_width
out_heightout_channel(# of kernel)
in_channel
ow_inner
inputs kernels
ZMM_0
ZMM_1 - ZMM_{ow_inner}
+ ×
DRAM
outputs
vectorized FMA
Load 31 inputs to DRAM; Load kernels to ZMM; vfmadd input_1, kernel, output_1 vfmadd input_2, kernel, output_2… vfmadd input_31, kernel, output_31 Store output_{1…31} back to DRAM
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Intel Graphics on Amazon DeepLens
Hardware Configs: Intel HD Graphics 500 (Intel’s Gen 9)• On-die integrated GPU• 12 EUs, 0.55 GHz• 7 physical threads per EU, 2 128-bit FPUs per EU• 105.6 GFLOPS peak performance• Work items in the same SIMD group form a subgroup sharing 4KB GRFs
• Intel Opencl ext: cl_intel_subgroups
• Shares the main memory with CPU
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Instruction examples and corresponding TVM instructions• intel_sub_group_block_read/write ⇒ cache_read/write(buffer,
“warp”, [result])
• Intel_sub_group_shuffle ⇒ storage_align(axis, 16) and bind it to threads
Convolution:• Work items work on a certain block of workloads to utilize local memory• Layout transform for coalescing memory accesses• Utilize cl_intel_subgroups operations
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Graph-level optimization
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Graph-level layout optimization
Data
CONV
BATCH_NORM
RELU POOLING
CONV
FLATTEN
NCHW
NCHW
NCHW
Undef
NCHW
KernelOIHW
Mean / VarianceC
Kernel
Data
CONV_NCHW16c
BATCH_NORM
RELU POOLING
CONV_NCHW16c
FLATTEN
NCHW16c
NCHW16c
NCHW16c
NCHW16c
Kernel
Mean / Variance
Kernel
LayoutTransformNCHW
NCHW16c
LayoutTransformOIHW16i16o OIHW
LayoutTransform CC16c
OIHW LayoutTransform
LayoutTransform
NCHW16c
NCHWUndef
OIHW16i16oOIHW
optimizedlayout
LayoutTransform forparameters can be
pre-computed duringcompile time.
AlterOpLayout
NCHW
NCHW
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Graph/tensor co-optimization
CONVi LayoutTransform CONVj
LayoutTransform CONVk LayoutTransform
CONVl
ELEWISE_ADD
LayoutTransform
CONV
LayoutTransform ?1 2 3
N-2 N-1 N
Yes
NoCONV schemes
CONV computing time: varies alongwith different CONV schemes
Layout Transform time: variesalong with different CONV schemes
Dynamic programming + necessary heuristics
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
End-to-end results Batch size = 1
Intel CPU Intel Graphics
0
0.5
1
1.5
2
2.5
ResNet-18
ResNet-34
ResNet-50
ResNet-101
ResNet-152
VGG-11
VGG-13
VGG-16
VGG-19
DenseNet-121
DenseNet-161
DenseNet-169
DenseNet-201
Inception-v3
MobileNetSSD
MXNet OpenVINO TVM
00.20.40.60.81
1.2
ResNet-18
ResNet-34
ResNet-50
ResNet-101
ResNet-152 SSD
OpenVINO TVM
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Other functionalities
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Runtime multi-threadingUse a customized thread pool for CPU targets• Lock-free queue using C++ atomics• Thread-binding to physical cores• Cache line padding
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ResNet-152 VGG-19
DenseNet-121 Inception-v3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Graph Annotation
GPU
CPU
GPU
GPU
Annotation Copy node insertion Optimization/Compilation Runtime
GPU node CPU node Data copy node
GPU
CPU
GPU
GPU
TVM op
TVM op
TVM op
TVM op
GPU
CPU
GPU
GPU
CPU lib
GPU lib
CPU lib
GPU lib
graph lib file
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Quantization on Intel CPUs
Hardware support : Fast INT8 operations with INT32 accumulationINT8 conv2d kernel requires new schedule
• Performs reduction in groups of 4 INT8 elements to INT32 elements• FP32 schedule does not require in-vector reduction
0
0.5
1
1.5
2
2.5
3
3.5
W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15 W16 W17 W18 W19 W20 W21 W22 W23 W24 W25
Spee
dup
norm
alize
d to
FP32
sche
dule
s
Workloads
INT8 schedules speedup for varying workloads of conv2d
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ASICs – AWS inferentia
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Takeaways
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Takeaways• Industry needs an open standard compiler for DL
• AWS working on the TVM stack
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Takeaways• Industry needs an open standard compiler for DL
• AWS working on the TVM stack
• We are eager to collaborate with the community• Talk to us, we have 10+ people here today!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Takeaways• Industry needs an open standard compiler for DL
• AWS working on the TVM stack
• We are eager to collaborate with the community• Talk to us, we have 10+ people here today!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Takeaways• Industry needs an open standard compiler for DL
• AWS working on the TVM stack
• We are eager to collaborate with the community• Talk to us, we have 10+ people here today!
• We are hiring!• Write to Vin Sharma ([email protected]) or Yida Wang