model compression · 2019-08-30 · approaches to “compressing” models Øarchitectural...

22
Model Compression Joseph E. Gonzalez Co-director of the RISE Lab [email protected]

Upload: others

Post on 07-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

ModelCompression

Joseph E. GonzalezCo-director of the RISE Lab

[email protected]

Page 2: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

What is the Problem Being Solved?

Ø Large neural networks are costly to deployØ Gigaflops of computation, hundreds of MB of storage

Ø Why are they costly?Ø Added computation requirements adversely affect

Ø throughput/latency/energyØ Added memory requirements adversely affect

Ø download/storage of model parameters (OTA)Ø throughput and latency through cachingØ Energy! (5pJ for SRAM cache read, 640pj for DRAM vs 0.9pJ for a FLOP)

Page 3: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Approaches to “Compressing” Models

Ø Architectural CompressionØ Layer Design à Typically using factorization techniques to

reduce storage and computationØ Pruning à Eliminating weights, layers, or channels to reduce

storage and computation from large pre-trained models

Ø Weight Compression Ø Low Bit Precision Arithmetic à Weights and activations are

stored and computed using low bit precisionØ Quantized Weight Encoding à Weights are quantized and

stored using dictionary encodings.

Page 4: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices(2017)

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun

Megvii Inc. (Face++)

Page 5: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Related work (at the time)

Ø SqueezeNet (2016) – Aggressively leverage 1x1 (point-wise) convolution to reduce inputs to 3x3 convolutions.Ø 57.5% Acc (comparable to AlexNet)Ø 1.2M Parameters à compressed down to 0.47MB

Ø MobileNetV1 (2017) – Aggressively leverage depth-wise separable convolutions to achieve Ø 70.6 acc on ImageNetØ 569M – Mult-AddsØ 4.2M -- Parameters

Page 6: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Background

Page 7: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Regular ConvolutionH

eig

htW

idth

Depth

3

3

InputChannels

DepthOutput

Channels

Kernel

Computation (for 3x3 kernel)

Computational Complexity:width, height, channel out, channel in, filter size

Combines information across space and across channels.

Page 8: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

1x1 Convolution (Point Convolution)H

eig

htW

idth

Depth

11

InputChannels

DepthOutput

Channels

Computation

Computational Complexity:

Combines information across channels only.

Page 9: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Depthwise ConvolutionH

eig

htW

idth

Depth

3

3

InputChannels

DepthOutput Channels

=Input Channels

Computation (for 3x3 kernel)

Computational Complexity:

Combines information across space only

Page 10: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

MobileNet Layer Architecture

Hei

ght

Width

3

3

11

SpatialAggregation

ChannelAggregation

Computational Cost:

Page 11: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Observation from MobileNet Paper

“MobileNet spends 95% of it’s computation time in 1x1 convolutions which also has 75% of the parameters as can be seen in Table 2.”

Ø Idea, eliminate the 1x1 conv but still achieve mixing of channel information?Ø Pointwise (1x1) Group ConvolutionØ Channel Shuffle

Page 12: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Group ConvolutionH

eig

htW

idth

DepthInput

Channels

DepthOutput

Channels

3

3

groups

Used in AlexNet to partition model across machines.

Computational Complexity:

Combines some information across space and across channels.

Can we apply to 1x1 (pointwise) convolution?

Page 13: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Pointwise (1x1) Group ConvolutionH

eig

htW

idth

DepthInput

Channels

DepthOutput

Channels

11

groups Computational Complexity:

Combines some information across channels.

Issue: If appliedrepeatedly channel remain independent.

Page 14: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Channel Shuffle

Ø Permute channels betweengroup convolution stagesØ Each group should get a channel

from each of the previous groups.

Ø No arithmetic operations but doesrequire data movementØ Good or bad for hardware?

Group Convolution

Group Convolution

Page 15: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

ShuffleNet ArchitectureShuffleNet Unit MobileNet Unit

Based on Residual Networks

Stride = 2Stride = 1

Page 16: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Alternative Visualization

Images from https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d

ShuffleNet Unit MobileNet Unit

SpatialDimension

ChannelDimension

SpatialDimension

ChannelDimension

Notice the cost of the 1x1 convolution

Page 17: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

ShuffleNet Architecture

Increased width when increasing number of groups à Constant FLOPs

Observation:

Shallow networks need more channels (width) to maintain accuracy.

Page 18: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

What are the Metrics of Success?

Ø Reduction in network size (parameters)

Ø Reduction in computation (FLOPS)

Ø Accuracy

Ø Runtime (Latency)

Page 19: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Comparisons to Other Architectures

Ø Less computationand more accurate than MobileNet

Ø Can be configuredto match accuracy of other models while using less compute.

Page 20: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Runtime Performance on a Mobile Processor(Qualcomm Snapdragon 820)

Ø Faster and more accurate than MobileNet

Ø Caveats Ø Evaluated using single threadØ Unclear how this would perform on a GPU … no numbers reported

Page 21: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Model Size

Ø They don’t report memory footprint of modelØ Onnx implementation is 5.6MB à ~1.4M parameters

Ø MobileNet reports model sizeØ 4.2M Parameters à ~16MB

Ø Generally relatively small

Page 22: Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural Compression ØLayer DesignàTypically using factorization techniques to reduce storage and

Limitations and Future Impact

Ø LimitationsØ Decreases arithmetic intensityØ They disable Pointwise Group Convolution on smaller input

layers due to “performance issues”

Ø Future ImpactØ Not yet a widely used as MobileNet

Ø Discussion:Ø Could potentially benefit from hardware optimization?