© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Marshall Tappen and Ernesto Gonzalez
Amazon Fulfillment Technologies
November 30, 2016
MAC301
Transforming Industrial
Processes with Deep Learning
What to Expect from the Session
• Description of how Amazon Fulfillment Technologies has
used computer vision to improve our processes.
• Walk through how we combined deep learning and
traditional computer vision to automate an industrial
process.
• What are the challenges and the opportunity created by
deep learning classifiers?
Overview of fulfillment process
One thing you have to understand about
fulfillment centers
Bins can hold anything
Misplaced inventory “disappears”
Amazon Confidential 5
Associate
rearranged
inventory
when
picking
items.
Misplaced inventory “disappears”
Amazon Confidential 6
We call this
an
inventory
defect
Items fall out of pods
Our solution: use computer vision to locate
inventory defects
First step: get a physical system to capture
images
Station
Outbound
frame
Inbound frame
Totes and
conveyance
Amazon Confidential 9
Capture set of images as pod arrives at
the stationArrival Image
Tower
Departure Image
TowerStation
Associate interacts with pod
Arrival Image
Tower
Departure Image
Tower
Station
Photographed again as pod leaves
Arrival Image
Tower
Departure Image
Tower
Station
General strategy
• We want to take advantage of deep learning.
• The cameras capture images of an entire pod, but we
need data at the bin level.
• We will have a two-step process:
1. Extracting bins from images
2. Analyzing bin Images
Computer vision step 1: pod image to bin
images
No problem, use 2-D barcodes!
Amazon Confidential 15
No problem, use 2-D barcodes!
Bands block the
barcodes
Amazon Confidential 16
Solution, if we can detect the trays
Amazon Confidential 17
And we can detect the sides
Amazon Confidential 18
We have a set of points to match with a recipe of the
pod’s geometry
Amazon Confidential 19
Map the coordinate system of the database to
the face of the pod in the image
Amazon Confidential 20
Detecting the side of a pod: downsample image
and convert to grayscale
2046 X 2046 Image 512 X 512 Image
Amazon Confidential 21
Correlate* with left rail template
Filter
* In practice, we use normalized cross-correlation
Amazon Confidential 22
Threshold
Amazon Confidential 23
Fit a line (similar process for the other side)
Amazon Confidential 24
We can detect trays in the same way
Amazon Confidential 25
We can detect trays in the same way
Now we
have
locations to
tie the
virtual
template to
the image!
Amazon Confidential 26
Transformation between image and pod
physical coordinates is called a homography
We can verify
that it works by
calculating the
boundary of
each bin in the
image and
coloring it in.
Amazon Confidential 27
How can we use computer vision?
• Automatic
identification of
every item?
Amazon Confidential 28
How can we use computer vision?
• Automatic identification of every item?(TOO HARD)
• Automatic counting of every item?
Amazon Confidential 29
What does computer vision need to tell us?
• Automatic
identification of every
item?(TOO HARD)
• Automatic counting
of every item? (TOO
HARD)
Amazon Confidential 30
Instead, we can look for changes
Inbound to the Station Outbound from the Station
Amazon Confidential 31
Our first attempt was with hand-engineered
computer vision
Amazon Confidential 32
It’s hard!
Must be robust to items rolling or shuffling inside
the bin, illumination changes, specularity, etc.
The big insight
• We realized our problem was just binary classification.
• Two images in, one label out.
• Why not try this deep-learning thing?
We did the simplest thing possible
• Take the first image,
convert it to grayscale,
and put it in the red
channel of a new image
• Take the second image
and put it in the blue
channel
• Now, we have a single
image to pass to the
neural network
It worked great!
Best Hand-
Engineered Model
CIFAR CNN
Krizhevsky’s CNN
Processing pipeline
Pod Image
Bin Extraction
Bin Images
Defect
Detection
Implementation details
• Implemented in OpenCV in Python
• C++ extensions for some steps
• Neural net uses Caffe
• Trained on G2 instances
• Runs on CPU in FC server room
• Can tolerate latency in our current use-pattern
Software architecture
Inventory
Event
Correlator
(EC2)
VBI
Service
(EC2)
Remote
Count
Website
(Defect
Detection)
(EC2)
Site Server Room AWS
Inventory
Bin Count
Elimination
(EC2)
• Get Bin Defect
Result
• Get Bin Space
Available
Capture
Event
Data
Router
Bin
Extraction
Process
Auto
Count
Process
Local
Storage
Service
Put
Pod Face
Images
Put Bin
Images
Get Pod
Images
Camera
Controller
File Pusher
Barcode
Extraction
Edge
Device (s)
EDGE
DEVICE
Get Bin
Image
Get Bin
Image
Applications
SN
S
HTTP
POST
SNS
DynamoDB
SNS
SNS
SQS
Get Work for Remote
Counting
SQS
SQS
SNS
How can we use computer vision?
Automatic
identification of every
item?(TOO HARD)
Automatic counting
of every item?
Amazon Confidential 40
Could we just count the number of items in the
bin?
• At this point, we have lots of data.
• Some of it has errors from inventory defects, but
networks have proven resilient to this kind of thing.
• Why not just train a network to directly count bins?
Using a convolutional neural network
• We used the Caffe implementation of GoogLeNet [1]
[1] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE International
Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
Maps cleanly onto classification paradigm
• Treat it as a multi-class classification problem
Neural
Network
0.1
0.2
0.4
0.4
This saved the project
• Hit the targets we needed
• Eliminated a lot of hardware (no more before/after shots
needed)
• Made the project cost effective
• Here is what we learned:
• Don’t focus on algorithms, focus on DATA
How else can we use this data?
• We want to find free space
in the bin without having to
label data.
• We can guess from
dimensions of items.
• But where is the space at?
2.0
1.0
Train model to predict emptiness from an image
Emptiness scoreConv
Avg
Po
olGoogleNet
Conv
(3*3)
This is a noisy,
probably incorrect
estimate!
But we can use layers in the network to find where the
space actually is!
emptiness scoreConv
Avg
Po
olGoogleNet
Conv
(3*3)
1024 channels
3*3
Original image Activation map Binary mapOriginal image Activation map Binary map
And it works!
We are releasing a dataset
Takeaways
• We have great pattern recognition machinery now.
• Focus on the data:
• How can you get lots of it?
• What can you get for free?
• How much labeling do you really need?
• Is there a proxy problem?
Thank you!
Remember to complete
your evaluations!