ssd: single shot multibox detectorkosecka/cs747/presentation-ssd.pdfssd: single shot multibox...
Post on 25-May-2020
11 Views
Preview:
TRANSCRIPT
SSD: Single Shot MultiBoxDetector
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Cheng-Yang Fu, Alexander C. Berg
Slides by: Sulabh Shrestha
Receptive Field
Ref: https://cv-tricks.com/object-detection/single-shot-multibox-detector-ssd/
▪ Deep feature maps▪ Smaller size
▪ Larger receptive fields
▪ May miss small objects
▪ Shallow feature maps▪ Larger size
▪ Smaller receptive fields
▪ May not be able to see larger objects
▪ Use multiple for corresponding receptive field sized objects
Use multiple
Architecture
▪ Base Network + Extra Feature Layer
▪ No FC layer
▪ Specific feature maps responsive to particular scale of objects▪ Not necessarily same as the receptive field
▪ A hyper-parameter
▪ Dependent on data8x8 Feature map 4x4 Feature map
VGG
Base Network
▪VGG 16
▪Pool5 changed:▪ 3x3 kernel instead of 2x2▪ Stride 1 instead of 2
▪1st 2 FCs replaced by CNN▪ DeepLab LargeFOV
▪Last FC removed altogether
▪No dropouts used
▪Conv4_3 also used for prediction▪ 4th Group of Conv▪ 3rd kernel
Ref: VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Multiple Default Boxes▪Similar to Anchor boxes of Faster-RCNN
▪Example feature map: ▪m x n
▪p-channels
▪For each location (i, j)▪ Multiple default boxes (k)
▪ 3 x 3 x p-channel CNN for each box▪ Confidence of each class, ci ; i Є [1, C]
▪ x, y, w, h
▪ (C+4) outputs
▪ Total outputs for 1 feature map:▪ m * n * k * (#classes + 4)
m
n
p
Scale and Aspect ratio
▪ How many default boxes per location?
▪ Scale▪ Related to but not exact as the receptive field▪ If m feature maps used for prediction:▪ smin = 0.2 ▪ smax = 0.9▪ Eg.
▪ s = 0.2▪ img-size = 300▪ Default box corresponding size = 0.2 * 300 = 60
▪ Aspect ratios(ar) ▪ {1, 2, 3, 1/2, 1/3} ~ k▪ Width (wk
a) = sk √ ar
▪ Height (hka) = sk / √ ar
▪ Eg. ▪ s = 0.2, img-size = 300▪ ar = 1 --> w = 0.2 * 300 = 60 h = 0.2 * 300 = 60▪ ar = 2 --> w = 0.2 * √ 2 * 300 = 85 h = 0.2 / √ 2 * 300 = 42▪ ar = 1/2 --> w = 0.2 * √ ½ * 300 = 42 h = 0.2 / √ ½ * 300 = 85
Training
• Basenet pre-trained on ImageNet CLS-LOC dataset• Fine-tuned for respective dataset
• Matching Strategy• Any 𝐼𝑂𝑈𝑑𝑒𝑓𝑎𝑢𝑙𝑡𝑏𝑜𝑥
𝑔𝑟𝑜𝑢𝑛𝑑𝑡𝑟𝑢𝑡ℎ> 0.5 → 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
• Simplifies learning problem• Can detect object in multiple overlapping default boxes
• Loss• Confidence loss (c)
• Softmax loss over multiple classes
• Localization loss (xywh)• Smooth L1 loss• Ground truth box(g) vs Default box(l)
Ref: https://github.com/rbgirshick/py-faster-rcnn/files/764206/SmoothL1Loss.1.pdf
ResultsPASCAL VOC2007 test detection results
PASCAL VOC2012 test detection results
Inference
• Filter boxes with low confidence
• NMS with 0.45 IOU
• Take top 200 detections
• Better mAP
• Faster FPSVOC2007 Test data
Analysis
• Better than 2 stage network:• Single network for localization and classification
• Better than YOLO• Use multiple feature maps• Use many more default boxes• No FC layer
• Faster inference• Fewer parameters
• Smaller input size• Faster RCNN
• 600 min. size• YOLO
• 448 x 448
Ablation Studies - 1
• Data Augmentation helps• Original image
• Random sample of patch
• Sample patch• IOUmin is 0.1, 0.3, 0.5, 0.7, 0.9
• More Multiple boxes helps
• Using FC instead of CNN (Atrous)• Similar result
• 20% slow
Ablation Studies - 2• Use different number of feature maps
• Similar # of default boxes to make it fair
• More feature maps better• Up to a certain extent
• Not using boundary defaults boxes better• Avoid default boxes lying outside the image
Thank youQuestions?
top related