lecture 10: other applications of...
TRANSCRIPT
Lecture10:OtherApplicationsofCNNs
Bohyung [email protected]
CSED703R:DeepLearningforVisualRecognition(2017F)
ApplicationsofConvolutionalNeuralNetworks
• Facerecognitionandverification
• Personreidentification
• Textregiondetection
• Styletransfer
• Objectgeneration
• Visualattentionandsaliency
• Visualanalogy
• Autonomousdriving
• Manyothers
2
NeuralArtisticStyle
• Maingoal
§ Synthesizingtwoimagesrepresentingbothcontentandstyle
§ Exploitingapretrained CNNforimageclassification
• CNN
§ VGG19layernetwithoutfullyconnectedlayers
§ Nofine-tuning
§ Averagepooling:improvesgradientflowandgetmoreappealingresults
3
[Gatys16]L.A.Gatys,A.S.Ecker,M.Bethge:ImageStyleTransferUsingConvolutionalNeuralNetworks.CVPR2016
Method
4
CNNInput:contentInput:style
Output
Lossinfeaturemap
Lossinfeaturemapcorrelation
+Finaloutput
[Gatys16]L.A.Gatys,A.S.Ecker,M.Bethge:ImageStyleTransferUsingConvolutionalNeuralNetworks.CVPR2016
Optimization
• Loss
§ ! and"#:originalcontentimageanditsfeaturemapinthe$-th layer
§ % and&#:generatedimageanditsfeaturemapinthe$-th layer
§ ' and(#:originalstyleimageanditsfeaturemapinthe$-th layer
§ &#, "#, (# ∈ ℝ,-×/-,where0# isthenumberoffeaturemapsand1# isthe
sizeoffeaturemap 1# = width#×height#§ :;<
# = ∑ &;># &<>
#?> ∈ ℝ,-×,- :correlationoffeaturemapsinthe$-th layer
§ @;<# = ∑ (;>
# (<>#?
> ∈ ℝ,-×,- :correlationoffeaturemapsinthe$-th layer
5
ABCBDE !, ', % = FAGCHBIHB !, %, $ + KALBMEI ', %
AGCHBIHB !, %, $ =12P &;<
# − ";<# R
?
;,<
ALBMEI ', % =12P
S#40#
R1#RP :;<
# − @;<# R
?
;,<
U
#
[Gatys16]L.A.Gatys,A.S.Ecker,M.Bethge:ImageStyleTransferUsingConvolutionalNeuralNetworks.CVPR2016
Optimization
• Errorback-propagation
§ Content:selectaparticularlayersuchasconv4_2
§ Style:useconv*_1withequalweights(S# = 0.2)
6
AGCHBIHB !, %, $ =12P &;<
# − ";<# R
?
;,<
ALBMEI ', % =12PS#X#
U
#
X# =1
40#R1#
RP :;<# − @;<
# R?
;,<
YAGCHBIHBY&;<
# = Z&;<# − ";<
# if&;<# ≥ 0
0 if&;<# < 0
YALBMEIY&;<
# =YALBMEIYX#
YX#Y&;<
# = _12S#&<;
# :;<# − @;<
# if&;<# ≥ 0
0 if&;<# < 0
Updaterule:
Updaterule:
[Gatys16]L.A.Gatys,A.S.Ecker,M.Bethge:ImageStyleTransferUsingConvolutionalNeuralNetworks.CVPR2016
GeneratedImages
7
Style1:TheStarryNight
Style2:TheScream
Source
[Gatys16]L.A.Gatys,A.S.Ecker,M.Bethge:ImageStyleTransferUsingConvolutionalNeuralNetworks.CVPR2016
MoreExamples
8
[Gatys16]L.A.Gatys,A.S.Ecker,M.Bethge:ImageStyleTransferUsingConvolutionalNeuralNetworks.CVPR2016
BalancebetweenContentandStyle
9
[Gatys16]L.A.Gatys,A.S.Ecker,M.Bethge:ImageStyleTransferUsingConvolutionalNeuralNetworks.CVPR2016
MultipleStyles
10
Source
Style1:TheStarryNight
Style2:TheScream
[Gatys16]L.A.Gatys,A.S.Ecker,M.Bethge:ImageStyleTransferUsingConvolutionalNeuralNetworks.CVPR2016
FaceVerification
• Definition
§ Giventwofaces,determinewhether
they aresamepersonornot.
§ Binarydecisionbyone-to-onematching
• Relatedproblem
§ Facedetection:findingfaces
§ Facerecognition:multi-classclassification
problem
• Standardpipeline
11
FaceDetection
FaceAlignment
FeatureExtraction
BinaryClassification
SiameseNetwork
• DeepDiscriminativeMetricLearning(DDML)
§ Learningadistancemetric
§ Representationlearning
§ Twobranchesshareweights.
§ Objective
12
[Hu2014]J.Hu,J.Lu,Y.-P.Tan:DiscriminativeDeepMetricLearningforFaceVerificationintheWild.
CVPR2014
`aR %;, %< = b %; − b %< R
R
$;< c − `aR %;, %< > 1
$;< = 1:sameID
$;< = −1:differentID
c > 1
DeepID I
• DeephiddenIDentity features(DeepID)
13
[Sun2014a]Y.Sun,X.Wang,X.Tang:DeepLearningFaceRepresentationfromPredicting10,000Classes.CVPR2014
97.45%verificationaccuracy:asgoodashumanperformance(97.53%)
CNNArchitecture
• CNNarchitecture
14
e< = max 0,Pi;jS;,<
j + i;RS;,<
R
?
;
+ k<
Multiplescales
VerificationAlgorithm
• JointBayesian
§ % = l + m,wherel isfaceidentityandn isintra-classvariation
§ l~0 0, pl andm~0 0, pm
§ Computeq %r, %s = logv %r, %s wxv %r, %s wy
,whichhasaclosed-formsolution.
• Neuralnetworks
15
highly-correlatedsubfeature (640D)
60groups
• JointBayesianisbetterthanneuralnetwork.
ComparisonbetweenTwoVerifiers
16
Numberofclassesfortraining
Testaccuracy(%)
Results
17
r:restrictedtrainingprotocol,where6000facepairsgivenbyLFWareusedfor10-foldcross-validation
u:unrestrictedtrainingprotocol,wheremoretrainingpairscanbegeneratedfromLFWusingidentity
o:usingoutsidetrainingdata,however,withoutusingtrainingdatafromLFW
o+r:usingbothoutsidedataandLFWdataintherestrictedprotocolfortraining
o+u:usingbothoutsidedataandLFWdataintheunrestrictedprotocolfortraining
TL:JointBayesiantransferlearningfromCelebFaces+toLFW
Comparisonofstate-of-the-artfaceverificationmethodsonLFWMethod Accuracy (%) No. of points No. of images Feature dimensionJoint Bayesian [8] 92.42 (o) 5 99,773 2000 ⇥ 4ConvNet-RBM [31] 92.52 (o) 3 87,628 N/ACMD+SLBP [17] 92.58 (u) 3 N/A 2302Fisher vector faces [29] 93.03 (u) 9 N/A 128 ⇥ 2Tom-vs-Pete classifiers [2] 93.30 (o+r) 95 20,639 5000High-dim LBP [9] 95.17 (o) 27 99,773 2000TL Joint Bayesian [6] 96.33 (o+u) 27 99,773 2000DeepFace [32] 97.25 (o+u) 6 + 67 4,400,000 + 3,000,000 4096 ⇥ 4DeepID on CelebFaces 96.05 (o) 5 87,628 150DeepID on CelebFaces+ 97.20 (o) 5 202,599 150DeepID on CelebFaces+ & TL 97.45 (o+u) 5 202,599 150
No. of outside images
Human-levelperformance:97.53
DeepID II
• Jointidentification-verification
§ Faceidentification:increasestheinter-personalvariationsbydrawing
DeepID2featuresextractedfromdifferentidentitiesapart
§ Faceverification:reducestheintra-personalvariationsbypullingDeepID2
featuresextractedfromthesameidentitytogether
• Featureextraction
18
[Sun2014b]Y.Sun,Y.Chen,X.Wang,X.Tang:DeepLearningFaceRepresentationbyJointIdentification-Verification.NIPS2014
b = Conv i; ~�
TrainingCNN
• Twolossfunctions
§ Identificationloss:cross-entropy
§ Verificationloss
19
b = Conv i; ~�
Ident b, Å; ~;Ç = −PÉ; log É̂;
Ö
;Üj
= − log É̂á
Verif b;, b<, e;<; ~äã =
12b; − b< R
Rife;< = 1
12max 0,å − b; − b< R
Rife;< = −1
~;Ç:parametersofsoftmax
layer
~äã = å
VerificationAlgorithm
• Featureextraction
§ Detect21faciallandmarksbySDMalgorithmandalignfacesglobally
§ Crop400facepatcheswithvariationsinpositions,scales,colorchannels,
andhorizontalflipping
• ConvNet
§ 200CNNs:generate400DeepID2featurevectorswithhorizontalflipping
§ Featurevector:160D
• Featuredimensionalityreduction
§ Select25patchesinagreedymanner
§ PCAfrom25x160Dto180D
20Selected25facepatches
JointBayesianforverification
ResultsonLFW
21
method accuracy (%)
High-dim LBP [4] 95.17± 1.13
TL Joint Bayesian [2] 96.33± 1.08
DeepFace [21] 97.35± 0.25
DeepID [20] 97.45± 0.26
GaussianFace [13] 98.52± 0.66
DeepID2 99.15± 0.13
Human-levelperformance:97.53
FaceNet
• Architecture
§ Directmappingbetweenfaceimagesandembeddedpoints
• Tripletloss
§ Usinglargemarginnearestneighbor(LMNN)
22
[Schroff15]F.Schroff,D.Kalenichenko,J.Philbin:FaceNet:AUnifiedEmbeddingforFaceRecognitionandClustering.CVPR2015
Discriminativevs.GenerativeCNN
• DiscriminativeCNN
• GenerativeCNN
23
ObjectclassViewpointStyle…
CNN
ObjectclassViewpointStyle…
Goal
• Generateanobjectbasedonhigh-levelinputssuchas
§ Class
§ Orientationwithrespecttocamera
§ Additionalparameters
• Rotation,translation,zoom
• Stretchinghorizontallyorvertically
• Hue,saturation,brightness
• Knowledgetransfer
§ GenerativeCNNlearnsthemanifoldofchairs.
§ Interpolationbetweenviewpointsanddifferentobjects
24
[Dosovitskiy15]A.Dosovitskiy,J.T.Springenberg,T.Brox: LearningtoGenerateChairswithConvolutionalNeuralNetworks.CVPR2015
Data
• Using3Dchairmodeldataset[Aubry14]
§ Originaldataset:1393chairmodels,62viewpoints,31azimuthangles,
2elevationangles
§ Sanitizedversion:809models,tightcropping,resizingto128x128
• Notations
§ ç = éj, èj, ~j , éR, èR, ~R , … , é,, è,, ~,
• é:classlabel
• è:viewpoint
• ~:additionalparameters
§ ë = %j, íj , %R, íR , … , %,, í,
• %:targetRGBoutputimage
• í:segmentationmask
25
[Aubry14]M.Aubry,D.Maturana,A.Efros,andJ.Sivic,Seeing3DChairs:ExemplarPart-based2D-3DAlignmentusingaLargeDatasetofCADModels.CVPR2014
NetworkArchitecture
26
ℎ î
ï = î ∘ ℎ
32Mparametersaltogether
Operations
• Unpooling:2x2
• Deconvolution:5x5
• ReLU
27
Fixedlocationunpooling
Training
• Objectivefunction
§ MinimizingtheEuclideanerrorin2Dof
• Reconstructionofthesegmented-outchairimage
• Segmentationmask, í
• Visualizationofuconv-3layerfiltersin128x128network
28
minóPò îôöõ ℎ é;, è;, ~; − úùû %
; ⋅ í;R
R+ îLI† ℎ é;, è;, ~; − úùû í
;R
R,
;Üj
[Saxe14]A.M.Saxe,J.L.McClelland,andS.Ganguli,LearningaNonlinearEmbeddingbyPreservingClassNeighbourhood.ICLR2014
RGBstream
Segmentationstream
NetworkCapacity
29
Translation
Rotation
Zoom
Stretch
Saturation
Brightness
Color
MorphingDifferentChairs
30
Viewpointsintrainingset
AutonomousDriving
• Twopreviousapproaches
§ Mediatedperception:parsingtheentirescenetomakeadrivingdecision
(e.g.,Mobileye,Google)
§ Behaviorreflex:directlymappinganinputimagetoadrivingdecisionby
anregressor (ALVINN,LeCun etal.)
31
Input Image Driving Control
Direct Perception (ours)
Mediated Perception
Behavior Reflex
[Chen15]C.Chen,A.Seff,A.Kornhauser,J.Xiao:DeepDriving:LearningAffordanceforDirectPerceptioninAutonomousDriving.ICCV2015
DeepDriving
• Directperception
§ Estimatingtheaffordancefordriving
§ Simpleinputtomodelusingafewkeyperceptionindicators
§ Compactyetcompletedescriptionsofthesceneforvehiclecontrol
• Approach
§ Builtupondeepconvolutionalneuralnetwork
§ TrainedandtestedonTORCS(TheOpenRacingCarSimulator)
§ Learningforestimatingaffordancerelatedtoautonomousdriving
§ Simplerthanthemediatedperceptionapproach
§ Moreinterpretablethanthetypicalbehaviorreflex approach
32
Platform
• Systemarchitecture
• Environment
§ Focusingonhighwaydrivingwithmultiplelanes
§ Threeconfigurations:aroadofonelane,twolanes,orthreelanes
33(a) one-lane (b) two-lane, left (c) two-lane, right (d) three-lane (e) inner lane mark. (f) boundary lane mark.
TORCS CNN
Image & Speed
Driving Controls
Shared Memory
Write
Read
Driving Controller
Image
Speed
angle
toMarking
dist...
...
Read
Read
Write Controller Output
ConvolutionalNeuralNetwork
• Predictionofaffordanceindicator
34
angle
(a) angle
toMarking_LLtoMarking_LL toMarking_RRtoMarking_RR
toMarking_MLtoMarking_MLtoMarking_MRtoMarking_MR
(b) in lane: toMarking
dist_MMdist_LL dist_RR
(c) in lane: dist
toMarking_LtoMarking_L
toMarking_RtoMarking_RtoMarking_MtoMarking_M
(d) on mark.: toMarking
dist_Rdist_L
(e) on marking: dist
on marking system activate range
on marking system activate range
in lane system activate rangein lane system activate range
overlapping area
overlapping area
(f) overlapping area
CNN
angle
toMarking_LL
dist_LL
toMarking_L
dist_L
……
……
always
inlanesystem
onmarkingsystem
VisualizationofLearnedModels
35
ResponsemapofKITTI-basedConvNet model
ResponsemapofTORCS-basedConvNet model
DeepDrivingDemo
36
C.Chen,A.Seff,A.Kornhauser,J.Xiao:DeepDriving:LearningAffordanceforDirectPerceptioninAutonomousDriving.ICCV2015
Analogy
37
PARIS:FRANCE::BEIJING: CHINA
France
Paris
China
Beijing
Slidecredit:ScottReed
VisualAnalogyMaking
38
: : : :
: : : :
: : : :
Changingcolor
Changingshape
Changingsize
: : : : ?
Slidecredit:ScottReed
VisualAnalogyMaking
• Concept
§ Learnsanencoderfunctionb:ℝ¢ → ℝ§ mappingimagesintoaspace,
whereanalogiescanbeperformed
§ Learnsadecoderï:ℝ§ → ℝ¢ mappingbacktotheimagespace
39
Infer Relationship Transform query
[Reed15]S.Reed,Y.Zhang,Y.Zhang,H.Lee: DeepVisualAnalogyMaking.NIPS2015
` = argmax•∈¶
cos b S , b k − b © + b ™
Architecture
40
ℒD¨¨ = argmax≠,Æ,�,Ç ∈Ø
` − ï b k − b © + b ™R
R
ℒ∞±E = argmax≠,Æ,�,Ç ∈Ø
` − ï b ™ +≤×j b k − b © ×Rb ™ R
R
ℒ¨II≥ = argmax≠,Æ,�,Ç ∈Ø
` − ï b ™ + ℎ b k − b © ; b ™R
R
Optimization
• Regularization
§ Foraccurateanalogycompletionbyimagemanifoldtraversing
§ Makingtransformationmatchthedifferenceofencoderembeddings
• Training
§ WithbackpropagationusingSGD
§ Combinedloss:ℒ + F¥,F = 0.01
41
¥ = P b ` − b ™ − ú b © , b k , b ™R
R?
≠,Æ,�,Ç ∈Ø
ú i, e, µ = _e − i forℒD¨¨
≤×j e − i ×Rµ forℒ∞±EMLP e − i; µ forℒ¨II≥
Given images a, b, c, and N (# steps)z f(c)
for i = 1 to N doz z + T (f(a), f(b), z)
xi g(z)
return generated images xi (i = 1, ..., N )
Algorithm 1: Manifold traversal by analogy,i h f i f i T (E 5)
Optimization
• Training
§ Withbackpropagation
usingSGD
§ Combinedloss:
ℒ + F¥,F = 0.01
• Regularization
§ Foraccurateanalogycompletionbyimagemanifoldtraversing
§ Makingtransformationmatchthedifferenceofencoderembeddings
42
¥ = P b ` − b ™ − ú b © , b k , b ™R
R?
≠,Æ,�,Ç ∈Ø
ú i, e, µ = _e − i forℒD¨¨
≤×j e − i ×Rµ forℒ∞±EMLP e − i; µ forℒ¨II≥
Given images a, b, c, and N (# steps)z f(c)
for i = 1 to N doz z + T (f(a), f(b), z)
xi g(z)
return generated images xi (i = 1, ..., N )
Algorithm 1: Manifold traversal by analogy,i h f i f i T (E 5)
ShapePredictions:AdditiveModel
43
rotate
scale
shift
ref out query t=1predictions
t=2 t=3 t=4
ShapePredictions:MultiplicativeModel
44
rotate
scale
shift
ref out query t=1predictions
t=2 t=3 t=4
ShapePredictions:DeepModel
45
rotate
scale
shift
ref out query t=1predictions
t=2 t=3 t=4
ResultsforAnalogyModels
• Transformingshapes
46
Model Rotation steps Scaling steps Translation steps1 2 3 4 1 2 3 4 1 2 3 4
Ladd 8.39 11.0 15.1 21.5 5.57 6.09 7.22 14.6 5.44 5.66 6.25 7.45Lmul 8.04 11.2 13.5 14.2 4.36 4.70 5.78 14.8 4.24 4.45 5.24 6.90Ldeep 1.98 2.19 2.45 2.87 3.97 3.94 4.37 11.9 3.84 3.81 3.96 4.61
ref +rot (gt) query +rot +rot +rot +rot
ref +scl (gt) query +scl +scl +scl +scl
ref +trans (gt) query +trans +trans +trans +trans
LearningDisentangledFeatures
• Objectivefunction
47
ℒ¨πL = argmax≠,Æ,� ∈∫
™ − ï ª ⋅ b © + 1 − ª b kR
R
Algorithm 2: Disentangling training update. Theswitches s determine which units from f(a) andf(b) are used to reconstruct image c.Given input images a, b and target cGiven switches s 2 {0, 1}Kz s · f(a) + (1− s) · f(b)✓ / @/@✓
(||g(z)− c||22
)
DisentanglingandAnalogyMaking
48
Identity
Posea
b
c
Identity
Pose
Identity
Posed
Increment function T
Slidecredit:ScottReed
Disentanglingidentity
ClassificationandAnalogyMaking
49Slidecredit:ScottReed
Identity
Pose
Attribute classifier
a
b
c
Identity
Pose
Identity
Posed
Increment function T
Separateclassificationforidentity
ResultsforDisentangledFeatures
• Transferringanimation
§ Disentanglingposefromidentity
§ Posetransformationsaremodeledbydeepadditiveinteractions
50
Model spellcast thrust walk slash shoot averageLadd 41.0 53.8 55.7 52.1 77.6 56.0Ldis 40.8 55.8 52.6 53.5 79.8 56.5
Ldis+cls 13.3 24.6 17.2 18.9 40.8 23.0
ResultsforExtrapolation
51
ref output query +1 +2 +3 +4-4 -3 -2 -1
ref. output query predictions
walk
thrust
rotate
Summary
• Proposingnoveldeeparchitecturesthatcanperformvisual
analogymakingbysimpleoperationsinanembeddingspace
§ Convolutionalencoder-decodernetworks
§ Modelingtransformationsbyvectoradditioninembeddingspaceworks
forsimpleproblems,butmulti-layerneuralnetworksarebetter.
• Combininganalogyanddisentanglingtrainingmethods
§ Analogyrepresentationscanovercomelimitationsofdisentangled
representationsbylearningtransformationmanifold.
52