overview supportvector’machines’rtc12/cse586/lectures/svmanddalaltriggs_6pp.pdfw⋅x+ = α x...
TRANSCRIPT
2/26/15
1
Overview • Recall last class: Boos1ng is a way of genera1ng a strong classifier as a weighted ensemble of weak ones
• Today: Support Vector Machine (SVM) training generates a strong classifier directly
• Case Study: Dalal and Triggs pedestrian detector
Support Vector Machines SVM slides from Kristen Grauman, UT-‐Aus1n
Other good resources: Presentation slides from Christoph Lampert: https://sites.google.com/site/christophlampert/teaching/kernel-methods-for-object-recognition Simple tutorial document by Chris Williams http://www.inf.ed.ac.uk/teaching/courses/iaml/docs/svm.pdf Video lecture by Pat Winston: https://www.youtube.com/watch?v=_PwhiWxHK8o
Linear classifiers
Find linear function to separate positive and negative examples
Lines in R2
0=++ bcyax
⎥⎦
⎤⎢⎣
⎡=ca
w ⎥⎦
⎤⎢⎣
⎡=yx
xLet
Lines in R2
0=+⋅ bxw
⎥⎦
⎤⎢⎣
⎡=ca
w ⎥⎦
⎤⎢⎣
⎡=yx
x
0=++ bcyax
Let
w
Linear classifiers • Find linear function to separate positive and
negative examples
0:negative0:positive
<+⋅
≥+⋅
bb
ii
ii
wxxwxx
Which line is best?
2/26/15
2
Support Vector Machines (SVMs)
• Discriminative classifier based on optimal separating line (for 2d case)
• Maximize the margin between the positive and negative training examples
Support vector machines • Want line that maximizes the margin.
1:1)(negative1:1)( positive−≤+⋅−=
≥+⋅=
byby
iii
iii
wxxwxx
Margin Support vectors
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
For support, vectors, 1±=+⋅ bi wx
wx+b=-1 wx+b=0
wx+b=1
Lines in R2
0=+⋅ bxw
⎥⎦
⎤⎢⎣
⎡=ca
w ⎥⎦
⎤⎢⎣
⎡=yx
x
0=++ bcyax
Let
w
( )00 , yx
D
Lines in R2
0=+⋅ bxw
⎥⎦
⎤⎢⎣
⎡=ca
w ⎥⎦
⎤⎢⎣
⎡=yx
x
0=++ bcyax
Let
w
( )00 , yx
D
wxw b
ca
bcyaxD +
=+
++=
Τ
22
00 distance from point to line
Lines in R2
0=+⋅ bxw
⎥⎦
⎤⎢⎣
⎡=ca
w ⎥⎦
⎤⎢⎣
⎡=yx
x
0=++ bcyax
Let
w
( )00 , yx
D
wxw b
ca
bcyaxD +
=+
++=
Τ
22
00 distance from point to line
Support vector machines • Want line that maximizes the margin.
1:1)(negative1:1)( positive−≤+⋅−=
≥+⋅=
byby
iii
iii
wxxwxx
Margin M Support vectors
For support, vectors, 1±=+⋅ bi wx
wx+b=-1 wx+b=0
wx+b=1
Distance between point and line: ||||
||wwx bi +⋅
www211
=−
−=Mwwxw 1±
=+ bΤ
For support vectors:
2/26/15
3
Support vector machines • Want line that maximizes the margin.
1:1)(negative1:1)( positive−≤+⋅−=
≥+⋅=
byby
iii
iii
wxxwxx
Margin Support vectors
For support, vectors, 1±=+⋅ bi wx
wx+b=-1 wx+b=0
wx+b=1
Distance between point and line: ||||
||wwx bi +⋅
Therefore, the margin is 2 / ||w||
Finding the maximum margin line 1. Maximize margin 2/||w|| 2. Correctly classify all training data points:
Quadratic optimization problem: Minimize
Subject to yi(w·xi+b) ≥ 1
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
wwT21
1:1)(negative1:1)( positive−≤+⋅−=
≥+⋅=
byby
iii
iii
wxxwxx
One constraint for each training point. Note sign trick.
Finding the maximum margin line • Solution:
∑= i iii y xw α
Support vector
learned weight
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Finding the maximum margin line • Solution:
b = yi – w·xi (for any support vector)
• Classification function:
• Notice that it relies on an inner product between the test point x and the support vectors xi
• (Solving the optimization problem also involves computing the inner products xi · xj between all pairs of training points)
∑= i iii y xw α
bybi iii +⋅=+⋅ ∑ xxxw α
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
f (x) = sign (w ⋅x+ b)
= sign αiyii∑ xi ⋅x+ b( )If f(x) < 0, classify as negative, if f(x) > 0, classify as positive
Ques1ons • What if the features are not 2d? • What if the data is not linearly separable? • What to do for more than two classes?
Ques1ons • What if the features are not 2d?
– Generalizes to d-‐dimensions – replace line with “hyperplane”
• What if the data is not linearly separable? • What to do for more than two classes?
2/26/15
4
Planes in R3
0=+++ dczbyax
0=+⋅ dxw
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
=
cba
w⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
=
zyx
xLet w
wxw d
cba
dczbyaxD +
=++
+++=
Τ
222
000 distance from point to plane
( )000 ,, zyx
D
Hyperplanes in Rn
02211 =++++ bxwxwxw nn…
Hyperplane H is set of all vectors which satisfy:
nR∈x
0=+Τ bxw
wxwx bHD +
=Τ
),(distance from point to hyperplane
Ques1ons • What if the features are not 2d? • What if the data is not linearly separable?
Nonlinear SVMs
Slide from Andrew Zisserman
Nonlinear SVMs
Slide from Andrew Zisserman
Nonlinear SVMs
Slide from Andrew Zisserman
2/26/15
5
The Kernel Trick
• Recall we transformed linear regression into nonlinear regression using a feature vector Φ(x) and, ul1mately, the “kernel trick.”
• We also use the kernel trick here to transform a linear classifier into nonlinear one.
Example Kernel
Slide from Andrew Zisserman
Example Kernels
Slide from Andrew Zisserman
Nonlinear SVMs • The kernel trick: instead of explicitly computing
the lifting transformation φ(x), define a kernel function K such that
K(xi , xjj) = φ(xi ) · φ(xj)
• This gives a nonlinear decision boundary in the original feature space:
bKyi
iii +∑ ),( xxα
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Ques1ons • What if the features are not 2d? • What if the data is not linearly separable? • What to do for more than two classes?
Mul1-‐class SVMs • Achieve mul1-‐class classifier by combining a number of binary
classifiers
• One vs. all – Training: learn an SVM for each class vs. the rest – Tes1ng: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value
• One vs. one – Training: learn an SVM for each pair of classes – Tes1ng: each learned SVM “votes” for a class to assign to the test example
2/26/15
6
Software for SVMs SVMs for recognition 1. Define a vector representation for
each example.
2. Select a kernel function.
3. Compute pairwise kernel values between labeled examples
4. Given this “kernel matrix” to SVM optimization software to identify support vectors & weights.
5. To classify a new example: compute kernel values between new input and support vectors, apply weights, check sign of output.
Case Study: Pedestrian Detector Navneet Dalal and Bill Triggs,
“Histograms of Oriented Gradients for Human DetecGon,” CVPR 2005
Dalal and Triggs CVPR’05 • Detect upright pedestrians • Histogram of oriented gradient feature vector • Linear SVM classifier; sliding window detector
64X128 HoG descriptor
HoG Feature Extrac1on HoG Feature Extrac1on: Cells 64x128
Compute gradients
Each cell contains a histogram of gradient orientations, weighted by gradient magnitude
2/26/15
7
HoG Feature Extrac1on: Blocks
“Each scalar cell response contributes several components to the final descriptor vector, each normalized with respect to a different block. This may seem redundant but good normalization is critical and including overlap significantly improves the performance.” Dalal&Triggs CVPR’05
2x2 block of cells normalize [ , , , , ... , ]
38
HoG Design Choices Parameters • Gradient scale • Orienta1on bins • Block overlap area
ε+←2
2/ vvv
Other choices n RGB or Lab, Color/gray n Block normalization
L2-hys,
or L1-sqrt,
Cel
l Cen
ter
bin
Block
R-H
OG
/SIF
T C-
HO
G
)/(1ε+← vvv
Parameter / design choices were guided by extensive experimentation to determine empirical effects on detector performance (e.g. miss rate)
Dalal&Triggs Detector • Default detector configura1on:
– RGB colour space with no gamma correc1on ; – [−1, 0, 1] gradient filter with no smoothing ; – linear gradient vo1ng into 9 orienta1on bins in 0◦–180◦; – 16×16 pixel blocks of four 8×8 pixel cells; – Gaussian spa1al window with σ = 8 pixel; – L2-‐Hys (Lowe-‐style clipped L2 norm) block normaliza1on; – block spacing stride of 8 pixels (hence 4-‐fold coverage of each cell) ;
– 64×128 detec1on window ; – linear SVM classifier.
41
Detector Architecture
Learn binary classifier
Encode images into feature vectors
Create normalised training data set
Object/Non-object decision
Fuse multiple detections in 3-D position & scale space
Run classifier to obtain object/non-object decisions
Scan image at all scales and locations
Object detections with bounding boxes
Learning Phase Detection Phase
Posi1ve and nega1ve examples
+ thousands more…
+ millions more…
2/26/15
8
Person detec1on with HoG & linear SVM
[Dalal and Triggs, CVPR 2005]
Soft (C=0.01) linear SVM trained with SVMLight.
To detect people at all locaGons and scales:
• Sliding window using learnt HOG template
• Post-‐processing using non-‐maxima suppression
• Sliding window using learnt HOG template
• Post-‐processing using non-‐maxima suppression
To detect people at all locaGons and scales:
• Sliding window using learnt HOG template
• Post-‐processing using non-‐maxima suppression
To detect people at all locaGons and scales:
2/26/15
9
• Sliding window using learnt HOG template
• Post-‐processing using non-‐maxima suppression
To detect people at all locaGons and scales:
• Sliding window using learnt HOG template
• Post-‐processing using non-‐maxima suppression
To detect people at all locaGons and scales:
• Sliding window using learnt HOG template
• Post-‐processing using non-‐maxima suppression
To detect people at all locaGons and scales:
• Sliding window using learnt HOG template
• Post-‐processing using non-‐maxima suppression
To detect people at all locaGons and scales:
Non-maximum Suppression across Scales
2/26/15
10
Dalal and Triggs Summary • HoG feature representa1on • Linear SVM classifier; sliding window detector • Non-‐maximum suppression across scale • Use of detector performance metrics to guide turning of system parameters
• Detec1on rate 90% at 10-‐4 FP per window • Slower than Viola-‐Jones detector