optimizing average precision using weakly supervised data aseem behl iiit hyderabad under...
TRANSCRIPT
Optimizing Average Precision using Weakly Supervised Data
Aseem BehlIIIT Hyderabad
Under supervision of:
Dr. M. Pawan Kumar (INRIA Paris), Prof. C.V. Jawahar (IIIT Hyderabad)
Input x
Output y = “Using Computer”Latent Variable h
Jumping
Phoning
Playing Instrument
Reading
Riding Bike
Riding Horse
Running
Taking Photo
Using Computer
Walking
Train Input xi Output yi
Action Classification
Aim - To estimate accurate model parameters by optimizing average precision with weakly supervised data
• Preliminaries
• Previous Work
• Our Framework
• Results
• Conclusion
Outline
Binary Classification
• Several problems in computer vision can be formulated as binary classification tasks.
• Running example: Action Classification, ie, automatically figuring out whether an image contains a person performing an action of interest (such as ‘jumping’ or ‘walking’).
• Binary classifier widely employed in computer vision is the support vector machine (SVM).
Conventional SVMs• Input examples xi (vector)
• Output labels yi (either +1 or -1)
• SVM learns a hyperplane w
• Predictions are sign(wTΦi(xi))
• Training involves solving the following:
minw ½ ||w||2 + CΣiξi
s.t. i : y∀ i(wTΦi(xi)) ≥ 1 - ξi •The sum of slacks Σiξi upper bounds the 0/1 loss
Structural SVM (SSVM) • Generalization of the SVM to structured output spaces
minw ½ ||w||2 + CΣiξi
s.t. ŷ∀ i : wT(xi,yi) – wTΨ(xi, ŷi) ≥ Δ(yi, ŷi) - ξi• Joint score for the correct label at least as large as
incorrect label plus the loss.• Number of constraints is the |dom(y)|.• Thus, number of constraints can be intractably large.• At least one constraint for where inequality is tight.• “most violated constraint”.
Learning:
Ypred = argmaxy wTΨ(x,y)Prediction:• Maximize the score over all possible outputs.
Structural SVM Learning
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
Slide taken from Yue et al. (2007)
Structural SVM Learning
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
Slide taken from Yue et al. (2007)
Structural SVM Learning
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
Slide taken from Yue et al. (2007)
Structural SVM Learning
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
Slide taken from Yue et al. (2007)
Structural SVM Learning1: Solve the SVM objective function using only the current working set of constraints.
2: Using the model learned in step 1, find the most violated constraint from the exponential set of constraints.
3: If the constraint returned in step 2 is more violated than the most violated constraint the working set by some small constant, add that constraint to the working set.
Repeat steps 1-3 until no additional constraints are added. Return the most recent model that was trained in step 1.
Steps 1-3 are guaranteed to loop for at most a polynomial number of iterations.
[Tsochantaridis et al. 2005]
Weak Supervision• Supervised learning involves the onerous task of collecting detailed annotations for each training sample.
• Financially infeasible as the size of the datasets grow.
• Weak supervision – additional annotations hi are unknown
• More complex machine learning problem.
Weak Supervision – Challenges• Find the best additional annotation for positive examples.
• Identify bounding box of the `jumping’ person in positive images
• Need to consider all possible values of annotations for negative samples as negative examples.• Ensure that scores of `jumping’ person bounding boxes are
higher than the scores of all possible bounding boxes in the negative images.
Latent SVM (LSVM)• Extends SSVM to incorporate hidden information.
• This information is considered as part of the label.
• Not observed during training.
minw ½ ||w||2 + CΣiξi
s.t. ŷ∀ i,ĥi : maxhiwTΨ (xi,yi,hi) – wTΨ(xi, ŷi, ĥi) ≥ Δ(yi, ŷi) - ξi
Non-convex objective Difference of convex
CCCP Algorithm - converges to a local minimum
Concave-Convex Procedure (CCCP)1. Repeat until convergence:
2. Iteratively approximate the concave portion of the objective
• Impute the hidden variables
3. Update the value of the parameter using the values of the hidden variables
• Solve the resulting convex SSVM problem
Steps 2-3 are guaranteed to local minima in polynomial number of iterations.
[Yuille and A. Rangarajan, 2003]
Average Precision (AP)
AP-loss = 0.24, 0-1 loss = 0.40
AP-loss = 0.36, 0-1 loss = 0.40
• AP is the most commonly used accuracy measure for binary classification. • AP is the average of the precision scores at the rank
locations of each positive sample. • AP-loss depends on the ranking of the samples.• 0-1 loss depends only on the number of incorrectly classified samples.
• A machine learning algorithm optimizing for 0/1 loss might learn a very different model than optimizing for AP.
Notation
x h
y = “Using Computer”
1, xi ranked higher than xj
Y: ranking matrix, st. Yij= 0, xi & xj are ranked same
-1, xi ranked lower than xj
X: Input {xi= 1,..,n}
{HP: Additional information for positives {hi, i P}∈HN: Additional information for negatives {hj, j N} ∈
∆(Y, Y∗): AP-loss = 1 − AP(Y, Y∗)AP(Y, Y∗) = AP of ranking Y
Joint Feature Vector: Ψ(X,Y*,{HP,HN}) = (1/|P|.|N|)ΣiΣjYij(Φi(hi) - Φj(hj))
AP-SVM• AP-SVM optimizes the correct AP-loss function as opposed to 0/1 loss.
Prediction: Yopt = maxY wTΨ(X,Y,H)
minw ½ ||w||2 + Cξs.t. Y : w∀ TΨ(X,Y*,H) - wTΨ(X,Y,H)
≥ Δ(Y,Y*) - ξ • Constraints are defined for each incorrect labeling Y.Joint discriminant score for the correct labeling at least as large as incorrect labeling plus the loss.
• After learning w, a prediction is made by sorting samples (xk,hk) in descending order of wTΦk(hk)
•
Learning:
AP-SVM – Exponential Constraints
• For Average Precision, the true labeling is a ranking where the positive examples are all ranked in the front, e.g.,
• An incorrect labeling would be any other ranking, e.g.,
• Exponential number of incorrect rankings.
• Thus an exponential number of constraints.
Finding Most Violated Constraint
• Structural SVM requires a subroutine to find the most violated constraint.
• Subroutine is dependent on formulation of loss function and joint feature representation.
• Yue et al. came up with efficient algorithm in the case of optimizing AP.
Finding Most Violated Constraint
• AP is invariant to the order of examples within positive and negative examples
• Joint SVM score is optimized by sorting in descending order by individual examples score.
• Reduces to finding an interleaving between two sorted lists of examples
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
positive/negative examples
Slide taken from Yue et al. (2007)
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
positive/negative examples• Find the best feasible ranking of
the negative example
Slide taken from Yue et al. (2007)
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
positive/negative examples• Find the best feasible ranking of
the negative example• Repeat for next negative
examples
Slide taken from Yue et al. (2007)
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
positive/negative examples• Find the best feasible ranking of
the negative example• Repeat for next negative
examples• Never want to swap past
previous negative examples
Slide taken from Yue et al. (2007)
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
positive/negative examples• Find the best feasible ranking of the
negative example• Repeat for next negative examples• Never want to swap past previous
negative examples• Repeat until all negative examples
have been considered
Slide taken from Yue et al. (2007)
Hypothesis
Optimizing correct loss function is important for weakly supervised learning.
Latent Structural SVM (LSSVM)
• Introduces a margin between the maximum score for the ground-truth output and all other pairs of output and additional annotations
• Compares scores between 2 different sets of annotation
minw ½ ||w||2 + Cξs.t. Y,∀ H :
maxĤ{wTΨ(X,Y*,Ĥ)} - wTΨ(X,Y,H) ≥ Δ(Y,Y*) - ξ
Learning:
Latent Structural SVM (LSSVM)Prediction:
(Yopt,Hopt) = maxY,H wTΨ(X,Y,H)
Negatives
Positives
Latent Structural SVM (LSSVM)Disadvantages:
•Prediction: LSSVM uses an unintuitive prediction rule.
•Learning: LSSVM optimizes a loose upper-bound on the AP-loss.
•Optimization: Exact loss-augmented inference is computationally inefficient.
Latent AP-SVM - Prediction
Negatives
Positives
Step 1: Find the best hi for each sample
Hopt = argmaxH wT Ψ(X,Y,H)
Yopt = argmaxY wT Ψ(X,Y,Hopt)
Step 2: Sort samples according to best scores
Latent AP-SVM - Learning
• Finds the best assignment of values HP such that the score for the correct ranking is higher than the score for an incorrect ranking, regardless of the choice of HN.
• Compares scores between same sets of additional annotation.
minw ½ ||w||2 + Cξs.t. Y,∀ HN :
maxHp{wTΨ(X,Y*,{HP,HN}) - wTΨ(X,Y,{HP,HN})} ≥ Δ(Y,Y*) - ξ
Latent AP-SVM - Learning
• Constraints of latent AP-SVM are a subset of LSSVM constraints.
• Optimal solution of latent AP-SVM has a lower objective than LSSVM solution.
• Latent AP-SVM provides a valid upper-bound on the AP-loss.
Latent AP-SVM provides a tighter upper-bound on the AP Loss
Latent AP-SVM - Optimization
1. Initialize the set of parameters w0
2. Repeat until convergence3.Imputation of the additional annotations for positives
4. Parameter update using cutting-plane algorithm
Independently choose additional annotation HP Complexity: O(nP.|H|)
Maximize over HN and Y independently Complexity: O(nP.nN)
Action ClassificationInput x
Output y = “Using Computer”Latent Variable h
PASCAL VOC 2011 action classification 4846 images, 10 action classes2424 trainval & 2422 test images
Features- 2400 activation scores of action-specific poselets & 4 object activation scores
Jumping
Phoning
Playing Instrument
Reading
Riding Bike
Riding Horse
Running
Taking Photo
Using Computer
Walking
Train Input xi Output yi
Action Classification• 5-fold cross validation on the `trainval’ dataset.
• Statistically significant increase in performance:• 6/10 classes over LSVM• 7/10 classes over LSSVM
• Overall improvement:• 5% over LSVM• 4% over LSSVM
Action Classification• X-axis corresponds to the amount of supervision provided. • Y-axis corresponds to the mean average precision. • As amount of supervision decreases, gap in performance
of latent AP-SVM and the baseline methods increases.
Action Classification• Performance on test-set of PASCAL 2011.• Increase in performance:
• All classes over LSVM• 8/10 classes over LSSVM
• Overall improvement:• 5.1% compared to LSVM• 3.7% over LSSVM
Object Detection• PASCAL VOC 2007 object detection dataset.• 9963 images over 20 object categories.• 5011 trainval & 4952 test images.
• Features – 4096 dimensional activation vector of penultimate layer of a trained Convolutional Neural Network (CNN).
Input x Output y = “Aeroplane”Latent Variable h
(Average of 2000 candidate windows per image using the selective-search algorithm)
Object Detection• 5-fold cross validation on the `trainval’ dataset.
• Statistically significant increase in performance for 15/20 classes over LSVM.
• Superior performance partially attributed to the better localization of objects by LAP-SVM during training.
Object Detection• Performance on test set of
PASCAL 2007.
• Increase in performance for19/20 classes over LSVM.
• Overall improvement of 7% over LSVM.
• We also get improved results on the IIIT 5K-WORD dataset.
Conclusion
• Proposed novel formulation that obtains accurate ranking by minimizing a carefully designed upper bound on the AP loss.
• Showed the theoretical benefits of our method.
• Demonstrated advantage of our approach on challenging machine learning problems.
Thank you