optimizing average precision using weakly supervised data aseem behl iiit hyderabad under...

Optimizing Average Precision using Weakly Supervised Data

Aseem BehlIIIT Hyderabad

Under supervision of:

Dr. M. Pawan Kumar (INRIA Paris), Prof. C.V. Jawahar (IIIT Hyderabad)

Input x

Output y = “Using Computer”Latent Variable h

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

Action Classification

Aim - To estimate accurate model parameters by optimizing average precision with weakly supervised data

• Preliminaries

• Previous Work

• Our Framework

• Results

• Conclusion

Outline

Binary Classification

• Several problems in computer vision can be formulated as binary classification tasks.

• Running example: Action Classification, ie, automatically figuring out whether an image contains a person performing an action of interest (such as ‘jumping’ or ‘walking’).

• Binary classifier widely employed in computer vision is the support vector machine (SVM).

Conventional SVMs• Input examples xi (vector)

• Output labels yi (either +1 or -1)

• SVM learns a hyperplane w

• Predictions are sign(wTΦi(xi))

• Training involves solving the following:

minw ½ ||w||2 + CΣiξi

s.t. i : y∀ i(wTΦi(xi)) ≥ 1 - ξi •The sum of slacks Σiξi upper bounds the 0/1 loss

Structural SVM (SSVM) • Generalization of the SVM to structured output spaces


s.t. ŷ∀ i : wT(xi,yi) – wTΨ(xi, ŷi) ≥ Δ(yi, ŷi) - ξi• Joint score for the correct label at least as large as

incorrect label plus the loss.• Number of constraints is the |dom(y)|.• Thus, number of constraints can be intractably large.• At least one constraint for where inequality is tight.• “most violated constraint”.

Learning:

Ypred = argmaxy wTΨ(x,y)Prediction:• Maximize the score over all possible outputs.

Structural SVM Learning

Original SVM Problem• Exponential constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.

Slide taken from Yue et al. (2007)

Structural SVM Learning1: Solve the SVM objective function using only the current working set of constraints.

2: Using the model learned in step 1, find the most violated constraint from the exponential set of constraints.

3: If the constraint returned in step 2 is more violated than the most violated constraint the working set by some small constant, add that constraint to the working set.

Repeat steps 1-3 until no additional constraints are added. Return the most recent model that was trained in step 1.

Steps 1-3 are guaranteed to loop for at most a polynomial number of iterations.

[Tsochantaridis et al. 2005]

Weak Supervision• Supervised learning involves the onerous task of collecting detailed annotations for each training sample.

• Financially infeasible as the size of the datasets grow.

• Weak supervision – additional annotations hi are unknown

• More complex machine learning problem.

Weak Supervision – Challenges• Find the best additional annotation for positive examples.

• Identify bounding box of the `jumping’ person in positive images

• Need to consider all possible values of annotations for negative samples as negative examples.• Ensure that scores of `jumping’ person bounding boxes are

higher than the scores of all possible bounding boxes in the negative images.

Latent SVM (LSVM)• Extends SSVM to incorporate hidden information.

• This information is considered as part of the label.

• Not observed during training.


s.t. ŷ∀ i,ĥi : maxhiwTΨ (xi,yi,hi) – wTΨ(xi, ŷi, ĥi) ≥ Δ(yi, ŷi) - ξi

Non-convex objective Difference of convex

CCCP Algorithm - converges to a local minimum

Concave-Convex Procedure (CCCP)1. Repeat until convergence:

2. Iteratively approximate the concave portion of the objective

• Impute the hidden variables

3. Update the value of the parameter using the values of the hidden variables

• Solve the resulting convex SSVM problem

Steps 2-3 are guaranteed to local minima in polynomial number of iterations.

[Yuille and A. Rangarajan, 2003]

Average Precision (AP)

AP-loss = 0.24, 0-1 loss = 0.40

AP-loss = 0.36, 0-1 loss = 0.40

• AP is the most commonly used accuracy measure for binary classification. • AP is the average of the precision scores at the rank

locations of each positive sample. • AP-loss depends on the ranking of the samples.• 0-1 loss depends only on the number of incorrectly classified samples.

• A machine learning algorithm optimizing for 0/1 loss might learn a very different model than optimizing for AP.

Notation

x h

y = “Using Computer”

1, xi ranked higher than xj

Y: ranking matrix, st. Yij= 0, xi & xj are ranked same

-1, xi ranked lower than xj

X: Input {xi= 1,..,n}

{HP: Additional information for positives {hi, i P}∈HN: Additional information for negatives {hj, j N} ∈

∆(Y, Y∗): AP-loss = 1 − AP(Y, Y∗)AP(Y, Y∗) = AP of ranking Y

Joint Feature Vector: Ψ(X,Y*,{HP,HN}) = (1/|P|.|N|)ΣiΣjYij(Φi(hi) - Φj(hj))

AP-SVM• AP-SVM optimizes the correct AP-loss function as opposed to 0/1 loss.

Prediction: Yopt = maxY wTΨ(X,Y,H)

minw ½ ||w||2 + Cξs.t. Y : w∀ TΨ(X,Y*,H) - wTΨ(X,Y,H)

≥ Δ(Y,Y*) - ξ • Constraints are defined for each incorrect labeling Y.Joint discriminant score for the correct labeling at least as large as incorrect labeling plus the loss.

• After learning w, a prediction is made by sorting samples (xk,hk) in descending order of wTΦk(hk)

•

Learning:

AP-SVM – Exponential Constraints

• For Average Precision, the true labeling is a ranking where the positive examples are all ranked in the front, e.g.,

• An incorrect labeling would be any other ranking, e.g.,

• Exponential number of incorrect rankings.

• Thus an exponential number of constraints.

Finding Most Violated Constraint

• Structural SVM requires a subroutine to find the most violated constraint.

• Subroutine is dependent on formulation of loss function and joint feature representation.

• Yue et al. came up with efficient algorithm in the case of optimizing AP.


• AP is invariant to the order of examples within positive and negative examples

• Joint SVM score is optimized by sorting in descending order by individual examples score.

• Reduces to finding an interleaving between two sorted lists of examples


►

• Start with perfect ranking• Consider swapping adjacent

positive/negative examples



►


positive/negative examples• Find the best feasible ranking of

the negative example



►



the negative example• Repeat for next negative

examples



►



the negative example• Repeat for next negative

examples• Never want to swap past

previous negative examples



►


positive/negative examples• Find the best feasible ranking of the

negative example• Repeat for next negative examples• Never want to swap past previous

negative examples• Repeat until all negative examples

have been considered


Hypothesis

Optimizing correct loss function is important for weakly supervised learning.

Latent Structural SVM (LSSVM)

• Introduces a margin between the maximum score for the ground-truth output and all other pairs of output and additional annotations

• Compares scores between 2 different sets of annotation

minw ½ ||w||2 + Cξs.t. Y,∀ H :

maxĤ{wTΨ(X,Y*,Ĥ)} - wTΨ(X,Y,H) ≥ Δ(Y,Y*) - ξ

Learning:

Latent Structural SVM (LSSVM)Prediction:

(Yopt,Hopt) = maxY,H wTΨ(X,Y,H)

Negatives

Positives

Latent Structural SVM (LSSVM)Disadvantages:

•Prediction: LSSVM uses an unintuitive prediction rule.

•Learning: LSSVM optimizes a loose upper-bound on the AP-loss.

•Optimization: Exact loss-augmented inference is computationally inefficient.

Latent AP-SVM - Prediction

Negatives

Positives

Step 1: Find the best hi for each sample

Hopt = argmaxH wT Ψ(X,Y,H)

Yopt = argmaxY wT Ψ(X,Y,Hopt)

Step 2: Sort samples according to best scores

Latent AP-SVM - Learning

• Finds the best assignment of values HP such that the score for the correct ranking is higher than the score for an incorrect ranking, regardless of the choice of HN.

• Compares scores between same sets of additional annotation.

minw ½ ||w||2 + Cξs.t. Y,∀ HN :

maxHp{wTΨ(X,Y*,{HP,HN}) - wTΨ(X,Y,{HP,HN})} ≥ Δ(Y,Y*) - ξ

Latent AP-SVM - Learning

• Constraints of latent AP-SVM are a subset of LSSVM constraints.

• Optimal solution of latent AP-SVM has a lower objective than LSSVM solution.

• Latent AP-SVM provides a valid upper-bound on the AP-loss.

Latent AP-SVM provides a tighter upper-bound on the AP Loss

Latent AP-SVM - Optimization

1. Initialize the set of parameters w0

2. Repeat until convergence3.Imputation of the additional annotations for positives

4. Parameter update using cutting-plane algorithm

Independently choose additional annotation HP Complexity: O(nP.|H|)

Maximize over HN and Y independently Complexity: O(nP.nN)

Action ClassificationInput x

Output y = “Using Computer”Latent Variable h

PASCAL VOC 2011 action classification 4846 images, 10 action classes2424 trainval & 2422 test images

Features- 2400 activation scores of action-specific poselets & 4 object activation scores

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

Action Classification• 5-fold cross validation on the `trainval’ dataset.

• Statistically significant increase in performance:• 6/10 classes over LSVM• 7/10 classes over LSSVM

• Overall improvement:• 5% over LSVM• 4% over LSSVM

Action Classification• X-axis corresponds to the amount of supervision provided. • Y-axis corresponds to the mean average precision. • As amount of supervision decreases, gap in performance

of latent AP-SVM and the baseline methods increases.

Action Classification• Performance on test-set of PASCAL 2011.• Increase in performance:

• All classes over LSVM• 8/10 classes over LSSVM

• Overall improvement:• 5.1% compared to LSVM• 3.7% over LSSVM

Object Detection• PASCAL VOC 2007 object detection dataset.• 9963 images over 20 object categories.• 5011 trainval & 4952 test images.

• Features – 4096 dimensional activation vector of penultimate layer of a trained Convolutional Neural Network (CNN).

Input x Output y = “Aeroplane”Latent Variable h

(Average of 2000 candidate windows per image using the selective-search algorithm)

Object Detection• 5-fold cross validation on the `trainval’ dataset.

• Statistically significant increase in performance for 15/20 classes over LSVM.

• Superior performance partially attributed to the better localization of objects by LAP-SVM during training.

Object Detection• Performance on test set of

PASCAL 2007.

• Increase in performance for19/20 classes over LSVM.

• Overall improvement of 7% over LSVM.

• We also get improved results on the IIIT 5K-WORD dataset.

Conclusion

• Proposed novel formulation that obtains accurate ranking by minimizing a carefully designed upper bound on the AP loss.

• Showed the theoretical benefits of our method.

• Demonstrated advantage of our approach on challenging machine learning problems.

Thank you

optimizing average precision using weakly supervised data aseem behl iiit hyderabad under...

Documents

structural svm learning1

exponential set of constraints

svm objective function

violated constraintuntil

support vector machine

number of constraints

additional constraints

good approximation