high level computer vision part-based models for object ... · high level computer vision - may 16,...

70
High Level Computer Vision Part-Based Models for Object Class Recognition Part 2 Bernt Schiele - [email protected] Mario Fritz - [email protected] https://www.mpi-inf.mpg.de/hlcv

Upload: others

Post on 25-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision

Part-Based Models for Object Class Recognition Part 2

Bernt Schiele - [email protected] Mario Fritz - [email protected]

https://www.mpi-inf.mpg.de/hlcv

Page 2: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

• Pictorial Structures [Fischler & Elschlager 1973] ‣ Model has two components

- parts (2D image fragments)

- structure (configuration of parts)

!2

Class of Object Models: Part-Based Models / Pictorial Structures

Page 3: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

• Bag of Words Models (BoW) ‣ object model = histogram of local features

‣ e.g. local feature around interest points

• Global Object Models ‣ object model = global feature object feature

‣ e.g. HOG (Histogram of Oriented Gradients)

• Part-Based Object Models ‣ object model = models of parts

& spatial topology model

‣ e.g. constellation model or ISM (Implicit Shape Model)

• But: What is the Ideal Notion of Parts here? • And: Should those Parts be Semantic?

“State-of-the-Art” in Object Class Representations

!3

BoW: no spatial relationships

e.g. HOG: fixed spatial relationships

e.g. ISM: flexible spatial relationships

Page 4: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Constellation of Parts

x1

x3

x4

x6

x5

x2

Fully connected shape model

Weber, Welling, Perona, ’00; Fergus, Zisserman, Perona, 03

!4

Page 5: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Constellation Model

• Joint model for appearance and structure (=shape) ‣ X: positions, A: part appearance, S: scale

‣ h: Hypothesis = assignment of features (in the image) to parts (of the model)

!5

Gaussian shape pdfProb. of detection

Gaussian part appearance pdf

Gaussian relative scale pdf

Log(scale)

Weber, Welling, Perona, ’00; Fergus, Zisserman, Perona, 03

Page 6: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Object Detection: ISM (Implicit Shape Model)

• Appearance of parts: Implicit Shape Model (ISM) [Leibe, Seemann & Schiele, CVPR 2005]

!6

xo

Page 7: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Spatial Models for Categorization

!7

x1

x3

x4

x6

x5

x2

“Star” shape model

x1

x3

x4

x6

x5

x2

Fully connected shape model

‣ e.g. Constellation Model ‣ Parts fully connected ‣ Recognition complexity: O(NP) ‣ Method: Exhaustive search

‣ e.g. ISM (Implicit Shape Model) ‣ Parts mutually independent ‣ Recognition complexity: O(NP) ‣ Method: Generalized Hough Transform

Page 8: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Implicit Shape Model: What are Good Parts?

!8

[Leibe,Schiele@bmvc03][Leibe,Leonardis,Schiele@ijcv08]

• Parts of the Implicit Shape Model ‣ “parts” = feature clusters

‣ lots of “parts” (in the order of 1,000 - 10,000 codebook entries) the more the better !

‣ “parts” are mostly non-semantic

• “parts” = (mostly non semantic) feature clusters also true for ‣ bag of words models

‣ constellation model (much fewer “parts” - but still feature clusters)

‣ ...

Page 9: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !9

Part-Based Models - Overview Today

• Last Week: ‣ Part-Based based on Manual Labeling of Parts

- Detection by Components, Multi-Scale Parts

‣ The Constellation Model - automatic discovery of parts and part-structure

‣ The Implicit Shape Model (ISM) - star-model of part configurations, parts obtained by clustering interest-points

• Today: ‣ Pictorial Structures Model

‣ Learning Object Model from CAD Data

‣ Deformable Parts Model (DPM)

‣ Discussion Semantic Parts vs. Discriminative Parts

Page 10: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

People Detection: partISM

• Appearance of parts: Implicit Shape Model (ISM) [Leibe, Seemann & Schiele, CVPR 2005]

!10

xo

Page 11: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

People Detection: partISM

• Appearance of parts: Implicit Shape Model (ISM) [Leibe, Seemann & Schiele, CVPR 2005]

• Part decomposition and inference: Pictorial structures model [Felzenszwalb & Huttenlocher, IJCV 2005]

!11

Body-part positions Image evidence

x1

x2

x3

x4

x5

x6

x8

x7

xo

p(L|D) � p(D|L)p(L)

Page 12: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Pictorial Structures Model

• Two Components

‣ Prior (capturing possible part configurations):

‣ Likelihood of Parts (capturing part appearance):

!12

Body-part positions Image evidence

p(L|D) =p(D|L)p(L)

p(D)� p(D|L)p(L)

p(L)

p(D|L)

Page 13: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Pictorial Structures: Model Components

• Body is represented as flexible configuration of body parts

!13

. ...

...orientation K

likelihood of part N

AdaBoostLocal

Features

estimated pose

...

part posteriors

−50 0 50

−60

−40

−20

0

20

40

60

80

100

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

Figure 2. (left) Kinematic prior learned on the multi-view andmulti-articulation dataset from [15]. The mean part position isshown using blue dots; the covariance of the part relations in thetransformed space is shown using red ellipses. (right) Several in-dependent samples from the learned prior (for ease of visualiza-tion given fixed torso position and orientation).

[14], and use AdaBoost [7] to train discriminative part clas-sifiers. Our detectors are evaluated densely and are boot-strapped to improve performance. Strong detectors of thattype have been commonplace in the pedestrian detection lit-erature [1, 12, 13, 24]. In these cases, however, the em-ployed body models are often simplistic. A simple starmodel for representing part articulations is, for example,used in [1], whereas [12] does not use an explicit part repre-sentation at all. This precludes the applicability to stronglyarticulated people and consequently these approaches havebeen applied to upright people detection only.

We combine this discriminative appearance model with agenerative pictorial structures approach by interpreting thenormalized classifier margin as the image evidence that isbeing generated. As a result, we obtain a generic modelfor people detection and pose estimation, which not onlyoutperforms recent work in both areas by a large margin, butis also surprisingly simple and allows for exact and efficientinference.More related work: Besides the already mentioned relatedwork there is an extensive literature on both people (andpedestrian) detection, as well as on articulated pose estima-tion. A large amount of work has been advocating strongbody models, and another substantial set of related workrelies on powerful appearance models.

Strong body models have appeared in various forms. Acertain focus has been the development of non-tree mod-els. [17] imposes constraints not only between limbs onthe same extremity, but also between extremities, and relieson integer programming for inference. Another approachincorporate self-occlusion in a non-tree model [8]. Eitherapproach relies on matching simple line features, and onlyappears to work on relatively clean backgrounds. In con-trast, our method also works well on complex, clutteredbackgrounds. [20] also uses non-tree models to improveocclusion handling, but still relies on simple features, suchas color. A fully connected graphical model for represent-ing articulations is proposed in [2], which also uses dis-criminative part detectors. However, the method has sev-

eral restrictions, such as relying on absolute part orienta-tions, which makes it applicable to people in upright posesonly. Moreover, the fully connected graph complicates in-ference. Other work has focused on discriminative treemodels [16, 18], but due to the use of simple features, thesemethods fall short in terms of performance. [25] proposesa complex hierarchical model for pruning the space of validarticulations, but also relies on relatively simple features. In[5] discriminative training is combined with strong appear-ance representation based on HOG features, however themodel is applied to detection only.

Discriminative part models have also been used in con-junction with generative body models, as we do here.[11, 21], for example, use them as proposal distributions(“shouters”) for MCMC or nonparametric belief propaga-tion. Our paper, however, directly integrates the part detec-tors and uses them as the appearance model.

2. Generic Model for People Detection andPose Estimation

To facilitate reliable detection of people across a widevariety of poses, we follow [4] and assume that the bodymodel is decomposed into a set of parts. Their configurationis denoted as L = {l0, l1, . . . , lN}, where the state of part iis given by li = (xi, yi, �i, si). xi and yi is the position ofthe part center in image coordinates, �i is the absolute partorientation, and si is the part scale, which we assume to berelative to the size of the part in the training set.

Depending on the task, the number of object parts mayvary (see Figs. 2 and 3). For upper body detection (or poseestimation), we rely on 6 different parts: head, torso, as wellas left and right lower and upper arms. In case of full bodydetection, we additionally consider 4 lower body parts: leftand right upper and lower legs, resulting in a 10 part model.For pedestrian detection we do not use arms, but add feet,leading to an 8 part model.

Given the image evidence D, the posterior of the partconfiguration L is modeled as p(L|D) � p(D|L)p(L),where p(D|L) is the likelihood of the image evidence givena particular body part configuration. In the pictorial struc-tures approach p(L) corresponds to a kinematic tree prior.Here, both these terms are learned from training data, ei-ther from generic data or trained more specifically for theapplication at hand. To make such a seemingly genericand simple approach work well, and to compete with morespecialized models on a variety of tasks, it is necessary tocarefully pick the appropriate prior p(L) and an appropriateimage likelihood p(D|L). In Sec. 2.1, we will first intro-duce our generative kinematic model p(L), which closelyfollows the pictorial structures approach [4]. In Sec. 2.2,we will then introduce our discriminatively trained appear-ance model p(D|L).

2

prior on body poseslikelihood of observations

posterior over body poses

[Andriluka,Roth,Schiele@cvpr09]

p(L|D) � p(D|L)p(L)

Page 14: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

• Represent pairwise part relations [Felzenszwalb & Huttenlocher, IJCV’05]

Kinematic Tree Prior(modeling the structure)

!14

l1

p(l2|l1) = N (T12(l2)|T21(l1),�12)

p(L) = p(l0)�

(i,j)�E

p(li|lj),

l2

part locations relative to the joint

transformed part locations

−50 0 50

−50

−40

−30

−20

−10

010

20

30

40

50−50 0 50

−50

−40

−30

−20

−10

010

20

30

40

50

−100 −50 0 50 100

−100−80

−60

−40

−20

020

40

60

80

100−50 0 50

−50−40

−30

−20

−10

010

20

30

40

50

−50 0 50

−50

−40

−30

−20

−10

010

20

30

40

50−50 0 50

−50

−40

−30

−20

−10

010

20

30

40

50

−100 −50 0 50 100

−100−80

−60

−40

−20

020

40

60

80

100−50 0 50

−50−40

−30

−20

−10

010

20

30

40

50

+

l1

l2

l7l5 l6 l8

l9

l10

l2

l1

l3

l4

Page 15: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Kinematic Tree Prior

• Prior parameters: • Parameters of the prior are estimated with maximum likelihood

!15

{Tij ,�ij}

−50 0 50

−60

−40

−20

0

20

40

60

80

100

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

Figure 2. (left) Kinematic prior learned on the multi-view andmulti-articulation dataset from [15]. The mean part position isshown using blue dots; the covariance of the part relations in thetransformed space is shown using red ellipses. (right) Several in-dependent samples from the learned prior (for ease of visualiza-tion given fixed torso position and orientation).

[14], and use AdaBoost [7] to train discriminative part clas-sifiers. Our detectors are evaluated densely and are boot-strapped to improve performance. Strong detectors of thattype have been commonplace in the pedestrian detection lit-erature [1, 12, 13, 24]. In these cases, however, the em-ployed body models are often simplistic. A simple starmodel for representing part articulations is, for example,used in [1], whereas [12] does not use an explicit part repre-sentation at all. This precludes the applicability to stronglyarticulated people and consequently these approaches havebeen applied to upright people detection only.

We combine this discriminative appearance model with agenerative pictorial structures approach by interpreting thenormalized classifier margin as the image evidence that isbeing generated. As a result, we obtain a generic modelfor people detection and pose estimation, which not onlyoutperforms recent work in both areas by a large margin, butis also surprisingly simple and allows for exact and efficientinference.More related work: Besides the already mentioned relatedwork there is an extensive literature on both people (andpedestrian) detection, as well as on articulated pose estima-tion. A large amount of work has been advocating strongbody models, and another substantial set of related workrelies on powerful appearance models.

Strong body models have appeared in various forms. Acertain focus has been the development of non-tree mod-els. [17] imposes constraints not only between limbs onthe same extremity, but also between extremities, and relieson integer programming for inference. Another approachincorporate self-occlusion in a non-tree model [8]. Eitherapproach relies on matching simple line features, and onlyappears to work on relatively clean backgrounds. In con-trast, our method also works well on complex, clutteredbackgrounds. [20] also uses non-tree models to improveocclusion handling, but still relies on simple features, suchas color. A fully connected graphical model for represent-ing articulations is proposed in [2], which also uses dis-criminative part detectors. However, the method has sev-

eral restrictions, such as relying on absolute part orienta-tions, which makes it applicable to people in upright posesonly. Moreover, the fully connected graph complicates in-ference. Other work has focused on discriminative treemodels [16, 18], but due to the use of simple features, thesemethods fall short in terms of performance. [25] proposesa complex hierarchical model for pruning the space of validarticulations, but also relies on relatively simple features. In[5] discriminative training is combined with strong appear-ance representation based on HOG features, however themodel is applied to detection only.

Discriminative part models have also been used in con-junction with generative body models, as we do here.[11, 21], for example, use them as proposal distributions(“shouters”) for MCMC or nonparametric belief propaga-tion. Our paper, however, directly integrates the part detec-tors and uses them as the appearance model.

2. Generic Model for People Detection andPose Estimation

To facilitate reliable detection of people across a widevariety of poses, we follow [4] and assume that the bodymodel is decomposed into a set of parts. Their configurationis denoted as L = {l0, l1, . . . , lN}, where the state of part iis given by li = (xi, yi, �i, si). xi and yi is the position ofthe part center in image coordinates, �i is the absolute partorientation, and si is the part scale, which we assume to berelative to the size of the part in the training set.

Depending on the task, the number of object parts mayvary (see Figs. 2 and 3). For upper body detection (or poseestimation), we rely on 6 different parts: head, torso, as wellas left and right lower and upper arms. In case of full bodydetection, we additionally consider 4 lower body parts: leftand right upper and lower legs, resulting in a 10 part model.For pedestrian detection we do not use arms, but add feet,leading to an 8 part model.

Given the image evidence D, the posterior of the partconfiguration L is modeled as p(L|D) � p(D|L)p(L),where p(D|L) is the likelihood of the image evidence givena particular body part configuration. In the pictorial struc-tures approach p(L) corresponds to a kinematic tree prior.Here, both these terms are learned from training data, ei-ther from generic data or trained more specifically for theapplication at hand. To make such a seemingly genericand simple approach work well, and to compete with morespecialized models on a variety of tasks, it is necessary tocarefully pick the appropriate prior p(L) and an appropriateimage likelihood p(D|L). In Sec. 2.1, we will first intro-duce our generative kinematic model p(L), which closelyfollows the pictorial structures approach [4]. In Sec. 2.2,we will then introduce our discriminatively trained appear-ance model p(D|L).

2

mean pose several independent samples

Page 16: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Pictorial Structures: Model Components

• Body is represented as flexible configuration of body parts

!16

. ...

...orientation K

likelihood of part N

AdaBoostLocal

Features

estimated pose

...

part posteriors

−50 0 50

−60

−40

−20

0

20

40

60

80

100

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

−80 −60 −40 −20 0 20 40 60 80

−60

−40

−20

0

20

40

60

80

100

120

Figure 2. (left) Kinematic prior learned on the multi-view andmulti-articulation dataset from [15]. The mean part position isshown using blue dots; the covariance of the part relations in thetransformed space is shown using red ellipses. (right) Several in-dependent samples from the learned prior (for ease of visualiza-tion given fixed torso position and orientation).

[14], and use AdaBoost [7] to train discriminative part clas-sifiers. Our detectors are evaluated densely and are boot-strapped to improve performance. Strong detectors of thattype have been commonplace in the pedestrian detection lit-erature [1, 12, 13, 24]. In these cases, however, the em-ployed body models are often simplistic. A simple starmodel for representing part articulations is, for example,used in [1], whereas [12] does not use an explicit part repre-sentation at all. This precludes the applicability to stronglyarticulated people and consequently these approaches havebeen applied to upright people detection only.

We combine this discriminative appearance model with agenerative pictorial structures approach by interpreting thenormalized classifier margin as the image evidence that isbeing generated. As a result, we obtain a generic modelfor people detection and pose estimation, which not onlyoutperforms recent work in both areas by a large margin, butis also surprisingly simple and allows for exact and efficientinference.More related work: Besides the already mentioned relatedwork there is an extensive literature on both people (andpedestrian) detection, as well as on articulated pose estima-tion. A large amount of work has been advocating strongbody models, and another substantial set of related workrelies on powerful appearance models.

Strong body models have appeared in various forms. Acertain focus has been the development of non-tree mod-els. [17] imposes constraints not only between limbs onthe same extremity, but also between extremities, and relieson integer programming for inference. Another approachincorporate self-occlusion in a non-tree model [8]. Eitherapproach relies on matching simple line features, and onlyappears to work on relatively clean backgrounds. In con-trast, our method also works well on complex, clutteredbackgrounds. [20] also uses non-tree models to improveocclusion handling, but still relies on simple features, suchas color. A fully connected graphical model for represent-ing articulations is proposed in [2], which also uses dis-criminative part detectors. However, the method has sev-

eral restrictions, such as relying on absolute part orienta-tions, which makes it applicable to people in upright posesonly. Moreover, the fully connected graph complicates in-ference. Other work has focused on discriminative treemodels [16, 18], but due to the use of simple features, thesemethods fall short in terms of performance. [25] proposesa complex hierarchical model for pruning the space of validarticulations, but also relies on relatively simple features. In[5] discriminative training is combined with strong appear-ance representation based on HOG features, however themodel is applied to detection only.

Discriminative part models have also been used in con-junction with generative body models, as we do here.[11, 21], for example, use them as proposal distributions(“shouters”) for MCMC or nonparametric belief propaga-tion. Our paper, however, directly integrates the part detec-tors and uses them as the appearance model.

2. Generic Model for People Detection andPose Estimation

To facilitate reliable detection of people across a widevariety of poses, we follow [4] and assume that the bodymodel is decomposed into a set of parts. Their configurationis denoted as L = {l0, l1, . . . , lN}, where the state of part iis given by li = (xi, yi, �i, si). xi and yi is the position ofthe part center in image coordinates, �i is the absolute partorientation, and si is the part scale, which we assume to berelative to the size of the part in the training set.

Depending on the task, the number of object parts mayvary (see Figs. 2 and 3). For upper body detection (or poseestimation), we rely on 6 different parts: head, torso, as wellas left and right lower and upper arms. In case of full bodydetection, we additionally consider 4 lower body parts: leftand right upper and lower legs, resulting in a 10 part model.For pedestrian detection we do not use arms, but add feet,leading to an 8 part model.

Given the image evidence D, the posterior of the partconfiguration L is modeled as p(L|D) � p(D|L)p(L),where p(D|L) is the likelihood of the image evidence givena particular body part configuration. In the pictorial struc-tures approach p(L) corresponds to a kinematic tree prior.Here, both these terms are learned from training data, ei-ther from generic data or trained more specifically for theapplication at hand. To make such a seemingly genericand simple approach work well, and to compete with morespecialized models on a variety of tasks, it is necessary tocarefully pick the appropriate prior p(L) and an appropriateimage likelihood p(D|L). In Sec. 2.1, we will first intro-duce our generative kinematic model p(L), which closelyfollows the pictorial structures approach [4]. In Sec. 2.2,we will then introduce our discriminatively trained appear-ance model p(D|L).

2

prior on body poseslikelihood of observations

posterior over body poses

[Andriluka,Roth,Schiele@cvpr09]

p(L|D) � p(D|L)p(L)

Page 17: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Likelihood Model

• Assumption: ‣ evidence (image features) for each part independent of all other parts:

• assumption clearly not correct, but ‣ allows efficient computation

‣ works rather well in practice

‣ training data for different body parts should cover “all” appearances

!17

p(D|L) =NY

i=0

p(di|li)

Page 18: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Likelihood Model

• Build on recent advances in object detection: ‣ state-of-the-art image descriptor: Shape Context

[Belongie et al., PAMI’02; Mikolajczyk&Schmid, PAMI’05] ‣ dense representation

‣ discriminative model: AdaBoost classifier for each body part

!18

- Shape Context: 96 dimensions (4 angular, 3 radial, 8 gradient orientations)

- Feature Vector: concatenate the descriptors inside part bounding box

- head: 4032 dimensions - torso: 8448 dimensions

Page 19: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Likelihood Model

• Part likelihood derived from the boosting score:

!19

p̃(di|li) = max�⇤

t �i,tht(x(li))⇤t �i,t

, ⇥0

part location

decision stump outputdecision stump weight

small constant to deal with part occlusions

Page 20: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Likelihood Model

!20

Upper leg

[Ramanan, NIPS’06]

Our part likelihoods

Input imageTorsoHead

.. ..

Page 21: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Part-Based Model: 2D Human Pose Estimation

!21

[Andriluka,Roth,Schiele@cvpr09]

Page 22: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Pictorial Structures for Human Pose Estimation: What are Good Parts?

• Parts of the Pictorial Structures Model ‣ “parts” = semantic body parts

‣ pose estimation = estimation of body part configuration

‣ semantic body parts allow to use motion capture data, etc. to improve kinematic tree prior

‣ non-semantic parts (e.g. in the ISM-model) are more difficult to generalize across human body poses

!22

Page 23: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !23

Part-Based Models - Overview Today

• Last Week: ‣ Part-Based based on Manual Labeling of Parts

- Detection by Components, Multi-Scale Parts

‣ The Constellation Model - automatic discovery of parts and part-structure

‣ The Implicit Shape Model (ISM) - star-model of part configurations, parts obtained by clustering interest-points

• Today: ‣ Pictorial Structures Model

‣ Learning Object Model from CAD Data

‣ Deformable Parts Model (DPM)

‣ Discussion Semantic Parts vs. Discriminative Parts

Page 24: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Back to the Future: Learning Shape Models from 3D CAD Data

!24

• 3D Computer Aided Design (CAD) Models ‣ Computer graphics, game design

‣ Polygonal meshes + texture descriptions

‣ semantic part annotations (may) exist

• Can we learn Object Class Models directly from 3D CAD data? ‣ Issue: Transition between 3D CAD models and 2D real-world images

[Stark,Goesele,Schiele@bmvc10]

Page 25: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Constellation Model

• Joint model for appearance and structure (=shape) ‣ X: positions, A: part appearance, S: scale

‣ h: Hypothesis = assignment of features (in the image) to parts (of the model)

!25

Gaussian shape pdfProb. of detection

Gaussian part appearance pdf

Gaussian relative scale pdf

Log(scale)

Weber, Welling, Perona, ’00; Fergus, Zisserman, Perona, 03

Page 26: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

1. Shape-based appearance abstraction ‣ Non-photorealistic rendering

‣ Local shape + global geometry

2. Discriminative part detectors ‣ Robust local shape features

‣ AdaBoost classifiers

3. Powerful spatial model ‣ Full covariance

‣ Efficient DDMCMC inference

Three Tools to Meet the Challenge

!26

Add experiments /

conclu

sions?

[Stark,Goesele,Schiele@bmvc10]

Page 27: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Shape-based Appearance Abstraction ➸ Non-Photorealistic Rendering

• Learn shape models from rendered images ‣ we do NOT render photo-realistically / texture

‣ But focus on 3D CAD model edges (mimic real-world image edges)

!27

Mesh creases

Silhouette

Part boundaries

Final edges (hidden edges removed)

} bank

-of-d

etec

tors

sch

emat

ic?

[Stark,Goesele,Schiele@bmvc10]

Page 28: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

• Part-based object class representation ‣ Semantic parts from 3D CAD models: left front wheel, left front door, etc.

Shape: Local Shape + Global Geometry

!28

+

Global geometry

Parts

Left

Local shape

...

Front left FrontView-points

... ...

mod

el v

is. a

t bot

tom

?

[Stark,Goesele,Schiele@bmvc10]

Page 29: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Qualitative Results

!29

Back-right

Right

• Three strongest true positive detections per viewpoint model

• Observations ‣ Accurate part localization

‣ Predicted viewpoints do match

[Stark,Goesele,Schiele@bmvc10]

Page 30: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Shape Model learned from 3D CAD-Data What are Good Parts?

• Parts of the Shape Model: ‣ “parts” = just means to enable correspondence across 3D-models

‣ semantics of parts: - in our case: yes - because of the employed 3D models

- but: semantics neither necessary nor important

!30

Front left

...

Front

......

Left

Page 31: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !31

Part-Based Models - Overview Today

• Last Week: ‣ Part-Based based on Manual Labeling of Parts

- Detection by Components, Multi-Scale Parts

‣ The Constellation Model - automatic discovery of parts and part-structure

‣ The Implicit Shape Model (ISM) - star-model of part configurations, parts obtained by clustering interest-points

• Today: ‣ Pictorial Structures Model

‣ Learning Object Model from CAD Data

‣ Deformable Parts Model (DPM)

‣ Discussion Semantic Parts vs. Discriminative Parts

Page 32: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

slide curtesy: Pedro Felzenszwalb

!32

Page 33: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !33

Page 34: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !34

Page 35: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Starting Point: Sliding Window Method

• Sliding Window Based People Detection:

!35

Two Important Questions: 1) which feature vector 2) which classifier

‘slide’ detection window over all positions & scales

ScanImage

Extract Feature Vector

Classify Feature Vector

Non-MaximaSuppression

For example:

- HOG Pedestrian Detector - HOG descriptor - linear SVM

Page 36: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Histogram of Oriented Gradients (HOG): Static Feature Extraction

!36

Input Image

Detection Window

Page 37: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !37

Page 38: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !38

Page 39: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !39

Page 40: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !40

F · �(p,H) < 0

F · �(p,H) > 0

Page 41: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !41

Page 42: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !42

Page 43: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !43

Page 44: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !44

Page 45: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !45

Page 46: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

Efficient Computation

• Overall score:

• Maximization can be done separately:

!46

score(p0, . . . , pn) =nX

i=0

Fi · �(H, pi)�nX

i=1

di · (dx2i , dy

2i )

score(p0) = maxp1,...,pn

score(p0, . . . , pn)

= F0 · �(H, p0)+

maxpn

�Fn · �(H, pn)� dn · (dx2

n, dy2n)�

maxp1

�F1 · �(H, p1)� d1 · (dx2

1, dy21)�+

· · ·+

Page 47: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !47

Page 48: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !48

Page 49: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !49

Page 50: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !50

Page 51: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

SVM training

• Classifier scores and example x using:

‣ model parameter:

‣ feature vector:

• Linear SVM: ‣ objective: maximize margin

(for best generalization)

!51

f(x) = � · �(x)

�(x)

margin

Page 52: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

SVM training

• Training data:

• Constraints:

• Training error:

!52

margin

D = (hx1, y1i, . . . , hxn, yni) with yi 2 {�1, 1}

f(xi) � +1 for yi = +1

f(xi) �1 for yi = �1

) yif(xi) � +1

) 0 � 1� yif(xi)

nX

i=1

max (0, 1� yif(xi))

Page 53: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

SVM training

• Two objectives: ‣ maximize margin:

‣ minimize training error:

• Therefore minimize (primal formulation)

• Hinge loss:

!53

min1

2||�||2

minnX

i=1

max (0, 1� yif(xi))

L(�) = min�

1

2||�||2 +

nX

i=1

max (0, 1� yif(xi))

!

H(z) = max(0, 1� z)

z0

Page 54: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

sudo

!54

Page 55: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !55

Page 56: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !56

Page 57: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !57

Page 58: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !58

Page 59: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !59

Page 60: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

slides from Dan Huttenlocher

!60

Page 61: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !61

Page 62: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !62

Page 63: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !63

Page 64: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !64

Page 65: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !65

Page 66: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !66

Page 67: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18 !67

Part-Based Models - Overview Today

• Last Week: ‣ Part-Based based on Manual Labeling of Parts

- Detection by Components, Multi-Scale Parts

‣ The Constellation Model - automatic discovery of parts and part-structure

‣ The Implicit Shape Model (ISM) - star-model of part configurations, parts obtained by clustering interest-points

• Today: ‣ Pictorial Structures Model

‣ Learning Object Model from CAD Data

‣ Deformable Parts Model (DPM)

‣ Discussion Semantic Parts vs. Discriminative Parts

Page 68: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

What are Ideal Parts for Part-Based Object Models?

• Parts can/may ‣ be semantic body parts - e.g. for articulated

human body pose estimation

‣ be feature clusters (typically many clusters) (e.g. ISM, constellation model, BoW)

‣ support learnability of discriminant appearance (e.g. DPM model)

‣ enable correspondence across 3D models

‣ ...

• in all those cases: the most important property is that “parts” facilitate correspondence across object instances

!68

Page 69: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

High Level Computer Vision - May 16, 2o18

What are Ideal Parts for Part-Based Object Models?

• Multiple motivations for part based models exist: ‣ intuitiveness: semantic meaning of parts/attributes is attractive

(e.g. enables use of language sources)

‣ learnability: sharing of parts/attributes across instances/classes

‣ scalability: transferability of parts/attributes across classes

‣ …

• in general, parts support learnability and scalability when they facilitate correspondence ‣ across object instances

‣ across object classes

‣ across modalities (e.g. from language to visual appearance)

‣ and semantics is only a secondary concern (for “intuitiveness”)

!69

Page 70: High Level Computer Vision Part-Based Models for Object ... · High Level Computer Vision - May 16, 2o18!9 Part-Based Models - Overview Today • Last Week: ‣ Part-Based based on

What are Ideal Parts for Part-Based Object Models?

the Good, the Bad, and the Ugly

thanks to: Micha Andriluka, Bastian Leibe, Sandra Ebert, Mario Fritz, Diane Larlus, Marcus Rohrbach, Paul Schnitzspan,

Stefan Roth, Michael Stark, Michael Goesele