inference and representation, fall...

DS-GA-1005 Problem Set 5 1

Inference and Representation, Fall 2015

Problem Set 5: Structure learning & Gaussian processesDue: Tuesday, December 1, 2015 at 3pm (uploaded to NYU Classes.)

Your submission should include a PDF file called “solutions.pdf” with your writtensolutions, separate output files, and all of the code that you wrote.

Important: See problem set policy on the course web site.

For this assignment, you are allowed to use basic graph packages (e.g., for representing and work-ing with undirected graphs, or for finding the maximum spanning tree), but are not permittedto use any machine learning, graphical models, or probabilistic inference packages.

1. Tree factorization. Let T denote the edges of a tree-structured pairwise Markov randomfield with vertices V . For the special case of trees, prove that any distribution pT (x)corresponding to a Markov random field over T admits a factorization of the form:

pT (x) =∏

(i,j)∈T

pT (xi, xj)

pT (xi)pT (xj)

∏j∈V

pT (xj), (1)

where pT (xi, xj) and pT (xi) denote pairwise and singleton marginals of the distributionpT , respectively.

Hint: consider the Bayesian network where you choose an arbitrary node to be a rootand direct all edges away from the root. Show that this is equivalent to the MRF. Then,looking at the BN’s factorization, reshape it into the required form.

2. Chow-Liu algorithm. When trying to do object detection from computer images, contextcan be very helpful. For example, if “car” and “road” are present in an image, then itis likely that “building” and “sky” are present as well (see Figure 1). In recent work, atree-structured Markov random field (see Figure 2) was shown to be particularly usefulfor modeling the prior distribution of what objects are present in images and using this toimprove object detection [1].

You will replicate some of the results from [1] (it is not necessary to read this paperto complete this assignment). Specifically, you will implement the Chow-Liu algorithm

Figure 1: Using context within object detection for computer vision. [1]


bagball

bars

basket

bench

bottle

bottles box

boxes

bread

closet counter

door

field glass

handrail

monitormountain

platform

railing

shelvesshoes showcase

staircase

stand

tray videoswindow

sky

airplane

armchair rug

awning

balconybookcase

booksbuilding

buscandle car

chair

chandelier

clock

clothes

desk

dome

fencefireplace

flower

gate

grass

ground

headstone

machine

path

plant

poster pot

river

road

sand

screen

sea

sofa steps

stone

stones

stool

streetlight

table

television

text

tower

tree

truck

umbrella

van

vase water

floor

bed

bowl

cabinet

countertop

cupboard

curtain

cushion

dish

dishwasher

drawer

easel

microwave

mirror

oven

person

picture

pillow

platerefrigerator

rock

rocksseatssink

stove toilettowel

wall

Figure 7. Model learned for SUN 09. Red edges denote negative correlation between classes. The thickness of each edge represents the

strength of the link.

!"#$%&'$&'()*+

!"#$%&!'"( )*+,) -,+-. ,-./0 )/+*. )*+*. 0.+**12343&"( 0,+0) 0.+/5 0.+06 00+5. 01.-/ 0*+/612#7( ,+5- .+/0 .+*5 ,+8. ,.-/ )/+801$!9( ,-+*0 20./1 ,8+5. ,8+6. ,8+). )*+,81$99&"( )-+88 )0+0* )0+)* )0+/. -3.*/ 8.+0,1:; ,4.45 -0+*- -6+5* -*+,. -*+/. 8/+*53!#( 8/+., 86+/8 86+/8 8/+.. *4.5/ 60+503!9( ,8+/- ,6+/) 24.3, ,0+,. ,)+8. 8*+6.3<!2#( ,6+., ,/+5, 24.2- ,6+-. ,6+.. 85+.*3$= 24.-* ,*+./ ,*+)) ,6+/. ,/+/. -6+*5

72'2'>9!1&"( ),+., )-+,* ))+5- ))+*. -*.// -.+0*7$>( ,.+/- ,,+)6 2-.*, ,,+,. ,,+/. 86+))<$#;"( 8-+)) 80+-) *5.-3 8-+*. 80+.. 65+08

?$9$#12@"( 8.+)/ 8.+55 *2.45 -/+-. -5+8. 05+65%"#;$'( -0+86 -8+// -0+86 -0+). ,0.0/ 0*+5)

%$99"7%&!'9( ,8+5. 21.00 ,0+6/ ,8+.. ,0+). 8-+/0;<""%( ,5+-/ ),+// -2.42 ,6+5. ,6+,. -0+,-;$A! -/.01 ,5+8- ).+8. ,5+-. ).+,. 8)+6/9#!2'( -/+/8 -/+8- ,4.4/ -,+5. -8+). 6,+-0

9B?$'29$#( -/+.. -8+)/ -0+/0 ,5.,/ -0+8. 08+*/CDEFCGE( )6+/. )/+,5 -5.50 )6+8, )/+,. 8/+*8

)*+ !67'89":$;6<=( !"#$%&'$ >&#: 96':$?:

Table 1. Average precision-recall. Baseline) baseline detector [5];

Gist) baseline and gist [20]; Context) our context model; [4]) re-

sults from [4] (the baseline in [4] is the same as our baseline, but

performances slightly differ); Bound) Maximal APR that can be

achieved given current max recall.

the baseline detector and not for learning the tree model.

In this experiment we use 107 object detectors. Thesedetectors span from regions (e.g., road, sky, buildings) towell defined objects (e.g., car, sofa, refrigerator, sink, bowl,bed) and highly deformable objects (e.g., river, towel, cur-tain). The database contains 4317 test images. Objects havea large range of difficulties due to variations in shape, butalso in sizes and frequencies. The distribution of objects inthe test set follows a power law (the number of instances forobject k is roughly 1/k) as shown in Figure 2.

Context learned from training images Figure 7 showsthe learned tree relating the 107 objects. A notable differ-ence from the tree learned for PASCAL 07 (Figure 4) is thatthe proportion of positive correlations is larger. In the treelearned from PASCAL 07, 10 out of 19 edges, and 4 outof the top 10 strongest edges have negative relationships.In contrast, 25 out of 106 edges and 7 out of 53 (! 13%)strongest edges in the SUN tree model have negative rela-tionships. In PASCAL 07, most objects are related by re-

Lo

caliz

atio

nP

rese

nce

Pre

dic

tion

1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

2714

1506

964

606

4952

1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1760

26336

2

4952

1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

4242

4066

37833447

4314

1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

4217

3757

3105

2435

4314

N N

Pe

rce

nta

ge

of

Imag

es

Pe

rce

nta

ge

of

Imag

es

Baseline

Context

Baseline

Context

Baseline

Context

Baseline

Context

a) b)

c) d)

PASCAL 07 SUN 09

PASCAL 07 SUN 09

Figure 5. Image annotation results for PASCAL 07 and SUN 09.

a-b) Percentage of images in which the top N most confident de-

tections are all correct. The numbers on top of the bars indicate

the number of images that contain at least N ground-truth object

instances. c-d) Percentage of images in which the top N most con-

fident object presence predictions are all correct. The numbers on

top of the bars indicate the number of images that contain at least

N different ground-truth object categories.

Object Categories

Loca

lizat

ion

0 10 20 30 40 50 60 70 80 90 100!5

0

5

10

15

0 10 20 30 40 50 60 70 80 90 100

0

10

20

30

Pre

sen

ce P

red

icti

on

AP

R Im

pro

vem

en

tA

PR

Imp

rove

me

nt

a)

b)

Figure 6. Improvement of context model over the baseline. Object

categories are sorted by the improvement in the localization task.

pulsion because most images contain only few categories.In SUN 09, there is a lot more opportunities to learn posi-tive correlations between objects. From the learned tree, wecan see that some objects take the role of dividing the tree

Figure 2: Pairwise MRF of object class presences in images [1]. Red edges denote negativecorrelations between classes. The thickness of each edge represents the strength of the link. Youwill be learning this MRF in question 2(a).

(1968) for maximum likelihood learning of tree-structured Markov random fields [2]. Seealso Murphy’s book Section 26.3 for a brief overview.

The goal of learning is to find the tree-structured distribution pT (x) that maximizes thelog-likelihood of the training data D = {x}:

maxT

maxθT

∑x∈D

log pT (x; θT ).

Recall from Lecture 10 that for a fixed structure T , the maximum likelihood parametersfor a MRF have a property called moment matching, meaning that the learned distributionwill have marginals pT (xi, xj) equal to the empirical marginals p̂(xi, xj) computed fromthe data D, i.e. p̂(xi, xj) = count(xi, xj)/|D| where count(xi, xj) is the number of datapoints in D with Xi = xi and Xj = xj . Thus, using the factorization from Eq. 1, thelearning task is reduced to solving

maxT

∑x∈D

log

∏(i,j)∈T

p̂(xi, xj)

p̂(xi)p̂(xj)

∏j∈V

p̂(xj)

.We can simplify the quantity being maximized over T as follows (let N = |D|):

=∑x∈D

( ∑(i,j)∈T

log

[p̂(xi, xj)

p̂(xi)p̂(xj)

]+∑j∈V

log [p̂(xj)])

=∑

(i,j)∈T

∑x∈D

log

[p̂(xi, xj)

p̂(xi)p̂(xj)

]+∑j∈V

∑x∈D

log [p̂(xj)]

=∑

(i,j)∈T

∑xi,xj

Np̂(xi, xj) log

[p̂(xi, xj)

p̂(xi)p̂(xj)

]+∑j∈V

∑xi

Np̂(xi) log [p̂(xj)]

= N( ∑

(i,j)∈T

Ip̂(Xi, Xj)−∑j∈V

Hp̂(Xj)),

where Ip̂(Xi, Xj) =∑xi,xj

p̂(xi, xj) logp̂(xi,xj)p̂(xi)p̂(xj)

is the empirical mutual information of

variables Xi and Xj , and Hp̂(Xi) is the empirical entropy of variable Xi. Since the entropyterms are not a function of T , these can be ignored for the purpose of finding the maximumlikelihood tree structure. We conclude that the maximum likelihood tree can be


obtained by finding the maximum-weight spanning tree in a complete graphwith edge weights Ip̂(Xi, Xj) for each edge (i, j).

The Chow-Liu algorithm then consists of the following two steps:

(a) Compute each edge weight based on the empirical mutual information.

(b) Find a maximum spanning tree (MST) via Kruskal or Prim’s Algorithm.

(c) Output a pairwise MRF with edge potentials φij(xi, xj) =p̂(xi,xj)p̂(xi)p̂(xj)

for each (i, j) ∈ Tand node potentials φi(xi) = p̂(xi).

We have one random variable Xi ∈ {0, 1} for each object type (e.g., “car” or “road”)specifying whether this object is present in a given image. For this problem, you areprovided with a matrix of dimension N ×M where N = 4367 is the number of imagesin the training set and M = 111 is the number of object types. This data is in the file“chowliu-input.txt”, and the file “names.txt” specifies the object names corresponding toeach column.

Implement the Chow-Liu algorithm described above to learn the maximum likelihood tree-structured MRF from the data provided. Your code should output the MRF in the standardUAI format described here:

http://www.hlt.utdallas.edu/~vgogate/uai14-competition/modelformat.html

3. A major computational burden in using Gaussian Processes is inverting the covariancematrix, which is necessary for both parameter inference and for prediction. Is it possibleto have a “precision function” p(xi, xj) that provides the (i, j)th entry of the precisionmatrix (inverse covariance matrix), analogous to the covariance function k(xi, xj) whichtakes as input the data points xi and xj and computes the (i, j)th entry of the covariancematrix? Please explain why or why not.

4. For four different covariance functions, make side-by-side plots showing for different pa-rameter values (a) the shape of the covariance function, and (b) samples from the Gaussianprocess with this covariance functions. Make sure to include both compact-/non-compactlysupported and differentiable/non-differentiable covariance functions. (You should writecode to do this; do not use an existing Gaussian process software package.)

References

[1] Myung Jin Choi, Joseph J. Lim, Antonio Torralba, and Alan S. Willsky. Exploiting hierar-chical context on a large database of object categories. In IEEE Conference on ComputerVIsion and Pattern Recognition (CVPR), 2010.

[2] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependencetrees. IEEE Transactions on Information Theory, 14:462–467, 1968.

http://www.hlt.utdallas.edu/~vgogate/uai14-competition/modelformat.html

inference and representation, fall...

Documents