experiments in graph-based semi-supervised learning for ...partha pratim talukdar * (search labs,...

Post on 16-Apr-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Partha Pratim Talukdar* (Search Labs, MSR)

Fernando Pereira (Google)

Experiments in Graph-based Semi-Supervised Learning

for Class-Instance Acquisition

ACL, 2010 (Uppsala, Sweden) *Work done while at the Univ. of Pennsylvania

Class-Instance Acquisition

2

Class-Instance Acquisition

• Given an entity, assign human readable descriptors to it– e.g., Toyota is a car manufacturer, japanese

company, multinational company, ...

2

Class-Instance Acquisition

• Given an entity, assign human readable descriptors to it– e.g., Toyota is a car manufacturer, japanese

company, multinational company, ...

• Large scale, open domain

2

Class-Instance Acquisition

• Given an entity, assign human readable descriptors to it– e.g., Toyota is a car manufacturer, japanese

company, multinational company, ...

• Large scale, open domain

• Applications– Web search, Advertising, etc.

2

Graph-based Class-Instance Acquisition

3

Graph-based Class-Instance Acquisition

Bob Dylan (0.95)Johnny Cash (0.87)

Billy Joel (0.82)

Set 2Set 1Billy Joel (0.75)

Johnny Cash (0.73)

PatternTable Mining

3

Graph-based Class-Instance Acquisition

Bob Dylan (0.95)Johnny Cash (0.87)

Billy Joel (0.82)

Set 2Set 1Billy Joel (0.75)

Johnny Cash (0.73)

PatternTable Mining

3

Musician Singer

Graph-based Class-Instance Acquisition

Bob Dylan (0.95)Johnny Cash (0.87)

Billy Joel (0.82)

Set 2Set 1Billy Joel (0.75)

Johnny Cash (0.73)

PatternTable Mining

Cluster ID

3

Musician Singer

Graph-based Class-Instance Acquisition

Bob Dylan (0.95)Johnny Cash (0.87)

Billy Joel (0.82)

Set 2Set 1Billy Joel (0.75)

Johnny Cash (0.73)

PatternTable Mining

Cluster ID

ExtractionConfidence

3

Musician Singer

Graph-based Class-Instance Acquisition

Bob Dylan (0.95)Johnny Cash (0.87)

Billy Joel (0.82)

Set 2Set 1Billy Joel (0.75)

Johnny Cash (0.73)

Set 2

Set 1

Bob Dylan

Johnny Cash

Billy Joel

0.95

0.87

0.82

0.73

0.75

PatternTable Mining

Cluster ID

ExtractionConfidence

3

Musician Singer

Graph-based Class-Instance Acquisition

Bob Dylan (0.95)Johnny Cash (0.87)

Billy Joel (0.82)

Set 2Set 1Billy Joel (0.75)

Johnny Cash (0.73)

Set 2

Set 1

Bob Dylan

Johnny Cash

Billy Joel

0.95

0.87

0.82

0.73

0.75

Musician 1.0

Musician 1.0

SeedClasses

PatternTable Mining

Cluster ID

ExtractionConfidence

3

Musician Singer

Graph-based Class-Instance Acquisition

Bob Dylan (0.95)Johnny Cash (0.87)

Billy Joel (0.82)

Set 2Set 1Billy Joel (0.75)

Johnny Cash (0.73)

Set 2

Set 1

Bob Dylan

Johnny Cash

Billy Joel

0.95

0.87

0.82

0.73

0.75

Musician 1.0

Musician 1.0

SeedClasses

PatternTable Mining

Cluster ID

ExtractionConfidence

3

Musician Singer

Can we infer that Bob Dylan is also a Musician, as that is missing in current

extractions?

Graph-based Class-Instance Acquisition [Talukdar et al., EMNLP 2008]

Set 2

Set 1

Bob Dylan

Johnny Cash

Billy Joel

0.95

0.87

0.82

0.73

0.75

4

Graph-based Class-Instance Acquisition [Talukdar et al., EMNLP 2008]

InitializationSet 2

Set 1

Bob Dylan

Johnny Cash

Billy Joel

0.95

0.87

0.82

0.73

0.75

Musician 1.0

Musician 1.0

Musician 1.0

Musician 1.0

Musician 1.0

SeedLabels

4

PredictedLabels

Graph-based Class-Instance Acquisition [Talukdar et al., EMNLP 2008]

Iteration 1Set 2

Set 1

Bob Dylan

Johnny Cash

Billy Joel

0.95

0.87

0.82

0.73

0.75

Musician 1.0

Musician 1.0

Musician 1.0

Musician 0.8Musician 1.0

Musician 1.0

SeedLabels

4

PredictedLabels

Graph-based Class-Instance Acquisition [Talukdar et al., EMNLP 2008]

Iteration 2Set 2

Set 1

Bob Dylan

Johnny Cash

Billy Joel

0.95

0.87

0.82

0.73

0.75

Musician 1.0

Musician 1.0

Musician 1.0

Musician 0.8Musician 1.0

Musician 1.0

SeedLabels

Musician 0.6

4

PredictedLabels

• Graph-based Class-Instance Acquisition– Overview & previous work

• Review & comparison of three SSL algorithms– LP-ZGL (Zhu+, 2003)

– Adsorption (Baluja+, 2008)– Modified Adsorption (MAD) (Talukdar and Crammer, 2009)

• Additional Semantic Constraints

• Summary

Outline

5

• Graph-based Class-Instance Acquisition– Overview & previous work

• Review & comparison of three SSL algorithms– LP-ZGL (Zhu+, 2003)

– Adsorption (Baluja+, 2008)– Modified Adsorption (MAD) (Talukdar and Crammer, 2009)

• Additional Semantic Constraints

• Summary

Outline

5

Comments on the Constructed Graph

Set 2

Set 1

Bob Dylan

Johnny Cash

Billy Joel

0.95

0.87

0.82

0.73

0.75

Musician 1.0

Musician 1.0

6

Comments on the Constructed Graph

Set 2

Set 1

Bob Dylan

Johnny Cash

Billy Joel

0.95

0.87

0.82

0.73

0.75

Musician 1.0

Musician 1.0

Smoothness:Nodes connected by an edge should be assigned similar classes, as enforced by edge weight.

6

Comments on the Constructed Graph

Set 2

Set 1

Bob Dylan

Johnny Cash

Billy Joel

0.95

0.87

0.82

0.73

0.75

Musician 1.0

Musician 1.0

Smoothness:Nodes connected by an edge should be assigned similar classes, as enforced by edge weight.

Initial Clusters

6

Comments on the Constructed Graph

Set 2

Set 1

Bob Dylan

Johnny Cash

Billy Joel

0.95

0.87

0.82

0.73

0.75

Musician 1.0

Musician 1.0

Smoothness:Nodes connected by an edge should be assigned similar classes, as enforced by edge weight.

Initial Clusters

6

Coupling Node:Force (softly) all instance nodes connected to it to have similar class labels, exploiting the Smoothness requirement.

Comments on the Constructed Graph

Set 2

Set 1

Bob Dylan

Johnny Cash

Billy Joel

0.95

0.87

0.82

0.73

0.75

Musician 1.0

Musician 1.0

Smoothness:Nodes connected by an edge should be assigned similar classes, as enforced by edge weight.

Initial Clusters

Seed classes can be different from the cluster IDs

6

Coupling Node:Force (softly) all instance nodes connected to it to have similar class labels, exploiting the Smoothness requirement.

LP-ZGL[Zhu et al., ICML 2003]

• m labels

• W : edge weight matrix

• Yul: weight of label l on node u

• Yul: seed weight of label l on node u

• S: diagonal matrix, non-zero for seed nodes

LP-ZGL[Zhu et al., ICML 2003]

• m labels

• W : edge weight matrix

• Yul: weight of label l on node u

• Yul: seed weight of label l on node u

• S: diagonal matrix, non-zero for seed nodes

LP-ZGL[Zhu et al., ICML 2003]

arg minY

m!

l=1

!

u,v

Wuv(Yul ! Yvl)2; s.t. SYl = SYl

• m labels

• W : edge weight matrix

• Yul: weight of label l on node u

• Yul: seed weight of label l on node u

• S: diagonal matrix, non-zero for seed nodes

LP-ZGL[Zhu et al., ICML 2003]

arg minY

m!

l=1

!

u,v

Wuv(Yul ! Yvl)2; s.t. SYl = SYl

Smooth

• m labels

• W : edge weight matrix

• Yul: weight of label l on node u

• Yul: seed weight of label l on node u

• S: diagonal matrix, non-zero for seed nodes

LP-ZGL[Zhu et al., ICML 2003]

arg minY

m!

l=1

!

u,v

Wuv(Yul ! Yvl)2; s.t. SYl = SYl

Match Seeds (hard)Smooth

Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

8

Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

8

argminY

m+1�

l=1

��SY l − SY l�2 + µ1

u,v

Muv(Y ul − Y vl)2 + µ2�Y l −Rl�2

• m labels, +1 dummy label

• M = W� +W is the symmetrized weight matrix

• Y vl: weight of label l on node v

• Y vl: seed weight for label l on node v

• S: diagonal matrix, nonzero for seed nodes

• Rvl: regularization target for label l on node v

Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

8

argminY

m+1�

l=1

��SY l − SY l�2 + µ1

u,v

Muv(Y ul − Y vl)2 + µ2�Y l −Rl�2

• m labels, +1 dummy label

• M = W� +W is the symmetrized weight matrix

• Y vl: weight of label l on node v

• Y vl: seed weight for label l on node v

• S: diagonal matrix, nonzero for seed nodes

• Rvl: regularization target for label l on node v

Match Seeds (soft)

Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

8

argminY

m+1�

l=1

��SY l − SY l�2 + µ1

u,v

Muv(Y ul − Y vl)2 + µ2�Y l −Rl�2

• m labels, +1 dummy label

• M = W� +W is the symmetrized weight matrix

• Y vl: weight of label l on node v

• Y vl: seed weight for label l on node v

• S: diagonal matrix, nonzero for seed nodes

• Rvl: regularization target for label l on node v

Match Seeds (soft) Smooth

Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

8

argminY

m+1�

l=1

��SY l − SY l�2 + µ1

u,v

Muv(Y ul − Y vl)2 + µ2�Y l −Rl�2

• m labels, +1 dummy label

• M = W� +W is the symmetrized weight matrix

• Y vl: weight of label l on node v

• Y vl: seed weight for label l on node v

• S: diagonal matrix, nonzero for seed nodes

• Rvl: regularization target for label l on node v

Match Seeds (soft) SmoothMatch Priors(Regularizer)

Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

8

argminY

m+1�

l=1

��SY l − SY l�2 + µ1

u,v

Muv(Y ul − Y vl)2 + µ2�Y l −Rl�2

• m labels, +1 dummy label

• M = W� +W is the symmetrized weight matrix

• Y vl: weight of label l on node v

• Y vl: seed weight for label l on node v

• S: diagonal matrix, nonzero for seed nodes

• Rvl: regularization target for label l on node v

MAD has extra regularization compared

to LP-ZGL

Match Seeds (soft) SmoothMatch Priors(Regularizer)

MAD (contd.)

9

Inputs Y ,R : |V |× (|L|+ 1), W : |V |× |V |, S : |V |× |V | diagonalY ← YM = W +W�

Zv ← Svv + µ1�

u �=v Mvu + µ2 ∀v ∈ Vrepeat

for all v ∈ V doY v ← 1

Zv

�(SY )v + µ1Mv·Y + µ2Rv

end foruntil convergence

MAD (contd.)

9

Inputs Y ,R : |V |× (|L|+ 1), W : |V |× |V |, S : |V |× |V | diagonalY ← YM = W +W�

Zv ← Svv + µ1�

u �=v Mvu + µ2 ∀v ∈ Vrepeat

for all v ∈ V doY v ← 1

Zv

�(SY )v + µ1Mv·Y + µ2Rv

end foruntil convergence

• Easily Parallelizable: Scalable

MAD (contd.)

9

Inputs Y ,R : |V |× (|L|+ 1), W : |V |× |V |, S : |V |× |V | diagonalY ← YM = W +W�

Zv ← Svv + µ1�

u �=v Mvu + µ2 ∀v ∈ Vrepeat

for all v ∈ V doY v ← 1

Zv

�(SY )v + µ1Mv·Y + µ2Rv

end foruntil convergence

• Easily Parallelizable: Scalable• Importance of a node can be discounted

MAD (contd.)

9

Inputs Y ,R : |V |× (|L|+ 1), W : |V |× |V |, S : |V |× |V | diagonalY ← YM = W +W�

Zv ← Svv + µ1�

u �=v Mvu + µ2 ∀v ∈ Vrepeat

for all v ∈ V doY v ← 1

Zv

�(SY )v + µ1Mv·Y + µ2Rv

end foruntil convergence

• Easily Parallelizable: Scalable• Importance of a node can be discounted• Convergence guarantee under mild conditions

Experiments with Public Datasets

10

Experiments with Public Datasets

• Previous research used proprietary datasets– difficult to reproduce and extend

10

Experiments with Public Datasets

• Previous research used proprietary datasets– difficult to reproduce and extend

• Public datasets are used in the paper– Freebase: relational tables from multiple sources

– TextRunner (Ritter+, 2009): Open-domain IE System

– YAGO (Suchanek+, 2007): KB curated from Wikipedia and WordNet

– Gold standard hypernyms: [Pantel+, 2009], WordNet

10

Experiments with Public Datasets

• Previous research used proprietary datasets– difficult to reproduce and extend

• Public datasets are used in the paper– Freebase: relational tables from multiple sources

– TextRunner (Ritter+, 2009): Open-domain IE System

– YAGO (Suchanek+, 2007): KB curated from Wikipedia and WordNet

– Gold standard hypernyms: [Pantel+, 2009], WordNet

• Available at: http://www.talukdar.net/datasets/class_inst/10

Graph Stats

11

Statistics of Graphs used in Experiments

Evaluation Metric: Mean Reciprocal Rank (MRR)

12

MRR =1

|test-set|�

v∈test-set

1

rankv(class(v))

Evaluation Metric: Mean Reciprocal Rank (MRR)

12

MRR =1

|test-set|�

v∈test-set

1

rankv(class(v))

Billy JoelLinguist 0.4Musician 0.3

...

Gold Label: MusicianMRR: 0.5 (1/2)

Graph-based SSL Comparisons (1)

13

Graph-based SSL Comparisons (1)

13

0.15

0.2

0.25

0.3

0.35

170 x 2 170 x 10

TextRunner Graph, 170 WordNet Classes

Me

an

Re

cip

roca

l R

an

k (

MR

R)

Amount of Supervision

LP-ZGL Adsorption MAD

Graph with175k nodes,529k edges.

Graph-based SSL Comparisons (2)

14

0.25

0.285

0.32

0.355

0.39

192 x 2 192 x 10

Freebase-2 Graph, 192 WordNet Classes

Me

an

Re

cip

roca

l R

an

k (

MR

R)

Amount of Supervision

LP-ZGL Adsorption MAD

Graph with303k nodes,2.3m edges.

When is MAD most effective?

15

When is MAD most effective?

15

0

0.1

0.2

0.3

0.4

0 7.5 15 22.5 30

Re

lati

ve I

ncr

eas

e i

n M

RR

by M

AD

ove

r L

P-Z

GL

Average Degree

When is MAD most effective?

15

0

0.1

0.2

0.3

0.4

0 7.5 15 22.5 30

Re

lati

ve I

ncr

eas

e i

n M

RR

by M

AD

ove

r L

P-Z

GL

Average Degree

MAD is most effective in dense graphs, where there is greater need for

regularization.

• Graph-based Class-Instance Acquisition– Overview & previous work

• Review & comparison of three SSL algorithms– LP-ZGL (Zhu+, 2003)

– Adsorption (Baluja+, 2008)– Modified Adsorption (MAD) (Talukdar and Crammer, 2009)

• Additional Semantic Constraints

• Summary

Outline

16

• Graph-based Class-Instance Acquisition– Overview & previous work

• Review & comparison of three SSL algorithms– LP-ZGL (Zhu+, 2003)

– Adsorption (Baluja+, 2008)– Modified Adsorption (MAD) (Talukdar and Crammer, 2009)

• Additional Semantic Constraints

• Summary

Outline

16

Can Additional Semantic Constraints Help ?

17

Can Additional Semantic Constraints Help ?

17

Set 2

Set 1

Isaac Newton

Johnny Cash

Bob Dylan

Can Additional Semantic Constraints Help ?

17

Instances with shared

attributes are likely to be

from the same class.

Set 2

Set 1

Isaac Newton

Johnny Cash

Bob Dylan

has_attribute-albums

Can Additional Semantic Constraints Help ?

17

Instances with shared

attributes are likely to be

from the same class.

Set 2

Set 1

Isaac Newton

Johnny Cash

Bob Dylan

has_attribute-albums

Can Additional Semantic Constraints Help ?

17

Instances with shared

attributes are likely to be

from the same class.

Set 2

Set 1

Isaac Newton

Johnny Cash

Bob Dylan

Graph-based representation makes it easy to incorporate such constraints!

Better Classes with YAGO Attributes

18

Better Classes with YAGO Attributes

18 0.3

0.338

0.375

0.413

0.45

170 WordNet Classes, 10 Seeds per Class, using Adsorption

Me

an

Re

cip

roca

l R

an

k (

MR

R)

TextRunner GraphYAGO GraphTextRunner + YAGO Graph

Better Classes with YAGO Attributes

18 0.3

0.338

0.375

0.413

0.45

170 WordNet Classes, 10 Seeds per Class, using Adsorption

Me

an

Re

cip

roca

l R

an

k (

MR

R)

TextRunner GraphYAGO GraphTextRunner + YAGO Graph

Graph constructed from TextRunner (UWash) output, 175k nodes, 529k edges

Better Classes with YAGO Attributes

18 0.3

0.338

0.375

0.413

0.45

170 WordNet Classes, 10 Seeds per Class, using Adsorption

Me

an

Re

cip

roca

l R

an

k (

MR

R)

TextRunner GraphYAGO GraphTextRunner + YAGO Graph

Graph constructed from output of YAGO Knowledge Base, 142k nodes, 777k edges

Better Classes with YAGO Attributes

18 0.3

0.338

0.375

0.413

0.45

170 WordNet Classes, 10 Seeds per Class, using Adsorption

Me

an

Re

cip

roca

l R

an

k (

MR

R)

TextRunner GraphYAGO GraphTextRunner + YAGO Graph

Combined graph, with 237k nodes, 1.3m edges

Better Classes with YAGO Attributes

18 0.3

0.338

0.375

0.413

0.45

170 WordNet Classes, 10 Seeds per Class, using Adsorption

Me

an

Re

cip

roca

l R

an

k (

MR

R)

TextRunner GraphYAGO GraphTextRunner + YAGO Graph

Better Classes with YAGO Attributes

18 0.3

0.338

0.375

0.413

0.45

170 WordNet Classes, 10 Seeds per Class, using Adsorption

Me

an

Re

cip

roca

l R

an

k (

MR

R)

TextRunner GraphYAGO GraphTextRunner + YAGO Graph

Additional semantic constraints help improve performance significantly.

Classes for Attributes

• Classes assigned to attribute nodes

19

has_isbn

wordnet_bookwordnet_maganize

Classes for Attributes

• Classes assigned to attribute nodes

19

has_isbn

wordnet_bookwordnet_maganize

Classes for Attributes

• Classes assigned to attribute nodes

19

has_isbn

wordnet_bookwordnet_maganize

Evidence that attributes propagate right classes.

Summary

20

Summary

• Compared three graph-based SSL algorithms for class-instance acquisition– MAD is consistently most effective, particularly in

dense graphs

20

Summary

• Compared three graph-based SSL algorithms for class-instance acquisition– MAD is consistently most effective, particularly in

dense graphs

• Better class-instance acquisition with additional semantic constraints– easy to add such constraints in graph-based

representation

20

Summary

• Compared three graph-based SSL algorithms for class-instance acquisition– MAD is consistently most effective, particularly in

dense graphs

• Better class-instance acquisition with additional semantic constraints– easy to add such constraints in graph-based

representation

• Datasets available: www.talukdar.net/datasets/class_inst/

– for reproducibility and extension

20

Thank You!http://www.talukdar.net/datasets/class_inst/

top related