1 discriminative training of clustering functions theory and experiments with entity identification...

29
1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Upload: anna-parrish

Post on 18-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

1

Discriminative Training ofClustering Functions

Theory and Experiments with EntityIdentification

Xin Li & Dan Roth

University of Illinois, Urbana-Champaign

Page 2: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 2

Outline

Clustering Current approaches Some problems

Making Clustering a Learning problem

Supervised Discriminative Clustering Framework

The Reference Problem: Entity Identification

within & across documents.

Page 3: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 3

Document 1: The Justice Department has officially ended its inquiry into the assassinations of John F. Kennedy and Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission 's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963.

Document 2: In 1953, Massachusetts Sen. John F. Kennedy married Jacqueline Lee Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedy confronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me.'‘

Document 3: David Kennedy was born in Leicester, England in 1959.  …Kennedy co-edited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980-1994 (Seren 1996). 

Kennedy

The Reference Problem

Page 4: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 4

Entity Identification in Text

Goal: Given names, within or across documents, identify real-world entities behind them.

Problem Definition: Given a set of names and their semantic types, [people], [locations] [Organizations] partition them into groups that refer to different entities.

Approaches: A generative Model [Li, Morie, Roth, NAACL’04] A discriminative approach [Li, Morie, Roth, AAAI’04]Other works [on citation, and

more: Milche et. al; Bilenko et. al.,…]

6

Document 1: The Justice Department has officially ended its inquiry into the assassinations of John F. Kennedy and Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963.

Document 2: In 1953, Massachusetts Sen. John F. Kennedy married Jacqueline Lee Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedyconfronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me.'‘

Document 3: David Kennedy was born in Leicester, England in 1959. …Kennedy co-edited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980-1994 (Seren 1996).

Kennedy

Intuitively, a discriminative approach, requires using some similarity measure between names, followed up by clustering into clusters that represent entities.

Page 5: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 5

Clustering

An optimization procedure that takes

A collection of data elements A distance (similarity) measure on the space of

data elements A Partition Algorithm

Attempts to:

Optimize some quality with respect to the given distance metric.

Page 6: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 6

Clustering: Example (k=4)

Page 7: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 7

An Optimization Problem:

Data: X = {x1,x2,…} Cluster Names: C = {1,2,3,…,K} The Euclidean Distance: d(x1,x2) = [ (x1-x2)T(x1-x2)]1/2

Find a mapping f: X C That minimizes:

j x 2 Cj

d(x,j )2

Where j = 1/m x 2 Cj x mean of elements in the k-th

cluster

Example: K-means Clustering

Page 8: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 8

Many NLP Applications

Class-based language models: group similar words together based on their semantics

(Dagan et. al 99, Lee et. al ; Pantel and Lin, 2002).

Document categorization; and topic identification (Karypis, Han 99,02).

Co-reference resolution – build coreference chain of noun phrases (Cardie,

Wagstaff 99).

In all cases – fixed metric distance; tuned for the application and the data. (and the algorithm?)

Page 9: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 9

Clustering: Metric and Algorithm

d1(x,x’) = [(f1 - f1’) 2+(f2 - f2’) 2]1/2 d2(x,x’) = |(f1+ f2)-(f1’+f2’)|

(a) Single-Linkage with Euclidean

(b) K-Means with Euclidean

(c) K-Means with a Linear Metric

There is no ‘universal’ distance metric that is appropriate for all clustering algorithms

How do we make sure we have an appropriate one, that reflects the task/designer intentions?

Page 10: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 10

Additional information in Clustering

Page 11: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 11

Traditional Clustering Framework

Typically, unsupervised; no learning. More recently: work on metric learning with

supervision: [Bilenko&Mooney 03, 04, Xing et. al.’03, Schultz & Joachims’03,

Bach & Jordan03] Learning a metric; then cluster Learning while clustering (algorithm specific)

A partition function

h(S) = Ad(S)

unlabeled data set S

partition h(S)

distance metric d

clustering algorithm A+

K-meansX = {x1,x2,…}, C = {c1,c2,

…,ck}

Euclidean Distance:d(x, x’) = [(x- x’)T(x- x’)]1/2

Page 12: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 12

labeled data set S

supervisedlearner

Training Stage:

Goal: h*=argmin errS(h,p)

distance metric d

clustering algorithm A+

unlabeled data set S’

partition h(S’)

Application Stage: h(S’ )

A partition function

h(S) = Ad(S)

Supervised Discriminative Clustering (SDC)

Incorporates supervision directly into metric training process; Training is driven by true clustering error Computed via the chosen data partition algorithm.

Page 13: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 13

Elements of SDC: Partition Function and Error

Goal: A partition function h maps a set of data points S to a partition h(S) of S. (outcome of a clustering

algorithm) [Note difference from multi-class classification]

The partition function h is a function of the parameterized

distance metric: d(x1,x2) = wi |xi1- xi

2| Error: Given a labeled data set S; p(S) = {(xi,ci)}1

m, the correct partition, and

a fixed clustering algorithm A, the training process attempts to find d*, minimizing the

clustering error:

d*= argmind errS(h,p), where h(S)=Ad(S).Optimal (given) PartitionLearned Partition

Page 14: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 14

A Supervised Clustering Error

errS(h,p) = 1/|S|2 ij [d(xi,xj)*Aij +(D-d(xi,xj))*Bij]

(as opposed to a quality function that depends only on the distance)

Two types of errors in pairwise prediction:(xi,xj) h `together’ or ‘apart’

False negative: Aij = I [p(xi)=p(xj) & h(xi)h(xj)],

False positive: Bij = I [p(xi) p(xj) & h(xi)=h(xj)],

D = maxij d(xi,xj ) . (See paper for a comparison with other error functions)

Page 15: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 15

Training the distance function

Initialize the distance Metric d

Cluster S using algorithm h=Ad

Evaluate ErrS(h,p)

Update d

S

A gradient descent based algorithm

Page 16: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 16

Training the distance function

Gradient descent Alg.

Learns a metric Iteratively by adjusting the parameter vector by a small amount in the direction that would most reduce the error.

Page 17: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 17

Entity Identification in Text

Goal: Given names, within or across documents, identify real-world entities behind them.

Problem Definition: Given a set of names and their semantic types, [people], [locations] [Organizations] partition them into groups that refer to different entities.

Approaches: A generative Model [Li, Morie, Roth,

NAACL’04] A discriminative approach [Li, Morie, Roth, AAAI’04]

6

Document 1: The Justice Department has officially ended its inquiry into the assassinations of John F. Kennedy and Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963.

Document 2: In 1953, Massachusetts Sen. John F. Kennedy married Jacqueline Lee Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedyconfronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me.'‘

Document 3: David Kennedy was born in Leicester, England in 1959. …Kennedy co-edited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980-1994 (Seren 1996).

Kennedy

Page 18: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 18

Feature Extraction: (John F. Kennedy, President Kennedy )= (1, 2 , …)

1. Fixed distance: distance (similarity) metric d for names. d (John F. Kennedy, President Kennedy) 0.6 d (Chicago Cubs, Cubs) 0.6 d (United States, USA) 0.7

2. A learned distance function parameterized as a Linear function over features (kernelized):

d(John F. Kennedy, President Kennedy ) = wi i Make it a pairwise classifier:

h (John F. Kennedy, President Kennedy ) = `together’ iff wi i <= 0.5

The distance function can be trained separately, to optimize partition quality, or via SDC, to minimize Error. (via gradient descent)

Parameterized Distance Metrics for Name Matching

John F. Kennedy President Kennedy?

Page 19: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 19

Features

Relational features that are extracted from a pair of strings, taking into account relative positions of tokens, substring relations, etc.

Page 20: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 20

Experimental Setting Names of people, locations and organizations.

John F. Kennedy, Bush, George W. Bush U.S.A, United States, and America University of Illinois, U. of I., IBM, International Business

Machines. 300 randomly picked New York Times news articles. 8,600 names annotated by a named entity tagger

and manually verified. Training sets contain names labeled with its global

entity.John F. Kennedy Kennedy1

President Kennedy Kennedy1,

David Kennedy Kennedy2.

Data is available from http://l2r.cs.uiuc.edu/~cogcomp/

Page 21: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 21

Gain from Metric Learning while Clustering

SoftTFIDF (Cohen et. al): Fixed metric LMR (Li, Morie, Roth, AAAI’04) learned metric via a pairwise classifier;

relational features extracted from pairs of strings; feedback from pairwise labels; SDC: trains a linear weighted distance metric for the single-link clustering

algorithm with labeled pairs of 600 names.

Page 22: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 22

Influence of Data Size

Page 23: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 23

Different Clustering Algorithms

Difference across clustering algorithm is not as significant as difference obtained from learning a good metric via SDC.

Page 24: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 24

Summary

A framework for Metric Learning for Clustering that is guided by global supervision with clustering as part of the feedback loop.

A parameterized distance metric is learned in a way that depends on the specific clustering algorithm used.

Significant improvement shown on the Reference Problem: Entity Identification Across documents.

Page 25: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 25

K=16

Intuition behind SDC

d

Page 26: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 26

Relational features: do not depend on specific tokens in the two names, but depend on some abstraction over tokens.

Honorific Equal: Mr., Mrs., President, Prof. Nickname: Thomas, Tom Edit Distance.

Relational Features

John DavisKennedy

2

1

1

2 3

John Kennedy

Toward Concept-Based Text Understanding and Mining

Page 27: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 27

How to employ transitivity between names ?

Toward Concept-Based Text Understanding and Mining

Michael Jordan

Michael Jordan,

Clustering: splitting a set of names. Distance Metrics: Edit distance, SoftTFIDF, Jora-Winkler

Clustering Algorithms: Single-Link, Complete-Link, K-means, graph cut.

Page 28: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 28

Outline

Clustering Current approaches Some problems

Making Clustering a Learning problem Supervised Discriminative Clustering Framework

The Reference Problem: Entity Identification in within & across document.

Page 29: 1 Discriminative Training of Clustering Functions Theory and Experiments with Entity Identification Xin Li & Dan Roth University of Illinois, Urbana-Champaign

Page 29

Entity Identification in Text

Goal: Given names, within or across documents, identify real-world entities behind them.

Problem Definition: Given a set of names and their semantic types, [people], [locations] [Organizations] partition them into groups that refer to different entities.

Approaches: A generative Model [Li, Morie, Roth,

NAACL’04] A discriminative approach [Li, Morie, Roth, AAAI’04]

6

Document 1: The Justice Department has officially ended its inquiry into the assassinations of John F. Kennedy and Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963.

Document 2: In 1953, Massachusetts Sen. John F. Kennedy married Jacqueline Lee Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedyconfronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me.'‘

Document 3: David Kennedy was born in Leicester, England in 1959. …Kennedy co-edited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980-1994 (Seren 1996).

Kennedy

Challenge: millions of entities in the world, but in training, we can only see names of a limited number of entities.