relation extraction for academic collaboration 10-709 project proposal

Relation Extraction for Academic Collaboration

10-709 Project Proposal

Justin Betteridge, Matthew Bilotti, Simon Fung, Sophie Wang

Jan 26, 2006

Relation Extraction

We want: CollaboratesWith( <x>, <y> ) where <x>, <y> are of type ‘person’

Two redundant sources of information for co-training: Extraction Patterns to find Relations expressed

in surface text or tables on the web Rote learner keeps track of Relations it is told

about, aggregating evidence in the form of confidence scores when Relations are multiply-extracted from different sources

Sketch of a Co-Training AlgorithmLet: R = a set of Relations; P = a set of Extraction PatternsInitialize: R <- seed Relations, P <- seed Patterns

do, until termination condition is reached:

1. For each p in P, where p is of the form ( “before context”, <x>, “between context”, <y>, “after context” ), query Google using the literal context strings in the Pattern to retrieve text windows from which a set of Relations ( <x>, <y> ) can be extracted.

2. For each new Relation, compute new confidence score and add it to R, combining evidence if necessary.

3. Weed out any r in R the confidence of which is below a threshold, or optionally, any r the arguments of which are unlikely to be of type person.

4. For each r in R, where r is of the form ( <x>, <y> ), query Google to retrieve a set of text windows containing the strings <x> and <y>. From these text windows, generalize a set of Patterns ( “before”, <x>, “between”, <y>, “after”)

5. For each new Pattern, compute new confidence score and add it to P, combining evidence if necessary.

6. Weed out any p in P the confidence of which is below a threshold.

Coverage as a Confidence Measure

Confidence for an Extraction Pattern p For each r in R, query Google to see if p can

extract r Coverage is the number of relations in R

extractable by p divided by |R|Confidence for a Relation r

For each p in P, query Google to see if p can extract r

Similarly, coverage is the number of patterns in P that can extract r divided by |P|

Combining Confidence Scores

Given a Relation with confidence c Extracted again; pattern has confidence p New confidence score of s (may be < c)

One idea: MYCIN Calculus [Shortliffe 76] new confidence = c + ( 1 – c ) * p * s intuitively, going p * s percent of the way from old

confidence c to maximal confidence 1.0

Another idea: = ( c + p * s ) / ( 1 + c * p * s ) confidences increase monotonically, stay

between 0 and 1.0, but never reach 1.0

Example Seed Data for Co-Training

Extraction Patterns <x> “in collaboration with” <y> <x> “joint work with” <y> Patterns that extract information from tables,

lists of citations, etc...

Relations CollaboratesWith( mbilotti, ehn ) CollaboratesWith( jbetter, teruko ) ...

Extraction Pattern Examples

Query: “in collaboration with” site:web.mit.edu/biology/www

Open Questions

Additional useful sources of information: Anchor text and link structure: advisor-advisee

cross-refs, department or lab organization Heuristics or Named Entity Recognition to weed

out relation arguments that are not people

Confidence metrics for patterns, relationsMethods of combining confidence scoresTermination condition

relation extraction for academic collaboration 10-709 project proposal

Documents