relation extraction for academic collaboration 10-709 project proposal
DESCRIPTION
Relation Extraction for Academic Collaboration 10-709 Project Proposal. Justin Betteridge, Matthew Bilotti, Simon Fung, Sophie Wang Jan 26, 2006. Relation Extraction. We want: CollaboratesWith( , ) where , are of type ‘person’ - PowerPoint PPT PresentationTRANSCRIPT
Relation Extraction for Academic Collaboration
10-709 Project Proposal
Justin Betteridge, Matthew Bilotti, Simon Fung, Sophie Wang
Jan 26, 2006
Relation Extraction
We want: CollaboratesWith( <x>, <y> ) where <x>, <y> are of type ‘person’
Two redundant sources of information for co-training: Extraction Patterns to find Relations expressed
in surface text or tables on the web Rote learner keeps track of Relations it is told
about, aggregating evidence in the form of confidence scores when Relations are multiply-extracted from different sources
Sketch of a Co-Training AlgorithmLet: R = a set of Relations; P = a set of Extraction PatternsInitialize: R <- seed Relations, P <- seed Patterns
do, until termination condition is reached:
1. For each p in P, where p is of the form ( “before context”, <x>, “between context”, <y>, “after context” ), query Google using the literal context strings in the Pattern to retrieve text windows from which a set of Relations ( <x>, <y> ) can be extracted.
2. For each new Relation, compute new confidence score and add it to R, combining evidence if necessary.
3. Weed out any r in R the confidence of which is below a threshold, or optionally, any r the arguments of which are unlikely to be of type person.
4. For each r in R, where r is of the form ( <x>, <y> ), query Google to retrieve a set of text windows containing the strings <x> and <y>. From these text windows, generalize a set of Patterns ( “before”, <x>, “between”, <y>, “after”)
5. For each new Pattern, compute new confidence score and add it to P, combining evidence if necessary.
6. Weed out any p in P the confidence of which is below a threshold.
Coverage as a Confidence Measure
Confidence for an Extraction Pattern p For each r in R, query Google to see if p can
extract r Coverage is the number of relations in R
extractable by p divided by |R|Confidence for a Relation r
For each p in P, query Google to see if p can extract r
Similarly, coverage is the number of patterns in P that can extract r divided by |P|
Combining Confidence Scores
Given a Relation with confidence c Extracted again; pattern has confidence p New confidence score of s (may be < c)
One idea: MYCIN Calculus [Shortliffe 76] new confidence = c + ( 1 – c ) * p * s intuitively, going p * s percent of the way from old
confidence c to maximal confidence 1.0
Another idea: = ( c + p * s ) / ( 1 + c * p * s ) confidences increase monotonically, stay
between 0 and 1.0, but never reach 1.0
Example Seed Data for Co-Training
Extraction Patterns <x> “in collaboration with” <y> <x> “joint work with” <y> Patterns that extract information from tables,
lists of citations, etc...
Relations CollaboratesWith( mbilotti, ehn ) CollaboratesWith( jbetter, teruko ) ...
Extraction Pattern Examples
Query: “in collaboration with” site:web.mit.edu/biology/www
Open Questions
Additional useful sources of information: Anchor text and link structure: advisor-advisee
cross-refs, department or lab organization Heuristics or Named Entity Recognition to weed
out relation arguments that are not people
Confidence metrics for patterns, relationsMethods of combining confidence scoresTermination condition