4. relationship extraction

CS 652, Peter Lindes 1

4. Relationship Extraction

Part 4 of Information ExtractionSunita Sarawagi

9/7/2012


The Problem

• Relate extracted entities – unstructured text not partitioned into records

• Various competitions– MUC– ACE– BioCreAtIvE II Protein-Protein Interaction

9/7/2012


Groups of Relationships

• ACE:– located at, near, part, role, social for entities:– person, organization, facility, location, and geo-

political entity• Biomedical: gene-disease, protein-protein,

subcellular regularizations• NAGA knowledge base: 26 relationships such

as: isA, bornInYear, establishedInYear, hasWonPrize, locatedIn, politicianOf, …

9/7/2012


Three Problem Levels

• First case:– Entities preidentified in unstructured text– Given a pair of entities, find type of relationship

• Second case:– Given relationship type r, entity name e– Extract entities with which e has relationship r

• Third case:– Open-ended corpus – the web– Given relationship type r, find entity pairs

9/7/2012


Given Entity Pair, Find Relationship

• R: set of relationship types• : R plus a special member for “other”• x: a “snippet” of text (might be a sentence)• E1 and E2 in x

• Identify relationships in between E1 and E2 • Resources available:

– Surface Tokens– Part of Speech tags– Syntactic Parse Tree Structure– Dependency Graph

• Use these clues to classify (x, E1, E2) into one of

9/7/2012


Parse Tree

9/7/2012


Dependency Graph

9/7/2012


Methods to Extract Relationships

• Feature-based methods– String form, orthographic type, POS tag, etc.– Features from Dependency Graph– Features from Word Sequence– Features from Parse Trees

• Kernel-based methods– Kernel function K(X, X’) captures similarity– Support Vector Machine (SVM) classifier

• Rule-based methods9/7/2012


Given Relationship, Find Entity Pairs

• Given one or more relationship types• Find all occurrences in a corpus• Open document collection• No labeled unstructured training data• Instead, seeding for each relationship type is

used

9/7/2012

10

Seed Data for Relationship Type r

• The types of entities that are arguments of r– Often specified at a high level, eg. proper noun,

common noun, numeric, etc.– Types such as “Person” or “Company” require patterns

to recognize them• A seed database S of entities that have r– May include negative examples

• A seed set or manually coded patterns– Easy for generic relationships, eg. hypernym or

meronym (part-of)9/7/2012 CS 652, Peter Lindes


3 Steps for Relationship Extraction

• Start with above seeding data– A corpus D– Relationship types r1,…,rk

– Entity types Tr1, Tr2 for each r

– A set S of examples (Ei1,Ei2,ri) 1 ≤ i ≤ N

• 1: Use S to learn extraction patterns M• 2: Use a subset of patterns to create candidates• 3: Validation: select a subset based on statistical

tests

9/7/2012


Example Data

• Relationships: “IsPhDAdvisorOf”, “Acquired”• Entity types: “(Person, Person)”, “(Company,

Company)”

9/7/2012


Learn Patterns from Seed Triples

• Assume only one relationship for each pair• Thus each example for r is negative for r’• 1: Find sentences with entity pairs– For (E1,E2,r) query for “E1 NEAR E2”

– Filter out where E1, E2 don’t match Tr1, Tr2

• 2: Filter sentences for the relationship• 3: Learn patterns from sentences

9/7/2012


Filtering Sentences

• Example:

• Banko: a simple heuristic using the length of dependency links

• This fails for above example

9/7/2012


Learn Patterns from Sentences

• Formulate as a standard classification problem• Two practical problems:– No guarantee of positive examples• Bunescu and Mooney: use SVM

– Many sentences for each pair• Bunescu and Mooney: down-weight correlated terms

9/7/2012


Extract Candidate Entity Pairs

• Learned model M: (x,E1,E2) -> r• Simple method: sequential scan over D– Look for Tr1, Tr2, then apply M

• Large, indexed corpus: retrieve relevant sentences– Use keyword search• Pattern-based• Keyword-based• Agichtein and Gravano: iterative solution

9/7/2012


Validate Extracted Relationships

• Extraction has high error rates• Validation based on corpus-wide statistics• Probabilities based on count of occurrences– Extract only high-confidence relationships

• Rare relationships:– Use contextual pattern– Alternative: correct entity boundary errors

9/7/2012


Summary

• Setting 1: entities already marked– Feature-based and kernel-based methods– Clues from word sequence, parse trees, and dependency graphs– Training data with labeled relationships

• Setting 2: open corpus, given relationship types– No labeled unstructured data– Seed database of (E1,E2,r) examples– Bootstrapping from seed data– Filter based on relevancy

• Accuracy:– 50%-70% for closed benchmark datasets– Lots of special case handling for the web

9/7/2012


Further Readings

• Concentrated here on binary relationships• Natural extension: records with multi-way

relationships• Requires cross-sentence analysis:– Co-reference resolution– Discourse analysis

• Much literature on this topic• Future research: discovering relevant

relationship types9/7/2012

4. relationship extraction

Documents