information extraction sources: sarawagi, s. (2008). information extraction. foundations and trends...

32
Information Extraction Sources : Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261– 377. Hobbs, J. R., & Riloff, E. (2010). Information extraction. Handbook of Natural Language Processing, 2.

Upload: marcus-young

Post on 26-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Information Extraction

Sources:• Sarawagi, S. (2008). Information extraction.

Foundations and Trends in Databases, 1(3), 261–377. • Hobbs, J. R., & Riloff, E. (2010). Information extraction.

Handbook of Natural Language Processing, 2.

Page 2: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

CONTEXT

Page 3: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

History

• Genesis = recognition of named entities (organization & people names)

• Online access = pushes towards – personal desktops -> structured databases, – scientific publications -> structured records, – Internet -> structured fact finding queries.

Page 4: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Driving workshops / conferences

– 1987-97: MUC (Message Understanding Conference)Filling slots, named entities & coreference (95-)

– 1999-08: ACE (Automatic Content Extraction) « supporting various classification, filtering, and selection applications by extracting and representing language content »

– 2008-now: TAC (Text Automated Comprehension)• Knowledge Base Population (09-11)• Others: Textual entailment, Summarization, QA (until

2009)

Page 5: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Example: MUC0. MESSAGE: ID TST1-MUC3-00011. MESSAGE: TEMPLATE 12. INCIDENT: DATE 02 FEB 903. INCIDENT: LOCATION GUATEMALA: SANTO TOMAS (FARM)4. INCIDENT: TYPE ATTACK5. INCIDENT: STAGE OF EXECUTION ACCOMPLISHED6. INCIDENT: INSTRUMENT ID -7. INCIDENT: INSTRUMENT TYPE -8. PERP: INCIDENT CATEGORY TERRORIST ACT9. PERP: INDIVIDUAL ID "GUERRILLA COLUMN" / "GUERRILLAS"10. PERP: ORGANIZATION ID "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG"11. PERP: ORGANIZATION CONFIDENCE REPORTED AS FACT / CLAIMED OR ADMITTED: "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG"12. PHYS TGT: ID "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"13. PHYS TGT: TYPE GOVERNMENT OFFICE OR RESIDENCE: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"14. PHYS TGT: NUMBER 1: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"15. PHYS TGT: FOREIGN NATION -16. PHYS TGT: EFFECT OF INCIDENT -17. PHYS TGT: TOTAL NUMBER -18. HUM TGT: NAME "CEREZO"19. HUM TGT: DESCRIPTION "PRESIDENT": "CEREZO" "CIVILIAN"20. HUM TGT: TYPE GOVERNMENT OFFICIAL: "CEREZO" CIVILIAN: "CIVILIAN"21. HUM TGT: NUMBER 1: "CEREZO" 1: "CIVILIAN"22. HUM TGT: FOREIGN NATION -23. HUM TGT: EFFECT OF INCIDENT NO INJURY: "CEREZO" DEATH: "CIVILIAN"24. HUM TGT: TOTAL NUMBER -

Page 6: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Application• Enterprise Applications

– News Tracking (terrorists, disease)– Customer care (linking mails to products, etc.)– Data Cleaning– Classified Ads

• Personal Information Management (PIM)• Scientific Applications (e.g. bio-informatics)• Web Oriented

– Citation databases– Opinion databases– Community websites (DBLife, Rexa - UMASS)– Comparison Shopping– Ad Placement on Webpages – Structured Web Searches

Page 7: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

IE - Taxonomy

• Types of structures extracted– Entities, Records, Relationships– Open/Closed IE

• Sources– Granularity of extraction– Heterogenity: machine generated, (semi)structured, open

• Input resources– Structured DB– Labelled Unstructured Text– Preprocessing (tokenizer, chunker, parser<)

Page 8: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Process (I)

• Annotated documents• Rules hand-crafted by humans (1500 hours!)

Page 9: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Process (I)

• Annotated documents• Rules hand-crafted by humans (1500 hours!)• Rules generated by a system• Rules evaluated by humans

Page 10: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Process (II)

• Annotated documents• Rules hand-crafted by humans (1500 hours!)

• Rules generated by a system• Rules learnt

Page 11: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Process (III)

• Annotated documents• Rules hand-crafted by humans (1500 hours!)

• Rules generated by a system

• Rules learnt• Models– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF

• Decomposition into a series of subproblems– Complex words, basic phrases, complex phrases, events and

merging

Page 12: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Process (IV)

• Annotated documents• Relevant & non relevant documents• Rules hand-crafted by humans (1500 hours!)

• Rules generated by a system

• Rules learnt• Models

– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF

Page 13: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Process (V)

• Annotated documents• Relevant & non relevant documents

• Seeds -> boostrapping• Rules hand-crafted by humans (1500 hours!)

• Rules generated by a system

• Rules learnt• Models

– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF

Page 14: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

RECOGNIZING ENTITIES / FILLING SLOTS

Page 15: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Rule based systems

• Rules to mark an entity (or more)– Before the start of the entity– Tokens of the entity– After the end of the entity

• Rules to mark the boundaries• Conflicts between rules– Larger span– Merge (if same action)– Order the rules

Page 16: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Entity Extraction – rule based

Page 17: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Learning rules

• Algorithms are based on– Coverage [how many cases are covered by the

rule]– Precision

• Two approaches– Top-down (e.g. FOIL): start with coverage = 100%– Bottom-up: start with precision = 100%

Page 18: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Rules – Autoslog

• Rule Learning– Look at sentences containing targets– Heuristic: looking for a linguistic pattern

Riloff, E. (1993). Automatically constructing a dictionary for information extraction tasks, 811–811.

Page 19: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Rules – LIEPHuffman, S. B. (2005). Learning information extraction patterns from examples.

Learn (sets of meta-heuristics) by using syntactic paths that relate two role-filling constituents, e.g. [subject(Bob,named),object(named,CE0)].Followed by generalization (matching + disjonction)

Page 20: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Statistical models

• How to label– IOB sequences (Inside, Outside, Beginning)– Sequences– Segmentation

Alleged/B guerrilla/I urban/I commandos/I launched/O two/B highpower/I bombs/I against/O a/B car/I dealership/I in/O down- town/O San/B Salvador/I this/B morning/I.

– Grammar based (longer dependencies)• Many ML models:– HMM– ME, CRF– SVM

Page 21: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Statistical models (cont’d)

• Features– Word– Orthographic– Dictionary– …

• First order– Position:– Segment:

Page 22: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Examples of features

Page 23: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Statistical models (cont’d)

• Learning:– Likelihood

– Max-Margin

Page 24: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

PREDICTING RELATIONSHIPS

Page 25: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Overall

• Goal: classify (E1,E2,x)• Features– Surface tokens (words, entities)

[Entity label of E1 = Person, Entity label of E2 = Location]

– Parse tree (syntaxic, dependency graph)[(POS = (noun,verb,noun), flag = “(1,none,2)”, type = “dependency”]

Page 26: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Models

• Standard classifier (e.g. SVM)• Kernel-based methods– e.g. measure of common properties between two

paths in the dependency tree– Convolution based kernels

• Rule-based methods

Page 27: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Extracting entities for a set of relationships

• Three steps– Learn extraction patterns for the seeds• Find documents where entities appear close to each

other• Filtering

– Generate candidate triplets• Pattern or keyword-based

– Validation• # of occurrences

Page 28: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

MANAGEMENT

Page 29: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Summary

• Performance– Document selection: subset, crawling– Queries to DB: related entities (top-k retrieval)

• Handling changes– Detecting when a page has changed

• Integration– Detecting duplicates entities– Redundant extractions (open IE)

Page 30: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

EVALUATION

Page 31: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

Metrics

• Metrics– Precision-Recall– F-measure (-> harmonic mean)

Page 32: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,

The 60% barrier