a s urvey on i nformation e xtraction from d ocuments u sing s tructures of s entences chikayama...

34
A SURVEY ON INFORMATION EXTRACTION FROM DOCUMENTS USING STRUCTURES OF SENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

Upload: martha-harper

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

A SURVEY ONINFORMATION EXTRACTIONFROM DOCUMENTSUSING STRUCTURES OF SENTENCES

Chikayama Taura Lab. M1 Mitsuharu Kurita

1

Page 2: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

INTRODUCTION

Current search systems are based on 2 assumptions

1. Users send words, not sentences2. The aim is finding documents which is

related to the query words

We are unconsciously get to select words which will appear nearby the target information

In some cases this clue doesn’t work well2

Page 3: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

INTRODUCTION

For more convenient access to the information Analysis of the detail of question

To know the target information

Analysis of the information in retrieved documents To find the requested information

Information Extraction

3

Page 4: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

OUTLINE

Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures

Frequent substructure Shortest path between 2 words Applying the kernel method for structured data

Conclusion

4

Page 5: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

INFORMATION EXTRACTION

What is Information Extraction? A kind of task in natural language processing Addresses extraction of information from texts

Not to retrieve the documents Originated with an international conference

named MUC

Message Understanding Conference (MUC) Competition of IE among research groups Set information extraction tasks every year

between 1987-1997

5

Page 6: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

MUC COMPETITION

An example of MUC task MUC-3 terrorism domain

Input: news articles(some of them include

terrorism event)

Output: the instances involved in each incident

6

Page 7: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

MUC COMPETITION

Pattern matching or linguistic analysis At that time (1987-1997), there were many

difficulties to use advanced natural language processing

Therefore, most of competitors adopted pattern matching to find instances

7

Page 8: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

OUTLINE

Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures

Frequent substructure Shortest path between 2 words Applying the kernel method for structured data

Conclusion

8

Page 9: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

EXAMPLE OF PATTERN MATCHING

CIRCUS [92 Lehnert et al.] Each pattern consists of “trigger word” and

“linguistic pattern”

Pattern: kidnap-passiveTrigger:

“kidnap”Linguistic pattern:

“<subject> passive-verb”Variable:

“target”

“The mayor was kidnapped

by terrorists.”1. “kidnap” activates the

pattern2. “was kidnapped” is a

passive verb phrase3. The subject “mayor” is

the target

9

Page 10: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

PROBLEMS OF PATTERN MATCHING

It takes a huge amount of time to create patterns In many cases, they were handwritten

It depends a lot on the target domain It is difficult to adapt to the new task

Automatic constructionof patterns

10

Page 11: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

THE EARLIESTAUTOMATIC PATTERN

GENERATION

AutoSlog [93 Riloff et al.] Creates the patterns for CIRCUS automatically Training data: articles tagged the target word

Created 1237 patterns from 1500 tagged texts Only 450 of them were judged to be valid by

human

“The mayor was kidnapped

by terrorists.”

Pattern: kidnap-passiveTrigger:

“kidnap”Linguistic pattern:

“<subject> passive-verb”Variable:

“target”

11

Page 12: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

Recently it has become possible to use deeper linguistic analysis

Some studies are addressing new IE tasks using these linguistic resources and machine learning approach

12

Page 13: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

OUTLINE

Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures

Frequent substructure Shortest path between 2 words Applying the kernel method for structured data

Conclusion

13

Page 14: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

SENTENCE STRUCTURES

Dependency Structure Describes modification relations between words One sentence makes up a tree structure

Predicate-Argument structure Describes the semantic relations between

predicate and argument One sentence makes up a graph structure

14

Page 15: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

DIFFICULTIES TO USE STRUCTURED DATA

Most of the machine learning algorithms deal with the data as feature vectors

It is difficult to express structured data (e.g. trees, graphs) as vectors

The ways to use sentence structures for IE Frequent substructures Shortest paths between 2 words Applying the kernel method for structured data

15

Page 16: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

OUTLINE

Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures

Frequent substructure Shortest path between 2 words Applying the kernel method for structured data

Conclusion

16

Page 17: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

IE WITHSUBGRAPH OF SENTENCE STRUCTURES

On-Demand Information Extraction[06 Sekine et

al.] Create extraction patterns on-demand and

extract information with itquery Relevan

tarticles

FrequentSubtreeMining

Article database Dependency analyzer

Table of Information

Dependency trees

Subtree patterns

17

Page 18: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

EXPERIMENTAL RESULTS

Generated patterns Found patterns for a query

“merger and acquisition” (M&A)

Extracted Information For the query “acquire, acquisition, merger, buy,

purchase”

18

<COM1>

<agree to buy>

<COM2>

<for MNY>

<COM1>

<will acquire>

<COM2>

<for MNY>

<a MNY merger>

<of COM1>

<and COM2>

Page 19: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

EXPERIMENTAL RESULTS

Very quick construction of patterns In MUC, it is allowed to take one month ODIE takes only a few minutes to return the

result

No training corpus is needed ODIE learns extraction patterns from the data

Information about reprising event can be extracted well Merger and acquisition Nobel prize winners 19

Page 20: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

OUTLINE

Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures

Frequent substructure Shortest path between 2 words Applying the kernel method for structured data

Conclusion

20

Page 21: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

IE WITHSHORTEST PATH BETWEEN

WORDS

Extraction of interacting protein pair[06 Yakushiji et al.]

Extract the interacting protein pairs from biomedical articles

Focus on the shortest path between 2 protein names on predicate-argument structure

Discriminate with Support Vector Machine (SVM)

Entity1 is interacted with a hydrophilic loop region

of Entity2.be

entity1

interact

withregion

of

a

hydrophilic

loop

entity2 21

Page 22: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

PATTERN GENERATION

Variation of Patterns The extracted patterns are not enough Divide the patterns and combine them into new

patterns

Main PrepEntity Entity

………

Xinterac

tYwithprotein

region

of

22

Page 23: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

PATTERN GENERATION

Validation of patterns Some of these patterns are inappropriate Each patterns are scored by its adequacy to the

learning data

Feature vector

23

TP: True PositiveFP: False Positive

Page 24: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

SUPPORT VECTOR MACHINE (SVM)

2 class linear classifier Divide the data space with hyperplane Margin maximization

Margin maximization

24

Page 25: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

EXPERIMENTAL RESULTS

Learning AImed corpus

225 abstracts of biomedical papers Annotated with protein names and interactions

Extraction MEDLINE

14 million titles and 8 million abstracts

Extracted data 7775 protein pairs 64.0% precision 83.8% recall

25

Page 26: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

OUTLINE

Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures

Frequent substructure Shortest path between 2 words Applying the kernel method for structured data

Conclusion

26

Page 27: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

IE WITH THE KERNEL METHOD ON SENTENCE STRUCTURES

Kernel Method e.g. SVM

Data are used only in the form of dot products If you can calculate the dot product directly, you

do not have to calculate the vector Furthermore, you can use other functions as long

as they meet some conditions27

Raw data

vector space

classifier

Kernel function

Page 28: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

RELATION EXTRACTION

Relation Extraction with Tree Kernel[04 Culotta et

al.] Classify the relation between 2 entities

5 entity types(person, organization, geo-political-entity,

location, facility) 5 major types of relations

(at, near, part, role, social) Classify the smallest subtree of dependency tree

which includes the entities

28

Page 29: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

TREE KERNEL

Represents the similarity between 2 tree-shaped data

Calculated as the sum of similarity of nodes

29

Dequeue a node pair

Add the similarity

Find all child node sequence pairswhose main features of the nodes

are common

Enqueue the child node pairs

Is the queueempty?

Return the similarity

Enqueue root node pair

Start

End

Yes

No

Page 30: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

CALCULATION OF TREE KERNEL

Features of nodes

The similarity between nodes are defined as the number of common features (except the main features)

30

Main features

Page 31: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

CALCULATION OF TREE KERNEL

31

A

B C D

E

A’

B’ D’

E’

F’

A

B

A

D

D

E

C’

A’

B’ C’

A’A

A’

B’

A’

D’

A

B C

D’

E’

X and X’ denote the nodes whose main

features are common

A

C

A’

C’

Page 32: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

EXPERIMENTAL RESULTS

Data set: ACE corpus 800 annotated documents

(gathered from newspapers and broadcasts)

5 entity types(person, organization, geo-political-entity,

location, facility) 5 major types of relations

(at, near, part, role, social)

32

Kernel Precision (%)

Recall (%)

Bag-of-words kernel 47.0 10.0

Tree kernel 69.6 25.3

Page 33: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

OUTLINE

Introduction Overview of Information Extraction (IE) IE with pattern matching IE with sentence structures

Frequent substructure Shortest path between 2 words Applying the kernel method for structured data

Conclusion

33

Page 34: A S URVEY ON I NFORMATION E XTRACTION FROM D OCUMENTS U SING S TRUCTURES OF S ENTENCES Chikayama Taura Lab. M1 Mitsuharu Kurita 1

CONCLUSION

Overview of Information Extraction The aim of information extraction Recent movement to use deep linguistic resource

The way to use sentence structures for IE Difficulties of using structured data in machine

learning Three different approaches to exploit them

34