features, formalized

1

Features, Formalized

Stephen MayhewHyung Sul Kim

2

Outline• What are features?• How are they defined in NLP tasks in general?• How they are defined specifically for relation

extraction? (Kernel methods)

3

What are features?

4

Feature Extraction Pipeline1. Define Feature Generation Functions (FGF)2. Apply FGFs to Data to make a lexicon3. Translate examples into feature space4. Learning with vectors

5

Feature Generation Functions• When we say ‘features’, we actually are often talking

about FGFs.• Define a relation over instance space

• For example, let instance • The relation containsWord(w) is active ( = 1) three

times: containsWord(little), containsWord(brown), containsWord(cow)

6

Feature Generation FunctionsLet be an enumerable collection of relations on . A Feature Generation Function is a mapping:

that maps each to a set of all elements in that satisfy .

Common notation for FGF:

7

Feature Generation FunctionsExample:

“Gregor Samsa woke from troubled dreams.”

Let { isCap(…), hasLen4(…), endsWithS(…) }

Define an FGF over and apply it to the instance:isCap(Gregor), isCap(Samsa), hasLen4(woke),

hasLen4(from), endsWithS(dreams)

8


9

LexiconApply our FGF to all input data.Creates grounded features and indexes them…3534: hasWord(stark)3535: hasWord(stamp)3536: hasWord(stampede)3537: hasWord(starlight)…

10


11

Translate examples to feature spaceFrom Lexicon:…98: hasWord(In)…241: hasWord(the)…3534: hasWord(stark)3535: hasWord(stamp)3536: hasWord(stampede)3537: hasWord(starlight)…

“In the stark starlight”

<98, 241, 3534, 3537>

12

Feature Extraction Pipeline1. Define Feature Generation Functions (FGF) 2. Apply FGFs to Data to make a lexicon3. Translate examples into feature space4. Learning with vectors

Easy.

13

Feature Extraction Pipeline

1. FGFs are already defined2. Lexicon is already defined3. Translate examples into feature space4. Learning with vectors

No surprises here.

Testing

14

Structured Pipeline - Training1. Define Feature Generation Functions (FGF)

(Note: in this case: )2. Apply FGFs to data to make a lexicon3. Translate examples into feature space4. Learning with vectors

Exactly the same as before!

15

Structured Pipeline - TestingRemember, the FGF is

Now we don’t have to use, but the idea is very similar: for every possible we create features.

16

Automatic Feature GenerationTwo ways to look at this:1. Creating an FGF

This is a black art, not even intuitive for humans to do

2. Choosing the best subset of a closed setThis is possible, algorithms exist

17

Exploiting Syntactico-Semantic Structures for Relation ExtractionBefore doing the hard task of relation classification, apply some easy heuristics to recognize:• Premodifiers: [the [Seattle] Zoo]• Possessives: [[California’s] Governor]• Prepositions: [officials] in [California]• Formulaics: [Medford] , [Massachusetts]

These 4 structures cover 80% of the mention pairs (in ACE 2004)

Chan and Roth, ACL 2011

18

Kernels for Relation Extraction

Hyung Sul Kim

19

Kernel Tricks• Borrowed a few slides from ACL2012 Tutorial for

Kernels in NLP by Moschitti

24

All We Need is• K(x1, x2) = ϕ(x1) · ϕ(x2) • Computing K(x1, x2) can be possible without

mapping x to ϕ(x)

25

Linear Kernels with Features(Zhou et al., 2005)

• Pairwise binary-SVM training • Features• Words• Entity Types• Mention Level• Overlap• Base Phrase Chunking• Dependency Tree• Parse Tree• Semantic Resources

26

Feature Description ExampleWM1 bag-of-words in M1 {they}HM1 head word of M1 theyWM2 bag-of-words in M2 {their, children}HM2 head word of M2 childrenHM12 combination of HM1 and HM2 <they,children>WBNULL when no word in between 0WBFL the only word in between when only one word in between 0WBF first word in between when at least two words in between doWBL last word in between when at least two words in between putWBO other words in between except first and last words when at least three words

in betweennot

BM1F first word before M1 0BM1L second word before M1 0AM2F first word after M2 inAM2L second word after M2 a

Word Features

27

Entity Types, Mention Level, Overlap

Feature Description Example 1 Example 2ET12 combination of mention entity types (PER, ORG, FAC, LOC, GPE) <PER, PER> <GPE, LOC>ML12 combination of mention levels (NAME, NOMIAL, PRONOUN) <PRO, NOM> <NAM,NAM>#MB number of other mentions in between 0 0#WB number of words in between 3 0M1>M2 1 if M2 is included in M1 0 1M1<M2 1 if M1 is included in M2 0 0

28

Base Phrase Chunking

Feature Description ExampleCPHBNULL when no phrase in between 0CPHBFL the only phrase head when only one phrase in between 0CPHBF first phrase head in between when at least two phrases in between JAPANCPHBL last phrase head in between when at least two phrase heads in between KILLEDCPHBO other phrase heads in between except first and last phrase heads when at

least three phrases in between0

CPHBM1F first phrase head before M1 0CPHBM1L second phrase head before M1 0CPHAM2F first phrase head after M2 0CPHAM2L second phrase head after M2 0

30

Performance of Features (F1 Measure)

Words

+ Entity Typ

e

+ Mention Le

vel

+ Overla

p

+ Chunking

+ Dependency Tree

+ Parse Tree

+ Seman

tic Reso

urces

0

10

20

30

40

50

60

31

Performance ComparisonYear Authors Method F-Measure

2005 Zhou et al. Linear Kernels with Handcrafted Features 55.5

32

Syntactic Kernels(Zhao and Grishman, 2005)

• Syntactic Kernels (Composite of 5 Kernels)• Argument Kernel• Bigram Kernel• Link Sequence Kernel• Dependency Path Kernel• Local Dependency Kernel

33

Bigram Kernel

• All unigrams and bigrams in the text from M1 to M2

Unigram Bigram

they they do

do do not

not not put

put put their

their their children

children

34

Dependency Path KernelThat's because Israel was expected to retaliate against Hezbollah forces in areas controlled by Syrian troops.

35



2005 Zhao and Grishman Syntactic Kernels (Composite of 5 Kernels) 70.35

36

Composite Kernel(Zhang et al., 2006)

• Composite of Two Kernels• Entity Kernel (Linear Kernel with entity related features

given by ACE datasets)• Convolution Tree Kernel (Collins and Duffy, 2001)

• Two ways to composite two kernels• Linear Combination• Polynomial Expansion

37

Convolution Tree Kernel(Collins and Duffy, 2001)

An example treeEfficiently Compute K(x1, x2) by O(|x1|·|x2|)

38

Relation Instance Spaces

51.361.9

59.260.4

39




2006 Zhang et al. Entity Kernel + Convolution Tree Kernel 72.1

40

Context-Sensitive Tree Kernel(Zhou et al., 2007)

• Motivational Example: John and Mary got marriedcalled predicate-linked category (10%)

PT: 63.6 Context-Sensitive Tree Kernel: 73.2

41





2007 Zhou et al. (Zhou et al., 2005) + Context-sensitive Tree Kernel 75.8

42

Best Kernel(Nguyen et al., 2009)

• Use Multiple Kernels on• Constituent Trees• Dependency Trees• Sequential Structures

• Design 5 different Kernel Composites with 4 Tree Kernels and 6 Sequential Kernels

43

Convolution Tree Kernels on 4 Special Trees

68.9

56.3

60.258.5

PET

DW GR

GRW

PET + GR = 70.5DW + GR = 61.8

44

Word Sequence Kernels on 6 Special Sequences

SK1. Sequence of terminals (lexical words) in the PET e.g. T2-LOC washington, U.S. T1-PER officials

SK2. Sequence of part-of-speech (POS) tags in the PETe.g. T2-LOC NN , NNP T1-PER NNS

SK3. Sequence of grammatical relations in the PETe.g. T2-LOC pobj , nn T1-PER nsubj

SK4. Sequence of words in the DWe.g. Washington T2-LOC In working T1-PER officials GPE U.S.

SK5. Sequence of grammatical relations in the GRe.g. pobj T2-LOC prep ROOT T1-PER nsubj GPE nn

SK6. Sequence of POS tags in the DWe.g. NN T2-LOC IN VBP T1-PER NNS GPE NNP

61.0

60.8

61.659.7

59.8

59.7

SK1 + SK2 + SK3 + SK4 + SK5 + SK6 = 69.8

45

Word Sequence Kernels(Cancedda et al., 2003)

• Extended Sequence Kernels•Map to high-dimensional spaces using every

subsequence• Penalties to• common subsequences (using IDF)• longer subsequences• non-contiguous subsequences

46

Performance Comparison

Year Authors Method F-Measure





2009 Nguyen et al. Multiple Tree Kernels + Multiple Sequence Kernels 71.5

(Zhang et al., 2006) F-measure 68.9 in our settings

(Zhou et al., 2007) “Such heuristics expand the tree and remove unnecessary information allowing a higher improvement on RE. They are tuned on the target RE task so although the result is impressive, we cannot use it to compare with pure automatic learning approaches, such us our models. “

47

Topic Kernel(Wang et al., 2011)

• Use Wikipedia InfoBox to learn topics of relations (like topics of words) based on co-occurrences

Topics Top RelationsTopic 1 active_years_end_date, career_end, final_year, retiredTopic 2 commands, part_of, battles, not_able_commandersTopic 3 influenced, school_tradition, not_able_ideas, main_interestsTopic 4 destinations, end, through, post_townTopic 5 prizes, award, academy_awards, highlightsTopic 6 inflow, outflow, length, maxdepthTopic 7 after, successor, ending_terminusTopic 8 college, almamater, education…

48

Overview

49

Performance Comparison

Year Authors Method F-Measure





2009 Nguyen et al. Multiple Tree Kernels + Multiple Sequence Kernels 71.5

2011 Wang et al. Entity Features + Word Features + Dependency Path + Topic Kernels 73.24

features, formalized

Documents