a self learning universal concept spotter by tomek strzalkowski and jin wang original slides by iman...

A Self Learning Universal Concept Spotter

By Tomek Strzalkowski and Jin Wang

Original slides by Iman Sen

Edited by Ralph Grishman

Introduction

When this paper appeared (1996), most named entity taggers were hand-coded; work on supervised learning for NE was just beginning.

The Universal Spotter was one of the first procedures proposed for unsupervised learning of semantic categories of names and noun phrases.

Basic Idea

Start with some examples and/or contexts for things to spot (the ‘seed’) & a large corpus

Exploit redundancy of evidence we may be able to classify a name both because we know the

name and we know its context

Use seed examples to learn indicative contexts and use these contexts to learn “new” items.

Initially precision is high, recall very low. Iterations should increase recall, while (hopefully)

maintaining/improving precision.

Seeds: What we are looking for

The seed is the initial information provided by the user. … in the form of either Examples or Contextual

Information. Examples are taken from the text ( “Microsoft”,

“toothbrushes”). Context information can also be specified (both Internal

& External). For example, “Name ends with Co.” or “appears after produced” .

The Cyclic Process

1. Build context rules from the seed examples.

2. Use these rules to find further examples of this concept in the corpus.

3. As we find more examples of the concepts, we can find more contextual information.

4. Selectively expand context rules using these new contexts.

5. Repeat.

Simple Example

• Suppose we have the seeds “Co” and “Inc” initially and the following text.

“Henry Kaufman is president of Henry Kaufman & Co., …..president of

Gabelli Funds Inc. ; Claude. N . Rosenberg is named president of

Thomson S.A ….”

• Use “Co” and “Inc” to pick out Henry Kaufman & Co and Gabelli Funds Inc.

• Use these new seeds to get contextual information such as for example, “president of” before each of the entities.

• Use “president of” to find “Thomson S.A.”

The Classification Task

So our goal is to decide whether a sequence of words represents a desired entity/concept.

This is done by calculating significance weights, SW, of evidence items [features], and then combining them .

The Process: In Detail

• Initially some preprocessing is done including tokenization, POS tagging and lexical normalization or stemming.

• POS tagging help to delineate which sequences of words might contain the desired entities.

• These become the ‘candidate items’

“Evidence Items” [features]

Consider sequence of words W1,W2,…Wm in text which is of interest. There is a window of size n on either side of the central unit where one looks for contextual information.

Then do the following:

Make up pairs of (word, position), where position is one of preceding (p) context, central unit (s) or following (f) context for all words within the window of size n. Similarly make up pairs of (bigram, position).

Make up triples of (word, position, distance) for the same sequence of words, where distance is the distance from W1 or Wm.

(for units in W1 thru Wm take distance from Wm).

An Example of Evidence Items

Example: ... boys kicked the door with rage ... with window n=2, and central unit, “the door”.

The generated tuples (called evidence items) are :

(boys, p), (kicked, p), (the, s),

(door, s), (with, f), (rage , f),

((boys, kicked), p), ((the, door)), s),

((with, ,rage), f), (boys, p, 2),

(kicked, p, 1), (the, s, 2), (door, s, 1),

(with, f, 1), (rage, f, 2),

((boys, kicked), p, 1), ((the, door)), s, 1),

((with, ,rage), f, 1)

Calculating Significance Weights for Evidence Items

Candidate items may be classified into two groups, accepted (A) and rejected (R).

Use these groups to calculate SW:

where s is a constant to filter noise and f(x,X) is the frequency of x in X.

• SW takes values between -1.0 & 1.0• For some e, SW(t)>e>0 is taken as positive evidence

and SW(t)<-e is taken as negative evidence.

SW (t) = f(t,A)-f(t,R) f ( t , A ) + f ( t , R ) > s f(t,A)+y(t,R) 0 otherwise

Combining SW weights

Then these SW weights for a given candidate item are combined and if the result exceeds a threshold, then they become available during the next tagging stage.

the primary scheme used by the authors for combining is:

x + y - xy if x>0 and y>0

x o y = x + y + xy if x<0 and y<0

x + y otherwise

Note: Values still remain with [-1.0, 1.0]

+

Bootstrapping

The basic bootstrapping process then looks likethis:

Procedure Bootstrapping Collect seeds l o o p Training phase (calc. SW for each evidence item)

Tagging phase (combine SW for each candidate item)

until Satisfied.

Experiments and Results

Organization Names Training on 7 MB WSJ corpus,

Testing on 10 selected articles. With seed context features,

precision 97%, recall 49% Reached P=95% & R= 90% after 4th cycle Similar experiment for identifying products,

lower performance (about 70% R, 70% P)

a self learning universal concept spotter by tomek strzalkowski and jin wang original slides by iman...

Documents