a self learning universal concept spotter by tomek strzalkowski and jin wang original slides by iman...
TRANSCRIPT
A Self Learning Universal Concept Spotter
By Tomek Strzalkowski and Jin Wang
Original slides by Iman Sen
Edited by Ralph Grishman
Introduction
When this paper appeared (1996), most named entity taggers were hand-coded; work on supervised learning for NE was just beginning.
The Universal Spotter was one of the first procedures proposed for unsupervised learning of semantic categories of names and noun phrases.
Basic Idea
Start with some examples and/or contexts for things to spot (the ‘seed’) & a large corpus
Exploit redundancy of evidence we may be able to classify a name both because we know the
name and we know its context
Use seed examples to learn indicative contexts and use these contexts to learn “new” items.
Initially precision is high, recall very low. Iterations should increase recall, while (hopefully)
maintaining/improving precision.
Seeds: What we are looking for
The seed is the initial information provided by the user. … in the form of either Examples or Contextual
Information. Examples are taken from the text ( “Microsoft”,
“toothbrushes”). Context information can also be specified (both Internal
& External). For example, “Name ends with Co.” or “appears after produced” .
The Cyclic Process
1. Build context rules from the seed examples.
2. Use these rules to find further examples of this concept in the corpus.
3. As we find more examples of the concepts, we can find more contextual information.
4. Selectively expand context rules using these new contexts.
5. Repeat.
Simple Example
• Suppose we have the seeds “Co” and “Inc” initially and the following text.
“Henry Kaufman is president of Henry Kaufman & Co., …..president of
Gabelli Funds Inc. ; Claude. N . Rosenberg is named president of
Thomson S.A ….”
• Use “Co” and “Inc” to pick out Henry Kaufman & Co and Gabelli Funds Inc.
• Use these new seeds to get contextual information such as for example, “president of” before each of the entities.
• Use “president of” to find “Thomson S.A.”
The Classification Task
So our goal is to decide whether a sequence of words represents a desired entity/concept.
This is done by calculating significance weights, SW, of evidence items [features], and then combining them .
The Process: In Detail
• Initially some preprocessing is done including tokenization, POS tagging and lexical normalization or stemming.
• POS tagging help to delineate which sequences of words might contain the desired entities.
• These become the ‘candidate items’
“Evidence Items” [features]
Consider sequence of words W1,W2,…Wm in text which is of interest. There is a window of size n on either side of the central unit where one looks for contextual information.
Then do the following:
Make up pairs of (word, position), where position is one of preceding (p) context, central unit (s) or following (f) context for all words within the window of size n. Similarly make up pairs of (bigram, position).
Make up triples of (word, position, distance) for the same sequence of words, where distance is the distance from W1 or Wm.
(for units in W1 thru Wm take distance from Wm).
An Example of Evidence Items
Example: ... boys kicked the door with rage ... with window n=2, and central unit, “the door”.
The generated tuples (called evidence items) are :
(boys, p), (kicked, p), (the, s),
(door, s), (with, f), (rage , f),
((boys, kicked), p), ((the, door)), s),
((with, ,rage), f), (boys, p, 2),
(kicked, p, 1), (the, s, 2), (door, s, 1),
(with, f, 1), (rage, f, 2),
((boys, kicked), p, 1), ((the, door)), s, 1),
((with, ,rage), f, 1)
Calculating Significance Weights for Evidence Items
Candidate items may be classified into two groups, accepted (A) and rejected (R).
Use these groups to calculate SW:
where s is a constant to filter noise and f(x,X) is the frequency of x in X.
• SW takes values between -1.0 & 1.0• For some e, SW(t)>e>0 is taken as positive evidence
and SW(t)<-e is taken as negative evidence.
SW (t) = f(t,A)-f(t,R) f ( t , A ) + f ( t , R ) > s f(t,A)+y(t,R) 0 otherwise
Combining SW weights
Then these SW weights for a given candidate item are combined and if the result exceeds a threshold, then they become available during the next tagging stage.
the primary scheme used by the authors for combining is:
x + y - xy if x>0 and y>0
x o y = x + y + xy if x<0 and y<0
x + y otherwise
Note: Values still remain with [-1.0, 1.0]
+
Bootstrapping
The basic bootstrapping process then looks likethis:
Procedure Bootstrapping Collect seeds l o o p Training phase (calc. SW for each evidence item)
Tagging phase (combine SW for each candidate item)
until Satisfied.
Experiments and Results
Organization Names Training on 7 MB WSJ corpus,
Testing on 10 selected articles. With seed context features,
precision 97%, recall 49% Reached P=95% & R= 90% after 4th cycle Similar experiment for identifying products,
lower performance (about 70% R, 70% P)