a survey of bootstrapping techniques in natural language ...vijay/fall13/snlp/lit-survey/... · a...

A Survey of Bootstrapping Techniques in Natural Language Processing

Daniel WaegelDepartment of Computer Science

University of DelawareNewark, DE 19711

[email protected]

AbstractThis paper surveys a selection of semi-nal bootstrapping articles and present anin-depth discussion of the algorithms andthe relative strengths and weaknesses ofthe techniques used. We also examine theproblems these authors attempt to solveand analyze when it is appropriate to uti-lize a bootstrapping algorithm. Finally, wediscuss some of the limitations inherent inany bootstrapping approach.

1 Introduction

Statistical approaches to information extractionand natural language processing tasks require vastamounts of information in order to perform ac-ceptably and produce reliable results. Trainingempirical algorithms requires huge corpora. Acorpus of unannotated, unmarked natural languagetext is not often suitable for training purposes onmost tasks; algorithms typically need their train-ing data to be annotated in some fashion in orderto be able to extract and learn the salient textualfeatures.

Unfortunately, annotated corpora are in shortsupply. Manually annotating even a small corpusis incredibly time-consuming and is infeasible forlarger ones. Automatically tagging a corpus withthe necessary information is usually not possible,since the annotations needed for training are typ-ically the information we are trying to find in thefirst place (Agichtein and Gravano, 2000).

For example, an empirical algorithm that at-tempts to identify whether a given article is writ-ten in a subjective or objective fashion would typ-ically train on a corpus that is marked up withannotations indicating subjectivity and objectiv-ity in some fashion. The algorithm would thenlearn from this annotated training corpus and then(hopefully!) be able to effectively perform thistask on unannotated natural language text.

Bootstrapping provides an alternative topainstaking manual annotation. The techniquesreviewed herein are all trained on unannotatedcorpora. The implementers provide their algo-rithm with a small number of carefully chosenseeds. These seeds are then used as a startingpoint to gather other, similar terms from anunannotated training corpus, which are in turnthen used to gather even more terms, and so on.In essence, the algorithm “pulls itself up from itsbootstraps”, hence the name.

In the following section, we discuss in detaila selection of widely cited bootstrapping meth-ods and the problems they are attempting to solve.First, we discuss the Snowball algorithm used byAgichtein and Gravano (2000) to extract relationsfrom unannotated text. Next, we investigate theBasilisk algorithm (Thelen and Riloff, 2002) usedfor the task of semantic categorization. We alsoexamine the NOMEN bootstrapping system and itsapplication to two tasks: Semantic categorization(Lin et al., 2003) and learning generalized names(Yangarber et al., 2002). Finally, we cover howthe bootstrapping of extraction patterns can iden-tify elusive subjective language using two differ-ent methodologies (Riloff et al., 2003; Riloff andWiebe, 2003).

We then examine the corpora used in these pa-pers and evaluate the results reported by the au-thors. We strive to identify the strengths and weak-nesses of the various bootstrapping approachesand their suitability for the different applications.

In the last sections we discuss and draw conclu-sions about the merits of bootstrapping more gen-erally. We demonstrate that it is a powerful andflexible technique that can extract many differenttypes of information with minimal human interac-tion. We also muse upon the limitations and draw-backs of bootstrapping techniques and what can bedone to mitigate such concerns.

2 Bootstrapping Algorithms

While there is great variation between implemen-tations, bootstrapping approaches in natural lan-guage processing all follow the same general for-mat:

1. Start with an empty list of things.1

2. Initialize this list with carefully chosen seeds.

3. Leverage the things in the list to find morethings from a training corpus.

4. Score those newly found things; add the bestones to the list.

5. Repeat from step 3. Stop after a set num-ber of iterations or some other stop condi-tion. The intuition for bootstrapping stemsfrom the observation that words from the se-mantic category tend to appear in similar pat-terns and similar contexts. For example, thewords “water” and “soda” are both in the se-mantic BEVERAGE category, and in a largetext they will likely both be found in phrasescontaining drank or imbibed. The core intu-ition is that by searching for patterns like thatwe can find other BEVERAGEs, and then byfinding patterns that contain those, we canfind even more.

The core feature of a bootstrapping algorithm isthat each iteration is fed the same type of data asits input that it produces as its output. The outputof the first iteration is used as the input of the sec-ond iteration, and so on. It should not be surpris-ing that choosing the initial set of data (the seeds)from which all other data is “grown” is a criticalfactor (arguably the most critical factor) in the per-formance of the algorithm.

2.1 SnowballThis paper by Agichtein and Gravano (2000) out-lines an attempt to identify and extract structuredrelations between named entities from unstruc-tured text. In this context, a relation is definedas a table or listing that maps one entity onto an-other (as in a relational database). The exampleused throughout the article is mapping an organi-zation entity (e.g., Microsoft or Boeing) onto theirheadquarters (a location entity such as Redmond

1Usually words or phrases, but it can be any representa-tion of language (such as regular expressions, tuples, etc.)

for Microsoft). These are represented as a tuple inthe form of ¡O, L¿. The goal of this research is toallow much more nuanced responses to questionssuch as “Where is Microsoft headquartered?”.

Snowball is built on top of the DIPRE (DualIterative Pattern Relation Expansion) method byBrin (1999). The central idea behind both of theseexploits the idea, termed pattern relation duality,that a “good” extraction pattern will produce goodtuples given a large enough body of text and a“good” set of tuples can have good extraction pat-terns deduced from it. The only user-provided in-put to the Snowball system is the training corpusand a small handful of manually compiled rela-tions (the seeds). The initial seeds provided byAgichtein and Gravano (2000) were:

ORGANIZATION HEADQUARTERSMICROSOFT REDMONDEXXON IRVINGIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA

Table 1: Seed tuples provided to Snowball.

The system then searches the training corpuslooking for documents where both terms in a seedtuple occur in close proximity to one another.It analyzes the connecting text and surroundingcontext and generates an extraction pattern repre-sented by a Snowball pattern, a 5-tuple of the form<left, tag1, middle, tag2, right>. The left, middle,and right terms are weighted vectors of words thatrepresent the respective left, middle, and right con-texts. The other two terms in the 5-tuple, tag1 andtag2, are named-entity tags such as ORGANIZA-TION or LOCATION. Since the patterns can onlypossibly match named entities of the proper type,a preprocessing step is required to tag the corpuswith a named-entity tagger. An example Snow-ball pattern generated by the system is the 5-tuple<{<the, 0.2>}, LOCATION, {<-, 0.5>, <based,0.5>}, ORGANIZATION, { }> which match textin the form “the ¡LOCATION¿-based ¡ORGANI-ZATION¿”.

In addition to the Snowball pattern 5-tuples,each document snippet S that contains twonamed entities with the correct named-entity tags(ORANGANIZATION and LOCATION in ourexamples) are also assigned a corresponding 5-tuple. Three weighted vectors lS , rS , and mS are

created from the text surrounding the two namedentities. The size of the lS and rS vectors is limitedby a window of size w. The vector weights indi-cate the relative importance, measured as a func-tion of the frequency, of each term in the context.The vectors are scaled so their norm is 1, and thenthe vectors are multiplied by a scaling factor to in-dicate each vector’s relative importance. 2

Snowball generates a 5-tuple for each stringthat contains a co-occurring pair of named enti-ties found in one of the seed tuple. Once all ofthese are gathered, they are clustered using a sim-ple single-pass clustering algorithm. Their prox-imity is measured by a similarity threshold τsimfor the following similarity metric, given two 5-tuples t1 and t2:

Match(t1, t2) ={l1 · l2 +m1 ·m2 + r1 · r2 if the tags match0 Otherwise

(1)

After clustering, Snowball patterns are formedby collapsing the left, middle, and right vec-tors of 5-tuples in the clusters via their centroid:The left vectors are represented by their centroidl̄s, the middle vectors by centroid m̄s, and theright vectors by r̄s. Since the similarity metricMatch(t1, t2) guarantees that only 5-tuples withthe same tags will be similar, the Snowball patternfor the cluster is built as < l̄s, t1, m̄s, t2, r̄s >.

After the Snowball patterns are generated, thecorpus can now be scanned to find to locate addi-tional relations (tuples of the form <O, L> in ourexamples). The algorithm for doing so is outlinedin Algorithm 1.

For every sentence S containing an organizationand location tagged by the named-entity tagger, a5-tuple tS =< ls, t1,ms, t2, rs > is generated.If there exists a pattern P represented by 5-tupletP such that Match(tS , tP ) ≥ τsim, the snowballsystem generates a candidate tuple < O,L >.

Thus, after the algorithm is done, each candi-date tuple will have one or more patterns associ-ated with it. Each candidate-pattern associationhas a similarity score associated with it.

The final step is to measure the system’s con-fidence that the patterns and tuples produced are

2In the implementation, Agichtein and Gravano (2000) as-signs the middle vector higher weights than the left or rightvectors with the intuition that the middle vector is the mostimportant.

foreach snippet ∈ corpus do{< O,L >,< ls, t1,ms, t2, rs > }← CreateOccurrence(snippet);TC = < O,L >;SimBest = 0;foreach p ∈ Patterns do

sim←Match(< ls, t1,ms, t2, rs >, p) ;if sim ≥ τsim then

UpdatePatternSelectivity(p, TC);if sim ≥ SimBest then

SimBest ← sim;PBest ← p;

endend

endif SimBest ≥ τsim then

CandidateTuples[TC ].Patterns[PBest]←SimBest;

endendreturn CandidateTuples;

Algorithm 1: Extracting new tuples from Snow-ball patterns

likely to be “good”. The confidence in a pattern pis simply the ratio of its positive matches (thosematched tuples that agree with known data) to pos-itive and negative matches (those matched tuplesthat disagree with known data). The confidence ina candidate tuple Conf(T ) is defined as

Conf(T ) = 1−|P |∏i=0

1−Conf(Pi) ·Match(Ci, Pi),

(2)where Pi ∈ P is each pattern that generated T andMatch(Ci, Pi) is the similarity score between thecandidate T and the pattern Pi.

Any candidates with a low confidence score arediscarded, and the rest are added to the pool of ac-cepted relations. This newly enlarged pool, whichoriginally starts with the manually chosen seeds,forms the basis for the next cycle of bootstrapping.

2.2 BasiliskThe Basilisk algorithm (Bootstrapping Approachto SemantIc Lexicon Induction using SemanticKnowledge), developed by Thelen and Riloff(2002), utilizes a bootstrapping approach to de-veloping a semantic lexicon of nouns. This is acategorization task: Search for nouns in an unan-

notated corpus and assign them to one of six se-mantic categories.3 These ad hoc categories weremanually selected by the authors because the cor-pus is comprised of articles written about terror-ism.4

Knowledge of semantic class information is in-tegral to the success of a huge assortment of nat-ural language processing tasks, such as questionanswering, information extraction, (Thelen andRiloff, 2002) and summarization (Lin et al., 2003).Large semantic dictionaries such as WordNet pro-vide a robust and readily accessible semantic lexi-con, but Thelen and Riloff (2002) argue that theseare insufficient for a number of reasons: They cannever be fully comprehensive (due to neologisms,idioms, esoteric verbiage, etc), and also that ageneral-purpose lexicon will not usually containdomain-specific jargon and vocabulary.

Basilisk, like all bootstrapping algorithms, mustbe seeded with carefully selected terms in order tobe effective. Thelen and Riloff (2002) seeded theiralgorithm by sorting the words in the corpus byfrequency and manually identifying the 10 mostfrequent words for each of the six semantic cate-gories.

Prior to the actual bootstrapping process, the al-gorithm runs an extraction pattern learner5 on thecorpus to extract every noun phrase in one of threeforms: subject (“<subject>was arrested”), directobject (“murdered <direct object>”), or preposi-tional phrase (“collaborated with <pp object>”).

The bootstrapping algorithm then proceeds asper Algorithm 2.

In step 1, the extraction patterns must be eval-uated and scored. Basilisk uses the RlogF metricfor this task. This is defined as:

RlogF (patterni) =Fi

Ni· log2(Fi) (3)

where Fi is the number of nouns extracted bypatterni known to belong to a semantic categoryand Ni is the total number of nouns extracted bypatterni. This essentially scores those patternswhich are highly selective (have a high precision)or which are moderately selective but are very pro-lific (have moderate precision but a high recall).

3The semantic categories are: BUILDING, EVENT, HU-MAN, LOCATION, TIME, and WEAPON.

4See section 3 for more details regarding the corpus.5The learner, the Autoslog system by Riloff, is briefly

mentioned for completeness. A full description is outside thescope of this survey.

Generate all extraction patterns in the corpusand record their extractions.lexicon← {seed words}i← 0BOOTSTRAPPING1. Score all extraction patterns2. patternpool← top 20 + i patterns3. candidate word pool← extractions frompattern pool

4. Score candidate ∈ candidate word pool5. Add top 5 candidates to lexicon6. i← i+ 17. Go to Step 1

Algorithm 2: Pseudo-code for Basilisk algo-rithm

In step 2, the best-scoring patterns are placed inthe pattern pool. In the first cycle, Thelen andRiloff (2002) set this parameter to the top 20 pat-terns. During each subsequent cycle, the pool sizeis expanded by one. This prevents the pool frombecoming “stale”; witha constant pool size, even-tually all valid candidates would be extracted bythe same top-scoring 20 patterns and only invalidcandidates would remain.

In step 3, every extraction produced by thepatterns in the pattern pool is added to thecandidate word pool, unless it is already presentin the lexicon. Next, these candidates need to bescored, which is done using the scoring metric:

score(wordi) =

Pi∑j=1

log2(Fj + 1)

Pi(4)

where Pi is the number of patterns that extractwordi, and Fj is the number of unique, known,category members extracted by patternj . Afterscoring, the best candidates are added to the lex-icon, the pools are flushed, and the bootstrappingcycle starts over with the enlarged lexicon.

2.3 NOMEN

Two of the articles we survey use the NOMEN al-gorithm, although the goals of each paper werevery different. In much the same vein of re-search as the previous section, Lin et al. (2003)investigates the suitability of bootstrapping for un-supervised learning of semantic classification ofterms. We survey this research along with the pre-viously discussed (Thelen and Riloff, 2002) exam-ine whether dissimilar bootstrapping approaches

to the same problem will have similar results. Thisgives us some insight into the robustness of thebootstrapping paradigm.6

The NOMAN algorithm was first formally de-scribed in the literature by Yangarber et al. (2002),where it is used to address the problem of learninggeneralized names in a biomedical context. Gen-eralized names, such as mad cow disease or Ebolahemorrhagic fever, often lack the (full) capitaliza-tion cues afforded to proper names and detectionmethods must rely on other metrics to attain rea-sonable performance. The authors particularly fo-cus on the confounding factor of semantic ambigu-ity on the bootstrapping process. It is common fornames in the biomedical context to refer to botha disease and a symptom, resulting in ambiguityand greatly increasing the difficulty of distinguish-ing between the two semantic classes. E. coli andencephalitis are two examples of this provided byYangarber et al. (2002).

Pseudo-code for NOMAN can be found in Al-gorithm 3. The following details are taken fromLin et al. (2003) as we feel that description wasmore readable. There are only two small dif-ferences between the implementations, which wenote below.

The NOMEN algorithm requires its training setto undergo significant pre-processing: The train-ing corpus is run through a zoner, a sentence split-ter, a tokenizer/lemmatizer, and a part-of-speechtagger. The zoner is necessary to extract the actualcontent because the training corpus is a collectionof mailing list correspondence with mailing head-ers and footers. The splitter and tokenizer breakthe corpus into word tokens, the lemmatizer con-verts words to their base form, which are finallytagged by the part-of-speech tagger.

After pre-processing, the first step is choos-ing the seeds. Yangarber et al. (2002) chose the10 most common names for each semantic cate-gory from an “IE database of more than 10,000records”. This source is different from the corporaused in training and testing. Lin et al. (2003) runsmultiple experiments using different seeds and of-

6We use the term “robust” here not in the software senseof handling abnormal situations gracefully, but rather to in-dicate a broad capacity to solve problems despite varyingapproaches and implementations. This is contrasted with aweak or fragile paradigm, which can only adequately solveproblems under a specific set of artificially chosen parame-ters and/or conditions. Assessing the strengths of classes ofalgorithms is a major issue in Computer Science, as failedapproaches go unpublished in the literature.

NOMAN BOOTSTRAPPING1. Tag accepted names in corpus2. For each tag, generate a patternp = [l−3l−2l−1 < t > l+1l+2l+3]3. For every learner, generate pos(p), neg(p),and unk(p)4. For every pattern, compute acc(p) andconf(p){

Discard(p) if acc(p) < θprec

Score(p) Otherwise

Accept the n top scoring patterns 5. For eachaccepted pattern p, add unk(p) to candidates6. For every learner, score candidate name tMt ← set of accepted patterns matching anyinstance of t{Discard(t) If Mt lacks Mass or Balance

Rank(t) Otherwise

Learner accepts top-scoring k percent ofnames (maximum of m)7. Repeat until no more names can be learned

Algorithm 3: Pseudo-code for NOMEN algo-rithm

ten lists them without explaining their origin. Wepresume they are manually selected. The seeds arethen placed in the pool of accepted names.

The next step is to tag each instance of everyaccepted name occurring in the corpus, e.g., <dis-ease>mad cow disease</disease>. Then for boththe opening and closing tags, an extraction patternis generated around the tag using a context win-dow. This window size was set to 3 in this work.The pattern is then generalized in some way byreplacing some elements in the context windowwith wildcards. It is completely unclear from ei-ther paper what the generalized pattern wouldlook like or what kinds of text it would match.

There is one learner per semantic category.For each learner, the generalized patterns arethen matched against the training corpus. Sinceeach pattern was created from either the open-ing or closing tag, the pattern will only matcheither the left or the right boundary of a name.NOMEN uses a regular expression of the form[Adj ∗ Noun+] to determine where the otherboundary of the word ends. Each learner thenchecks whether the matched name is a positive(already in this learner’s pool of accepted names),

negative (already in another learner’s pool), orunknown (in no pool) name.

Each pattern, for each learner, is then scored us-ing the number of unique matches of each type:pos(p), neg(p), and unk(p). The pattern is thenscored using accuracy and confidence metrics:

acc(p) =|pos(p)|

|pos(p)|+ |neg(p)|(5)

conf(p) =|pos(p)| − |neg(p)|

|pos(p|) + |neg(p)|+ |unk(p)|(6)

where patterns with an accuracy of less than Θprec

are removed from consideration. The rest aregiven a final score of Score(p) = conf(p) ·log|pos(p)|. The top n patterns for each learnerare then accepted.

Candidate names are then drawn from theunion of the unknown names of accepted patterns(unk(p)). Each candidate t must have an accept-able mass Mt (defined as the # of accepted pat-terns that generated t } of 2 or greater and bebalanced (at least one accepted pattern matchedboth left and right name boundaries). Names thatmeet this criteria are scored:

rank(t) = 1−∏p∈Mt

(1− conf(p)) (7)

and each learner then finally accepts up to top-scoring k percent, but no more thanm in any givencycle.7 The cycle is then repeated until no moreacceptable names exist.

2.4 AutoSlog-TS

Riloff and Wiebe (2003) uses bootstrapping toidentify subjective expressions. Empirical meth-ods have previously been leveraged to create listsof words and n-grams statistically associated withsubjective language. Previous work has also re-sulted in lists of manually compiled subjectivewords.8

However, subjective ideas are expressed in myr-iad forms, many of which are very hard to de-tect using conventional natural language process-ing approaches. Sarcastic, metaphorical, and id-iomatic phrases are cases where the expressionof subjectivity is easily overlooked by empirical

7Both implementations used a k of 5% and an m of 5.8A well-known example is the ”Verbs of Desire” class

composed by Beth Levin.

methods; both utilize words which may be objec-tive when isolated but may become strongly sub-jective when placed in context.9 While some id-ioms, such as kicked the bucket may be easily cap-tured by an n-gram of sufficient length, many oth-ers contain pronouns or proper nouns (as in dealtJohn a mortal blow ).

Riloff and Wiebe (2003) use an unnamed vari-ant of their previously defined AutoSlog-TS al-gorithm to address some of these issues. Thework is notable in that it bootstraps extraction pat-terns instead of words. An extraction pattern isessentially a regular expression to capture somelexico-syntactic pattern in text, such as ¡x¿ drives¡y¿ up the wall , where x and y can match anyarbitrary part-of-speech (a noun or noun phrasewould make sense in this case). Riloff and Wiebe(2003) hypothesize that extraction patterns are bet-ter suited for capturing non-compositional mean-ing than other representations like words or n-grams.

Particularly worth noting is that this processis the only one that does not use manually se-lected seeds. The original set of extraction pat-terns are instead created by what is termed a High-precision subjective classifier (HP-Subj). Thisclassifier flags sentences as subjective or objectiveusing a large list of lexical items. These items arewords, N-grams, and other subjective clues com-piled from a wide variety of prior works whichshowed them to be good indicators of subjectivity.There is also a corresponding High-precision ob-jectivity classifier (HP-Obj) that labels sentences itbelieves are objective. It functions somewhat dif-ferently: It labels sentences that have a dearth ofsubjective clues within them and in the surround-ing textual context.

The Autoslog-TS algorithm takes as input theset of relevant (subjective) and irrelevant (objec-tive) sentences flagged by the HP-Subj and HP-Obj classifiers. First, syntactic templates gener-ated from prior work are exhaustively applied tothe training corpus, so that extraction patterns aregenerated for every possible instantiation of thetemplates that appears in the training corpus. Thelist of templates and examples of learned extrac-tion patterns are shown below.

Then, these extraction patterns are matchedagainst every sentence in the training corpus. The

9This is called non-compositional meaning as it defies thelinguistic Principle of Compositionality.

SYNTACTIC FORM EXAMPLE PATTERN<subj> passive-verb <subj> was satisfied<subj> act-verb <subj> complained<subj> act-verb dobj <subj> dealt blow<subj> verb infinitive <subj> appear to be<subj> aux noun <subj> has positionactive-verb <dobj> endorsed <dobj>infinitive <dobj> to condemn <dobj>verb infinitive <dobj> get to know <dobj>noun aux <dobj> fact is <dobj>noun prep <np> opinion on <np>active-verb prep <np> agrees with <np>pass-verb prep <np> was mad about < np>infinitive prep <np> to resort to <np>

Table 2: Syntactic templates and example patternsinstantiated from them (Riloff and Wiebe, 2003)

frequency with which each extracted pattern oc-curs in known subjective and objective sentences(labeled earlier by the HP-Subj and HP-Obj clas-sifiers) is calculated and this information is usedto rank each pattern using the scoring function:

Pr(subj|patterni) =subjfreq(patterni)

freq(patterni)(8)

Where subjfreq(patterni) is the number oftimes the pattern matched part of a known sub-jective sentence and freq(patterni) is the totalnumber of times it matched any sentence. Thisis essentially the “precision” of the pattern.

After all patterns are scored, two thresholdmeasures are used to select the patterns deemedmost subjective. Patterns are selected for whichfreq(patterni) ≥ Θ1 and Pr(subj|patterni) ≤Θ2. The new sentences are then mixed in withthe original sentences and the extraction patternlearner is re-run to complete the bootstrapping cy-cle.

2.5 MetaBoot + BasiliskPreviously mentioned approaches have used a sin-gle algorithm to perform bootstrapping. Anotherapproach employed by Riloff et al. (2003) is touse multiple algorithms simultaneously to achievea synergistic effect. This can perform well if thecombined algorithms each find a different set ofterms. The goal is again to identify subjective lan-guage. Specifically, it is framed as a classificationproblem: Classify sentences in the corpus as ei-ther subjective or objective. The bootstrapping is

performed on subjective nouns rather than on ex-traction patterns as is the case in Riloff and Wiebe(2003).

As the Basilisk algorithm was already discussedin Section 2.2 and MetaBoot is not the core fo-cus of this approach, we will only provide a briefdescription of how the MetaBoot algorithm func-tions.10 Like Basilisk, it was originally designedfor semantic categorization tasks. It is re-purposedin this paper for learning subjective nouns. It isseeded with words belonging to the desired class(subjective in this case), and then creates extrac-tion patterns by instantiating templates that extractevery noun phrase in the corpus. The patterns arescored based upon the number of seeds extracted.The best pattern is saved and then the head nounfrom every extracted noun phrase is accepted intothe category. The patterns are then re-scored andthe cycle repeats.

What differentiates MetaBoot from previouslydiscussed algorithms is that it has a second levelof bootstrapping. After each cycle completes, allnouns put into the dictionary during the cycle arescored based on the number of patterns that ex-tracted it. Only the five best are allowed to remain.

Both algorithms were used in tandem and runfor 400 iterations. Two were used because theycaptured different sets of subjective nouns whilebeing provided with the same seeds. Since theobjective is to produce a lexicon of subjectivenouns, this was a desirable feature. The authorswould have had to come up with different seedwords in order to produce more results from a sin-gle system. The final lexicon was a union of theoutput of the two algorithms.

3 Bootstrapping Corpora

All the corpora used in these experiments wereunannotated. This is because bootstrapping is aprocess that needs no additional markup and istherefore ideal for tasks on unannotated naturallanguage text. Furthermore, many of the tasksare such that annotations would be unhelpful, e.g.,building lexicons. Agichtein and Gravano (2000),for the Snowball algorithm, used the North Ameri-can News Text Corpus spanning from 1995 to mid-1997. They split this corpus up into a trainingand a testing set: The training set was the 178,000documents from 1996, and the testing set was the142,000 documents from 1995 and 1997.

10For a full description, see (Riloff and Jones, 1999).

Thelen and Riloff (2002), with the Basilisk al-gorithm, utilized the MUC-4 Proceedings corpusfrom 1992. This contained 1,700 documents fo-cusing on terrorism.

Yangarber et al. (2002) and Lin et al. (2003)both used the ProMED corpus for their experi-ments with the NOMEN algorithm. This corpuscontained articles written in the biomedical field.Yangarber et al. (2002) used 5,100 ProMED arti-cles spanning the years from 1999 to 2001, whileLin et al. (2003) used a smaller subset containingonly 1,400 articles. They also performed some ex-periments on learning proper names in Chinese.Since there are no case distinctions in Chinese, thisbears some similarity to the task of learning gen-eralized names by Yangarber et al. (2002). Theyused a training corpus consisting of approximately700,000 words from the Beijing University Insti-tute of Computational Linguistics corpus.

Riloff and Wiebe (2003) and Riloff et al. (2003),using AutoSlog-TS and MetaBoot + Basilisk re-spectively, both utilized the U.S. Foreign Broad-cast Information Service (FBIS) corpus. This cor-pus contains news stories in multiple languages,paired with their translations in English (e.g., pairsof English-Chinese stories). From this corpus,302,163 sentences were used.

4 Experimental Results

As discussed earlier, Agichtein and Gravano(2000) performed the task of attempting to extract¡ORGANIZATION, LOCATION¿ relations, specif-ically a company and its headquarters. The pre-cision and recall of their algorithm, as comparedto a baseline, the DIPRE system that Snowball isbased on, and an alternate version of Snowball thatdiscards punctuation (Snowball-plain) is detailedin Figure 1. The baseline is computed by tabu-lating each co-occurrence of an organization anda location and assuming the most frequently co-occurring location is the headquarters of the or-ganization. The x-axis in these charts representssuccessive runs, where the number of times a tu-ple needs to be found before it is accepted is in-creased. As we can see, the baseline does quitewell, but Snowball still comes out on top in bothprecision and recall. The results are reported after3 iterations because Snowball’s results quickly sta-bilize (see Figure 4). Thelen and Riloff (2002) uti-lized their Basilisk algorithm to build a semanticlexicon for six categories. The authors manually

Figure 1: Precision and recall of the Snowball sys-tem Agichtein and Gravano (2000)

scored each word in the resulting lexicon for cor-rectness and did the same thing for an implementa-tion of MetaBoot that they used as a baseline. Re-sults were compared at various points between 100and 1000 bootstrapping iterations. Basilisk sub-stantially outperformed MetaBoot at every itera-tion, especially when it learned its semantic cate-gories in tandem instead of sequentially.

Yangarber et al. (2002) compared their imple-mentation of the NOMEN algorithm to severalmanually constructed reference lists for precisionand recall that they used as their gold standard.Figure 2 shows the results of their best score (Dis+ Loc + Sym + Other) for the semantic categoryof disease names.

Riloff and Wiebe (2003) and Riloff et al. (2003)This score was obtained by simultaneously boot-

Figure 2: NOMAN precision/recallYangarber et al. (2002)

strapping multiple semantic categories at the sametime. Without a baseline to compare to, it is dif-

ficult to evaluate success (see Section 5.1 for adiscussion of this problem). However, Lin et al.(2003) performed some similar experiments butgreatly increased the number of semantic cate-gories being simultaneously learned. We can seethe resulting heightened precision in Figure 3 con-firms the trend.

Figure 3: NOMAN precision/recallLin et al. (2003)

The AutoSlog-TS system detailed in Riloff et al.(2003) performed quite well according to the met-rics reported. Their system learned 17,073 pat-terns that tended to extract subjective expressions.To evaluate these, they ran different subsets ofthe patterns on a manually annotated test set thatcontained 54% subjective sentences. The result-ing precision when extracting subjective sentencesranged between 71% and 85%.

The MetaBoot + Basilisk goal pursued by Riloffet al. (2003) was also building a semantic lexicon.In Table 3 we see that the resulting lexicon wasmanually reviewed and tagged by the authors aseither strongly or weakly subjective words. Herewe see how the merging of lexicons produced byeach algorithm produces a final lexicon that is sig-nificantly larger. To give some insight about theaccuracy of the system, after 1000 generationsBasilisk had a lexicon containing approximately55% subjective words. MetaBoot contained ap-proximately 29%. After 2000 generations, thishad fallen to roughly 44% and 25%, respectively.

B M B ∩M B ∪MStrongSubj 372 192 110 454WeakSubj 453 330 185 598Total 825 522 295 1052

Table 3: Subjective words in lexiconRiloff et al. (2003)

5 Discussion

The previous discussions have discussed the phi-losophy behind bootstrapping and provided ahigh-level overview of how several seminal boot-strapping algorithms are implemented. We thenevaluated what kind of data they were trained onand how they performed. In the following sec-tions, we discuss the strengths, weaknesses, andidiosyncrasies of the bootstrapping model.

5.1 Evaluation Methodologies

The bootstrapping tasks surveyed here weretrained and tested on unannotated corpora. Thisposed difficulties when attempting to evaluate theperformance of the bootstrapper, as there was of-ten no gold standard against which to compare thebootstrapping results.

Agichtein and Gravano (2000) used theirSnowball system to extract Organization →Headquarters relations from an unannotatedcorpus, but then they had no gold standard againstwhich to check. They ultimately solved theirlack of an ideal set for their extracted relationsby downloading a publicly available directory ofcompany information and parsing out the requisitevalues. Lin et al. (2003) manually compiled twolists of gold standards themselves in order to havesomething to compare their results against. Riloffet al. (2003) and Riloff and Wiebe (2003) both hadto manually annotate their resulting lexicons in or-der to ascertain whether they even contained anyresults. Even after manually finding the subjectivewords in their results, they still did not have anykind of baseline to compare it to. These difficul-ties in evaluating the performance are a hindranceto the development and adoption of bootstrappingalgorithms.

Evaluation is further confused by the fact thatsome bootstrapping algorithms are evaluated us-ing different standards than others. Yangarberet al. (2002) note when presenting their resultsthat algorithms can have their recall and preci-sion scored for each instance of a word cor-rectly tagged in the corpus (termed instance-basedor token-based evaluation), or they can be scoredonly once per word correctly found (type-based).He noted that which method was used often de-pends on the end goal: If the goal is to tag each in-stance in the corpus with a semantic category, thentoken-based makes sense. If the goal is to build alexicon, then type-based makes sense. However,

since both objectives can be achieved by the samealgorithm just keeping track of different things,this can making comparison efforts difficult.

5.2 Role of Bootstrapping

As broad, all-encompassing hierarchical seman-tic databases like WordNet continue to mature, itis reasonable to wonder if techniques for identi-fying semantic categories will become obsolete.However, there are many reasons why this is un-likely, although it may be relegated to niche po-sitions. General semantic databases like WordNetare not designed for domain-specific lingo/jargonand techniques like bootstrapping may be the onlysolution to such tasks (Thelen and Riloff, 2002).

Additionally, specific tasks may always needbootstrapping. Snowball bootstraps specific rela-tions; it maps a semantic class onto a related se-mantic class. Basilisk bootstraps nouns into ar-bitrary semantic classes. We saw it used for bothidentification of subjective words and the semanticcategorization of other words. There are a (prac-tically unlimited) number of potential ad hoc se-mantic classes and relations between the two. Attheir core, bootstrapping algorithms are not lim-ited to learning general semantic categories. Theyare useful for learning any category or relation-ship for which words appear in similar linguisticphrases or contexts.

5.3 Weaknesses of Bootstrapping

5.3.1 Semantic DriftA very well known flaw in bootstrapping is a phe-nomenon known as semantic drift or creep. Thisoccurs when, after multiple iterations, the boot-strapper wanders away from the original semanticmeaning of the seeds and begins to accept incor-rect or undesirable entities.

This can occur if entities have more than onesemantic word sense; Yangarber et al. (2002) notethat bootstrapping disease names may also beginto pick up symptoms because some symptoms anddiseases share names (i.e., encephalitis ). It canalso occur if words similar meaning and usage buthave different undertones.

Consider a hypothetical bootstrapper that wasseeded with discussion. On further iterations,it might pick up debate → disagreement →argument → altercation → brawl, if they allappeared in similar contexts in the training corpus(for example, “I had a(n) <x> with him during

dinner”). While each iteration may be semanti-cally similar to the last word added (discussionand debate, argument and altercation, etc.),they share less and less semantic meaning with theoriginal seed word. By the end, brawl is hardlyrelated at all to discussion.

There are several ways to ward against this kindof semantic creep. The chief line of defense is theevaluation of candidates before adding them to thesystem. All of the bootstrapping systems surveyedhad some kind of candidate evaluation; most useda confidence measure. The effectiveness of thisconfidence measure is clarified by comparing theperformance of the Snowball system to DIPRE(See Figure 4). The DIPRE system peaks andthen quickly plummets since “it has no way to pre-vent unreliable tuples from being seed[s] for itsnext iteration”. Another technique employed by

Figure 4: Effects of confidence measure, DIPREvs Snowball

multiple authors ((Thelen and Riloff, 2002; Yan-garber et al., 2002; Lin et al., 2003)) is the par-allel or simultaneous learning of various seman-tic categories, instead of learning them serially.This offers an advantage that may not be immedi-ately obvious: Each category’s seeds will acceptthe most closely related terms to it during eachiteration. When terms are forbidden from beingshared between categories, this effectively parti-tions the search space and limits how far a cat-egory can grow before being constrained by theothers. Another way simultaneous learning canhelp was discussed by Lin et al. (2003). Their sys-tem not only prohibited semantic categories from

including terms already included in another cat-egory, but also discouraged each semantic cate-gory from learning patterns that matched terms in-cluded in too many other categories. In this way, itprovided a self-stabilizing or auto-correcting fea-ture to the bootstrapping feature to help keep it “ontrack”. These techniques, however, can only beutilized when the search space is partitionable.

5.3.2 Stop conditions

One major drawback that bootstrapping suffers isthe difficulty in knowing when to stop the iterativecycles. Some (Riloff et al., 2003) stopped after aset and seemingly arbitrary number of iterations,while others (Lin et al., 2003) stopped when theiralgorithm had exhaustively added all candidates orrejected any remaining candidates. The metricsused to determine whether a bootstrapping algo-rithm should terminate are often of a dubious na-ture. For example, the confidence measures usedby the majority of the bootstrapping algorithms wesurveyed cannot make any guarantees about preci-sion or recall. They can simply tell whether whatthey have is arbitrarily similar enough to what theyare now finding.

5.3.3 Seeding Decisions

As we have seen, the choosing of seeds is arguablythe most critical step in bootstrapping. However,most authors simply chose the most frequentlyoccurring words in their corpus that they havequickly identified belong to the category they areinterested in. While this ensures that the greatestamount of contextual information will be availableto learn from, it does nothing to ensure the qualityof the contexts. It is easy to imagine improperlychosen seeds that would pick up tons of extractionpatterns that, in turn, extract very poor additionalwords, ultimately producing poor results. None ofthe surveyed literature addressed how to find seedchoices beyond taking the most frequent words.It would be interesting to read literature on howdifferent seeding choices impacted the final per-formance of the system and provide guidelines orrules for making seeding decisions.

6 Conclusion

We examined in detail multiple approaches tobootstrapping for natural language processingtasks. We explored not only a variety of differentgoals but also multiple techniques.

We conclude that bootstrapping is a powerfulmethod for extracting useful data from enormousvolumes of unstructured text. However, bootstrap-ping systems must be carefully seeded and con-strained to avoid growing in unwanted directionsor becoming overly diluted with irrelevant data.The choice of seeds is pivotal to the success of thebootstrapping process and it is not at all clear howto determine what the “best” seed might be.

Bootstrapping is amenable to a wide variety ofnatural language processing tasks, such as seman-tic categorization, relation extraction, and subjec-tivity analysis. Bootstrapping systems are wellsuited to natural language tasks due to their abil-ity to learn and navigate the syntactically rich,unstructured, and extremely complex nature ofloosely structured natural languages.

ReferencesAgichtein, Eugene and Gravano, Luis. 2000. Snow-

ball: Extracting relations from large plain-text col-lections. In Proceedings of the fifth ACM conferenceon Digital libraries, pages 85–94. ACM.

Brin, Sergey. 1999. Extracting patterns and relationsfrom the world wide web. In The World Wide Weband Databases, pages 172–183. Springer BerlinHeidelberg.

Thelen, Michael and Riloff, Ellen. 2002. A boot-strapping method for learning semantic lexicons us-ing extraction pattern contexts. In Proceedings ofthe ACL-02 conference on Empirical methods in nat-ural language processing, Volume 10, pages 214–221. Association for Computational Linguistics.

Yangarber, Roman, Lin, Winston, and Grishman,Ralph. 2002. Unsupervised learning of general-ized names. In Proceedings of the 19th internationalconference on Computational linguistics, Volume 1.Association for Computational Linguistics.

Lin, Winston, Yangarber, Roman, and Grishman,Ralph. 2003. Bootstrapped learning of semanticclasses from positive and negative examples. In Pro-ceedings of ICML-2003 Workshop on The Contin-uum from Labeled to Unlabeled Data, Volume 4,No. 4.

Riloff, Ellen, Wiebe, Janyce and Wilson, Theresa.2003. Learning subjective nouns using extractionpattern bootstrapping. In Proceedings of the seventhconference on Natural language learning at HLT-NAACL, Volume 4, pages 25–32. Association forComputational Linguistics.

Riloff, Ellen and Wiebe, Janyce. 2003. Learning ex-traction patterns for subjective expressions. In Pro-ceedings of the 2003 conference on Empirical meth-

ods in natural language processing, pages 105–112.Association for Computational Linguistics.

Riloff, Ellen and Jones, Rosie. 1999. Learning dic-tionaries for information extraction by multi-levelbootstrapping. In AAAI/IAAI, pages 474–479.

a survey of bootstrapping techniques in natural language ...vijay/fall13/snlp/lit-survey/... · a...

Documents