using wikipedia for hierarchical finer categorization of named entities

Using Wikipedia for Hierarchical Finer Categorization

of Named Entities

Aasish PappuLanguage Technologies Institute

Carnegie Mellon University

PACLIC 2009

outline

1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments6 Conclusion

1 Introduction 3

• Structured and organized encyclopedic corpus is a suitable training corpus.– a wide range of topics– provides hyperlinks

1 Introduction 4

• In this paper1) Discuss the usability of Wikipedia2) Induce WordNet and Wikipedia domain

taxonomy into the feature space3) Using Maximum Entropy and SVM classifier

outline

2 Related Work 6

• Kazama and Torisawa (2007)– extracted gloss text

• Dakka and Cucerzan (2008)– tagging the Wikipedia data

• Bunescu and Pasca (2006)– built a disambiguation system

outline

3 Corpus Creation 8

• 10-18-2007 English version of Wikipedia• 2 million articles• 292,384 categories• a taxonomy with a depth about 10

– 5882 Wikipedia Stub categories– 105 domains

3 Corpus Creation

3.1 Categories in Wikipedia3.2 Named entity categories3.3 Procedure

3.1 Categories in Wikipedia 10

• taxonomy– constituted by categories– linked to other categories across depth and

breadth• contains cycles

– Tackled by Zesch and Gurevych, 2007• wikipedia taxonomy is not a tree

3 Corpus Creation

3.2 Named entity categories 12

• the domain hierarchy– 17 basic domains– 88 sub-domains

3.2 Named entity categories 13

• to avoid the bias towards any particular domain

• rules to choose set of categories– To ensure diversity in the categorization task– To ensure we select balanced categories– consider category with each parameter

closest to mean value under that domain

3 Corpus Creation

3.3 Procedure 15

• extract named entity phrases– using Stanford POS tagger

• extract typed dependency relationships• extract the content words around a named

entity– collect the NPs (noun phrases) and VPs (verb

phrases)

3.3 Procedure 16

1) Firstly, we look for redirected and disambiguated article titles matching with first name of the named entity.

2) If, there are more than one such titles, consider the target title using minimum edit distance metric.

3) Pick all articles that fall under the same category as the target article.

4) Look for those articles that fall under the special categories that are chosen for the classification task.

5) Find the article that shares maximum number of categories with the target article and label the target article with the its special category.

3.3 Procedure 17

• About 10,000 samples– Training 75%– Testing 25%

3.3 Procedure 18

outline

4 Features 20

• four types of feature sets– a syntactic feature set– three semantic features

4 Features

4.1 Typed Dependency Feature4.2 Hypernyms4.3 Domain based features

4.1 Typed Dependency Feature 22

• phrase structure parse– nesting of multi-word constituents

• dependency parse– dependencies between individual words

• dependency relations gives a clue about probable semantic relations that can be associated with the named entity.

4 Features

4.2 Hypernyms 24

• preferred to have a hypernym feature which is semantically specific– hypernyms of all synsets are inversely

ordered according to their depth in the hypernym tree

– deepest hypernym in the lot is choosen as the target feature for that content word

4 Features

4.3.1 Wordnet domains4.3.2 Wikipedia domains4.3.3 WDH vsWikipedia Domain System

4.3.1 Wordnet domains 27

• Every synset in WordNet is associated a domain label in Wordnet Domain Hierarchy (WDH)

• There are 5 top-level domains and 46 basic domains in WDH.

4 Features

4.3.2 Wikipedia domains 29

• indexed Wikipedia• search content words in the index for the

categories that contain more number of pages containing a content word

• Especially, pages with links are weighed double the pages that contains the word without a hyperlink.

4 Features

4.3.3 WDH vsWikipedia Domain System

outline

5 Experiments 33

outline

1 Introduction2 Related Work3 Corpus Creation4 Features5 Experiments

5.1 Experiment 1: Feature wise model5.2 Experiment 2: Feature combination model5.3 Experiment 3: Error analysis

outline

• presented a named entity categorization system– employs Wikipedia categories as classes

• adapted hierachial categorization of Wikipedia– mine relations among named entities

using wikipedia for hierarchical finer categorization of named entities

entity categories3

number of subcategories

special categories

wikipedia103 corpus

maximum number of categories

entity categoriesto

wikipedia domain taxonomy

wikipedia taxonomy

Documents

the finer things score

finer issue 1_cambridge 2015

finer ik w - teachitct.org

leeds finer – issue one

hot rolled & cold rolled categorization hot rolled & cold...

the finer things

luke a. rosedahl & f. gregory ashby · • own wikipedia...

concepts & categorization

maprdd: finer grained resilient distributed dataset for...

decision categorization

the finer edge

youcat : weakly supervised youtube video categorization...

strength finer

cy-fair’s finer affair

animal categorization

text categorization hongning wang cs@uva. today’s lecture...

text categorization

image categorization

exploiting wikipedia categorization for predicting age and...

nature's finer forces intreg