graph-based analysis of espresso-style minimally-supervised bootstrapping algorithms jan 15, 2010...

52
Graph-based Analysis of Espresso- style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Upload: piers-griffith

Post on 27-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Graph-based Analysis of Espresso-styleMinimally-supervised Bootstrapping

Algorithms

Jan 15, 2010

Mamoru KomachiNara Institute of Science and Technology

Page 2: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Corpus-based extraction of semantic knowledge

2

Singapore

Hong Kong

___ visa Hong Kong

China

___ history

Australia

Egypt

Instance Pattern New instance

Alternate step by step

Input Output(Extracted from corpus)

Singapore visa

Singapore map

Page 3: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Semantic drift is the central problem of bootstrapping

Singapore

card

visa ___ Australia

messages

greeting __

card

words

Instance Pattern New instance

Errors propagate to successive iteration

Input Output(Extracted from corpus)

Semantic category changed!

Generic patternsPatterns co-occurring with many irrelevant instances

Generic patternsPatterns co-occurring with many irrelevant instances

3

Page 4: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Main contributions of this work

1. Suggest a parallel between semantic drift in Espresso [Pantel and Pennachiotti, 2006] style bootstrapping and topic drift in HITS [Kleinberg, 1999]

2. Solve semantic drift using “relatedness” measure (regularized Laplacian) instead of “importance” measure (HITS authority) used in link analysis community

4

Page 5: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Table of contents

5

4.Graph-based Analysis of Espresso-style Bootstrapping Algorithms

3.Espresso-style Bootstrapping Algorithms

2. Overview of Bootstrapping Algorithms

5.Word Sense Disambiguation

6.Bilingual Dictionary Construction

7.Learning Semantic Categories

Page 6: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Espresso Algorithm [Pantel and Pennacchiotti, 2006]

• Repeat• Pattern extraction• Pattern ranking• Pattern selection• Instance extraction• Instance ranking• Instance selection

• Until a stopping criterion is met

6

Page 7: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Pattern/instance ranking in Espresso

Score for pattern p Score for instance i

7

r (p)

pmi(i, p)

max pmir(i)

iI

I

r(i)

pmi(i, p)

max pmir (p)

pP

P

,**,,*,

,,log),(

pyx

ypxpipmi

p: patterni: instanceP: set of patternsI: set of instancespmi: pointwise mutual informationmax pmi: max of pmi in all the patterns

and instances

Page 8: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Espresso uses pattern-instance matrix Afor ranking patterns and instances

|P|×|I|-dimensional matrix holding the (normalized) pointwise mutual information (pmi) between patterns and instances

8

1 2 ... i ... |I|

1

2

:

p

:

|P|

[A]p,i = pmi(p,i) / maxp,i pmi(p,i)[A]p,i = pmi(p,i) / maxp,i pmi(p,i)

instance indices

pattern indices

Page 9: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

p = pattern score vectori = instance score vectorA = pattern-instance matrix

Pattern/instance ranking in Espresso

9

... pattern ranking

... instance ranking

Reliable instances are supported by reliable patterns, and vice versa

|P| = number of patterns|I| = number of instances

normalization factorsto keep score vectors not too large

Page 10: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Espresso Algorithm [Pantel and Pennacchiotti, 2006]

• Repeat• Pattern extraction• Pattern ranking• Pattern selection• Instance extraction• Instance ranking• Instance selection

• Until a stopping criterion is met

10

For graph-theoretic analysis,

we will introduce 3 simplifications to Espresso

For graph-theoretic analysis,

we will introduce 3 simplifications to Espresso

Page 11: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

First simplification

• Compute the pattern-instance matrix• Repeat

• Pattern extraction• Pattern ranking• Pattern selection• Instance extraction• Instance ranking• Instance selection

• Until a stopping criterion is met

11

Simplification 1Remove pattern/instance extraction steps

Instead, pre-compute all patterns and instances once in the beginning of the algorithm

Simplification 1Remove pattern/instance extraction steps

Instead, pre-compute all patterns and instances once in the beginning of the algorithm

Page 12: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Second simplification

• Compute the pattern-instance matrix• Repeat

• Pattern ranking• Pattern selection

• Instance ranking• Instance selection

• Until a stopping criterion is met

12

Simplification 2Remove pattern/instance selection stepswhich retain only highest scoring k patterns / m instances for the next iterationi.e., reset the scores of other items to 0

Instead,retain scores of all patterns and instances

Simplification 2Remove pattern/instance selection stepswhich retain only highest scoring k patterns / m instances for the next iterationi.e., reset the scores of other items to 0

Instead,retain scores of all patterns and instances

Page 13: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Third simplification

• Compute the pattern-instance matrix• Repeat

• Pattern ranking

• Instance ranking

• Until a stopping criterion is met

13

Until score vectors p and i converge

Simplification 3No early stopping i.e., run until convergence

Simplification 3No early stopping i.e., run until convergence

Page 14: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Simplified Espresso

Input• Initial score vector of seed instances• Pattern-instance co-occurrence matrix A

Main loopRepeat

Until i and p converge

OutputInstance and pattern score vectors i and p

14

... pattern ranking

... instance ranking

Page 15: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Simplified Espresso is HITS

Simplified Espresso = HITS in a bipartite graph whose adjacency matrix is A

Problem

No matter which seed you start with, the same instance is always ranked topmostSemantic drift (also called topic drift in HITS)

15

The ranking vector i tends to the principal eigenvector of ATA as the iteration proceedsregardless of the seed instances!

Page 16: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

How about Espresso?

Espresso has heuristics not present in Simplified Espresso

• Early stopping• Pattern and instance selection

Do these heuristics really help reduce semantic drift?

16

Page 17: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Experiments on semantic drift

Does the heuristics in original Espresso help reduce drift?

Page 18: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Word sense disambiguation task of Senseval-3 English Lexical Sample

Predict the sense of “bank”

18

… the financial benefits of the bank (finance) 's employee package ( cheap mortgages and pensions, etc ) , bring this up to …

In that same year I was posted to South Shields on the south bank (bank of the river) of the River Tyne and quickly became aware that I had an enormous burden

Possibly aligned to water a sort of bank(???) by a rushing river.

Training instances are annotated with their sense

Predict the sense of target word in the test set

Page 19: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Word sense disambiguation by Espresso

Seed instance = the instance to predict its senseSystem output = k-nearest neighbor (k=3)Heuristics of Espresso• Pattern and instance selection

• # of patterns to retain p=20 (increase p by 1 on each iteration)

• # of instance to retain m=100 (increase m by 100 on each iteration)

• Early stopping19

Seed instanceSeed instance

Page 20: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Convergence process of Espresso

20

Heuristics in Espresso helps reducing semantic drift

(However, early stopping is required for optimal performance)

Output the most frequent sense regardless of input

Espresso

Simplified Espresso

Most frequent sense (baseline)

Semantic drift occurs (always outputs the most frequent sense)

Page 21: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Learning curve of Espresso: per-sense breakdown

21

# of most frequent sense predictions increases

Precision for infrequent senses worsens even with

original Espresso

Most frequent sense

Other senses

Page 22: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Summary: Espresso and semantic drift

Semantic drift happens because• Espresso is designed like HITS• HITS gives the same ranking list regardless of

seeds

Some heuristics reduce semantic drift• Early stopping is crucial for optimal performance

Still, these heuristics require• many parameters to be calibrated• but calibration is difficult

22

Page 23: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Main contributions of this work

1. Suggest a parallel between semantic drift in Espresso-like bootstrapping and topic drift in HITS (Kleinberg, 1999)

2. Solve semantic drift by graph kernels used in link analysis community

23

Page 24: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Q. What caused drift in Espresso?

A. Espresso's resemblance to HITS

HITS is an importance computation method(gives a single ranking list for any seeds)

Why not use a method for another type of link analysis measure - which takes seeds into account?"relatedness" measure

(it gives different rankings for different seeds)

24

Page 25: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

The regularized Laplacian kernel

• A relatedness measure• Takes higher-order relations into account• Has only one parameter

25

Normalized Graph Laplacian

R n ( L)n (I L) 1

n0

Regularized Laplacian matrix

A: adjacency matrix of the graphD: (diagonal) degree matrix

β: parameterEach column of Rβ gives the rankings relative to a node

Page 26: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Label propagation algorithm [Zhou et al. 2004]

• Form the affinity matrix W defined by if i != j and Wii = 0.

• Construct the matrix in which D is a diagonal matrix with its (i,i)-

element equal to the sum of the i-th row of W.

• Iterate until convergence, where alpha is a parameter in

(0,1)

• Let F* denote the limit of the sequence {F(t)}. Label each point xi as a label

26

X is a set of instancesand xi is an instance

Page 27: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Laplacian label propagation algorithm

• Form the affinity matrix W defined by where A is a row-normalized instance-pattern co-

occurrence matrix

• Construct the normalized Laplacian matrix

• Iterate until convergence, where alpha is a parameter in

(0,1)

• Let F* denote the limit of the sequence {F(t)}. Label each point xi as a label

27

Page 28: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

The sequence {F(t)} converges to F* = (1-α)(I-αS)-1Y

Proof:• Suppose F(0) = Y.• By iterative algorithm,

• Since 0 < α < 1 and the eigenvalues of (-L) in [-1, 1]

• Hencefor classification, which is equivalent to

28

This is also a form of the regularized Laplacian kernel

Page 29: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Word Sense Disambiguation

Evaluation of regularized Laplacian against Espresso

Page 30: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Label prediction of “bank” (F measure)

Algorithm Most frequent sense

Other senses

Simplified Espresso 100.0 0.0

Espresso (after convergence)

100.0 30.2

Espresso (optimal stopping) 94.4 67.4

Regularized Laplacian (β=10-

2)92.1 62.8

30

The regularized Laplacian keeps high recall for infrequent senses

Espresso suffers from semantic drift(unless stopped at optimal stage)

Page 31: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

WSD on all nouns in Senseval-3

algorithm F measur

e

Most frequent sense (baseline)

54.5

HyperLex 64.6

PageRank 64.6

Simplified Espresso 44.1

Espresso (after convergence) 46.9

Espresso (optimal stopping) 66.5

Regularized Laplacian (β=10-

2)67.1

31

Outperforms other graph-based methods

Espresso needs optimal stopping to achieve an equivalent performance

Page 32: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Regularized Laplacian is stable across a parameter

32

Page 33: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Learning Semantic Categories

Evaluation of regularized Laplacian by large-scaled web data

Page 34: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Graph-based approach to learn semantic knowledge

• We showed that some of bootstrapping algorithms can be regarded as computing similarity over a graph

34

Singapore

Hong Kong___ map

___ visa

UFJ

ANA

___ history

China___ airlines

Never tried large-scale data->Does it scale?

Tested on word sense disambiguation task->Is it task-independent?

Never tried large-scale data->Does it scale?

Tested on word sense disambiguation task->Is it task-independent?

Page 35: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Bootstrapping achieves high precision, but requires calibration in practice

Pros• Dedicated to Japanese web search query logs• Outperforms previous methods on this task• Orders of magnitude faster than previous

bootstrapping methods

Cons• Needs to control many parameters and

configurations• Has little theoretical background• Does not scale to large corpus

35

Page 36: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Graph-based methods are simple but effective on

large-scale data such as web documents

Pros• Can scale to massive amount of raw data• Firm mathematical background (Can employ

classical optimization methods)

Cons• Computational efficiency (Can be approximated)• No obvious “good” graph• Requires computational resource (CPU, disk,

memory, etc) and technical know-how

36

Page 37: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Quetchup algorithm (QUEry Term CHUnk Processor)

• Using clickthrough logs as source of information

• Semi-supervised learning with Laplacian Label Propagation

• Efficient computation over a graph

37

Page 38: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Learning semantic categories from clickthrough graph

• Query “Singapore” –(click)-> http://www.cikm2009.org/

38

Singapore

Hong Kong http://en.wikipedia.org/wiki/Hong_Kong

http://www.bk.mufg.jp/

UFJ

ANA

http://www.singaporair.com/hk.jsp

Chinahttp://www.china-airlines.co.jp/

http://www.ana.co.jp/

http://www.acl-ijcnlp-2009.org/

http://www.cikm2009.org/

Page 39: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Experimental setting

• Search logs: Japanese web search logs collected in August 2008; retained the top 10 million distinct queries

• Target categories: Travel and Finance [Komachi and Suzuki, 2008]

• Evaluation: precision at k; relative recall [Pantel and Ravichandran, 2004]

39

where RA|B is relative recall of system A given B,CX is the number of correct output from system X,C is the true number of the correct instances,PX is precison of system X, and |X| is the number of input instances for system X

Page 40: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Seed instances for each category

Category Seed

Travel jal (Japan Airlines),ana (All Nippon Airways),jr (Japan Railways), じゃらん (jalan: online travel guide site), his (H.I.S.Co.,Ltd.: travel agency)

Finance みずほ銀行 (Mizuho Bank),三井住友銀行 (Sumitomo Mitsui Banking Corporation), jcb,新生銀行 (Shinsei Bank), 野村證券 (Nomura Securities)

40

Page 41: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Precision of Travel domain

41

The regularized Laplacian gives best precision

Page 42: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Precision of Finance domain

42

Again, the regularized Laplacian gives best precision

Page 43: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Relative recall of Travel domain

Clickthrough logs not only achieve high precision but also high recall

43

Page 44: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Relative recall of Finance domain

44

Page 45: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Relative recall of Quetchup with varying click:query

45

Page 46: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Precision of Quethcup with varying click:query

46

Page 47: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Instances and patterns with the top 10 highest scores

System

Instance Pattern

Quetchupclick

じゃらん 宿泊 (jalan accommodation), じゃラン (jalan), ジャラン (jalan), jarann, jaran, じゃらん net (jalan.net), jalan, じゅらん (julan), ana 予約 (ana reservation), ana.co.jp

www.jalan.net/, www.ana.co.jp/, www.his-j.com/, www.jreast.co.jp/, www.jtb.co.jp/, www.jtb.co.jp/ace/, www.westjr.co.jp/, www.jtb.co.jp/kaigai/, nippon.his.co.jp/, www.jr.cyberstation.ne.jp/

Quetchupquery

中部発 (Traveling from Midland), his 関西 (his Kansai), 伊平屋島 (Iheya Island), ホテルコンチネンタル横浜 (Hotel Continental Yokohama), げんじいの森 (Genjii-no-mori; spa), フジサファリパーク (Fuji Safari Park), ad-box, アダコミ (adacomi; offensive), スカイチーム (SkyTeam), ノースウェスト (Northwest)

# 時刻表 (timetable), # 国内旅行 (domestic tour), # 宿泊 (accommodation), # 北海道 (Hokkaido), # 関西 (Kansai), # 九州 (Kyushu), # マイレージ (mileage), # 名古屋 (Nagoya), # 沖縄 (Okinawa), # 温泉 (spa)

Tchai 静鉄バス (Seitetsu Bus), 相鉄バス (Sotetsu Bus), 函館バス (Hakodate Bus), 大阪地下鉄 (Osaka subway), 琴電 (Kotoden railways), 地下鉄御堂筋線 (Subway Midosuji Line), 芸陽バス (Geiyo Bus), 新京成バス (Shin-keisei Bus), jr 阪和線 (Japan Railways Hanna Line), 常磐線 (Joban Line)

# 時刻表 , # 路線図 (route map), # 運賃 (fare), # 料金 (fare), # 定期 (season ticket), # 運行状況 (service situation), # 路線 (route), # 定期代 (season ticket fare), # 定期券 (season ticket), # 時刻

47

Page 48: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Random samples of extracted instancesType (Frequency)

Example

Transportation (54) 広島 新幹線 (Hiroshima Super Express), 東海道線 (Tokaido Line), jr飯田線 (JR Iida Line), jr 博多 (JR Hakata), 京都 新幹線 (Kyoto Super Express)

Accommodation (10)

ホテルビーナス (Hotel Venus), リーガロイヤルホテル大阪 (Rihga Royal Hotel Osaka), www.route-inn.co.jp, ホテル京阪ユニバーサル・シティ (Hotel Keihan Universal City), 札幌全日空ホテル (ANA Hotel Sapporo)

Travel Information (10)

外務省 安全 (Foreign Ministry safety), チケットショップ 大阪 (ticket shop Osaka), 観光 関西 (Site seeing Kansai), 高山観光協会 (Takayama Tourism Association), グーグル ナビ (Google navi)

Travel Agency (6) jr おでかけネット (Odekake net), 近畿ツー (Kinki Tourist), タビックス 静岡 (Tabix Shizuoka), フレックスインターナショナル (Flex International), オリオンツアー (Orion Tour)

Others (2) プロテカ (Proteca; bag for travel), jal 紀行倶楽部 (JAL Travel Club)

Not Related (20) 格安航空チケット 海外 (discount flight ticket overseas), 新幹線予約状況 (Super express reservation situation), 新幹線 時刻表 (Super express timetable), 温泉宿 (spa accommodation), 新幹線 停車駅 (Super express stops), 虎 (tiger), youtubu 海外ドラマ (overseas drama), 法務部採用 (legal department recruitment), おくりびと (Okuribito; film), 社会人野球 (amateur baseball)

48

Page 49: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Learning curve of Quetchupclick

The more the data, the higher the precision

49

Page 50: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Sensitivity to Quetchupclick to parameter α

Clickthrough graph seems much denser than query graph, and large alpha (less weight on initial labels) yields better performance than small alpha

50

Page 51: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Conclusions

• Semantic drift in Espresso is a parallel form of topic drift in HITS

• The regularized Laplacian reduces semantic drift in two tasks: word sense disambiguation and named entity extraction• inherently a relatedness measure ( importance

measure)

51

Page 52: Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Jan 15, 2010 Mamoru Komachi Nara Institute of Science and Technology

Future work

• Investigate if a similar analysis is applicable to a wider class of bootstrapping algorithms (including co-training)

• Investigate the influence of seed selection to bootstrapping algorithms and propose a way to select effective seed instances

• Explore relation between Markov random walk and label propagation methods

52