graph-based analysis of espresso-style minimally-supervised bootstrapping algorithms jan 15, 2010...

Graph-based Analysis of Espresso-styleMinimally-supervised Bootstrapping

Algorithms

Jan 15, 2010

Mamoru KomachiNara Institute of Science and Technology

Corpus-based extraction of semantic knowledge

2

Singapore

Hong Kong

___ visa Hong Kong

China

___ history

Australia

Egypt

Instance Pattern New instance

Alternate step by step

Input Output(Extracted from corpus)

Singapore visa

Singapore map

Semantic drift is the central problem of bootstrapping

Singapore

card

visa ___ Australia

messages

greeting __

card

words

Instance Pattern New instance

Errors propagate to successive iteration

Input Output(Extracted from corpus)

Semantic category changed!

Generic patternsPatterns co-occurring with many irrelevant instances

Generic patternsPatterns co-occurring with many irrelevant instances

3

Main contributions of this work

1. Suggest a parallel between semantic drift in Espresso [Pantel and Pennachiotti, 2006] style bootstrapping and topic drift in HITS [Kleinberg, 1999]

2. Solve semantic drift using “relatedness” measure (regularized Laplacian) instead of “importance” measure (HITS authority) used in link analysis community

4

Table of contents

5

4.Graph-based Analysis of Espresso-style Bootstrapping Algorithms

3.Espresso-style Bootstrapping Algorithms

2. Overview of Bootstrapping Algorithms

5.Word Sense Disambiguation

6.Bilingual Dictionary Construction

7.Learning Semantic Categories

Espresso Algorithm [Pantel and Pennacchiotti, 2006]

• Repeat• Pattern extraction• Pattern ranking• Pattern selection• Instance extraction• Instance ranking• Instance selection

• Until a stopping criterion is met

6

Pattern/instance ranking in Espresso

Score for pattern p Score for instance i

7

r (p)

pmi(i, p)

max pmir(i)

iI

I

r(i)

pmi(i, p)

max pmir (p)

pP

P

,**,,*,

,,log),(

pyx

ypxpipmi

p: patterni: instanceP: set of patternsI: set of instancespmi: pointwise mutual informationmax pmi: max of pmi in all the patterns

and instances

Espresso uses pattern-instance matrix Afor ranking patterns and instances

|P|×|I|-dimensional matrix holding the (normalized) pointwise mutual information (pmi) between patterns and instances

8

1 2 ... i ... |I|

1

2

:

p

:

|P|

[A]p,i = pmi(p,i) / maxp,i pmi(p,i)[A]p,i = pmi(p,i) / maxp,i pmi(p,i)

instance indices

pattern indices

p = pattern score vectori = instance score vectorA = pattern-instance matrix

Pattern/instance ranking in Espresso

9

... pattern ranking

... instance ranking

Reliable instances are supported by reliable patterns, and vice versa

|P| = number of patterns|I| = number of instances

normalization factorsto keep score vectors not too large

Espresso Algorithm [Pantel and Pennacchiotti, 2006]

• Repeat• Pattern extraction• Pattern ranking• Pattern selection• Instance extraction• Instance ranking• Instance selection


10

For graph-theoretic analysis,

we will introduce 3 simplifications to Espresso

For graph-theoretic analysis,

we will introduce 3 simplifications to Espresso

First simplification

• Compute the pattern-instance matrix• Repeat

• Pattern extraction• Pattern ranking• Pattern selection• Instance extraction• Instance ranking• Instance selection


11

Simplification 1Remove pattern/instance extraction steps

Instead, pre-compute all patterns and instances once in the beginning of the algorithm

Simplification 1Remove pattern/instance extraction steps

Instead, pre-compute all patterns and instances once in the beginning of the algorithm

Second simplification


• Pattern ranking• Pattern selection

• Instance ranking• Instance selection


12

Simplification 2Remove pattern/instance selection stepswhich retain only highest scoring k patterns / m instances for the next iterationi.e., reset the scores of other items to 0

Instead,retain scores of all patterns and instances

Simplification 2Remove pattern/instance selection stepswhich retain only highest scoring k patterns / m instances for the next iterationi.e., reset the scores of other items to 0

Instead,retain scores of all patterns and instances

Third simplification


• Pattern ranking

• Instance ranking


13

Until score vectors p and i converge

Simplification 3No early stopping i.e., run until convergence

Simplification 3No early stopping i.e., run until convergence

Simplified Espresso

Input• Initial score vector of seed instances• Pattern-instance co-occurrence matrix A

Main loopRepeat

Until i and p converge

OutputInstance and pattern score vectors i and p

14

... pattern ranking

... instance ranking

Simplified Espresso is HITS

Simplified Espresso = HITS in a bipartite graph whose adjacency matrix is A

Problem

No matter which seed you start with, the same instance is always ranked topmostSemantic drift (also called topic drift in HITS)

15

The ranking vector i tends to the principal eigenvector of ATA as the iteration proceedsregardless of the seed instances!

How about Espresso?

Espresso has heuristics not present in Simplified Espresso

• Early stopping• Pattern and instance selection

Do these heuristics really help reduce semantic drift?

16

Experiments on semantic drift

Does the heuristics in original Espresso help reduce drift?

Word sense disambiguation task of Senseval-3 English Lexical Sample

Predict the sense of “bank”

18

… the financial benefits of the bank (finance) 's employee package ( cheap mortgages and pensions, etc ) , bring this up to …

In that same year I was posted to South Shields on the south bank (bank of the river) of the River Tyne and quickly became aware that I had an enormous burden

Possibly aligned to water a sort of bank(???) by a rushing river.

Training instances are annotated with their sense

Predict the sense of target word in the test set

Word sense disambiguation by Espresso

Seed instance = the instance to predict its senseSystem output = k-nearest neighbor (k=3)Heuristics of Espresso• Pattern and instance selection

• # of patterns to retain p=20 (increase p by 1 on each iteration)

• # of instance to retain m=100 (increase m by 100 on each iteration)

• Early stopping19

Seed instanceSeed instance

Convergence process of Espresso

20

Heuristics in Espresso helps reducing semantic drift

(However, early stopping is required for optimal performance)

Output the most frequent sense regardless of input

Espresso

Simplified Espresso

Most frequent sense (baseline)

Semantic drift occurs (always outputs the most frequent sense)

Learning curve of Espresso: per-sense breakdown

21

# of most frequent sense predictions increases

Precision for infrequent senses worsens even with

original Espresso

Most frequent sense

Other senses

Summary: Espresso and semantic drift

Semantic drift happens because• Espresso is designed like HITS• HITS gives the same ranking list regardless of

seeds

Some heuristics reduce semantic drift• Early stopping is crucial for optimal performance

Still, these heuristics require• many parameters to be calibrated• but calibration is difficult

22

Main contributions of this work

1. Suggest a parallel between semantic drift in Espresso-like bootstrapping and topic drift in HITS (Kleinberg, 1999)

2. Solve semantic drift by graph kernels used in link analysis community

23

Q. What caused drift in Espresso?

A. Espresso's resemblance to HITS

HITS is an importance computation method(gives a single ranking list for any seeds)

Why not use a method for another type of link analysis measure - which takes seeds into account?"relatedness" measure

(it gives different rankings for different seeds)

24

The regularized Laplacian kernel

• A relatedness measure• Takes higher-order relations into account• Has only one parameter

25

Normalized Graph Laplacian

R n ( L)n (I L) 1

n0

Regularized Laplacian matrix

A: adjacency matrix of the graphD: (diagonal) degree matrix

β: parameterEach column of Rβ gives the rankings relative to a node

Label propagation algorithm [Zhou et al. 2004]

• Form the affinity matrix W defined by if i != j and Wii = 0.

• Construct the matrix in which D is a diagonal matrix with its (i,i)-

element equal to the sum of the i-th row of W.

• Iterate until convergence, where alpha is a parameter in

(0,1)

• Let F* denote the limit of the sequence {F(t)}. Label each point xi as a label

26

X is a set of instancesand xi is an instance

Laplacian label propagation algorithm

• Form the affinity matrix W defined by where A is a row-normalized instance-pattern co-

occurrence matrix

• Construct the normalized Laplacian matrix

• Iterate until convergence, where alpha is a parameter in

(0,1)

• Let F* denote the limit of the sequence {F(t)}. Label each point xi as a label

27

The sequence {F(t)} converges to F* = (1-α)(I-αS)-1Y

Proof:• Suppose F(0) = Y.• By iterative algorithm,

• Since 0 < α < 1 and the eigenvalues of (-L) in [-1, 1]

• Hencefor classification, which is equivalent to

28

This is also a form of the regularized Laplacian kernel

Word Sense Disambiguation

Evaluation of regularized Laplacian against Espresso

Label prediction of “bank” (F measure)

Algorithm Most frequent sense

Other senses

Simplified Espresso 100.0 0.0

Espresso (after convergence)

100.0 30.2

Espresso (optimal stopping) 94.4 67.4

Regularized Laplacian (β=10-

2)92.1 62.8

30

The regularized Laplacian keeps high recall for infrequent senses

Espresso suffers from semantic drift(unless stopped at optimal stage)

WSD on all nouns in Senseval-3

algorithm F measur

e

Most frequent sense (baseline)

54.5

HyperLex 64.6

PageRank 64.6

Simplified Espresso 44.1

Espresso (after convergence) 46.9

Espresso (optimal stopping) 66.5

Regularized Laplacian (β=10-

2)67.1

31

Outperforms other graph-based methods

Espresso needs optimal stopping to achieve an equivalent performance

Regularized Laplacian is stable across a parameter

32

Learning Semantic Categories

Evaluation of regularized Laplacian by large-scaled web data

Graph-based approach to learn semantic knowledge

• We showed that some of bootstrapping algorithms can be regarded as computing similarity over a graph

34

Singapore

Hong Kong___ map

___ visa

UFJ

ANA

___ history

？

China___ airlines

Never tried large-scale data->Does it scale?

Tested on word sense disambiguation task->Is it task-independent?

Never tried large-scale data->Does it scale?

Tested on word sense disambiguation task->Is it task-independent?

Bootstrapping achieves high precision, but requires calibration in practice

Pros• Dedicated to Japanese web search query logs• Outperforms previous methods on this task• Orders of magnitude faster than previous

bootstrapping methods

Cons• Needs to control many parameters and

configurations• Has little theoretical background• Does not scale to large corpus

35

Graph-based methods are simple but effective on

large-scale data such as web documents

Pros• Can scale to massive amount of raw data• Firm mathematical background (Can employ

classical optimization methods)

Cons• Computational efficiency (Can be approximated)• No obvious “good” graph• Requires computational resource (CPU, disk,

memory, etc) and technical know-how

36

Quetchup algorithm (QUEry Term CHUnk Processor)

• Using clickthrough logs as source of information

• Semi-supervised learning with Laplacian Label Propagation

• Efficient computation over a graph

37

Learning semantic categories from clickthrough graph

• Query “Singapore” –(click)-> http://www.cikm2009.org/

38

Singapore

Hong Kong http://en.wikipedia.org/wiki/Hong_Kong

http://www.bk.mufg.jp/

UFJ

ANA

http://www.singaporair.com/hk.jsp

Chinahttp://www.china-airlines.co.jp/

http://www.ana.co.jp/

http://www.acl-ijcnlp-2009.org/

http://www.cikm2009.org/

Experimental setting

• Search logs: Japanese web search logs collected in August 2008; retained the top 10 million distinct queries

• Target categories: Travel and Finance [Komachi and Suzuki, 2008]

• Evaluation: precision at k; relative recall [Pantel and Ravichandran, 2004]

39

where RA|B is relative recall of system A given B,CX is the number of correct output from system X,C is the true number of the correct instances,PX is precison of system X, and |X| is the number of input instances for system X

Seed instances for each category

Category Seed

Travel jal (Japan Airlines),ana (All Nippon Airways),jr (Japan Railways), じゃらん (jalan: online travel guide site), his (H.I.S.Co.,Ltd.: travel agency)

Finance みずほ銀行 (Mizuho Bank),三井住友銀行 (Sumitomo Mitsui Banking Corporation), jcb,新生銀行 (Shinsei Bank), 野村證券 (Nomura Securities)

40

Precision of Travel domain

41

The regularized Laplacian gives best precision

Precision of Finance domain

42

Again, the regularized Laplacian gives best precision

Relative recall of Travel domain

Clickthrough logs not only achieve high precision but also high recall

43

Relative recall of Finance domain

44

Relative recall of Quetchup with varying click:query

45

Precision of Quethcup with varying click:query

46

Instances and patterns with the top 10 highest scores

System

Instance Pattern

Quetchupclick

じゃらん宿泊 (jalan accommodation), じゃラン (jalan), ジャラン (jalan), jarann, jaran, じゃらん net (jalan.net), jalan, じゅらん (julan), ana 予約 (ana reservation), ana.co.jp

www.jalan.net/, www.ana.co.jp/, www.his-j.com/, www.jreast.co.jp/, www.jtb.co.jp/, www.jtb.co.jp/ace/, www.westjr.co.jp/, www.jtb.co.jp/kaigai/, nippon.his.co.jp/, www.jr.cyberstation.ne.jp/

Quetchupquery

中部発 (Traveling from Midland), his 関西 (his Kansai), 伊平屋島 (Iheya Island), ホテルコンチネンタル横浜 (Hotel Continental Yokohama), げんじいの森 (Genjii-no-mori; spa), フジサファリパーク (Fuji Safari Park), ad-box, アダコミ (adacomi; offensive), スカイチーム (SkyTeam), ノースウェスト (Northwest)

# 時刻表 (timetable), # 国内旅行 (domestic tour), # 宿泊 (accommodation), # 北海道 (Hokkaido), # 関西 (Kansai), # 九州 (Kyushu), # マイレージ (mileage), # 名古屋 (Nagoya), # 沖縄 (Okinawa), # 温泉 (spa)

Tchai 静鉄バス (Seitetsu Bus), 相鉄バス (Sotetsu Bus), 函館バス (Hakodate Bus), 大阪地下鉄 (Osaka subway), 琴電 (Kotoden railways), 地下鉄御堂筋線 (Subway Midosuji Line), 芸陽バス (Geiyo Bus), 新京成バス (Shin-keisei Bus), jr 阪和線 (Japan Railways Hanna Line), 常磐線 (Joban Line)

# 時刻表 , # 路線図 (route map), # 運賃 (fare), # 料金 (fare), # 定期 (season ticket), # 運行状況 (service situation), # 路線 (route), # 定期代 (season ticket fare), # 定期券 (season ticket), # 時刻

47

Random samples of extracted instancesType (Frequency)

Example

Transportation (54) 広島新幹線 (Hiroshima Super Express), 東海道線 (Tokaido Line), jr飯田線 (JR Iida Line), jr 博多 (JR Hakata), 京都新幹線 (Kyoto Super Express)

Accommodation (10)

ホテルビーナス (Hotel Venus), リーガロイヤルホテル大阪 (Rihga Royal Hotel Osaka), www.route-inn.co.jp, ホテル京阪ユニバーサル・シティ (Hotel Keihan Universal City), 札幌全日空ホテル (ANA Hotel Sapporo)

Travel Information (10)

外務省安全 (Foreign Ministry safety), チケットショップ大阪 (ticket shop Osaka), 観光関西 (Site seeing Kansai), 高山観光協会 (Takayama Tourism Association), グーグルナビ (Google navi)

Travel Agency (6) jr おでかけネット (Odekake net), 近畿ツー (Kinki Tourist), タビックス静岡 (Tabix Shizuoka), フレックスインターナショナル (Flex International), オリオンツアー (Orion Tour)

Others (2) プロテカ (Proteca; bag for travel), jal 紀行倶楽部 (JAL Travel Club)

Not Related (20) 格安航空チケット海外 (discount flight ticket overseas), 新幹線予約状況 (Super express reservation situation), 新幹線時刻表 (Super express timetable), 温泉宿 (spa accommodation), 新幹線停車駅 (Super express stops), 虎 (tiger), youtubu 海外ドラマ (overseas drama), 法務部採用 (legal department recruitment), おくりびと (Okuribito; film), 社会人野球 (amateur baseball)

48

Learning curve of Quetchupclick

The more the data, the higher the precision

49

Sensitivity to Quetchupclick to parameter α

Clickthrough graph seems much denser than query graph, and large alpha (less weight on initial labels) yields better performance than small alpha

50

Conclusions

• Semantic drift in Espresso is a parallel form of topic drift in HITS

• The regularized Laplacian reduces semantic drift in two tasks: word sense disambiguation and named entity extraction• inherently a relatedness measure ( importance

measure)

51

Future work

• Investigate if a similar analysis is applicable to a wider class of bootstrapping algorithms (including co-training)

• Investigate the influence of seed selection to bootstrapping algorithms and propose a way to select effective seed instances

• Explore relation between Markov random walk and label propagation methods

52

graph-based analysis of espresso-style minimally-supervised bootstrapping algorithms jan 15, 2010...

Documents