japanese abbreviation expansion with query and clickthrough logs kei uchiumi †, mamoru komachi...
Post on 27-Dec-2015
213 Views
Preview:
TRANSCRIPT
Japanese Abbreviation Expansion with Query and Clickthrough Logs
Kei Uchiumi†, Mamoru Komachi‡, Keigo Machinaga,Toshiyuki Maezawa†, Toshinori Satou†, Yoshinori Kobayashi†
† : Yahoo Japan Corporation‡ : Nara Institute of Science and Technology
1
Recently : Hard to compile manually
Time consuming to construct a dictionary
Requires domain knowledge The web grows rapidly
Even harder to maintain an up-to-date dictionary
4
Our purpose:Generating an abbreviation
dictionaryfrom web search logs
Excellent resource for many NLP applications in web domain
5
Clickthrough logs Learning semantic categories [Komachi et al. 2009] Named entity extraction [Jain et al. 2010]
Search query logs Query alteration [Hagiwara et al. 2009] Acquiring semantic categories [Sekine et al. 2007]
The main contribution
1. Novel re-ranking method to combine web query and clickthrough logs
2. First attempt to automatically recognize full spellings given Japanese abbreviation
6
This method is used as assistant tool for making dictionary in Yahoo! Japan
Agenda
1. Introduction2. Query reformulation based on noisy
channel model1. Query Abbreviation model2. Query Language Model
3. Evaluation4. Related work5. Conclusion
7
Agenda
1. Introduction2. Query reformulation based on noisy
channel model1. Query Abbreviation model2. Query Language Model
3. Evaluation4. Related work5. Conclusion
8
Reformulation flow
Clickthrough logs
Search query logs
Query language model
Clickthrough graph Query :
q
Offline part
Candidates : c1,c2,c3,…
Reranking
Outputs: ca, cb, cc, …Online part
10
Query Abbreviation Model
Label propagation on clickthrough graph
abc
american broadcasting corporation
alphabetsong
austrianballetcompany
www.abc-tokyo.com
abcnews.go.com
www.alphabetsong.org
en.wikipedia.org
The depth of the color of lines indicates relatedness between each node.The depth of the color of nodes represents relatedness to the seed.
11
Problems of adopting [Komachi et al. 2009] to our query reformulation task
1. Extracted not only synonymous expressions but also semantically
2. Failed to alleviate semantic drift because of using normalized frequency
12
Preliminary experiments showed that [Komachi et al. 2009] cannot be directly applied to our task
One step approximation prevents extracting non-synonymous
expressionsThe one step approximation extracts queries landing on the same URL by 1-hop label propagation.
These queries are possibly synonyms of the seed and thus possible to correct without semantic transformation.
13
Using normalized PMI [Bounma, 2009] as countermeasure against semantic
drift
PMI assigns high scores to low-frequency events
Using naively makes clickthrough graph dense
14
Cutting off the negative values
• Cut off the values lower than threshold θ (θ≥0)
• The range of Wij can be nomalized to [0,1]
• Prevents W from being dense• Reduces the noise in the data
15
Edges are represented as (i,j)-th element of matrix W
Reformulation flow
Clickthrough logs
Search query logs
Query language model
Query : q
Offline part
Candidates : c1,c2,c3,…
Reranking
Outputs: ca, cb, cc, …Online part
16
Query Abbreviation Model
Clickthrough graph
Character n-gram query language model
C is a contiguous sequence of N characters.
c = {x0,x1,…,xn-1}
17
A language model estimated from search query logsP(c) represents likelihood of c as a query
Character n-gram is robust for Japanese web NLP
Hard to compute the likelihood of neologisms by word n-gram language model
Characters themselves carry essential semantic information in Chinese and Japanese [Asahara and Matsumoto, 2004][Huang and Zhao, 2006]
Using character 5-grams for query language model 18
Agenda
1. Introduction2. Query reformulation based on noisy
channel model1. Query Abbreviation model2. Query Language Model
3. Evaluation4. Related work5. Conclusion
19
Japanese abbreviation expansion data set
Test set 1916 of ‘Acronym’, ’Kanji’, ‘Kana’ abbreviations
Collected from the Japanese version of Wikipedia Removed single letters and duplications
Training set Clickthrough logs
2009/10/22 – 2009/11/9, 2010/1/1 – 2010/1/16 About 17,000,000 pairs of queries and URLs Cut off pairs occurred less than 10 times
Web search query logs 2009/8/1 – 2010/1/27 About 52,000,000 unique queries Cut off queries occurred less than 10 times
20
Judgment guideline
1 Acronym for its English expansion
2 Acronym for its Japanese orthography
3 Japanese abbreviation for its Japanese orthography
4 Japanese abbreviation for its English orthography
Table1: Correction patterns for abbreviation expansion
Correction
patterns
Abbreviation
Correct candidates
1 adf Asian dub foundation
2 ana 全日本空輸株式会社 (All Nippon Airways)
3 ハンスト ハンガーストライキ (Hunger Strike)4 イラレ illustrator
21
Table2: Examples of abbreviations and corrections pairs
Evaluation measure
22
• The agreement rate of judgment of abbreviation/expansion pair: 47.0 %
• Cohen’s kappa measure κ = 0.63
Comparison methods
Evaluated reranking performance of 50 candidates extracted from clickthrough logs Candidates are extracted by one step
approximation Compared three reranking methods
1. Ranking using abbreviation model (AM) only
2. Reranking using language model (LM) only
3. Reranking using both AM and LM23
Reranking with query language model improves both precision and coverage at
top-10k Query
abbreviation model (QAM)
Query language
model(QLM)
QLM+QAM
precision
coverage
precision
coverage
precision
coverage
1 0.114 0.114 0.157 0.157 0.161 0.161
3 0.112 0.256 0.142 0.278 0.157 0.321
5 0.121 0.341 0.128 0.346 0.142 0.392
10 0.114 0.453 0.102 0.425 0.115 0.465
30 0.087 0.536 0.078 0.529 0.082 0.542
50 0.073 0.557 0.073 0.557 0.073 0.557The result of using only QAM is equivalent to the method of Komachi et al. (2009) using NPMI instead of raw frequency
24
Examples of input and candidates or its correction
Input Candidates
写植 写真植字 (photocomposition), 写植 方 , 漫画
満鉄 南満州鉄道株式会社 (South Manchuria Railway Corporation)
はねトび はねるのとびら , はねるのトびらvod ビデオオンデ , ビデオ・オン・デマンド (Video on Demand)
ilo 国際労働機関 (International Labour Organization), 国際労働期間
pr パブリック・リレーションズ (public relations), prohoo!マ , プラ
25
Blue: CorrectRed: Incorrect
Error Analysis
26
1 A partial correct query
2 A correct query but with an additional attribute word
3 A related but not abbreviated term
Table3: types of errors
Beside above reason:280 out of 1,916 queries did not exist in clickthrough logs
A partial correct query
The likelihood of the partial query becomes higher than that of its correct spelling Although the likelihood was divided by
the length of candidate’s string, still fail to filter fragments of queries
27
vod ビデオオンデ , ビデオオンデマンド (Video on Demand)
A correct query but with an additional attribute word
写植 写真植字 意味 ,写真植字 (photocomposition)
Include the combination of correct queries and commonly used attribute words
e.g. “* 意味 (* meaning)”, “* とは (what does * mean?)”, etc.
857 queries were classified as incorrect that co-occurred with these attribute words.
28
A related but not abbreviated term
A number of abbreviations coincide with other general nouns e.g. “dog (DOG: Disk Original Group)”
Hard to expand these abbreviations correctly at present
29
Agenda
1. Introduction2. Query reformulation based on noisy
channel model1. Query Abbreviation model2. Query Language Model
3. Evaluation4. Related work5. Conclusion
30
Related Work Spelling Correction based on edit distance
1. Using noisy channel model with a language model created from query logs[Cucerzan and Brill, 2004]
2. Reranking method applying neural net to the spelling correction candidates obtained from Cucerzan’s method[Gao et al. 2010][Sun et al. 2010]
Synonym extraction1. Using similarity based on JS divergence of commonly
clicked URL distribution between queries[Wei et al. 2009]
Query expansion1. Proposed a unified approach using CRFs with extended
feature function[Guo et al. 2008]
31
Agenda
1. Introduction2. Query reformulation based on noisy
channel model1. Query Abbreviation model2. Query Language Model
3. Evaluation4. Related work5. Conclusion
32
6. Conclusion
Have proposed a query expansion method using the web search logs
In experiment, found that a combination of label propagation and language model outperformed other methods using either label propagation or language model
In the future, will address this task using discriminative learning as a ranking problem 33
PageRank on a query graph
Edges represent common co-occurring words between queries
Will assign higher scores to correct queries than a QLM and QAM
35
国際労働機関
国際労働機関 とは
国際労働機関意味
国際労働機関役割 (role)
Partial queries do not co-occur with attribute words frequently
Parameters
Construction of a clickthrough graph The threshold θ of elements Wij was set
to 0.1 The parameter α for label propagation
was set to 0.0001 Construction of a language model
Character 5-gram Likelihood was divided by the length of
candidate’s string
36
Correct candidates types
37
1 Named entity
2 Common expression
3 Japanese meaning of the common expression
Table: correct candidate types
[Komachi et al. 2009]
Suggested that normalized frequency causes semantic drift
Suggested using relative frequency as countermeasure against semantic drift
39
P-values of Wilcoxon’s signed rank test
QAM and QAM+QLM
QLM and QAM+QLM
P-value 0.055 7.79e-10
40
Comparison of harmonic mean between precision and coverage each model with k ranking from 1 to 50
Query abbreviation model
Uses the label propagation method on a clickthrough graph (based on [Komachi et al. 2009] )
The probability of the label propagation can be regarded as the conditional probability P(q|c) The label propagation is mathematically
identical to the random walk with restart[Tong and Faloustos KDD 06]
41
top related