measuring semantic similarity between words using hownet iccsit 2008 liuling dai, yuning xia, bin...

15
Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI , Yuning XIA , Bin LIU , ShiKun WU School of Computer Science, Beijing Institute of Technology

Upload: antony-hicks

Post on 18-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

Algorithms Similarity between sememes Similarity between concepts Similarity between words Amendment with thesaurus

TRANSCRIPT

Page 1: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

Measuring Semantic Similarity between Words Using HowNet

ICCSIT 2008Liuling DAI , Yuning XIA , Bin LIU , ShiKun WU

School of Computer Science, Beijing Institute of Technology

Page 2: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

HowNet

• W_C=工夫• DEF={Ability|能力 :host={human|人 }}• DEF={Strength|力量 :host={group|群體 }{human|人 }}• DEF={time|時間 }

• Word : 工夫• Concept : {Ability|能力 :host={human|人 }}• Sememe : Ability|能力

Page 3: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

Algorithms

• Similarity between sememes• Similarity between concepts• Similarity between words

• Amendment with thesaurus

Page 4: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

Similarity between sememes

• Strategy 1

• Strategy 2

• d : Distance between S1 and S2• h : Depth of the first common parent node

of the two sememes• α , β : Parameters to adjust d,h

Page 5: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

Similarity between concepts

• Word “Doctor”• DEF={human|人 :{own|有 :possession={Status|身分 :

domain={education|教育 },modifier={HighRank|高等 :degree={most|最 }}},possessor={~}}}

• Human → Primary sememe• Status, own … → Modifying sememe• Possession , domain …

→ Descriptors

Page 6: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

Similarity between concepts

• P , Q : Two concepts. Assume P has less number of modifying sememe.

• P_i , Q_j : ith, jth modifying sememe of P , Q.• S , T : Descriptor set of P , Q• α,β,γ : Weight of 3 parts

Page 7: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

Similarity between words

• One word may has many concepts.• Choose the most similar pair.

Page 8: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

Amendment with thesaurus

• Some words are missing and some DEFs are too rough in in HowNet.

• Using Chinese thesaurus Tongyici Cilin(同義詞詞林 )應為哈爾濱工業大學 IR-Lab的哈工大信息檢索研究室同義詞詞林擴展版

• d : Distance between W1 and W2

Page 9: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

Similarity between words

• Sim1 : Eq. 6 (Similarity in HowNet)• Sim2 : Eq. 7 (Similarity in Tongyici Cilin)• α,β,γ,η : Parameters to scale the weights of

the two parts.

Page 10: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

Evaluation• Dataset– RG-65

• Rubenstein and Goodenough established synonymy judgments for 65 pairs of nouns.They invited 51 human judges to assign every pair a score between 0.0 and 4.0 to indicate semantic similarity.

– MC-28• Miller and Charles follow this idea and restricted themselves to 30 pairs

of nouns selected from Rubenstein and Goodenough’s list, divided equally amongst words with high, intermediate and low similarity.

• For measuring similarity between Chinese words , translate RG-65 into Chinese manually.

Page 11: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

Evaluation

• Parameters– Similarity between sememes• Strategy 1 : α = 1.6 , β = 0.16• Strategy 2 : α = 0.2 , β = 0.16

– Similarity between concepts• α = 0.54 , β = 0.36 , γ = 0.1

– Similarity between words• On Chinese dataset :α = 0.95,β = 0.05,γ = 0.95,η = 0.05• On English dataset : α = 0.95,β = 0.05,γ = 0.45,η = 0.55

Page 12: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

Result– HAPI : HowNet_Get_Concept_Similarity in HowNet API

Page 13: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

Result• In addition, They compare results to eight groups of measures

that rely on WordNet.• Table 1. Correlations coefficient of algorithms

Approach RG-28 MC-28 RG-65Hirst-St.Onge 0.671 0.682 0.732Jiang 0.67 0.682 0.732Leacock 0.801 0.82 0.852Lin 0.773 0.814 0.834Resnik 0.706 0.763 0.8Yang 0.889 0.921 0.897Li 0.8914 0.882 N/AAlvarez 0.9 0.913 N/AS1-English 0.9238 0.9074 0.8764S2-English 0.9286 0.9056 0.8744HAPI-English 0.5371 0.5113 0.6089S1-Chinese 0.8617 0.8401 0.8958S2-Chinese 0.8679 0.846 0.895HAPI-Chinese 0.5328 0.5001 0.6752

Page 14: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

RG-65

Page 15: Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI, Yuning XIA, Bin LIU, ShiKun WU School of Computer Science, Beijing Institute

MC-30 & RG-30