measuring semantic similarity between words using hownet iccsit 2008 liuling dai, yuning xia, bin...
DESCRIPTION
Algorithms Similarity between sememes Similarity between concepts Similarity between words Amendment with thesaurusTRANSCRIPT
Measuring Semantic Similarity between Words Using HowNet
ICCSIT 2008Liuling DAI , Yuning XIA , Bin LIU , ShiKun WU
School of Computer Science, Beijing Institute of Technology
HowNet
• W_C=工夫• DEF={Ability|能力 :host={human|人 }}• DEF={Strength|力量 :host={group|群體 }{human|人 }}• DEF={time|時間 }
• Word : 工夫• Concept : {Ability|能力 :host={human|人 }}• Sememe : Ability|能力
Algorithms
• Similarity between sememes• Similarity between concepts• Similarity between words
• Amendment with thesaurus
Similarity between sememes
• Strategy 1
• Strategy 2
• d : Distance between S1 and S2• h : Depth of the first common parent node
of the two sememes• α , β : Parameters to adjust d,h
Similarity between concepts
• Word “Doctor”• DEF={human|人 :{own|有 :possession={Status|身分 :
domain={education|教育 },modifier={HighRank|高等 :degree={most|最 }}},possessor={~}}}
• Human → Primary sememe• Status, own … → Modifying sememe• Possession , domain …
→ Descriptors
Similarity between concepts
• P , Q : Two concepts. Assume P has less number of modifying sememe.
• P_i , Q_j : ith, jth modifying sememe of P , Q.• S , T : Descriptor set of P , Q• α,β,γ : Weight of 3 parts
Similarity between words
• One word may has many concepts.• Choose the most similar pair.
Amendment with thesaurus
• Some words are missing and some DEFs are too rough in in HowNet.
• Using Chinese thesaurus Tongyici Cilin(同義詞詞林 )應為哈爾濱工業大學 IR-Lab的哈工大信息檢索研究室同義詞詞林擴展版
• d : Distance between W1 and W2
Similarity between words
• Sim1 : Eq. 6 (Similarity in HowNet)• Sim2 : Eq. 7 (Similarity in Tongyici Cilin)• α,β,γ,η : Parameters to scale the weights of
the two parts.
Evaluation• Dataset– RG-65
• Rubenstein and Goodenough established synonymy judgments for 65 pairs of nouns.They invited 51 human judges to assign every pair a score between 0.0 and 4.0 to indicate semantic similarity.
– MC-28• Miller and Charles follow this idea and restricted themselves to 30 pairs
of nouns selected from Rubenstein and Goodenough’s list, divided equally amongst words with high, intermediate and low similarity.
• For measuring similarity between Chinese words , translate RG-65 into Chinese manually.
Evaluation
• Parameters– Similarity between sememes• Strategy 1 : α = 1.6 , β = 0.16• Strategy 2 : α = 0.2 , β = 0.16
– Similarity between concepts• α = 0.54 , β = 0.36 , γ = 0.1
– Similarity between words• On Chinese dataset :α = 0.95,β = 0.05,γ = 0.95,η = 0.05• On English dataset : α = 0.95,β = 0.05,γ = 0.45,η = 0.55
Result– HAPI : HowNet_Get_Concept_Similarity in HowNet API
Result• In addition, They compare results to eight groups of measures
that rely on WordNet.• Table 1. Correlations coefficient of algorithms
Approach RG-28 MC-28 RG-65Hirst-St.Onge 0.671 0.682 0.732Jiang 0.67 0.682 0.732Leacock 0.801 0.82 0.852Lin 0.773 0.814 0.834Resnik 0.706 0.763 0.8Yang 0.889 0.921 0.897Li 0.8914 0.882 N/AAlvarez 0.9 0.913 N/AS1-English 0.9238 0.9074 0.8764S2-English 0.9286 0.9056 0.8744HAPI-English 0.5371 0.5113 0.6089S1-Chinese 0.8617 0.8401 0.8958S2-Chinese 0.8679 0.846 0.895HAPI-Chinese 0.5328 0.5001 0.6752
RG-65
MC-30 & RG-30