measuring semantic similarity between words using hownet iccsit 2008 liuling dai, yuning xia, bin...

Measuring Semantic Similarity between Words Using HowNet

ICCSIT 2008Liuling DAI , Yuning XIA , Bin LIU , ShiKun WU

School of Computer Science, Beijing Institute of Technology

HowNet

• W_C=工夫• DEF={Ability|能力 :host={human|人 }}• DEF={Strength|力量 :host={group|群體 }{human|人 }}• DEF={time|時間 }

• Word : 工夫• Concept : {Ability|能力 :host={human|人 }}• Sememe : Ability|能力

Algorithms

• Similarity between sememes• Similarity between concepts• Similarity between words

• Amendment with thesaurus

Similarity between sememes

• Strategy 1

• Strategy 2

• d : Distance between S1 and S2• h : Depth of the first common parent node

of the two sememes• α , β : Parameters to adjust d,h

Similarity between concepts

• Word “Doctor”• DEF={human|人 :{own|有 :possession={Status|身分 :

domain={education|教育 },modifier={HighRank|高等 :degree={most|最 }}},possessor={~}}}

• Human → Primary sememe• Status, own … → Modifying sememe• Possession , domain …

→ Descriptors

Similarity between concepts

• P , Q : Two concepts. Assume P has less number of modifying sememe.

• P_i , Q_j : ith, jth modifying sememe of P , Q.• S , T : Descriptor set of P , Q• α,β,γ : Weight of 3 parts

Similarity between words

• One word may has many concepts.• Choose the most similar pair.

Amendment with thesaurus

• Some words are missing and some DEFs are too rough in in HowNet.

• Using Chinese thesaurus Tongyici Cilin(同義詞詞林 )應為哈爾濱工業大學 IR-Lab的哈工大信息檢索研究室同義詞詞林擴展版

• d : Distance between W1 and W2

Similarity between words

• Sim1 : Eq. 6 (Similarity in HowNet)• Sim2 : Eq. 7 (Similarity in Tongyici Cilin)• α,β,γ,η : Parameters to scale the weights of

the two parts.

Evaluation• Dataset– RG-65

• Rubenstein and Goodenough established synonymy judgments for 65 pairs of nouns.They invited 51 human judges to assign every pair a score between 0.0 and 4.0 to indicate semantic similarity.

– MC-28• Miller and Charles follow this idea and restricted themselves to 30 pairs

of nouns selected from Rubenstein and Goodenough’s list, divided equally amongst words with high, intermediate and low similarity.

• For measuring similarity between Chinese words , translate RG-65 into Chinese manually.

Evaluation

• Parameters– Similarity between sememes• Strategy 1 : α = 1.6 , β = 0.16• Strategy 2 : α = 0.2 , β = 0.16

– Similarity between concepts• α = 0.54 , β = 0.36 , γ = 0.1

– Similarity between words• On Chinese dataset :α = 0.95,β = 0.05,γ = 0.95,η = 0.05• On English dataset : α = 0.95,β = 0.05,γ = 0.45,η = 0.55

Result– HAPI : HowNet_Get_Concept_Similarity in HowNet API

Result• In addition, They compare results to eight groups of measures

that rely on WordNet.• Table 1. Correlations coefficient of algorithms

Approach RG-28 MC-28 RG-65Hirst-St.Onge 0.671 0.682 0.732Jiang 0.67 0.682 0.732Leacock 0.801 0.82 0.852Lin 0.773 0.814 0.834Resnik 0.706 0.763 0.8Yang 0.889 0.921 0.897Li 0.8914 0.882 N/AAlvarez 0.9 0.913 N/AS1-English 0.9238 0.9074 0.8764S2-English 0.9286 0.9056 0.8744HAPI-English 0.5371 0.5113 0.6089S1-Chinese 0.8617 0.8401 0.8958S2-Chinese 0.8679 0.846 0.895HAPI-Chinese 0.5328 0.5001 0.6752

MC-30 & RG-30

measuring semantic similarity between words using hownet iccsit 2008 liuling dai, yuning xia, bin...

Documents