measuring semantic similarity between words by removing noise and redundancy in web snippets

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2011; 23:2496–2510Published online 22 September 2011 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.1816

Measuring semantic similarity between words by removing noiseand redundancy in web snippets

Zheng Xu*,†, Xiangfeng Luo, Jie Yu and Weimin Xu

School of Computer Engineering and Science, High Performance Computing Center, Shanghai University, Shanghai200072, China

SUMMARY

Semantic similarity measures play important roles in many Web-related tasks such as Web browsing andquery suggestion. Because taxonomy-based methods can not deal with continually emerging words, recentlyWeb-based methods have been proposed to solve this problem. Because of the noise and redundancy hiddenin the Web data, robustness and accuracy are still challenges. In this paper, we propose a method integratingpage counts and snippets returned by Web search engines. Then, the semantic snippets and the number ofsearch results are used to remove noise and redundancy in the Web snippets (‘Web-snippet’ includes thetitle, summary, and URL of a Web page returned by a search engine). After that, a method integrating pagecounts, semantics snippets, and the number of already displayed search results are proposed. The proposedmethod does not need any human annotated knowledge (e.g., ontologies), and can be applied Web-relatedtasks (e.g., query suggestion) easily. A correlation coefficient of 0.851 against Rubenstein–Goodenoughbenchmark dataset shows that the proposed method outperforms the existing Web-based methods by a widemargin. Moreover, the proposed semantic similarity measure significantly improves the quality of querysuggestion against some page counts based methods. Copyright © 2011 John Wiley & Sons, Ltd.

Received 24 May 2010; Revised 2 May 2011; Accepted 29 May 2011

KEY WORDS: semantic similarity; information retrieval; query suggestion; Web search

1. INTRODUCTION

The study of semantic similarity between words is a problem with a long history both in the aca-demic community and industry. Examples include word sense disambiguation [1], knowledge flowgeneration [2–5], image retrieval [6], multimodal documents retrieval [7], topic detection [8], andautomatic hypertext linking [9].

Recently, with the rapid development of the Web, semantic similarity measures are also impor-tant in many Web-related tasks. In query suggestion [10], for example, Google’s (www.google.com)‘Search related to’ and Yahoo’s (www.yahoo.com) ‘Also Try’, a set of words semantically related tothe query are suggested to help the user find what he or she really needs, thus improving the user’ssearch experience and retrieval effectiveness.

Besides Web-related tasks, semantic similarity measures also have been used in Semantic Web-related applications including automatic annotation of Web pages [11] and semistructured datasearch [12], etc. For example, ‘apple’ is a fruit, and also a company. A user, who searches for ‘apple’on the Web, may be interested in either one of these different senses. In this case, words semanticallyrelated to ‘apple’ can be clustered by semantic similarity to automatically annotate the query.

*Correspondence to: Zheng Xu, School of Computer Engineering and Science, High Performance Computing Center,Shanghai University, Shanghai 200072, China.

†E-mail: [email protected]

Copyright © 2011 John Wiley & Sons, Ltd.

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS 2497

Even in the economic arenas, semantic similarity measures play a vital role. In the sponsoredsearch model, search engines are paid by businesses that are interested in displaying ads for theirsite alongside the search results. Businesses bid for keywords, and their ad is displayed when thekeyword is queried to the search engine. An important problem in this process is ‘keyword genera-tion’, given a business that is interested in launching a campaign, suggests keywords that are relatedto that campaign. In that case, keywords with high semantic similarity to the business can be usedas potential bid suggestions.

In the past years, there has been much research on developing Web-based similarity measures,which can be divided into three categories:

(1) Measures that rely on the number of the returned hits. Page counts are useful informationsources provided by Web search engines. For example, the page counts for query ‘MicrosoftCEO AND Bill Gates’ in Google is 191,000, whereas the page counts for query ‘Microsoft CEOAND Steve Ballmer’ is 844,000. In fact, ‘Steve Ballmer’ is more similar to ‘Microsoft CEO’than ‘Bill Gates’ because the page counts for ‘Microsoft CEO’ AND ‘Steve Ballmer’ is morethan ‘Microsoft CEO’ AND ‘Bill Gates’.

(2) Measures that download a number of the top-ranked documents and then apply text process-ing techniques. Resnik [13] employs the Web as parallel corpora, new words or entities appearin the Web immediately. Because of the huge number of Web pages and high growth of theWeb, it is unrealistic to analyze all Web pages accordingly. Instead, these measures use aWeb search engine, which is an efficient interface to vast information retrieval in the Web.The basic assumption behind these measures is that similarity of context implies similarityof meaning, that is, words appearing in similar lexical environment have a close semanticrelation.

(3) Measures that combine both (1) and (2).

In this paper, we focus on the problem of Web-based semantic similarity computations betweenwords. The existing methods are flawed because they do not deal with the noise and redun-dancy. The noise is mainly caused by the huge number of Web pages. Given the scale of theWeb, words might occur arbitrarily in some pages, which may decreases the accuracy of thepage counts. Besides the page counts, the top-ranked documents or snippets are also impactedby the noise. The random co-occurrences may reduce the accuracy when using these docu-ments. On the other hand, besides the noise in the Web data, redundancy is another factor,which reduces the accuracy of semantic similarity measures. Some studies [14] suggest thataround 1.7% to 7% of the Web pages visited by crawlers are near duplicate pages. Many dupli-cate Web pages make page counts become unreliable. In this work, semantic snippets containingthe query word-pair p and q in one sentence are used to remove the noise instead of all topranking snippets. Moreover, the number of already displayed search results provided by Googleis used to remove redundancy in Web snippets. The major contributions of this work are asfollows:

(1) Most of the previous methods only use page counts or snippets. In this paper, we propose amethod, which integrates page counts and snippets to measure semantic similarity betweenwords. The proposed method, run automatically, does not need any human annotated knowledge(e.g., ontologies), and can be applied Web-related tasks (e.g., query suggestion) easily.

(2) To our best knowledge, there is no previous study on removing noise and redundancy in Websnippets. In this paper, semantic snippets are automatically extracted to remove noise in Websnippets. Meanwhile, the number of already displayed search results provided by Google is usedto remove redundancy in Web snippets.

(3) We conduct experiments including Rubenstein–Goodenough (R–G) dataset and 80 Test ofEnglish as a Foreign Language (TOEFL) questions to evaluate the proposed method. A highcorrelation coefficient of 0.851 with human ratings, which significantly improves the accuracyof measuring semantic similarity between words against the existing Web-based methods.

(4) The proposed semantic similarity measure significantly improves the quality of query sugges-tion against some popular Web search engines (e.g., Google and Yahoo), thereby verifying

Copyright © 2011 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2011; 23:2496–2510DOI: 10.1002/cpe

2498 Z. XU ET AL.

the capability of the proposed measure to capture semantic similarity by removing noise andredundancy in the Web snippets.

The remainder of the paper is organized as follows. In Section 2, we discuss related work.The proposed method is described in Section 3. Experimental results are discussed in Section 4.An application on query suggestion using proposed method is given in Section 5. Finally, someconclusions are presented in Section 6.

2. RELATED WORK

Many different methods of semantic similarity measures between words have been proposed,which can be divided into two aspects: taxonomy-based methods and Web-based methods.Taxonomy-based methods use information theory and hierarchical taxonomy, such as WordNet(http://wordnet.princeton.edu), to measure semantic similarity. On the contrary, Web-based methodsuse the Web as a live and active corpus instead of hierarchical taxonomy.

Resnik [1] proposed to use information content to evaluate semantic similarity between words.He thought the information content of a concept c can be quantified as negative, the log likeli-hood (log.p.c///, which is the probability of encountering an instance of concept c. According tothe idea of information content, Pederson [15] developed a software named WordNet::Similarity( http://search.cpan.org/dist/WordNet-Similarity) to measure semantic similarity between a pair ofconcepts. Rada [16] pointed out that the distance between two words in taxonomy is a more natu-ral and direct way of evaluating their semantic similarity. The shorter the distance from one wordto the other, the more similar they are. Richardson and Smeaton [17] used a formula consideringdensity, depth, and strength of edge to measure semantic similarity between words. Type of link,depth, and density were considered in [18]. Jiang [19] presented a combined model of informationcontent and distance of two words to measure semantic similarity between words. Erk [20] usedvector space model and random walk to measure semantic similarity between words. In Li’s paper[21], he explored the determination of semantic similarity by a number of information source, whichconsist of structural semantic information from a lexical taxonomy and information content froma corpus. To investigate how information sources could be used effectively, a variety of strategiesfor using various possible information sources are implemented. Because new words are constantlybeing created, new senses are assigned to existing words as well. Manually maintaining thesaurisuch as WordNet to capture these new words and senses is costly, if not impossible, which makesthe taxonomy-based methods not feasible in Web-related task.

Different from taxonomy-based methods, Turney [22] defined a pointwise mutual information(PMI-IR) measure, which uses the number of hits returned by a Web search engine to recognizesynonyms. Another method named co-occurrence double check (CODC) was proposed by Chen[23]. He explored the Web as a live corpus. This method depends heavily on the ranking algo-rithm of search engine. Sahami [24] defined a similarity kernel function to determine the similarityof words using Google. He also showed the use of his kernel function in a large-scale system forsuggesting related queries to search engine users. A new corpus-based method for calculating thesemantic similarity of two target words was presented in [25]. The authors called their methodsecond order co-occurrence PMI (SOC-PMI). They used mutual information (MI) to sort lists ofimportant neighbor words of the two target words. Bollegala [26] used page counts and snippets pro-vided by Web search engines to measure semantic similarity. He used some automatically extractedlexico-syntactic patterns from snippets to his method. In this method, 200 patterns from 4,562,471unique patterns, which was obtained from the top 900 snippets, are extracted. Because the top snip-pets change across time, the regeneration of huge number (usually 5 million) of unique patternsmakes this method time-consuming. Consequently, the extracted patterns impact significantly onthis method.

Overall, the existing Web-based methods lack the related mechanisms to process the noise andredundancy in Web data.



3. MEASURING SEMANTIC SIMILARITY BETWEEN WORDS

3.1. Outline

Given a pair of words p and q, the proposed method of measuring semantic similarity can be viewedas a four-stage process.

(1) Because page counts are useful information resources provided by Web search engines, fourpopular page counts based scores are considered to measure semantic similarity between words.

(2) Many words might occur arbitrarily in some snippets, which bring noise to the page counts ofthe pair of words p and q. Therefore, the noise on snippets should be removed to improve theaccuracy of semantic similarity measures.

(3) Besides the noise on snippets, many Web pages may duplicate. There are many causes for theexistence of duplicate pages: typographical errors, versioned, mirrored, or plagiarized docu-ments [14]. Thus, these duplicate pages should be removed to improve the accuracy of semanticsimilarity measures.

(4) After removing noise and redundancy in the Web data, a method integrating page counts andsnippets returned by Web search engines is proposed.

The following sections mainly discuss the earlier mentioned four steps.

3.2. Page counts based semantic similarity measures between words

Page counts mean the number of Web pages containing the query q. For example, the page countsof the query ‘Obama’ in Google are 297,000,000‡. Moreover, page counts for the query ‘q AND p’can be considered as a measure of co-occurrence of queries q and p. For the remainder of thispaper, we use the notation N.q/ to denote the page counts of the query q in Google. However, therespective page counts for the word pair p and q are not enough for measuring semantic similar-ity. The page counts for the query ‘p AND q’ should be considered. For example, when we query‘Obama’ and ‘United States’ in Google, we can find 110,000,000 Web pages, that is, N (Obama\United States)D 110, 000, 000.

Page counts based semantic similarity measures are based on co-occurrence of two words. Thecore idea is that ‘you shall know a word by the company it keeps’ [27]. In this section, four popularco-occurrence measures (i.e., Jaccard, Overlap, Dice, and PMI) are proposed to measure semanticsimilarity using page counts (Equations (1)–(3)).

Jaccard.p, q/DN.p \ q/

N.p/CN.q/�N.p \ q/, (1)

therein, p \ q denotes the conjunction query ‘p AND q’,

Overlap.p, q/DN.p \ q/

min.N.p/,N.q//, (2)

Dice.p, q/D2 �N.p \ q/

N.p/CN.q/. (3)

According to probability and information theory, the MI of two random variables is a quantity thatmeasures the mutual dependence of the two variables. PMI is a variant of MI (Equation (4)):

PMI.p, q/D log

�N �N.p \ q/

N.p/ �N.q/

��logN , (4)

where N is the number of Web pages in the search engine, which is set to N D 1011 according tothe number of indexed pages reported by Google.

‡The page counts of this paper are obtained in September 2008.


2500 Z. XU ET AL.

Using page counts as a measure of semantic similarity between words has the followingdrawbacks:

(1) Ignoring the noise in Web data. Given the scale of the Web, two words may appear on somepages accidentally. Thus, the adverse effects attributable to random co-occurrences should bereduced to improve the accuracy of semantic similarity measures.

(2) Ignoring the redundancy in Web data. Besides the noise in co-occurrence of two words p and q,many pages may duplicate. In other words, the page counts given by the Web search enginemay not be accurate for the query. Therefore, the adverse effects attributable to duplicate pagesshould be removed to improve the accuracy of semantic similarity measures.

3.3. Removing noise by extracting semantic snippets

As aforementioned in Section 3.2, the noise in Web data may reduce the accuracy of semantic sim-ilarity measures. N.p \ q/ in page counts based semantic similarity measures should be revised.Given the scale of the Web, it is impractical to find random co-occurrences of two words p and qby scanning all Web pages. Fortunately, the snippets returned by search engines give us a chance toremove noise.

The snippets are returned by Web search engines alongside with the search results. A snippet, usu-ally a short text with no more than 30 words, provides important semantic information for us. Forexample, Figure 1 is a snippet returned by the query ‘computer AND hardware’. In Figure 1, ‘com-puter’ and ‘hardware’ appear in one sentence from which we can obtain semantic relation betweenthem. The phrase ‘components of’ indicates ‘part of’ semantic relation between ‘hardware’ and‘computer’.

In fact, not all snippets provide correct or useful semantic relation between two wordsp and q. Forexample, Figure 2 is a snippet of the query ‘food AND fruit’. Different from Figure 1, the snippetin Figure 2 can not provide correct or useful semantic information for us although it is the top resultreturned by Google because two words are not in one sentence. By the comparison of snippets inFigures 1 and 2, the following definition is given:

Definition (Semantic Snippets)The semantic snippets are those which contain the two words p and q in one sentence.

According to the definition, the snippet in Figure 1 is a semantic snippet, which provides usefulsemantic relation between the two words p and q. Thus, semantic snippets can be used to decidewhether two words may appear on some pages accidentally or not. In fact, the idea of semantic snip-pets is similar to the context window to determine co-occurrence. The comparison between semanticsnippets and window-based co-occurrence is given in (Section 4).

3.4. Removing redundancy by the number of already displayed search results

Besides the noise in the Web data, redundancy is another factor, which reduces the accuracy ofsemantic similarity measures. Many duplicate Web pages makesN.p/,N.q/, andN.p\q/ in page

Figure 1. A snippet of the query ‘computer AND hardware’ provided by Google.

Figure 2. A snippet of the query ‘food AND fruit’ provided by Google, which can not give useful semanticinformation.



counts based semantic similarity measures become unreliable. Because Google provides the URL§

of each search result, it seems intuitively that redundancy can be scanned by these Web pages.Unfortunately, given the huge number of Web pages and the high growth rate of the Web, it isimpractical to analyze each Web search result directly and separately. Meanwhile, different Websites have different HTML format, it is unfeasible to parse the different Web sites at the Web scale.

Instead of using some duplicate detection algorithms (e.g., All-Pairs [28]), the interface called‘repeat the search’ provided by Google is used to remove the redundancy in Web data. When a queryis submitted to Google, the number of search results returned is up to 1000. In order to show themost relevant results, Google have omitted some entries very similar to the already displayed searchresults. For example, the ‘repeat the search’ of the query ‘Obama’ is 451, which means Google hasomitted some entries very similar to the 451 already displayed search results. The number of alreadydisplayed search results can be used for removing duplicate snippets of the query.

3.5. Integrating page counts, semantic snippets, and the number of already displayedsearch results

In this section, we describe a robust semantic similarity measure integrating page counts, semanticsnippets, and the number of already displayed search results.

Strategy 1. Semantic similarity between words is a combination of page counts based measuresand semantic snippets. The steps of this strategy are as follows:

(1) Issue p, q, p AND q as query to the Web search engine (in this paper, we use Google)respectively;

(2) Get N.p/, N.q/, and N.p \ q/;(3) Compute the proportion of semantic snippets in the top n snippets returned by query

p AND q, which is denoted as SS.p \ q/;(4) Replace N.p \ q/ by N.p \ q/ � SS.p \ q/.

This strategy revises N.p \ q/ of page counts based measures by using semantic snippets, whichremoves the noise in Web data. For example, according to this strategy, PMI can be revised bySnippet-based PMI (SPMI) as follows (Equation (5)):

SPMI.p, q/D log2

�N.p \ q/ �N � SS.p \ q/

N.p/ �N.q/

�log2N .

(5)

It is worth noting that the number of top snippets (n/ used for computing the proportion of seman-tic snippets impacts significantly on the results of strategy 1. Different number of snippets maychange the accuracy of semantic similarity measures. The detailed discussions of this factor will beintroduced in (Section 4).

Strategy 2. Semantic similarity between words is a combination of page counts based measuresand the number of already displayed search results. The steps of this strategy are as follows:

(1) Issue p, q, p AND q as query to the Web search engine (in this paper, we use Google)respectively;

(2) Get N.p/, N.q/, and N.p \ q/;(3) Get the number of already displayed search results of p, q, and p AND q, which is denoted

as R.p/, R.q/, and R.p \ q/;(4) Replace N.p/, N.q/, and N.p\ q/ in page counts measures by N.p/�R.p/, N.q/�R.q/,

and N.p \ q/ �R.p \ q/.

This strategy revisesN.p/,N.q/, andN.p\q/ of page counts based measures by using the numberof already displayed search results given by Google, which removes the redundancy in the Web data.

§ Google also provides cached of Web pages.


2502 Z. XU ET AL.

For example, according to this strategy, PMI can be revised by Pase-based PMI (PPMI) as follows(Equation (6)):

PPMI.p, q/D log

�N.p \ q/ �N �R.p \ q/

N.p/ �R.p/ �N.q/ �R.q/

�logN .

(6)

Strategy 3. Semantic similarity between words is a combination of strategies 1 and 2, whichconsider not only semantic snippets, but also the number of already displayed search results. Forexample, according to this strategy, PMI can be revised by Snippet & Pase based PMI (SPPMI) asfollows (Equation (7)):

SRPMI.p, q/D

(0 ifSS.p \ q/ < ˛�

N.p\q/�N�SS.p\q/�R.p\q/N.p/�R.p/�N.q/�R.q/

�logN else

. (7)

Like strategy 1, in this strategy, the threshold ˛ impacts significantly on the results. The detaileddiscussions of threshold ˛ factors will be introduced in (Section 4).

4. EXPERIMENTS AND EVALUATIONS

Our experiments include three aspects: (1) selecting the best page counts based semantic similaritymeasures; (2) selecting the best strategy for measuring semantic similarity between words; and (3)comparing the proposed method with other existing Web based methods.

4.1. The benchmark dataset for evaluation

In this section, R–G benchmark dataset is introduced. Rubenstein and Goodenough [29] proposed adataset containing 28 word-pairs rating by a group of 51 human subjects, which is a reliable bench-mark for evaluating semantic similarity measures. The higher the correlation coefficient against R–Gratings is, the more accurate the methods for measuring semantic similarity between words are.

4.2. Selecting the best page counts based semantic similarity measures

Besides R–G dataset, we also select 80 synonym test questions from the TOEFL (http://www.ets.org) to decide the best page counts measures. Consider the following TOEFL question, the word‘appreciated’ is closest in meaning to the following:

.A/ ‘Experienced’ .B/ ‘Recognized’ .C/ ‘Explored’ .D/ ‘Increased’

The right answer is ‘Recognized’. For example, we use four page counts based semantic similar-ity measures to decide which answer is correct. The higher the accuracy of 80 TOEFL questions, the

0.26

0.411

0.277

0.534

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

Jaccard Overlap Dice PMI

Cor

rela

tion

Coe

ffic

ient

Figure 3. The correlation coefficient of four page counts based semantic similarity measures againstRubenstein–Goodenough dataset.



47.50%

40.00%

46.25%

52.50%

30.00%

32.50%

35.00%

37.50%

40.00%

42.50%

45.00%

47.50%

50.00%

52.50%

55.00%

Jaccard Overlap Dice PMIA

ccur

acy

Figure 4. The accuracy of four page counts based semantic similarity measures against 80 Test of Englishas a Foreign Language questions dataset.

Table I. Length of window versus accuracy.

Number of top snippets k D 25 k D 12 k D 6 k D 5 k D 4 k D 3

100 0.679 0.687 0.711 0.723 0.732 0.722200 0.702 0.709 0.726 0.736 0.744 0.741300 0.713 0.718 0.736 0.744 0.752 0.751400 0.714 0.718 0.737 0.747 0.756 0.759500 0.715 0.719 0.738 0.748 0.759 0.765600 0.714 0.718 0.735 0.744 0.754 0.760700 0.722 0.725 0.736 0.745 0.754 0.760800 0.730 0.732 0.738 0.746 0.755 0.761900 0.738 0.740 0.744 0.751 0.759 0.764

more accurate the methods for measuring semantic similarity between words are. Figures 3 and 4show the results of R–G dataset and 80 TOEFL questions. Figures 3 and 4 show that PMI performsbest in its highest correlation coefficient against R–G dataset and accuracy of 80 TOEFL ques-tions than in the other three methods. That fact enables us to select PMI when integrating semanticsnippets and relevant snippets.

4.3. Integrating pointwise mutual information and semantic snippets

In this section, we investigate the performance of strategy 1, which integrates page counts basedsemantic similarity measures and semantic snippets. Before giving the results of strategy 1, wefirst introduce the context-window-based methods because the idea of semantic snippets is similarto it. The context-window-based methods employ a context window of fixed size (K words) foroccurrence extraction. Different from our method, the context-window-based methods think thatthe words appearing in a K length window are a real occurrence. In order to compare the accuracybetween strategy 1 and the context-window-based methods, Table I is given to show the bestK in thecontext-window-based methods. As aforementioned in Section 3.5, Equation 5 is used to measuresemantic similarity between words. Moreover, because the number of snippets returned by Googleis up to 1000¶, we compute the correlation coefficient with R–G dataset for different numbers ofsnippets. The experimental results are summarized in Figure 5, from which some conclusions canbe drawn.

(1) The correlation coefficient against R–G dataset of SPMI is monotonically increasing withthe number of snippets. The method for extracting semantic snippets may be blamed on thatresult. Different number of snippets causes the different proportion of semantic snippets. The

¶Usually, the number of snippets is lower than 1000 because many duplicate snippets are omitted by Google. In thispaper, we measure only up to 900.


2504 Z. XU ET AL.

0.660.670.680.690.7

0.710.720.730.740.750.760.770.78

100 200 300 400 500 600 700 800 900

number of snippetsco

rrel

atio

n co

effi

cien

t

Figure 5. Correlation coefficient of SPMI with different number of snippets.

probability of finding appropriate SS.p\q/ in SPMI increases with the number of snippets.That fact enables us to set the number of snippets to 900.

(2) SPMI significantly improves accuracy than PMI on correlation coefficient against R–Gdataset. The best correlation coefficient of SPMI is 0.7695, which increases 44% than PMI.Unlike PMI, SPMI uses semantic snippets to remove the noise in Web data, which makes theco-occurrence of two words p and q more accurate.

(3) SPMI performs better than the context-window-based methods on correlation coefficientagainst R–G dataset. The best correlation coefficient of the context-window-based methodsis 0.765. This result proves that the sentence-based occurrence is more accurate than thewindow based occurrence.

Therefore, the high improvement rate of SPMI proves the accuracy of using semantic snippets tomeasure semantic similarity between words.

4.4. Integrating pointwise mutual information and the number of already displayed search results

In this section, we investigate the performance of strategy 2 which integrates page counts basedsemantic similarity measures and the number of already displayed search results. As aforementionedin Section 3.5, Equation 6 is used to measure semantic similarity between words. Different fromstrategy 1 which is monotonically increasing with the number of snippets, strategy 2 is independentwith the number of snippets. Some experimental results can be summarized.

(1) RPMI improves accuracy than PMI on correlation coefficient against R-G dataset. The cor-relation coefficient of RPMI is 0.542, which increases 1.5% than PMI. The reason for that isRPMI use the number of already displayed search results provided by Google, which removesthe redundancy in Web data.

(2) The improvement rate of RPMI is lower than SPMI, which means the noise in Web dataimpacts more than redundancy on the accuracy of measuring semantic similarity betweenwords. Some studies [14] suggest that around 1.7% to 7% of the Web pages visited bycrawlers are near duplicate pages. The low percentage of duplicate pages makes SPMIperform better than RPMI.

4.5. Integrating SPMI and RPMI

In this section, we investigate the performance of strategy 3 which integrates SPMI and RPMI. Asaforementioned in Section 3.5, Equation 7 is used to measure semantic similarity between words.Moreover, because threshold ˛ may change the accuracy of semantic similarity measures, we com-pute the correlation coefficient against R–G dataset with different threshold ˛ to investigate itseffect. The experimental results are presented in Figure 6. The best correlation coefficient of SRPMIis 0.851 when the threshold ˛ is 0.38, which increases 10.6% and 57% than SPMI and RPMI, respec-tively. The reason for that is that SRPMI not only uses semantic snippets to remove the noise in Web



0.746 0.743

0.719

0.745

0.779 0.777

0.8340.851

0.65

0.7

0.75

0.8

0.85

0.9

15% 21% 22% 24% 25% 27% 35% 38%

thresholdco

rrel

atio

n co

effi

cien

t

Figure 6. Correlation coefficient of SRPMI with different threshold ˛.

data but also the number of already displayed search results to remove redundancy in the Web data,which significantly improves accuracy on semantic similarity measures. Therefore, SRPMI whichintegrates page counts, semantic snippets, and the number of already displayed search results is thebest strategy of semantic similarity measures.

4.6. Comparing SRPMI with existing Web based methods

In this section, some existing Web-based semantic similarity measures proposed in previous workincluding Sahami [24], CODC [23], and SemSim [26] are used to compare with proposed method(SRPMI). The reasons for selecting these three methods are 1) SemSim [26] claimed to perform thebest correlation coefficient among the existing Web-based semantic similarity, 2) Sahami [24] onlyconsidered snippets and CODC [23] only considered page counts. Results are shown in Tables IIand III. SRPMI earns the highest correlation coefficient of 0.851 in our experiments.

SemSim measures indicate the second-best correlation coefficient of 0.797. SemSim uses auto-matically extracted lexico-syntactic patterns from text snippets to measure semantic similaritybetween words. In SemSim 200 patterns from 4562471 unique patterns, which get from top 900snippets are extracted. Because the top snippets change across time, the regeneration of huge num-ber (usually 5 million) of unique patterns makes it SemSim time-consuming. Consequently, theextracted patterns influence significantly on SemSim.

Co-occurrence Double Checking (CODC) proposed by Chen indicates the correlation coefficientof 0.723. The CODC measure is defined as, Equation (8)):

CODC.p, q/D8̂<:̂

0 iff .p@q/D 0

e

log

24f .p@q/

N.p/�f .q@p/

N.q/

35˛

otherwise

. (8)

Table II. Properties of different metrics (1 means satisfying the property).

Web Lexicosearch Page syntactic Wordnet Download External Removing Removing

Metric engine counts Snippets patterns ontology documents knowledge noise redundancy

PMI 1 1 0 0 0 0 0 0 0Sahami 1 0 1 0 0 0 0 0 0CODC 1 0 1 0 0 0 0 0 0SemSim 1 1 1 1 1 1 1 0 0SPMI 1 1 1 0 0 0 0 1 0RPMI 1 1 1 0 0 0 0 0 1SRPMI 1 1 1 0 0 0 0 1 1


2506 Z. XU ET AL.

Therein, f .p@q/ denotes the number of occurrence of p in the top-ranking snippets for thequery q in Google.˛ is a constant in CODC and it is set to 0.15. Even though a word pair p and qis semantic similarity, one might not always find q among the top snippets for p. Therefore, CODCmeasure reports zero similarity scores for many word-pairs in the R-G dataset.

Similarity measure propose by Sahami reflect a correlation coefficient of 0.621. This methodonly uses snippets to represent the word when measuring semantic similarity. This method onlyuses snippets to measure semantic similarity between words, which ignore page counts provided byGoogle. Moreover, Sahami did not compare his method with other Web-based methods.

Unlike other Web-based methods, SRPMI has following advantages:

(1) SRPMI integrates page counts and snippets to measure semantic similarity between words.PMI only uses page counts, reflecting a correlation coefficient of 0.534. Sahami only usessnippets, reflecting a correlation coefficient of 0.621.

(2) SRPMI uses semantic snippets and the number of already displayed search results to removenoise and redundancy in the Web data, which improves the accuracy of semantic similaritymeasures. CODC and SemSim only concentrate on all top snippets, which ignore the fact thatnot all snippets provide useful semantic information. The correlation coefficients of them are0.723 and 0.797, respectively, which is less accurate than SRPMI.

Overall, the results in Table III suggest that proposed SRPMI performs better than some existingWeb based methods.

Table III. Semantic similarity of human ratings and baselines on Rubenstein–Goodenough (R–G) dataset.

Word pair R–G ratings PMI Sahami CODC SemSim SRPMI (˛ = 0.38)

Chord–smile 0.02 0.146 0.090 0 0 0Rooster–voyage 0.04 0.152 0.197 0 0.017 0Noon–string 0.04 0.152 0.082 0 0.018 0Glass–magician 0.44 0.214 0.143 0 0.18 0Monk–slave 0.57 0.200 0.095 0 0.375 0Coast–forest 0.85 0.186 0.248 0 0.405 0.171Monk–oracle 0.91 0.151 0.045 0 0.328 0Lad–wizard 0.99 0.164 0.149 0 0.22 0Forest–graveyard 1 0.184 0 0 0.547 0Food–rooster 1.09 0.150 0.075 0 0.06 0.123Coast–hill 1.26 0.173 0.293 0 0.874 0.153Car–journey 1.55 0.142 0.189 0.290 0.286 0.127Crane–implement 2.37 0.186 0.152 0 0.133 0.156Brother–lad 2.41 0.182 0.236 0.379 0.344 0.161Bird–crane 2.63 0.186 0.223 0 0.879 0.165Bird–cook 2.63 0.177 0.058 0.502 0.593 0.177Food–fruit 2.69 0.175 0.181 0.338 0.998 0.167Brother–monk 2.74 0.187 0.267 0.547 0.377 0.192Asylum–madhouse 3.04 0.245 0.212 0 0.773 0.230Furnace–stove 3.11 0.234 0.310 0.928 0.889 0.234Magician–wizard 3.21 0.220 0.233 0.671 1 0.211Journey–voyage 3.58 0.202 0.524 0.417 0.996 0.188Coast–shore 3.6 0.203 0.381 0.518 0.945 0.187Implement–tool 3.66 0.185 0.419 0.419 0.684 0.169Boy–lad 3.82 0.181 0.471 0 0.974 0.185Automobile–car 3.92 0.169 1 0.686 0.98 0.158Midday–noon 3.94 0.228 0.289 0.856 0.819 0.190Gem–jewel 3.94 0.198 0.211 1 0.686 0.237Correlation coefficient 1 0.534 0.621 0.723 0.797 0.851

R–G, Rubenstein–Goodenough; PMI, pointwise mutual information; CODC, co-occurrence double checking.



5. USING SRPMI ON QUERY SUGGESTION

In this section, we apply the proposed semantic similarity measure in a real-world application: querysuggestion.

Queries to Web search engines are usually short and ambiguous, which provides insufficient infor-mation needs of users for effectively retrieving relevant Web pages. To address this problem, querysuggestion is implemented by most search engines (e.g., Google’s ‘Search related to’ and Yahoo’s‘Also Try’). Query suggestion provides a list of semantic related words to user’s initial query, whichimproves the user’s search experience and retrieval effectiveness. Obviously, the proposed semanticsimilarity measures between words can be used on selecting appropriate suggested words. The stepsof our query suggestion method based on SRPMI are as follows:

(1) Issue p as query to the Google;(2) Extracting concepts of the query p as candidate words for query suggestion. Our concept extrac-

tion method is inspired by the famous problem of finding frequent patterns in data mining [30].When the query q is submitted to the Web search engine, a set of snippets are obtained foridentifying the concepts. According to the theory of cognitive science [31], if a concept appearsfrequently in the snippets of the query q, it represents an important concept related to the queryq because it coexists in close proximity with the query q in the top Web search results.

(3) Compute the semantic similarity between extracted concepts and query q by SRPMI. Theconcepts are sorted in descending order according to their semantic similarity to the query q.

(4) Selecting high semantic similarity concepts as query suggestion.

In order to evaluate our SRPMI based query suggestion method, some human raters were invitedto rate the query suggestion result of Google, Yahoo, and our method. Each human rater evaluatedthe query suggestion on a 5-point scale [24], defined as:

1: Suggested concept is totally off topic.2: Suggested concept is not as good as original query.3: Suggested concept is basically same as original query.4: Suggested concept is potentially better than original query.5: Suggested concept is fantastic—should suggest this query because it might help a user find what

they are looking for if they issued it instead of the original query.

It is obvious that the higher the average rating of method by human raters is, the better is themethod.

The queries for evaluation are obtained from the Google Zeitgeist||. The Google Zeitgeist trackspopular queries on the Web monthly. The reasons for using the Google Zeitgeist are as follows:

(1) The queries in the Google Zeitgeist are popularly searched by users in the world. These queriesare more likely known by the human raters. If the rater knows the meaning of the query exactly,he or she may rate suggested concepts more correctly;

(2) Because the queries in the Google Zeitgeist are popular, if useful query suggestions are found,they could potentially be applicable for a large number of search engine users who have thesame information needs.

Table IV lists the queries used for our evaluation. These queries are the top 10 searched queries fromyear 2004 to 2008 in Google. In Table V, we provide an example of query suggestion of ‘MSN’.The results of evaluation of our SRPMI based query suggestion method are shown in Figure 7. FromFigure 7, it is apparent that our method performs better than Google and Yahoo because the ratingsof our method are higher than Google and Yahoo in each year’s queries. Indeed, we see that thehuman rating by our method is higher than three in Figure 7, which indicates that the suggestedconcept is basically related to the query.

|| http://www.google.com/intl/en/press/zeitgeist.html.


2508 Z. XU ET AL.

Table IV. The queries used for query suggestion application, which are the top 10 searched queries fromyear 2004 to 2008 in the Google.

Year 2008 2007 2006 2005 2004

Ranking 1 Sarah Palin iPhone Bebo Myspace Britney SpearsRanking 2 Beijing 2008 Badoo Myspace Ares Paris HiltonRanking 3 Facebook login Facebook World cup Baidu Christina AguileraRanking 4 Tuenti Dailymotion Metacafe Wikipedia Pamela AndersonRanking 5 Heath ledger Webkinz Radioblog Orkut ChatRanking 6 Obama Youtube Wikipedia iTunes GamesRanking 7 Nasza Klasa Ebuddy Video Sky News Carmen ElectraRanking 8 Wer kennt wen Second life Rebelde World of Warcraft Orlando BloomRanking 9 Euro 2008 Hi5 Mininova Green Day Harry PotterRanking 10 Jonas brothers Club penguin Wiki Leonardo da Vinci Mp3

Table V. The suggested concepts for Messenger (MSN) on differentmethod.

Query SRPMI Google search related Yahoo also try

MSN Microsoft Japan HotmailShopping Names MessengerWeather Games GamesMovies Groups PlusSports Web 7.5Network Dollies GroupsHotmail Plus MusicGames Blocker Names

3.39 3.00

3.20

3.37

3.27

2.64 2.24

2.30

2.40

2.50

2.55

2.59 2.36

2.72

2.51

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

2008 2007 2006 2005 2004

Ave

rage

Rat

ing

Our method

Google

Yahoo

Figure 7. The average rating of our method, Google, and Yahoo on the queries from year 2004 to 2008.

6. CONCLUSIONS

In this paper, we focus on the problem of Web-based semantic similarity computations betweenwords. In our view, given the noise and redundancy in Web data, words might occur arbitrarilyin some pages. The methods for measuring semantic similarity between words should remove thenoise and redundancy to improve accuracy and robustness.

Four page counts based methods including Jaccard, Overlap, Dice, and PMI are proposed as thecandidate measures for semantic similarity. Semantic snippets instead of all top ranking snippets areused to solve the noise in Web data. The semantic snippets are snippets that contain the query word-pair p and q in one sentence, which can remove noise in the search results. Moreover, the number ofalready displayed search results provided by Google is used to remove redundancy in Web snippets.Three different strategies integrating page counts, semantics snippets, and the number of alreadydisplayed search results are proposed for measuring semantic similarity between words.

In addition, 80 TOEFL questions and R–G data set are used to select the best page counts basedmethods. The results show that PMI performs better than the other three methods. R–G data set isused to select the best strategies. The results show that the proposed strategy performs better than



just using, which proves the accuracy of using semantic snippets and the number of already dis-played search results on semantic similarity measures. The comparisons of the proposed methodbetween some existing Web-based methods including CODC, Sahami, and SemSim are conducted.A correlation coefficient of 0.851 against R–G benchmark dataset shows that the proposed methodoutperforms the existing Web-based methods by a wide margin.

At last, we apply the proposed semantic similarity measure in a real-world application: querysuggestion. The proposed semantic similarity measure significantly improves the quality of querysuggestion against some popular Web search engines (e.g., Google and Yahoo), thereby verifying thecapability of the proposed method to capture semantic similarity by removing noise and redundancyin the Web snippets.

ACKNOWLEDGEMENTS

Research work is supported by the Shanghai Science and Technology Commission (grants 09JC1406200),National Science Foundation of China (grants 91024012, 61071110, 61003249, 90818004 and 90612010),Science Foundation of Shanghai University (grants SHUCX111001), and the Shanghai Leading AcademicDiscipline Project (J50103).

REFERENCES

1. Resnik P. Semantic similarity in a taxonomy: an information based measure and its application to problems ofambiguity in natural language. Journal of Artificial Intelligence Research 1999; 11:95–130.

2. Luo X, Hu Q, Xu W, Yu Z. Discovery of textual knowledge flow based on the management of knowledge maps.Concurrency and Computation: Practice and Experience 2008; 20:1791–1806.

3. Luo X, Xu Z, Li Q, Hu Q, Yu J, Tang X. Generation of similarity knowledge flow for intelligent browsing based onsemantic link networks. Concurrency and Computation: Practice and Experience 2009; 21:2018–2032.

4. Luo X, Yu J, Li Q, Liu F, Xu Z. Building web knowledge flows based on interactive computing with semantics. NewGeneration Computing 2010; 28:113–120.

5. Zhang S, Luo X, Chen J, Xu Z, Yu J, Xu W. Measuring knowledge delivery quantity of associated knowledge flow.In Proceedings of the Fourth International Conference on Semantics, Knowledge and Grid. IEEE Computer Society:Washington, DC, 2008; 117–124.

6. Smeulders A, Worring M, Santini S, Gupta A, Jain R. Content-based image retrieval at the end of the early years.IEEE Transactions on Pattern Analysis and Machine Intelligence 2000; 22(12):1349–1380.

7. Srihari R, Zhang Z, Rao A. Intelligent indexing and semantic retrieval of multimodal documents. InformationRetrieval 2000; 2:245–275.

8. Makkonen J, Ahonen-Myka H, Salmenkivi M. Simple semantics in topic detection and tracking. InformationRetrieval 2004; 7:347–368.

9. Green SJ. Building hypertext links by computing semantic similarity. IEEE Transactions on Knowledge and DataEngineering 1999; 11(5):713–730.

10. Vojnovic M, Cruise J, Gunawardena D, Marbach P. Ranking and suggesting popular items. IEEE Transactions onKnowledge and Data Engineering 2009; 21(8):1133–1146.

11. Cimano P, Handschuh S. Towards the self-annotating web. In Proceedings of the 13th International World Wide WebConference. ACM Press: New York, 2004; 462–471.

12. Schenkel R, Theobald A, Weikum G. Semantic similarity search on semistructured data with the XXL search engine.Information Retrieval 2005; 8:521–545.

13. Resnik P, Smithm A. The Web as a parallel corpus. Computational Linguistics 2003; 29(3):349–380.14. Xiao C, Wang W, Lin X, Xu Yu J. Efficient similarity joins for near duplicate detection. In Proceedings of 17th

International World Wide Web Conference. ACM Press: New York, NY, 2008; 131–140.15. Pederson T, Patwardhan S, Michelizzi J. WordNet::similarity – measuring the relatedness of concepts. Human

Language Technology Conference, 2004; 38–41.16. Rada R, Mili H, Bicknell E, Blettner M. Development and application of a metric on semantic nets. IEEE Transaction

on System, Man, and Cybernetics 1989; 19(1):17–30.17. Richardson R, Smeaton F. Using WordNet in a knowledge-based approach to information retrieval. Working Paper,

CA-0395, School of Computer Applications, Dublin City University, Ireland, 1999.18. Sussna M. Word sense disambiguation for free-text indexing using a massive semantic network. In Proceedings of the

Second International Conference on Information and Knowledge Management. ACM Press: New York, NY, 1993;67–74.

19. Jiang JJ, Conrath DW. Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings ofInternational Conference Research on Computational Linguistics, 1997.

20. Herdagdelen A, Erk K. Measuring semantic relatedness with vector space models and random walks. Proceedingsof the 2009 Workshop on Graph-based Methods for Natural Language Processing, 2009; 50–53.


2510 Z. XU ET AL.

21. Li Y, Bandar A, McLean D. An approach for measuring semantic similarity between words using multipleinformation sources. IEEE Transaction on Knowledge and Data Engineering 2003; 15(4):871–882.

22. Turney PD. Features of similarity. Psychological Review 1997; 84(4):327–352.23. Chen H, Lin M, Wei Y. Novel association measures using web search with double checking. In Proceedings of

the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association forComputational Linguistics, 2006; 1009–1016.

24. Sahami M, Heilman D. A web-based kernel function for measuring the similarity of short text snippets. InProceedings of the 15th International World Wide Web Conference. ACM Press: New York, NY, 2006; 377–386.

25. Islam A, Inkpen D. Second order co-occurrence PMI for determining the semantic similarity of words. Proceedingsof the International Conference on Language Resources and Evaluation, 2006; 1033–1038.

26. Bollegala D, Matsuo Y, Ishizuka M. Measuring semantic similarity between words using web search engines. InProceedings of 16th International World Wide Web Conference. ACM Press: New York, NY, 2007; 757–766.

27. Firth R. A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis. Philological Society: Oxford,1957.

28. Bayardo RJ, Ma Y, Srikant R. Scaling up all pairs similarity search. In Proceedings of 16th International World WideWeb Conference. ACM Press: New York, NY, 2007; 131–140.

29. Rubenstein H, Goodenough B. Contextual correlates of synonymy. Communications of the ACM 1965; 8(10):627–633.

30. Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. In Proceedingsof the 1993 ACM SIGMOD International Conference on Management of Data, Vol. 22. ACM Press: New York, NY,1993; 207–216.

31. Church W, Hanks P. Word association norms, mutual information and lexicography. Proceedings of the 27th AnnualConference of the Association of Computational Linguistics, 1989; 76–83.


measuring semantic similarity between words by removing noise and redundancy in web snippets

Documents