probabilistic modeling of joint-context in distributional similarity

Proceedings of the Eighteenth Conference on Computational Language Learning, pages 181–190,Baltimore, Maryland USA, June 26-27 2014. c©2014 Association for Computational Linguistics

Probabilistic Modeling of Joint-context in Distributional Similarity

Oren Melamud§, Ido Dagan§, Jacob Goldberger♦, Idan Szpektor†, Deniz Yuret‡§ Computer Science Department, Bar-Ilan University♦ Faculty of Engineering, Bar-Ilan University

† Yahoo! Research Israel‡ Koc University

{melamuo,dagan,goldbej}@{cs,cs,eng}[email protected], [email protected]

AbstractMost traditional distributional similaritymodels fail to capture syntagmatic patternsthat group together multiple word featureswithin the same joint context. In this workwe introduce a novel generic distributionalsimilarity scheme under which the powerof probabilistic models can be leveragedto effectively model joint contexts. Basedon this scheme, we implement a concretemodel which utilizes probabilistic n-gramlanguage models. Our evaluations sug-gest that this model is particularly well-suited for measuring similarity for verbs,which are known to exhibit richer syntag-matic patterns, while maintaining compa-rable or better performance with respectto competitive baselines for nouns. Fol-lowing this, we propose our scheme as aframework for future semantic similaritymodels leveraging the substantial body ofwork that exists in probabilistic languagemodeling.

1 Introduction

The Distributional Hypothesis is commonlyphrased as “words which are similar in meaningoccur in similar contexts” (Rubenstein and Good-enough, 1965). Distributional similarity modelsfollowing this hypothesis vary in two major as-pects, namely the representation of the context andthe respective computational model. Probably themost prominent class of distributional similaritymodels represents context as a vector of word fea-tures and computes similarity using feature vectorarithmetics (Lund and Burgess, 1996; Turney etal., 2010). To construct the feature vectors, thecontext of each target word token1, which is com-monly a word window around it, is first broken

1We use word type to denote an entry in the vocabulary,and word token for a particular occurrence of a word type.

into a set of individual independent words. Thenthe weights of the entries in the word feature vec-tor capture the degree of association between thetarget word type and each of the individual wordfeatures, independently of one another.

Despite its popularity, it was suggested thatthe word feature vector approach misses valu-able information, which is embedded in the co-location and inter-relations of words (e.g. wordorder) within the same context (Ruiz-Casado et al.,2005). Following this motivation, Ruiz-Casadoet al. (2005) proposed an alternative composite-feature model, later adopted in (Agirre et al.,2009). This model adopts a richer context repre-sentation by considering entire word window con-texts as features, while keeping the same compu-tational vector-based model. Although showinginteresting potential, this approach suffers from avery high-dimensional feature space resulting indata sparseness problems. Therefore, it requiresexceptionally large learning corpora to considerlarge windows effectively.

A parallel line of work adopted richer contextrepresentations as well, with a different compu-tational model. These works utilized neural net-works to learn low dimensional continuous vectorrepresentations for word types, which were founduseful for measuring semantic similarity (Col-lobert and Weston, 2008; Mikolov et al., 2013).These vectors are trained by optimizing the pre-diction of target words given their observed con-texts (or variants of this objective). Most of thesemodels consider each observed context as a jointset of context words within a word window.

In this work we follow the motivation in the pre-vious works above to exploit richer joint-contextrepresentations for modeling distributional simi-larity. Under this approach the set of features inthe context of each target word token is consid-ered to jointly reflect on the meaning of the targetword type. To further facilitate this type of mod-

181

eling we propose a novel probabilistic computa-tional scheme for distributional similarity, whichleverages the power of probabilistic models andaddresses the data sparseness challenge associatedwith large joint-contexts. Our scheme is based onthe following probabilistic corollary to the distri-butional hypothesis:

(1)“words are similar in meaning ifthey are likely to occur in the same contexts”

To realize this corollary, our distributional sim-ilarity scheme assigns high similarity scores toword pairs a and b, for which a is likely in the con-texts that are observed for b and vice versa. Thescheme is generic in the sense that various under-lying probabilistic models can be used to providethe estimates for the likelihood of a target wordgiven a context. This allows concrete semanticsimilarity models based on this scheme to lever-age the capabilities of probabilistic models, suchas established language models, which typicallyaddress the modeling of joint-contexts.

We hypothesize that an underlying model thatcould capture syntagmatic patterns in large wordcontexts, yet is flexible enough to deal with datasparseness, is desired. It is generally acceptedthat the semantics of verbs in particular are cor-related with their syntagmatic properties (Levin,1993; Hanks, 2013). This provides grounds to ex-pect that such model has the potential to excel forverbs. To capture syntagmatic patterns, we choosein this work standard n-gram language models asthe basis for a concrete model implementing ourscheme. This choice is inspired by recent work onlearning syntactic categories (Yatbaz et al., 2012),which successfully utilized such language mod-els to represent word window contexts of targetwords. However, we note that other richer typesof language models, such as class-based (Brownet al., 1992) or hybrid (Tan et al., 2012), can beseamlessly integrated into our scheme.

Our evaluations suggest that our model is in-deed particularly advantageous for measuring se-mantic similarity for verbs, while maintainingcomparable or better performance with respect tocompetitive baselines for nouns.

2 Background

In this section we provide additional details re-garding previous works that we later use as base-lines in our evaluations.

To implement the composite-feature approach,Ruiz-Casado et al. (2005) used a Web search en-gine to compare entire window contexts of targetword types. For example, a single feature thatcould be retrieved this way for the target word likeis “Children cookies and milk”. They showedgood results on detecting synonyms in the 80multiple-choice questions TOEFL test. Agirre etal. (2009) constructed composite-feature vectorsusing an exceptionally large 1.6 Teraword learn-ing corpus. They found that this approach out-performs the traditional independent feature vec-tor approach on a subset of the WordSim353 test-set (Finkelstein et al., 2001), which is designed totest the more restricted relation of semantic simi-larity (to be distinguished from looser semantic re-latedness). We are not aware of additional worksfollowing this approach, of using entire word win-dows as features.

Neural networks have been used to train lan-guage models that are based on low dimensionalcontinuous vector representations for word types,also called word embeddings (Bengio et al., 2003;Mikolov et al., 2010). Although originally de-signed to improve language models, later workshave shown that such word embeddings are usefulin various other NLP tasks, including measuringsemantic similarity with vector arithmetics (Col-lobert and Weston, 2008; Mikolov et al., 2013).Specifically, the recent work by Mikolov et al.(2013) introduced the CBOW and Skip-gram mod-els, achieving state-of-the-art results in detectingsemantic analogies. The CBOW model is trainedto predict a target word given the set of contextwords in a word window around it, where thiscontext is considered jointly as a bag-of-words.The Skip-gram model is trained to predict each ofthe context words independently given the targetword.

3 Probabilistic Distributional Similarity

3.1 Motivation

In this section we briefly demonstrate the bene-fits of considering joint-contexts of words. As anillustrative example, we note that the target wordslike and surround may share many individual wordfeatures such as “school” and “campus” in the sen-tences “Mary’s son likes the school campus” and“The forest surrounds the school campus”. Thispotentially implies that individual features maynot be sufficient to accurately reflect the difference

182

between such words. Alternatively, we could usethe following composite features to model the con-text of these words, “Mary’s son the schoolcampus” and “The forest the school campus”.This would discriminate better between like andsurround. However, in this case sentences such as“Mary’s son likes the school campus” and “John’sson loves the school campus” will not provide anyevidence to the similarity between like and love,since “Mary’s son the school campus” is a dif-ferent feature than “John’s son the school cam-pus”.

In the remainder of this section we proposea modeling scheme and then a concrete model,which can predict that like and love are likely tooccur in each other’s joint-contexts, whereas likeand surround are not, and then assign similarityscores accordingly.

3.2 The probabilistic similarity schemeWe now present a computational scheme that re-alizes our proposed corollary (1) to the distribu-tional hypothesis and facilitates robust probabilis-tic modeling of joint contexts. First, we slightlyrephrase this corollary as follows: “words a andb are similar in meaning if word b is likely inthe contexts of a and vice versa”. We denote theprobability of an occurrence of a target word bgiven a joint-context c by p(b|c). For example,p(love|“Mary’s son the school campus”) is theprobability of the word love to be the filler of the‘place-holder’ in the given joint-context “Mary’sson the school campus”. Similarly, we denotep(c|a) as the probability of a joint-context c givena word a, which fills its place-holder. We nowpropose psim(b|a) to reflect how likely b is in thejoint-contexts of a. We define this measure as:

(2)psim(b|a) =∑c

p(c|a) · p(b|c)

where c goes over all possible joint-contexts in thelanguage.

To implement this measure we need to findan efficient estimate for psim(b|a). The moststraight forward strategy is to compute sim-ple corpus count ratio estimates for p(b|c) andp(c|a), denoted p#(b|c) = count(b,c)

count(∗,c) and

p#(c|a) = count(a,c)count(a,∗) . However, when consid-

ering large joint-contexts for c, this approach be-comes similar to the composite-feature approachsince it is based on co-occurrence counts of tar-get words with large joint-contexts. Therefore, we

expect in this case to encounter the data sparse-ness problems mentioned in Section 1, where se-mantically similar word type pairs that share onlyfew or no identical joint-contexts yield very lowpsim(b|a) estimates.

To address the data sparseness challenge andadopt more advanced context modeling, we aim touse a more robust underlying probabilistic modelθ for our scheme and denote the probabilities es-timated by this model by pθ(b|c) and pθ(c|a). Wenote that contrary to the count ratio model, given arobust model θ, such as a language model, pθ(b|c)and pθ(c|a) can be positive even if the target wordsb and a were not observed with the joint-context cin the learning corpus.

While using pθ(b|c) and pθ(c|a) to estimate thevalue of psim(b|a) addresses the sparseness chal-lenge, it introduces a computational challenge.This is because estimating psim(b|a) would re-quire computing the sum over all of the joint-contexts in the learning corpus regardless ofwhether they were actually observed with eitherword type a or b. For that reason we choose amiddle ground approach, estimating p(b|c) withθ, while using a count ratio estimate for p(c|a),as follows. We denote the collection of all joint-contexts observed for the target word a in thelearning corpus by Ca, where |Ca|= count(a, ∗).For example, Clike = {c1=“Mary’s son theschool campus”, c2=“John’s daughter to readpoetry”,...}. We note that this collection is a multi-set, where the same joint-context can appear morethan once.

We now approximate psim(b|a) from Equation(2) as follows:

(3)

psimθ (b|a) =∑c

p#(c|a) · pθ(b|c) =

1|Ca| ·

∑c∈Ca

pθ(b|c)

We note that this formulation still addressessparseness of data by using a robust model, such asa language model, to estimate pθ(b|c). At the sametime it requires our model to sum only over thejoint-contexts in the collection Ca, since contextsnot observed for a yield p#(c|a) = 0. Even so,since the size of these context collections growslinearly with the corpus size, considering all ob-served contexts may still present a scalability chal-lenge. Nevertheless, we expect our approximationpsimθ (b|a) to converge with a reasonable sample

183

size from a’s joint-contexts. Therefore, in orderto bound computational complexity, we limit thesize of the context collections used to train ourmodel to a maximum of N by randomly samplingN entries from larger collections. In all our ex-periments we use N = 10, 000. Higher valuesof N yielded negligible performances differences.Overall we see that our model estimates psimθ (b|a)as the average probability predicted for b in (alarge sample of) the contexts observed for a.

Finally, we define our similarity measure for tar-get word types a and b:

(4)simθ(a, b) =√psimθ (b|a) · psimθ (a|b)

As intended, this similarity measure promotesword pairs in which both b is likely in the con-texts of a and vice versa. Next, we describe amodel which implements this scheme with an n-gram language model as a concrete choice for θ.

3.3 Probabilistic similarity using languagemodels

In this work we focus on the word window contextrepresentation, which is the most common. Wedefine a word window of order k around a targetword as a window with up to k words to each sideof the target word, not crossing sentence bound-aries. The word window does not include the tar-get word itself, but rather a ‘place-holder’ for it.

Since word windows are sequences of words,probabilistic language models are a natural choiceof a model θ for estimating pθ(b|c). Languagemodels assign likelihood estimates to sequencesof words using approximation strategies. Inthis work we choose n-gram language models,aiming to capture syntagmatic properties of theword contexts, which are sensitive to word or-der. To approximate the probability of long se-quences of words, n-gram language models com-pute the product of the estimated probability ofeach word in the sequence conditioned on at mostthe n − 1 words preceding it. Furthermore, theyuse ‘discounting’ methods to improve the esti-mates of conditional probabilities when learningdata is sparse. Specifically, in this work we usethe Kneser-Ney n-gram model (Kneser and Ney,1995).

We compute pθ(b|c) as follows:

(5)pθ(b|c) =pθ(b, c)pθ(c)

where pθ(b, c) is the probability of the word se-quence comprising the word window c, in whichthe word b fills the place-holder. For instance, forc = “I drive my to work every” and b = car,pθ(b, c) is the estimated language model probabil-ity of “I drive my car to work every”. pθ(c) is themarginal probability of pθ(∗, c) over all possiblewords in the vocabulary. 2

4 Experimental Settings

Although sometimes used interchangeably, it iscommon to distinguish between semantic simi-larity and semantic relatedness (Budanitsky andHirst, 2001; Agirre et al., 2009). Semantic simi-larity is used to describe ‘likeness’ relations, suchas the relations between synonyms, hypernym-hyponyms, and co-hyponyms. Semantic relat-edness refers to a broader range of relations in-cluding also meronymy and various other asso-ciative relations as in ‘pencil-paper’ or ‘penguin-Antarctica’. In this work we focus on semanticsimilarity and evaluate all compared methods onseveral semantic similarity tasks.

Following previous works (Lin, 1998; Riedl andBiemann, 2013) we use Wordnet to construct largescale gold standards for semantic similarity evalu-ations. We perform the evaluations separately fornouns and verbs to test our hypothesis that ourmodel is particularly well-suited for verbs. To fur-ther evaluate our results on verbs we use the verbsimilarity test-set released by (Yang and Powers,2006), which contains pairs of verbs associatedwith semantic similarity scores based on humanjudgements.

4.1 Compared methodsWe compare our model with a traditional fea-ture vector model, the composite-feature model(Agirre et al., 2009), and the recent state-of-the-artword embedding models, CBOW and Skip-gram(Mikolov et al., 2013), all trained on the samelearning corpus and evaluated on equal grounds.

We denote the traditional feature vector baselineby IFV W−k, where IFV stands for “Independent-Feature Vector” and k is the order of the con-text word window considered. Similarly, we

2Computing pθ(c) by summing over all possible place-holder filler words, as we did in this work, is computation-ally intensive. However, this can be done more efficientlyby implementing customized versions of (at least some) n-gram language models with little computational overhead,e.g. by counting the learning corpus occurrences of n-gramtemplates, in which one of the elements matches any word.

184

denote the composite-feature vector baseline byCFV W−k, where CFV stands for “Composite-Feature Vector”. This baseline constructstraditional-like feature vectors, but considers en-tire word windows around target word tokens assingle features. In both of these baselines we useCosine as the vector similarity measure, and posi-tive pointwise mutual information (PPMI) for thefeature vector weights. PPMI is a well-knownvariant of pointwise mutual information (Churchand Hanks, 1990), and the combination of Cosinewith PPMI was shown to perform particularly wellin (Bullinaria and Levy, 2007).

We denote Mikolov’s CBOW and Skip-grambaseline models by CBOWW−k and SKIPW−k

respectively, where k denotes again the order ofthe window used to train these models. We usedMikolov’s word2vec utility3 with standard param-eters (600 dimensions, negative sampling 15) tolearn the word embeddings, and Cosine as the vec-tor similarity measure between them.

As the underlying probabilistic language modelfor our method we use the Berkeley implementa-tion4 (Pauls and Klein, 2011) of the Kneser-Neyn-gram model with the default discount parame-ters. We denote our model PDSW−k, where PDSstands for “Probabilistic Distributional Similar-ity”, and k is the order of the context word win-dow. In order to avoid giving our model an un-fair advantage of tuning the order of the languagemodel n as an additional parameter, we use a fixedn = k + 1. This means that the conditional prob-abilities that our n-gram model learns consider ascope of up to half the size of the window, whichis the distance in words between the target wordand either end of the window. We note that thisis the smallest reasonable value for n, as smallervalues effectively mean that there will be contextwords within the window that are more than nwords away from the target word, and thereforewill not be considered by our model.

As learning corpus we used the first CD ofthe freely available Reuters RCV1 dataset (Roseet al., 2002). This learning corpus contains ap-proximately 100M words, which is comparable insize to the British National Corpus (BNC) (As-ton, 1997). We first applied part-of-speech tag-ging and lemmatization to all words. Then werepresented each word w in the corpus as the pair

3http://code.google.com/p/word2vec4http://code.google.com/p/berkeleylm/

[pos(w), lemma(w)], where pos(w) is a coarse-grained part-of-speech category and lemma(w) isthe lemmatized form of w. Finally, we convertedevery pair [pos(w), lemma(w)] that occurs lessthan 100 times in the learning corpus to the pair[pos(w), ? ], which represents all rare words of thesame part-of-speech tag. Ignoring rare words is acommon practice used in order to clean up the cor-pus and reduce the vocabulary size (Gorman andCurran, 2006; Collobert and Weston, 2008).

The above procedure resulted in a word vocabu-lary of 27K words. From this vocabulary we con-structed a target verb set with over 2.5K verbs byselecting all verbs that exist in Wordnet (Fellbaum,2010). We repeated this procedure to create a tar-get noun set with over 9K nouns. We used ourlearning corpus for all compared methods and hadthem assign a semantic similarity score for everypair of verbs and every pair of nouns in these tar-get sets. These scores were later used in all of ourevaluations.

4.2 Wordnet evaluation

There is a shortage of large scale test-sets for se-mantic similarity. Popular test-sets such as Word-Sim353 and the TOEFL synonyms test containonly 353 and 80 test items respectively, and there-fore make it difficult to obtain statistically signif-icant results. To automatically construct larger-scale test-sets for semantic similarity, we sampledlarge target word subsets from our corpus and usedWordnet as a gold standard for their semanticallysimilar words, following related previous evalua-tions (Lin, 1998; Riedl and Biemann, 2013). Weconstructed two test-sets for our primary evalua-tion, one for verb similarity and another for nounsimilarity.

To perform the verb similarity evaluation, werandomly sampled 1,000 verbs from the targetverb set, where the probability of each verb to besampled is set to be proportional to its frequency inthe learning corpus. Next, for each sampled verba we constructed a Wordnet-based gold standardset of semantically similar words. In this set eachverb a′ is annotated as a ‘synonym’ of a if at leastone of the senses of a′ is a synonym of any of thesenses of a. In addition, each verb a′ is annotatedas a ‘semantic neighbor’ of a if at least one of thesenses of a′ is a synonym, co-hyponym, or a di-rect hypernym/hyponym of any of the senses of a.We note that by definition all verbs annotated as

185

synonyms of a are annotated as semantic neigh-bors as well. Next, per each verb a and an evalu-ated method, we generated a ranked list of all otherverbs, which was induced according to the similar-ity scores of this method.

Finally, we evaluated the compared methodson two tasks, ‘synonym detection’ and ‘seman-tic neighbor detection’. In the synonym detectiontask we evaluated the methods’ ability to retrieveas much verbs annotated in our gold standard as‘synonyms’, in the top-n entries of their rankedlists. Similarly, we evaluated all methods on the‘semantic neighbors’ task. The synonym detec-tion task is designed to evaluate the ability of thecompared methods to identify a more restrictiveinterpretation of semantic similarity, while the se-mantic neighbor detection task does the same fora somewhat broader interpretation.

We repeated the above procedure for sam-pling 1,000 target nouns, constructing the nounWordnet-based gold standards and evaluating onthe two semantic similarity tasks.

4.3 VerbSim evaluation

The publicly available VerbSim test-set contains130 verb pairs, each annotated with an average of6 human judgements of semantic similarity (Yangand Powers, 2006). We extracted a 107 pairs sub-set of this dataset for which all verbs are in ourlearning corpus. We followed works such as (Yangand Powers, 2007; Agirre et al., 2009) and com-pared the Spearman correlations between the verb-pair similarity scores assigned by the comparedmethods and the manually annotated scores in thisdataset.

5 Results

For each method and verb a in our 1,000 testedverbs, we used the Wordnet gold standard to com-pute the precision at top-1, top-5 and top-10 of theranked list generated by this method for a. Wethen computed mean precision values averagedover all verbs for each of the compared methods,denoted as P@1, P@5 and P@10. The detailedreport of P@10 results is omitted for brevity, asthey behave very similarly to P@5. We varied thecontext window order used by all methods to testits effect on the results. We measured the samemetrics for nouns.

The results of our Wordnet-based 1,000 verbsevaluation are presented in the upper part of Fig-

ure 1. The results show significant improvementof our method over all baselines, with a marginbetween 2 to 3 points on the synonyms detectiontask and 5 to 7 points on the semantic neighborsdetection task. Our best performing configura-tions are PDSW−3 and PDSW−4, outperform-ing all other baselines on both tasks and in all pre-cision categories. This difference is statisticallysignificant at p < 0.001 using a paired t-test in allcases except for the P@1 in the synonyms detec-tion task. Within the baselines, the composite fea-ture vector (CFV) performs somewhat better thanthe independent feature vector (IFV) baseline, andboth methods perform best around window orderof two, with gradual decline for larger windows.The word embedding baselines, CBOW and SKIP,perform comparably to the feature vector base-lines and to one another, with best performanceachieved around window order of four.

When gradually increasing the context windoworder within the range of up to 4 words, our PDSmodel shows improvement. This is in contrast tothe feature vector baselines, whose performancedeclines for context window orders larger than 2.This suggests that our approach is able to take ad-vantage of larger contexts in comparison to stan-dard feature vector models. The decline in perfor-mance for the independent feature vector baseline(IFV) may be related to the fact that independentfeatures farther away from the target word are gen-erally more loosely related to it. This seems con-sistent with previous works, where narrow win-dows of the order of two words performed well(Bullinaria and Levy, 2007; Agirre et al., 2009;Bruni et al., 2012) and in particular so when eval-uating semantic similarity rather than relatedness.On the other hand, the decline in performance forthe composite feature vector baseline (CFV) maybe attributed to the data sparseness phenomenonassociated with larger windows. The performanceof the word embedding baselines (CBOW andSKIP) starts declining very mildly only for win-dow orders larger than 4. This might be attributedto the fact that these models assign lower weightsto context words the farther away they are from thecenter of the window.

The results of our Wordnet-based 1,000 nounsevaluation are presented in the lower part of Fig-ure 1. These results are partly consistent with theresults achieved for verbs, but with a couple ofnotable differences. First, though our model still

186

Figure 1: Mean precision scores as a function of window order, obtained against the Wordnet-based goldstandard, on both the verb and noun test-sets with both the synonyms and semantic neighbor detectiontasks. “P@n” stands for precision in the top-n words of the ranked lists. Note that the Y-axis scale variesbetween graphs.

outperforms or performs comparably to all otherbaselines, in this case the advantage of our modelover the feature vector baselines is much moremoderate and not statistically significant. Second,the word embedding baselines generally performworst (with CBOW performing a little better thanSKIP), and our model outperforms them in bothP@5 and P@10 with a margin of around 2 pointsfor the synonyms detection task and 3-4 points forthe neighbor detection task, with statistical signif-icance at p < 0.001.

Next, to reconfirm the particular applicabilityof our model to verb similarity as apparent fromthe Wordnet evaluation, we performed the Verb-Sim evaluation and present the results in Table 1.

We compared the Spearman correlation obtainedfor the top-performing window order of each ofthe evaluated methods in the Wordnet verbs eval-uation. We present two sets of results. The ‘allscores‘ results follow the standard evaluation pro-cedure, considering all similarity scores producedby each method. In the ‘top-100 scores‘ results,for each method we converted to zero the scoresthat it assigned to word pairs, where neither ofthe words is in the top-100 most similar wordsof the other. Then we performed the evaluationwith these revised scores. This procedure focuseson evaluating the quality of the methods’ top-100 ranked word lists. The results show that ourmethod outperforms all baselines by a nice mar-

187

Method All scores top-100 scores

PDS W-4 0.616 0.625

CFV W-2 0.477 0.497

IFV W-2 0.467 0.546

SKIP W-4 0.469 0.512

CBOW W-5 0.528 0.469

Table 1: Spearman correlation values obtained forthe VerbSim evaluation. Each method was evalu-ated with the optimal window order found in theWordnet verbs evaluation.

gin of more than 8 points with the score of 0.616and 0.625 for the ‘all scores‘ and ‘top-100 scores‘evaluations respectively. Though not statisticallysignificant, due to the small test-set size, these re-sults support the ones from the Wordnet evalu-ation, suggesting that our model performs betterthan the baselines on measuring verb similarity.

In summary, our results suggest that in lack of arobust context modeling scheme it is hard for dis-tributional similarity models to effectively lever-age larger word window contexts for measuringsemantic similarity. It appears that this is some-what less of a concern when it comes to noun sim-ilarity, as the simple feature vector models reachnear-optimal performance with small word win-dows of order 2, but it is an important factor forverb similarity. In his recent book, Hanks (2013)claims that contrary to nouns, computational mod-els that are to capture the meanings of verbs mustconsider their syntagmatic patterns in text. Ourparticularly good results on verb similarity sug-gest that our modeling approach is able to cap-ture such information in larger context windows.We further conjecture that the reason the word em-bedding baselines did not do as well as our modelon verb similarity might be due to their particularchoice of joint-context formulation, which is notsensitive to word order. However, these conjec-tures should be further validated with additionalevaluations in future work.

6 Future Directions

In this paper we investigated the potential for im-proving distributional similarity models by model-ing jointly the occurrence of several features underthe same context. We evaluated several previousworks with different context modeling approachesand suggest that the type of the underlying con-

text modeling may have significant effect on theperformance of the semantic model. Further-more, we introduced a generic probabilistic distri-butional similarity approach, which can leveragethe power of established probabilistic languagemodels to effectively model joint-contexts for thepurpose of measuring semantic similarity. Ourconcrete model utilizing n-gram language modelsoutperforms several competitive baselines on se-mantic similarity tasks, and appears to be partic-ularly well-suited for verbs. In the remainder ofthis section we describe some potential future di-rections that can be pursued.

First, the performance of our generic schemeis largely inherited from the nature of its under-lying language model. Therefore, we see muchpotential in exploring the use of other types oflanguage models, such as class-based (Brown etal., 1992), syntax-based (Pauls and Klein, 2012)or hybrid (Tan et al., 2012). Furthermore, a sim-ilar approach to ours could be attempted in wordembedding models. For instance, our syntagmaticjoint-context modeling approach could be investi-gated by word embedding models to generate bet-ter embeddings for verbs.

Another direction relates to the well known ten-dency of many words, and particularly verbs, toassume different meanings (or senses) under dif-ferent contexts. To address this phenomenon con-text sensitive similarity and inference models havebeen proposed (Dinu and Lapata, 2010; Melamudet al., 2013). Similarly to many semantic similar-ity models, our current model aggregates informa-tion from all observed contexts of a target wordtype regardless of its different senses. However,we believe that our approach is well suited to ad-dress context sensitive similarity with proper en-hancements, as it considers joint-contexts that canmore accurately disambiguate the meaning of tar-get words. As an example, it is possible to con-sider the likelihood of word b to occur in a subsetof the contexts observed for word a, which is bi-ased towards a particular sense of a.

Finally, we note that our model is not a classicvector space model and therefore common vec-tor composition approaches (Mitchell and Lap-ata, 2008) cannot be directly applied to it. In-stead, other methods, such as similarity of com-positions (Turney, 2012), should be investigated toextend our approach for measuring similarity be-tween phrases.

188

Acknowledgments

This work was partially supported by the IsraeliMinistry of Science and Technology grant 3-8705,the Israel Science Foundation grant 880/12, theEuropean Community’s Seventh Framework Pro-gramme (FP7/2007-2013) under grant agreementno. 287923 (EXCITEMENT) and the Scien-tific and Technical Research Council of Turkey(TUBITAK, Grant Number 112E277).

ReferencesEneko Agirre, Enrique Alfonseca, Keith Hall, Jana

Kravalova, Marius Pasca, and Aitor Soroa. 2009.A study on similarity and relatedness using distribu-tional and wordnet-based approaches. In Proceed-ings of NAACL. Association for Computational Lin-guistics.

Guy Aston. 1997. The BNC Handbook Exploring theBritish National Corpus with SARA Guy Aston andLou Burnard.

Yoshua Bengio, Rjean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic lan-guage model. Journal of Machine Learning Re-search, 3:1137–1155.

Peter F. Brown, Peter V. Desouza, Robert L. Mer-cer, Vincent J. Della Pietra, and Jenifer C. Lai.1992. Class-based n-gram models of natural lan-guage. Computational linguistics, 18(4):467–479.

Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in tech-nicolor. In Proceedings of ACL.

Alexander Budanitsky and Graeme Hirst. 2001.Semantic distance in wordnet: An experimental,application-oriented evaluation of five measures. InWorkshop on WordNet and Other Lexical Resources.

John A. Bullinaria and Joseph P. Levy. 2007. Ex-tracting semantic representations from word co-occurrence statistics: A computational study. Be-havior Research Methods, 39(3):510–526.

Kenneth W. Church and Patrick Hanks. 1990. Wordassociation norms, mutual information, and lexicog-raphy. Computational Linguistics, 16(1):22–29.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In Pro-ceedings of the 25th international conference onMachine learning, pages 160–167. ACM.

Georgiana Dinu and Mirella Lapata. 2010. Measuringdistributional similarity in context. In Proceedingsof EMNLP.

Christiane Fellbaum. 2010. WordNet. Springer.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey-tan Ruppin. 2001. Placing search in context: Theconcept revisited. In Proceedings of the 10th inter-national conference on World Wide Web. ACM.

James Gorman and James R. Curran. 2006. Scalingdistributional similarity to large corpora. In Pro-ceedings of ACL.

Patrick Hanks. 2013. Lexical Analysis: Norms andExploitations. Mit Press.

Reinhard Kneser and Hermann Ney. 1995. Improvedbacking-off for m-gram language modeling. In In-ternational Conference on Acoustics, Speech, andSignal Processing. IEEE.

Beth Levin. 1993. English verb classes and alter-nations: A preliminary investigation. University ofChicago press.

Dekang Lin. 1998. Automatic retrieval and clusteringof similar words. In Proceedings of COLING-ACL.

Kevin Lund and Curt Burgess. 1996. Producinghigh-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instru-ments, & Computers.

Oren Melamud, Jonathan Berant, Ido Dagan, JacobGoldberger, and Idan Szpektor. 2013. A two levelmodel for context sensitive inference rules. In Pro-ceedings of ACL.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. In IN-TERSPEECH.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Jeff Mitchell and Mirella Lapata. 2008. Vector-basedmodels of semantic composition. In Proceedings ofACL.

Adam Pauls and Dan Klein. 2011. Faster and SmallerN -Gram Language Models. In Proceedings of ACL.

Adam Pauls and Dan Klein. 2012. Large-scale syntac-tic language modeling with treelets. In Proceedingsof ACL.

Martin Riedl and Chris Biemann. 2013. Scaling tolargeˆ3 data: An efficient and effective method tocompute distributional thesauri. In Proceedings ofEMNLP.

Tony Rose, Mark Stevenson, and Miles Whitehead.2002. The Reuters Corpus Volume 1-from Yester-day’s News to Tomorrow’s Language Resources. InLREC.

189

Herbert Rubenstein and John B. Goodenough. 1965.Contextual correlates of synonymy. Communica-tions of the ACM.

Maria Ruiz-Casado, Enrique Alfonseca, and PabloCastells. 2005. Using context-window overlappingin synonym discovery and ontology extension. Pro-ceedings of RANLP.

Ming Tan, Wenli Zhou, Lei Zheng, and Shaojun Wang.2012. A scalable distributed syntactic, semantic,and lexical language model. Computational Lin-guistics, 38(3):631–671.

Peter D. Turney, Patrick Pantel, et al. 2010. From fre-quency to meaning: Vector space models of seman-tics. Journal of artificial intelligence research.

Peter D. Turney. 2012. Domain and function: Adual-space model of semantic relations and compo-sitions. Journal of Artificial Intelligence Researc,44(1):533–585, May.

Dongqiang Yang and David M. W. Powers. 2006. Verbsimilarity on the taxonomy of wordnet. In the 3rdInternational WordNet Conference (GWC-06).

Dongqiang Yang and David M. W. Powers. 2007.An empirical investigation into grammatically con-strained contexts in predicting distributional similar-ity. In Australasian Language Technology Workshop2007, pages 117–124.

Mehmet Ali Yatbaz, Enis Sert, and Deniz Yuret. 2012.Learning syntactic categories using paradigmaticrepresentations of word context. In Proceedings ofEMNLP.

190

probabilistic modeling of joint-context in distributional similarity

Documents