acm ihi-2010-pedersen-final

The Effect of Different Context Representations on Word Sense

Discrimination in Biomedical Texts

Ted PedersenDepartment of Computer ScienceUniversity of Minnesota, Duluth

http://www.d.umn.edu/~tpederse

http://www.d.umn.edu/~tpederse

Topics Natural Language Processing

Semantic ambiguity Disambiguation versus Discrimination

Text (Context) Representations Latent Semantic Analysis (LSA) Word Co-occurrence (Schütze)

Cluster contexts to discover word senses Discrimination Experiments with MedLine

Discrimination or Disambiguation?

Word sense disambiguation assigns a sense to a target word in context by selecting from a pre-existing set of possibilities (classification)

Are you a river bank or a money bank? Word sense discrimination assigns a target

word in context to a cluster without regard to any pre-existing sense inventory (discovery)

How many ways is bank being used?

Discrimination Input : target word contexts

Wellbutrin is used to treat depression. A deep groove or depression is painful. It left a tender red depression on his wrist. Counseling and medication helps depression.

Discrimination Output : clusters of contexts

Wellbutrin is used to treat depression. Counseling and medication helps depression.

A deep groove or depression is painful. It left a tender red depression on his wrist.

Word Sense Discrimination! This is one way to identify senses in the first place

- sense inventories aren't static Word sense discrimination – identify senses Craft a definition – sense labeling

When doing searches we often only need to know that a word is being used in several distinct senses (but may not care exactly what they are)

depression (economic, indentation, condition) apache (helicopter, Native American, software) John Smith (names often shared)

The Goal

To carry out word sense discrimination based strictly on empirical evidence in the text that contains our target words (local), or other text we can easily obtain (global)

Be knowledge-lean and avoid dependence on existing sense inventories, ontologies, etc.

Language independent Domain independent Discover new senses

First Order Methods Represent each target word context with a

feature vector that shows which unigram or bigram features occur within

Other features can be used, including part of speech tags, syntactic info, etc.

Results in a context by feature matrix Each row is a context to be clustered Each context contains a target word

Cluster All the contexts in the same cluster

presumed to use target word in same sense

First Order Representationscontext by features

(2) A deep groove or depression is painful. (3) It left a tender red depression on his wrist.

But ... we know that painful and tender are very similar – we just can't see it here ...

deep depression groove left painful red tender wrist

(2) 1 1 1 0 1 0 0 0

(3) 0 1 0 1 0 1 1 1

Look to the Second Order... “You shall know a word by the company it

keeps” (JR Firth, 1957) ... know a friend by their friends ... words co-occur with other words

Words that occur in similar contexts will tend to have similar meanings (Zelig Harris, 1954)

... know a friend by the places they go ... “Distributional Hypothesis”

Look to the Second Order

Replace each word in a target word context with a vector that reveals something about that word

Replace a word by the company it keeps Feature by feature matrix (Schütze) Word co-occurrence matrix

Replace a word by the places it has been Feature by context matrix (LSA) Term by document matrix

Feature by Feature Matrix... to replace a word by the company it keeps ....

hurt Wellbutrin sore bruise ...

medication 1 1 0 0

counseling 0 1 0 0

tender 1 0 1 1

painful 1 0 1 1

red 1 0 1 0

Feature by Context Matrix... to replace a word by the places it has been ...

(100) (101) (102) (103) ...

medication 0 0 1 1

counseling 0 1 1 0

tender 1 1 0 1

painful 1 1 0 1

red 0 0 1 1

Second Order Representations Replace each word in a target word context

with a vector Feature by feature (Schütze)

The company it keeps Feature by context (LSA)

The places it has been

Remove all words that don't have vectors Average all word vectors together and

represent context with that averaged vector Do the same with all other target word

contexts, then cluster

Second order representations

(2) : A deep groove or depression is painful. (3) : It left a tender red depression on his wrist.

Nothing matches in first order representation, but in second order since painful and tender ...

both occur with hurt, then there is some similarity between (2) and (3)

both occur in document 100, 102, and 103, then there is some similarity between (2) and (3)

The Question Which method of representing contexts is

best able to discover and discriminate among senses?

First order feature vectors Traditional vector space o1-Ngram

Replace word by the company it keeps Schütze o2-SC

Replace word by the places it has been LSA o2-LSA

Experimental Methodology Collect contexts with a given target word Identify lexical features (unigrams or bigrams)

within the contexts or in other global data Use these features to represent contexts using

first or second order methods Perform SVD (optional)

Cluster Number of clusters automatically discovered Generate a label for each cluster

Evaluate

Lexical Features Unigrams

High frequency words that aren't in stop list Stop list typically made up of function words

like the, for, and, but, etc. Bigrams

Two word sequences (separated by up to 8 words) that occur more often than chance

Selected using Fisher's Exact Test (left sided) p-value =.99

Bigrams made up of stop words excluded Can be identified in the target word contexts

(local), or in some other global set of data

Clustering Repeated Bisections

Starts by clustering all contexts in one cluster, then repeatedly partitioning (in two) to optimize the criterion function

Partitioning done via k-means with k=2 I2 criterion function

Finds average pairwise similarity between each context in the cluster and the centroid, sums across all clusters to find value

Implemented in Cluto

Cluster Stopping Find k where criterion function stops improving PK2 (Hartigan, 1975) takes ratio of criterion

function of successive pairs of k PK3 takes ratio of twice the criterion function

at k divided by product of (k-1) and (k+1) PK2 and PK3 stop when these ratios are

within 1 std of 1 Gap Statistic (Tibshirani, 2001) compares

observed data with reference sample of noise, find k with greatest divergence from noise

Evaluation

Map discovered clusters to “actual” clusters and find assignment of discovered clusters to actual clusters that maximizes agreement

Assignment Problem Hungarian Algorithm Munkres-Kuhn Algorithm

Precision, Recall, F-measure

Experiments Isolate context representations, hold most

everything else equal Focus on biomedical text

Ambiguities exist, often rather fine grained and not well represented in existing resources

Automatic mapping of terms to concepts of great practical importance

Relatively small amounts of manually annotated evaluation data available, create new collection automatically (imperfectly)

Experimental Data Randomly select 60 MeSH preferred terms,

and pair them randomly Relatively unambiguous and moderately

specific terms Medical Subject Headings – used to index

medical journal articles “Create” 30 new ambiguous terms (pseudo

words) that conflate the terms in a pair

COLON-&-LEG

PATIENT_CARE-&-OSTEOPOROSIS

Experimental Data Replace all occurrences of each member of a

pair with the new conflated term Select 1,000 – 10,000 MedLine abstracts

that contain each pseudo word create 50/50 split of two “senses”

Discriminate into some number of clusters Evaluate with F-measure All in one cluster results in 50%

Experimental Settings

Unigrams and bigrams as features, selected from target word contexts

First order and second order methods SVD optional with second order methods Clustering with repeated bisections and I2 Cluster stopping with PK2, PK3, and Gap Evaluation relative to 30 pseudo words

Experimental Results : F-Measure

Experimental Results : discovered k

Discussion of Results Second order methods robust and accurate

o2 SC overall more accurate and better at predicting number of senses

SVD degrades results First order unigrams effective but brittle Conflated words not perfect but useful Knowing the company a word keeps tells

us (a bit) more about its meaning than knowing the places it has been

Ongoing and Future Work Averaging all vectors seems “coarse”, create

context representations so that dominant word vectors stand out and noise recedes

Evaluate on more than 2-way distinctions, using manually created gold standard data

Use information in dictionaries and ontologies when we can, but don't be tied to them

UMLS::Similarity – free open source package that measures similarity and relatedness between concepts in the UMLS

http://search.cpan.org/dist/UMLS-Similarity/

Thank you! All experiments run with SenseClusters, a

freely available open source package from the University of Minnesota, Duluth

http://senseclusters.sourceforge.net Download software (for Linux) Publications Web interface for experiments

The creation of SenseClusters was funded by an NSF CAREER Award (#0092784). This particular study was supported by a grant from NIH/NLM (1R01LM009623-01A2).

http://senseclusters.sourceforge.net/

Extra Slides

EvaluationCOLON LEG

C1 10 10 20

C2 10 30 40

C3 40 0 40

60 40 100

COLON LEG

C3 4040 0 40

C2 10 3030 40

C1 1010 1010 2020

60 40 100

Precision = 70/80 = 87.5% Recall = 70/100 = 70% F-Measure = 2*(87.5*70)/(87.5 + 70) = 77.8%

Experimental Results

o1-big

o1-uni

o2-sc

o2-sc-svd

o2-lsa

o2-lsa-svd

PK2 64.63 (5.5)

75.24 (4.0)

90.74 (2.2)

57.52 (5.3)

84.16 (2.9)

57.89 (5.3)

PK3 75.08 (3.8)

84.24 (3.0)

90.68 (2.4)

69.44 (2.5)

87.43 (2.3)

67.85 (2.4)

Gap 65.51 (6.2)

87.50 (1.9)

88.57 (2.2)

50.00 (1.0)

83.93 (2.3)

49.56 (1.3)

References LSI : Deerwester, S., et al. (1990) Improving Information

Retrieval with Latent Semantic Indexing, Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, pp. 36–40.

Word Co-occurrences : Firth, J. R. (1957) Papers in Linguistics 1934-1951. London: Oxford University Press.

Distributional Hypothesis : Harris, Z. (1954) Distributional structure. Word, 10(23): 146-162.

LSA : Landauer, T. K., and Dumais, S. T. (1997) A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.

Schütze : Schütze, H. (1998) Automatic word sense discrimination. Computational Linguistics, 24(1), pp. 97-123.

acm ihi-2010-pedersen-final

Education

word vectors

target word cluster

word sense disambiguation

given target word

context lsa

order representations

tender red depression

vector feature