acm ihi-2010-pedersen-final
DESCRIPTION
Presentation slides for ACM IHI 2010 talk "The Effect of Different Context Representations on Word Sense Discrimination in Biomedical Texts"TRANSCRIPT
The Effect of Different Context Representations on Word Sense
Discrimination in Biomedical Texts
Ted PedersenDepartment of Computer ScienceUniversity of Minnesota, Duluth
http://www.d.umn.edu/~tpederse
Topics Natural Language Processing
Semantic ambiguity Disambiguation versus Discrimination
Text (Context) Representations Latent Semantic Analysis (LSA) Word Co-occurrence (Schütze)
Cluster contexts to discover word senses Discrimination Experiments with MedLine
Discrimination or Disambiguation?
Word sense disambiguation assigns a sense to a target word in context by selecting from a pre-existing set of possibilities (classification)
Are you a river bank or a money bank? Word sense discrimination assigns a target
word in context to a cluster without regard to any pre-existing sense inventory (discovery)
How many ways is bank being used?
Discrimination Input : target word contexts
Wellbutrin is used to treat depression. A deep groove or depression is painful. It left a tender red depression on his wrist. Counseling and medication helps depression.
Discrimination Output : clusters of contexts
Wellbutrin is used to treat depression. Counseling and medication helps depression.
A deep groove or depression is painful. It left a tender red depression on his wrist.
Word Sense Discrimination! This is one way to identify senses in the first place
- sense inventories aren't static Word sense discrimination – identify senses Craft a definition – sense labeling
When doing searches we often only need to know that a word is being used in several distinct senses (but may not care exactly what they are)
depression (economic, indentation, condition) apache (helicopter, Native American, software) John Smith (names often shared)
The Goal
To carry out word sense discrimination based strictly on empirical evidence in the text that contains our target words (local), or other text we can easily obtain (global)
Be knowledge-lean and avoid dependence on existing sense inventories, ontologies, etc.
Language independent Domain independent Discover new senses
First Order Methods Represent each target word context with a
feature vector that shows which unigram or bigram features occur within
Other features can be used, including part of speech tags, syntactic info, etc.
Results in a context by feature matrix Each row is a context to be clustered Each context contains a target word
Cluster All the contexts in the same cluster
presumed to use target word in same sense
First Order Representationscontext by features
(2) A deep groove or depression is painful. (3) It left a tender red depression on his wrist.
But ... we know that painful and tender are very similar – we just can't see it here ...
deep depression groove left painful red tender wrist
(2) 1 1 1 0 1 0 0 0
(3) 0 1 0 1 0 1 1 1
Look to the Second Order... “You shall know a word by the company it
keeps” (JR Firth, 1957) ... know a friend by their friends ... words co-occur with other words
Words that occur in similar contexts will tend to have similar meanings (Zelig Harris, 1954)
... know a friend by the places they go ... “Distributional Hypothesis”
Look to the Second Order
Replace each word in a target word context with a vector that reveals something about that word
Replace a word by the company it keeps Feature by feature matrix (Schütze) Word co-occurrence matrix
Replace a word by the places it has been Feature by context matrix (LSA) Term by document matrix
Feature by Feature Matrix... to replace a word by the company it keeps ....
hurt Wellbutrin sore bruise ...
medication 1 1 0 0
counseling 0 1 0 0
tender 1 0 1 1
painful 1 0 1 1
red 1 0 1 0
Feature by Context Matrix... to replace a word by the places it has been ...
(100) (101) (102) (103) ...
medication 0 0 1 1
counseling 0 1 1 0
tender 1 1 0 1
painful 1 1 0 1
red 0 0 1 1
Second Order Representations Replace each word in a target word context
with a vector Feature by feature (Schütze)
The company it keeps Feature by context (LSA)
The places it has been
Remove all words that don't have vectors Average all word vectors together and
represent context with that averaged vector Do the same with all other target word
contexts, then cluster
Second order representations
(2) : A deep groove or depression is painful. (3) : It left a tender red depression on his wrist.
Nothing matches in first order representation, but in second order since painful and tender ...
both occur with hurt, then there is some similarity between (2) and (3)
both occur in document 100, 102, and 103, then there is some similarity between (2) and (3)
The Question Which method of representing contexts is
best able to discover and discriminate among senses?
First order feature vectors Traditional vector space o1-Ngram
Replace word by the company it keeps Schütze o2-SC
Replace word by the places it has been LSA o2-LSA
Experimental Methodology Collect contexts with a given target word Identify lexical features (unigrams or bigrams)
within the contexts or in other global data Use these features to represent contexts using
first or second order methods Perform SVD (optional)
Cluster Number of clusters automatically discovered Generate a label for each cluster
Evaluate
Lexical Features Unigrams
High frequency words that aren't in stop list Stop list typically made up of function words
like the, for, and, but, etc. Bigrams
Two word sequences (separated by up to 8 words) that occur more often than chance
Selected using Fisher's Exact Test (left sided) p-value =.99
Bigrams made up of stop words excluded Can be identified in the target word contexts
(local), or in some other global set of data
Clustering Repeated Bisections
Starts by clustering all contexts in one cluster, then repeatedly partitioning (in two) to optimize the criterion function
Partitioning done via k-means with k=2 I2 criterion function
Finds average pairwise similarity between each context in the cluster and the centroid, sums across all clusters to find value
Implemented in Cluto
Cluster Stopping Find k where criterion function stops improving PK2 (Hartigan, 1975) takes ratio of criterion
function of successive pairs of k PK3 takes ratio of twice the criterion function
at k divided by product of (k-1) and (k+1) PK2 and PK3 stop when these ratios are
within 1 std of 1 Gap Statistic (Tibshirani, 2001) compares
observed data with reference sample of noise, find k with greatest divergence from noise
Evaluation
Map discovered clusters to “actual” clusters and find assignment of discovered clusters to actual clusters that maximizes agreement
Assignment Problem Hungarian Algorithm Munkres-Kuhn Algorithm
Precision, Recall, F-measure
Experiments Isolate context representations, hold most
everything else equal Focus on biomedical text
Ambiguities exist, often rather fine grained and not well represented in existing resources
Automatic mapping of terms to concepts of great practical importance
Relatively small amounts of manually annotated evaluation data available, create new collection automatically (imperfectly)
Experimental Data Randomly select 60 MeSH preferred terms,
and pair them randomly Relatively unambiguous and moderately
specific terms Medical Subject Headings – used to index
medical journal articles “Create” 30 new ambiguous terms (pseudo
words) that conflate the terms in a pair
COLON-&-LEG
PATIENT_CARE-&-OSTEOPOROSIS
Experimental Data Replace all occurrences of each member of a
pair with the new conflated term Select 1,000 – 10,000 MedLine abstracts
that contain each pseudo word create 50/50 split of two “senses”
Discriminate into some number of clusters Evaluate with F-measure All in one cluster results in 50%
Experimental Settings
Unigrams and bigrams as features, selected from target word contexts
First order and second order methods SVD optional with second order methods Clustering with repeated bisections and I2 Cluster stopping with PK2, PK3, and Gap Evaluation relative to 30 pseudo words
Experimental Results : F-Measure
Experimental Results : discovered k
Discussion of Results Second order methods robust and accurate
o2 SC overall more accurate and better at predicting number of senses
SVD degrades results First order unigrams effective but brittle Conflated words not perfect but useful Knowing the company a word keeps tells
us (a bit) more about its meaning than knowing the places it has been
Ongoing and Future Work Averaging all vectors seems “coarse”, create
context representations so that dominant word vectors stand out and noise recedes
Evaluate on more than 2-way distinctions, using manually created gold standard data
Use information in dictionaries and ontologies when we can, but don't be tied to them
UMLS::Similarity – free open source package that measures similarity and relatedness between concepts in the UMLS
http://search.cpan.org/dist/UMLS-Similarity/
Thank you! All experiments run with SenseClusters, a
freely available open source package from the University of Minnesota, Duluth
http://senseclusters.sourceforge.net Download software (for Linux) Publications Web interface for experiments
The creation of SenseClusters was funded by an NSF CAREER Award (#0092784). This particular study was supported by a grant from NIH/NLM (1R01LM009623-01A2).
Extra Slides
Experimental Datacolon(s|ic)? & legs? | patient care & osteoporosis | blood transfusions? & ventricular functions? |
randomized controlled trials? & haplotypes? | vasodilations? & bronchoalveolar lavages? |
toluenes? & thinking | duodenal ulcers? & clonidines? | myomas? & appetites? |
glycolipids? & prenatal care | thoracic surger(y|ies) & cytogenetic analys(is|es) |
measles virus(es)? & tissue extracts? | lanthanums? & curiums? |
adrenal insufficienc(y|ies) & (recurrent )?laryngeal nerves? | glucokinases? & xeroderma pigmentosums? |
polyvinyl alcohols? & polyribosomes? | urethral strictures? & resistance training |
cholesterol esters? & premature births? | odontoblasts? & anurias? |
brain infarctions? & health resources? | turbinates? & aphids? |
cochlear nerves? & (protein )?kinases? Inhibitors? | hematemesis & gemfibrozils? |
nectars? & work of breathing | fusidic acids? & dicarboxylic acids? | brucellas? & potassium iodides? |
walkers? & primidones? | hepatitis( b)? & flavoproteins? | prognathisms? & plant roots? |
plant proteins? & (persistent )?vegetative states? | prophages? & porphyrias?
EvaluationCOLON LEG
C1 10 10 20
C2 10 30 40
C3 40 0 40
60 40 100
COLON LEG
C3 4040 0 40
C2 10 3030 40
C1 1010 1010 2020
60 40 100
Precision = 70/80 = 87.5% Recall = 70/100 = 70% F-Measure = 2*(87.5*70)/(87.5 + 70) = 77.8%
Experimental Results
o1-big
o1-uni
o2-sc
o2-sc-svd
o2-lsa
o2-lsa-svd
PK2 64.63 (5.5)
75.24 (4.0)
90.74 (2.2)
57.52 (5.3)
84.16 (2.9)
57.89 (5.3)
PK3 75.08 (3.8)
84.24 (3.0)
90.68 (2.4)
69.44 (2.5)
87.43 (2.3)
67.85 (2.4)
Gap 65.51 (6.2)
87.50 (1.9)
88.57 (2.2)
50.00 (1.0)
83.93 (2.3)
49.56 (1.3)
References LSI : Deerwester, S., et al. (1990) Improving Information
Retrieval with Latent Semantic Indexing, Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, pp. 36–40.
Word Co-occurrences : Firth, J. R. (1957) Papers in Linguistics 1934-1951. London: Oxford University Press.
Distributional Hypothesis : Harris, Z. (1954) Distributional structure. Word, 10(23): 146-162.
LSA : Landauer, T. K., and Dumais, S. T. (1997) A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.
Schütze : Schütze, H. (1998) Automatic word sense discrimination. Computational Linguistics, 24(1), pp. 97-123.