comp423: intelligent agent text representation. menu – bag of words – phrase – semantics –...
TRANSCRIPT
Bag of words
• Vector Space model– Documents are term vectors– Tf.Idf for term weights– Cosine similarity
• Limitations:– Words semantics– Semantic distance between words
– Word order– Word importance– ….
Consider Word Order• N-grams model
– Bi-grams: two words as a phrase, some are not really phrases– Tri-games, three words, not worth it
• Phrase based – Use Part of speech, e.g. select noun phrases– Regular expression, chunking: expensive for writing the patterns
• Mixed results
• One example [Furnkranz98]– The representation is evaluated on a Web categorization task (university pages classified
as STUDENT, FACULTY, STAFF, DEPARTMENT, etc.• – A Naive Bayes (NB) classifier and Ripper used
– Results (words vs. words+phrases) are mixed• Accuracy improved for NB and not for Ripper• Precision at low recall highly improved• Some phrasal features are highly predictive for certainclasses, but in general have low coverage
• More recent work by [Yuefeng Li 2010, KDD]– Applied on classification, positive results
Word semantics
• Using external resources– Early work
• WordNet• Cyc
– Wikipedia– The Web
• Mixed results– Recall is usually improved, but precision is hurt
• Disambiguation is critical
Wordnet
• WordNet’s organization– The basic unit is the synset = synonym set– A synset is equivalent to a concept– E.g. Senses of “car” (synsets to which “car”belongs)– {car, auto, automobile, machine, motorcar}– {car, railcar, railway car, railroad car}– {cable car, car}– {car, gondola}– {car, elevator car}
WordNet is useful for IR
• Indexing with synsets has proven effective [Gonzalo98]
• It improves recall because involves mapping– synonyms into the same indexing object
• It improves precision if only relevant senses are considered– E.g. A query for “jaguar” in the car sense causes
retrieving only documents with “jaguar car”
Mixed resultsConcept indexing with WordNet [Scott98, Scott99] ¯
• Using synsets and hypernyms with Ripper• Fail because they do not perform WSD (Word Sense Disambiguation)
[Junker97] ¯¯• Using synsets and hypernyms as generalizationoperators in a specialized rule learner• Fail because the proposed learning method gets lost in the hypothesis space
[Fukumoto01] • Sysnets and (limited) hypernyms for SVM, no WSD• Improvement on less populated categories
In general• Given that there is not a reliable WSD algorithm for (fine-grained) WordNet senses, current approaches do not perform WSD• Improvements in small categories • But I believe full, perfect WSD is not required.
Word importance
• Feature selection: need a corpus for training– Document frequency– Information Gain (IG)– Chi-square
• Keyword extraction• Feature extraction• Others
– Using Wikipedia as training data and testing data– Using the Web– Bring order to words
Semantic distance between word pairs
– Thesaurus based : WordNet– Corpus based:
• e.g. Latent Semantic Analysis• Statistical Thesaurus: Co- occurrence
– Google normalised distance (GND)
– Wikipedia based• Wikipedia Link Similarity (WLM)• Explicit semantic analysis: ESA (State-of-art)
GND: Motivation and Goals
To represent meaning in a computer-digestable form
To establish semantic relations between common names of objects
Utilise the largest database in the word – the web
NGD definition x = word one (eg 'horse')
y = word two (eg 'rider')
N = normalising factor (often M)
M = the cardinality of the set of all pages on the web
f(x) = frequency x occurs in the total set of documents
Because of LogN, NGD is stable as the web grows
Example
NGD(horse, rider) Horse returns 46,700,000 Rider returns 12,200,000 Horse Rider returns 2,630,000 Google indexed 8,058,044,651
Wikipedia Link similarity measure
• Inlinks• Outlinks• Shared inlinks and outlinks, average of the two
– Inlinks: formula borrowed from GND
– Outlinks:
• w(l,A) the weight of a link, similar to the inversed document similarity
Bag of concepts
• WikiMiner, WLM, by Ian Witten, David Milne, Anna Huang– Wikipedia based approach– Concepts are anchor texts
• Can be phrases• Also is a way to select important words
– Use shared inlinks, outlinks to estimate the semantic distance between concepts,
– New document similarity measure.• There should be other ways
– to define concepts– to select concepts– to compare concepts– …