comp423: intelligent agent text representation. menu – bag of words – phrase – semantics –...

COMP423: Intelligent Agent

Text Representation

Menu

– Bag of words– Phrase– Semantics

– Bag of concepts

– Semantic distance between two words

Bag of words

• Vector Space model– Documents are term vectors– Tf.Idf for term weights– Cosine similarity

• Limitations:– Words semantics– Semantic distance between words

– Word order– Word importance– ….

Consider Word Order• N-grams model

– Bi-grams: two words as a phrase, some are not really phrases– Tri-games, three words, not worth it

• Phrase based – Use Part of speech, e.g. select noun phrases– Regular expression, chunking: expensive for writing the patterns

• Mixed results

• One example [Furnkranz98]– The representation is evaluated on a Web categorization task (university pages classified

as STUDENT, FACULTY, STAFF, DEPARTMENT, etc.• – A Naive Bayes (NB) classifier and Ripper used

– Results (words vs. words+phrases) are mixed• Accuracy improved for NB and not for Ripper• Precision at low recall highly improved• Some phrasal features are highly predictive for certainclasses, but in general have low coverage

• More recent work by [Yuefeng Li 2010, KDD]– Applied on classification, positive results

Word semantics

• Using external resources– Early work

• WordNet• Cyc

– Wikipedia– The Web

• Mixed results– Recall is usually improved, but precision is hurt

• Disambiguation is critical

Wordnet

• WordNet’s organization– The basic unit is the synset = synonym set– A synset is equivalent to a concept– E.g. Senses of “car” (synsets to which “car”belongs)– {car, auto, automobile, machine, motorcar}– {car, railcar, railway car, railroad car}– {cable car, car}– {car, gondola}– {car, elevator car}

WordNet is useful for IR

• Indexing with synsets has proven effective [Gonzalo98]

• It improves recall because involves mapping– synonyms into the same indexing object

• It improves precision if only relevant senses are considered– E.g. A query for “jaguar” in the car sense causes

retrieving only documents with “jaguar car”

Mixed resultsConcept indexing with WordNet [Scott98, Scott99] ¯

• Using synsets and hypernyms with Ripper• Fail because they do not perform WSD (Word Sense Disambiguation)

[Junker97] ¯¯• Using synsets and hypernyms as generalizationoperators in a specialized rule learner• Fail because the proposed learning method gets lost in the hypothesis space

[Fukumoto01] • Sysnets and (limited) hypernyms for SVM, no WSD• Improvement on less populated categories

In general• Given that there is not a reliable WSD algorithm for (fine-grained) WordNet senses, current approaches do not perform WSD• Improvements in small categories • But I believe full, perfect WSD is not required.

Word importance

• Feature selection: need a corpus for training– Document frequency– Information Gain (IG)– Chi-square

• Keyword extraction• Feature extraction• Others

– Using Wikipedia as training data and testing data– Using the Web– Bring order to words

Other issues

• Time• Word categories

– Common words– Academic words– Domain specific words

• ….

Semantic distance between word pairs

– Thesaurus based : WordNet– Corpus based:

• e.g. Latent Semantic Analysis• Statistical Thesaurus: Co- occurrence

– Google normalised distance (GND)

– Wikipedia based• Wikipedia Link Similarity (WLM)• Explicit semantic analysis: ESA (State-of-art)

GND: Motivation and Goals

To represent meaning in a computer-digestable form

To establish semantic relations between common names of objects

Utilise the largest database in the word – the web

NGD definition x = word one (eg 'horse')

y = word two (eg 'rider')

N = normalising factor (often M)

M = the cardinality of the set of all pages on the web

f(x) = frequency x occurs in the total set of documents

Because of LogN, NGD is stable as the web grows

Example

NGD(horse, rider) Horse returns 46,700,000 Rider returns 12,200,000 Horse Rider returns 2,630,000 Google indexed 8,058,044,651

Wikipedia Link similarity measure

• Inlinks• Outlinks• Shared inlinks and outlinks, average of the two

– Inlinks: formula borrowed from GND

– Outlinks:

• w(l,A) the weight of a link, similar to the inversed document similarity

Bag of concepts

• WikiMiner, WLM, by Ian Witten, David Milne, Anna Huang– Wikipedia based approach– Concepts are anchor texts

• Can be phrases• Also is a way to select important words

– Use shared inlinks, outlinks to estimate the semantic distance between concepts,

– New document similarity measure.• There should be other ways

– to define concepts– to select concepts– to compare concepts– …

comp423: intelligent agent text representation. menu – bag of words – phrase – semantics –...

Documents

results words

jaguar car slide

words phrases

railroad car

cable car

railway car

elevator car

senses of car synsets