Basic indexing pipeline
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
Documents tobe indexed.
Friends, Romans, countrymen.
Parsing a document
What format is it in? pdf/word/excel/html?
What language is it in? What character set is in use?
Plain ASCII, UTF-8, UTF-16,…
Each of these is a classification problem, with many complications…
Tokenization: Issues
Chinese/Japanese no spaces between words: Not always guaranteed a unique tokenization Dates/amounts in multiple formats
フォーチュン 500 社は情報不足のため時間あた $500K( 約 6,000 万円 )
Katakana Hiragana Kanji “Romaji”
What about DNA sequences ? ACCCGGTACGCAC...
Definition of Tokens What you can search !!
Case folding
Reduce all letters to lower case exception: upper case (in mid-
sentence?) e.g., General Motors USA vs. usa
Morgen will ich in MIT … Is this the
German “mit”?
Stemming
Reduce terms to their “roots” language dependent
e.g., automate(s), automatic, automation all reduced to automat.
e.g., casa, casalinga, casata, casamatta, casolare, casamento, casale, rincasare, case reduced to cas
Porter’s algorithm
Commonest algorithm for stemming English Conventions + 5 phases of reductions
phases applied sequentially each phase consists of a set of commands sample convention: Of the rules in a
compound command, select the one that applies to the longest suffix.
Full morphologial analysis modest benefit !!
sses ss, ies i, ational ate, tional tion
Thesauri
Handle synonyms and homonyms Hand-constructed equivalence classes
e.g., car = automobile e.g., macchina = automobile = spider
List of words important for a given domain
For each word it specifies a list of correlated words (usually,
synonyms, polysemic or phrases for complex concepts).
Co-occurrence Pattern: BT (broader term), NT (narrower
term) Vehicle (BT) Car Fiat 500 (NT)
How to use it in SE ??
Statistical properties of texts
Token are not distributed uniformly They follow the so called “Zipf Law”
Few tokens are very frequent A middle sized set has medium frequency Many are rare
The first 100 tokens sum up to 50% of the text Many of these tokens are stopwords
K-th most frequent term has frequency approximately 1/k; or the product of the frequency (f) of a token and its rank (r) is almost a constant
The Zipf Law, in detail
f = c |T| / r
r * f = c |T|f = c |T| / r
General Law
Sum after the k-th element is ≤ fkk/(z-1)
For the initial top-elements is a constant
Consequences of Zipf Law
Do exist many not frequent tokens that do not
discriminate. These are the so called “stop words” English: to, from, on, and, the, ... Italian: a, per, il, in, un,…
Do exist many tokens that occur once in a text and thus are poor to discriminate (error?).
English: Calpurnia Italian: Precipitevolissimevolmente (o, paklo)
Words with medium frequency Words that discriminate
Other statistical properties of texts
The number of distinct tokens grows as The so called “Heaps Law” (|T|where ) Hence the token length is (log |T|)
Interesting words are the ones with Medium frequency (Luhn)