information retrieval and web search text properties (note: some of the slides in this set have been...
TRANSCRIPT
![Page 1: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/1.jpg)
Information Retrieval and Web Search
Text properties(Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan at U.Massachusetts, Amherst)
Instructor: Rada MihalceaClass web page: http://www.cs.unt.edu/~rada/CSCE5300
![Page 2: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/2.jpg)
Slide 2
Last time
•Text preprocessing– Required for processing the text collection prior to
indexing– Required for processing the query prior to retrieval
•Main steps– Tokenization– Stopword removal– Lemmatization / Stemming
• Porter stemmer • New stemming algorithm?
![Page 3: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/3.jpg)
Slide 3
Today’s topics
•Text properties– Distribution of words in language– Why are they important?
•Zipf’s Law
•Heap’s Law
![Page 4: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/4.jpg)
Slide 4
Statistical Properties of Text
•How is the frequency of different words distributed?
•How fast does vocabulary size grow with the size of a corpus?
•Such factors affect the performance of information retrieval and can be used to select appropriate term weights and other aspects of an IR system.
![Page 5: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/5.jpg)
Slide 5
Word Frequency
•A few words are very common.– 2 most frequent words (e.g. “the”, “of”) can account for
about 10% of word occurrences.
•Most words are very rare.– Half the words in a corpus appear only once, called
hapax legomena (Greek for “read only once”)
•Called a “heavy tailed” distribution, since most of the probability mass is in the “tail”
![Page 6: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/6.jpg)
Slide 6
Sample Word Frequency Data(from B. Croft, UMass)
![Page 7: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/7.jpg)
Slide 7
Zipf’s Law
•Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ).
•Zipf (1949) “discovered” that:
•If probability of word of rank r is pr and N is the total number of word occurrences:
)constant (for kkrf
1.0 const. indp. corpusfor Ar
A
N
fpr
![Page 8: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/8.jpg)
Slide 8
Zipf and Term Weighting
• Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for indexing.
![Page 9: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/9.jpg)
Slide 9
Predicting Occurrence Frequencies• By Zipf, a word appearing n times has rank rn=AN/n
• Several words may occur n times, assume rank rn applies to the last of these.
• Therefore, rn words occur n or more times and rn+1 words occur n+1 or more times.
• So, the number of words appearing exactly n times is:
)1(11
nn
AN
n
AN
n
ANrrI nnn
![Page 10: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/10.jpg)
Slide 10
Predicting Word Frequencies (cont’d)
•Assume highest ranking term occurs once and therefore has rank D = AN/1
•Fraction of words with frequency n is:
•Fraction of words appearing only once is therefore ½.
)1(
1
nnD
In
![Page 11: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/11.jpg)
Slide 11
Occurrence Frequency Data (from B. Croft, UMass)
![Page 12: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/12.jpg)
Slide 12
Does Real Data Fit Zipf’s Law?
• A law of the form y = kxc is called a power law.
• Zipf’s law is a power law with c = –1
• On a log-log plot, power laws give a straight line with slope c.
• Zipf is quite accurate except for very high and low rank.
)log(log)log()log( xckkxy c
![Page 13: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/13.jpg)
Slide 13
Fit to Zipf for Brown Corpus
k = 100,000
![Page 14: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/14.jpg)
Slide 14
Mandelbrot (1954) Correction
•The following more general form gives a bit better fit:
,, constantsFor )( BPrPf B
![Page 15: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/15.jpg)
Slide 15
Mandelbrot Fit
P = 105.4, B = 1.15, = 100
![Page 16: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/16.jpg)
Slide 16
Explanations for Zipf’s Law
• Zipf’s explanation was his “principle of least effort.” Balance between speaker’s desire for a small vocabulary and hearer’s desire for a large one.
• Debate (1955-61) between Mandelbrot and H. Simon over explanation.
• Li (1992) shows that just random typing of letters including a space will generate “words” with a Zipfian distribution.– http://linkage.rockefeller.edu/wli/zipf/
![Page 17: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/17.jpg)
Slide 17
Zipf’s Law Impact on IR
•Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces inverted-index storage costs.
•Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare.
![Page 18: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/18.jpg)
Slide 18
Exercise
•Assuming Zipf’s Law with a corpus independent constant A=0.1, what is the fewest number of most common words that together account for more than 25% of word occurrences (I.e. the minimum value of m such as at least 25% of word occurrences are one of the m most common words)
![Page 19: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/19.jpg)
Slide 19
Vocabulary Growth
•How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus?
•This determines how the size of the inverted index will scale with the size of the corpus.
•Vocabulary not really upper-bounded due to proper names, typos, etc.
![Page 20: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/20.jpg)
Slide 20
Heaps’ Law
•If V is the size of the vocabulary and the n is the length of the corpus in words:
•Typical constants:– K 10100 0.40.6 (approx. square-root)
10 , constants with KKnV
![Page 21: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/21.jpg)
Slide 21
Heaps’ Law Data
![Page 22: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/22.jpg)
Slide 22
Explanation for Heaps’ Law
•Can be derived from Zipf’s law by assuming documents are generated by randomly sampling words from a Zipfian distribution.
•Heap’s Law holds on distribution of other data– Own experiments on types of questions asked by users
show a similar behavior
![Page 23: Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan](https://reader038.vdocument.in/reader038/viewer/2022103100/56649e955503460f94b9973d/html5/thumbnails/23.jpg)
Slide 23
Exercise
•We want to estimate the size of the vocabulary for a corpus of 1,000,000 words. However, we only know statistics computed on smaller corpora sizes:– For 100,000 words, there are 50,000 unique words– For 500,000 words, there are 150,000 unique words
– Estimate the vocabulary size for the 1,000,000 words corpus
– How about for a corpus of 1,000,000,000 words?