itcs 6265 ir & web mining itcs 6265/8265: advanced topics in kdd --- information retrieval and...
TRANSCRIPT
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
ITCS 6265/8265: Advanced Topics in KDD ---
Information Retrieval and Web MiningLecture 1
Boolean retrieval
UNC Charlotte, Fall 2009
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Course administrivia Web site: http://www.cs.uncc.edu/~wwu18/itcs6265/
Work/Grading: Assignments 20% Project 15% Paper presentation 5% Midterm 30% Final 30%
Required textbook: Introduction to Information Retrieval, by Manning et al.
2
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Information Retrieval Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
Unstructured data: structure & semantics of data are implicit, not specified to ease computer processing
Structured data, e.g., records in relational databases
3
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Unstructured data in 1680 Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia? One could grep all of Shakespeare’s plays for Brutus
and Caesar, then strip out lines containing Calpurnia? Slow (for large corpora, e.g., billions or trillions of words) Other operations (e.g., find the word Romans near
countrymen) not feasible Ranked retrieval (best documents to return)
Later lectures
Need index!4
Sec. 1.1
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Take one: Index in term-document incidence matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains word, 0 otherwise
Brutus AND Caesar but NOT Calpurnia
Sec. 1.1
Documents (plays)
Words(~32k)
…
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Term vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar
and Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100.
6
Sec. 1.1
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Answers to query
Antony and Cleopatra (Act III, Scene ii)Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Hamlet (Act III, Scene ii)Lord Polonius: I did enact Julius Caesar: I was killed i' the
Capitol; Brutus killed me.
7
Sec. 1.1
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is
relevant to user’s information need and helps him complete a task
Note: information need is topic of interest, different from query --- which are formulated by users based on their information need, & then submitted to retrieval systems to find relevant documents
But many things might go wrong in this process …
8
Sec. 1.1
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
The classic search model
Corpus
TASK
Info Need
Query
Verbal form
Results
SEARCHENGINE
QueryRefinement
Get rid of mice in a politically correct way
Info about removing micewithout killing them
How do I trap mice alive?
mouse trap
Mis-conception
Mis-translation
Mis-formulation
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
How good are the retrieved docs? Precision : Fraction of retrieved docs that are
relevant to user’s information need Recall : Fraction of relevant docs in collection that are
retrieved More precise definitions and measurements to
follow in later lectures
10
Sec. 1.1
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Can matrix scale to larger collection? Shakespeare’ collective works: ~36 plays
Now consider N = 1 million documents, each with about 1000 words. Note that this is still several orders of magnitude smaller than the size of Web!
Avg 6 bytes/word including spaces/punctuation 6GB (= 1M * 1000 * 6) of data in the documents.
Say there are M = 500K distinct terms among these.
11
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Can’t build the matrix 500K x 1M matrix has half-a-trillion 0’s and 1’s. But it has no more than one billion 1’s.
matrix is extremely sparse. i.e., has a lot of zeros
What would be a better representation? We only record the 1 positions.
12
Why?
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Take two: Inverted index For each term T, we must store a list of all
documents that contain T. Do we use an array or a list for this?
13
Brutus
Calpurnia
Caesar
1 2 3 5 8 13 21 34
2 4 8 16 32 64128
13 16
What happens if the word Caesar is added to document 14?
Sec. 1.2
Sorted by docID (more later on why).
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Inverted index Linked lists generally preferred to arrays
Dynamic space allocation Insertion of terms into documents easy
But have space overhead due to pointers
14
Brutus
Calpurnia
Caesar
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Dictionary Postings lists
PostingPosting
Sec. 1.2
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Tokenizer
Token stream. Friends Romans Countrymen
Inverted index construction
Linguistic modules
Normalized tokens(index terms).
friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
More onthese later.
Documents tobe indexed.
Friends, Romans, countrymen.
Sec. 1.2
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Indexer steps Sequence of (Normalized token, Document ID) pairs.
I did enact JuliusCaesar I was killed
i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The noble
Brutus hath told youCaesar was ambitious
Doc 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2
caesar 2was 2ambitious 2
Sec. 1.2
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Sorting Sort by terms & then by doc ID’s
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Core indexing step.
Sec. 1.2
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Merging Multiple term entries in the
same document are merged.
Frequency information is added.
Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Why frequency?Will discuss later.
Sec. 1.2
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Grouping (by term) & splitting (into dictionary and postings lists)
Doc # Term freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Term Doc freq Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1
Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Sec. 1.2
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Where do we pay in storage?
20
Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1
Pointers
Terms
Will quantify the storage, later.
Sec. 1.2
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
The index we just built How do we process a query?
Later - what kinds of queries can we process?
21
Today’s focus
Sec. 1.3
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Query processing: AND Consider processing the query:
Brutus AND Caesar Locate Brutus in the Dictionary;
Retrieve its postings. Locate Caesar in the Dictionary;
Retrieve its postings. Intersect (sometimes called “merge”) the two postings:
22
128
34
2 4 8 16 32 64
1 2 3 5 8 13
21
Brutus
Caesar
Sec. 1.3
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
The Intersect/Merge algorithm
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
The complexity of merge Walk through the two postings simultaneously, in
time linear in the total number of postings entries
24
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar2 8
If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.
Sec. 1.3
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Boolean queries: Exact match
The Boolean Retrieval model is being able to ask a query that is a Boolean expression: Boolean Queries are queries using AND, OR and NOT to
join query terms Views each document as a set of words Is precise: document matches condition or not.
Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., lawyers) still like Boolean
queries: You know exactly what you’re getting.
25
Sec. 1.3
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Example: WestLaw http://www.westlaw.com/
http://library.uncc.edu/electronic/display.php?resource=434 (accessible to uncc)
Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)
Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example:
Information need: What is the statute of limitations in cases involving the federal tort claims act?
Query: LIMIT! /3 STATUTE ACT /S FEDERAL /2 TORT /3 CLAIM
/3 = within 3 words, /S = in same sentence 26
Sec. 1.4
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Example: WestLaw http://www.westlaw.com/
Another example: Information need: Requirements for disabled people
to be able to access a workplace Query: disabl! /p access! /s work-site work-place
(employment /3 place)
Note that SPACE is disjunction, not conjunction! Long, precise queries; proximity operators;
incrementally developed; not like web search Professional searchers often like Boolean search:
Precision, transparency and control But that doesn’t mean they actually work better….
Sec. 1.4
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Boolean queries: More general merges Exercise: Adapt the merge for the queries:
Brutus AND NOT CaesarBrutus OR NOT Caesar
Can we still run through the merge in time O(x+y)?What can we achieve?
28
Sec. 1.3
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
MergingWhat about an arbitrary Boolean formula?(Brutus OR Caesar) AND NOT(Antony OR Cleopatra) Can we always merge in “linear” time?
Linear in what? Can we do better?
29
Sec. 1.3
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Query optimization
What is the best order for query processing? Consider a query that is an AND of t terms. For each of the t terms, get its postings, then AND
them together.
Brutus
Calpurnia
Caesar
1 2 3 5 8 16 21 34
2 4 8 16 32 64128
13 16
Query: Brutus AND Calpurnia AND Caesar30
Sec. 1.3
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Query optimization example Process in order of increasing freq:
start with smallest set, then keep cutting further.
31
Brutus
Calpurnia
Caesar
1 2 3 5 8 13 21 34
2 4 8 16 32 64128
13 16
This is why we keptfreq in dictionary
Execute the query as (Caesar AND Brutus) AND Calpurnia.
Sec. 1.3
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
More general optimization e.g., (madding OR crowd) AND (ignoble OR
strife) Get freq’s for all terms. Estimate the size of each OR by the sum of its
freq’s (conservative). Process in increasing order of OR sizes.
32
Sec. 1.3
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Exercise
Recommend a query processing order for
Term Freq eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812
33
(tangerine OR trees) AND(marmalade OR skies) AND(kaleidoscope OR eyes)
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
What is ahead? Part I: IR & Search Engines
Positional index: phrase/proximity queries Tolerant retrieval: wildcard, spelling correction, etc. Efficient index construction Index compression Ranked retrieval: term weighting, vector space
model Evaluation of IR system Relevance feedback & query expansion Web search, crawling, and indexing
34
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Part II: Text mining techniques Classification: given a set of topics, plus a new doc D,
decide which topic(s) D belongs to. Naïve Bayes classifier, vector space classifier, support vector machine
Clustering: given a set of docs, group them into clusters based on their contents. flat, hierarchical clustering, latent semantic indexing
35
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Other text mining tasks (if time permits) Entity extraction Summarization Opinion retrieval Sentiment analysis …
36
ITCS 6265 IR & Web MiningITCS 6265 IR & Web Mining
Part III: Web data mining Content mining
Web data extraction, web data integration, etc.
Linkage analysis & structure mining Authority, hub, pagerank, community discovery, etc
Usage mining Web/application server log, user browsing history, etc.