concepts and challenges of text retrieval for search engine
TRANSCRIPT
CONCEPTS AND CHALLENGES OF TEXT RETRIEVAL
FOR SEARCH ENGINESPRE CONFERENCE TUTORIAL
by Gan Keng Hoon
16th August 2016
1
THIS TUTORIAL
Overview: Text Retrieval & Search EngineConcept : Basics of Text RetrievalChallenges: Semantics & Specific Case: Expert Search Engine
2
Search3
What Do People Search for?
Fu Yuanhui
How to get free Pokeball ?
How to write thesis in three month ?
keynote speaker ICAICTA 2016
4
What Do People Expect ? How to get free Pokeball
5
Behind the Click?
6
Quiz: Which one is not a Search Engine?
7
Type of Search Engine
Web Search EngineGoogle, Yahoo, Bing
Domain Specific Search Engine Medline/PubmedMicrosoft Academic
Desktop Search Engine Copernic
8
Connecting Two Ends
Search Collection
Web Domain
Specific Personal Enterprise Etc.
Information Needs
I want to know more about the keynotes speech of ICAICTA
2016.
I need more Pokeballs
Free Of Charge..…
What’s so funny about Fu Yuan
Hui??
Scholarship ending soon, three months left to submit my thesis….
Web Sites Journal
Articles News Images Videos Audio Scanned
Documents Tweets Posts Reviews Etc…
9
A Conceptual Model for Text Retrieval
Information Needs
Query
Search Collection
Document Representation
Retrieved Documents
IndexingFormulation
Retrieval Function
Relevance Feedback
Natural Language Content Analysis
10
Natural Language Content Analysis
11
Search Collection (Retrieval Unit)
Web pages, email, books, news stories, scholarly papers, text messages, Word™, Powerpoint™, PDF, forum postings, patents, etc.
Retrieval unit can be Part of document, e.g. a paragraph, a slide, a page etc. In the form different structure, html, xml, text etc. In different sizes/length.
12
Document Representation
Full Text RepresentationKeep everything. Complete. Require huge resources. Too much may not be good.
Reduced (partial) Content RepresentationRemove not important contents e.g. stopwords.Standardization to reduce overlapped contents e.g. stemming.Retain only important contents, e.g. noun phrases, header etc.
13
Document RepresentationThink of representation as some ways of storing the document.
Bag of Words Model Store the words as the bag (multiset) of its words, disregarding grammar and even word order.
Document 1: "The cat sat on the hat"Document 2: "The dog ate the cat and the hat"From these two documents, a word list is constructed:{ the, cat, sat, on, hat, dog, ate, and }The list has 8 distinct words. Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 }Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
14
Information Needs & Query
Information Needs != Query
Recall the information needs Query: icaicta 2016 keynote
Information Need: I want to know more about the keynotes speech of ICAICTA 2016
Query: free pokeballInformation Need: I need more Pokeballs. I don’t want to pay. No cheat codes.
15
Retrieved DocumentsFrom the original collection, a subset of documents are obtained.
What is the factor that determines what document to return?
Simple Term Matching Approach 1. Compare the terms in a document and query.2. Compute “similarity” between each document in the collection and
the query based on the terms they have in common.3. Sorting the document in order of decreasing similarity with the
query.4. The outputs are a ranked list and displayed to the user - the top ones
are more relevant as judged by the system.16
Indexing
Convert documents into representation or data structure to improve the efficiency of retrieval.
To generate a set of useful terms called indexes.
Why?Many variety of words used in texts, but not all are important.Among the important words, some are more contextually relevant.
Some basic processes involved•Tokenization•Stop Words Removal •Stemming •Phrases• Inverted File
17
Indexing (Tokenization)Convert a sequence of characters into a sequence of tokens with some basic meaning.
“The cat chases the mouse.”
“Bigcorp's 2007 bi-annual report showed profits rose 10%.”
thecatchasesthemouse
bigcorp2007biannualreportshowedprofitsrose10%
18
Indexing (Tokenization)Token can be single or multiple terms.
“Samsung Galaxy S7 Edge, redefines what a phone can do.”
samsung galaxy s7 edge redefineswhataphone cando
samsunggalaxy s7 edge redefineswhata ….
or
19
Indexing (Tokenization)
Common Issues1. Capitalized words can have different meaning from lower case words
Bush fires the officer. Query: Bush fire The bush fire lasted for 3 days. Query: bush fire
2. Apostrophes can be a part of a word, a part of a possessive, or just a mistake
rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's
20
Indexing (Tokenization)
3. Numbers can be important, including decimals nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, 288358
4. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations
I.B.M., Ph.D., cs.umass.edu, F.E.A.R.
Note: tokenizing steps for queries must be identical to steps for documents
21
Indexing (Stopping)Top 50 Words from AP89 News Collection
Recall,
Indexes should be useful term links to a document.
Are the terms on the right figure useful?
22
Indexing (Stopping)
Stopword list can be created from high-frequency words or based on a standard list
Lists are customized for applications, domains, and even parts of documents
e.g., “click” is a good stopword for anchor text
Best policy is to index all words in documents, make decisions about which words to use at query time?
23
Indexing (Stemming)
Many morphological variations of wordsinflectional (plurals, tenses)derivational (making verbs nouns etc.)
In most cases, these have the same or very similar meanings
Stemmers attempt to reduce morphological variations of words to a common stem
usually involves removing suffixes
Can be done at indexing time or as part of query processing (like stopwords)
24
Indexing (Stemming)Porter Stemmer
Algorithmic stemmer used in IR experiments since the 70s
Consists of a series of rules designed to the longest possible suffix at each step
Produces stems not wordsExample Step 1 (right figure)
25
Indexing (Phrases)
Recall, token, meaningful tokens are better indexes, e.g. phrases.
Text processing issue – how are phrases recognized?
Three possible approaches:Identify syntactic phrases using a part-of-speech (POS) taggerUse word n-gramsStore word positions in indexes and use proximity operators in queries
26
Indexing (Phrases)Example Noun Phrases
* Other method like N-Gram
27
Indexing (Inverted Index)
Recall, indexes are designed to support search.
Each index term is associated with an inverted listContains lists of documents, or lists of word occurrences in documents, and other information.
Each entry is called a posting. The part of the posting that refers to a specific document or location is called a pointerEach document in the collection is given a unique numberLists are usually document-ordered (sorted by document number)
28
Indexing (Inverted Index)Sample collection. 4 sentences from Wikipedia entry for Tropical Fish
29
Indexing (Inverted Index)Simple inverted index.
30
Indexing (Inverted Index)
Inverted index with counts.
Support better ranking algorithms.
31
Indexing (Inverted Index)Inverted index with positions.
Support proximity matching.
32
Retrieval FunctionRankingDocuments are retrieved in sorted order according to a score computing using the document representation, the query, and a ranking algorithm
33
Retrieval Function (Vector Space Model)Ranked based method.
Documents and query represented by a vector of term weights.
Collection represented by a matrix of term weights.
34
Retrieval Function (Vector Space Model)
borneo daily new north straits timesD1 0 0 1 0 1 1D2 0 1 1 0 1 0D3 1 0 0 1 0 1
D1: new straits timesD2: new straits dailyD3 : north borneo times
Vector of useful terms
35
Retrieval Function (Vector Space Model)
borneo daily new north straits times
D1 0 0 0.176 0 0.176 0.176
D2 0 0.477 0.176 0 0.176 0
D3 0.477 0 0 0.477 0 0.176
idf (borneo) = log(3/1) =0.477idf (daily) = log(3/1) = 0.477idf (new) = log(3/2) =0.176idf (north) = log(3/1) = 0.477idf (straits) = log(3/2) = 0.176idf (times) = log(3/2) = 0.176then multiply by tf
tf.idf weightTerm frequency weight measures importance in document:
Inverse document frequency measures importance in collection:
Note: Doc Length, Term Location, Term Semantic Meaning
36
Retrieval Function (Vector Space Model)
Documents ranked by distance between points representing query and documents
Similarity measure more common than a distance or dissimilaritymeasuree.g. Cosine correlation
37
Retrieval Function (Vector Space Model)Consider two documents D1, D2 and a query Q
Q = “straits times”
Compare against collection, D1 = “new straits times”
(borneo, daily, new, north, straits, times)
Q = (0, 0, 0, 0, 0.176, 0.176)
D1 = (0, 0, 0.176, 0, 0.176, 0.176)
D2 = (0, 0.477, 0.176, 0, 0.176, 0)
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷𝐷,𝑄𝑄 =0∗0 + 0∗0 + 0.176∗0 + 0∗0 + 0.176∗0.176 +(0.176∗0.176)
0.1762+0.1762+0.1762 (0.1762+0.1762)=0.816
Find Cosine (D2,Q).Which document is more relevant?
38
Evaluation
A must to evaluate the retrieval function, preprocessing steps etc.
Standard CollectionTask specificHuman experts are used to judge relevant results.
Performance Metric PrecisionRecall
39
Evaluation (Collection)Test collections consisting of documents, queries, and relevance judgments, e.g.,
40
Evaluation (Collection)
Example query and narrative for golden standard.
41
Evaluation (Effectiveness Measures)
A is set of relevant documents, B is set of retrieved documents
42
Evaluation (Ranking Effectiveness)
43
Evaluation (Ranking Effectiveness)
Recall@4 = 3/4 Precision@4 = 3/4
Recall@2 = 2/4 Precision@2 = 2/2
44
ChallengesSocial Texts, e.g. Tweets,
Posts
Hard question. Hard Disk ?
Named Entity Various levels and
aspects of annotations
45
Challenges
Small DataSpecific searchImprove semantics extensively
Big Data
Multi modal retrieval
Connecting many medias
46
Case: Adding Semantics Bibliography
Improve Search Results Display
Facet-based semantic
Useful Terms
Demo: ir.cs.usm.my
THANK [email protected]
49