concepts and challenges of text retrieval for search engine

1

Upload: gan-keng-hoon

Post on 17-Feb-2017

136 views

Category:

Presentations & Public Speaking


0 download

TRANSCRIPT

Page 1: Concepts and Challenges of Text Retrieval for Search Engine

CONCEPTS AND CHALLENGES OF TEXT RETRIEVAL

FOR SEARCH ENGINESPRE CONFERENCE TUTORIAL

by Gan Keng Hoon

16th August 2016

1

Page 2: Concepts and Challenges of Text Retrieval for Search Engine

THIS TUTORIAL

Overview: Text Retrieval & Search EngineConcept : Basics of Text RetrievalChallenges: Semantics & Specific Case: Expert Search Engine

2

Page 3: Concepts and Challenges of Text Retrieval for Search Engine

Search3

Page 4: Concepts and Challenges of Text Retrieval for Search Engine

What Do People Search for?

Fu Yuanhui

How to get free Pokeball ?

How to write thesis in three month ?

keynote speaker ICAICTA 2016

4

Page 5: Concepts and Challenges of Text Retrieval for Search Engine

What Do People Expect ? How to get free Pokeball

5

Page 6: Concepts and Challenges of Text Retrieval for Search Engine

Behind the Click?

6

Page 7: Concepts and Challenges of Text Retrieval for Search Engine

Quiz: Which one is not a Search Engine?

7

Page 8: Concepts and Challenges of Text Retrieval for Search Engine

Type of Search Engine

Web Search EngineGoogle, Yahoo, Bing

Domain Specific Search Engine Medline/PubmedMicrosoft Academic

Desktop Search Engine Copernic

8

Page 9: Concepts and Challenges of Text Retrieval for Search Engine

Connecting Two Ends

Search Collection

Web Domain

Specific Personal Enterprise Etc.

Information Needs

I want to know more about the keynotes speech of ICAICTA

2016.

I need more Pokeballs

Free Of Charge..…

What’s so funny about Fu Yuan

Hui??

Scholarship ending soon, three months left to submit my thesis….

Web Sites Journal

Articles News Images Videos Audio Scanned

Documents Tweets Posts Reviews Etc…

9

Page 10: Concepts and Challenges of Text Retrieval for Search Engine

A Conceptual Model for Text Retrieval

Information Needs

Query

Search Collection

Document Representation

Retrieved Documents

IndexingFormulation

Retrieval Function

Relevance Feedback

Natural Language Content Analysis

10

Page 11: Concepts and Challenges of Text Retrieval for Search Engine

Natural Language Content Analysis

11

Page 12: Concepts and Challenges of Text Retrieval for Search Engine

Search Collection (Retrieval Unit)

Web pages, email, books, news stories, scholarly papers, text messages, Word™, Powerpoint™, PDF, forum postings, patents, etc.

Retrieval unit can be Part of document, e.g. a paragraph, a slide, a page etc. In the form different structure, html, xml, text etc. In different sizes/length.

12

Page 13: Concepts and Challenges of Text Retrieval for Search Engine

Document Representation

Full Text RepresentationKeep everything. Complete. Require huge resources. Too much may not be good.

Reduced (partial) Content RepresentationRemove not important contents e.g. stopwords.Standardization to reduce overlapped contents e.g. stemming.Retain only important contents, e.g. noun phrases, header etc.

13

Page 14: Concepts and Challenges of Text Retrieval for Search Engine

Document RepresentationThink of representation as some ways of storing the document.

Bag of Words Model Store the words as the bag (multiset) of its words, disregarding grammar and even word order.

Document 1: "The cat sat on the hat"Document 2: "The dog ate the cat and the hat"From these two documents, a word list is constructed:{ the, cat, sat, on, hat, dog, ate, and }The list has 8 distinct words. Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 }Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1}

14

Page 15: Concepts and Challenges of Text Retrieval for Search Engine

Information Needs & Query

Information Needs != Query

Recall the information needs Query: icaicta 2016 keynote

Information Need: I want to know more about the keynotes speech of ICAICTA 2016

Query: free pokeballInformation Need: I need more Pokeballs. I don’t want to pay. No cheat codes.

15

Page 16: Concepts and Challenges of Text Retrieval for Search Engine

Retrieved DocumentsFrom the original collection, a subset of documents are obtained.

What is the factor that determines what document to return?

Simple Term Matching Approach 1. Compare the terms in a document and query.2. Compute “similarity” between each document in the collection and

the query based on the terms they have in common.3. Sorting the document in order of decreasing similarity with the

query.4. The outputs are a ranked list and displayed to the user - the top ones

are more relevant as judged by the system.16

Page 17: Concepts and Challenges of Text Retrieval for Search Engine

Indexing

Convert documents into representation or data structure to improve the efficiency of retrieval.

To generate a set of useful terms called indexes.

Why?Many variety of words used in texts, but not all are important.Among the important words, some are more contextually relevant.

Some basic processes involved•Tokenization•Stop Words Removal •Stemming •Phrases• Inverted File

17

Page 18: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Tokenization)Convert a sequence of characters into a sequence of tokens with some basic meaning.

“The cat chases the mouse.”

“Bigcorp's 2007 bi-annual report showed profits rose 10%.”

thecatchasesthemouse

bigcorp2007biannualreportshowedprofitsrose10%

18

Page 19: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Tokenization)Token can be single or multiple terms.

“Samsung Galaxy S7 Edge, redefines what a phone can do.”

samsung galaxy s7 edge redefineswhataphone cando

samsunggalaxy s7 edge redefineswhata ….

or

19

Page 20: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Tokenization)

Common Issues1. Capitalized words can have different meaning from lower case words

Bush fires the officer. Query: Bush fire The bush fire lasted for 3 days. Query: bush fire

2. Apostrophes can be a part of a word, a part of a possessive, or just a mistake

rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's

20

Page 21: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Tokenization)

3. Numbers can be important, including decimals nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, 288358

4. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations

I.B.M., Ph.D., cs.umass.edu, F.E.A.R.

Note: tokenizing steps for queries must be identical to steps for documents

21

Page 22: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Stopping)Top 50 Words from AP89 News Collection

Recall,

Indexes should be useful term links to a document.

Are the terms on the right figure useful?

22

Page 23: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Stopping)

Stopword list can be created from high-frequency words or based on a standard list

Lists are customized for applications, domains, and even parts of documents

e.g., “click” is a good stopword for anchor text

Best policy is to index all words in documents, make decisions about which words to use at query time?

23

Page 24: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Stemming)

Many morphological variations of wordsinflectional (plurals, tenses)derivational (making verbs nouns etc.)

In most cases, these have the same or very similar meanings

Stemmers attempt to reduce morphological variations of words to a common stem

usually involves removing suffixes

Can be done at indexing time or as part of query processing (like stopwords)

24

Page 25: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Stemming)Porter Stemmer

Algorithmic stemmer used in IR experiments since the 70s

Consists of a series of rules designed to the longest possible suffix at each step

Produces stems not wordsExample Step 1 (right figure)

25

Page 26: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Phrases)

Recall, token, meaningful tokens are better indexes, e.g. phrases.

Text processing issue – how are phrases recognized?

Three possible approaches:Identify syntactic phrases using a part-of-speech (POS) taggerUse word n-gramsStore word positions in indexes and use proximity operators in queries

26

Page 27: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Phrases)Example Noun Phrases

* Other method like N-Gram

27

Page 28: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Inverted Index)

Recall, indexes are designed to support search.

Each index term is associated with an inverted listContains lists of documents, or lists of word occurrences in documents, and other information.

Each entry is called a posting. The part of the posting that refers to a specific document or location is called a pointerEach document in the collection is given a unique numberLists are usually document-ordered (sorted by document number)

28

Page 29: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Inverted Index)Sample collection. 4 sentences from Wikipedia entry for Tropical Fish

29

Page 30: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Inverted Index)Simple inverted index.

30

Page 31: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Inverted Index)

Inverted index with counts.

Support better ranking algorithms.

31

Page 32: Concepts and Challenges of Text Retrieval for Search Engine

Indexing (Inverted Index)Inverted index with positions.

Support proximity matching.

32

Page 33: Concepts and Challenges of Text Retrieval for Search Engine

Retrieval FunctionRankingDocuments are retrieved in sorted order according to a score computing using the document representation, the query, and a ranking algorithm

33

Page 34: Concepts and Challenges of Text Retrieval for Search Engine

Retrieval Function (Vector Space Model)Ranked based method.

Documents and query represented by a vector of term weights.

Collection represented by a matrix of term weights.

34

Page 35: Concepts and Challenges of Text Retrieval for Search Engine

Retrieval Function (Vector Space Model)

borneo daily new north straits timesD1 0 0 1 0 1 1D2 0 1 1 0 1 0D3 1 0 0 1 0 1

D1: new straits timesD2: new straits dailyD3 : north borneo times

Vector of useful terms

35

Page 36: Concepts and Challenges of Text Retrieval for Search Engine

Retrieval Function (Vector Space Model)

borneo daily new north straits times

D1 0 0 0.176 0 0.176 0.176

D2 0 0.477 0.176 0 0.176 0

D3 0.477 0 0 0.477 0 0.176

idf (borneo) = log(3/1) =0.477idf (daily) = log(3/1) = 0.477idf (new) = log(3/2) =0.176idf (north) = log(3/1) = 0.477idf (straits) = log(3/2) = 0.176idf (times) = log(3/2) = 0.176then multiply by tf

tf.idf weightTerm frequency weight measures importance in document:

Inverse document frequency measures importance in collection:

Note: Doc Length, Term Location, Term Semantic Meaning

36

Page 37: Concepts and Challenges of Text Retrieval for Search Engine

Retrieval Function (Vector Space Model)

Documents ranked by distance between points representing query and documents

Similarity measure more common than a distance or dissimilaritymeasuree.g. Cosine correlation

37

Page 38: Concepts and Challenges of Text Retrieval for Search Engine

Retrieval Function (Vector Space Model)Consider two documents D1, D2 and a query Q

Q = “straits times”

Compare against collection, D1 = “new straits times”

(borneo, daily, new, north, straits, times)

Q = (0, 0, 0, 0, 0.176, 0.176)

D1 = (0, 0, 0.176, 0, 0.176, 0.176)

D2 = (0, 0.477, 0.176, 0, 0.176, 0)

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷𝐷,𝑄𝑄 =0∗0 + 0∗0 + 0.176∗0 + 0∗0 + 0.176∗0.176 +(0.176∗0.176)

0.1762+0.1762+0.1762 (0.1762+0.1762)=0.816

Find Cosine (D2,Q).Which document is more relevant?

38

Page 39: Concepts and Challenges of Text Retrieval for Search Engine

Evaluation

A must to evaluate the retrieval function, preprocessing steps etc.

Standard CollectionTask specificHuman experts are used to judge relevant results.

Performance Metric PrecisionRecall

39

Page 40: Concepts and Challenges of Text Retrieval for Search Engine

Evaluation (Collection)Test collections consisting of documents, queries, and relevance judgments, e.g.,

40

Page 41: Concepts and Challenges of Text Retrieval for Search Engine

Evaluation (Collection)

Example query and narrative for golden standard.

41

Page 42: Concepts and Challenges of Text Retrieval for Search Engine

Evaluation (Effectiveness Measures)

A is set of relevant documents, B is set of retrieved documents

42

Page 43: Concepts and Challenges of Text Retrieval for Search Engine

Evaluation (Ranking Effectiveness)

43

Page 44: Concepts and Challenges of Text Retrieval for Search Engine

Evaluation (Ranking Effectiveness)

Recall@4 = 3/4 Precision@4 = 3/4

Recall@2 = 2/4 Precision@2 = 2/2

44

Page 45: Concepts and Challenges of Text Retrieval for Search Engine

ChallengesSocial Texts, e.g. Tweets,

Posts

Hard question. Hard Disk ?

Named Entity Various levels and

aspects of annotations

45

Page 46: Concepts and Challenges of Text Retrieval for Search Engine

Challenges

Small DataSpecific searchImprove semantics extensively

Big Data

Multi modal retrieval

Connecting many medias

46

Page 47: Concepts and Challenges of Text Retrieval for Search Engine

Case: Adding Semantics Bibliography

Page 48: Concepts and Challenges of Text Retrieval for Search Engine

Improve Search Results Display

Facet-based semantic

Useful Terms

Demo: ir.cs.usm.my

Page 49: Concepts and Challenges of Text Retrieval for Search Engine

THANK [email protected]

49