concepts and challenges of text retrieval for search engine

CONCEPTS AND CHALLENGES OF TEXT RETRIEVAL

FOR SEARCH ENGINESPRE CONFERENCE TUTORIAL

by Gan Keng Hoon

16th August 2016

1

THIS TUTORIAL

Overview: Text Retrieval & Search EngineConcept : Basics of Text RetrievalChallenges: Semantics & Specific Case: Expert Search Engine

2

Search3

What Do People Search for?

Fu Yuanhui

How to get free Pokeball ?

How to write thesis in three month ?

keynote speaker ICAICTA 2016

4

What Do People Expect ? How to get free Pokeball

5

Behind the Click?

6

Quiz: Which one is not a Search Engine?

7

Type of Search Engine

Web Search EngineGoogle, Yahoo, Bing

Domain Specific Search Engine Medline/PubmedMicrosoft Academic

Desktop Search Engine Copernic

8

Connecting Two Ends

Search Collection

Web Domain

Specific Personal Enterprise Etc.

Information Needs

I want to know more about the keynotes speech of ICAICTA

2016.

I need more Pokeballs

Free Of Charge..…

What’s so funny about Fu Yuan

Hui??

Scholarship ending soon, three months left to submit my thesis….

Web Sites Journal

Articles News Images Videos Audio Scanned

Documents Tweets Posts Reviews Etc…

9

A Conceptual Model for Text Retrieval

Information Needs

Query

Search Collection

Document Representation

Retrieved Documents

IndexingFormulation

Retrieval Function

Relevance Feedback

Natural Language Content Analysis

10

Natural Language Content Analysis

11

Search Collection (Retrieval Unit)

Web pages, email, books, news stories, scholarly papers, text messages, Word™, Powerpoint™, PDF, forum postings, patents, etc.

Retrieval unit can be Part of document, e.g. a paragraph, a slide, a page etc. In the form different structure, html, xml, text etc. In different sizes/length.

12

Document Representation

Full Text RepresentationKeep everything. Complete. Require huge resources. Too much may not be good.

Reduced (partial) Content RepresentationRemove not important contents e.g. stopwords.Standardization to reduce overlapped contents e.g. stemming.Retain only important contents, e.g. noun phrases, header etc.

13

Document RepresentationThink of representation as some ways of storing the document.

Bag of Words Model Store the words as the bag (multiset) of its words, disregarding grammar and even word order.

Document 1: "The cat sat on the hat"Document 2: "The dog ate the cat and the hat"From these two documents, a word list is constructed:{ the, cat, sat, on, hat, dog, ate, and }The list has 8 distinct words. Document 1: { 2, 1, 1, 1, 1, 0, 0, 0 }Document 2 : { 3, 1, 0, 0, 1, 1, 1, 1}

14

Information Needs & Query

Information Needs != Query

Recall the information needs Query: icaicta 2016 keynote

Information Need: I want to know more about the keynotes speech of ICAICTA 2016

Query: free pokeballInformation Need: I need more Pokeballs. I don’t want to pay. No cheat codes.

15

Retrieved DocumentsFrom the original collection, a subset of documents are obtained.

What is the factor that determines what document to return?

Simple Term Matching Approach 1. Compare the terms in a document and query.2. Compute “similarity” between each document in the collection and

the query based on the terms they have in common.3. Sorting the document in order of decreasing similarity with the

query.4. The outputs are a ranked list and displayed to the user - the top ones

are more relevant as judged by the system.16

Indexing

Convert documents into representation or data structure to improve the efficiency of retrieval.

To generate a set of useful terms called indexes.

Why?Many variety of words used in texts, but not all are important.Among the important words, some are more contextually relevant.

Some basic processes involved•Tokenization•Stop Words Removal •Stemming •Phrases• Inverted File

17

Indexing (Tokenization)Convert a sequence of characters into a sequence of tokens with some basic meaning.

“The cat chases the mouse.”

“Bigcorp's 2007 bi-annual report showed profits rose 10%.”

thecatchasesthemouse

bigcorp2007biannualreportshowedprofitsrose10%

18

Indexing (Tokenization)Token can be single or multiple terms.

“Samsung Galaxy S7 Edge, redefines what a phone can do.”

samsung galaxy s7 edge redefineswhataphone cando

samsunggalaxy s7 edge redefineswhata ….

or

19

Indexing (Tokenization)

Common Issues1. Capitalized words can have different meaning from lower case words

Bush fires the officer. Query: Bush fire The bush fire lasted for 3 days. Query: bush fire

2. Apostrophes can be a part of a word, a part of a possessive, or just a mistake

rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's

20

Indexing (Tokenization)

3. Numbers can be important, including decimals nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, 288358

4. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations

I.B.M., Ph.D., cs.umass.edu, F.E.A.R.

Note: tokenizing steps for queries must be identical to steps for documents

21

Indexing (Stopping)Top 50 Words from AP89 News Collection

Recall,

Indexes should be useful term links to a document.

Are the terms on the right figure useful?

22

Indexing (Stopping)

Stopword list can be created from high-frequency words or based on a standard list

Lists are customized for applications, domains, and even parts of documents

e.g., “click” is a good stopword for anchor text

Best policy is to index all words in documents, make decisions about which words to use at query time?

23

Indexing (Stemming)

Many morphological variations of wordsinflectional (plurals, tenses)derivational (making verbs nouns etc.)

In most cases, these have the same or very similar meanings

Stemmers attempt to reduce morphological variations of words to a common stem

usually involves removing suffixes

Can be done at indexing time or as part of query processing (like stopwords)

24

Indexing (Stemming)Porter Stemmer

Algorithmic stemmer used in IR experiments since the 70s

Consists of a series of rules designed to the longest possible suffix at each step

Produces stems not wordsExample Step 1 (right figure)

25

Indexing (Phrases)

Recall, token, meaningful tokens are better indexes, e.g. phrases.

Text processing issue – how are phrases recognized?

Three possible approaches:Identify syntactic phrases using a part-of-speech (POS) taggerUse word n-gramsStore word positions in indexes and use proximity operators in queries

26

Indexing (Phrases)Example Noun Phrases

* Other method like N-Gram

27

Indexing (Inverted Index)

Recall, indexes are designed to support search.

Each index term is associated with an inverted listContains lists of documents, or lists of word occurrences in documents, and other information.

Each entry is called a posting. The part of the posting that refers to a specific document or location is called a pointerEach document in the collection is given a unique numberLists are usually document-ordered (sorted by document number)

28

Indexing (Inverted Index)Sample collection. 4 sentences from Wikipedia entry for Tropical Fish

29

Indexing (Inverted Index)Simple inverted index.

30

Indexing (Inverted Index)

Inverted index with counts.

Support better ranking algorithms.

31

Indexing (Inverted Index)Inverted index with positions.

Support proximity matching.

32

Retrieval FunctionRankingDocuments are retrieved in sorted order according to a score computing using the document representation, the query, and a ranking algorithm

33

Retrieval Function (Vector Space Model)Ranked based method.

Documents and query represented by a vector of term weights.

Collection represented by a matrix of term weights.

34

Retrieval Function (Vector Space Model)

borneo daily new north straits timesD1 0 0 1 0 1 1D2 0 1 1 0 1 0D3 1 0 0 1 0 1

D1: new straits timesD2: new straits dailyD3 : north borneo times

Vector of useful terms

35


borneo daily new north straits times

D1 0 0 0.176 0 0.176 0.176

D2 0 0.477 0.176 0 0.176 0

D3 0.477 0 0 0.477 0 0.176

idf (borneo) = log(3/1) =0.477idf (daily) = log(3/1) = 0.477idf (new) = log(3/2) =0.176idf (north) = log(3/1) = 0.477idf (straits) = log(3/2) = 0.176idf (times) = log(3/2) = 0.176then multiply by tf

tf.idf weightTerm frequency weight measures importance in document:

Inverse document frequency measures importance in collection:

Note: Doc Length, Term Location, Term Semantic Meaning

36


Documents ranked by distance between points representing query and documents

Similarity measure more common than a distance or dissimilaritymeasuree.g. Cosine correlation

37

Retrieval Function (Vector Space Model)Consider two documents D1, D2 and a query Q

Q = “straits times”

Compare against collection, D1 = “new straits times”

(borneo, daily, new, north, straits, times)

Q = (0, 0, 0, 0, 0.176, 0.176)

D1 = (0, 0, 0.176, 0, 0.176, 0.176)

D2 = (0, 0.477, 0.176, 0, 0.176, 0)

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐷𝐷𝐷,𝑄𝑄 =0∗0 + 0∗0 + 0.176∗0 + 0∗0 + 0.176∗0.176 +(0.176∗0.176)

0.1762+0.1762+0.1762 (0.1762+0.1762)=0.816

Find Cosine (D2,Q).Which document is more relevant?

38

Evaluation

A must to evaluate the retrieval function, preprocessing steps etc.

Standard CollectionTask specificHuman experts are used to judge relevant results.

Performance Metric PrecisionRecall

39

Evaluation (Collection)Test collections consisting of documents, queries, and relevance judgments, e.g.,

40

Evaluation (Collection)

Example query and narrative for golden standard.

41

Evaluation (Effectiveness Measures)

A is set of relevant documents, B is set of retrieved documents

42

Evaluation (Ranking Effectiveness)

43

Evaluation (Ranking Effectiveness)

Recall@4 = 3/4 Precision@4 = 3/4

Recall@2 = 2/4 Precision@2 = 2/2

44

ChallengesSocial Texts, e.g. Tweets,

Posts

Hard question. Hard Disk ?

Named Entity Various levels and

aspects of annotations

45

Challenges

Small DataSpecific searchImprove semantics extensively

Big Data

Multi modal retrieval

Connecting many medias

46

Case: Adding Semantics Bibliography

Improve Search Results Display

Facet-based semantic

Useful Terms

Demo: ir.cs.usm.my

THANK [email protected]

49

concepts and challenges of text retrieval for search engine

Presentations & Public Speaking