מבוא לאחזור מידע information retrieval בינה מלאכותית אבי רוזנפלד

מבוא לאחזור מידעInformation Retrieval

מלאכותית בינהרוזנפלד אבי

Copyright © Victor LavrenkoHopkins IR Workshop 2005

?מהי אחזור מידע

חיפוש אחרי מידע באינטרנט•?)BINGהרב ד"ר גוגל )–תחום מחקר פורח וחשובה...–

מתוך הרבה מקורות רלוונטי חיפוש אחרי מידע •)אתרים, תמונה, וכו'(


מה ההבדל בין שאילתות? בבסיסי נתונים ואחזור מידע

נתונים בסיסי מידע אחזור

Data Structured (key) Unstructured (web)

Fields Clear semantics )SSN, age(

No fields )other than text(

QueriesDefined )relational algebra, SQL(

Free text )“natural language”(, Boolean

MatchingExact )results are always “correct”(

Imprecise )need to measure effectiveness(

4

מערכת המידע של אחזור מידע

IRSystem

Query String

Documentcorpus

RankedDocuments

1 .Doc12 .Doc23 .Doc3

.

.


דירוג תוצאות

• Early IR focused on set-based retrieval– Boolean queries, set of conditions to be satisfied– document either matches the query or not

• like classifying the collection into relevant / non-relevant sets

– still used by professional searchers– “advanced search” in many systems

• Modern IR: ranked retrieval– free-form query expresses user’s information need– rank documents by decreasing likelihood of relevance– many studies prove it is superior

דוגמאות

שלבים למנוע חיפוש

(Web crawlerבניית המאגר מידע )•(Indexבניית האנדקסים )לאנדקס •

STEMMINGניקיון המידע מכפילות, –בניית התשובה•

(STOP WORDSעיבוד השאלתה )הורדת –(PAGERANKדירוג תוצאות )–

ניתוח התוצאות•–FALSE POSITIVE / FALSE NEGATIVE–Recall / Precision

Indexing Process

Indexing Process

• Text acquisition– identifies and stores documents for indexing

• Text transformation– transforms documents into index terms or

features• Index creation

– takes index terms and creates data structures (indexes) to support fast searching

Web Crawler / זחלן רשת– Identifies and acquires documents for search engine– http://en.wikipedia.org/wiki/Web_crawler

זחלן רשת הוא סוג של בוט או תוכנה שסורקת באופן •.WWWאוטומטי ושיטתי את ה

מדיניות של בחירה אשר מגדירה איזה עמוד להוריד.•מדיניות של ביקור חוזר אשר מגדירה מתי לבדוק שינויים •

בדפים.מדיניות נימוס אשר מגדירה איך להימנע מעומס יתר של •

אתרים ולגרום להפלה של השרת.מדיניות של הקבלה אשר מגדירה איך לתאם בין •

הזחלנים השונים.

http://en.wikipedia.org/wiki/Web_crawler

http://en.wikipedia.org/wiki/Web_crawler

Text Acquisition• Feeds

– Real-time streams of documents• e.g., web feeds for news, blogs, video, radio, tv

– RSS is common standard• RSS “reader” can provide new XML documents to search engine

• Conversion– Convert variety of documents into a consistent text plus

metadata format• e.g. HTML, XML, Word, PDF, etc. → XML

– Convert text encoding for different languages• Using a Unicode standard like UTF-8

• (http://www.search-engines-book.com/slides/)

http://www.search-engines-book.com/slides/

http://www.search-engines-book.com/slides/

Controlling Crawling• Even crawling a site slowly will anger some

web server administrators, who object to any copying of their data

• Robots.txt file can be used to control crawlers

Simple Crawler Thread

Compression

• Text is highly redundant (or predictable)• Compression techniques exploit this redundancy

to make files smaller without losing any of the content

• Compression of indexes covered later• Popular algorithms can compress HTML and XML

text by 80%– e.g., DEFLATE (zip, gzip) and LZW (UNIX compress,

PDF)– may compress large files in blocks to make access

faster

Duplicate Detection• Exact duplicate detection is relatively easy• Checksum techniques

– A checksum is a value that is computed based on the content of the document

• e.g., sum of the bytes in the document file

– Possible for files with different text to have same checksum

• Functions such as a cyclic redundancy check (CRC), have been developed that consider the positions of the bytes

Removing Noise

• Many web pages contain text, links, and pictures that are not directly related to the main content of the page

• This additional material is mostly noise that could negatively affect the ranking of the page

• Techniques have been developed to detect the content blocks in a web page– Non-content material is either ignored or reduced

in importance in the indexing process

Noise Example

Text Transformation• Parser

– Processing the sequence of text tokens in the document to recognize structural elements

• e.g., titles, links, headings, etc.

– Tokenizer recognizes “words” in the text• must consider issues like capitalization, hyphens,

apostrophes, non-alpha characters, separators

– Markup languages such as HTML, XML often used to specify structure and metatags

• Tags used to specify document elements– E.g., <h2> Overview </h2>

• Document parser uses syntax of markup language (or other formatting) to identify structure

Index Creation• Document Statistics

– Gathers counts and positions of words and other features

– Used in ranking algorithm• Weighting

– Computes weights for index terms– Used in ranking algorithm– e.g., tf.idf weight

• Combination of term frequency in document and inverse document frequency in the collection

Zipf’s Law• Distribution of word frequencies is very skewed

– a few words occur very often, many words hardly ever occur

– e.g., two most common words (“the”, “of”) make up about 10% of all word occurrences in text documents

• Zipf’s “law”:– observation that rank (r) of a word times its

frequency (f) is approximately a constant (k)• assuming words are ranked in order of decreasing

frequency

– i.e., r.f » k or r.Pr » c, where Pr is probability of word occurrence and c » 0.1 for English

Zipf’s Law

Top 50 Words from AP89


Term Frequency (TF)• Observation:

– key words tend to be repeated in a document• Modify our similarity measure:

– give more weight if word occurs multiple times

• Problem:– biased towards long documents– spurious occurrences– normalize by length:

Qq

D qtfQDsim )(),(

Qq

D

D

qtfQDsim

||

)(),(


Inverse Document Frequency (IDF)

• Observation:– rare words carry more meaning: cryogenic, apollo– frequent words are linguistic glue: of, the, said, went

• Modify our similarity measure:– give more weight to rare words

… but don’t be too aggressive (why?)

– df(q) … total number of documents that contain q

)(

||log

||

)(),(

qdf

C

D

qtfQDsim

Qq

D

25

• tf = term frequency – frequency of a term/keyword in a document

The higher the tf, the higher the importance (weight) for the doc.

• df = document frequency– no. of documents containing the term– distribution of the term

• idf = inverse document frequency– the unevenness of term distribution in the corpus– the specificity of term to a document

The more the term is distributed evenly, the less it is specific to a document

weight(t,D) = tf(t,D) * idf(t)

tf*idf weighting schema

26

• function words do not bear useful information for IRof, in, about, with, I, although, …

• Stoplist: contain stopwords, not to be used as index– Prepositions– Articles– Pronouns– Some adverbs and adjectives– Some frequent words (e.g. document)

• The removal of stopwords usually improves IR effectiveness

• A few “standard” stoplists are commonly used.

Stopwords / Stoplist

27

Stemming

• Reason: – Different word forms may bear similar meaning (e.g. search,

searching): create a “standard” representation for them• Stemming:

– Removing some endings of word computercompute computescomputingcomputedcomputation

comput

Stemming• Generally a small but significant effectiveness

improvement– can be crucial for some languages– e.g., 5-10% improvement for English, up to 50% in

Arabic

Words with the Arabic root ktb

Stemming

• Two basic types– Dictionary-based: uses lists of related words– Algorithmic: uses program to determine related

words• Algorithmic stemmers

– suffix-s: remove ‘s’ endings assuming plural• e.g., cats → cat, lakes → lake, wiis → wii• Many false negatives: supplies → supplie• Some false positives: ups → up

30

Porter algorithm(Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137)

• Step 1: plurals and past participles – SSES -> SS caresses -> caress – (*v*) ING -> motoring -> motor

• Step 2: adj->n, n->v, n->adj, …– (m>0) OUSNESS -> OUS callousness -> callous – (m>0) ATIONAL -> ATE relational -> relate

• Step 3: – (m>0) ICATE -> IC triplicate -> triplic

• Step 4:– (m>1) AL -> revival -> reviv– (m>1) ANCE -> allowance -> allow

• Step 5: – (m>1) E -> probate -> probat – (m > 1 and *d and *L) -> single letter controll -> control

N-Grams

• Frequent n-grams are more likely to be meaningful phrases

• N-grams form a Zipf distribution– Better fit than words alone

• Could index all n-grams up to specified length– Much faster than POS tagging– Uses a lot of storage

• e.g., document containing 1,000 words would contain 3,990 instances of word n-grams of length 2 ≤ n ≤ 5

Google N-Grams

• Web search engines index n-grams• Google sample:

• Most frequent trigram in English is “all rights reserved”– In Chinese, “limited liability corporation”

Document Structure and Markup

• Some parts of documents are more important than others

• Document parser recognizes structure using markup, such as HTML tags– Headers, anchor text, bolded text all likely to be

important– Metadata can also be important– Links used for link analysis

Example Web Page

Link Analysis

• Links are a key component of the Web• Important for navigation, but also for search

– e.g., <a href="http://example.com" >Example website</a>

– “Example website” is the anchor text– “http://example.com” is the destination link– both are used by search engines

37

PageRank in Google

• Assign a numeric value to each page• The more a page is referred to by important pages, the more this page

is important

• d: damping factor (0.85)

• Many other criteria: e.g. proximity of query words– “…information retrieval …” better than “… information … retrieval …”

A B i i

i

IC

IPRddAPR

)(

)()1()(

I1

I2

PageRank• Billions of web pages, some more informative

than others• Links can be viewed as information about the

popularity (authority?) of a web page– can be used by ranking algorithm

• Inlink count could be used as simple measure• Link analysis algorithms like PageRank provide

more reliable ratings– less susceptible to link spam

Indexes

• Indexes are data structures designed to make search faster

• Text search has unique requirements, which leads to unique data structures

• Most common data structure is inverted index– general name for a class of structures– “inverted” because documents are associated

with words, rather than words with documents• similar to a concordance

Indexes and Ranking

• Indexes are designed to support search– faster response time, supports updates

• Text search engines use a particular form of search: ranking– documents are retrieved in sorted order according to

a score computing using the document representation, the query, and a ranking algorithm

• What is a reasonable abstract model for ranking?– enables discussion of indexes without details of

retrieval model

Abstract Model of Ranking

Inverted Index

• Each index term is associated with an inverted list– Contains lists of documents, or lists of word

occurrences in documents, and other information– Each entry is called a posting– The part of the posting that refers to a specific

document or location is called a pointer– Each document in the collection is given a unique

number– Lists are usually document-ordered (sorted by

document number)

44

Inverted List Information to be Published

Word (key) Address1 Address2 Address3 Address4 Address5 Address6

a 111-1111 111-1112 111-1113 111-1114 111-1115 111-1116

aardvark 111-4323

the 111-1111 111-1112 111-1113 111-1114 111-1115 111-1116

zoo 123-4214 123-9714 333-9714

zygote 548-4342

Simple Inverted Index

Inverted Indexwith counts

• supports better ranking algorithms

Inverted Indexwith positions

• supports proximity matches

Proximity Matches

• Matching phrases or words within a window– e.g., "tropical fish", or “find tropical within

5 words of fish”• Word positions in inverted lists make these

types of query features efficient– e.g.,

Query Process

Query Process

• User interaction– supports creation and refinement of query, display

of results• Ranking

– uses query and indexes to generate ranked list of documents

• Evaluation– monitors and measures effectiveness and

efficiency (primarily offline)

Evaluation – False Positive / Negative

Predicted LabelPositive (A) Negative (B)

Known Label Positive (A) True Positive

(TP)False Negative(FN)

Negative (B) False Positive(FP)

True Negative(TN)

Definitions

Measure Formula Intuitive Meaning

Precision TP / (TP + FP)The percentage of positive predictions that are correct.

Recall TP / (TP + FN)The percentage of positive labeled instances that were predicted as positive.

Specificity TN / (TN + FP)The percentage of negative labeled instances that were predicted as negative.

Accuracy (TP + TN) / (TP + TN + FP + FN)

The percentage of predictions that are correct.

Example

Predicted LabelPositive (A) Negative (B)

Known LabelPositive (A) 500 1000Negative (B) 500 10,000

Precision = 50% (500/1000)Recall = 83% (500/600)

Accuracy = 95% (10500/11100)

54

General form of precision/recall

Precision 1.0 Recall 1.0

-Precision change w.r.t. Recall (not a fixed point)

-Systems cannot compare at one Precision/Recall point

-Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0)

Effectiveness MeasuresA is set of relevant

documents ,B is set of retrieved documents

Classification Errors

• False Positive (Type I error)– a non-relevant document is retrieved

• False Negative (Type II error)– a relevant document is not retrieved– 1- Recall

• Precision is used when probability that a positive result is correct is important


Evaluation Metrics (set-based)• Precision

– Proportion of a retrieved set that is relevant– Precision = |relevant Ç retrieved| ÷ |retrieved|

= P( relevant | retrieved )

• Recall– proportion of all relevant documents in the collection included in

the retrieved set– Recall = |relevant Ç retrieved| ÷ |relevant|

= P( retrieved | relevant )

• Single-number measures:– F1 = 2PR / (P+R) … harmonic mean of precision and recall– Breakeven: (rank-based) point where Recall = Precision

Caching

• Query distributions similar to Zipf– About ½ each day are unique, but some are very

popular• Caching can significantly improve effectiveness

– Cache popular query results– Cache common inverted lists

• Inverted list caching can help with unique queries

• Cache must be refreshed to prevent stale data

Search Engine Optimization (SEO)