3 understanding search

27
Understanding Search Engines

Upload: masiclat

Post on 16-Apr-2017

2.893 views

Category:

Education


0 download

TRANSCRIPT

Page 1: 3 Understanding Search

Understanding Search Engines

Page 2: 3 Understanding Search

Basic Defintions: Search Engine

Search engines are information retrieval (IR) systems designed to help find specific information stored in digital server and database systems.

Search engines are meant to minimize both the time required to find information, and the amount of information which must be searched.

Page 3: 3 Understanding Search

Our focus is on Web Information Retrieval, not traditional IR· Web IR means “search within the world’s largest and linked document collection.”

· This document collection is growing at a rate that is almost impossible to know.

· Links arise and disappear at an unknown rate.

Page 4: 3 Understanding Search

Methods of IR and Search· Boolean Search· Vector Space Model Search

· Probabilistic Model Search

· Meta Search

Page 5: 3 Understanding Search

Boolean Search· One of the earliest and simplest computerized IR methods.

· Applies Boolean algebraic operations (AND, OR, NOT) to user keywords.

· AND = x and y satisfied (both conditions, I)

· OR= x or y condition (either condition, U)

· NOT= only x, not y (specific subset, S)

Page 6: 3 Understanding Search

Boolean Search 2+’s· Simple. Fast. Manageable.

—’s· Simplistic; car+maintenance≠ auto care (polysemy and synonymy)

Assumes user has strong familiarity with the topic domain.

· Limited; best used for specific topics with small vocabulary.

Page 7: 3 Understanding Search

Vector Space Model Search· Developed in the early 1960s by Gerard Salton.

· Transforms text into numeric vectors and matrices, then uses matrix analysis techniques to discern features and semantic relationships.(!)

Page 8: 3 Understanding Search

Vector Space Model 2+’sIncredibly powerful tool for keeping track of evolving meanings and shifting vocabularies.Automatically includes relevance scores thereby returning ranked search results.(!)

—’sComputationally intense; requires massive computing power and cannot scale up to deal with massive (web-sized) document sets.

Page 9: 3 Understanding Search

Probabilistic Model SearchUses a probability model to guess which documents a

user will find relevant. They key to this model’s effectiveness is the set of initial conditions.

One of the most powerful initial conditions is an index of a user’s search history/search tendency.

Another initial condition is the search term. Some powerful search algorithms begin by broadening the search terms to include conceptually related documents.

Most appropriate for enterprises where complete understanding of an evolving topic domain or wordspace is mission critical.

Grapeshot

Page 10: 3 Understanding Search

Probabilistic Model 2+’sVery powerful tool. Uses evolving meanings and shifting vocabularies to expand the search vectors.Cutting edge. This is the area of greatest research interest, and greatest value generation. In other words, this is where the money is.

—’sWhen there is no history, you have to start with assumptions; that can be devastating to relevance.Very hard to build, therefore, very expensive. Like, unbelievably expensive. Megabucks.

Page 11: 3 Understanding Search

Meta SearchIf one search engine is good (but has drawbacks) why not combine them?!

That’s a MetaSearch engine.

Queries are sent to multiple engines, or multiple processors.

As you would expect, this can be very accurate, but very slow.

When they’re wrong, they’re monumentally wrong.

Page 12: 3 Understanding Search

To make the perfect Web Search Engine, you must deal with the web’s externalities:1. You will have to search through the

largest document set in the known universe.

2. That document set is changing

3. The set is self-organizing; or more accurately, the set is completely disorganized.

4. It is hyperlinked

Page 13: 3 Understanding Search

The perfect web search engine: A Huge Document SetThe web is, in fact, too big to accurately measure.

JAN 2004: 10,000,000,000+ pages

FEB 2007: 25,000,000,000+ pages

Surface web counts, not Deep Web.

Page 14: 3 Understanding Search

The perfect web search engine: A Changing Document SetCho and Molina, 2000. The evolution of the Web and implications for an incremental crawler. Proceedings of the 26th International Conference on Very Large Databases

40% of pages in sample changed w/in 7 days

23% changed w/in 24 hours

* Growth rate is unknown, but significant

Page 15: 3 Understanding Search

The perfect web search engine: A Self-Organizing Set

There are no standards for content, minimal control over structure, no rules for formats. The data are volatile subject to error, dishonesty, link-rot, and file disappearance.

Data exist in multiple formats; in duplicate; or they don’t exist until a specific request.

Data are re-created for many different uses and conditions (shopping, research, entertainment, way-finding).

Page 16: 3 Understanding Search

The perfect web search engine: A Hyperlinked SetThank God.

The availability of hyperlinks creates an additional layer of meaning. This also places the web document set into a relational framework that can be very accurately described using a branch of mathematics called topology.

Hyperlinks (the only new form of punctuation created in the last 500 years) allow us to do ranked searches.

Page 17: 3 Understanding Search

Designing a precise* search mechanism.1. Crawler Module

2. Page Repository

3. Indexing Module

4. Indexes

5. Query Module

6. Ranking Module

Page 18: 3 Understanding Search

The Pieces (Google style)

Page 19: 3 Understanding Search

The Crawler Module (CM)

A distributed system of software robots (bots, spiders) designed to examine and record the content and structure of pages within a site within a defined domain.

CM gives bots root URLs

Spiders consume resources! (bandwidth, quotas)

Should conform to ethical crawling (robots.txt)

Page 20: 3 Understanding Search

The Page Repository

Temporary storage for full page contents and link structure.

Valuable and popular pages can be stored for longer term.

Page 21: 3 Understanding Search

Indexing Module· A software processor that applies a compression algorithm.

· For content, the algorithm generates an inverted file index.

· Also yields Structure Indexes, and Special-purpose Indexes (for PDFs and video)

Software, 2

Processor, 3

Compression, 7

Algorithm, 8, 12

Index, 17

Indexes, 21, 25

Page 22: 3 Understanding Search

Indexes

Storage area for inverted files and other processed page results. These are the valuable assets of an Internet Search company.

Page 23: 3 Understanding Search

The Query ModuleThe software that handles user queries. Interacts with the ranking module, the indexes, and the page repository.Must be fast! Feb 2003, Google reported serving 250,000,000 searches per day. (2,894 queries per second) Langville & Meyer, 2006

Page 24: 3 Understanding Search

The Ranking Module

The software that examines the hyperlink structure and calculates a page’s value.

Page 25: 3 Understanding Search

3 guys and 2 thesesSergey Brin, Larry Page and Jon Kleinberg

HITS and PageRank™

More on this next class.

Page 26: 3 Understanding Search

An Excellent History (the key reference text)

Amy Langville, Carl Meyer, Google’s Page Rank and Beyond: The Science of Search Engine Rankings. Princeton University Pres, 2006

Page 27: 3 Understanding Search

Questions & Discussion

Ask questions now.