3 understanding search

Understanding Search Engines

Basic Defintions: Search Engine

Search engines are information retrieval (IR) systems designed to help find specific information stored in digital server and database systems.

Search engines are meant to minimize both the time required to find information, and the amount of information which must be searched.

Our focus is on Web Information Retrieval, not traditional IR· Web IR means “search within the world’s largest and linked document collection.”

· This document collection is growing at a rate that is almost impossible to know.

· Links arise and disappear at an unknown rate.

Methods of IR and Search· Boolean Search· Vector Space Model Search

· Probabilistic Model Search

· Meta Search

Boolean Search· One of the earliest and simplest computerized IR methods.

· Applies Boolean algebraic operations (AND, OR, NOT) to user keywords.

· AND = x and y satisfied (both conditions, I)

· OR= x or y condition (either condition, U)

· NOT= only x, not y (specific subset, S)

Boolean Search 2+’s· Simple. Fast. Manageable.

—’s· Simplistic; car+maintenance≠ auto care (polysemy and synonymy)

Assumes user has strong familiarity with the topic domain.

· Limited; best used for specific topics with small vocabulary.

Vector Space Model Search· Developed in the early 1960s by Gerard Salton.

· Transforms text into numeric vectors and matrices, then uses matrix analysis techniques to discern features and semantic relationships.(!)

Vector Space Model 2+’sIncredibly powerful tool for keeping track of evolving meanings and shifting vocabularies.Automatically includes relevance scores thereby returning ranked search results.(!)

—’sComputationally intense; requires massive computing power and cannot scale up to deal with massive (web-sized) document sets.

Probabilistic Model SearchUses a probability model to guess which documents a

user will find relevant. They key to this model’s effectiveness is the set of initial conditions.

One of the most powerful initial conditions is an index of a user’s search history/search tendency.

Another initial condition is the search term. Some powerful search algorithms begin by broadening the search terms to include conceptually related documents.

Most appropriate for enterprises where complete understanding of an evolving topic domain or wordspace is mission critical.

Grapeshot

http://www.grapeshot.co.uk/index.php

Probabilistic Model 2+’sVery powerful tool. Uses evolving meanings and shifting vocabularies to expand the search vectors.Cutting edge. This is the area of greatest research interest, and greatest value generation. In other words, this is where the money is.

—’sWhen there is no history, you have to start with assumptions; that can be devastating to relevance.Very hard to build, therefore, very expensive. Like, unbelievably expensive. Megabucks.

Meta SearchIf one search engine is good (but has drawbacks) why not combine them?!

That’s a MetaSearch engine.

Queries are sent to multiple engines, or multiple processors.

As you would expect, this can be very accurate, but very slow.

When they’re wrong, they’re monumentally wrong.

To make the perfect Web Search Engine, you must deal with the web’s externalities:1. You will have to search through the

largest document set in the known universe.

2. That document set is changing

3. The set is self-organizing; or more accurately, the set is completely disorganized.

4. It is hyperlinked

The perfect web search engine: A Huge Document SetThe web is, in fact, too big to accurately measure.

JAN 2004: 10,000,000,000+ pages

FEB 2007: 25,000,000,000+ pages

Surface web counts, not Deep Web.

The perfect web search engine: A Changing Document SetCho and Molina, 2000. The evolution of the Web and implications for an incremental crawler. Proceedings of the 26th International Conference on Very Large Databases

40% of pages in sample changed w/in 7 days

23% changed w/in 24 hours

* Growth rate is unknown, but significant

The perfect web search engine: A Self-Organizing Set

There are no standards for content, minimal control over structure, no rules for formats. The data are volatile subject to error, dishonesty, link-rot, and file disappearance.

Data exist in multiple formats; in duplicate; or they don’t exist until a specific request.

Data are re-created for many different uses and conditions (shopping, research, entertainment, way-finding).

The perfect web search engine: A Hyperlinked SetThank God.

The availability of hyperlinks creates an additional layer of meaning. This also places the web document set into a relational framework that can be very accurately described using a branch of mathematics called topology.

Hyperlinks (the only new form of punctuation created in the last 500 years) allow us to do ranked searches.

Designing a precise* search mechanism.1. Crawler Module

2. Page Repository

3. Indexing Module

4. Indexes

5. Query Module

6. Ranking Module

The Pieces (Google style)

The Crawler Module (CM)

A distributed system of software robots (bots, spiders) designed to examine and record the content and structure of pages within a site within a defined domain.

CM gives bots root URLs

Spiders consume resources! (bandwidth, quotas)

Should conform to ethical crawling (robots.txt)

The Page Repository

Temporary storage for full page contents and link structure.

Valuable and popular pages can be stored for longer term.

Indexing Module· A software processor that applies a compression algorithm.

· For content, the algorithm generates an inverted file index.

· Also yields Structure Indexes, and Special-purpose Indexes (for PDFs and video)

Software, 2

Processor, 3

Compression, 7

Algorithm, 8, 12

Index, 17

Indexes, 21, 25

Indexes

Storage area for inverted files and other processed page results. These are the valuable assets of an Internet Search company.

The Query ModuleThe software that handles user queries. Interacts with the ranking module, the indexes, and the page repository.Must be fast! Feb 2003, Google reported serving 250,000,000 searches per day. (2,894 queries per second) Langville & Meyer, 2006

The Ranking Module

The software that examines the hyperlink structure and calculates a page’s value.

3 guys and 2 thesesSergey Brin, Larry Page and Jon Kleinberg

HITS and PageRank™

More on this next class.

An Excellent History (the key reference text)

Amy Langville, Carl Meyer, Google’s Page Rank and Beyond: The Science of Search Engine Rankings. Princeton University Pres, 2006

Questions & Discussion

Ask questions now.

3 understanding search

Education