working of web search engines

11
 1 Abstract The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research. True search engines crawl the web, and then automatically generate their listings. If you change your web pages, search engine crawlers will eventually find these changes, and that can affect your listing. Page titles, body copy, meta tags (sometimes) and other elements all play a role in how each search engine evaluates the relevancy of your page(and hence its ranking). There are plenty of ways to cater a search engine’s crawlers and change a site to help improve its rankings. One such is Google Web Search Engine. This report goes through the different generations of web search engines, the simplified algorithm used for Page Ranking and an overview of the Google Architecture. It is important to know how the search Engines Works, as to get the best out of them. 1. Introduction Internet search engine is a specialized tool that helps us find information on the World Wide Web. A technical encyclopedia, WhatIs.com, provides an accurate definition of a search engine. “A search engine is a coordinated set of programs that includes:  A spider (also called a "crawler" or a "bot") that goes to every page or representative pages on every Web site that wants to be searchable and reads it, using hypertext links on each page to discover and read a site's other pages  A program that creates a huge index (sometimes called a "catalog") from the pages that have been read  A program that receives your search request, compares it to the entries in the index, and returns results to you.” (Whatis.com ,2001.) In essence, the search engine bots crawl web pages and use links to help them navigate to other pages. The search engine then indexes those pages into its database. When a searcher sends a search query, the search engine compares the web pages in the index to find documents that are relevant to the search query. Based on its algorithm, the search engine returns results to the searcher in the search engine result page (SERP). The search engine algorithm is a set of rules that a search engine follows, in order to return the most relevant results. Search engines fail to return relevant results sometimes, and that is why they need to improve their algorithm constantly. The algorithms determine the placement of web documents in the organic or natural search results, which are typically displayed on the left side of the screen in the SERPs, as illustrated in the figure 1 Figure 1 Search engine algorithms are very closely kept industry secrets, because of the fierce competition in the field. Another reason for search engines to keep their algorithms private is search engine spam. If webmasters knew the exact algorithm of a search engine, they could manipulate the results in their favor quite easily. By testing different tactics, website owners sometimes find out elements of the algorithms and act accordingly to boost their ranking in the SERPs. Therefore, changes in the algorithms are often due to increased search engine spam. There are dozens of search engines which are used by billions of people every day. Amongst these include popular ones like Google, Yahoo, and Bing. The web creates new challenges for information retrieval. The amount of information on the web is growing rapidly, as people are likely to surf the web using its link graph, often starting with high Working of Web Search Engines

Upload: mohammed-azzan-patni

Post on 08-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Working of Web Search Engines

8/7/2019 Working of Web Search Engines

http://slidepdf.com/reader/full/working-of-web-search-engines 1/11

 

1

Abstract

The amount of information on the web is

growing rapidly, as well as the number of new

users inexperienced in the art of web research.

True search engines crawl the web, and then

automatically generate their listings. If you

change your web pages, search engine crawlers

will eventually find these changes, and that can

affect your listing. Page titles, body copy, meta

tags (sometimes) and other elements all play a

role in how each search engine evaluates the

relevancy of your page(and hence its ranking).

There are plenty of ways to cater a searchengine’s crawlers and change a site to help

improve its rankings. One such is Google Web

Search Engine. This report goes through the

different generations of web search engines, the

simplified algorithm used for Page Ranking and 

an overview of the Google Architecture. It is

important to know how the search Engines Works,

as to get the best out of them.

1. 

IntroductionInternet search engine is a specialized tool that

helps us find information on the World Wide

Web. A technical encyclopedia, WhatIs.com,

provides an accurate definition of a search engine.

“A search engine is a coordinated set of programs

that includes:

•  A spider (also called a "crawler" or a

"bot") that goes to every page or

representative pages on every Web site

that wants to be searchable and reads it,

using hypertext links on each page todiscover and read a site's other pages

•  A program that creates a huge index

(sometimes called a "catalog") from the

pages that have been read

•  A program that receives your search

request, compares it to the entries in the

index, and returns results to you.”

(Whatis.com ,2001.)

In essence, the search engine bots crawl web

pages and use links to help them navigate to otherpages. The search engine then indexes those pages

into its database. When a searcher sends a search

query, the search engine compares the web pages

in the index to find documents that are relevant to

the search query. Based on its algorithm, the

search engine returns results to the searcher in the

search engine result page (SERP).

The search engine algorithm is a set of rules that

a search engine follows, in order to return the

most relevant results. Search engines fail to return

relevant results sometimes, and that is why they

need to improve their algorithm constantly. The

algorithms determine the placement of web

documents in the organic or natural search results,

which are typically displayed on the left side of 

the screen in the SERPs, as illustrated in the figure

1

Figure 1

Search engine algorithms are very closely kept

industry secrets, because of the fierce competition

in the field. Another reason for search engines to

keep their algorithms private is search engine

spam. If webmasters knew the exact algorithm of 

a search engine, they could manipulate the results

in their favor quite easily. By testing different

tactics, website owners sometimes find out

elements of the algorithms and act accordingly to

boost their ranking in the SERPs. Therefore,

changes in the algorithms are often due to

increased search engine spam.

There are dozens of search engines which are

used by billions of people every day. Amongst

these include popular ones like Google, Yahoo,

and Bing.

The web creates new challenges for information

retrieval. The amount of information on the web isgrowing rapidly, as people are likely to surf the

web using its link graph, often starting with high

Working of Web Search Engines

Page 2: Working of Web Search Engines

8/7/2019 Working of Web Search Engines

http://slidepdf.com/reader/full/working-of-web-search-engines 2/11

 

2

quality human maintained indices such as Yahoo!

or with search engines like Lycos, AltaVista etc.

Human maintained lists cover popular topics

effectively but are subjective, expensive to build

and maintain, slow to improve, and cannot cover

all esoteric topics.

Automated search engines that rely on keywordmatching usually return too many low quality

matches.

To make matters worse, some advertisers

attempt to gain people’s attention by taking

measures meant to mislead automated search

engines and also there are spammers who want to

influence the web search results.

2. A Brief History of Search Engines

The history of Internet search engines dates

back to 1990, when Alan Emtage, a student at

McGill University in Montreal developed a search

engine called Archie. As there was no World

Wide Web at that time, Archie operated in a

system called File Transfer Protocol (FTP). In

June 2003, Matthew Gray developed the first

robot on the Web called the Wanderer. Referred to

as the mother of search engines, World Wide Web

Wanderer captured URLs on the web and stored

them in the first ever web database, Wandex.

Other improved web robots soon followed and

search engines began categorizing web pages in

databases, instead of just crawling and listing

them. In 1994 Galaxy, Lycos and WebCrawler

were launched, bringing search engine indexing to

a more advanced state. A small directory project

by two Stanford University Ph.D. candidates,

David Filo and Jerry Yang was also introduced in

1994, which the creators called Yahoo! This small

directory has since turned into a multi-billion

dollar company and is currently one of the biggest

online search providers.

Many search engines that are still major players

in the search arena were established in the

following years, including AltaVista, Excite,

Inktomi, HotBot and Ask Jeeves.

Excite was introduced in 1993 by six Stanford

University students. It used statistical analysis of 

word relationships to aid in the search process.

Today it's a part of the AskJeeves Company.

EINet Galaxy (Galaxy) was established in 1994

as part of the MCC Research Consortium at the

University of Texas, in Austin. It was eventually

purchased from the University and, after being

transferred through several companies, is a

separate corporation today. It was created as a

directory, containing Gopher and telnet search

features in addition to its Web search feature.

Jerry Yang and David Filo created Yahoo in

1994. It started out as a listing of their favoriteWeb sites. What made it different was that each

entry, in addition to the URL, also had a

description of the page. Within a year the two

received funding and Yahoo, the corporation, was

created.

Later in 1994, WebCrawler was introduced. It

was the first full-text search engine on the

Internet; the entire text of each page was indexed

for the first time.

Lycos introduced relevance retrieval, prefixmatching, and word proximity in 1994. It was a

large search engine, indexing over 60 million

documents in 1996; the largest of any search

engine at the time. Like many of the other search

engines, Lycos was created in a university

atmosphere at Carnegie Mellon University by Dr.

Michael Mauldin.

Infoseek went online in 1995. It didn't really

bring anything new to the search engine scene. It

is now owned by the Walt Disney Internet Groupand the domain forwards to Go.com.

Alta Vista also began in 1995. It was the first

search engine to allow natural language inquires

and advanced searching techniques. It also

provides a multimedia search for photos, music,

and videos.

Inktomi started in 1996 at UC Berkeley. In June

of 1999 Inktomi introduced a directory search

engine powered by "concept induction"

technology. "Concept induction," according to thecompany, "takes the experience of human analysis

and applies the same habits to a computerized

analysis of links, usage, and other patterns to

determine which sites are most popular and the

most productive." Inktomi was purchased by

Yahoo in 2003.

AskJeeves and Northern Light were both

launched in 1997.

Google was launched in 1997 by Sergey Brin

and Larry Page as part of a research project atStanford University. It uses inbound links to rank 

sites. In 1998 MSN Search and the Open

Page 3: Working of Web Search Engines

8/7/2019 Working of Web Search Engines

http://slidepdf.com/reader/full/working-of-web-search-engines 3/11

 

3

Directory were also started. Today MSN

Search(Live) is also known as Bing,

3. Three types of Search Engines 

The term "search engine" is often used

generically to describe crawler-based search

engines, human-powered directories, and hybridsearch engines. These types of search engines

gather their listings in different ways, through

crawler-based searches, human-powered

directories, and hybrid searches.

3.1.  Crawler-based search engines

Crawler-based search engines, such as Google

(http://www.google.com), create their listings

automatically. They "crawl" or "spider" the web,

then people search through what they have found.

If web pages are changed, crawler-based searchengines eventually find these changes, and that

can affect how those pages are listed. Page titles,

body copy and other elements all play a role.

3.2.  Human-powered directories

A human-powered directory, such as the Open

Directory Project depends on humans for its

listings. (Yahoo!, which used to be a directory,

now gets its information from the use of 

crawlers.) A directory gets its information from

submissions, which include a short description tothe directory for the entire site, or from editors

who write one for sites they review. A search

looks for matches only in the descriptions

submitted. Changing web pages, therefore, has no

effect on how they are listed. Techniques that are

useful for improving a listing with a search

engine have nothing to do with improving a

listing in a directory.

3.3.  Hybrid search engines

Today, it is extremely common for crawler-typeand human-powered results to be combined when

conducting a search. Usually, a hybrid search

engine will favor one type of listings over

another. For example, MSN (Now Bing) is more

likely to present human-powered listings from

LookSmart

4. How do Search Engines Work?Many Internet nomads are confounded when

they enter a search query and get back a set of over 10,000 “relevant” hits, viewable in batches of 

10. There are occasions when the searcher will

plow through the list hoping to find the perfect

link, but sometimes come across other factors at

work that cause inappropriate results to rise to the

top of the list. One of the factors that can lead to

this type of misinformation may be erroneous

assumptions by searchers as to what’s really going

on “behind the curtain.”“Search engine” is the

popular term for an Information Retrieval (IR)system.

Before a search engine can tell you where a file

or document is, it must be found. To find

information on the hundreds of millions of Web

pages that exist, a search engine employs special

software robots, called spiders, to build lists of the

words found on Web sites. When a spider is

building its lists, the process is called Web

crawling. In order to build and maintain a useful

list of words, a search engine's spiders have to

look at a lot of pages.4.1.  High level Design Architecture of a

WebCrawler

Figure 2

A Web crawler is a computer program that

browses the World Wide Web in a methodical,

automated manner or in an orderly fashion. Other

terms for Web crawlers are ants, automatic

indexers, bots, Web spiders, Web robots etc.The behavior of a Web crawler is the outcome

of a combination of policies:

•  a selection policy that states which pages to

download,

•  a re-visit policy that states when to check 

for changes to the pages,

•  a politeness policy that states how to avoid

overloading Web sites, and

•  a parallelization policy that states how to

coordinate distributed Web crawlers.

Page 4: Working of Web Search Engines

8/7/2019 Working of Web Search Engines

http://slidepdf.com/reader/full/working-of-web-search-engines 4/11

Page 5: Working of Web Search Engines

8/7/2019 Working of Web Search Engines

http://slidepdf.com/reader/full/working-of-web-search-engines 5/11

 

5

consistent data structure that all the downstream

processes can handle. The need for a well-formed,

consistent format is of relative importance in

direct proportion to the sophistication of later

steps of document processing. Step 2 is important

because the pointers stored in the inverted file will

enable a system to retrieve various sized units,site, page, document, section, paragraph, or

sentence.

Step 4: Identify potential indexable elements in

documents

Identifying potential indexable elements in

documents dramatically affects the nature and

quality of the document representation that the

engine will search against. In designing the

system, we must define the following: What is a

term? Is it the alphanumeric characters between

blank spaces or punctuation? If so, what aboutnoncom positional phrases (phrases where the

separate words do not convey the meaning of the

phrase, like skunk works or hot dog), multiword

proper names, or interword symbols such as

hyphens or apostrophes that can denote the

difference between “small business men” vs.

small-business men?” Each search engine depends

on a set of rules that its document processor must

execute to determine what action is to be taken by

the “tokenizer,” i.e., the software used to define a‘term’ suitable for indexing.

Step 5: Delete stop words

This step helps save system resources by

eliminating from further processing, as well as

potential matching, those terms that have little

value in finding useful documents in response to a

customer’s query. This step used to matter much

more than it does now when memory has become

so much cheaper and systems so much faster, but

since stop words may comprise up to 40 percent of 

text words in a document, it still has somesignificance.

A stop word list typically consists of those word

classes known to convey little substantive

meaning, such as articles (a, the), conjunctions

(and, but), interjections (oh, but), prepositions (in,

over), pronouns (he, it), and forms of the “to be”

verb (is, are). To delete stop words, an algorithm

compares index term candidates in the documents

against a stop word list and eliminates certain

terms from inclusion in the index for searching.Step 6: Stem terms

Stemming removes word suffixes, perhaps

recursively in layer after layer of processing. The

process has two goals. In terms of efficiency,

stemming reduces the number of unique words in

the index, which in turn reduces the storage space

required for the index and speeds up the search

process.

In terms of effectiveness, stemming improves

recall by reducing all forms of word to a base orstemmed form. For example, if a user asks for

analyze, he or she may also want documents

which contain analysis, analyzing, analyzer,

analyzes, and analyzed. Therefore, the document

processor stems document terms to analy- so that

documents which include various forms of analy-

will have equal likelihood of being retrieved,

which would not occur if the engine only indexed

variant forms separately and required the user to

enter all. Of course, stemming does have a

downside. It may negatively affect precision inthat all forms of a stem will match, when, in fact,

a successful query for the user would have come

from matching only the word form actually used

in the query.

Systems may implement either a strong

stemming algorithm or a weak stemming

algorithm. A strong stemming algorithm will strip

off inflectional suffixes (-s, -es, -ed) and

derivational suffixes (-able, -aciousness, -ability),

while a weak stemming algorithm will strip off only the inflectional suffixes (-s, -es, -ed).

Step 7: Extract index entries

Having completed steps 1 through 6, the

document processor extracts the remaining entries

from the original document. For example, the

following paragraph shows the full text as sent to

a search engine for processing: “Milosevic's

comments, carried by the official news agency

Tanjug, cast doubt over the governments at the

talks, which the international community has

called to try to prevent an all-out war in theSerbian province. President Milosevic said it was

well known that Serbia and Yugoslavia were

firmly committed to resolving problems in

Kosovo, which is an integral part of Serbia,

peacefully in Serbia with the participation of the

representatives of all ethnic communities, Tanjug

said. Milosevic was speaking during a meeting

with British Foreign Secretary Robin Cook, who

delivered an ultimatum to attend negotiations in a

week's time on an autonomy proposal for Kosovowith ethnic Albanian leaders from the province.

Cook earlier told a conference that Milosevic had

agreed to study the proposal.”

Page 6: Working of Web Search Engines

8/7/2019 Working of Web Search Engines

http://slidepdf.com/reader/full/working-of-web-search-engines 6/11

 

6

Steps 1 through 6 reduce this text for searching

to the following text: “Milosevic comm carri offic

new agen Tanjug cast doubt govern talk interna

commun

call try prevent all-out war Serb province

President Milosevic said well known Serbia

Yugoslavia firm commit resolv problem Kosovointegr part Serbia peace Serbia particip representa

ethnic commun Tanjug said Milosevic speak 

meeti British Foreign Secretary Robin Cook 

deliver ultimat attend negoti week time autonomy

propos Kosovo ethnic Alban lead province Cook 

earl told conference Milosevic agree study

propos”

The output of step 7 is then inserted and stored

in an inverted file that lists the index entries and

an indication of their position and frequency of 

occurrence. The specific nature of the indexentries, however, will vary based on the decision

in Step 4 concerning what constitutes an

“indexable term.” More sophisticated Document

Processors will have phrase recognizers, as well as

Named Entity recognizers and Categorizers, to

insure index entries such as Milosevic are tagged

as a person and entries such as Yugoslavia and

Serbia as countries.

Step 8: Compute weights.

Weights are assigned to terms in the index file.The simplest of search engines simply assign a

binary weight: 1 for presence and 0 for absence.

The more sophisticated the search engine, the

more complex the weighting scheme. Measuring

the frequency of occurrence of a term in the

document creates more sophisticated weighting,

with length-normalization of frequencies still

more sophisticated.

Extensive experience in Information Retrieval

research over many years has clearly demonstrated

that the optimal weighting comes from use of termfrequency/inverse document frequency (tf/idf).

This algorithm measures the frequency of 

occurrence of each term within a document. Then

it compares that frequency against the frequency

of occurrence in the entire database.

Not all terms are good discriminators; that is,

they don’t all single out one document from

another very well. A simple example would be the

word “THE.” This word appears in too many

documents to help distinguish one from another.A less obvious example would be the word

“antibiotic.” In a sports database, when we

compare each document to the database as a

whole, the term “antibiotic” would probably be a

good discriminator among documents, and

therefore would be assigned a high weight.

Conversely, in a database devoted to health or

medicine, “antibiotic” would probably be a poor

discriminator, since it occurs very often. The tf/idf 

weighting scheme assigns higher weights to thoseterms that really distinguish one document from

the others.

Step 9: Create index

The index or inverted file is the internal data

structure that stores the index information and that

will be searched for each query. Inverted files

range from a simple listing of every alphanumeric

sequence in a set of documents/pages being

indexed along with the overall identifying

numbers of the documents in which that sequence

occurs, to a more linguistically complex list of entries, their tf/idf weights, and pointers to where

inside each document the term occurs. The more

complete the information in the index, the better

the search results.

4.2.2. Query Processor

Query processing has seven possible steps,

though a system can cut these steps short and

proceed to match the query to the inverted file at

any of a number of places during the processing.

Document processing shares many steps withquery processing. More steps and more documents

make the process more expensive for processing

in terms of computational resources and

responsiveness. However, the longer the wait for

results, the higher the quality of results. Thus,

search system designers must choose what is most

important to their users, time or quality. Publicly

available search engines usually choose time over

very high quality because they have too many

documents to search against. The steps in query

processing are as follows (with the option to stopprocessing and start matching indicated as

“Matcher”):

1. Tokenize query terms

2. Recognize query terms vs. special operators

3. ---------------------------> Matcher

4. Delete stop words

5. Stem words

6. Create query representation

7. ---------------------------> Matcher

8. Expand query terms9. Compute weights

10. ---------------------------> Matcher

Page 7: Working of Web Search Engines

8/7/2019 Working of Web Search Engines

http://slidepdf.com/reader/full/working-of-web-search-engines 7/11

 

7

Figure 4

Step 1: Tokenize query terms 

As soon as a user inputs a query, the search

engine, whether a keyword-based system or a full

Natural Language Processing (NLP) system, must

tokenize the query stream, i.e., break it down into

understandable segments. Usually a token is

defined as an alphanumeric string that occurs

between white space and or punctuation.

Step 2: Recognize query terms vs. special

operators 

Since users may employ special operators in

their query, including Boolean, adjacency, or

proximity operators, the system needs to parse the

query first into query terms and operators. These

operators may occur in the form of reserved

punctuation (e.g., quotation marks) or reservedterms in specialized format (e.g., AND, OR). In

the case of an NLP system, the query processor

will recognize the operators implicitly in the

language used no matter how they might be

expressed (e.g., prepositions, conjunctions,

ordering).

At this point, a search engine may take the list

of query terms and search them against the

inverted file. In fact, this is the point at which the

majority of publicly available search engines

perform their search.

Steps 3 and 4: Delete stop words and stem

words

Some search engines will go further and stop-

list and stem the query, similar to the processes

described in the Document Processor section. The

stop list might also contain words from commonly

occurring querying phrases, such as “I’d likeinformation about….” However, since most

publicly available search engines encourage very

short queries, as evidenced in the size of query

window they provide, they may drop these two

steps.

Step 5: Creating the query representation 

How each particular search engine creates a

query representation depends on how the system

does its matching. If a statistically based matcher

is used, then the query must match the statistical

representations of the documents in the system.Good statistical queries should contain many

synonyms and other terms in order to create a full

representation. If a Boolean matcher is utilized,

then the system must create logical sets of the

terms connected by AND, OR, or NOT.

The NLP system will recognize single terms,

phrases, and Named Entities. If it uses any

Boolean logic, it will also recognize the logical

operators from Step 2 and create a representation

containing logical sets of the terms to be AND’d,OR’d, or NOT’d. At this point, a search engine

may take the query representation and perform the

search against the inverted file. More advanced

search engines may take two further steps.

Step 6: Expand query terms

Since users of search engines usually include

only a single statement of their information needs

in a query, it becomes highly probable that the

information they need may be expressed using

synonyms, rather than the exact query terms, in

the documents that the search engine searchesagainst. Therefore, more sophisticated systems

may expand the query into all possible

synonymous terms and perhaps even broader and

narrower terms.

This process approaches what search

intermediaries did for end-users in the earlier days

of commercial search systems. Then

intermediaries might have used the same

controlled vocabulary or thesaurus used by the

indexers who assigned subject descriptors todocuments.

Today, resources such as WordNet are generally

available, or specialized expansion facilities may

Page 8: Working of Web Search Engines

8/7/2019 Working of Web Search Engines

http://slidepdf.com/reader/full/working-of-web-search-engines 8/11

 

8

take the initial query and enlarge it by adding

associated vocabulary.

Step 7: Computer query term weight (assuming

more than one query term)

The final step in query processing involves

computing weights for the terms in the query.

Sometimes the user controls this step byindicating either how much to weight each term or

simply which term or concept in the query matters

most and must appear in each retrieved document

to ensure relevance.

Leaving the weighting up to the user is

uncommon because research has shown that users

are not particularly good at determining the

relative importance of terms in their queries. They

can’t make this determination for several reasons.

First, they don’t know what else exists in the

database and document terms are weighted bybeing compared to the database as a whole.

Second, most users seek information about an

unfamiliar subject, so they may not know the

correct terminology. Few search engines

implement system-based query weighting, but

some do an implicit weighting by treating the first

term(s) in a query as having higher significance.

They use this information to provide a list of 

documents/pages to the user. After this final step,

the expanded, weighted query is searched againstthe inverted file of documents.

4.2.3. Search and Matching Functions

How systems carry out their search and

matching functions differs according to which

theoretical model of IR underlies the system’s

design philosophy.

Searching the inverted file for documents which

meet the query requirements, referred to simply as

“matching,” is typically a standard binary search

no matter whether the search ends after the first

two, five, or all seven steps of query processing.While the computational processing required for

simple, un-weighted, non-Boolean query matching

is far simpler than when the model is an NLP-

based query within a weighted, Boolean model, it

also follows that the simpler the document

representation, the query representation, and the

matching algorithm, the less relevant the results,

except for very simple queries, such as one-word,

non-ambiguous queries seeking the most generally

known information.Having determined which subset of documents

or pages match the query requirements to some

degree, a similarity score is computed between the

query and each document/page based on the

scoring algorithm used by the system. Scoring

algorithms base their rankings on the

presence/absence of query term(s), term

frequency, tf/idf, Boolean logic fulfillment, or

query term weights. Some search engines use

scoring algorithms not based on documentcontents, but rather, on relations among

documents or past retrieval history of 

documents/pages.

After computing the similarity of each

document in the subset of documents, the system

presents an ordered list to the user. The

sophistication of the ordering of the documents

again depends on the model the system uses, as

well as the richness of the document and query

weighting mechanisms. For example, search

engines that only require the presence of anyalphanumeric string from the query occurring

anywhere, in any order, in a document would

produce a very different ranking from one by a

search engine that performed linguistically correct

phrasing for document and query representation

and that utilized the proven tf/idf weighting

scheme.

However, the search engine determines rank,

and the ranked results list goes to the user, who

can then simply click and follow the system’sinternal pointers to the selected document/page.

More sophisticated systems will go even further

at this stage and allow the user to provide some

relevance feedback or to modify their query based

on the results they have seen. If either of these are

available, the system will then adjust its query

representation to reflect this value-added feedback 

and rerun the search with the improved query to

produce either a new set of documents or a simple

re-ranking of documents from the initial search.

4.3. RankingGoogle's rise to success was in large part due to a

patented algorithm called PageRank that helps

rank web pages that match a given search string.

When Google was a Stanford research project, it

was nicknamed BackRub because the technology

checks backlinks to determine a site's importance.

Previous keyword-based methods of ranking

search results, used by many search engines that

were once more popular than Google, would rank 

pages by how often the search terms occurred inthe page, or how strongly associated the search

terms were within each resulting page. The

PageRank algorithm instead analysis human-

Page 9: Working of Web Search Engines

8/7/2019 Working of Web Search Engines

http://slidepdf.com/reader/full/working-of-web-search-engines 9/11

 

9

generated links assuming that web pages linked

from many important pages are themselves likely

to be important. The algorithm computes a

recursive score for pages, based on the weighted

sum of the PageRanks of the pages linking to

them. PageRank is thought to correlate well with

human concepts of importance. In addition toPageRank, Google over the years has added many

other secret criteria for determining the ranking of 

pages on result lists, reported to be over 200

different indicators. The details are kept secret due

to spammers and in order to maintain an

advantage over Google's competitors.

PageRank is a link analysis algorithm, named

after Larry Page and used by the Google Internet

search engine that assigns a numerical weighting

to each element of a hyperlinked set of documents,

such as the World Wide Web, with the purpose of "measuring" its relative importance within the set.

The algorithm may be applied to any collection of 

entities with reciprocal quotations and references.

The numerical weight that it assigns to any given

element E is referred to as the PageRank of E and

denoted by PR(E).

Google describes PageRank:

“ PageRank reflects our view of the importance of 

web pages by considering more than 500 million

variables and 2 billion terms. Pages that webelieve are important pages receive a higher

PageRank and are more likely to appear at the top

of the search results.

PageRank also considers the importance of each

page that casts a vote, as votes from some pages

are considered to have greater value, thus giving

the linked page greater value. We have always

taken a pragmatic approach to help improve

search quality and create useful products, and our

technology uses the collective intelligence of the

web to determine a page's importance.”The name "PageRank" is a trademark of Google,

and the PageRank process has been patented (U.S.

Patent 6,285,999). However, the patent is assigned

to Stanford University and not to Google. Google

has exclusive license rights on the patent from

Stanford University. The university received 1.8

million shares of Google in exchange for use of 

the patent; the shares were sold in 2005 for

336million. 

A PageRank results from a mathematicalalgorithm based on the graph created by all World

Wide Web pages as nodes and hyperlinks, taking

into consideration authority hubs like Wikipedia

(however, Wikipedia is actually a sink rather than

a hub because it uses nofollow on external links).

The rank value indicates an importance of a

particular page. A hyperlink to a page counts as a

vote of support. The PageRank of a page is

defined recursively and depends on the number

and PageRank metric of all pages that link to it("incoming links"). A page that is linked to by

many pages with high PageRank receives a high

rank itself. If there are no links to a web page there

is no support for that page.

Numerous academic papers concerning

PageRank have been published since Page and

Brin's original paper. In practice, the PageRank 

concept has proven to be vulnerable to

manipulation, and extensive research has been

devoted to identifying falsely inflated PageRank 

and ways to ignore links from documents withfalsely inflated PageRank.

Other link-based ranking algorithms for Web

pages include the HITS algorithm invented by Jon

Kleinberg (used by Teoma and now Ask.com), the

IBM CLEVER project, and the TrustRank 

algorithm.

PageRank is a probability distribution used to

represent the likelihood that a person randomly

clicking on links will arrive at any particular page.

PageRank can be calculated for collections of documents of any size. It is assumed in several

research papers that the distribution is evenly

divided among all documents in the collection at

the beginning of the computational process. The

PageRank computations require several passes,

called "iterations", through the collection to adjust

approximate PageRank values to more closely

reflect the theoretical true value.

A probability is expressed as a numeric value

between 0 and 1. A 0.5 probability is commonly

expressed as a "50% chance" of somethinghappening. Hence, a PageRank of 0.5 means there

is a 50% chance that a person clicking on a

random link will be directed to the document with

the 0.5 PageRank.

Assume a small universe of four web pages: A,

B, C and D. The initial approximation of 

PageRank would be evenly divided between these

four documents. Hence, each document would

begin with an estimated PageRank of 0.25.

In the original form of PageRank initial valueswere simply 1. This meant that the sum of all

pages was the total number of pages on the web.

Later versions of PageRank (see the formulas

Page 10: Working of Web Search Engines

8/7/2019 Working of Web Search Engines

http://slidepdf.com/reader/full/working-of-web-search-engines 10/11

 

10

below) would assume a probability distribution

between 0 and 1. Here a simple probability

distribution will be used—hence the initial value

of 0.25.

If pages B, C, and D each only link to A, they

would each confer 0.25 PageRank to A. All

PageRank PR( ) in this simplistic system wouldthus gather to A because all links would be

pointing to A.

This is 0.75.Suppose that page B has a link to

page C as well as to page A, while page D has

links to all three pages. The value of the link-votes

is divided among all the outbound links on a page.

Thus, page B gives a vote worth 0.125 to

page A and a vote worth 0.125 to page C. Only

one third of D's PageRank is counted for A's

PageRank (approximately 0.083).

In other words, the PageRank conferred by an

outbound link is equal to the document's own

PageRank score divided by the normalized

number of outbound links L( ) (it is assumed that

links to specific URLs only count once per

document).

In the general case, the PageRank value for any

page u can be expressed as:

i.e. the PageRank value for a page u is dependent

on the PageRank values for each page v out of the

set Bu (this set contains all pages linking to

page u), divided by the number L(v) of links from

page v.

The Figure 5 below explains it in simpler manner:

Figure 5

Mathematical PageRanks (out of 100) for a

simple network (PageRanks reported by Google

are rescaled logarithmically). Page C has a higher

PageRank than Page E, even though it has fewer

links to it; the link it has is of a much higher

value. A web surfer who chooses a random link on

every page (but with 15% likelihood jumps to a

random page on the whole web) is going to be on

Page E for 8.1% of the time. (The 15% likelihood

of jumping to an arbitrary page corresponds to a

damping factor of 85%.) Without damping, all

web surfers would eventually end up on Pages A,

B, or C, and all other pages would have PageRank 

zero. Page A is assumed to link to all pages in the

web, because it has no outgoing links. 

Page 11: Working of Web Search Engines

8/7/2019 Working of Web Search Engines

http://slidepdf.com/reader/full/working-of-web-search-engines 11/11

 

11

The Panda Update

Google’s recent Panda (a.k.a. “Farmer”) Page

Rank algorithm update is to provide higher page

rankings for quality, rather than quantity, content.

The biggest sites hurt in the change seem to be the

“content farms”.

Wikipedia defines a “content farm” as “acompany that employs large numbers of often

freelance writers to generate large amounts of 

textual content which is specifically designed to

satisfy algorithms for maximal retrieval by

automated search engines. Their main goal is to

generate advertising revenue through attracting

reader page views.” In other words, spammy

content designed to fool Google into better results.

To prevent spammers from gaming the system,

Google does not divulge what specific changesthey’ve made to their algorithm.

Google formed their definition of low quality by

asking outside testers to rate sites by answering

questions such as:

•  Would you be comfortable giving this site

your credit card?

•  Would you be comfortable giving medicine

prescribed by this site to your kids?

•  Do you consider this site to be

authoritative?•  Would it be okay if this was in a magazine?

•  Does this site have excessive ads?

And if the answer was yes then PageRank was

to decrease.

5.  ConclusionSearch engine plays important role in accessing

the content over the internet, it fetches the pages

requested by the user.

It has made the internet and accessing the

information just a click away. The need for better

search engines only increases. The search engine

sites are among the most popular websites.

Search Engines are not the place to answer

questions but to find information. So it is equally

important for one to know how Search Engines

Work and how to get the best out of them.

6. References 

[1] Wikipedia.

http://en.wikipedia.org/wiki/Web_search_

engine

[2] How Stuff Works

http://www.howstuffworks.com.

[3] WebReference.com

[4] The Anatomy of a Large-Scale

Hypertextual Web Search Engine by

Sergey Brin and Lawrence Page

[5] How a Search Engine Works by ElizabethLiddy

http://www.cnlp.org/publications/02HowA

SearchEngineWorks.pdf