working of web search engines
TRANSCRIPT
8/7/2019 Working of Web Search Engines
http://slidepdf.com/reader/full/working-of-web-search-engines 1/11
1
Abstract
The amount of information on the web is
growing rapidly, as well as the number of new
users inexperienced in the art of web research.
True search engines crawl the web, and then
automatically generate their listings. If you
change your web pages, search engine crawlers
will eventually find these changes, and that can
affect your listing. Page titles, body copy, meta
tags (sometimes) and other elements all play a
role in how each search engine evaluates the
relevancy of your page(and hence its ranking).
There are plenty of ways to cater a searchengine’s crawlers and change a site to help
improve its rankings. One such is Google Web
Search Engine. This report goes through the
different generations of web search engines, the
simplified algorithm used for Page Ranking and
an overview of the Google Architecture. It is
important to know how the search Engines Works,
as to get the best out of them.
1.
IntroductionInternet search engine is a specialized tool that
helps us find information on the World Wide
Web. A technical encyclopedia, WhatIs.com,
provides an accurate definition of a search engine.
“A search engine is a coordinated set of programs
that includes:
• A spider (also called a "crawler" or a
"bot") that goes to every page or
representative pages on every Web site
that wants to be searchable and reads it,
using hypertext links on each page todiscover and read a site's other pages
• A program that creates a huge index
(sometimes called a "catalog") from the
pages that have been read
• A program that receives your search
request, compares it to the entries in the
index, and returns results to you.”
(Whatis.com ,2001.)
In essence, the search engine bots crawl web
pages and use links to help them navigate to otherpages. The search engine then indexes those pages
into its database. When a searcher sends a search
query, the search engine compares the web pages
in the index to find documents that are relevant to
the search query. Based on its algorithm, the
search engine returns results to the searcher in the
search engine result page (SERP).
The search engine algorithm is a set of rules that
a search engine follows, in order to return the
most relevant results. Search engines fail to return
relevant results sometimes, and that is why they
need to improve their algorithm constantly. The
algorithms determine the placement of web
documents in the organic or natural search results,
which are typically displayed on the left side of
the screen in the SERPs, as illustrated in the figure
1
Figure 1
Search engine algorithms are very closely kept
industry secrets, because of the fierce competition
in the field. Another reason for search engines to
keep their algorithms private is search engine
spam. If webmasters knew the exact algorithm of
a search engine, they could manipulate the results
in their favor quite easily. By testing different
tactics, website owners sometimes find out
elements of the algorithms and act accordingly to
boost their ranking in the SERPs. Therefore,
changes in the algorithms are often due to
increased search engine spam.
There are dozens of search engines which are
used by billions of people every day. Amongst
these include popular ones like Google, Yahoo,
and Bing.
The web creates new challenges for information
retrieval. The amount of information on the web isgrowing rapidly, as people are likely to surf the
web using its link graph, often starting with high
Working of Web Search Engines
8/7/2019 Working of Web Search Engines
http://slidepdf.com/reader/full/working-of-web-search-engines 2/11
2
quality human maintained indices such as Yahoo!
or with search engines like Lycos, AltaVista etc.
Human maintained lists cover popular topics
effectively but are subjective, expensive to build
and maintain, slow to improve, and cannot cover
all esoteric topics.
Automated search engines that rely on keywordmatching usually return too many low quality
matches.
To make matters worse, some advertisers
attempt to gain people’s attention by taking
measures meant to mislead automated search
engines and also there are spammers who want to
influence the web search results.
2. A Brief History of Search Engines
The history of Internet search engines dates
back to 1990, when Alan Emtage, a student at
McGill University in Montreal developed a search
engine called Archie. As there was no World
Wide Web at that time, Archie operated in a
system called File Transfer Protocol (FTP). In
June 2003, Matthew Gray developed the first
robot on the Web called the Wanderer. Referred to
as the mother of search engines, World Wide Web
Wanderer captured URLs on the web and stored
them in the first ever web database, Wandex.
Other improved web robots soon followed and
search engines began categorizing web pages in
databases, instead of just crawling and listing
them. In 1994 Galaxy, Lycos and WebCrawler
were launched, bringing search engine indexing to
a more advanced state. A small directory project
by two Stanford University Ph.D. candidates,
David Filo and Jerry Yang was also introduced in
1994, which the creators called Yahoo! This small
directory has since turned into a multi-billion
dollar company and is currently one of the biggest
online search providers.
Many search engines that are still major players
in the search arena were established in the
following years, including AltaVista, Excite,
Inktomi, HotBot and Ask Jeeves.
Excite was introduced in 1993 by six Stanford
University students. It used statistical analysis of
word relationships to aid in the search process.
Today it's a part of the AskJeeves Company.
EINet Galaxy (Galaxy) was established in 1994
as part of the MCC Research Consortium at the
University of Texas, in Austin. It was eventually
purchased from the University and, after being
transferred through several companies, is a
separate corporation today. It was created as a
directory, containing Gopher and telnet search
features in addition to its Web search feature.
Jerry Yang and David Filo created Yahoo in
1994. It started out as a listing of their favoriteWeb sites. What made it different was that each
entry, in addition to the URL, also had a
description of the page. Within a year the two
received funding and Yahoo, the corporation, was
created.
Later in 1994, WebCrawler was introduced. It
was the first full-text search engine on the
Internet; the entire text of each page was indexed
for the first time.
Lycos introduced relevance retrieval, prefixmatching, and word proximity in 1994. It was a
large search engine, indexing over 60 million
documents in 1996; the largest of any search
engine at the time. Like many of the other search
engines, Lycos was created in a university
atmosphere at Carnegie Mellon University by Dr.
Michael Mauldin.
Infoseek went online in 1995. It didn't really
bring anything new to the search engine scene. It
is now owned by the Walt Disney Internet Groupand the domain forwards to Go.com.
Alta Vista also began in 1995. It was the first
search engine to allow natural language inquires
and advanced searching techniques. It also
provides a multimedia search for photos, music,
and videos.
Inktomi started in 1996 at UC Berkeley. In June
of 1999 Inktomi introduced a directory search
engine powered by "concept induction"
technology. "Concept induction," according to thecompany, "takes the experience of human analysis
and applies the same habits to a computerized
analysis of links, usage, and other patterns to
determine which sites are most popular and the
most productive." Inktomi was purchased by
Yahoo in 2003.
AskJeeves and Northern Light were both
launched in 1997.
Google was launched in 1997 by Sergey Brin
and Larry Page as part of a research project atStanford University. It uses inbound links to rank
sites. In 1998 MSN Search and the Open
8/7/2019 Working of Web Search Engines
http://slidepdf.com/reader/full/working-of-web-search-engines 3/11
3
Directory were also started. Today MSN
Search(Live) is also known as Bing,
3. Three types of Search Engines
The term "search engine" is often used
generically to describe crawler-based search
engines, human-powered directories, and hybridsearch engines. These types of search engines
gather their listings in different ways, through
crawler-based searches, human-powered
directories, and hybrid searches.
3.1. Crawler-based search engines
Crawler-based search engines, such as Google
(http://www.google.com), create their listings
automatically. They "crawl" or "spider" the web,
then people search through what they have found.
If web pages are changed, crawler-based searchengines eventually find these changes, and that
can affect how those pages are listed. Page titles,
body copy and other elements all play a role.
3.2. Human-powered directories
A human-powered directory, such as the Open
Directory Project depends on humans for its
listings. (Yahoo!, which used to be a directory,
now gets its information from the use of
crawlers.) A directory gets its information from
submissions, which include a short description tothe directory for the entire site, or from editors
who write one for sites they review. A search
looks for matches only in the descriptions
submitted. Changing web pages, therefore, has no
effect on how they are listed. Techniques that are
useful for improving a listing with a search
engine have nothing to do with improving a
listing in a directory.
3.3. Hybrid search engines
Today, it is extremely common for crawler-typeand human-powered results to be combined when
conducting a search. Usually, a hybrid search
engine will favor one type of listings over
another. For example, MSN (Now Bing) is more
likely to present human-powered listings from
LookSmart
4. How do Search Engines Work?Many Internet nomads are confounded when
they enter a search query and get back a set of over 10,000 “relevant” hits, viewable in batches of
10. There are occasions when the searcher will
plow through the list hoping to find the perfect
link, but sometimes come across other factors at
work that cause inappropriate results to rise to the
top of the list. One of the factors that can lead to
this type of misinformation may be erroneous
assumptions by searchers as to what’s really going
on “behind the curtain.”“Search engine” is the
popular term for an Information Retrieval (IR)system.
Before a search engine can tell you where a file
or document is, it must be found. To find
information on the hundreds of millions of Web
pages that exist, a search engine employs special
software robots, called spiders, to build lists of the
words found on Web sites. When a spider is
building its lists, the process is called Web
crawling. In order to build and maintain a useful
list of words, a search engine's spiders have to
look at a lot of pages.4.1. High level Design Architecture of a
WebCrawler
Figure 2
A Web crawler is a computer program that
browses the World Wide Web in a methodical,
automated manner or in an orderly fashion. Other
terms for Web crawlers are ants, automatic
indexers, bots, Web spiders, Web robots etc.The behavior of a Web crawler is the outcome
of a combination of policies:
• a selection policy that states which pages to
download,
• a re-visit policy that states when to check
for changes to the pages,
• a politeness policy that states how to avoid
overloading Web sites, and
• a parallelization policy that states how to
coordinate distributed Web crawlers.
8/7/2019 Working of Web Search Engines
http://slidepdf.com/reader/full/working-of-web-search-engines 4/11
8/7/2019 Working of Web Search Engines
http://slidepdf.com/reader/full/working-of-web-search-engines 5/11
5
consistent data structure that all the downstream
processes can handle. The need for a well-formed,
consistent format is of relative importance in
direct proportion to the sophistication of later
steps of document processing. Step 2 is important
because the pointers stored in the inverted file will
enable a system to retrieve various sized units,site, page, document, section, paragraph, or
sentence.
Step 4: Identify potential indexable elements in
documents
Identifying potential indexable elements in
documents dramatically affects the nature and
quality of the document representation that the
engine will search against. In designing the
system, we must define the following: What is a
term? Is it the alphanumeric characters between
blank spaces or punctuation? If so, what aboutnoncom positional phrases (phrases where the
separate words do not convey the meaning of the
phrase, like skunk works or hot dog), multiword
proper names, or interword symbols such as
hyphens or apostrophes that can denote the
difference between “small business men” vs.
small-business men?” Each search engine depends
on a set of rules that its document processor must
execute to determine what action is to be taken by
the “tokenizer,” i.e., the software used to define a‘term’ suitable for indexing.
Step 5: Delete stop words
This step helps save system resources by
eliminating from further processing, as well as
potential matching, those terms that have little
value in finding useful documents in response to a
customer’s query. This step used to matter much
more than it does now when memory has become
so much cheaper and systems so much faster, but
since stop words may comprise up to 40 percent of
text words in a document, it still has somesignificance.
A stop word list typically consists of those word
classes known to convey little substantive
meaning, such as articles (a, the), conjunctions
(and, but), interjections (oh, but), prepositions (in,
over), pronouns (he, it), and forms of the “to be”
verb (is, are). To delete stop words, an algorithm
compares index term candidates in the documents
against a stop word list and eliminates certain
terms from inclusion in the index for searching.Step 6: Stem terms
Stemming removes word suffixes, perhaps
recursively in layer after layer of processing. The
process has two goals. In terms of efficiency,
stemming reduces the number of unique words in
the index, which in turn reduces the storage space
required for the index and speeds up the search
process.
In terms of effectiveness, stemming improves
recall by reducing all forms of word to a base orstemmed form. For example, if a user asks for
analyze, he or she may also want documents
which contain analysis, analyzing, analyzer,
analyzes, and analyzed. Therefore, the document
processor stems document terms to analy- so that
documents which include various forms of analy-
will have equal likelihood of being retrieved,
which would not occur if the engine only indexed
variant forms separately and required the user to
enter all. Of course, stemming does have a
downside. It may negatively affect precision inthat all forms of a stem will match, when, in fact,
a successful query for the user would have come
from matching only the word form actually used
in the query.
Systems may implement either a strong
stemming algorithm or a weak stemming
algorithm. A strong stemming algorithm will strip
off inflectional suffixes (-s, -es, -ed) and
derivational suffixes (-able, -aciousness, -ability),
while a weak stemming algorithm will strip off only the inflectional suffixes (-s, -es, -ed).
Step 7: Extract index entries
Having completed steps 1 through 6, the
document processor extracts the remaining entries
from the original document. For example, the
following paragraph shows the full text as sent to
a search engine for processing: “Milosevic's
comments, carried by the official news agency
Tanjug, cast doubt over the governments at the
talks, which the international community has
called to try to prevent an all-out war in theSerbian province. President Milosevic said it was
well known that Serbia and Yugoslavia were
firmly committed to resolving problems in
Kosovo, which is an integral part of Serbia,
peacefully in Serbia with the participation of the
representatives of all ethnic communities, Tanjug
said. Milosevic was speaking during a meeting
with British Foreign Secretary Robin Cook, who
delivered an ultimatum to attend negotiations in a
week's time on an autonomy proposal for Kosovowith ethnic Albanian leaders from the province.
Cook earlier told a conference that Milosevic had
agreed to study the proposal.”
8/7/2019 Working of Web Search Engines
http://slidepdf.com/reader/full/working-of-web-search-engines 6/11
6
Steps 1 through 6 reduce this text for searching
to the following text: “Milosevic comm carri offic
new agen Tanjug cast doubt govern talk interna
commun
call try prevent all-out war Serb province
President Milosevic said well known Serbia
Yugoslavia firm commit resolv problem Kosovointegr part Serbia peace Serbia particip representa
ethnic commun Tanjug said Milosevic speak
meeti British Foreign Secretary Robin Cook
deliver ultimat attend negoti week time autonomy
propos Kosovo ethnic Alban lead province Cook
earl told conference Milosevic agree study
propos”
The output of step 7 is then inserted and stored
in an inverted file that lists the index entries and
an indication of their position and frequency of
occurrence. The specific nature of the indexentries, however, will vary based on the decision
in Step 4 concerning what constitutes an
“indexable term.” More sophisticated Document
Processors will have phrase recognizers, as well as
Named Entity recognizers and Categorizers, to
insure index entries such as Milosevic are tagged
as a person and entries such as Yugoslavia and
Serbia as countries.
Step 8: Compute weights.
Weights are assigned to terms in the index file.The simplest of search engines simply assign a
binary weight: 1 for presence and 0 for absence.
The more sophisticated the search engine, the
more complex the weighting scheme. Measuring
the frequency of occurrence of a term in the
document creates more sophisticated weighting,
with length-normalization of frequencies still
more sophisticated.
Extensive experience in Information Retrieval
research over many years has clearly demonstrated
that the optimal weighting comes from use of termfrequency/inverse document frequency (tf/idf).
This algorithm measures the frequency of
occurrence of each term within a document. Then
it compares that frequency against the frequency
of occurrence in the entire database.
Not all terms are good discriminators; that is,
they don’t all single out one document from
another very well. A simple example would be the
word “THE.” This word appears in too many
documents to help distinguish one from another.A less obvious example would be the word
“antibiotic.” In a sports database, when we
compare each document to the database as a
whole, the term “antibiotic” would probably be a
good discriminator among documents, and
therefore would be assigned a high weight.
Conversely, in a database devoted to health or
medicine, “antibiotic” would probably be a poor
discriminator, since it occurs very often. The tf/idf
weighting scheme assigns higher weights to thoseterms that really distinguish one document from
the others.
Step 9: Create index
The index or inverted file is the internal data
structure that stores the index information and that
will be searched for each query. Inverted files
range from a simple listing of every alphanumeric
sequence in a set of documents/pages being
indexed along with the overall identifying
numbers of the documents in which that sequence
occurs, to a more linguistically complex list of entries, their tf/idf weights, and pointers to where
inside each document the term occurs. The more
complete the information in the index, the better
the search results.
4.2.2. Query Processor
Query processing has seven possible steps,
though a system can cut these steps short and
proceed to match the query to the inverted file at
any of a number of places during the processing.
Document processing shares many steps withquery processing. More steps and more documents
make the process more expensive for processing
in terms of computational resources and
responsiveness. However, the longer the wait for
results, the higher the quality of results. Thus,
search system designers must choose what is most
important to their users, time or quality. Publicly
available search engines usually choose time over
very high quality because they have too many
documents to search against. The steps in query
processing are as follows (with the option to stopprocessing and start matching indicated as
“Matcher”):
1. Tokenize query terms
2. Recognize query terms vs. special operators
3. ---------------------------> Matcher
4. Delete stop words
5. Stem words
6. Create query representation
7. ---------------------------> Matcher
8. Expand query terms9. Compute weights
10. ---------------------------> Matcher
8/7/2019 Working of Web Search Engines
http://slidepdf.com/reader/full/working-of-web-search-engines 7/11
7
Figure 4
Step 1: Tokenize query terms
As soon as a user inputs a query, the search
engine, whether a keyword-based system or a full
Natural Language Processing (NLP) system, must
tokenize the query stream, i.e., break it down into
understandable segments. Usually a token is
defined as an alphanumeric string that occurs
between white space and or punctuation.
Step 2: Recognize query terms vs. special
operators
Since users may employ special operators in
their query, including Boolean, adjacency, or
proximity operators, the system needs to parse the
query first into query terms and operators. These
operators may occur in the form of reserved
punctuation (e.g., quotation marks) or reservedterms in specialized format (e.g., AND, OR). In
the case of an NLP system, the query processor
will recognize the operators implicitly in the
language used no matter how they might be
expressed (e.g., prepositions, conjunctions,
ordering).
At this point, a search engine may take the list
of query terms and search them against the
inverted file. In fact, this is the point at which the
majority of publicly available search engines
perform their search.
Steps 3 and 4: Delete stop words and stem
words
Some search engines will go further and stop-
list and stem the query, similar to the processes
described in the Document Processor section. The
stop list might also contain words from commonly
occurring querying phrases, such as “I’d likeinformation about….” However, since most
publicly available search engines encourage very
short queries, as evidenced in the size of query
window they provide, they may drop these two
steps.
Step 5: Creating the query representation
How each particular search engine creates a
query representation depends on how the system
does its matching. If a statistically based matcher
is used, then the query must match the statistical
representations of the documents in the system.Good statistical queries should contain many
synonyms and other terms in order to create a full
representation. If a Boolean matcher is utilized,
then the system must create logical sets of the
terms connected by AND, OR, or NOT.
The NLP system will recognize single terms,
phrases, and Named Entities. If it uses any
Boolean logic, it will also recognize the logical
operators from Step 2 and create a representation
containing logical sets of the terms to be AND’d,OR’d, or NOT’d. At this point, a search engine
may take the query representation and perform the
search against the inverted file. More advanced
search engines may take two further steps.
Step 6: Expand query terms
Since users of search engines usually include
only a single statement of their information needs
in a query, it becomes highly probable that the
information they need may be expressed using
synonyms, rather than the exact query terms, in
the documents that the search engine searchesagainst. Therefore, more sophisticated systems
may expand the query into all possible
synonymous terms and perhaps even broader and
narrower terms.
This process approaches what search
intermediaries did for end-users in the earlier days
of commercial search systems. Then
intermediaries might have used the same
controlled vocabulary or thesaurus used by the
indexers who assigned subject descriptors todocuments.
Today, resources such as WordNet are generally
available, or specialized expansion facilities may
8/7/2019 Working of Web Search Engines
http://slidepdf.com/reader/full/working-of-web-search-engines 8/11
8
take the initial query and enlarge it by adding
associated vocabulary.
Step 7: Computer query term weight (assuming
more than one query term)
The final step in query processing involves
computing weights for the terms in the query.
Sometimes the user controls this step byindicating either how much to weight each term or
simply which term or concept in the query matters
most and must appear in each retrieved document
to ensure relevance.
Leaving the weighting up to the user is
uncommon because research has shown that users
are not particularly good at determining the
relative importance of terms in their queries. They
can’t make this determination for several reasons.
First, they don’t know what else exists in the
database and document terms are weighted bybeing compared to the database as a whole.
Second, most users seek information about an
unfamiliar subject, so they may not know the
correct terminology. Few search engines
implement system-based query weighting, but
some do an implicit weighting by treating the first
term(s) in a query as having higher significance.
They use this information to provide a list of
documents/pages to the user. After this final step,
the expanded, weighted query is searched againstthe inverted file of documents.
4.2.3. Search and Matching Functions
How systems carry out their search and
matching functions differs according to which
theoretical model of IR underlies the system’s
design philosophy.
Searching the inverted file for documents which
meet the query requirements, referred to simply as
“matching,” is typically a standard binary search
no matter whether the search ends after the first
two, five, or all seven steps of query processing.While the computational processing required for
simple, un-weighted, non-Boolean query matching
is far simpler than when the model is an NLP-
based query within a weighted, Boolean model, it
also follows that the simpler the document
representation, the query representation, and the
matching algorithm, the less relevant the results,
except for very simple queries, such as one-word,
non-ambiguous queries seeking the most generally
known information.Having determined which subset of documents
or pages match the query requirements to some
degree, a similarity score is computed between the
query and each document/page based on the
scoring algorithm used by the system. Scoring
algorithms base their rankings on the
presence/absence of query term(s), term
frequency, tf/idf, Boolean logic fulfillment, or
query term weights. Some search engines use
scoring algorithms not based on documentcontents, but rather, on relations among
documents or past retrieval history of
documents/pages.
After computing the similarity of each
document in the subset of documents, the system
presents an ordered list to the user. The
sophistication of the ordering of the documents
again depends on the model the system uses, as
well as the richness of the document and query
weighting mechanisms. For example, search
engines that only require the presence of anyalphanumeric string from the query occurring
anywhere, in any order, in a document would
produce a very different ranking from one by a
search engine that performed linguistically correct
phrasing for document and query representation
and that utilized the proven tf/idf weighting
scheme.
However, the search engine determines rank,
and the ranked results list goes to the user, who
can then simply click and follow the system’sinternal pointers to the selected document/page.
More sophisticated systems will go even further
at this stage and allow the user to provide some
relevance feedback or to modify their query based
on the results they have seen. If either of these are
available, the system will then adjust its query
representation to reflect this value-added feedback
and rerun the search with the improved query to
produce either a new set of documents or a simple
re-ranking of documents from the initial search.
4.3. RankingGoogle's rise to success was in large part due to a
patented algorithm called PageRank that helps
rank web pages that match a given search string.
When Google was a Stanford research project, it
was nicknamed BackRub because the technology
checks backlinks to determine a site's importance.
Previous keyword-based methods of ranking
search results, used by many search engines that
were once more popular than Google, would rank
pages by how often the search terms occurred inthe page, or how strongly associated the search
terms were within each resulting page. The
PageRank algorithm instead analysis human-
8/7/2019 Working of Web Search Engines
http://slidepdf.com/reader/full/working-of-web-search-engines 9/11
9
generated links assuming that web pages linked
from many important pages are themselves likely
to be important. The algorithm computes a
recursive score for pages, based on the weighted
sum of the PageRanks of the pages linking to
them. PageRank is thought to correlate well with
human concepts of importance. In addition toPageRank, Google over the years has added many
other secret criteria for determining the ranking of
pages on result lists, reported to be over 200
different indicators. The details are kept secret due
to spammers and in order to maintain an
advantage over Google's competitors.
PageRank is a link analysis algorithm, named
after Larry Page and used by the Google Internet
search engine that assigns a numerical weighting
to each element of a hyperlinked set of documents,
such as the World Wide Web, with the purpose of "measuring" its relative importance within the set.
The algorithm may be applied to any collection of
entities with reciprocal quotations and references.
The numerical weight that it assigns to any given
element E is referred to as the PageRank of E and
denoted by PR(E).
Google describes PageRank:
“ PageRank reflects our view of the importance of
web pages by considering more than 500 million
variables and 2 billion terms. Pages that webelieve are important pages receive a higher
PageRank and are more likely to appear at the top
of the search results.
PageRank also considers the importance of each
page that casts a vote, as votes from some pages
are considered to have greater value, thus giving
the linked page greater value. We have always
taken a pragmatic approach to help improve
search quality and create useful products, and our
technology uses the collective intelligence of the
web to determine a page's importance.”The name "PageRank" is a trademark of Google,
and the PageRank process has been patented (U.S.
Patent 6,285,999). However, the patent is assigned
to Stanford University and not to Google. Google
has exclusive license rights on the patent from
Stanford University. The university received 1.8
million shares of Google in exchange for use of
the patent; the shares were sold in 2005 for
336million.
A PageRank results from a mathematicalalgorithm based on the graph created by all World
Wide Web pages as nodes and hyperlinks, taking
into consideration authority hubs like Wikipedia
(however, Wikipedia is actually a sink rather than
a hub because it uses nofollow on external links).
The rank value indicates an importance of a
particular page. A hyperlink to a page counts as a
vote of support. The PageRank of a page is
defined recursively and depends on the number
and PageRank metric of all pages that link to it("incoming links"). A page that is linked to by
many pages with high PageRank receives a high
rank itself. If there are no links to a web page there
is no support for that page.
Numerous academic papers concerning
PageRank have been published since Page and
Brin's original paper. In practice, the PageRank
concept has proven to be vulnerable to
manipulation, and extensive research has been
devoted to identifying falsely inflated PageRank
and ways to ignore links from documents withfalsely inflated PageRank.
Other link-based ranking algorithms for Web
pages include the HITS algorithm invented by Jon
Kleinberg (used by Teoma and now Ask.com), the
IBM CLEVER project, and the TrustRank
algorithm.
PageRank is a probability distribution used to
represent the likelihood that a person randomly
clicking on links will arrive at any particular page.
PageRank can be calculated for collections of documents of any size. It is assumed in several
research papers that the distribution is evenly
divided among all documents in the collection at
the beginning of the computational process. The
PageRank computations require several passes,
called "iterations", through the collection to adjust
approximate PageRank values to more closely
reflect the theoretical true value.
A probability is expressed as a numeric value
between 0 and 1. A 0.5 probability is commonly
expressed as a "50% chance" of somethinghappening. Hence, a PageRank of 0.5 means there
is a 50% chance that a person clicking on a
random link will be directed to the document with
the 0.5 PageRank.
Assume a small universe of four web pages: A,
B, C and D. The initial approximation of
PageRank would be evenly divided between these
four documents. Hence, each document would
begin with an estimated PageRank of 0.25.
In the original form of PageRank initial valueswere simply 1. This meant that the sum of all
pages was the total number of pages on the web.
Later versions of PageRank (see the formulas
8/7/2019 Working of Web Search Engines
http://slidepdf.com/reader/full/working-of-web-search-engines 10/11
10
below) would assume a probability distribution
between 0 and 1. Here a simple probability
distribution will be used—hence the initial value
of 0.25.
If pages B, C, and D each only link to A, they
would each confer 0.25 PageRank to A. All
PageRank PR( ) in this simplistic system wouldthus gather to A because all links would be
pointing to A.
This is 0.75.Suppose that page B has a link to
page C as well as to page A, while page D has
links to all three pages. The value of the link-votes
is divided among all the outbound links on a page.
Thus, page B gives a vote worth 0.125 to
page A and a vote worth 0.125 to page C. Only
one third of D's PageRank is counted for A's
PageRank (approximately 0.083).
In other words, the PageRank conferred by an
outbound link is equal to the document's own
PageRank score divided by the normalized
number of outbound links L( ) (it is assumed that
links to specific URLs only count once per
document).
In the general case, the PageRank value for any
page u can be expressed as:
i.e. the PageRank value for a page u is dependent
on the PageRank values for each page v out of the
set Bu (this set contains all pages linking to
page u), divided by the number L(v) of links from
page v.
The Figure 5 below explains it in simpler manner:
Figure 5
Mathematical PageRanks (out of 100) for a
simple network (PageRanks reported by Google
are rescaled logarithmically). Page C has a higher
PageRank than Page E, even though it has fewer
links to it; the link it has is of a much higher
value. A web surfer who chooses a random link on
every page (but with 15% likelihood jumps to a
random page on the whole web) is going to be on
Page E for 8.1% of the time. (The 15% likelihood
of jumping to an arbitrary page corresponds to a
damping factor of 85%.) Without damping, all
web surfers would eventually end up on Pages A,
B, or C, and all other pages would have PageRank
zero. Page A is assumed to link to all pages in the
web, because it has no outgoing links.
8/7/2019 Working of Web Search Engines
http://slidepdf.com/reader/full/working-of-web-search-engines 11/11
11
The Panda Update
Google’s recent Panda (a.k.a. “Farmer”) Page
Rank algorithm update is to provide higher page
rankings for quality, rather than quantity, content.
The biggest sites hurt in the change seem to be the
“content farms”.
Wikipedia defines a “content farm” as “acompany that employs large numbers of often
freelance writers to generate large amounts of
textual content which is specifically designed to
satisfy algorithms for maximal retrieval by
automated search engines. Their main goal is to
generate advertising revenue through attracting
reader page views.” In other words, spammy
content designed to fool Google into better results.
To prevent spammers from gaming the system,
Google does not divulge what specific changesthey’ve made to their algorithm.
Google formed their definition of low quality by
asking outside testers to rate sites by answering
questions such as:
• Would you be comfortable giving this site
your credit card?
• Would you be comfortable giving medicine
prescribed by this site to your kids?
• Do you consider this site to be
authoritative?• Would it be okay if this was in a magazine?
• Does this site have excessive ads?
And if the answer was yes then PageRank was
to decrease.
5. ConclusionSearch engine plays important role in accessing
the content over the internet, it fetches the pages
requested by the user.
It has made the internet and accessing the
information just a click away. The need for better
search engines only increases. The search engine
sites are among the most popular websites.
Search Engines are not the place to answer
questions but to find information. So it is equally
important for one to know how Search Engines
Work and how to get the best out of them.
6. References
[1] Wikipedia.
http://en.wikipedia.org/wiki/Web_search_
engine
[2] How Stuff Works
http://www.howstuffworks.com.
[3] WebReference.com
[4] The Anatomy of a Large-Scale
Hypertextual Web Search Engine by
Sergey Brin and Lawrence Page
[5] How a Search Engine Works by ElizabethLiddy
http://www.cnlp.org/publications/02HowA
SearchEngineWorks.pdf