unit-5 seraching the web
TRANSCRIPT
-
8/17/2019 Unit-5 Seraching the Web
1/33
1
Searching the Web
-
8/17/2019 Unit-5 Seraching the Web
2/33
2
Agenda
1. Introduction
2. Challenges of Searching the Web
3. Measuring the Web 4. Searching Engines (Google)
5. Web Directories
6. Metasearchers
7. oogle Searching uidelines
-
8/17/2019 Unit-5 Seraching the Web
3/33
3
1. Introduction
• The Web can be seen as a very large, unstructured butubiquitous database.
• So e need !or e!!icient tools to "anage, retrieve and!ilter the in!or"ation.
• There are # di!!erent !or"s o! searching the Web$ 1. Search %ngines, hich inde& a 'ortion o! Web
'ages as a !ull te&t database.
. Web *irectories, hich classi!y selected Webdocu"ents by sub+ect.
#. Searching by hy'erlin s structure.
-
8/17/2019 Unit-5 Seraching the Web
4/33
4
. -hallenges o! Searching the Web• roble" ith the data itsel!$
Distributed data High percentage of volatile data
• it is esti"ated that /0 o! the Web changes every "onth.
Unstructured and redundant data
• 2o conce'tual "odel, no organi3ation, no constraints.• 4y so"e esti"ates, about #0 o! the Web is redundant.
Quality of data• *ata can be !alse, invalid, outdated, 'oorly ritten or ith "any errors.
Heterogeneous data• 5ulti'le "edia ty'es, "ulti'le !or"ats, languages and al'habets.
• roble"s regarding the user and his interaction ith the retrievalsyste"$
How to specify a query How to interpret the answer provided by the system.
-
8/17/2019 Unit-5 Seraching the Web
5/33
5
#. 5easuring the Web
-
8/17/2019 Unit-5 Seraching the Web
6/33
6
#. 5easuring the Web (Cont.)
-
8/17/2019 Unit-5 Seraching the Web
7/33
#. 5easuring the Web (Cont.)
7
-
8/17/2019 Unit-5 Seraching the Web
8/33
#. 5easuring the Web (Cont.)
8
-
8/17/2019 Unit-5 Seraching the Web
9/33
#. 5easuring the Web (Cont.)
9
-
8/17/2019 Unit-5 Seraching the Web
10/33
#. 5easuring the Web (Cont.)
10
-
8/17/2019 Unit-5 Seraching the Web
11/33
#. 5easuring the Web (Cont.)
11
-
8/17/2019 Unit-5 Seraching the Web
12/3312
/. Search %ngine
• A search engine is a 'rogra" designed to hel' !indin!or"ation stored on a co"'uter syste" such as theWorld Wide Web, or a 'ersonal co"'uter.
• The search engine allo s one to as !or content "eeting
s'eci!ic criteria and retrieves a list o! re!erences that "atchthose criteria.
• !"o #ain architectures $ . Centrali!ed $ 6sing crawlers , in!or"ation is
gathered into a single site, here it is inde&ed7 the sitethen 'rocesses all user queries.
". Distributed $ Searching is a coordinated e!!ort o! "anyin!or"ation gatherers and bro#ers .
-
8/17/2019 Unit-5 Seraching the Web
13/3313
/.1 -entrali3ed Architecture
• 5ost search engines uses a centrali3edcra ler inde&er architecture.
• -o"'onents$ -ra lers, Inde&, 8uery %ngine,and Inter!ace.
-
8/17/2019 Unit-5 Seraching the Web
14/3314
/. *istributed Architecture
9arvest is an e&a"'le o! distributed architecture. 5ain dra bac $ requires the coordination o! several Webservers.-o"'onents $
1. Gatherers$• %&tracts in!or"ation !ro" the docu"ents stored on one or "ore
Web servers.• -an handle docu"ents in "any !or"ats$ 9T5:, *;, ostscri't, etc.
. 4ro er$ 'rovides the inde&ing "echanis" and query inter!ace.#.
-
8/17/2019 Unit-5 Seraching the Web
15/33
15
/. *istributed Architecture (Cont.)
-
8/17/2019 Unit-5 Seraching the Web
16/33
16
/.# About Google>
• The na"e ?Google? is a 'lay on the ord?googol ?, hich re!ers to the nu"ber re'resented
by 1 !ollo ed by one hundred 3eros.• Google receives over 00 "illion queries each day
through its various services.• As o! @anuary 00 , Google has inde&ed B.C
billion eb 'ages, 1.# billion i"ages, and overone billion 6senet "essages D in total,
a''ro&i"ately 1 billion ite"s. It also caches "uch o! the content that it inde&es.
-
8/17/2019 Unit-5 Seraching the Web
17/33
17
6ser Inter!aces
-
8/17/2019 Unit-5 Seraching the Web
18/33
18
Google Services and ToolsSource$ htt'$EEen. i i'edia.orgE i iE:istFo!FGoogleFservicesFandFtools
-
8/17/2019 Unit-5 Seraching the Web
19/33
19
9o Google or s
-
8/17/2019 Unit-5 Seraching the Web
20/33
20
Google !inds i"'ortant 'ages
• The idea is that the docu"ents on the ebhave di!!erent degrees o! ?i"'ortance?.
• Google ill sho the "ost i"'ortant 'ages!irst.
• The ideas is that "ore i"'ortant 'ages areli ely to be "ore relevant to any query thannon i"'ortant 'ages.
-
8/17/2019 Unit-5 Seraching the Web
21/33
21
Google
-
8/17/2019 Unit-5 Seraching the Web
22/33
22
Google age
-
8/17/2019 Unit-5 Seraching the Web
23/33
23
Google Syste" ;eatures
A
T1
T2
Tn
C1
C2
Cm
• age
-
8/17/2019 Unit-5 Seraching the Web
24/33
24
%&a"'le
• age
-
8/17/2019 Unit-5 Seraching the Web
25/33
25
• ;or e&a"'le, the ord ? civil ? "ight occur in docu"ents #,, , H , , and B , hile the ord ? ar ? "ight occur in
docu"ents , , 1H, , , and CC.• Su''ose so"eone co"es to Google and ty'es in civil war .
In order to 'resent and score the results, e need to do t othings$
1. ;ind the set o! 'ages that contain the userMs query so"e here
.
-
8/17/2019 Unit-5 Seraching the Web
26/33
26
Web cra"ler
• A "eb cra"ler (also no n as a "eb s ider ) is a 'rogra" hich bro ses the World Wide Web in a"ethodical, auto"ated "anner.
• Web cra lers are "ainly used to create a co'y o! all thevisited 'ages !or later 'rocessing by a search engine, thatill inde& the do nloaded 'ages to 'rovide !ast searches.
• It starts ith a list o! 6
-
8/17/2019 Unit-5 Seraching the Web
27/33
27
oogle Search Engine 'rchitecture
• URL Server- Provides URLs to be fetched• Crawler is distributed• Store Server - compresses and stores
pages• Repository - holds pages for inde ing• !nde er - parses documents" records
words" positions" font si#e" capitali#ation
• Le icon - list of uni$ue words found• %arrels hold• &nchors - 'eep information about lin'
found in web pages• URL Resolver - converts relative URLs to
absolute• Sorter - generates (oc !nde• (oc !nde - inverted inde of all words inall documents )e cept stop words*
• Lin's - stores info about lin's to eachpage )used for Pageran'*
• Pageran' - computes a ran' for each pageretrieved
• Searcher - answers $ueries
-
8/17/2019 Unit-5 Seraching the Web
28/33
28
H. Web *irectories
• ,eb directory $ A classi!ication o! Web 'ages bysub+ect.
• rinci'les$-lassi!ication is by a hierarchical ta&ono"y.
*irectory "ay be s'eci!ic to a sub+ect, a region, a language.ages are sub"itted and revie ed be!ore they are included.
Auto"atic classi!ication is not success!ul enough.• Advantage$
i! !ound, the ans er ill be use!ul in "ost cases7• *isadvantage$classi!ication is not s'eciali3ed enough7not all Web 'ages are classi!ied7
-
8/17/2019 Unit-5 Seraching the Web
29/33
29
. 5etasearchers
• -etasearcher $ eb server that sends a given query to severalsearch engines and Web directories, collects the ans ers anduni!ies the". J %&a"'les$ 5etacra ler, Savvysearch, 5etaSearch, 5a""a.
• Advantages$ J -o"bine the results o! "any sources. J Save users !ro" the need to 'ose queries to "ulti'le searchers. J Ability to sort the results by di!!erent attributes. J ages retrieved by "ulti'le searchers are "ore relevant. J I"'rove coverage$ individual searchers cover a s"all !raction o! the
Web.• Issues$ J 9o to translate the given query to the s'eci!ic language o! each
search %ngine> J 9o to ran the uni!ied results>
-
8/17/2019 Unit-5 Seraching the Web
30/33
cs466-26 30
Web “Agents”
Passive Pe s!na"i#e$ %n&! mati!n 'at(e e )*am+"e , A.'A%/ !t A! n 96 P !t )t#i!ni et a" 96
imi"a t! :C in&! mati!n e*t acti!n tas;a %$enti&
-
8/17/2019 Unit-5 Seraching the Web
31/33
cs466-26 31
Active Eia"!g >it( e ve - Fi""s ! t + !$ ct in&! mati!n &! ms inte active"<s+eci&ic t! eac( site
- :se P T t! s bmit $ata
- Ana"
-
8/17/2019 Unit-5 Seraching the Web
32/33
cs466-26 32
Ii t a" (!++ingWeb s(!++e
!!; &in$e CE &in$e
m! tgage?"!an ate neg!tiati!nt!c; t a$ing
a te ing A cti!ning n!nstan$a $ g!!$s
)*am+"es !& Web Agents
/! &i*e$ + icenee$ &! inte active
va" e &i*ing
3 "eve"s !& inte active s(!++ing "!cate an$ + c(ase
neg!tiate
"ega" a t(! it<)*c(ange !& m!ne
-
8/17/2019 Unit-5 Seraching the Web
33/33
- Java ma ;et+"ace A>e bac( Ami- /eg!tiate &! an$ se"" va" e !& CP: time
- Ca"en$a a++ entice- eeting c!! $inati!n- C!nst aint satis&acti!n an$ neg!tiati!n
(ave m< ca"en$a agent c!ntact
)*am+"es !& Web Agents c!nt