unit-5 seraching the web

Upload: rajeev-sahani

Post on 06-Jul-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 Unit-5 Seraching the Web

    1/33

    1

    Searching the Web

  • 8/17/2019 Unit-5 Seraching the Web

    2/33

    2

    Agenda

    1. Introduction

    2. Challenges of Searching the Web

    3. Measuring the Web 4. Searching Engines (Google)

    5. Web Directories

    6. Metasearchers

    7. oogle Searching uidelines

  • 8/17/2019 Unit-5 Seraching the Web

    3/33

    3

    1. Introduction

    • The Web can be seen as a very large, unstructured butubiquitous database.

    • So e need !or e!!icient tools to "anage, retrieve and!ilter the in!or"ation.

    • There are # di!!erent !or"s o! searching the Web$ 1. Search %ngines, hich inde& a 'ortion o! Web

    'ages as a !ull te&t database.

    . Web *irectories, hich classi!y selected Webdocu"ents by sub+ect.

    #. Searching by hy'erlin s structure.

  • 8/17/2019 Unit-5 Seraching the Web

    4/33

    4

    . -hallenges o! Searching the Web• roble" ith the data itsel!$

    Distributed data High percentage of volatile data

    • it is esti"ated that /0 o! the Web changes every "onth.

    Unstructured and redundant data

    • 2o conce'tual "odel, no organi3ation, no constraints.• 4y so"e esti"ates, about #0 o! the Web is redundant.

    Quality of data• *ata can be !alse, invalid, outdated, 'oorly ritten or ith "any errors.

    Heterogeneous data• 5ulti'le "edia ty'es, "ulti'le !or"ats, languages and al'habets.

    • roble"s regarding the user and his interaction ith the retrievalsyste"$

    How to specify a query How to interpret the answer provided by the system.

  • 8/17/2019 Unit-5 Seraching the Web

    5/33

    5

    #. 5easuring the Web

  • 8/17/2019 Unit-5 Seraching the Web

    6/33

    6

    #. 5easuring the Web (Cont.)

  • 8/17/2019 Unit-5 Seraching the Web

    7/33

    #. 5easuring the Web (Cont.)

    7

  • 8/17/2019 Unit-5 Seraching the Web

    8/33

    #. 5easuring the Web (Cont.)

    8

  • 8/17/2019 Unit-5 Seraching the Web

    9/33

    #. 5easuring the Web (Cont.)

    9

  • 8/17/2019 Unit-5 Seraching the Web

    10/33

    #. 5easuring the Web (Cont.)

    10

  • 8/17/2019 Unit-5 Seraching the Web

    11/33

    #. 5easuring the Web (Cont.)

    11

  • 8/17/2019 Unit-5 Seraching the Web

    12/3312

    /. Search %ngine

    • A search engine is a 'rogra" designed to hel' !indin!or"ation stored on a co"'uter syste" such as theWorld Wide Web, or a 'ersonal co"'uter.

    • The search engine allo s one to as !or content "eeting

    s'eci!ic criteria and retrieves a list o! re!erences that "atchthose criteria.

    • !"o #ain architectures $ . Centrali!ed $ 6sing crawlers , in!or"ation is

    gathered into a single site, here it is inde&ed7 the sitethen 'rocesses all user queries.

    ". Distributed $ Searching is a coordinated e!!ort o! "anyin!or"ation gatherers and bro#ers .

  • 8/17/2019 Unit-5 Seraching the Web

    13/3313

    /.1 -entrali3ed Architecture

    • 5ost search engines uses a centrali3edcra ler inde&er architecture.

    • -o"'onents$ -ra lers, Inde&, 8uery %ngine,and Inter!ace.

  • 8/17/2019 Unit-5 Seraching the Web

    14/3314

    /. *istributed Architecture

    9arvest is an e&a"'le o! distributed architecture. 5ain dra bac $ requires the coordination o! several Webservers.-o"'onents $

    1. Gatherers$• %&tracts in!or"ation !ro" the docu"ents stored on one or "ore

    Web servers.• -an handle docu"ents in "any !or"ats$ 9T5:, *;, ostscri't, etc.

    . 4ro er$ 'rovides the inde&ing "echanis" and query inter!ace.#.

  • 8/17/2019 Unit-5 Seraching the Web

    15/33

    15

    /. *istributed Architecture (Cont.)

  • 8/17/2019 Unit-5 Seraching the Web

    16/33

    16

    /.# About Google>

    • The na"e ?Google? is a 'lay on the ord?googol ?, hich re!ers to the nu"ber re'resented

    by 1 !ollo ed by one hundred 3eros.• Google receives over 00 "illion queries each day

    through its various services.• As o! @anuary 00 , Google has inde&ed B.C

    billion eb 'ages, 1.# billion i"ages, and overone billion 6senet "essages D in total,

    a''ro&i"ately 1 billion ite"s. It also caches "uch o! the content that it inde&es.

  • 8/17/2019 Unit-5 Seraching the Web

    17/33

    17

    6ser Inter!aces

  • 8/17/2019 Unit-5 Seraching the Web

    18/33

    18

    Google Services and ToolsSource$ htt'$EEen. i i'edia.orgE i iE:istFo!FGoogleFservicesFandFtools

  • 8/17/2019 Unit-5 Seraching the Web

    19/33

    19

    9o Google or s

  • 8/17/2019 Unit-5 Seraching the Web

    20/33

    20

    Google !inds i"'ortant 'ages

    • The idea is that the docu"ents on the ebhave di!!erent degrees o! ?i"'ortance?.

    • Google ill sho the "ost i"'ortant 'ages!irst.

    • The ideas is that "ore i"'ortant 'ages areli ely to be "ore relevant to any query thannon i"'ortant 'ages.

  • 8/17/2019 Unit-5 Seraching the Web

    21/33

    21

    Google

  • 8/17/2019 Unit-5 Seraching the Web

    22/33

    22

    Google age

  • 8/17/2019 Unit-5 Seraching the Web

    23/33

    23

    Google Syste" ;eatures

    A

    T1

    T2

    Tn

    C1

    C2

    Cm

    • age

  • 8/17/2019 Unit-5 Seraching the Web

    24/33

    24

    %&a"'le

    • age

  • 8/17/2019 Unit-5 Seraching the Web

    25/33

    25

    • ;or e&a"'le, the ord ? civil ? "ight occur in docu"ents #,, , H , , and B , hile the ord ? ar ? "ight occur in

    docu"ents , , 1H, , , and CC.• Su''ose so"eone co"es to Google and ty'es in civil war .

    In order to 'resent and score the results, e need to do t othings$

    1. ;ind the set o! 'ages that contain the userMs query so"e here

    .

  • 8/17/2019 Unit-5 Seraching the Web

    26/33

    26

    Web cra"ler

    • A "eb cra"ler (also no n as a "eb s ider ) is a 'rogra" hich bro ses the World Wide Web in a"ethodical, auto"ated "anner.

    • Web cra lers are "ainly used to create a co'y o! all thevisited 'ages !or later 'rocessing by a search engine, thatill inde& the do nloaded 'ages to 'rovide !ast searches.

    • It starts ith a list o! 6

  • 8/17/2019 Unit-5 Seraching the Web

    27/33

    27

    oogle Search Engine 'rchitecture

    • URL Server- Provides URLs to be fetched• Crawler is distributed• Store Server - compresses and stores

    pages• Repository - holds pages for inde ing• !nde er - parses documents" records

    words" positions" font si#e" capitali#ation

    • Le icon - list of uni$ue words found• %arrels hold• &nchors - 'eep information about lin'

    found in web pages• URL Resolver - converts relative URLs to

    absolute• Sorter - generates (oc !nde• (oc !nde - inverted inde of all words inall documents )e cept stop words*

    • Lin's - stores info about lin's to eachpage )used for Pageran'*

    • Pageran' - computes a ran' for each pageretrieved

    • Searcher - answers $ueries

  • 8/17/2019 Unit-5 Seraching the Web

    28/33

    28

    H. Web *irectories

    • ,eb directory $ A classi!ication o! Web 'ages bysub+ect.

    • rinci'les$-lassi!ication is by a hierarchical ta&ono"y.

    *irectory "ay be s'eci!ic to a sub+ect, a region, a language.ages are sub"itted and revie ed be!ore they are included.

    Auto"atic classi!ication is not success!ul enough.• Advantage$

    i! !ound, the ans er ill be use!ul in "ost cases7• *isadvantage$classi!ication is not s'eciali3ed enough7not all Web 'ages are classi!ied7

  • 8/17/2019 Unit-5 Seraching the Web

    29/33

    29

    . 5etasearchers

    • -etasearcher $ eb server that sends a given query to severalsearch engines and Web directories, collects the ans ers anduni!ies the". J %&a"'les$ 5etacra ler, Savvysearch, 5etaSearch, 5a""a.

    • Advantages$ J -o"bine the results o! "any sources. J Save users !ro" the need to 'ose queries to "ulti'le searchers. J Ability to sort the results by di!!erent attributes. J ages retrieved by "ulti'le searchers are "ore relevant. J I"'rove coverage$ individual searchers cover a s"all !raction o! the

    Web.• Issues$ J 9o to translate the given query to the s'eci!ic language o! each

    search %ngine> J 9o to ran the uni!ied results>

  • 8/17/2019 Unit-5 Seraching the Web

    30/33

    cs466-26 30

    Web “Agents”

    Passive Pe s!na"i#e$ %n&! mati!n 'at(e e )*am+"e , A.'A%/ !t A! n 96 P !t )t#i!ni et a" 96

    imi"a t! :C in&! mati!n e*t acti!n tas;a %$enti&

  • 8/17/2019 Unit-5 Seraching the Web

    31/33

    cs466-26 31

    Active Eia"!g >it( e ve - Fi""s ! t + !$ ct in&! mati!n &! ms inte active"<s+eci&ic t! eac( site

    - :se P T t! s bmit $ata

    - Ana"

  • 8/17/2019 Unit-5 Seraching the Web

    32/33

    cs466-26 32

    Ii t a" (!++ingWeb s(!++e

    !!; &in$e CE &in$e

    m! tgage?"!an ate neg!tiati!nt!c; t a$ing

    a te ing A cti!ning n!nstan$a $ g!!$s

    )*am+"es !& Web Agents

    /! &i*e$ + icenee$ &! inte active

    va" e &i*ing

    3 "eve"s !& inte active s(!++ing "!cate an$ + c(ase

    neg!tiate

    "ega" a t(! it<)*c(ange !& m!ne

  • 8/17/2019 Unit-5 Seraching the Web

    33/33

    - Java ma ;et+"ace A>e bac( Ami- /eg!tiate &! an$ se"" va" e !& CP: time

    - Ca"en$a a++ entice- eeting c!! $inati!n- C!nst aint satis&acti!n an$ neg!tiati!n

    (ave m< ca"en$a agent c!ntact

    )*am+"es !& Web Agents c!nt