a topic specific web crawler and wie*: an automatic web information extraction technique using hps...

A Topic Specific Web Crawler and WIE*: An Automatic Web Information

Extraction Technique using HPS Algorithm

2002.09.04Dongwon Lee

Database Systems Lab.

2/20

Contents

• Introduction• Crawler• Problem of Web Crawler• URL Ordering• Focused Crawler• Strategy for URL Ordering • Ordering Model• WIE* • Conclusion

3/20

Introduction(1)

• Search Engine• Gathering web resource• Shortage of Search Engine

– Large Answer Set– Low Precision– Destroying the Hypertext Structures of

Matching Hyper documents– General Concept Queries

• Topic driven web-crawling

4/20

Introduction(2)

• It can reduce search space to improve the efficiency of web search engine.

• It can be applied to special purpose search engine.– Ex) Medical information retrieval,

Travel information retrieval, Biology information retrieval

5/20

Web Crawler(1)

• Program that automatically traverse the Web via hyperlinks embedded in hypertext, news group listings, directory structures or database schemas.

• Gather resources from the Web• To ensure an index is kept as

up to date as possible• To achieve the broadest

possible coverage of the Web.

6/20

Web Crawler(2)

• Retrieving Module• Processing Module• Formatting Module• URL Listing Module• The order of traversing

– Breadth-first– Depth-first– Better pages first

• How frequently the index is updated– Stars in the sky view

Word Wide Web

Database

Retrieving Module

Processing Module

Formatting Module

URL Listing Module

7/20

Problem of Web Crawler

• How frequently the index is updated– How old is an index to a Web page

• Varies a lot: One day to two months• Stars in the sky view

– Percentage of invalid links: 2-9%

• Not be able to visit every possible page– Must periodically revisit pages – Storage capacity– Network bandwidth

• Shortage of Information Retrieval Systems– Too large answer set– Low precision– Etc.

8/20

URL Ordering

• BFS, DFS• URL Ordering

– Document similarity measure– PageRank

• Inlinks/Outlinks•

– URL measure• “.com”, “home”, “www”

• J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In Proc. 7th Intl. World Wide Web Conference, Brisbane, Australia, 1998

n

i

TiCTiPRddAPR1

))(/)(()1()(

9/20

Focused Crawler• Text Categorization Method

– Naïve Bayes Algorithm• Distiller

– Authority– Hub

• Soumen Chakrabarti, Martin van den Berg, Byron Dom, Focused crawling: a new approach to topic-specific Web resource discovery , 8th World Wide Web Conference

10/20

Strategy for URL Ordering

• Better page if it has many outgoing links

• Better page if its contents is more relevant to a certain topics

• Important rate of pages that passed influence on important rate of next page

• A Web Crawler using Hyperlink structure and Hypertext Categorization Method,The 17th KIPS Spring conference

11/20

Ordering Model(1)

•

ikki

kk

kkk

kii

ptopfromdepthpd

classtoinclassifiedispofyprobabilitpC

pfromlinkedisthatdocumentppL

pbypassedisthatdocumentppP

ii

)(

)(

}|{

}|{

ki

i

ii

Pp i

ki

kk

poflinksoutgoingof

pCpIPL

linksoutgoingofMax

linksoutgoingofpCpIPL

#

))()(((

)#

#()()(

k d(pi) P1

P2 P3 P4

P5 P4 P5 P6

P1

P2 P3 P4

P5 P4 P5 P6

12/20

Ordering Model(2)

WWW

RetrievingModule

ClassifierModule

LinkModule

FilterModule

URL ListingModule

Indexer

Index

WWW

RetrievingModule

ClassifierModule

LinkModule

FilterModule

URL ListingModule

Indexer

Index

13/20

Image Crawler

• Filtering Useless Image– Bullet, Background image, …

• Find Semantic for indexing– Use document which contains images

• Generate metadata automatically– Mpeg-7

14/20

WIE*

• Automatic web page traversal and content extraction

• HPS (hyperlink prioritization search)• Mechanism

– Identify a URL– Travels Web from a provided url– Extract and collect information pieces by pa

ragraph

15/20

WIE*- HPS(1)

• Notation– D={d|d is a web page}– P={p|p is a paragraph in d}– W={w|w is a word in p}– l(d,p,w) is a hyperlink with a (d,p,w)

status– L(D,P,W) is a hyperlink category to

which l(d,p,w)

16/20

WIE*- HPS(2)

• Notation– D1={d|d is an element of D and contains search

keyword(s)}

– D0={d|d is an element of D and not the element of D1}

– P1={d|d is an element of P and contains search keyword(s)}

– P0={d|d is an element of P and not the element of P1 }

– W1={d|d is an element of W and contains search keyword(s)}

– W0={d|d is an element of W and not the element of W1 }

17/20

WIE*-HPS(3)• Link category

• LC7 > LC6 > LC4 > LC0

Link category

Page Paragraph Link

LC0 D0 P0 W0

LC1 D0 P0 W1

LC2 D0 P1 W0

LC3 D0 P1 W1

LC4 D1 P0 W0

LC5 D1 P0 W1

LC6 D1 P1 W0

LC7 D1 P1 W1

18/20

WIE*-HPS(4)

• Trend– Global trend

• TG=

– Recent trend

• TR=

visitsentire in the webpagesof/#

visitsentire in the webpagesmatched of #

visitsentire in the webpagesof#

itsrecent vis in the webpagesmatched of #

19/20

WIE*-HPS(5)

• Termination Scheme– Comparing Global trend and Recent

trend– No more hyperlink– By the user’s decision

• Information Extraction– Extract paragraphs containing

keyword

20/20

Further work & Discussion

• Intelligent WIE*– Learned WIE*

• Fast learning mechanism• Reinforcement learning• Hypertext categorization

– Advanced Information Extraction technique– Personalization technique

• Using Link Information– Hub and Authority

• Client Side Light Application(?)– Ex) Plug in Web browser

21/20

Information Retrieval System

Query and

Feedback

User Interface

Crawling

IndexingQuery

Processing

Document Repository

Search Engine

Knowledge Base

Learning

Learning

Inference Engine

Ad Hoc Information

Large Text (Multimedia) Database Tech.Data(Text) Mining Tech.

Query and

Feedback

User Interface

Crawling

IndexingQuery

Processing

Document Repository

Search Engine

Knowledge Base

Learning

Learning

Inference Engine

Ad Hoc Information

Large Text (Multimedia) Database Tech.Data(Text) Mining Tech.

a topic specific web crawler and wie*: an automatic web information extraction technique using hps...

Documents

url travels web

web crawler2

topic specific web crawler

ordering model2 slide

ordering model1 slide

automatic web page traversal

conclusion slide

paragraph slide