chapter 6 web content mining

51
Chapter 6 Web Content Mining 1

Upload: howard

Post on 14-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Chapter 6 Web Content Mining . Web Mining. Data mining techniques applied to the Web Three areas: web-usage mining web-structure mining web-content mining. Web Usage Mining. Does not deal with the contents of web documents Goals: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 6 Web Content Mining

1

Chapter 6Web Content Mining

Page 2: Chapter 6 Web Content Mining

2

Web Mining Data mining techniques applied to the

Web Three areas:

1. web-usage mining2. web-structure mining3. web-content mining

Page 3: Chapter 6 Web Content Mining

3

Web Usage Mining Does not deal with the contents of web

documents Goals:

- to determine how a website’s visitors use web resources - to study their navigational patterns

The data used for web-usage mining is essentially secondary

Page 4: Chapter 6 Web Content Mining

4

Web Structure Mining Web-structure mining is concerned with

the topology of the Web Focuses on data that organizes the

content and facilitates navigation The principal source of information is

hyperlinks, connecting one page to another

Chapter 8 presents web-structure mining

Page 5: Chapter 6 Web Content Mining

5

Web Content Mining Web-content mining deals with primary data

on the Web actual content of the web documents web-content mining is to extract information users locate and extract information relevant to

their needs Web-content mining is composed of multiple

data types: text, images, audio, and video It also deals with crawling the Web and

searching for information

Page 6: Chapter 6 Web Content Mining

6

Web Content Mining

Page 7: Chapter 6 Web Content Mining

7

Web Content Mining Web-content mining techniques are

used to discover useful information from content on the web

textual audiovideo images metadata

Page 8: Chapter 6 Web Content Mining

8

Origin of web data Some of the web content is generated

dynamically using queries to database management systems

Other web content may be hidden from general users

Page 9: Chapter 6 Web Content Mining

9

Problems with Web data Problems with the web data

Distributed data Large volume Unstructured data Redundant data Quality of data Extreme percentage volatile data Varied data

Page 10: Chapter 6 Web Content Mining

10

Web CrawlerA computer program that navigates the hypertext structure of the web

Crawlers are used to ease the formation of indexes used by search engines

The page(s) that the crawler begins with are called the seed URLs.

Every link from the first page is recorded and saved in a queue

Page 11: Chapter 6 Web Content Mining

11

Periodic Web CrawlerBuilds an index visiting number of pages and then replaces the current index Known as a periodic crawler because

it is activated periodically

Page 12: Chapter 6 Web Content Mining

12

Focused Web CrawlersGenerally recommended for use due to large size of the Web Visits pages related to topics of interest If a page is not pertinent, the entire set

of possible pages below it is pruned

Page 13: Chapter 6 Web Content Mining

13

Web CrawlerCrawling process

Begin with group of URLs Submitted by users Common URLs

Breath-first or depth-first Extract more URLs

Numerous crawlers Problem of redundancy Web partition robot per partition

Page 14: Chapter 6 Web Content Mining

14

Focused CrawlerThe focused crawler structure consists of two major parts:

1. The distiller 2. The classifier

Page 15: Chapter 6 Web Content Mining

15

The Distiller A distiller verifies which pages contain

links to other relevant pages, which are called hub pages.

Identifies hypertext nodes which are considered as good access points to more relevant pages (HITS algorithm).

Page 16: Chapter 6 Web Content Mining

16

The hypertext classifier A hypertext classifier establishes a resource

rating to estimate how advantageous it would be for the crawler to pursue the links out of that page.

The classifier connects a significant score for each document with respect to the crawl topic.

Evaluates the relevance of hypertext documents according to the given topic.

Page 17: Chapter 6 Web Content Mining

17

Focused CrawlerThe pages that the crawler visits are selected using a priority-based structure managed by the priority associated with pages by the classifier and the distiller

Page 18: Chapter 6 Web Content Mining

18

Focused Crawler- how it works User identifies sample documents that

are of interest. Sample documents are classified based

on a hierarchical classification tree. Documents are used as the seed

documents to begin the focused crawling

Page 19: Chapter 6 Web Content Mining

19

Focused CrawlerEach document is classified into a leaf node of the taxonomy tree

One approach, hard focus, follows links if there is an ancestor of this node that has been marked as good

Another approach, soft focus, identifies the probability that a page, d, is relevant as follows:

where c is a node in the tree (thus a page) is the indication that it has been labeled to be of interest The priority of visiting a page not yet visited is the maximum of the

relevance of pages that have been visited and point to it

( )

( ) ( | )good c

R d P c d

( )good c

Page 20: Chapter 6 Web Content Mining

20

Context Graph Focused crawling has proposed the use of

context graphs, which in turn created the context focused crawler (CFC)

The CFC performs crawling in two steps: 1. Context graphs and classifiers are constructed using a set of seed documents as a training set 2. Crawling is performed using the classifiers to guide it.

How is it different from the focused crawler? Context graphs are updated during the crawl.

Page 21: Chapter 6 Web Content Mining

21

Context Graph

Page 22: Chapter 6 Web Content Mining

22

Search Engines

Page 23: Chapter 6 Web Content Mining

23

Search Engine Uses a ‘spider’ or ‘crawler’ that crawls the Web hunting for new or updated Web pages to store in an index

Page 24: Chapter 6 Web Content Mining

24

Search EngineBasic components to a search engine:

The crawler /spiderGathers new or updated information on Internet websites

The indexUsed to store information about several websites

The search softwarePerforms searching through the huge index in an effort to generate an ordered list of useful search results

Page 25: Chapter 6 Web Content Mining

25

Search Engine Mechanism

Page 26: Chapter 6 Web Content Mining

26

Search EnginesGeneric structure of all search

engines is basically the sameHowever, the search results differ

from search engine to search engine for the same search terms, why?

Page 27: Chapter 6 Web Content Mining

27

Responsibilities of Search Engines Document collection

choose the documents to be indexed Document indexing

indicate the content of the selected documents. Searching

indicate the user information need into a query retrieval (search algorithms, ranking of web pages)

Results present the outcome

Page 28: Chapter 6 Web Content Mining

28

Phases of Query Binding

Query binding is the process of translating a user need into a search engine query

Page 29: Chapter 6 Web Content Mining

29

Phases of Query BindingThree-tier process : 1. The first level involves the user formulating the information need into a question or a list of terms using experiences and vocabulary and entering it into the search engine. 2. The search engine must translate the words with possible spelling errors into processing tokens. 3. The search engine must use the processing tokens to search the document database and retrieve the appropriate documents.

Page 30: Chapter 6 Web Content Mining

30

Types of Queries Boolean Queries:

Boolean logic queries connect words in the search using operators such as AND or OR.

Natural Language Queries: In natural language queries the user frames as a question or a statement

Thesaurus Queries: In a thesaurus query the user selects the term from a preceding set of terms predetermined by the retrieval system

Page 31: Chapter 6 Web Content Mining

31

Types of Queries cont. Fuzzy Queries:

Fuzzy queries reflect no specificity. (handling misspelling, variations of the same word)

Term Searches: The most common type of query on the Web is when a user provides a few words or phrases for the search

Probabilistic Queries: Probabilistic queries refer to the way in which the IR system retrieves documents according to relevancy .

Page 32: Chapter 6 Web Content Mining

32

The Robot Exclusion Why would the developers prefer to exclude robots from parts of their websites? The robot exclusion protocol

to indicate restricted parts of the Website to robots that visit a site

for giving crawlers/spiders (“robots”) limited access to a website

Page 33: Chapter 6 Web Content Mining

33

The Robot ExclusionWebsite administrators and content providers can limit robot activity through two mechanisms:

The Robots Exclusion Protocol is used by Website administrators to specify which parts of the site should not be visited by a robot, by providing a file called robots.txt on their site.

The Robots META Tag is a special html META tag that can be used in any Web page to indicate whether that page should be indexed, or parsed for links.

Page 34: Chapter 6 Web Content Mining

34

Example of the Robots META Tag <META NAME="ROBOTS" CONTENT="NOINDEX,

NOFOLLOW">

If a web page contains the above tag, a robot should not index this document (indicated by the word NOINDEX), nor parse it for links (specified using NOFOLLOW).

Page 35: Chapter 6 Web Content Mining

35

The Robot Exclusion

Page 36: Chapter 6 Web Content Mining

36

The Robot Exclusion

Page 37: Chapter 6 Web Content Mining

37

Robots.txt The "User-agent: *" means this section

applies to all robots. The "Disallow: /" tells the robot that it

should not visit any pages on the site.

Page 38: Chapter 6 Web Content Mining

38

Example-1

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

Page 39: Chapter 6 Web Content Mining

39

Example-1 In this example, three directories are

excluded. Note that you need a separate "Disallow"

line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line.

Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Page 40: Chapter 6 Web Content Mining

40

Example-2User-agent: Google Disallow: /

To allow a single robot

What modifications on the robots.txt if we wanted to exclude the bing robot?

Page 41: Chapter 6 Web Content Mining

41

Important Considerations when using robots.txt Robots can ignore your /robots.txt. Especially

malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.

The /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

It is not advisable to use /robots.txt to hide information.

Page 42: Chapter 6 Web Content Mining

42

More Examples See handout

Page 43: Chapter 6 Web Content Mining

43

Robots META tag Robots.txt is only accessible by web

administrators. META tag can be used by individual web

page authors The robots META tag is placed in the

<HEAD> section of the HTML page.

Page 44: Chapter 6 Web Content Mining

44

Robot META tag<html><head><meta name=“robots” content=“noindex, nofollow”>…<title>..<title></head>

Page 45: Chapter 6 Web Content Mining

45

Content terms ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW ALL= INDEX, FOLLOW NONE=NOINDEX, NOFOLLOW

Page 46: Chapter 6 Web Content Mining

46

Content combinations<meta name=“robots” content=“index,follow”>== <meta name=“robots” content=“all”><meta name=“robots” content=“noindex,follow”>

<meta name=“robots” content=“index,nofollow”>

<meta name=“robots” content=“noindex,nofollow”>= <meta name=“robots” content=“none”>

Page 47: Chapter 6 Web Content Mining

47

ExerciseCheck if the KSU website has a robot exclusion file robots.txt.

Page 48: Chapter 6 Web Content Mining

48

Multimedia Information Retrieval

Perspective of images and videos Content system for images is the Query by Image

Content (QBIC) system: A three-dimensional color feature vector, where

distance measure is simple Euclidean distance. k-dimensional color histograms, where the bins of the

histogram can be chosen by a partition-based clustering algorithm.

A three-dimensional texture vector consisting of features that measure scale, directionality, and contrast. Distance is computed as a weighted Euclidean distance measure, where the default weights are inverse variances of the individual features.

Page 49: Chapter 6 Web Content Mining

49

Multimedia Information RetrievalThe query can be expressed directly in terms of the feature representation itself

For instance, Find images that are 40% blue in color and contain a texture with specific coarseness property

Or a specific layout

Page 50: Chapter 6 Web Content Mining

50

Multimedia Information Retrieval

MIR Systemwww.hermitagemuseum.org/html_En/index.html

A QBIC Layout Search Demo that illustrates a step by step demonstration of the search described in the text can be found at:

www.hermitagemuseum.org/fcgi-bin/db2www/qbicLayout.mac/qbic?selLang=English.

Page 51: Chapter 6 Web Content Mining

51

Multimedia Information Retrieval As multimedia become apparent as a more

extensively used data format, it is vital to deal with the issues of: metadata standards classification query matching presentation evaluation

To guarantee the development and deployment of efficient and effective multimedia information retrieval systems