chapter 6 web content mining

1

Chapter 6Web Content Mining

2

Web Mining Data mining techniques applied to the

Web Three areas:

1. web-usage mining2. web-structure mining3. web-content mining

3

Web Usage Mining Does not deal with the contents of web

documents Goals:

- to determine how a website’s visitors use web resources - to study their navigational patterns

The data used for web-usage mining is essentially secondary

4

Web Structure Mining Web-structure mining is concerned with

the topology of the Web Focuses on data that organizes the

content and facilitates navigation The principal source of information is

hyperlinks, connecting one page to another

Chapter 8 presents web-structure mining

5

Web Content Mining Web-content mining deals with primary data

on the Web actual content of the web documents web-content mining is to extract information users locate and extract information relevant to

their needs Web-content mining is composed of multiple

data types: text, images, audio, and video It also deals with crawling the Web and

searching for information

6

Web Content Mining

7

Web Content Mining Web-content mining techniques are

used to discover useful information from content on the web

textual audiovideo images metadata

8

Origin of web data Some of the web content is generated

dynamically using queries to database management systems

Other web content may be hidden from general users

9

Problems with Web data Problems with the web data

Distributed data Large volume Unstructured data Redundant data Quality of data Extreme percentage volatile data Varied data

10

Web CrawlerA computer program that navigates the hypertext structure of the web

Crawlers are used to ease the formation of indexes used by search engines

The page(s) that the crawler begins with are called the seed URLs.

Every link from the first page is recorded and saved in a queue

11

Periodic Web CrawlerBuilds an index visiting number of pages and then replaces the current index Known as a periodic crawler because

it is activated periodically

12

Focused Web CrawlersGenerally recommended for use due to large size of the Web Visits pages related to topics of interest If a page is not pertinent, the entire set

of possible pages below it is pruned

13

Web CrawlerCrawling process

Begin with group of URLs Submitted by users Common URLs

Breath-first or depth-first Extract more URLs

Numerous crawlers Problem of redundancy Web partition robot per partition

14

Focused CrawlerThe focused crawler structure consists of two major parts:

1. The distiller 2. The classifier

15

The Distiller A distiller verifies which pages contain

links to other relevant pages, which are called hub pages.

Identifies hypertext nodes which are considered as good access points to more relevant pages (HITS algorithm).

16

The hypertext classifier A hypertext classifier establishes a resource

rating to estimate how advantageous it would be for the crawler to pursue the links out of that page.

The classifier connects a significant score for each document with respect to the crawl topic.

Evaluates the relevance of hypertext documents according to the given topic.

17

Focused CrawlerThe pages that the crawler visits are selected using a priority-based structure managed by the priority associated with pages by the classifier and the distiller

18

Focused Crawler- how it works User identifies sample documents that

are of interest. Sample documents are classified based

on a hierarchical classification tree. Documents are used as the seed

documents to begin the focused crawling

19

Focused CrawlerEach document is classified into a leaf node of the taxonomy tree

One approach, hard focus, follows links if there is an ancestor of this node that has been marked as good

Another approach, soft focus, identifies the probability that a page, d, is relevant as follows:

where c is a node in the tree (thus a page) is the indication that it has been labeled to be of interest The priority of visiting a page not yet visited is the maximum of the

relevance of pages that have been visited and point to it

( )

( ) ( | )good c

R d P c d

( )good c

20

Context Graph Focused crawling has proposed the use of

context graphs, which in turn created the context focused crawler (CFC)

The CFC performs crawling in two steps: 1. Context graphs and classifiers are constructed using a set of seed documents as a training set 2. Crawling is performed using the classifiers to guide it.

How is it different from the focused crawler? Context graphs are updated during the crawl.

21

Context Graph

22

Search Engines

23

Search Engine Uses a ‘spider’ or ‘crawler’ that crawls the Web hunting for new or updated Web pages to store in an index

24

Search EngineBasic components to a search engine:

The crawler /spiderGathers new or updated information on Internet websites

The indexUsed to store information about several websites

The search softwarePerforms searching through the huge index in an effort to generate an ordered list of useful search results

25

Search Engine Mechanism

26

Search EnginesGeneric structure of all search

engines is basically the sameHowever, the search results differ

from search engine to search engine for the same search terms, why?

27

Responsibilities of Search Engines Document collection

choose the documents to be indexed Document indexing

indicate the content of the selected documents. Searching

indicate the user information need into a query retrieval (search algorithms, ranking of web pages)

Results present the outcome

28

Phases of Query Binding

Query binding is the process of translating a user need into a search engine query

29

Phases of Query BindingThree-tier process : 1. The first level involves the user formulating the information need into a question or a list of terms using experiences and vocabulary and entering it into the search engine. 2. The search engine must translate the words with possible spelling errors into processing tokens. 3. The search engine must use the processing tokens to search the document database and retrieve the appropriate documents.

30

Types of Queries Boolean Queries:

Boolean logic queries connect words in the search using operators such as AND or OR.

Natural Language Queries: In natural language queries the user frames as a question or a statement

Thesaurus Queries: In a thesaurus query the user selects the term from a preceding set of terms predetermined by the retrieval system

31

Types of Queries cont. Fuzzy Queries:

Fuzzy queries reflect no specificity. (handling misspelling, variations of the same word)

Term Searches: The most common type of query on the Web is when a user provides a few words or phrases for the search

Probabilistic Queries: Probabilistic queries refer to the way in which the IR system retrieves documents according to relevancy .

32

The Robot Exclusion Why would the developers prefer to exclude robots from parts of their websites? The robot exclusion protocol

to indicate restricted parts of the Website to robots that visit a site

for giving crawlers/spiders (“robots”) limited access to a website

33

The Robot ExclusionWebsite administrators and content providers can limit robot activity through two mechanisms:

The Robots Exclusion Protocol is used by Website administrators to specify which parts of the site should not be visited by a robot, by providing a file called robots.txt on their site.

The Robots META Tag is a special html META tag that can be used in any Web page to indicate whether that page should be indexed, or parsed for links.

34

Example of the Robots META Tag <META NAME="ROBOTS" CONTENT="NOINDEX,

NOFOLLOW">

If a web page contains the above tag, a robot should not index this document (indicated by the word NOINDEX), nor parse it for links (specified using NOFOLLOW).

35

The Robot Exclusion

36

The Robot Exclusion

37

Robots.txt The "User-agent: *" means this section

applies to all robots. The "Disallow: /" tells the robot that it

should not visit any pages on the site.

38

Example-1

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

39

Example-1 In this example, three directories are

excluded. Note that you need a separate "Disallow"

line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line.

Also, you may not have blank lines in a record, as they are used to delimit multiple records.

40

Example-2User-agent: Google Disallow: /

To allow a single robot

What modifications on the robots.txt if we wanted to exclude the bing robot?

41

Important Considerations when using robots.txt Robots can ignore your /robots.txt. Especially

malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.

The /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

It is not advisable to use /robots.txt to hide information.

42

More Examples See handout

43

Robots META tag Robots.txt is only accessible by web

administrators. META tag can be used by individual web

page authors The robots META tag is placed in the

<HEAD> section of the HTML page.

44

Robot META tag<html><head><meta name=“robots” content=“noindex, nofollow”>…<title>..<title></head>

45

Content terms ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW ALL= INDEX, FOLLOW NONE=NOINDEX, NOFOLLOW

46

Content combinations<meta name=“robots” content=“index,follow”>== <meta name=“robots” content=“all”><meta name=“robots” content=“noindex,follow”>

<meta name=“robots” content=“index,nofollow”>

<meta name=“robots” content=“noindex,nofollow”>= <meta name=“robots” content=“none”>

47

ExerciseCheck if the KSU website has a robot exclusion file robots.txt.

48

Multimedia Information Retrieval

Perspective of images and videos Content system for images is the Query by Image

Content (QBIC) system: A three-dimensional color feature vector, where

distance measure is simple Euclidean distance. k-dimensional color histograms, where the bins of the

histogram can be chosen by a partition-based clustering algorithm.

A three-dimensional texture vector consisting of features that measure scale, directionality, and contrast. Distance is computed as a weighted Euclidean distance measure, where the default weights are inverse variances of the individual features.

49

Multimedia Information RetrievalThe query can be expressed directly in terms of the feature representation itself

For instance, Find images that are 40% blue in color and contain a texture with specific coarseness property

Or a specific layout

50

Multimedia Information Retrieval

MIR Systemwww.hermitagemuseum.org/html_En/index.html

A QBIC Layout Search Demo that illustrates a step by step demonstration of the search described in the text can be found at:

www.hermitagemuseum.org/fcgi-bin/db2www/qbicLayout.mac/qbic?selLang=English.

51

Multimedia Information Retrieval As multimedia become apparent as a more

extensively used data format, it is vital to deal with the issues of: metadata standards classification query matching presentation evaluation

To guarantee the development and deployment of efficient and effective multimedia information retrieval systems

chapter 6 web content mining

Documents

web dataproblems

web resources

web visits pages

focused web crawlersgenerally

periodic web crawlerbuilds

origin of web datasome

contents of web documentsgoals

needswebcontent mining