ihr logo chapter 7 web content mining dsci 4520/5240 dr. nick evangelopoulos xxxxxxxx

Ihr Logo

Chapter 7Web Content Mining

DSCI 4520/5240Dr. Nick EvangelopoulosXxxxxxxx

Your Logo

Introduction

Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page contents.

- textual

- audio

- video

- still images

- metadata

- hyperlinks

Your Logo

Introduction

Problems with the web data

Distributed data

Large volume

Unstructured data

Redundant data

Quality of data

Extreme percentage volatile data

Varied data

Your Logo

Introduction

Two approaches of web-content mining:

agent-based

software agents perform the content mining

database oriented

view the Web data as belonging to a database

Your Logo

Web Crawler

A computer program that navigates the hypertext structure of the web

Crawlers are used to ease the formation of indexes used by search engines

The page(s) that the crawler begins with are called the seed URLs.

Builds an index visiting number of pages and then replaces the current index

Known as a periodic crawler because it is activated periodically

Your Logo

Web Crawler

Another type is a Focused Crawler

Generally recommended for use due to large size of the Web

Visits pages related to topics of interest

If a page is not pertinent, the entire set of possible pages below it is pruned

Your Logo

Web Crawler Crawling process

Begin with group of URLs

Submitted by users

Common URLs

Breath-first or depth-first

Extract more URLs

Numerous crawlers

Problem of redundancy

Web partition robot per partition

Your Logo

Focused Crawler

The focused crawler structure consists of two major parts:

The distiller

The hypertext classifier

Your Logo

Focused Crawler

The pages that the crawler visits are selected using a priority-based structure managed by the priority associated with pages by the classifier and the distiller

Your Logo

Focused Crawler

Sample documents are identified and classified based on a hierarchical classification tree

Documents are used as the seed documents to begin the focused crawling

Your Logo

Context Graph

Focused crawling has proposed the use of context graphs, which in turn created the context focused crawler (CFC)

The CFC performs crawling in two steps:

Context graphs and classifiers are constructed using a set of seed documents as a training set

Crawling is performed using the classifiers to guide it

Your Logo

Content Graph

Your Logo

Implementation of a Web Crawler

Wget is a free GNU utility that makes it possible to retrieve web documents

Wget supports Internet protocols

HTTP (Hyper Text Transfer Protocol)

FTP (File Transfer Protocol)

Recursively browse through the structure of HTML documents and FTP directory trees

Your Logo

Commonly Used Options for Wget

Your Logo

Methods for Crawl Class

Your Logo

Crawl class

Figure 7.7 Code from the main of Crawl class (Suitable for Java programmers)

Your Logo

The readContent Method of Crawl Class

Figure 7.8 Code from the readContent method of Crawl class (Suitable for Java programmers)

Your Logo

Code for Extracting Links from Crawl Class

Figure 7.9

Your Logo

Thank you for your attention