ihr logo chapter 7 web content mining dsci 4520/5240 dr. nick evangelopoulos xxxxxxxx

19
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Upload: gordon-miller

Post on 12-Jan-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Ihr Logo

Chapter 7Web Content Mining

DSCI 4520/5240Dr. Nick EvangelopoulosXxxxxxxx

Page 2: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Introduction

Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page contents.

- textual

- audio

- video

- still images

- metadata

- hyperlinks

Page 3: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Introduction

Problems with the web data

Distributed data

Large volume

Unstructured data

Redundant data

Quality of data

Extreme percentage volatile data

Varied data

Page 4: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Introduction

Two approaches of web-content mining:

agent-based

software agents perform the content mining

database oriented

view the Web data as belonging to a database

Page 5: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Web Crawler

A computer program that navigates the hypertext structure of the web

Crawlers are used to ease the formation of indexes used by search engines

The page(s) that the crawler begins with are called the seed URLs.

Builds an index visiting number of pages and then replaces the current index

Known as a periodic crawler because it is activated periodically

Page 6: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Web Crawler

Another type is a Focused Crawler

Generally recommended for use due to large size of the Web

Visits pages related to topics of interest

If a page is not pertinent, the entire set of possible pages below it is pruned

Page 7: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Web Crawler Crawling process

Begin with group of URLs

Submitted by users

Common URLs

Breath-first or depth-first

Extract more URLs

Numerous crawlers

Problem of redundancy

Web partition robot per partition

Page 8: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Focused Crawler

The focused crawler structure consists of two major parts:

The distiller

The hypertext classifier

Page 9: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Focused Crawler

The pages that the crawler visits are selected using a priority-based structure managed by the priority associated with pages by the classifier and the distiller

Page 10: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Focused Crawler

Sample documents are identified and classified based on a hierarchical classification tree

Documents are used as the seed documents to begin the focused crawling

Page 11: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Context Graph

Focused crawling has proposed the use of context graphs, which in turn created the context focused crawler (CFC)

The CFC performs crawling in two steps:

Context graphs and classifiers are constructed using a set of seed documents as a training set

Crawling is performed using the classifiers to guide it

Page 12: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Content Graph

Page 13: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Implementation of a Web Crawler

Wget is a free GNU utility that makes it possible to retrieve web documents

Wget supports Internet protocols

HTTP (Hyper Text Transfer Protocol)

FTP (File Transfer Protocol)

Recursively browse through the structure of HTML documents and FTP directory trees

Page 14: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Commonly Used Options for Wget

Page 15: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Methods for Crawl Class

Page 16: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Crawl class

Figure 7.7 Code from the main of Crawl class (Suitable for Java programmers)

Page 17: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

The readContent Method of Crawl Class

Figure 7.8 Code from the readContent method of Crawl class (Suitable for Java programmers)

Page 18: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Code for Extracting Links from Crawl Class

Figure 7.9

Page 19: Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx

Your Logo

Thank you for your attention