ihr logo chapter 7 web content mining dsci 4520/5240 dr. nick evangelopoulos xxxxxxxx
TRANSCRIPT
Ihr Logo
Chapter 7Web Content Mining
DSCI 4520/5240Dr. Nick EvangelopoulosXxxxxxxx
Your Logo
Introduction
Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page contents.
- textual
- audio
- video
- still images
- metadata
- hyperlinks
Your Logo
Introduction
Problems with the web data
Distributed data
Large volume
Unstructured data
Redundant data
Quality of data
Extreme percentage volatile data
Varied data
Your Logo
Introduction
Two approaches of web-content mining:
agent-based
software agents perform the content mining
database oriented
view the Web data as belonging to a database
Your Logo
Web Crawler
A computer program that navigates the hypertext structure of the web
Crawlers are used to ease the formation of indexes used by search engines
The page(s) that the crawler begins with are called the seed URLs.
Builds an index visiting number of pages and then replaces the current index
Known as a periodic crawler because it is activated periodically
Your Logo
Web Crawler
Another type is a Focused Crawler
Generally recommended for use due to large size of the Web
Visits pages related to topics of interest
If a page is not pertinent, the entire set of possible pages below it is pruned
Your Logo
Web Crawler Crawling process
Begin with group of URLs
Submitted by users
Common URLs
Breath-first or depth-first
Extract more URLs
Numerous crawlers
Problem of redundancy
Web partition robot per partition
Your Logo
Focused Crawler
The focused crawler structure consists of two major parts:
The distiller
The hypertext classifier
Your Logo
Focused Crawler
The pages that the crawler visits are selected using a priority-based structure managed by the priority associated with pages by the classifier and the distiller
Your Logo
Focused Crawler
Sample documents are identified and classified based on a hierarchical classification tree
Documents are used as the seed documents to begin the focused crawling
Your Logo
Context Graph
Focused crawling has proposed the use of context graphs, which in turn created the context focused crawler (CFC)
The CFC performs crawling in two steps:
Context graphs and classifiers are constructed using a set of seed documents as a training set
Crawling is performed using the classifiers to guide it
Your Logo
Content Graph
Your Logo
Implementation of a Web Crawler
Wget is a free GNU utility that makes it possible to retrieve web documents
Wget supports Internet protocols
HTTP (Hyper Text Transfer Protocol)
FTP (File Transfer Protocol)
Recursively browse through the structure of HTML documents and FTP directory trees
Your Logo
Commonly Used Options for Wget
Your Logo
Methods for Crawl Class
Your Logo
Crawl class
Figure 7.7 Code from the main of Crawl class (Suitable for Java programmers)
Your Logo
The readContent Method of Crawl Class
Figure 7.8 Code from the readContent method of Crawl class (Suitable for Java programmers)
Your Logo
Code for Extracting Links from Crawl Class
Figure 7.9
Your Logo
Thank you for your attention