dsci 5240 graduate presentation xxxxxx
Post on 23-Feb-2016
55 Views
Preview:
DESCRIPTION
TRANSCRIPT
DSCI 5240 Graduate PresentationXxxxxx
Research paper: Web Mining Research: A survey SIGKDD Explorations, June 2000. Volume 2, Issue 1
Author: R. Kosala and H. Blockeel
Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion
Outline
The World Wide Web is a popular and interactive medium to disseminate information
Information users may encounter four problems 1. Finding relevant information a. low precision b. low recall
2. Creating new knowledge out of the information available on the web
---data-triggered process
3. Personalizing of the information People differ in the content and presentations of information
4. Learning about consumers or individual users Mass customizing or even personalizing
Introduction
Definition: web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the web data
Four subtasks Resource finding: retrieving intended web documents Information selection and pre-processing: selecting and pre-
processing specific information Generalization: discovering general patterns Analysis: validation and/or interpretation of mined patterns
Web Mining
Web Mining and Information RetrievalDefinition: IR is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible.goal: indexing and searching for useful documents Web Mining and Information ExtractionIE has the goal of transforming a collection of documents into information that is more readily digested and analyzed. Compare IR and IE a. aims b. fields
Web Mining
Web Mining and the Agent ParadigmWeb mining is often viewed from or implemented within an agent paradigm 1. User interface agents2. Distributed agents3. Mobile agents
Two approaches used to develop intelligent agents4. Content-based approach5. Collaborative approach
Web Mining
Definition: discovering useful info from web page contents/data/documents
Several types of data: text, image, audio, video, hyperlinks
Types of Data Structure:1.Unstructured: free text2.Semi- structured: HTML3.More structured: data in tables or database generated HTML pages
Web Content Mining
IR view: Unstructured Documentsa. Bag of words to represent unstructured documents b. Feature: Boolean, Frequency basedc. Variations of the feature selection d. Features could be reduced using different feature selection
techniques Semi-Structured Documentse. Uses richer representations for featuresf. Uses common data mining methods
Web Content Mining
DB view:DB view tries to infer the structure of a web site or transform a web site to become a databaseMethods:a. Finding the scheme of web documentsb. Building a web warehousec. Building a web knowledge based. Building a virtual database
Web Content Mining
Interested in the structure of the hyperlinks within the web
Inspired by the study of social networks and citation analysis
Discover specific types of pages based on the incoming and outgoing links
Application: a. discovering micro-communities in the webb. measuring the completeness of a web site
Web Structure Mining
Tries to predict user behavior from interaction with the web
Wide range of data Two commonly used approachesa. Maps the usage data of Web server into relational tables before
an adapted data mining technique is performedb. Uses the log data directly by utilizing special pre-processing
techniques problems:a. Distinguishing among unique users, server sessions, episodes in
the presence of caching and proxy serversb. Often usage mining uses some background or domain knowledge applications
Web Usage Mining
Survey of research in the area of web mining
Three web mining categories: content structure usage mining
Connection between web mining categories and related agent paradigm
Conclusions
top related