data mining the web - staffaidvi/courses/06/dm/seminars2008/data mining...data mining the web oscar...

5/6/2008

1

Data mining the webOscar Djupfeldt, Avraz Hirori, Christoffer Kullman

Group 5

Introduction

• Structure mining

• Content mining

• Usage mining

5/6/2008

2

Problem

• There is a huge amount of information on the web

• The information available on the web is very diverse

• Information on the web is exists in almost all types of formats

• The web is semi-structured because of HTMLs nestsed structure

• The web is linked• Lots of redundant information

Web Crawling

• Getting data about web pages

• Web search

• Done in two ways:

▫ Uniformed graph search

▫ Guided, “informed” search

5/6/2008

3

Structure Mining

• Analyzes which places a web page points to and which places point to a web page

• To find relevant data about a web page

• Used for web search and social networks

Document Structure

• Look at the HTML- or XML-structure of a web page

• The structure can reveal which data in a document is relevant as well as providing a context to the information

5/6/2008

4

Document Structure

<title>Web Mining</title>

<meta name=”Author” content=”John Doe”>

--

<h2><big>Web Mining</big></h2>

--

<a href=”presentation.html”>Web Mining </a>

Document Structure

What do we get from this?

• Tagging and indexing

▫ Expand main index.

▫ Small, seperate indicies for faster access.

• Relevance ranking

▫ Terms location affects ranking. Different tags = different weights.

▫ Good at natural relevance ranking.

5/6/2008

5

Document Structure

Problems

• Invisible words

• Misleading text in titles, metatags, etc.

• Missing or incorrect information

Anchor text

• Most significant in use due

• Additional information

Link Analysis - Hyperlinks

• Independent evaluation of web page popularity or authority

• Idea is from social networks▫ Popularity▫ Authority▫ Prestige

• Ranking• Link analysis algorithms:

▫ PageRank▫ HITS

5/6/2008

6

Link Analysis - PageRank

• Democratic – a links to b = a “votes” for b.

• The “voter” is also analyzed

• Designed for the “random clicking user” -likelihood a user will reach a certain page

• Iterative process

Link Analysis - PageRank

• The damping factor

• Random page switch

• Sink pages

• No outbound links?

5/6/2008

7

Content Mining

• Focused on text mining

• Video

• Audio

• Images

• Structured records like tables

• Information retrieval

Information Retrieval

Information retrieval mainly uses two types of measurement, precision and recall.

• Precision: the proportion of correctly returned pages out of all the returned pages.

• Recall: of all the pages that are correct, how many are returned?

5/6/2008

8

Information Retrieval

We typically have one of the following:

• Perfect recall and very low precision

• Perfect precision and very low recall

Content Mining

• Web search engines

• Web directories

5/6/2008

9

Content Mining

• Classification

▫ Genetic algorithms

▫ Memory based reasoning or K-Nearest neighbors based classification (K-NN)

▫ Vector space model

Vector Space Model

• Page similarity

• Weighted vectors

▫ Term frequency-Inverse document frequency method

5/6/2008

10

Clustering

• Used to further help the user find what they are looking for

• Refine searches on a more general keyword

Clustering

5/6/2008

11

Usage Mining

• Web logsFile (or several files) automatically created and maintained by a server of activity performed by it.

Typically added• Client IP address• request date/time• page requested• HTTP code• bytes served• user agent• referer

Usage Mining

Web Usage Mining consists of three phases

• Preprocessing

• Patten discovery

• Patten analysis

5/6/2008

12

Preprocessing

• To clean up the data

• User identification

• Session identification

• Path completion

Pattern Discovery

• Clustering algorithm

• Dependency modeling

• Classification

• Association Rules

5/6/2008

13

Pattern Analysis

• SQL

• Filter out uninteresting rules

• Visualization techniques

▫ Graphic

▫ Color

Usage Mining

• Example

5/6/2008

14

Conclusions

• Useful for personalizing the Internet

• Good commercial use

• Simplifies searching and browsing

• Adds much needed structure to the data on the web

• Easy to manipulate

• No authority

• Needed

Bibliography• "PageRank." Wikipedia. 8 Apr. 2008. Accessed on 4 May 2008 <http://en.wikipedia.org/wiki/Web_mining>.

• "Vector Space Model." Wikipedia. 22 April 2008. Accessed on 6 May 2008 <http://en.wikipedia.org/wiki/Vector_space_model>.

• Larose, Markov. Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage. Hoboken, New Jersey: Wiley-Interscience, 2007.

• Nasraou, O. “Approaches to Mining the Web”, CECS 694 – Mining the Web for E-Commerce & Information Retrieval. University of Louisville, 21 Oct. 2004 <http://webmining.spd.louisville.edu/Websites/tutorials/Chapter2-approaches-mining-web.pdf>.

• Cooley, Deshpande, Srivastava, Tan. "Web Usage Mining: Discovery and Applications of Usage Patterns From Web Data." SIGKDD Explorations 1 (2000). Accessed on 4 May 2008 <http://www.sigkdd.org/explorations/issue.php?volume=1&issue=2&year=2000&month=01>.

• Sing, T. “Web Content Mining”. Oakland University. Accessed on 5 May 2008. <personalwebs.oakland.edu/~tsingh23/PresentationMain.ppt>.

• "Google Technology." Google. Accessed on 6 May 2008 <http://www.google.com/technology/>.

• “Introduction.” 19 Feb 2002, Accessed on 5 May 2008 <http://www2002.org/CDROM/refereed/643/node1.html>.

• "Introduction." 18 Feb. 2002. 5 May 2008 <http://www2002.org/CDROM/refereed/643/node1.html>.

data mining the web - staffaidvi/courses/06/dm/seminars2008/data mining...data mining the web oscar...

Documents