data mining the web - staffaidvi/courses/06/dm/seminars2008/data mining...data mining the web oscar...
TRANSCRIPT
5/6/2008
1
Data mining the webOscar Djupfeldt, Avraz Hirori, Christoffer Kullman
Group 5
Introduction
• Structure mining
• Content mining
• Usage mining
5/6/2008
2
Problem
• There is a huge amount of information on the web
• The information available on the web is very diverse
• Information on the web is exists in almost all types of formats
• The web is semi-structured because of HTMLs nestsed structure
• The web is linked• Lots of redundant information
Web Crawling
• Getting data about web pages
• Web search
• Done in two ways:
▫ Uniformed graph search
▫ Guided, “informed” search
5/6/2008
3
Structure Mining
• Analyzes which places a web page points to and which places point to a web page
• To find relevant data about a web page
• Used for web search and social networks
Document Structure
• Look at the HTML- or XML-structure of a web page
• The structure can reveal which data in a document is relevant as well as providing a context to the information
5/6/2008
4
Document Structure
<title>Web Mining</title>
<meta name=”Author” content=”John Doe”>
--
<h2><big>Web Mining</big></h2>
--
<a href=”presentation.html”>Web Mining </a>
Document Structure
What do we get from this?
• Tagging and indexing
▫ Expand main index.
▫ Small, seperate indicies for faster access.
• Relevance ranking
▫ Terms location affects ranking. Different tags = different weights.
▫ Good at natural relevance ranking.
5/6/2008
5
Document Structure
Problems
• Invisible words
• Misleading text in titles, metatags, etc.
• Missing or incorrect information
Anchor text
• Most significant in use due
• Additional information
Link Analysis - Hyperlinks
• Independent evaluation of web page popularity or authority
• Idea is from social networks▫ Popularity▫ Authority▫ Prestige
• Ranking• Link analysis algorithms:
▫ PageRank▫ HITS
5/6/2008
6
Link Analysis - PageRank
• Democratic – a links to b = a “votes” for b.
• The “voter” is also analyzed
• Designed for the “random clicking user” -likelihood a user will reach a certain page
• Iterative process
Link Analysis - PageRank
• The damping factor
• Random page switch
• Sink pages
• No outbound links?
5/6/2008
7
Content Mining
• Focused on text mining
• Video
• Audio
• Images
• Structured records like tables
• Information retrieval
Information Retrieval
Information retrieval mainly uses two types of measurement, precision and recall.
• Precision: the proportion of correctly returned pages out of all the returned pages.
• Recall: of all the pages that are correct, how many are returned?
5/6/2008
8
Information Retrieval
We typically have one of the following:
• Perfect recall and very low precision
• Perfect precision and very low recall
Content Mining
• Web search engines
• Web directories
5/6/2008
9
Content Mining
• Classification
▫ Genetic algorithms
▫ Memory based reasoning or K-Nearest neighbors based classification (K-NN)
▫ Vector space model
Vector Space Model
• Page similarity
• Weighted vectors
▫ Term frequency-Inverse document frequency method
5/6/2008
10
Clustering
• Used to further help the user find what they are looking for
• Refine searches on a more general keyword
Clustering
5/6/2008
11
Usage Mining
• Web logsFile (or several files) automatically created and maintained by a server of activity performed by it.
Typically added• Client IP address• request date/time• page requested• HTTP code• bytes served• user agent• referer
Usage Mining
Web Usage Mining consists of three phases
• Preprocessing
• Patten discovery
• Patten analysis
5/6/2008
12
Preprocessing
• To clean up the data
• User identification
• Session identification
• Path completion
Pattern Discovery
• Clustering algorithm
• Dependency modeling
• Classification
• Association Rules
5/6/2008
13
Pattern Analysis
• SQL
• Filter out uninteresting rules
• Visualization techniques
▫ Graphic
▫ Color
Usage Mining
• Example
5/6/2008
14
Conclusions
• Useful for personalizing the Internet
• Good commercial use
• Simplifies searching and browsing
• Adds much needed structure to the data on the web
• Easy to manipulate
• No authority
• Needed
Bibliography• "PageRank." Wikipedia. 8 Apr. 2008. Accessed on 4 May 2008 <http://en.wikipedia.org/wiki/Web_mining>.
• "Vector Space Model." Wikipedia. 22 April 2008. Accessed on 6 May 2008 <http://en.wikipedia.org/wiki/Vector_space_model>.
• Larose, Markov. Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage. Hoboken, New Jersey: Wiley-Interscience, 2007.
• Nasraou, O. “Approaches to Mining the Web”, CECS 694 – Mining the Web for E-Commerce & Information Retrieval. University of Louisville, 21 Oct. 2004 <http://webmining.spd.louisville.edu/Websites/tutorials/Chapter2-approaches-mining-web.pdf>.
• Cooley, Deshpande, Srivastava, Tan. "Web Usage Mining: Discovery and Applications of Usage Patterns From Web Data." SIGKDD Explorations 1 (2000). Accessed on 4 May 2008 <http://www.sigkdd.org/explorations/issue.php?volume=1&issue=2&year=2000&month=01>.
• Sing, T. “Web Content Mining”. Oakland University. Accessed on 5 May 2008. <personalwebs.oakland.edu/~tsingh23/PresentationMain.ppt>.
• "Google Technology." Google. Accessed on 6 May 2008 <http://www.google.com/technology/>.
• “Introduction.” 19 Feb 2002, Accessed on 5 May 2008 <http://www2002.org/CDROM/refereed/643/node1.html>.
• "Introduction." 18 Feb. 2002. 5 May 2008 <http://www2002.org/CDROM/refereed/643/node1.html>.