data mining and the web_past_present and future
TRANSCRIPT
Data Mining and the Web: Past, Present and Future
Minos N. Garofalakis
Bell Laboratories
Rajeev Rastogi
Bell Laboratories
S. Seshadri
Bell Laboratories
Kyuseok Shim
Bell Laboratories
Agenda
Today www search tools are plagued by four problems. abundance problem limited coverage of the Web a limited query interface limited customization to individual users.
Data Mining Techniques Association rules, Classification, Cluster…etc.
Web Mining Techniques Hubs and Authorities Building Web Knowledge Base Mining the Structure of Web Documents
Web Mining Research Issues Mining Web Structure. Improving Customization. Extracting Information from Hypertext Documents
Introduction
Problems: Web is a huge, diverse and dynamic collection of interlinked
hypertext documents. Except for hyperlinks, the Web is largely unstructured. The contents of many internet sources are hidden behind
search interfaces 99% of the information on the Web is of no interest to 99% of
the people. abundance problem
the phenomenon of hundreds of irrelevant documents being returned in response to a search query.
limited coverage of the Web A limited query interface
based on syntactic keyword-oriented search limited customization to individual users
Data Mining Techniques – Association Rules
A useful mechanism for discovering correlations among items belonging to customer transactions in a market basket database.
Rule form: “Body ead [support, confidence]”.
Find all the rules X & Y Z with minimum confidence and support support, s, that a transaction contains {X Y Z} confidence, c, that a transaction having {X Y} also
contains Z
Association Rules example
For rule A C:support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
Association Rules – Apriori Algorithm
The Apriori algorithm is the most popular algorithm for computing association rules.
The Apriori principle:Any subset of a frequent itemset must be frequent
Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;
Apriori Algorithm Example
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
itemset sup.{1} 2{2} 3{3} 3{5} 3
Scan D
C1L1
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
C2 C2Scan D
C3 L3itemset{2 3 5}
Scan D itemset sup{2 3 5} 2
Data Mining Techniques – Classification
The goal is to induce a model or description for each class in terms of the attributes.
Classifiers are useful in the Web context to build taxonomies and topic hierarchies on Web pages.
Two step processModel constructionUse the Model in prediction
Classification Process (1): Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Classification Process (2): Use the Model in Prediction
Classification
Supervised learning The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set
Decision trees classifiers are popular since they are easily interpreted by humans and are efficient to build. Building phase Pruning phase
Data Mining Techniques – Cluster
Clustering is a useful technique for discovering interesting data distributions and patterns in the underlying data.
A collection of data objects Similar to one another within the same cluster (Maximum of Intraclas
s similarity) Dissimilar to the objects in other clusters (Minimum of Interclass simil
arity) Clustering is unsupervised classification: no predefined classes Common method
Partitioning method Hierarchical method Density-based method Grid-based method Model-based method
Web Mining Techniques
Hubs and Authorities J. Kleinberg, 1999 To discover the underlying Web structure, and analyze the lin
k topology. Authorities are highly-referenced pages on the topic. Hubs are pages that “point” to many of the authorities Hubs and authorities thus exhibit a strong mutually reinforcing
relationship. Building Web Knowledge Base
By enumerating and organizing all web occurrences of chosen subgraphs.
Mining the Structure of Web Document XML
Web Mining Research Issues
Mining Web Structure. These approaches only take into account hyperlink inform
ation and pay little or no attention to the content of Web pages.
Improving Customization. Providing users with pages, sites and advertizements that
are of interest to them. Automatically optimize their design and organization base
d on observed user patterns. Extracting Information from Hypertext Documents
Complicated, because HTML provide very little semantic information.
XML may be possible to transform the entire Web into one unified database.
Reference
Data Mining: Concepts and Techniques — Slides for Textbook —
©Jiawei Han and Micheline Kamber
Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca
Q & A
Thanks! ^_^