data mining and the web_past_present and future

Data Mining and the Web: Past, Present and Future

Minos N. Garofalakis

Bell Laboratories

[email protected]

Rajeev Rastogi

Bell Laboratories

[email protected]

S. Seshadri

Bell Laboratories

[email protected]

Kyuseok Shim

Bell Laboratories

[email protected]

Agenda

Today www search tools are plagued by four problems. abundance problem limited coverage of the Web a limited query interface limited customization to individual users.

Data Mining Techniques Association rules, Classification, Cluster…etc.

Web Mining Techniques Hubs and Authorities Building Web Knowledge Base Mining the Structure of Web Documents

Web Mining Research Issues Mining Web Structure. Improving Customization. Extracting Information from Hypertext Documents

Introduction

Problems: Web is a huge, diverse and dynamic collection of interlinked

hypertext documents. Except for hyperlinks, the Web is largely unstructured. The contents of many internet sources are hidden behind

search interfaces 99% of the information on the Web is of no interest to 99% of

the people. abundance problem

the phenomenon of hundreds of irrelevant documents being returned in response to a search query.

limited coverage of the Web A limited query interface

based on syntactic keyword-oriented search limited customization to individual users

Data Mining Techniques – Association Rules

A useful mechanism for discovering correlations among items belonging to customer transactions in a market basket database.

Rule form: “Body ead [support, confidence]”.

Find all the rules X & Y Z with minimum confidence and support support, s, that a transaction contains {X Y Z} confidence, c, that a transaction having {X Y} also

contains Z

Association Rules example

For rule A C:support = support({A C}) = 50%

confidence = support({A C})/support({A}) = 66.6%

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Association Rules – Apriori Algorithm

The Apriori algorithm is the most popular algorithm for computing association rules.

The Apriori principle:Any subset of a frequent itemset must be frequent

Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

Apriori Algorithm Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Data Mining Techniques – Classification

The goal is to induce a model or description for each class in terms of the attributes.

Classifiers are useful in the Web context to build taxonomies and topic hierarchies on Web pages.

Two step processModel constructionUse the Model in prediction

Classification Process (1): Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Classification Process (2): Use the Model in Prediction

Classification

Supervised learning The training data (observations, measurements,

etc.) are accompanied by labels indicating the class of the observations

New data is classified based on the training set

Decision trees classifiers are popular since they are easily interpreted by humans and are efficient to build. Building phase Pruning phase

Data Mining Techniques – Cluster

Clustering is a useful technique for discovering interesting data distributions and patterns in the underlying data.

A collection of data objects Similar to one another within the same cluster (Maximum of Intraclas

s similarity) Dissimilar to the objects in other clusters (Minimum of Interclass simil

arity) Clustering is unsupervised classification: no predefined classes Common method

Partitioning method Hierarchical method Density-based method Grid-based method Model-based method

Web Mining Techniques

Hubs and Authorities J. Kleinberg, 1999 To discover the underlying Web structure, and analyze the lin

k topology. Authorities are highly-referenced pages on the topic. Hubs are pages that “point” to many of the authorities Hubs and authorities thus exhibit a strong mutually reinforcing

relationship. Building Web Knowledge Base

By enumerating and organizing all web occurrences of chosen subgraphs.

Mining the Structure of Web Document XML

Web Mining Research Issues

Mining Web Structure. These approaches only take into account hyperlink inform

ation and pay little or no attention to the content of Web pages.

Improving Customization. Providing users with pages, sites and advertizements that

are of interest to them. Automatically optimize their design and organization base

d on observed user patterns. Extracting Information from Hypertext Documents

Complicated, because HTML provide very little semantic information.

XML may be possible to transform the entire Web into one unified database.

Reference

Data Mining: Concepts and Techniques — Slides for Textbook —

©Jiawei Han and Micheline Kamber

Intelligent Database Systems Research Lab

School of Computing Science

Simon Fraser University, Canada

http://www.cs.sfu.ca

Q & A

Thanks! ^_^

data mining and the web_past_present and future

Technology