web mining issues

27
Web Mining Issues Web Mining Issues Size Size >350 million pages >350 million pages Grows at about 1 million pages a Grows at about 1 million pages a day day Diverse types of data Diverse types of data

Upload: tacey

Post on 25-Feb-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Web Mining Issues. Size >350 million pages Grows at about 1 million pages a day Diverse types of data. Web Mining Taxonomy. Crawlers. Robot (spider) traverses the hypertext sructure in the Web. Collect information from visited pages Used to construct indexes for search engines - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Web Mining Issues

Web Mining IssuesWeb Mining Issues

SizeSize– >350 million pages>350 million pages– Grows at about 1 million pages a dayGrows at about 1 million pages a day

Diverse types of dataDiverse types of data

Page 2: Web Mining Issues

Web Mining TaxonomyWeb Mining Taxonomy

Page 3: Web Mining Issues

CrawlersCrawlers Robot (spider)Robot (spider) traverses the hypertext sructure in the traverses the hypertext sructure in the

Web.Web. Collect information from visited pagesCollect information from visited pages Used to construct indexes for search enginesUsed to construct indexes for search engines Traditional CrawlerTraditional Crawler – visits entire Web (?) and – visits entire Web (?) and

replaces indexreplaces index Periodic CrawlerPeriodic Crawler – visits portions of the Web and – visits portions of the Web and

updates subset of indexupdates subset of index Incremental CrawlerIncremental Crawler – selectively searches the Web – selectively searches the Web

and incrementally modifies indexand incrementally modifies index Focused CrawlerFocused Crawler – visits pages related to a particular – visits pages related to a particular

subjectsubject

Page 4: Web Mining Issues

Focused CrawlerFocused Crawler

Classifier also determines how useful Classifier also determines how useful outgoing links areoutgoing links are

Page 5: Web Mining Issues

Focused CrawlerFocused Crawler

Page 6: Web Mining Issues

PersonalizationPersonalization Web access or contents tuned to better fit the Web access or contents tuned to better fit the

desires of each user.desires of each user. Manual techniques identify user’s preferences Manual techniques identify user’s preferences

based on profiles or demographics.based on profiles or demographics. Collaborative filteringCollaborative filtering identifies preferences identifies preferences

based on ratings from similar users.based on ratings from similar users. Content based filteringContent based filtering retrieves pages retrieves pages

based on similarity between pages and user based on similarity between pages and user profiles.profiles.

Page 7: Web Mining Issues

PageRankPageRank Used by GoogleUsed by Google Prioritize pages returned from search by Prioritize pages returned from search by

looking at Web structure.looking at Web structure. Importance of page is calculated based Importance of page is calculated based

on number of pages which point to it – on number of pages which point to it – BacklinksBacklinks..

Weighting is used to provide more Weighting is used to provide more importance to backlinks coming form importance to backlinks coming form important pages.important pages.

Page 8: Web Mining Issues

PageRank (cont’d)PageRank (cont’d)

PR(p) = c (PR(1)/NPR(p) = c (PR(1)/N11 + … + PR(n)/N + … + PR(n)/Nnn))– PR(i): PageRank for a page i which points PR(i): PageRank for a page i which points

to target page p.to target page p.– NNii: number of links coming out of page I: number of links coming out of page I

Rank source E: R= cAR+cERank source E: R= cAR+cE

Page 9: Web Mining Issues

CLEVERCLEVER

Identify authoritative and hub pages.Identify authoritative and hub pages. Authoritative PagesAuthoritative Pages : :

– Highly important pages.Highly important pages.– Best source for requested information.Best source for requested information.

Hub PagesHub Pages : :– Contain links to highly important pages.Contain links to highly important pages.

Page 10: Web Mining Issues

Web Usage Mining ApplicationsWeb Usage Mining Applications

PersonalizationPersonalization Improve structure of a site’s Web pagesImprove structure of a site’s Web pages Aid in caching and prediction of future Aid in caching and prediction of future

page referencespage references Improve design of individual pagesImprove design of individual pages Improve effectiveness of e-commerce Improve effectiveness of e-commerce

(sales and advertising)(sales and advertising)

Page 11: Web Mining Issues

Web Usage Mining ActivitiesWeb Usage Mining Activities Preprocessing Web logPreprocessing Web log

– Cleanse Cleanse – Remove extraneous informationRemove extraneous information– SessionizeSessionize

Session:Session: Sequence of pages referenced by one user at a sitting. Sequence of pages referenced by one user at a sitting. Pattern DiscoveryPattern Discovery

– Count patterns that occur in sessionsCount patterns that occur in sessions– Pattern Pattern is sequence of pages references in session.is sequence of pages references in session.– Similar to association rulesSimilar to association rules

» Transaction: sessionTransaction: session» Itemset: pattern (or subset)Itemset: pattern (or subset)» Order is importantOrder is important

Pattern AnalysisPattern Analysis

Page 12: Web Mining Issues

Web Usage Mining IssuesWeb Usage Mining Issues

Identification of exact user not possible.Identification of exact user not possible. Exact sequence of pages referenced by Exact sequence of pages referenced by

a user not possible due to caching.a user not possible due to caching. Session not well definedSession not well defined Security, privacy, and legal issuesSecurity, privacy, and legal issues

Page 13: Web Mining Issues

Web Log CleansingWeb Log Cleansing

Replace source IP address with unique Replace source IP address with unique but non-identifying ID.but non-identifying ID.

Replace exact URL of pages referenced Replace exact URL of pages referenced with unique but non-identifying ID.with unique but non-identifying ID.

Delete error records and records Delete error records and records containing not page data (such as containing not page data (such as figures and code)figures and code)

Page 14: Web Mining Issues

SessionizingSessionizing

Divide Web log into sessions.Divide Web log into sessions. Two common techniques:Two common techniques:

– Number of consecutive page references Number of consecutive page references from a source IP address occurring within from a source IP address occurring within a predefined time interval (e.g. 25 a predefined time interval (e.g. 25 minutes).minutes).

– All consecutive page references from a All consecutive page references from a source IP address where the interclick time source IP address where the interclick time is less than a predefined threshold.is less than a predefined threshold.

Page 15: Web Mining Issues

EpisodesEpisodes

Partially ordered set of pagesPartially ordered set of pages Serial episodeSerial episode – totally ordered with – totally ordered with

time constrainttime constraint Parallel episodeParallel episode – partial ordered with – partial ordered with

time constrainttime constraint General episodeGeneral episode – partial ordered with – partial ordered with

no time constraintno time constraint

Page 16: Web Mining Issues

DAG for EpisodeDAG for Episode

Page 17: Web Mining Issues

Longest Common SubseriesLongest Common Subseries

Find longest subseries they have in Find longest subseries they have in common.common.

Ex:Ex:– X = <10,5,6,9,22,15,4,2>X = <10,5,6,9,22,15,4,2>– Y = <6,9,10,5,6,22,15,4,2>Y = <6,9,10,5,6,22,15,4,2>– Output: <22,15,4,2>Output: <22,15,4,2>– Sim(X,Y) = l/n = 4/9Sim(X,Y) = l/n = 4/9

Page 18: Web Mining Issues

Similarity based on Linear Similarity based on Linear TransformationTransformation

Linear transformation function fLinear transformation function f– Convert a value form one series to a value Convert a value form one series to a value

in the secondin the second ff – tolerated difference in results – tolerated difference in results – – time value difference allowedtime value difference allowed

Page 19: Web Mining Issues

Distance between StringsDistance between Strings

Cost to convert one to the otherCost to convert one to the other TransformationsTransformations

– Match: Current characters in both strings Match: Current characters in both strings are the sameare the same

– Delete: Delete current character in input Delete: Delete current character in input stringstring

– Insert: Insert current character in target Insert: Insert current character in target string into stringstring into string

Page 20: Web Mining Issues

Distance between StringsDistance between Strings

Page 21: Web Mining Issues

Frequent SequenceFrequent Sequence

Page 22: Web Mining Issues

Frequent Sequence ExampleFrequent Sequence Example

Purchases made by Purchases made by customerscustomers

s(<{A},{C}>) = 1/3s(<{A},{C}>) = 1/3 s(<{A},{D}>) = 2/3s(<{A},{D}>) = 2/3 s(<{B,C},{D}>) = 2/3s(<{B,C},{D}>) = 2/3

Page 23: Web Mining Issues

Frequent Sequence LatticeFrequent Sequence Lattice

Page 24: Web Mining Issues

SPADESPADE

Sequential Pattern Discovery using Sequential Pattern Discovery using Equivalence classesEquivalence classes

Divides lattice into equivalent classes Divides lattice into equivalent classes and searches each separately.and searches each separately.

Page 25: Web Mining Issues

SPADE ExampleSPADE Example

ID-List for Sequences of length 1:ID-List for Sequences of length 1:

Count for <{A}> is 3Count for <{A}> is 3 Count for <{A},{D}> is 2Count for <{A},{D}> is 2

Page 26: Web Mining Issues

Equivalence ClassesEquivalence Classes

Page 27: Web Mining Issues

SPADE AlgorithmSPADE Algorithm