webmininglec

Data Mining AlgorithmsData Mining Algorithms

Web MiningWeb Mining

22

Web Mining OutlineWeb Mining OutlineGoal:Goal: Examine the use of data mining on Examine the use of data mining on

the World Wide Webthe World Wide WebIntroductionIntroductionWeb Content MiningWeb Content MiningWeb Structure MiningWeb Structure MiningWeb Usage MiningWeb Usage Mining

IntroductionIntroductionThe Web is perhaps the single largest data The Web is perhaps the single largest data source in the world.source in the world.Web mining aims to extract and mine useful Web mining aims to extract and mine useful knowledge from the Web.knowledge from the Web.A multidisciplinary field: data mining, machine A multidisciplinary field: data mining, machine learning, natural language processing, learning, natural language processing, statistics, databases, information retrieval, statistics, databases, information retrieval, multimedia, etc.multimedia, etc.Due to the heterogeneity and lack of structure Due to the heterogeneity and lack of structure of Web data, mining is a challenging task.of Web data, mining is a challenging task.

33

Opportunities and ChallengesOpportunities and ChallengesThe amount of info on the Web is huge, and easily The amount of info on the Web is huge, and easily accessible.accessible.The coverage of Web info is very wide and diverse.The coverage of Web info is very wide and diverse.Info/data of almost all types exist on the Web, e.g., Info/data of almost all types exist on the Web, e.g., structured tables, texts, multimedia data, etc.structured tables, texts, multimedia data, etc.Much of the Web information is semi-structuredMuch of the Web information is semi-structured due to the due to the nested structure of HTML code.nested structure of HTML code.Much of the Web info is linked. Much of the Web info is linked. hyperlinks among hyperlinks among pages within a site, and across different sites.pages within a site, and across different sites.Much of the Web info is redundant. Much of the Web info is redundant. Same piece of Same piece of info or its variants may appear in many pagesinfo or its variants may appear in many pages..

44

Opportunities and ChallengesOpportunities and ChallengesWeb is noisy. A Web page typically contains many kinds of info, e.g., main contents, advertisements, navigation panels, copyright notices, etc.Web consists of surface Web and deep Web. – Surface Web: pages that can be browsed using a browser.– Deep Web: can only be accessed thro parameterized QI.

Web is also about services. Web is dynamic. Information on the Web changes constantly. Keeping up with the changes and monitoring the changes are important issues.The Web is a virtual society. It is not only about data, information and services, but also about interactions among people, organizations and automatic systems, i.e. communities.

55

66

Web Mining Other IssuesWeb Mining Other Issues

SizeSize– >1000 million pages >1000 million pages 227,225,642227,225,642 web sites web sites

(Sep 2010) (Netcraft Survey)(Sep 2010) (Netcraft Survey)– Grows at about 1 million pages a dayGrows at about 1 million pages a day– Google indexes > 5 billion documentsGoogle indexes > 5 billion documentsDiverse types of dataDiverse types of dataSo not possible to warehouse or normal So not possible to warehouse or normal data miningdata mining

77

Web DataWeb DataWeb pagesWeb pagesIntra-page structures Intra-page structures (HTML, XML code)(HTML, XML code)Inter-page structures Inter-page structures (actual linkage (actual linkage structures between web pages)structures between web pages)Usage dataUsage dataSupplemental dataSupplemental data– ProfilesProfiles– Registration informationRegistration information– CookiesCookies

88

Web Mining TaxonomyWeb Mining Taxonomy

99

Web Mining TaxonomyWeb Mining Taxonomy

Web Content MiningWeb Content Mining– Extends work of basic search enginesExtends work of basic search enginesWeb Structure MiningWeb Structure Mining– Mine structure (links, graph) of the WebMine structure (links, graph) of the WebWeb Usage MiningWeb Usage Mining– Analyses Logs of Web AccessAnalyses Logs of Web AccessWeb Mining applications include Web Mining applications include Target Target Advtg., Recommendation Engines, Advtg., Recommendation Engines, CRM etcCRM etc

1010

Web Content MiningWeb Content MiningExtends work of basic search enginesExtends work of basic search enginesWeb content mining: mining, extraction and integration of useful data, information and knowledge from Web page contentsSearch EnginesSearch Engines– IR application, Keyword based, Similarity IR application, Keyword based, Similarity

between query and documentbetween query and document– Crawlers, IndexingCrawlers, Indexing– ProfilesProfiles– Link analysisLink analysis

1111

Issues in Web Content MiningIssues in Web Content MiningDeveloping intelligent tools for IRDeveloping intelligent tools for IR– Finding keywords and key phrasesFinding keywords and key phrases– Discovering grammatical rules and Discovering grammatical rules and

collocationscollocations– Hypertext classification/categorizationHypertext classification/categorization– Extracting key phrases from text documentsExtracting key phrases from text documents– Learning extraction models/rulesLearning extraction models/rules– Hierarchical clusteringHierarchical clustering– Predicting (words) relationshipPredicting (words) relationship

1212

Search Engine – Two Rank FunctionsSearch Engine – Two Rank Functions

Web Pages

Meta Data ForwardIndex

InvertedIndex

ForwardLink

Backward Link(Anchor Text)

Web TopologyGraph

Web Page Parser

IndexerAnchor TextGenerator

Web GraphConstructor

Importance Ranking(Link Analysis)Rank Functions

URLDictioanry

Term Dictionary(Lexicon)

Search

Relevance Ranking

Ranking based on link structure analysis

Similarity based on content or text

How do We Find Similar Web How do We Find Similar Web Pages?Pages?

Content based approachContent based approachStructure based approachStructure based approachCombing both content and structure Combing both content and structure approachapproach

1313

• Inverted index - A data structure for supporting text queries - like index in a book

Relevance Ranking

inverted index

aalborg 3452, 11437, ….......arm 4, 19, 29, 98, 143, ...armada 145, 457, 789, ...armadillo 678, 2134, 3970, ...armani 90, 256, 372, 511, ........zz 602, 1189, 3209, ...

disks with documents

indexing

1515

CrawlersCrawlersRobot (spider)Robot (spider) traverses the hypertext structure traverses the hypertext structure in the Web.in the Web.Collect information from visited pagesCollect information from visited pagesUsed to construct indexes for search enginesUsed to construct indexes for search enginesTraditional CrawlerTraditional Crawler – visits entire Web and – visits entire Web and replaces indexreplaces indexPeriodic CrawlerPeriodic Crawler – visits portions of the Web and – visits portions of the Web and updates subset of indexupdates subset of indexIncremental CrawlerIncremental Crawler – selectively searches the – selectively searches the Web and incrementally modifies indexWeb and incrementally modifies indexFocused CrawlerFocused Crawler – visits pages related to a – visits pages related to a particular subjectparticular subject

1616

Focused CrawlerFocused CrawlerOnly visit links from a page if that page is Only visit links from a page if that page is determined to be relevant.determined to be relevant.Classifier is static after learning phase.Classifier is static after learning phase.Components:Components:– Classifier which assigns relevance score to Classifier which assigns relevance score to

each page based on crawl topic.each page based on crawl topic.– Distiller to identify Distiller to identify hub pages.hub pages.– Crawler visits pages based on crawler and Crawler visits pages based on crawler and

distiller scores.distiller scores.

1717

Focused CrawlerFocused Crawler

Classifier to related documents to topicsClassifier to related documents to topicsClassifier also determines how useful Classifier also determines how useful outgoing links areoutgoing links areHub PagesHub Pages contain links to many relevant contain links to many relevant pages. Must be visited even if not high pages. Must be visited even if not high relevance scorerelevance score..

1818

Focused CrawlerFocused Crawler

1919

Virtual Web ViewVirtual Web ViewMultiple Layered DataBase (MLDB)Multiple Layered DataBase (MLDB) built on top built on top of the Web.of the Web.Each layer of the database is more generalized Each layer of the database is more generalized (and smaller) and centralized than the one beneath (and smaller) and centralized than the one beneath it.it.Upper layers of MLDB are structured and can be Upper layers of MLDB are structured and can be accessed with SQL type queries.accessed with SQL type queries.Translation tools convert Web documents to XML.Translation tools convert Web documents to XML.Extraction tools extract desired information to place Extraction tools extract desired information to place in first layer of MLDB.in first layer of MLDB.Higher levels contain more summarized data Higher levels contain more summarized data obtained through generalizations of the lower obtained through generalizations of the lower levels.levels.

Multilevel DatabasesMultilevel Databases

Text Image Audio Video Maps Games

Levels of A MLDBLevels of A MLDBLayer 0 :Layer 0 :– Unstructured, massive and global information base.Unstructured, massive and global information base.Layer 1: Layer 1: – Derived from lower layers.Derived from lower layers.– Relatively structured.Relatively structured.– Obtained by data analysis, transformation & Obtained by data analysis, transformation &

Generalization.Generalization.Higher Layers (Layer n):Higher Layers (Layer n):– Further generalization to form smaller, better structured Further generalization to form smaller, better structured

databases for more efficient retrieval.databases for more efficient retrieval.

Web Query SystemWeb Query System These systems attempt to make use of:These systems attempt to make use of:

– Standard database query language – SQLStandard database query language – SQL– Structural information about web documentsStructural information about web documents– Natural language processing for queries made in www Natural language processing for queries made in www

searches.searches.

Examples:Examples:– WebLog: Restructuring extracted information from Web WebLog: Restructuring extracted information from Web

sources.sources.– W3QL: Combines structure query (organization of W3QL: Combines structure query (organization of

hypertext) and content query (information retrieval hypertext) and content query (information retrieval techniques).techniques).

Architecture of a Global Architecture of a Global MLDBMLDB

ConceptHierarchy

Generalized Data

HigherLevels

.

.

.

Source 1

Source 2

Source n

Resource Discovery (MLDB)

Knowledge Discovery

Web Query SystemWeb Query SystemThese systems attempt to make use of:These systems attempt to make use of:– Standard database query language – SQLStandard database query language – SQL– Structural information about web documentsStructural information about web documents– Natural language processing for queries made in Natural language processing for queries made in

www searches.www searches.

Examples:Examples:– WebLog: Restructuring extracted information from WebLog: Restructuring extracted information from

Web sources.Web sources.– W3QL: Combines structure query (organization of W3QL: Combines structure query (organization of

hypertext) and content query (information retrieval hypertext) and content query (information retrieval techniques).techniques).

Architecture of a Global Architecture of a Global MLDBMLDB

ConceptHierarchy

Generalized Data

HigherLevels

.

.

.

Source 1

Source 2

Source n

Resource Discovery (MLDB)

Knowledge Discovery

2626

PersonalizationPersonalizationWeb access or contents tuned to better fit the Web access or contents tuned to better fit the desires of each user.desires of each user.Manual techniques identify user’s preferences Manual techniques identify user’s preferences based on profiles or demographics.based on profiles or demographics.Collaborative filteringCollaborative filtering identifies preferences identifies preferences based on ratings from similar users.based on ratings from similar users.Content based filteringContent based filtering retrieves pages retrieves pages based on similarity between pages and user based on similarity between pages and user profiles.profiles.

ApplicationsApplications

ShopBotShopBotBookmark OrganizerBookmark OrganizerRecommender SystemsRecommender SystemsIntelligent Search EnginesIntelligent Search Engines

2727

2828

Web Structure MiningWeb Structure Mining

Mine structure (links, graph) of the WebMine structure (links, graph) of the WebTechniquesTechniques– PageRankPageRank– CLEVERCLEVER

Create a model of the Web organization.Create a model of the Web organization.May be combined with content mining to May be combined with content mining to more effectively retrieve important pages.more effectively retrieve important pages.

2929

Web as a Graph Web as a Graph

Web pages as nodes of a graph.Web pages as nodes of a graph.Links as directed edges.Links as directed edges.

www.vesit.edumy page www.vesit.edu

www.google.com

www.google.com

my pagewww.vesit.edu

www.google.com

3030

Link Structure of the Web Link Structure of the Web Forward links (out-edges).Forward links (out-edges).Backward links (in-edges).Backward links (in-edges).Approximation of importance/quality: a Approximation of importance/quality: a page may be of high quality if it is referred page may be of high quality if it is referred to by many other pages, and by pages of to by many other pages, and by pages of high quality.high quality.

3131

Authorities and HubsAuthorities and Hubs

Authority is a page which has relevant Authority is a page which has relevant information about the topic.information about the topic.Hub is a page which has collection of links Hub is a page which has collection of links to pages about that topic.to pages about that topic.

h

a1

a2

a3

a4

PageRankPageRankIntroduced by Brin and Page (1998).Introduced by Brin and Page (1998).Mine hyperlink structure of web to produce Mine hyperlink structure of web to produce ‘global’ importance ranking of every web page.‘global’ importance ranking of every web page.Used in Google Search Engine.Used in Google Search Engine.Web search result is returned in the rank Web search result is returned in the rank order.order.Treats link as like academic citation.Treats link as like academic citation.Assumption:Assumption: Highly linked pages are more Highly linked pages are more ‘important‘important’’ than pages with a few links. than pages with a few links.

3232

3333

PageRankPageRankUsed by Used by GoogleGooglePrioritize pages returned from search by Prioritize pages returned from search by looking at Web structure.looking at Web structure.Importance of page is calculated based Importance of page is calculated based on number of pages which point to it on number of pages which point to it – – BacklinksBacklinks..Weighting is used to provide more Weighting is used to provide more importance to importance to backlinksbacklinks coming form coming form important pages.important pages.

PageRank: Main IdeaPageRank: Main IdeaA page has a high rank if the A page has a high rank if the sum of the sum of the ranks of its back-linksranks of its back-links is high. is high.Google utilizes a number of factors to rank Google utilizes a number of factors to rank the search results: the search results: – proximity, anchor text, page rankproximity, anchor text, page rankThe benefits of Page Rank are The benefits of Page Rank are the greatest the greatest for underspecified queries, example: ‘Mumbai for underspecified queries, example: ‘Mumbai University’ query using Page Rank lists the University’ query using Page Rank lists the university home page the first.university home page the first.

3434

3535

Basic IdeaBasic IdeaBack-links coming from important pages Back-links coming from important pages convey more importance to a page. convey more importance to a page. For example, if a web page has a link from For example, if a web page has a link from the yahoo home page, it may be just one the yahoo home page, it may be just one link but it is a very important one.link but it is a very important one.A page has high rank if the sum of the A page has high rank if the sum of the ranks of its back-links is high. ranks of its back-links is high. This covers both the case when a page This covers both the case when a page has many back-links and when a page has has many back-links and when a page has a few highly ranked back-links.a few highly ranked back-links.

3636

DefinitionDefinitionA page’s rank is equal to the sum of all A page’s rank is equal to the sum of all the pages pointing to it.the pages pointing to it.

vfromlinksofnumberNutolinkswithpagesofsetB

NvRankuRank

v

u

Bv vu

)()(

3737

Simplified PageRank ExampleSimplified PageRank ExampleRank(u) = Rank Rank(u) = Rank of page of page uu , where , where c c is ais a normalization normalization constant (c < 1 to constant (c < 1 to cover for pages cover for pages with no outgoing with no outgoing linkslinks).).

3838

Expanded DefinitionExpanded DefinitionR(u)R(u): page rank of page : page rank of page uucc: factor used for normalization (<1): factor used for normalization (<1)BBuu: set of pages pointing to : set of pages pointing to uuNNvv: outbound links of : outbound links of vvR(v)R(v): page rank of site : page rank of site vv that points to that points to uuE(u)E(u): distribution of web pages that a random : distribution of web pages that a random surfer periodically jumps (set to 0.15)surfer periodically jumps (set to 0.15)

)()()( ucENvRcuR

uBv v

3939

Problem 1 - Rank SinkProblem 1 - Rank SinkPage cycles pointed by some incoming link.Page cycles pointed by some incoming link.

Loop will accumulate rank but never distribute it.Loop will accumulate rank but never distribute it.

4040

Problem 2 - Dangling LinksProblem 2 - Dangling Links

4141

PageRank (cont’d)PageRank (cont’d)

PR(p) = c (PR(1)/NPR(p) = c (PR(1)/N11 + … + PR(n)/N + … + PR(n)/Nnn))– PR(i): PageRank for a page i which PR(i): PageRank for a page i which

points to target page p.points to target page p.– NNii: number of links coming out of page i: number of links coming out of page i

4242

HITSHITS

Hyperlink-Induces Topic SearchHyperlink-Induces Topic Search

Based on a set of keywords, find set of relevant Based on a set of keywords, find set of relevant

pages – R.pages – R.

Identify hub and authority pages for these.Identify hub and authority pages for these.

– Expand R to a base set, B, of pages linked to or from R.Expand R to a base set, B, of pages linked to or from R.

– Calculate weights for authorities and hubs.Calculate weights for authorities and hubs.

Pages with highest ranks in R are returned.Pages with highest ranks in R are returned.

4343

Authorities and HubsAuthorities and Hubs

Authority is a page which has relevant Authority is a page which has relevant information about the topic.information about the topic.Hub is a page which has collection of links Hub is a page which has collection of links to pages about that topic.to pages about that topic.

h

a1

a2

a3

a4

4444

Authorities and Hubs (cont.)Authorities and Hubs (cont.)Good hubs are the ones that point to good Good hubs are the ones that point to good authorities.authorities.Good authorities are the ones that are Good authorities are the ones that are pointed to by pointed to by good hubs.good hubs.

h2

h3

h4

h5

a1

a2

a3

a4

a5

a6

h1

4545

Finding Authorities and HubsFinding Authorities and Hubs

First, construct a focused sub-graph of the First, construct a focused sub-graph of the www.www.Second, compute Hubs and Authorities Second, compute Hubs and Authorities from the sub-graph.from the sub-graph.

4646

Construction of Sub-graphConstruction of Sub-graph

Topic Search Engine CrawlerRootsetPages

ExpandedsetPages

Rootset

Forward link pages

4747

Root Set and Base SetRoot Set and Base SetUse query term to Use query term to collect a collect a root setroot set of of pages from text-pages from text-based search engine based search engine (Lycos, Altavista ).(Lycos, Altavista ).

Root set

4848

Root Set and Base Set (cont.)Root Set and Base Set (cont.)

Expand root set into Expand root set into base set by including base set by including (up to a designated (up to a designated size cut-off):size cut-off):– All pages linked to by All pages linked to by

pages in root setpages in root set– All pages that link to a All pages that link to a

page in root setpage in root set

Root set

Base set

4949

Hubs & Authorities Hubs & Authorities CalculationCalculation

Iterative algorithm on Base SetIterative algorithm on Base Set: authority weights : authority weights aa(p), and hub (p), and hub weights weights hh(p).(p).– Set authority weights Set authority weights aa(p) = 1, and hub weights (p) = 1, and hub weights hh(p) = 1 for (p) = 1 for

all p.all p.– Repeat following two operationsRepeat following two operations

(and then re-normalize (and then re-normalize aa and and hh to have unit norm): to have unit norm):

v1

pv2

v3

h(v2)

h(v3)

pq

pa topoints

h(q))(

v1

p

a(v1)

v2

v3

a(v2)

a(v3)

qp

aph topoints

(q))(

h(v1)

5050

ExampleExample

Hub 0.45, Authority 0.45

0.45, 0.45

0.45, 0.45

0.45, 0.45

5151

Example (cont.)Example (cont.)

Hub 0.9, Authority 0.45

1.35, 0.9

0.45, 0.9

0.45, 0.9

5252

Algorithmic OutcomeAlgorithmic OutcomeApplying iterative multiplication (power Applying iterative multiplication (power iteration) will lead to calculating iteration) will lead to calculating eigenvector of any eigenvector of any ““non-degenerate” initial non-degenerate” initial vector.vector.Hubs and auHubs and auththorities as outcome of orities as outcome of process.process.Principal ePrincipal eigenvector igenvector contains highest hub contains highest hub and authoritiesand authorities..

5353

ResultsResultsAlthough HITS is only link-based (it Although HITS is only link-based (it completely disregardcompletely disregardss page content) results page content) results are quite good in many tested queries.are quite good in many tested queries.From narrow topic, HITS tends to end in more From narrow topic, HITS tends to end in more general one.general one.Specific of hub pages - many links can cause Specific of hub pages - many links can cause algorithm drift. They can point to authorities in algorithm drift. They can point to authorities in different topicsdifferent topics..Pages from single domain / website can Pages from single domain / website can dominate result, if they point to one page - dominate result, if they point to one page - not necessarnot necessarilily a good authority.y a good authority.

5454

Possible EnhancementsPossible EnhancementsUse weighted sums for link calculationUse weighted sums for link calculation..Take advantage of Take advantage of “anchor “anchor text” - text text” - text surrounding link itself.surrounding link itself.Break hubs into smaller pieces. Analyze each Break hubs into smaller pieces. Analyze each piece separately, instead of whole hub page as piece separately, instead of whole hub page as one.one.Disregard or minimize influence of links inside Disregard or minimize influence of links inside one domain.one domain.IBM expanded HITS into Clever; not seen as IBM expanded HITS into Clever; not seen as viable real-time search engine.viable real-time search engine.

5555

CLEVERCLEVER

Identify authoritative and hub pages.Identify authoritative and hub pages.Authoritative PagesAuthoritative Pages : :– Highly important pages.Highly important pages.– Best source for requested information.Best source for requested information.

Hub PagesHub Pages : :– Contain links to highly important pages.Contain links to highly important pages.

5656

CLEVERCLEVERThe CLEVER algorithm is an extension of standard HITS and provides an appropriate solution to the problems that result from standard HITS.CLEVER assigns a weight to each link based on the terms of the queries and end-points of the link. It combines anchor text to set weights to the links as well. Moreover, it breaks large hub pages into smaller units so that each hub page is focused on as a single topic. Finally, in the case of a large number of pages from a single domain, it scales down the weights of pages to reduce the probabilities of overhead weights

5757

PageRank vs. HITSPageRank vs. HITS

PageRankPageRank (Google)(Google)– computed for all web computed for all web

pages stored in the pages stored in the database prior to the database prior to the queryquery

– computes authorities computes authorities onlyonly

– Trivial and fast to Trivial and fast to computecompute

HITSHITS (CLEVER)(CLEVER)– performed on the set of performed on the set of

retrieved web pages for retrieved web pages for each queryeach query

– computes authorities computes authorities and hubsand hubs

– easy to compute, but easy to compute, but real-time execution is real-time execution is hardhard

5858

Web Usage MiningWeb Usage Mining

Performs mining on Performs mining on Web UsageWeb Usage data or data or Web LogsWeb LogsA web log is a listing of page reference A web log is a listing of page reference data also called as a data also called as a click steamclick steamCan be seen from either server Can be seen from either server perspective – perspective – better web site designbetter web site designOr client perspective – Or client perspective – prefetching of web prefetching of web pages etc.pages etc.

5959

Web Usage Mining Web Usage Mining ApplicationsApplications

PersonalizationPersonalizationImprove structure of a site’s Web pagesImprove structure of a site’s Web pagesAid in caching and prediction of future page Aid in caching and prediction of future page referencesreferencesImprove design of individual pagesImprove design of individual pagesImprove effectiveness of e-commerce Improve effectiveness of e-commerce (sales (sales and advertising)and advertising)Improve web server performance (Load Improve web server performance (Load Balancing)Balancing)

6060

Web Usage Mining ActivitiesWeb Usage Mining ActivitiesPreprocessing Web logPreprocessing Web log– Cleanse Cleanse – Remove extraneous informationRemove extraneous information– SessionizeSessionize

Session:Session: Sequence of pages referenced by one user at a sitting. Sequence of pages referenced by one user at a sitting.

Pattern DiscoveryPattern Discovery– Count patterns that occur in sessionsCount patterns that occur in sessions– Pattern Pattern is sequence of pages references in session.is sequence of pages references in session.– Similar to association rulesSimilar to association rules

Transaction: sessionTransaction: sessionItemset: pattern (or subset)Itemset: pattern (or subset)Order is importantOrder is important

Pattern AnalysisPattern Analysis

6161

Web Usage Mining IssuesWeb Usage Mining Issues

Identification of exact user not possible.Identification of exact user not possible.Exact sequence of pages referenced by a Exact sequence of pages referenced by a user not possible due to caching.user not possible due to caching.Session not well definedSession not well definedSecurity, privacy, and legal issuesSecurity, privacy, and legal issues

Web Usage Mining - OutcomeWeb Usage Mining - OutcomeAssociation rulesAssociation rules– – Find pages that are often viewed Find pages that are often viewed togethertogetherClusteringClustering– – Cluster users based on browsing patternsCluster users based on browsing patterns– – Cluster pages based on contentCluster pages based on content

ClassificationClassification– – Relate user attributes to patternsRelate user attributes to patterns

6262

6363

Web Log CleansingWeb Log Cleansing

Replace source IP address with unique Replace source IP address with unique but non-identifying ID.but non-identifying ID.Replace exact URL of pages referenced Replace exact URL of pages referenced with unique but non-identifying ID.with unique but non-identifying ID.Delete error records and records Delete error records and records containing not page data (such as figures containing not page data (such as figures and code)and code)

6464

Data Structures Data Structures

Keep track of patterns identified during Keep track of patterns identified during Web usage mining processWeb usage mining processCommon techniques:Common techniques:– Trie Trie – Suffix TreeSuffix Tree– Generalized Suffix TreeGeneralized Suffix Tree– WAP TreeWAP Tree

Web Usage Mining – Three Web Usage Mining – Three PhasesPhases

http://www.acm.org/sigs/sigkdd/explorations/issue1-2/srivastava.pdf

Phase 1: Phase 1: Pre-processing Pre-processingConverts the raw data into the data Converts the raw data into the data abstraction necessary for the further abstraction necessary for the further applying the data mining algorithmapplying the data mining algorithm– Mapping the log data into Mapping the log data into relational relational

tablestables before an adapted data mining before an adapted data mining technique is performed.technique is performed.

– Using the log data directlyUsing the log data directly by utilizing by utilizing special pre-processing techniques.special pre-processing techniques.

6666

Raw data – Web logRaw data – Web logClick streamClick stream: a sequential series of page : a sequential series of page view requestview requestUser sessionUser session: a delimited set of user clicks : a delimited set of user clicks (click stream) across one or more Web (click stream) across one or more Web servers.servers.Server session (visit)Server session (visit): a collection of user : a collection of user clicks to a single Web server during a user clicks to a single Web server during a user session.session.EpisodeEpisode: a subset of related user clicks : a subset of related user clicks that occur within a user session.that occur within a user session.

6767

Phase 2: Pattern DiscoveryPhase 2: Pattern Discovery

Pattern Discovery uses techniques Pattern Discovery uses techniques such as statistical analysis, such as statistical analysis, association rules, clustering, association rules, clustering, classification, sequential pattern, classification, sequential pattern, dependency Modeling.dependency Modeling.

6868

Phase 3: Pattern AnalysisPhase 3: Pattern Analysis

A process to gain Knowledge about how A process to gain Knowledge about how visitors use Website in order tovisitors use Website in order to– PPrevent disorientationrevent disorientation and help designers to and help designers to

place important information/functions exactly place important information/functions exactly where the visitors look for and in the way where the visitors look for and in the way users need it.users need it.

– BBuild up adaptive Website serveruild up adaptive Website server

6969

Techniques for Web usage miningTechniques for Web usage mining Construct multidimensional view on the Weblog databaseConstruct multidimensional view on the Weblog database

– Perform multidimensional OLAP analysis to find the top Perform multidimensional OLAP analysis to find the top NN users, top users, top NN accessed Web pages, most frequently accessed Web pages, most frequently accessed time periods, etc.accessed time periods, etc.

Perform data mining on Weblog recordsPerform data mining on Weblog records – Find association patterns, sequential patterns, and Find association patterns, sequential patterns, and

trends of Web accessingtrends of Web accessing– May need additional information,e.g., user browsing May need additional information,e.g., user browsing

sequences of the Web pages in the Web server buffersequences of the Web pages in the Web server bufferConduct studies toConduct studies to– Analyze system performance, improve system design Analyze system performance, improve system design

by Web caching, Web page prefetching, and Web page by Web caching, Web page prefetching, and Web page swappingswapping

Software for Web Usage Mining Software for Web Usage Mining WEBMINER :WEBMINER :

– introduces a general architecture for Web usage introduces a general architecture for Web usage mining, automatically discovering association rules mining, automatically discovering association rules and sequential patterns from server access logs.and sequential patterns from server access logs.

– proposes an SQL-like query mechanism for querying proposes an SQL-like query mechanism for querying the discovered knowledge in the form of association the discovered knowledge in the form of association rules and sequential patterns.rules and sequential patterns.

WebLogMiner WebLogMiner – Web log is filtered to generate a relational databaseWeb log is filtered to generate a relational database– Data mining on web log data cube and web log Data mining on web log data cube and web log

databasedatabase

WEBMINERWEBMINER SQL-like QuerySQL-like Query A framework for Web mining, A framework for Web mining,

– Association rules: using Apriori algorithmAssociation rules: using Apriori algorithm40% of clients who accessed the Web page with 40% of clients who accessed the Web page with URL URL /company/products/product1.html,/company/products/product1.html, also also accessed accessed /company/products/product2.html/company/products/product2.html

– Sequential patterns:Sequential patterns:60% of clients who placed an online order in 60% of clients who placed an online order in /company/products/product1.html/company/products/product1.html, also placed , also placed an online order in an online order in /company/products/product4.html/company/products/product4.html within 15 within 15 daysdays

WebLogMinerWebLogMinerDatabase construction from server log file:Database construction from server log file:

– data cleaningdata cleaning– data transformationdata transformation

Multi-dimensional web log data cube construction and Multi-dimensional web log data cube construction and manipulationmanipulation

Data mining on web log data cube and web log databaseData mining on web log data cube and web log database

Mining the World-Wide WebMining the World-Wide Web Design of a Web Log MinerDesign of a Web Log Miner

– Web log is filtered to generate a relational databaseWeb log is filtered to generate a relational database– A data cube is generated from the databaseA data cube is generated from the database– OLAP is used to drill-down and roll-up in the cubeOLAP is used to drill-down and roll-up in the cube– OLAM is used for mining interesting knowledgeOLAM is used for mining interesting knowledge

1 Data Cleaning 2 Data Cube Creation

3OLAP 4

Mining

Web log Database Data CubeSliced and diced

cubeKnowledge

webmininglec

Documents

web information

web changes

web sites

surface web

deep web

entire web

introductionthe web

similar web pages