webmininglec
TRANSCRIPT
Data Mining AlgorithmsData Mining Algorithms
Web MiningWeb Mining
22
Web Mining OutlineWeb Mining OutlineGoal:Goal: Examine the use of data mining on Examine the use of data mining on
the World Wide Webthe World Wide WebIntroductionIntroductionWeb Content MiningWeb Content MiningWeb Structure MiningWeb Structure MiningWeb Usage MiningWeb Usage Mining
IntroductionIntroductionThe Web is perhaps the single largest data The Web is perhaps the single largest data source in the world.source in the world.Web mining aims to extract and mine useful Web mining aims to extract and mine useful knowledge from the Web.knowledge from the Web.A multidisciplinary field: data mining, machine A multidisciplinary field: data mining, machine learning, natural language processing, learning, natural language processing, statistics, databases, information retrieval, statistics, databases, information retrieval, multimedia, etc.multimedia, etc.Due to the heterogeneity and lack of structure Due to the heterogeneity and lack of structure of Web data, mining is a challenging task.of Web data, mining is a challenging task.
33
Opportunities and ChallengesOpportunities and ChallengesThe amount of info on the Web is huge, and easily The amount of info on the Web is huge, and easily accessible.accessible.The coverage of Web info is very wide and diverse.The coverage of Web info is very wide and diverse.Info/data of almost all types exist on the Web, e.g., Info/data of almost all types exist on the Web, e.g., structured tables, texts, multimedia data, etc.structured tables, texts, multimedia data, etc.Much of the Web information is semi-structuredMuch of the Web information is semi-structured due to the due to the nested structure of HTML code.nested structure of HTML code.Much of the Web info is linked. Much of the Web info is linked. hyperlinks among hyperlinks among pages within a site, and across different sites.pages within a site, and across different sites.Much of the Web info is redundant. Much of the Web info is redundant. Same piece of Same piece of info or its variants may appear in many pagesinfo or its variants may appear in many pages..
44
Opportunities and ChallengesOpportunities and ChallengesWeb is noisy. A Web page typically contains many kinds of info, e.g., main contents, advertisements, navigation panels, copyright notices, etc.Web consists of surface Web and deep Web. – Surface Web: pages that can be browsed using a browser.– Deep Web: can only be accessed thro parameterized QI.
Web is also about services. Web is dynamic. Information on the Web changes constantly. Keeping up with the changes and monitoring the changes are important issues.The Web is a virtual society. It is not only about data, information and services, but also about interactions among people, organizations and automatic systems, i.e. communities.
55
66
Web Mining Other IssuesWeb Mining Other Issues
SizeSize– >1000 million pages >1000 million pages 227,225,642227,225,642 web sites web sites
(Sep 2010) (Netcraft Survey)(Sep 2010) (Netcraft Survey)– Grows at about 1 million pages a dayGrows at about 1 million pages a day– Google indexes > 5 billion documentsGoogle indexes > 5 billion documentsDiverse types of dataDiverse types of dataSo not possible to warehouse or normal So not possible to warehouse or normal data miningdata mining
77
Web DataWeb DataWeb pagesWeb pagesIntra-page structures Intra-page structures (HTML, XML code)(HTML, XML code)Inter-page structures Inter-page structures (actual linkage (actual linkage structures between web pages)structures between web pages)Usage dataUsage dataSupplemental dataSupplemental data– ProfilesProfiles– Registration informationRegistration information– CookiesCookies
88
Web Mining TaxonomyWeb Mining Taxonomy
99
Web Mining TaxonomyWeb Mining Taxonomy
Web Content MiningWeb Content Mining– Extends work of basic search enginesExtends work of basic search enginesWeb Structure MiningWeb Structure Mining– Mine structure (links, graph) of the WebMine structure (links, graph) of the WebWeb Usage MiningWeb Usage Mining– Analyses Logs of Web AccessAnalyses Logs of Web AccessWeb Mining applications include Web Mining applications include Target Target Advtg., Recommendation Engines, Advtg., Recommendation Engines, CRM etcCRM etc
1010
Web Content MiningWeb Content MiningExtends work of basic search enginesExtends work of basic search enginesWeb content mining: mining, extraction and integration of useful data, information and knowledge from Web page contentsSearch EnginesSearch Engines– IR application, Keyword based, Similarity IR application, Keyword based, Similarity
between query and documentbetween query and document– Crawlers, IndexingCrawlers, Indexing– ProfilesProfiles– Link analysisLink analysis
1111
Issues in Web Content MiningIssues in Web Content MiningDeveloping intelligent tools for IRDeveloping intelligent tools for IR– Finding keywords and key phrasesFinding keywords and key phrases– Discovering grammatical rules and Discovering grammatical rules and
collocationscollocations– Hypertext classification/categorizationHypertext classification/categorization– Extracting key phrases from text documentsExtracting key phrases from text documents– Learning extraction models/rulesLearning extraction models/rules– Hierarchical clusteringHierarchical clustering– Predicting (words) relationshipPredicting (words) relationship
1212
Search Engine – Two Rank FunctionsSearch Engine – Two Rank Functions
Web Pages
Meta Data ForwardIndex
InvertedIndex
ForwardLink
Backward Link(Anchor Text)
Web TopologyGraph
Web Page Parser
IndexerAnchor TextGenerator
Web GraphConstructor
Importance Ranking(Link Analysis)Rank Functions
URLDictioanry
Term Dictionary(Lexicon)
Search
Relevance Ranking
Ranking based on link structure analysis
Similarity based on content or text
How do We Find Similar Web How do We Find Similar Web Pages?Pages?
Content based approachContent based approachStructure based approachStructure based approachCombing both content and structure Combing both content and structure approachapproach
1313
• Inverted index - A data structure for supporting text queries - like index in a book
Relevance Ranking
inverted index
aalborg 3452, 11437, ….......arm 4, 19, 29, 98, 143, ...armada 145, 457, 789, ...armadillo 678, 2134, 3970, ...armani 90, 256, 372, 511, ........zz 602, 1189, 3209, ...
disks with documents
indexing
1515
CrawlersCrawlersRobot (spider)Robot (spider) traverses the hypertext structure traverses the hypertext structure in the Web.in the Web.Collect information from visited pagesCollect information from visited pagesUsed to construct indexes for search enginesUsed to construct indexes for search enginesTraditional CrawlerTraditional Crawler – visits entire Web and – visits entire Web and replaces indexreplaces indexPeriodic CrawlerPeriodic Crawler – visits portions of the Web and – visits portions of the Web and updates subset of indexupdates subset of indexIncremental CrawlerIncremental Crawler – selectively searches the – selectively searches the Web and incrementally modifies indexWeb and incrementally modifies indexFocused CrawlerFocused Crawler – visits pages related to a – visits pages related to a particular subjectparticular subject
1616
Focused CrawlerFocused CrawlerOnly visit links from a page if that page is Only visit links from a page if that page is determined to be relevant.determined to be relevant.Classifier is static after learning phase.Classifier is static after learning phase.Components:Components:– Classifier which assigns relevance score to Classifier which assigns relevance score to
each page based on crawl topic.each page based on crawl topic.– Distiller to identify Distiller to identify hub pages.hub pages.– Crawler visits pages based on crawler and Crawler visits pages based on crawler and
distiller scores.distiller scores.
1717
Focused CrawlerFocused Crawler
Classifier to related documents to topicsClassifier to related documents to topicsClassifier also determines how useful Classifier also determines how useful outgoing links areoutgoing links areHub PagesHub Pages contain links to many relevant contain links to many relevant pages. Must be visited even if not high pages. Must be visited even if not high relevance scorerelevance score..
1818
Focused CrawlerFocused Crawler
1919
Virtual Web ViewVirtual Web ViewMultiple Layered DataBase (MLDB)Multiple Layered DataBase (MLDB) built on top built on top of the Web.of the Web.Each layer of the database is more generalized Each layer of the database is more generalized (and smaller) and centralized than the one beneath (and smaller) and centralized than the one beneath it.it.Upper layers of MLDB are structured and can be Upper layers of MLDB are structured and can be accessed with SQL type queries.accessed with SQL type queries.Translation tools convert Web documents to XML.Translation tools convert Web documents to XML.Extraction tools extract desired information to place Extraction tools extract desired information to place in first layer of MLDB.in first layer of MLDB.Higher levels contain more summarized data Higher levels contain more summarized data obtained through generalizations of the lower obtained through generalizations of the lower levels.levels.
Multilevel DatabasesMultilevel Databases
Text Image Audio Video Maps Games
Levels of A MLDBLevels of A MLDBLayer 0 :Layer 0 :– Unstructured, massive and global information base.Unstructured, massive and global information base.Layer 1: Layer 1: – Derived from lower layers.Derived from lower layers.– Relatively structured.Relatively structured.– Obtained by data analysis, transformation & Obtained by data analysis, transformation &
Generalization.Generalization.Higher Layers (Layer n):Higher Layers (Layer n):– Further generalization to form smaller, better structured Further generalization to form smaller, better structured
databases for more efficient retrieval.databases for more efficient retrieval.
Web Query SystemWeb Query System These systems attempt to make use of:These systems attempt to make use of:
– Standard database query language – SQLStandard database query language – SQL– Structural information about web documentsStructural information about web documents– Natural language processing for queries made in www Natural language processing for queries made in www
searches.searches.
Examples:Examples:– WebLog: Restructuring extracted information from Web WebLog: Restructuring extracted information from Web
sources.sources.– W3QL: Combines structure query (organization of W3QL: Combines structure query (organization of
hypertext) and content query (information retrieval hypertext) and content query (information retrieval techniques).techniques).
Architecture of a Global Architecture of a Global MLDBMLDB
ConceptHierarchy
Generalized Data
HigherLevels
.
.
.
Source 1
Source 2
Source n
Resource Discovery (MLDB)
Knowledge Discovery
Web Query SystemWeb Query SystemThese systems attempt to make use of:These systems attempt to make use of:– Standard database query language – SQLStandard database query language – SQL– Structural information about web documentsStructural information about web documents– Natural language processing for queries made in Natural language processing for queries made in
www searches.www searches.
Examples:Examples:– WebLog: Restructuring extracted information from WebLog: Restructuring extracted information from
Web sources.Web sources.– W3QL: Combines structure query (organization of W3QL: Combines structure query (organization of
hypertext) and content query (information retrieval hypertext) and content query (information retrieval techniques).techniques).
Architecture of a Global Architecture of a Global MLDBMLDB
ConceptHierarchy
Generalized Data
HigherLevels
.
.
.
Source 1
Source 2
Source n
Resource Discovery (MLDB)
Knowledge Discovery
2626
PersonalizationPersonalizationWeb access or contents tuned to better fit the Web access or contents tuned to better fit the desires of each user.desires of each user.Manual techniques identify user’s preferences Manual techniques identify user’s preferences based on profiles or demographics.based on profiles or demographics.Collaborative filteringCollaborative filtering identifies preferences identifies preferences based on ratings from similar users.based on ratings from similar users.Content based filteringContent based filtering retrieves pages retrieves pages based on similarity between pages and user based on similarity between pages and user profiles.profiles.
ApplicationsApplications
ShopBotShopBotBookmark OrganizerBookmark OrganizerRecommender SystemsRecommender SystemsIntelligent Search EnginesIntelligent Search Engines
2727
2828
Web Structure MiningWeb Structure Mining
Mine structure (links, graph) of the WebMine structure (links, graph) of the WebTechniquesTechniques– PageRankPageRank– CLEVERCLEVER
Create a model of the Web organization.Create a model of the Web organization.May be combined with content mining to May be combined with content mining to more effectively retrieve important pages.more effectively retrieve important pages.
2929
Web as a Graph Web as a Graph
Web pages as nodes of a graph.Web pages as nodes of a graph.Links as directed edges.Links as directed edges.
www.vesit.edumy page www.vesit.edu
www.google.com
www.google.com
my pagewww.vesit.edu
www.google.com
3030
Link Structure of the Web Link Structure of the Web Forward links (out-edges).Forward links (out-edges).Backward links (in-edges).Backward links (in-edges).Approximation of importance/quality: a Approximation of importance/quality: a page may be of high quality if it is referred page may be of high quality if it is referred to by many other pages, and by pages of to by many other pages, and by pages of high quality.high quality.
3131
Authorities and HubsAuthorities and Hubs
Authority is a page which has relevant Authority is a page which has relevant information about the topic.information about the topic.Hub is a page which has collection of links Hub is a page which has collection of links to pages about that topic.to pages about that topic.
h
a1
a2
a3
a4
PageRankPageRankIntroduced by Brin and Page (1998).Introduced by Brin and Page (1998).Mine hyperlink structure of web to produce Mine hyperlink structure of web to produce ‘global’ importance ranking of every web page.‘global’ importance ranking of every web page.Used in Google Search Engine.Used in Google Search Engine.Web search result is returned in the rank Web search result is returned in the rank order.order.Treats link as like academic citation.Treats link as like academic citation.Assumption:Assumption: Highly linked pages are more Highly linked pages are more ‘important‘important’’ than pages with a few links. than pages with a few links.
3232
3333
PageRankPageRankUsed by Used by GoogleGooglePrioritize pages returned from search by Prioritize pages returned from search by looking at Web structure.looking at Web structure.Importance of page is calculated based Importance of page is calculated based on number of pages which point to it on number of pages which point to it – – BacklinksBacklinks..Weighting is used to provide more Weighting is used to provide more importance to importance to backlinksbacklinks coming form coming form important pages.important pages.
PageRank: Main IdeaPageRank: Main IdeaA page has a high rank if the A page has a high rank if the sum of the sum of the ranks of its back-linksranks of its back-links is high. is high.Google utilizes a number of factors to rank Google utilizes a number of factors to rank the search results: the search results: – proximity, anchor text, page rankproximity, anchor text, page rankThe benefits of Page Rank are The benefits of Page Rank are the greatest the greatest for underspecified queries, example: ‘Mumbai for underspecified queries, example: ‘Mumbai University’ query using Page Rank lists the University’ query using Page Rank lists the university home page the first.university home page the first.
3434
3535
Basic IdeaBasic IdeaBack-links coming from important pages Back-links coming from important pages convey more importance to a page. convey more importance to a page. For example, if a web page has a link from For example, if a web page has a link from the yahoo home page, it may be just one the yahoo home page, it may be just one link but it is a very important one.link but it is a very important one.A page has high rank if the sum of the A page has high rank if the sum of the ranks of its back-links is high. ranks of its back-links is high. This covers both the case when a page This covers both the case when a page has many back-links and when a page has has many back-links and when a page has a few highly ranked back-links.a few highly ranked back-links.
3636
DefinitionDefinitionA page’s rank is equal to the sum of all A page’s rank is equal to the sum of all the pages pointing to it.the pages pointing to it.
vfromlinksofnumberNutolinkswithpagesofsetB
NvRankuRank
v
u
Bv vu
)()(
3737
Simplified PageRank ExampleSimplified PageRank ExampleRank(u) = Rank Rank(u) = Rank of page of page uu , where , where c c is ais a normalization normalization constant (c < 1 to constant (c < 1 to cover for pages cover for pages with no outgoing with no outgoing linkslinks).).
3838
Expanded DefinitionExpanded DefinitionR(u)R(u): page rank of page : page rank of page uucc: factor used for normalization (<1): factor used for normalization (<1)BBuu: set of pages pointing to : set of pages pointing to uuNNvv: outbound links of : outbound links of vvR(v)R(v): page rank of site : page rank of site vv that points to that points to uuE(u)E(u): distribution of web pages that a random : distribution of web pages that a random surfer periodically jumps (set to 0.15)surfer periodically jumps (set to 0.15)
)()()( ucENvRcuR
uBv v
3939
Problem 1 - Rank SinkProblem 1 - Rank SinkPage cycles pointed by some incoming link.Page cycles pointed by some incoming link.
Loop will accumulate rank but never distribute it.Loop will accumulate rank but never distribute it.
4040
Problem 2 - Dangling LinksProblem 2 - Dangling Links
4141
PageRank (cont’d)PageRank (cont’d)
PR(p) = c (PR(1)/NPR(p) = c (PR(1)/N11 + … + PR(n)/N + … + PR(n)/Nnn))– PR(i): PageRank for a page i which PR(i): PageRank for a page i which
points to target page p.points to target page p.– NNii: number of links coming out of page i: number of links coming out of page i
4242
HITSHITS
Hyperlink-Induces Topic SearchHyperlink-Induces Topic Search
Based on a set of keywords, find set of relevant Based on a set of keywords, find set of relevant
pages – R.pages – R.
Identify hub and authority pages for these.Identify hub and authority pages for these.
– Expand R to a base set, B, of pages linked to or from R.Expand R to a base set, B, of pages linked to or from R.
– Calculate weights for authorities and hubs.Calculate weights for authorities and hubs.
Pages with highest ranks in R are returned.Pages with highest ranks in R are returned.
4343
Authorities and HubsAuthorities and Hubs
Authority is a page which has relevant Authority is a page which has relevant information about the topic.information about the topic.Hub is a page which has collection of links Hub is a page which has collection of links to pages about that topic.to pages about that topic.
h
a1
a2
a3
a4
4444
Authorities and Hubs (cont.)Authorities and Hubs (cont.)Good hubs are the ones that point to good Good hubs are the ones that point to good authorities.authorities.Good authorities are the ones that are Good authorities are the ones that are pointed to by pointed to by good hubs.good hubs.
h2
h3
h4
h5
a1
a2
a3
a4
a5
a6
h1
4545
Finding Authorities and HubsFinding Authorities and Hubs
First, construct a focused sub-graph of the First, construct a focused sub-graph of the www.www.Second, compute Hubs and Authorities Second, compute Hubs and Authorities from the sub-graph.from the sub-graph.
4646
Construction of Sub-graphConstruction of Sub-graph
Topic Search Engine CrawlerRootsetPages
ExpandedsetPages
Rootset
Forward link pages
4747
Root Set and Base SetRoot Set and Base SetUse query term to Use query term to collect a collect a root setroot set of of pages from text-pages from text-based search engine based search engine (Lycos, Altavista ).(Lycos, Altavista ).
Root set
4848
Root Set and Base Set (cont.)Root Set and Base Set (cont.)
Expand root set into Expand root set into base set by including base set by including (up to a designated (up to a designated size cut-off):size cut-off):– All pages linked to by All pages linked to by
pages in root setpages in root set– All pages that link to a All pages that link to a
page in root setpage in root set
Root set
Base set
4949
Hubs & Authorities Hubs & Authorities CalculationCalculation
Iterative algorithm on Base SetIterative algorithm on Base Set: authority weights : authority weights aa(p), and hub (p), and hub weights weights hh(p).(p).– Set authority weights Set authority weights aa(p) = 1, and hub weights (p) = 1, and hub weights hh(p) = 1 for (p) = 1 for
all p.all p.– Repeat following two operationsRepeat following two operations
(and then re-normalize (and then re-normalize aa and and hh to have unit norm): to have unit norm):
v1
pv2
v3
h(v2)
h(v3)
pq
pa topoints
h(q))(
v1
p
a(v1)
v2
v3
a(v2)
a(v3)
qp
aph topoints
(q))(
h(v1)
5050
ExampleExample
Hub 0.45, Authority 0.45
0.45, 0.45
0.45, 0.45
0.45, 0.45
5151
Example (cont.)Example (cont.)
Hub 0.9, Authority 0.45
1.35, 0.9
0.45, 0.9
0.45, 0.9
5252
Algorithmic OutcomeAlgorithmic OutcomeApplying iterative multiplication (power Applying iterative multiplication (power iteration) will lead to calculating iteration) will lead to calculating eigenvector of any eigenvector of any ““non-degenerate” initial non-degenerate” initial vector.vector.Hubs and auHubs and auththorities as outcome of orities as outcome of process.process.Principal ePrincipal eigenvector igenvector contains highest hub contains highest hub and authoritiesand authorities..
5353
ResultsResultsAlthough HITS is only link-based (it Although HITS is only link-based (it completely disregardcompletely disregardss page content) results page content) results are quite good in many tested queries.are quite good in many tested queries.From narrow topic, HITS tends to end in more From narrow topic, HITS tends to end in more general one.general one.Specific of hub pages - many links can cause Specific of hub pages - many links can cause algorithm drift. They can point to authorities in algorithm drift. They can point to authorities in different topicsdifferent topics..Pages from single domain / website can Pages from single domain / website can dominate result, if they point to one page - dominate result, if they point to one page - not necessarnot necessarilily a good authority.y a good authority.
5454
Possible EnhancementsPossible EnhancementsUse weighted sums for link calculationUse weighted sums for link calculation..Take advantage of Take advantage of “anchor “anchor text” - text text” - text surrounding link itself.surrounding link itself.Break hubs into smaller pieces. Analyze each Break hubs into smaller pieces. Analyze each piece separately, instead of whole hub page as piece separately, instead of whole hub page as one.one.Disregard or minimize influence of links inside Disregard or minimize influence of links inside one domain.one domain.IBM expanded HITS into Clever; not seen as IBM expanded HITS into Clever; not seen as viable real-time search engine.viable real-time search engine.
5555
CLEVERCLEVER
Identify authoritative and hub pages.Identify authoritative and hub pages.Authoritative PagesAuthoritative Pages : :– Highly important pages.Highly important pages.– Best source for requested information.Best source for requested information.
Hub PagesHub Pages : :– Contain links to highly important pages.Contain links to highly important pages.
5656
CLEVERCLEVERThe CLEVER algorithm is an extension of standard HITS and provides an appropriate solution to the problems that result from standard HITS.CLEVER assigns a weight to each link based on the terms of the queries and end-points of the link. It combines anchor text to set weights to the links as well. Moreover, it breaks large hub pages into smaller units so that each hub page is focused on as a single topic. Finally, in the case of a large number of pages from a single domain, it scales down the weights of pages to reduce the probabilities of overhead weights
5757
PageRank vs. HITSPageRank vs. HITS
PageRankPageRank (Google)(Google)– computed for all web computed for all web
pages stored in the pages stored in the database prior to the database prior to the queryquery
– computes authorities computes authorities onlyonly
– Trivial and fast to Trivial and fast to computecompute
HITSHITS (CLEVER)(CLEVER)– performed on the set of performed on the set of
retrieved web pages for retrieved web pages for each queryeach query
– computes authorities computes authorities and hubsand hubs
– easy to compute, but easy to compute, but real-time execution is real-time execution is hardhard
5858
Web Usage MiningWeb Usage Mining
Performs mining on Performs mining on Web UsageWeb Usage data or data or Web LogsWeb LogsA web log is a listing of page reference A web log is a listing of page reference data also called as a data also called as a click steamclick steamCan be seen from either server Can be seen from either server perspective – perspective – better web site designbetter web site designOr client perspective – Or client perspective – prefetching of web prefetching of web pages etc.pages etc.
5959
Web Usage Mining Web Usage Mining ApplicationsApplications
PersonalizationPersonalizationImprove structure of a site’s Web pagesImprove structure of a site’s Web pagesAid in caching and prediction of future page Aid in caching and prediction of future page referencesreferencesImprove design of individual pagesImprove design of individual pagesImprove effectiveness of e-commerce Improve effectiveness of e-commerce (sales (sales and advertising)and advertising)Improve web server performance (Load Improve web server performance (Load Balancing)Balancing)
6060
Web Usage Mining ActivitiesWeb Usage Mining ActivitiesPreprocessing Web logPreprocessing Web log– Cleanse Cleanse – Remove extraneous informationRemove extraneous information– SessionizeSessionize
Session:Session: Sequence of pages referenced by one user at a sitting. Sequence of pages referenced by one user at a sitting.
Pattern DiscoveryPattern Discovery– Count patterns that occur in sessionsCount patterns that occur in sessions– Pattern Pattern is sequence of pages references in session.is sequence of pages references in session.– Similar to association rulesSimilar to association rules
Transaction: sessionTransaction: sessionItemset: pattern (or subset)Itemset: pattern (or subset)Order is importantOrder is important
Pattern AnalysisPattern Analysis
6161
Web Usage Mining IssuesWeb Usage Mining Issues
Identification of exact user not possible.Identification of exact user not possible.Exact sequence of pages referenced by a Exact sequence of pages referenced by a user not possible due to caching.user not possible due to caching.Session not well definedSession not well definedSecurity, privacy, and legal issuesSecurity, privacy, and legal issues
Web Usage Mining - OutcomeWeb Usage Mining - OutcomeAssociation rulesAssociation rules– – Find pages that are often viewed Find pages that are often viewed togethertogetherClusteringClustering– – Cluster users based on browsing patternsCluster users based on browsing patterns– – Cluster pages based on contentCluster pages based on content
ClassificationClassification– – Relate user attributes to patternsRelate user attributes to patterns
6262
6363
Web Log CleansingWeb Log Cleansing
Replace source IP address with unique Replace source IP address with unique but non-identifying ID.but non-identifying ID.Replace exact URL of pages referenced Replace exact URL of pages referenced with unique but non-identifying ID.with unique but non-identifying ID.Delete error records and records Delete error records and records containing not page data (such as figures containing not page data (such as figures and code)and code)
6464
Data Structures Data Structures
Keep track of patterns identified during Keep track of patterns identified during Web usage mining processWeb usage mining processCommon techniques:Common techniques:– Trie Trie – Suffix TreeSuffix Tree– Generalized Suffix TreeGeneralized Suffix Tree– WAP TreeWAP Tree
Web Usage Mining – Three Web Usage Mining – Three PhasesPhases
http://www.acm.org/sigs/sigkdd/explorations/issue1-2/srivastava.pdf
Phase 1: Phase 1: Pre-processing Pre-processingConverts the raw data into the data Converts the raw data into the data abstraction necessary for the further abstraction necessary for the further applying the data mining algorithmapplying the data mining algorithm– Mapping the log data into Mapping the log data into relational relational
tablestables before an adapted data mining before an adapted data mining technique is performed.technique is performed.
– Using the log data directlyUsing the log data directly by utilizing by utilizing special pre-processing techniques.special pre-processing techniques.
6666
Raw data – Web logRaw data – Web logClick streamClick stream: a sequential series of page : a sequential series of page view requestview requestUser sessionUser session: a delimited set of user clicks : a delimited set of user clicks (click stream) across one or more Web (click stream) across one or more Web servers.servers.Server session (visit)Server session (visit): a collection of user : a collection of user clicks to a single Web server during a user clicks to a single Web server during a user session.session.EpisodeEpisode: a subset of related user clicks : a subset of related user clicks that occur within a user session.that occur within a user session.
6767
Phase 2: Pattern DiscoveryPhase 2: Pattern Discovery
Pattern Discovery uses techniques Pattern Discovery uses techniques such as statistical analysis, such as statistical analysis, association rules, clustering, association rules, clustering, classification, sequential pattern, classification, sequential pattern, dependency Modeling.dependency Modeling.
6868
Phase 3: Pattern AnalysisPhase 3: Pattern Analysis
A process to gain Knowledge about how A process to gain Knowledge about how visitors use Website in order tovisitors use Website in order to– PPrevent disorientationrevent disorientation and help designers to and help designers to
place important information/functions exactly place important information/functions exactly where the visitors look for and in the way where the visitors look for and in the way users need it.users need it.
– BBuild up adaptive Website serveruild up adaptive Website server
6969
7070
Techniques for Web usage miningTechniques for Web usage mining Construct multidimensional view on the Weblog databaseConstruct multidimensional view on the Weblog database
– Perform multidimensional OLAP analysis to find the top Perform multidimensional OLAP analysis to find the top NN users, top users, top NN accessed Web pages, most frequently accessed Web pages, most frequently accessed time periods, etc.accessed time periods, etc.
Perform data mining on Weblog recordsPerform data mining on Weblog records – Find association patterns, sequential patterns, and Find association patterns, sequential patterns, and
trends of Web accessingtrends of Web accessing– May need additional information,e.g., user browsing May need additional information,e.g., user browsing
sequences of the Web pages in the Web server buffersequences of the Web pages in the Web server bufferConduct studies toConduct studies to– Analyze system performance, improve system design Analyze system performance, improve system design
by Web caching, Web page prefetching, and Web page by Web caching, Web page prefetching, and Web page swappingswapping
Software for Web Usage Mining Software for Web Usage Mining WEBMINER :WEBMINER :
– introduces a general architecture for Web usage introduces a general architecture for Web usage mining, automatically discovering association rules mining, automatically discovering association rules and sequential patterns from server access logs.and sequential patterns from server access logs.
– proposes an SQL-like query mechanism for querying proposes an SQL-like query mechanism for querying the discovered knowledge in the form of association the discovered knowledge in the form of association rules and sequential patterns.rules and sequential patterns.
WebLogMiner WebLogMiner – Web log is filtered to generate a relational databaseWeb log is filtered to generate a relational database– Data mining on web log data cube and web log Data mining on web log data cube and web log
databasedatabase
WEBMINERWEBMINER SQL-like QuerySQL-like Query A framework for Web mining, A framework for Web mining,
– Association rules: using Apriori algorithmAssociation rules: using Apriori algorithm40% of clients who accessed the Web page with 40% of clients who accessed the Web page with URL URL /company/products/product1.html,/company/products/product1.html, also also accessed accessed /company/products/product2.html/company/products/product2.html
– Sequential patterns:Sequential patterns:60% of clients who placed an online order in 60% of clients who placed an online order in /company/products/product1.html/company/products/product1.html, also placed , also placed an online order in an online order in /company/products/product4.html/company/products/product4.html within 15 within 15 daysdays
WebLogMinerWebLogMinerDatabase construction from server log file:Database construction from server log file:
– data cleaningdata cleaning– data transformationdata transformation
Multi-dimensional web log data cube construction and Multi-dimensional web log data cube construction and manipulationmanipulation
Data mining on web log data cube and web log databaseData mining on web log data cube and web log database
Mining the World-Wide WebMining the World-Wide Web Design of a Web Log MinerDesign of a Web Log Miner
– Web log is filtered to generate a relational databaseWeb log is filtered to generate a relational database– A data cube is generated from the databaseA data cube is generated from the database– OLAP is used to drill-down and roll-up in the cubeOLAP is used to drill-down and roll-up in the cube– OLAM is used for mining interesting knowledgeOLAM is used for mining interesting knowledge
1 Data Cleaning 2 Data Cube Creation
3OLAP 4
Mining
Web log Database Data CubeSliced and diced
cubeKnowledge