web mining clickstream analysis -...

University of Fribourg

Web Mining &

Clickstream Analysis

Stefan Kolek Oezkan Kirmaci

Fribourg, 31.05.2006

2

INTRODUCTION................................................................................................................................. 3

DEFINITION......................................................................................................................................... 4

WHAT IS WEB MINING? ........................................................................................................................ 4 WHAT IS DATA MINING?....................................................................................................................... 4 WEB CRAWLERS .................................................................................................................................... 5 WEB MINING CATEGORIES ................................................................................................................... 6 WEB CONTENT MINING........................................................................................................................... 6 WEB STRUCTURE MINING ....................................................................................................................... 7 WEB USAGE MINING ............................................................................................................................... 8

IMPLEMENTATION......................................................................................................................... 10

WEB MINING AND PRIVACY........................................................................................................ 11

SEVEN RULES FOR SUCCESSFUL WEB MINING.................................................................... 12

FUTURE DIRECTIONS .................................................................................................................... 14

WEB METRICS AND MEASUREMENTS ................................................................................................ 14 PROCESS MINING ................................................................................................................................. 14 DYNAMIC ASPECT ................................................................................................................................ 15 RAPIDLY GROWING WEB DATA.......................................................................................................... 15 PERSONALIZATION .............................................................................................................................. 15

ROBOT DETECTION AND FILTERING – CLICKSTREAM .................................................... 16

CLICKSTREAM ANALYSIS............................................................................................................ 16

PATH ANALYSIS ................................................................................................................................... 17 CLICKSTREAM DATA ........................................................................................................................... 18 TYPES OF CLICKSTREAM DATA............................................................................................................. 18 THE SOURCES OF CLICKSTREAM DATA ................................................................................................ 20 MAIN METHODS OF USER TRACKING ................................................................................................ 22 PRIVACY CONCERNS............................................................................................................................ 23 BENEFITS RESULTING FROM A CLICKSTREAM ANALYSIS ................................................................ 24

CONCLUSION.................................................................................................................................... 25

REFERENCES .................................................................................................................................... 26

TABLES ............................................................................................................................................... 29

FIGURES ............................................................................................................................................. 29

3

Introduction In the midst of the explosive growth of information sources available on the World Wide Web, it has become necessary for internet users to use automated tools in finding desired information resources, and to track and analyse their usage patterns, because from the beginning the potential of extracting valuable knowledge from the Web has been evident. These factors give rise to the necessity of creating server-side and client-side intelligent systems that can effectively mine for knowledge, and are the reason why Web Mining has grown rapidly, both in research and practitioner communities (cp. Srivastava J., 2002). Web Mining can be seen as the collection of technologies to fulfil this potential, and consequently Web Mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web (cp. Cooley R., 1997). This paper provides a brief overview over Web Mining and Click stream analysis, shows its different categories and gives a possible way of inserting them into the “data-knowledge” chain. We also show problems that could occur during their implementation, and explain some future directions.

4

Definition

What is Web Mining?

• Web Mining is the use of Data Mining techniques to automatically discover and extract information from Web documents and services (cp. O. Etzioni, 1996).

• Web Mining is an increasingly important and very active research field which

adapts advanced machine learning techniques for understanding the complex information flow of the World Wide Web (cp. Nigam K., 2000).

What is Data Mining? Data Mining can be defined quite differently. In practice KDD (Knowledge Discovery in Databases1) is often used as a synonym for Data Mining since it is a partial stage of KDD. This is illustrated in the following figure below.

Figure 1, KDD-Process Model (cp. Fayyad et al., 1996)

• Ashby, Simms (1998): "Data Mining is the process of discovering meaningful new correlations, patterns and trends by "mining" large amounts of stored data using pattern recognition technologies, as well as statistical and mathematical techniques." (cit. Ashby, Simms after Schmidt-Thieme L., 2002)

• Berry, Linoff (1997):

"Data Mining is the exploration and analysis, by automatic and semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules." (cit. Berry, Linoff after Schmidt-Thieme L., 2002)

1 KDD is the not trivial process of identifying valid, novel, potentially useful and clearly coherent patterns in data.

5

• Decker, Focardi (1995): Data Mining is ... "a problem solving methodology that finds logical or mathematical descriptions, eventually of a complex nature, of patterns and regularities in a set of data." (cit. Decker, Focardi after Schmidt-Thieme L., 2002)

• John (1997):

"Data Mining is the process of discovering advantageous patterns in data." (cit. John after Schmidt-Thieme L., 2002)

• Newquist (1996):

"Data Mining finds relationships and can help anticipate the future based on the past data." (cit. Newquist after Lars Schmidt-Thieme, 2002)

• Parsaye (1996):

"Data Mining is a decision support process where we look in large data bases for unknown and unexpected patterns of information." (cit. Parsaye after Schmidt-Thieme L., 2002)

Web Crawlers The tools which are used to identify and extract resources from internet are usually referred to as Web Crawlers. Existing Web Crawlers can differ strongly in their conception and therefore in their architecture because of the different uses of the resources respectively the different applications. There are two types of Web Crawlers that can be distinguished: The first distinguishable type is the Crawlers which collect data according to their linking for example the Crawler of Google. The second distinguishable type is the content based Crawlers, which include a relevance valuation of the data and therefore zoom in crawling process (cp. Ehrig M., 2004).

6

Web Mining Categories

Figure 2, Web Mining Taxonomy (cp. Srivastava J., 2002) In a substantial amount of the literature, Web Mining is broadly divided into three distinct categories, according to the kinds of data to be mined: Web Content Mining, Web Structure Mining and Web Usage Mining.

Web Content Mining Web Content Mining describes the discovery of useful information from the Web contents/data/documents; however this can encompass a very broad range of data. One can easily observe that the amount of information accessible from the Web has increased tremendously in the last few years. Unlike in the past, data sources such as Gopher2, FTP3 and Usenet4 are either ported or accessible from the Web. The growth in the amount of government information on the Web has been significant for example there are Digital Libraries accessible from the Web. Companies are taking advantage of the expansion of the Web by transforming their businesses and services electronically and therefore many of the company databases that previously resided in the legacy systems are also being ported to or made accessible from the 2 Gopher is a distributed document search and retrieval network protocol designed for the Internet. Its goal was similar to that of

the World Wide Web, and it has been almost completely displaced by the Web.

3 FTP or file transfer protocol is a commonly used protocol for exchanging files over any network that supports the TCP/IP

(Transmission Control Protocol / Internet Protocol) protocol.

4 Usenet is a distributed Internet discussion system.

7

Web. Of course some of the Web Content Data are still hidden data, which cannot be indexed. Inaccessible data are either generated dynamically as a result of queries and are located in the DBMS’s5 or are private. In general, the Web contains many kinds and types of data (cp. Hansen L.K., 2000). Web Content Mining is also the process of extracting useful information from the contents of Web documents. Content Data corresponds to the collection of facts a Web page was designed to convey to its users. It may consist of text, images, audio, video, or structured records such as lists and tables. Text mining and its application to Web content has been the most widely researched. There are various research issues addressed in text mining which include topic discovery, extracting association patterns, clustering6 of web documents and classification of Web Pages (cp. Srivastava J., 2002). Text mining is recognized in current literature and is described to categorize text according to topic, to spot new topics, and in a broader sense to create more intelligent searches, e.g., by WWW search engines (cp. Hansen L.K., 2000). Research activities in this field also involve using techniques from other disciplines such as Information Retrieval (IR) and Natural Language Processing (NLP). Thus the research done in Web Content Mining could be recognized from two different points of view: The IR7 and DB8 views. Web Content Mining from the IR view is mainly to assist or to improve the information finding or to filter the information to users usually based on either inferred or solicited user profiles, while the DB view mainly tries to model the data on the Web and to integrate them so that more sophisticated queries other than the keywords based search could be performed (cp. Kosala R., 2000).

Web Structure Mining Web Structure Mining seeks to discover the models underlying the link structures of the Web. The models are based on the topology of the hyperlinks with or without the description of the links (cp. Kosala R., 2000). The structure of a typical Web graph consists of Web pages as nodes9, and hyperlinks as edges connecting between two related pages. In addition Web Structure Mining can be regarded as the process of discovering structure information from the Web. Based on the kind of structural data used, this type of mining can be further divided into two categories: hyperlinks and document structure. Hyperlinks: A Hyperlink is a structural unit that connects a Web page to different location, either within the same Web page or to a different Web page. A hyperlink that connects to a different part of the same page is called an Intra-Document Hyperlink, and a hyperlink that connects two different pages is called an Inter-

5 A database management system (DBMS) is a computer program (or more typically, a suite of them) designed to manage a

database (a large set of structured data), and run operations on the data requested by numerous clients.

6 Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning,

Data Mining, pattern recognition, image analysis and bioinformatics. Clustering is the classification of similar objects into

different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally)

share some common trait - often proximity according to some defined distance measure. 7 Information retrieval (IR) is the art and science of searching for information in documents, searching for documents

themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone

databases or hypertext networked databases such as the Internet or intranets, for text, sound, images or data.

8 A database is an organized collection of data.

9 A node is a device that is connected as part of a computer network.

8

Document Hyperlink. There has been a significant body of work on hyperlink analysis, of which provides an up-to-date survey.

Document Structure: In addition, the content within a Web page can also be organized in a tree-structured format, based on the various HTML and XML tags within the page. Mining efforts here have focused on automatically extracting document object model (DOM10) structures out of documents (cp. Srivastava J., 2002).

Web Usage Mining Web Usage Mining attempts to make sense of the data generated by the Web surfer’s sessions or behaviours. While Web content and structure mining use real or primary data on the Web, Web Usage Mining mines the secondary data derived from the interactions of the users while interacting with the Web. This data includes data from Web server access, proxy server and browser logs, user profiles, registration data, user sessions or transactions, cookies, user queries, bookmark data, mouse clicks and scrolls, and any other data that results from interactions (cp. Kosala R., 2000). In short, Web Usage Mining is the applied action of Data Mining to techniques to discover interesting usage patterns from Web Data. In order to understand and better serve the needs of Web-based applications, Usage Data captures the identity or origin of Web users along with their browsing behaviour at a Web site. Web Usage Mining itself can be classified further Depending on the kind of Usage Data considered, Web Usage Mining can be further classified into three types of data: Web Server Data, Application Server Data and Application Level Data. Web Server Data: They correspond to the user logs that are collected at Web server. Some of the typical data collected at a Web server include IP addresses, page references, and access time of the users.

Application Server Data: Commercial application servers, e.g. Web logic [BEA11], BroadVision [BV12], StoryServer [VIGN13], etc. have significant features in the framework to enable E-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs.

Application Level Data: Finally, new kinds of events can always be defined in an application, and logging can be turned on for them – generating histories of these specially defined events. The Usage Data can also be split into three different kinds of data based on the source of its collection: on the server side, the client side, and the proxy side. The key issue is that on the server side there is an aggregate picture of the usage of a service by all users, while on the client side there is complete picture of usage of all services by a particular client, with the proxy side being somewhere in the middle (cp. Srivastava J., 2002).

10 Document Object Model (DOM) is a description of how an HTML or XML document is represented in an object-oriented

fashion. DOM provides an application programming interface to access and modify the content, structure and style of the

document.

11 BEA Systems is an American software company known for its WebLogic products.

12 BroadVision is an international software vendor of self service web applications for electronic commerce.

13 StoryServer was the name the company Vignette gave to CNET's web publishing application, PRISM (Parallel Reduced

Instruction Set Machine), when they bought it. Web template, a ready made website design that can easily be specialized to

produce individual websites.

9

Table 1, Web Mining Categories (cp. Srivastava J., 2002)

Despite the categorizing it is important to emphasize the distinctions between the above categories are not clear-cut. Web Content Mining might use text and links and even profiles that are either inferred or inputted by the users. The user profiles are primarily used for the user modelling applications or personal assistants; however the same is also true for Web Structure Mining that could use the information about the links in addition to the link structures (cp. Kosala, 2000). In this context Ehrig, Hartmann and Schmitz also mention that only the intelligent combination of different methods of analysis of resources and there rational consideration let us be able to define connections and relevance of the resources (cp. Ehrig M., 2004).

10

Implementation The collected data from Web Mining is worthless if not implemented correctly. Companies try to improve and personalize their Web sites to better suit the customers needs. Instead of trying to directly implementing the new cognitions, Spiliopoulou suggest to evaluate the Web sites current usage. According to Preece et al., “evaluation is concerned with gathering data about the usability of a design or product by a specified group of users for a particular activity within a specified environment or work context’’ (cit. Preece, J., 1994, after Spiliopoulou M., 2000). In order to transfer this definition into practice it needs a methodology for the extraction of knowledge from data. Data Mining is such a methodology. The remaining questions are: What data is appropriate for the analysis? How do we define anything like “expected usage” so it is possible to discover expected and unexpected navigation patterns? The first question is dealt with in the data preparation phase; the second one is resolved in the mining phase (see figure below). Figure 3, (cp. Spiliopoulou M., 2000)

Once the data is collected and the personalization is the process of customizing a Web site to the needs of specific users by, taking advantage of the knowledge acquired from the analysis of the user’s navigational behaviour (Usage Data) in correlation with otherinformation collected such as Web context, namely, structure, content, and User Profile Data. Due to the explosive growth of the Web, the domain of Web personalization has gained great momentum both in the research and commercial areas. The objective of a Web personalization system is to “provide users with the information they want or need, without expecting from them to ask for it explicitly” (cit. Mulvenna

M. D., 2000, after Eirinaki M., 2003). At this point, it is necessary to stress the layout difference between customisation and personalization. In customisation the site can be adjusted to each user’s preferences regarding its structure and presentation. Every time a registered user logs in, their customized home page is loaded. This process is performed either manually or semi automatically. In personalization systems modifications concerning the content or even the structure of a Web site are performed dynamically. Principal elements of Web personalization include (a) the categorization and pre-processing of Web Data, (b) the extraction of correlations between and across different kinds of such data, and (c) the determination of the actions that should be recommended by such a personalization system (cp. Mobasher B., 2000). In the context of Web personalization, Web Data are those that can be collected and used. These data are classified in four categories according to Srivastava et al. (cp. Srivastava J., 2000).

The Process of discovering navigation patterns in a site

11

1. Content Data are presented to the end-user appropriately structured. They can be simple text, images, or structured data, such as information retrieved from databases.

2. Structure Data represent the way content is organized. They can be either data

entities used within a Web page, such as HTML or XML tags, or data entities used to put a Web site together, such as hyperlinks connecting one page to another.

3. Usage Data represent a Web site’s usage, such as a visitor’s IP address, time

and date of access, complete path (files or directories) accessed, referrers address, and other attributes that can be included in a Web access log.

4. User Profile Data provide information about the users of a Web site. A user profile

contains demographic information (such as name, age, country, marital status, education, interests, etc.) for each user of a Web site, as well as information about users’ interests and preferences. Such information is acquired through registration forms or questionnaires, or can be inferred by analyzing Web usage logs.

Web Mining and Privacy While there are many benefits to be gained from Web Mining, a clear drawback is the potential for severe violations of privacy. Public attitude towards privacy seems to be almost schizophrenic – i.e. people say one thing and do quite the opposite. People value their privacy, even though experience at major e-commerce portals shows over 97% of all people accept cookies with no problems – and most of them actually like the personalization features that can be provided based on it. Spiekerman et al. (cp. Spiekermann, 2001) have demonstrated that people were willing to provide fairly personal information about themselves which was completely irrelevant to the task at hand, if provided the right stimulus to do so. Furthermore, explicitly bringing attention information privacy policies had practically no effect. One explanation of this seemingly contradictory attitude towards privacy may be that we have a bi-modal view of privacy, namely that “I’d be willing to share information about myself as long as I get some (tangible or intangible) benefits from it, as long as there is an implicit guarantee that the information will not be abused”. The research issue generated by this attitude is the need to develop approaches, methodologies and tools that can be used to verify and validate that a Web service is indeed using an end-user’s information in a manner consistent with its stated policies (cp. Srivastava J., 2002).

12

Seven Rules for Successful Web Mining Trivial questions such as: Who are my visitors? And how do they become profitable clients? Which paths lead to the most orders? Which products are popular when? When and where are the orders cancelled the most? Cannot be answered, because the necessary information is not being registered, even though the log files of Web servers can be full of information but with little relevance. Simple analysis often try to compensate imprecise measuring methods with intelligent approximations. For example, instead of analysing the action of an individual user, those inexact measuring methods could analyse the firewall content of hundreds of users or the visit of one surfer is recorded as many sessions just because the IP-Address changed during the session. According to Schrader, the CEO of an IT-Company SinnerSchrader in Germany, these are just 2 reasons why the correct use of Web Mining is absolutely necessary (cp. Schrader M., 2003). 1. Generate the Right Data The first rule for successful Web Mining is to make sure that the right and necessary Information is being generated! During every server enquiry the session ID, that means the most definite identification, must be registered, so the coherence between different queries during a single session can be seen. That is the only way to analyse the actions of actual visitors. Obviously this must be done in an anonymous way due to data protection provisos. Furthermore, Schrader reminds not to forget to delete the tracks of search engines such as Google from the visitor statistics (cp. Schrader M., 2003). 2. Technology Follows Management Advertising effectiveness is measured due to ad clicks, cost per lead, or cost per registration. It is important for a service portal to know the frequency of the use of the different functions, therefore every Web site can be judged by content tracking. If these numbers are the necessary information for the e-business Management the Web Mining must be set up so that this data is delivered. The Web Mining requirements should be defined by managers and not technicians. IT specialists need different key data to control performance, stability and security, so never let them define the requirements from the marketing point of view! Only if you ask the right questions, you will get useful answers (cp. Schrader M., 2003). 3. Be Careful Using Standard Evaluation Web Mining is effective when the analysis are integrated into the daily e-business process. Hence the data has to be delivered simple and coherent. A Web Mining tool has to be suitable for daily use, so it must be user friendly. Web Mining should give the decision takers a flexible, realistic and real time aspect of all the information, they use in the decision process when it comes to internet. Thus, it is important to be careful when it comes to standard evaluations. A company is not standard however, must be individual and constantly exposed to changes. System queries should be flexible and adjustable, so modified requirements can be realized quickly without big

13

effort. Therefore Schrader suggests an ASP-Model (Application Service Providing). In an ASP-Model, the Web Mining user has access to all the analysis through a Web browser, which is more flexible than locally installed software allowing the data to centralize saving the user personal and investment costs. (cp. Schrader M., 2003). 4. Participation Creates Customer Orientation In an additional argument for ASP states that all relevant departments can have easy real time access to analyse because Web Mining implements through ASP an active interaction with the own Web site and hence customer orientation in different departments in the whole company. Schrader compares Web Mining with a Trojan horse with which customer orientation is carried in an organisation (cp. Schrader M., 2003). 5. Success is a Question of Interpretation Web Mining delivers numbers and graphics that can help a company to find answers to specific relevant company questions. There is not a software or algorithm effective enough to interpreted data so that specific questions are answered. The use of Web Mining can only be optimised by the right interpretation of the data, only then will the data lead to knowledge and improve propositions. For example, a click stream analysis shows that a large number of users of a Web site click on the shopping cart in the first page without choosing a product beforehand. The reasons for this behaviour are unclear and are only to discovered with more specific analysis. Possible interpretations are: lack of knowledge, “before I shop, I need a shopping cart like usual” or the surfers are curious, they are used to that because that is how it was in the past. Moreover it could be interpreted that in the case of unawareness the users are most probably using the e-shopping for the first time and who are high potential new clients. The curiosity could tell us that these users are not using e-shopping for the first time, so the shopping carts on the homepage is relevant, so these users can start shopping quickly without using to much time looking for it. And the last group could be interpreted as regular customers of ours, and it should be avoided to irritate them, hence they should be informed via email about changes (cp. Schrader M., 2003). 6. Test it! Web Mining broadens the classical marketing research spectrum by the possibility of a total inquiry in real time. Through a high level of automation in data collection and evaluation, Web Mining can be faster, more precise and cheaper than conventional market researches. Furthermore, Web Mining is a unique test bed for planned optimisation measures: all the changes are very precise testable in real time. For example, it is possible to create test groups with different shopping carts. Later these results expressed in usual key data like conversion rate compared with results of the normal operation can be evaluated and interpreted. In a flash, with the help of real time, interpretations and hypotheses can be transferred in concrete actions and tangible results with such a level of detail that could only be achieved with extremely high cost when collected by conventional marketing research. This sort of application can also be used extremely well for testing advertising material or any sort of new

14

content. In short, with a given budget and time a whole re-launch could be tested before the actual starting date (cp. Schrader M., 2003). 7. ROI Here Schrader underlines the importance of keeping the return of investments in sight. Today investments require for every measure a clear evidence that investments are profitable in relatively short period of time (cp. Schrader M., 2003).

Future Directions As the Web and its usage continue to grow, it will generate evermore content, structure, and Usage Data, and the value of Web Mining will keep increasing (cp. Srivastava J., 2002). Outlined here are some research directions that must be pursued to ensure that we continue to develop Web Mining technologies that will enable this value to be realized.

Web Metrics and Measurements During spring of 2000 Amazon used the Web as an experimental apparatus. The 48 hour long experiment involving more over one million user sessions on the live site was carried out before the decision to change Amazon’s logo was made. Not only because it provides the ability of measuring human behaviour at a micro level, it eliminates the bias of the subjects knowing that they are participating in an experiment, and also allows the number of participants to be many orders of magnitude larger. Amazon was one of the first developing Web metrics to measure procedures, so that various Web phenomena can be studied. The measure of user impact of various proposed changes – on operational metrics such as site visits and visit/buy ratios, as well as on financial metrics such as revenue and profit – before a deployment decision is made, is absolutely necessary. Hence research will continue in developing the right set of Web metrics, and their measurement procedures (cp. Srivastava 2002).

Process Mining The value of process information in understanding users’ behaviour in traditional shops is vital and so is it in e-business. Research needs to be carried out in extracting process models from Usage Data, understanding how different parts of the process model impact various Web metrics of interest, and how the process models change in response to various changes that are made – changing stimuli to the user. Thus, click stream is the key word for the future according to Srivastava et al. (cp. Srivastava J., 2002). Click stream data provides the opportunity for a detailed look at the decision making process itself, and knowledge extracted from it can be used for optimizing the process, influencing the process, etc.

15

Dynamic Aspect Zhao et al. (cp. Zhao Q., 2005) question if the existing Web Usage Mining techniques are only focusing on discovering knowledge based on the statistical measures obtained from the static characteristics of web Usage Data. They do not consider the dynamic nature of web Usage Data. Zhao focuses on discovering novel knowledge by analysing the change patterns of historical web access sequence data. Therefore he presents an algorithm called Wam-Miner to discover Web Access Motifs (WAMs). WAMs are web access patterns that never change or do not change significantly most of the time (if not always) in terms of their support values during a specific time period. WAMs are useful for many applications, such as intelligent web advertisement, web site restructuring, business intelligence, and intelligent web caching.

Rapidly Growing Web Data Yet et al. (cp. Yen S.-J., 2005) point out that Web Data will grow rapidly in the short period time and some of the Web Data may be antiquated. The user behaviours could possibly change when the new Web Data is inserted into and the old Web Data is deleted from web logs. Therefore, we must re-discover the user behaviours from the updated web logs. However, it is very time-consuming to re-find the users’ access patterns. Hence, many researchers pay attention to the incremental mining in recent years and will continue to do that. The essence of incremental mining is that it utilizes the previous mining results and finds new patterns solely from the inserted or deleted part of the web logs such that the mining time can be reduced.

Personalization Numerous firms are investing and will increase investments in various means and methods to track online customers’ navigation patterns and analyse their characteristics and their web behaviours. With better understanding of the navigation patterns, the firms can provide breakthrough customer service. Current framework however would be limited to server-sided tracking, and the firms lose their customers’ footprints once the customers leave their web sites. Future frameworks will track users to record their Internet activities to draw a linkage between Web customers’ characteristics (such as personality traits) and their browsing behaviours. It is needless to say privacy problems will occur (cp. Ho S. Y., 2005).

16

Robot Detection and Filtering – Clickstream Web robots are software programs that automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information. The importance of separating robot behaviour from human behaviour prior to extracting user behaviour knowledge from Usage Data has been illustrated by (cp. Kohavi R., 2001). According to Tan et. al (cp. Tan P.-N., 2001), e-commerce retailers are particularly concerned about the unauthorized deployment of robots for gathering business intelligence at their Web sites. Plus, Web robots tend to consume considerable network bandwidth at the expense of other users. Sessions due to Web robots also make it more difficult to perform click-stream analysis effectively on the Web Data. Conventional techniques for detecting Web robots are often based on identifying the IP address and user agent of the Web clients. While these techniques are applicable to many well-known robots, they may not be sufficient to detect camouflaging and previously unknown robots. Tan et. al (cp. Tan P.-N., 2001) approached this problem using the navigational patterns in the click-stream data to determine if it is due to a robot. Highly accurate classification models could be built using this approach. In addition, these models are able to discover many camouflaging and previously unidentified robots.

Clickstream Analysis Analysing data obtained from web server logs14, so-called “Clickstreams”, which are saved on the web and proxy servers, when users are visiting them is rapidly becoming one of the most important activities for companies in any sector as most businesses become e-businesses. Clickstream Analysis can reveal usage patterns on the company’s web site and give a highly improved understanding of customer behaviour. This understanding can then be utilized for improving customer satisfaction with the web site and the company in general, yielding a huge business advantage (Compare cp. Andersen J., Giversen A., Jensen A. H., Larsen R.S., Pedersen, T.B., Skyt J, 2000). There is a lot of important information which can be find out about the user when using modern methods of Clickstream Analysis. Is the users invention to browse, to search or to buy something? What is the time (time stickiness), the user spends on site? How does the visitor enter the site and how does he leave it? How do the visitors travel inside of the site? Comes the visitor back, or not? Which pages of the site are viewed by majority of visitors and which are loaded only sporadically? What is the visit-to-purchase ratio? These all are questions, which can be answered thanks to Clickstream Analysis.

14Server Log is a file (or several files) automatically created and maintained by a server of activity performed by it.

(wikipedia.org)

17

Path Analysis A web site consists of a hyper linked set of pages. One web site can be composed of one up to several thousands of pages. During his visit to a web site, a visitor navigates through the site by either clicking on the hyperlinks, or performing internal searches, or using his/her bookmarks to jump to pages and areas of interest. Example of such a navigation is shown on the Figure 1. As we can see, the visitor performs different types of moves: forward steps (e.g. from node A to B), backward steps (e.g. from node D to C) as well as forward jump steps indicated by the arrow from H to I. Each of the pages viewed by this visitor is captured in the web log as a separate record. This sequence of moves is known as a “path” or sometimes called “Clickstream“ and is recorded in the form of a log file. Paths can be analysed to determine the sequences in which users have navigated the web site. This information is also known as “e - intelligence” and is very advantageous to organizations that are active in E-commerce. The real challenge arises when many visitors navigate through a site that contains thousands and thousands of pages scattered across hundreds of web servers. The order in which visitors choose to view pages indicates their steps through the browsing (buying) process. The similarities and differences in navigational behaviour of various classes of visitors, such as new visitors vs. repeat visitors, purchasers vs. non purchasers, 1st time purchasers vs. repeat purchasers can be described by several patterns and could hold clues towards improving the web site design, offer personalization opportunities, and help streamline the e-commerce environment. (cp. Theusinger Ch., Huber K.P., 2000)

Figure 4, A web site with a sample path (cp. Theusinger Ch., Huber K.P., 2000)

18

Clickstream Data The data needed to fulfil the above-mentioned tasks is derived from a web server log file, maybe supported by other methods described later on. From a web server log file in Common Log Format (CLF), for each click on the web site we know the IP-address15 of the user, the page (URL) the user requested, and the timestamp - duration (down to seconds) of the event. From these 3 facts, we can derive a great deal of extra information. First, we can identify individual users, through the IP-address, and through the IP-address the users country of origin. The problems which occur when using IP address as the only visitor identification will be described later on. We can identify user sessions, including start page, end page and all pages visited between those two. A user session is an ordered sequence of clicks made by the same user, starting from the click where the user enters the site until the last click registered in the log for that user during that visit. The exit point, i.e. end page, is determined from a maximum time-span “allowed” between clicks. This prevents that the entrance click in the same user’s next session is registered as part of the previous one. Also, it is easily derived what time of day and day of week a user is on the site, how long a user is on each page (except the last page of a session,) the number of hits per page per user, and the number of hits in a specific site section per user. Finally, we can group users by time between clicks and the time of day/week users are on the site and count the number of hits on a banner, totally or by day/week. (cp. Andersen J., Giversen A., Jensen A. H., Larsen R.S., Pedersen, T.B., Skyt J, 2000)

Types of Clickstream Data In general, the data retrieved from Clickstream Analysis can be divided into two different groups. The first type is transaction-based data the second type customer-based data, what does mean aggregated data according to the Customer-ID. Transaction Data: Most of the transaction-based data can directly be retrieved from the log file. In this log file usually only one string for each click is reported with ip address, time stamp, referer address, clicked page and server address. As users behind a router or a proxy server can all have the same IP address, an ip address is not suitable as an identification variable so other techniques like cookies16 or JAVA servlets17 must be used. Otherwise it is very difficult to distinguish if two clicks belong to the same visitor. A first step may be to distinguish transactions based on the purpose, the next step could be to perform association and sequence analysis to get insight in visited web

15 IP adress is a unique number that devices use in order to identify and communicate with each other on a computer network

utilizing the Internet Protocol standard IP (wikipedia.org)

16 Treated later on

17 Java Servlet is a small program that runs on a server and is analogous to a Java applet that runs within a web browser

environment. Servlets are the Java counterpart to dynamic web content technologies such as CGI, PHP or ASP. Servlets can

maintain state across many server transactions by using HTTP cookies, session variables or URL rewriting.

(Compare http://en.wikipedia.org/wiki/Java_servlet;http://www.webopedia.com/TERM/S/servlet.html)

19

sites and customer paths. More sophisticated analysis need more information about transactions so new variables have to be calculated from the log file: number of clicks, average time per site (as well as maximum and minimum), weekday of visit, transaction time (business time, free time, night time), server, type of browser, number of sites visited and which sites were visited in which order, first, second page as well as last, second to last page etc. Based on this data about each customer cluster analysis like K Means18 or Kohonen are well suited methods to find segments of customers with similar behaviour. If in addition one event, like the visit of a special site, can be defined as a business-relevant target, predictive modelling methods (decision trees, neural networks, and regression models), can also be used. To obtain good customer profiles, variables describing the characteristics of the customer should be added. (cp. Theusinger Ch., Huber K.P., 2000) Customer Based Data: If available, this information is given in a data warehouse where all customer characteristics and historical information about click behavior etc. are stored. To combine this information with the transactional data the users must identify themselves when visiting the web site so the cookie id could be matched with their names and the transactional data can be merged with customer-relevant data. Having an e-commerce application, combining all this data will allow to answer questions like: “What kind of customers do I have?“ “How can customers be recognized who are interested in a special product for cross-selling? Predictive modelling techniques will also be used in this context. (cp. Theusinger Ch., Huber K.P., 2000) Field Date Description

Date date The date that the activity occurredTime time The time that the activity occurredClient IP address c-ip The IP address of the client that accessed your server

User Name cs-usernameThe name of the autheticated user who access your server, anonymous users are represented by -

Servis Name s-sitename The Internet service and instance number that was accessed by a clientServer Name s-computername The name of the server on which the log entry was generatedServer IP Address s-ip The IP address of the server that accessed your serverServer Port s-port The port number the client is connected toMethod cs-method The action the client was trying to performURI Stem cs-uri-stem The resource accessedURI Query cs-uri-query The query, if any, the client was trying to performProtocol Status sc-status The status of the action, in HTTP or FTP termsWin32 Status sc-win32-status The status of the action, in terms used by Microsoft WindowsBytes Sent sc-bytes The number of bytes sent by the serverBytes Received cs-bytes The number of bytes received by the serverTime Taken time-taken The duration of time, in milliseconds, that the action consumedProtocol Version cs-version The protocol (HTTP, FTP) version used by the clientHost cs-host Display the content of the host header

User Agent cs(User Agent) The browser used on the clientCookie cs(Cookie) The content of the cookie sent or received, if any

Referrer cs(Referrer)The previous site visited by the user. This site provided a link to the current site

s = server actions, c = client actions, cs = client-to-server actions, sc = server-to-client actions Table 2, Example of W3C Extended Log File Format (cp. Levene M., 2002)

18 The K-means algorithm is an algorithm to cluster objects based on attributes into k partitions

20

The Sources of Clickstream Data There are four potential sources of Clickstream Data: The Host server, Third Party thanks to an Agreement, Third Party thanks to the Web Topology and Tracking Programmes in Host Computers. Figure 5, (cp. Montgomery A.L., Srinivasan K., January, 2002)

Host Server: The host server (the computer of the site being visited) keeps a record of visits, called a server log (treated above). When a user requests a page, the server records identifying information (the user’s IP (Internet protocol) address, previous URL visited, and browser type) in the server log. In the example given in Figure 1, altavista.com would know that this user requested two different URLs from their server. An advantage of collecting Clickstream Data from server logs is that they contain a record of everything done at a Web site. Unfortunately, they can also be difficult to process because they contain so much data. A practical problem with many server logs is that they become large very quickly. The Internet portal Yahoo! received 1.2 billion page requests per day as of July 2001. This generates terabytes of information to be stored every day. Learning how to access and store this information can be incredibly challenging. Many companies provide products to summarize and analyse these logs (Accrue, IBM SurfAid for example). Other companies, (Accrue, CommerceTrends) provide consulting services that process Clickstream Data and aggregate them with online sales data and e-mail data to create a comprehensive picture of their client firm’s business. To make its logs useful for later analysis the server

should record what it presented to the user. This is a problem because servers generally offer large amounts of content and change it frequently. Many URLs and the content these URLs contain are dynamically created, which means that requests for the same page by two different users may result in two different pages. To

Visualisation of user’s request to Astalavista

21

manage the resulting data set and extract useful information, some companies scan each Web page presented to a user for titles or Meta Tags19. This process is called packet sniffing. Third Party thanks to Agreement: A major limitation of Clickstream Data collected from server logs is that the data come from only one site and do not enable tracking across sites. For example, in the example on Figure 3, AltaVista would not know what the user did before or after he or she left its Web site. One way around this is to provide links to content, such as graphic images, that are stored at other sites. For example, when DoubleClick provides graphic images at AltaVista, it can collect Clickstream Data from AltaVista’s users without their intentionally visiting DoubleClick’s Web site. Of course, AltaVista must be willing to place these images on its Web site. Large providers, such as DoubleClick and Engage, can leverage the large number of sites they do business with and track users across Web sites. For example, DoubleClick can remember that a visitor to iVillage.com is the same one that visited altavista.com last month, even though iVillage and AltaVista have no direct relationship. Other third parties that can collect Clickstream Data by working in conjunction with Web servers are Application Services Providers (ASPs)20, (Angara, Cogit, CoreMetrics for example). Frequently, these companies do not provide graphics the user will see as DoubleClick does, but 1x1 pixel images called Web Bugs. These graphics are so small that they are invisible to Web visitors. CoreMetrics holds a patent for one type of Web Bug. For example, walmart.com places Web Bugs on its site that allow CoreMetrics to record a visitor’s movements at walmart.com. CoreMetrics can then analyze these movements and make recommendations to walmart.com about its Web design and marketing policies. Another example of a Web bug is one used by zanybrainy.com in 2000. Whenever a customer made a purchase at zanybrainy.com, it placed a Web bug that relayed the purchase information to DoubleClick. Web Bugs allow one company to communicate information about its Web visitors to another company. Third Party thanks to Network Topology: Clickstream Data can also be collected by a third party because of the network topology of the Internet. Users’ requests for Web pages on their computers’ are relayed by a series of computers attached to the network before they finally reach their destination. Most users are connected to the Internet through an internet service provider (ISP)21 or a commercial on-line service (COS), such as AOL. For example, when an AOL user makes a request, AOL’s servers receive the request before it is relayed through the Internet. When an ISP or COS routes a request, it can record the routing information and create its own Clickstream Data set. (What AOL does with this information depends upon its privacy policy.) Because many ISPs and COSs cache22 their users’ page requests, they do not always pass all requests on to the server. Instead ISPs and COSs may serve many pages from their own archives to

19 Meta Tags are HTML elements used to provide structured metadata about a web page

20 An application service provider (ASP) is a business that provides computer-based services to customers over a network

21 An Internet service provider (ISP, also called Internet access provider or IAP) is a business or organization that offers users

access to the Internet and related services

22 cache – to store the content of different web pages on the server near to end user in attempt to speed up the loading of the

page

22

speed up responses to user requests. Unfortunately, this means that the server logs we discussed in previous section may contain only a subset of the users’ viewings. Tracking Programs in Host Computers: A final–and perhaps the most reliable source of Clickstream Data is a meter program that can record what users do on their computers. This meter program can “watch” the user and record the URLs of all pages viewed in the browser window and any other application programs the user is running. Because this program works on the user’s computer, it avoids problems with cached pages mentioned in the previous section. Such a program can also record how long users keep windows active, instead of simply when they request URLs. A drawback to this technique is that the individual must be willing to run such a program. Most companies that use this method to collect Clickstream Data are marketing research companies(Jupiter Media Metrix and A.C. Nielsen for example). These companies create large, representative samples of households that own personal computers (PCs) and have consented to participate in such a panel. The companies use statistical techniques to project usage from these small samples to the overall population.(cp. Montgomery A.L., Srinivasan K., January, 2002)

Main Methods of User Tracking There are several methods which can be used to identify the visitor of the web page. The basic one is the identification based on the IP address, more advanced ones are cookies or Java servlets. The most robust and reliable one is the registration of users. These different types of identification require different level of user involvement and there is a rule saying that the higher the user involvement is, the higher is the quality and reliability of tracking. IP Adress: Identification based on IP Address does not require any active cooperation of the user but is very unreliable. User can have dynamic IP Address or can surf the internet from a corporate network, where many computers are identified by one IP Address in the internet. There is also the possibility of using a public proxy server, service which is cheap and easy to get nowadays. Cookies and Java Servlets: More reliable is identification based on cookies or java servlets. When a cookie is placed on a users computer, it is possible to trace a user with a dynamic IP-address through more than one visit. Using cookies we also know if a returning user has a static or dynamic IP-address and if the user is behind a multi-user proxy-server. This is indicated when different cookies are found from the same static IP-address. We can also see cookies from other web sites and use them to create a more detailed user profile. It is possible to read cookies from other sites by exploiting a security weakness in the current version of Internet Explorer. This method is very reliable but the web page visitor has to cooperate. Very low computer knowledge is needed to delete the cookie. When this is done, no more tracking is possible.(cp. Andersen J., Giversen A., Jensen A. H., Larsen R.S., Pedersen, T.B., Skyt J, 2000)

23

Registration of Users: The best way how to track the user is to get him to register on the page. Before using the web page, user logs in using his unique user name and password. Afterwards it is very easy to track him on the site. The most known online markets Ebay and Amazon can serve as good examples. The problem to tackle is to motivate the user to register. By Ebay and Amazon, it is a necessity, because without being registered, user cannot make any operations. By other internet sites, where there is no such necessity, users are often motivated to register by an additional profit – a small present, discount, etc.

Privacy Concerns As the importance of the web grows with users extended use of the internet for shopping, etc., a big issue called privacy arises. How much can businesses learn about you by logging your actions on their web site? By using a weakness in the current version of Internet Explorer it is possible to get cookies from other web sites a user has visited. The means to do this is redirecting a user to a special URL, which sends a specific cookie to the server. You have to know what you are looking for, i.e., to get a cookie you have to specify from which site you want the cookie. If the user’s name is present in a known cookie, the security hole can be used to get this information. Adding information from the other sites a user has visited enables extremely precise profiling of the user, which among other things can be used to emphasize the areas, where your company does better than the other companies that the user has visited, in the same area of business. By combining Clickstream Data with “borrowed” cookies from other web sites a more personal site can be made. The possibility for abusing the “open cookie jar” is clear. For instance it is possible to get a cookie from a user, and then read the user’s mail if he/she has got a hotmail account. If it is desired to be anonymous while browsing the Internet, it can be done. Of course, by being complete anonymous the benefits of personalized services are lost. To be anonymous, cookies must be disabled, and the users need to make sure that their IP cannot identify them. Special privacy concerns are present when dealing with e-commerce. Contrary to normal shopping, users cannot be anonymous because they have to give their addresses to the seller so the seller knows where to send the purchased goods. Moreover, the users have to give the seller credit card information in order to pay for the goods. The credit card information is the seller’s security that you will pay. Your security is based on trust in the seller. Thus, the seller must have a good reputation. The personal information the seller has about the users can be misused. For instance, the seller can charge more money than expected on your credit card. The businesses that abuse the trust of their customers will be unsuccessful—rumours spread very fast on the web. But those that use the information to deliver a good personalized and trustful service will be very successful. Building a trusted name in their customer group is crucial for e-commerce companies. The means to do this is general good service and the word will spread. The other possibility is advertising, using the Internet or other channels such as TV and radio. (cp. Andersen J., Giversen A., Jensen A. H., Larsen R.S., Pedersen, T.B., Skyt J, 2000)

24

Benefits Resulting from a Clickstream Analysis Clickstream Information can be derived from “raw” Clickstream Data. Here we have tried to discuss some of the most important uses inspired by the needs of the companies active in e-commerce. An important use of Clickstream Information is personalization of the site to the individual user’s needs, e.g., delivering services and advertisements based on user interest, and thereby improving the quality of user interaction and leading to higher customer loyalty. It is important to identify user groups and focus the site towards them. Not all users are equal, some are more profitable to the organization than others, e.g., because they are more likely to buy high-profit products. Thus, it is necessary to make sure that the site caters to the wanted users. Using the groups, it is desirable to analyse banner23 efficiency for each user group, and to optimise this by presenting the right banners to the right people. Analysis of customer interests for product development purposes is also very interesting. Customer interests are based on what a user have looked at on the site and especially which sections of the site a user visits and spends some time in. One of the design goals of most sites is to have a high stickiness, meaning that users spend a long time productively on the site. A way to achieve this is by identifying pages that often lead to undesired termination of user sessions, so-called “killer pages.” It is important to find out if and why a page is killing sessions for certain user groups, and popular for other groups. Another interesting questions is whether the web site is used for different things on weekdays, weekends and holidays and if banners are used differently by the same users at different times. Finally, cost/benefit analysis of the site, or parts of it, is of great value in order to determine if it pays off, i.e., if the number of users using it justifies the current cost of updating the site? (cp. Andersen J., Giversen A., Jensen A. H., Larsen R.S., Pedersen, T.B., Skyt J, 2000)

23 A web banner or banner ad is a form of advertising on the World Wide Web. This form of online advertising entails

embedding an advertisement into a web page.

25

Conclusion Internet gives companies and their managers a perfect possibility to learn more about their customers, their likes and dislikes, about patterns of their behaviour. With the increasing volume and value of the e- commerce (US online retail sales as high as $86.3 billion in 2005, 24.6 percent growth compared to 2004), Data Mining is gaining on importance and is more and more becoming a inevitable part of the market research. We have covered only a small part of the Data Mining process in our paper. Web Mining and Clickstream Analysis is the part of the business which is specialised on the behaviour of the internet users while surfing the internet. It tracks all types of surfer’s actions, looking for information, looking for merchandise, comparing and buying it. It is sure that every person is individual, but there are some patterns of conduct, thanks to which it is possible to divide (possible) online customers into different group and than to tailor the web pages layout and service to suit to their needs. Personalization is the trend and we are living a dynamic metamorphose of the web which is changing from static into dynamic space offering terabytes of new customers related data every day. The potential is huge, its up to the companies active in e-commerce to get the most of it.

26

References

• Alan L. Montgomery, Shibo Li, Kannan Srinivasan, and John C. Liechty, Modeling Online Browsing and Path Analysis, Using Clickstream Data, November 2002, First Revision, September 2003, Second Revision, February 2004, Third Revision, February 2004 www.andrew.cmu.edu/~alm3/papers/purchase%20conversion.pdf

• Andersen J., Giversen A., Jensen A. H., Larsen R.S., Pedersen, T.B., Skyt J, October 2000, Analyzing Clickstreams Using Subsessions, Technical Report 00-5001, Department of Computer Science Aalborg University, Dänemark http://www.cs.auc.dk/~tbp/articles/R005001.pdf

• Ashby C., Simms S., 1998, Data Mining – Research Brief, URL, http:theweb.badm.sc.edu/798gstud/simsj/Datamininggrb.html

• Ashby, Simms, 1998, after Schmidt-Thieme L., 2002, KDD, Data Mining und Web Mining, Theorie und Forschung, http://www.informatik.uni-freiburg.de/cgnm/lehre/wm-02w/webmining-1.pdf

• Berry M.J.A., Linoff G., 1997, Data Mining Techniques for Marketing, Sales and Customer Support, New York

• Berry, Linoff, 1997, after Schmidt-Thieme L., 2002, KDD, Data Mining und Web Mining, Theorie und Forschung, http://www.informatik.uni-freiburg.de/cgnm/lehre/wm-02w/webmining-1.pdf

• Cooley R., Mobasher B., Srivastava J., 1997, Web Mining: Information and Pattern discovery on the World Wide Web, Department of Computer Science and Engineering University of Minnesota, Minneapolis, USA, http://maya.cs.depaul.edu/~mobasher/papers/webminer-tai97.pdf

• David L. Olson, Sebastian Elbaum, Steve Goddard, Fred Choobineh, 2003 An E-Commerce Decision Support System Design for Web Customer Retention, University of Nebraska, USA www.cse.unl.edu/~goddard/Papers/Conference/ECommerceDecisionSupportOlsonEtAl.pdf

• Decker K.M., Focardi S., 1995, Technology Overview: A report on Data Mining, CSCS-ETH, Swiss Scientific Computer Center

• Decker, Focardi, 1995 after Schmidt-Thieme L., 2002, KDD, Data Mining und Web Mining, Theorie und Forschung, http://www.informatik.uni-freiburg.de/cgnm/lehre/wm-02w/webmining-1.pdf

• Ehrig M., Hartmann J., Schmitz C., 2004, Ontologie-basiertes Web Mining, Institut AIFB, Universität Karlsruhe, FG Wissensverarbeitung, Universität Kassel, http://metis.ontoware.org/docs/ehrig_web_mining.pdf

• Eirinaki M., Vazirgiannis M., 2003, Web Mining for Web Personalization, ACM Transactions on Internet Technology, Vol. 3, No. 1, Athens University of Economics and Business, Greece, http://delivery.acm.org/10.1145/650000/643478/p1-eirinaki.pdf?key1=643478&key2=2606048411&coll=GUIDE&dl=GUIDE&CFID=71950429&CFTOKEN=54757960

• Etzioni, O., 1996, The World Wide Web: Quagmire or gold mine, Communications of the ACM, 39,11:65-68, http://nlp.uned.es/WebMining/Tema1.Introducci%F3n/kosala2000.pdf

27

• Hansen L.K., Sigurdsson S., Kolenda T., Nielsen F.A., Kjems U., Larsen J., 2000, Modeling Text with Generalizable Gaussian Mixtures, Proceedings of IEEE ICASSP’2000, Istanbul, Turkey, vol. VI, pp. 3494-3497, http://mole.imm.dtu.dk/thko_project/larsen.csda.pdf

• Ho S. Y., 2005, An Exploratory Study of Using a User Remote Tracker to Examine Web Users’ Personality Traits, Department of Accounting and Business Information Systems, Faculty of Economics and Commerce, The University of Melbourne, AUS, http://delivery.acm.org/10.1145/1090000/1089669/p659-ho.pdf?key1=1089669&key2=1514838411&coll=GUIDE&dl=GUIDE&CFID=71711545&CFTOKEN=52827898

• Jiawei Han, Kevin Chen-Chuan Chang, 2002 Data Mining forWeb Intelligence, University of Illinois, USA www-faculty.cs.uiuc.edu/~hanj/pdf/computer02.pdf

• John G., 1997, Enhancements to the Data Mining Process, Doctorial Disertation, Departement of Computer Science, University Stanford

• John, 1997, after Schmidt-Thieme L., 2002, KDD, Data Mining und Web Mining, Theorie und Forschung, http://www.informatik.uni-freiburg.de/cgnm/lehre/wm-02w/webmining-1.pdf

• Kohavi R., 2001, Mining E-Commerce Data: The Good, the Bad, the Ugly, KDD, San Francisco, California, http://ai.stanford.edu/~ronnyk/kddITrackTalk.pdf

• Kosala R., Blockeel H., 2000, Web Mining Research, Department of Computer Science, Katholieke Universiteit Leuven, Belgium, http://nlp.uned.es/WebMining/Tema1.Introducci%F3n/kosala2000.pdf

• Kothari R., Mittal R., Jain M. and Mukesh, On Using Page Cooccurrences for Computing Clickstream Similarity, 2003 www.siam.org/meetings/sdm03/proceedings/sdm03_14.pdf

• Levene M., 2002, School of Computer Science and Information Systems Birkbeck University of London, London, Great Britain www.dcs.bbk.ac.uk/~mark/download/lec6_web_usage_mining.ppt

• Magdalini Eirinaki and Michalis Vazirgiannis, Web Mining for Web Personalization, 2001, Athens University of Economics and Business, Greece www.db-net.aueb.gr/magda/papers/TOIT-webmining_survey.pdf

• Smith M:, 2005, Stages of Knowledge Discovery in Websites, Brigham Young University - Data Mining Lab dml.cs.byu.edu/~smitty/publications/esi2005slides.pdf

• Mobasher B., Cooley R., Srivastava J., 2000, Automatic personalization based on Web Usage Mining, Commun. ACM, New York, USA, http://delivery.acm.org/10.1145/350000/345169/p142-mobasher.pdf?key1=345169&key2=3847548411&coll=GUIDE&dl=GUIDE&CFID=76590242&CFTOKEN=25230862

• Montgomery A.L., Srinivasan K., January 2002, Learning About Customers Without Asking, Graduate School of Industrial Administration, Carnegie Mellon University, Pittsburgh, USA http://www.andrew.cmu.edu/user/alm3/papers/online%20learning.pdf

• Mulvenna M. D., Anand S. S., Buchner A. G., 2000, Personalization on the net using Web Mining, Commun. ACM, after Eirinaki M., Vazirgiannis M., 2003 Web Mining for Web Personalization, ACM Transactions on Internet

28

Technology, Vol. 3, No. 1, Athens University of Economics and Business, Greece, http://delivery.acm.org/10.1145/650000/643478/p1-eirinaki.pdf?key1=643478&key2=2606048411&coll=GUIDE&dl=GUIDE&CFID=71950429&CFTOKEN=54757960

• Newquist H.P., 1996, Data Mining: The AI Methamorphosis, Database Programming & Design, http://www.dbpd.com/newquist.htm

• Nigam K., McCallum A.K., Thrun S., Mitchell T., 2000, Text Classification from Labeled and Unlabeled Documents using EM, Machine Learning, vol. 39, no. 2-3, pp. 103-134, http://mole.imm.dtu.dk/thko_project/larsen.csda.pdf

• Newquist, 1996, after Lars Schmidt-Thieme, KDD, Data Mining und Web Mining, Theorie und Forschung, 2002, http://www.informatik.uni-freiburg.de/cgnm/lehre/wm-02w/webmining-1.pdf

• Parsaye K., 1996, New Realms of Analysis, Database Programming & Design • Parsaye, 1996, after Schmidt-Thieme L., 2002, KDD, Data Mining und Web

Mining, Theorie und Forschung, http://www.informatik.uni-freiburg.de/cgnm/lehre/wm-02w/webmining-1.pdf

• Preece, J., Rogers, Y., Sharp, H., Benyon, D., Holland, S., Carey T., 1994, Human-Computer Interaction, after Spiliopoulou M., 2000, Web Usage Mining for Web site evaluation, Communications of the ACM, ACM Press, Berlin, http://delivery.acm.org/10.1145/350000/345167/p127-spiliopoulou.pdf?key1=345167&key2=8712048411&coll=&dl=GUIDE&CFID=71945341&CFTOKEN=48253524

• Schrader M., 2003, Sieben regeln für erfolgreiches Web Mining, SinnerSchrader AG, Hamburg, http://www.s2analyse.de/s2a/de/content/docs/sinnerschrader_webmining01.pdf

• Spiekermann, S., Grossklags, J., Berendt, B., 2001, E-privacy in 2nd generation E-Commerce: privacy preferences versus actual behaviour, http://delivery.acm.org/10.1145/510000/501163/p38-spiekemann.pdf?key1=501163&key2=8768938411&coll=GUIDE&dl=GUIDE&CFID=69039794&CFTOKEN=16594318

• Spiliopoulou M., 2000, Web Usage Mining for Web site evaluation, Communications of the ACM, ACM Press, Berlin, http://delivery.acm.org/10.1145/350000/345167/p127-spiliopoulou.pdf?key1=345167&key2=8712048411&coll=&dl=GUIDE&CFID=71945341&CFTOKEN=48253524

• Srivastava J., Cooley R., Deshphande M., Tan P.-N., 2000, Web Usage Mining: Discovery and applications of usage patterns from Web Data, SIGKDD Explorations, ACM Pres, University of Minnesota, http://delivery.acm.org/10.1145/850000/846188/p12-srivastava.pdf?key1=846188&key2=0387548411&coll=GUIDE&dl=ACM&CFID=76590549&CFTOKEN=19478325

• Srivastava J., Desikan P., Kumar V., 2002, Web Mining – Accomplishments & Future Directions, Computer Science Department, University of Minnesota, Minneapolis, http://www.ieee.org.ar/downloads/Srivastava-tut-paper.pdf

• Tan P.-N., Kumar V., 2001, Discovery of Web Robot Sessions Based on their Navigational Patterns, Department of Computer Science, University of Minnesota, Minneapolis, http://www.springerlink.com/(sis44v45daqusizavrsapo45/app/home/contributio

29

n.asp?referrer=parent&backto=issue,2,5;journal,23,41;linkingpublicationresults,1:100254,1

• Theusinger Ch., Huber K.P, 2000, Analyzing the footsteps of your customers, Stanford University, USA http://ai.stanford.edu/~ronnyk/WEBKDD2000/papers/theusinger.pdf

• Yen S.-J., Lee Y.-S., Hsieh M.-C., 2005, An Efficient Incremental Algorithm for Mining Web Traversal Patterns, Department of Computer Science and Information Engineering, Ming Chuan University, National Taiwan University, Taipei, Taiwan, http://csdl.computer.org/dl/proceedings/icebe/2005/2430/00/24300274.pdf

• Zhao Q., Bhowmick S.S., Gruenwald L., 2005, WAMMiner: In the Search of Web Access Motifs from Historical Web Log Data, Nanyang Technological University, Singapore, The University of Oklahoma, USA, http://delivery.acm.org/10.1145/1100000/1099679/p421-zhao.pdf?key1=1099679&key2=1042648411&coll=GUIDE&dl=GUIDE&CFID=71711545&CFTOKEN=52827898

Tables Table 1, Web Mining Categories (cp. Srivastava J., 2002)............................................... 9 Table 2, Example of W3C Extended Log File Format (cp. Levene M., 2002 ) ............ 19

Figures Figure 1, KDD-Process Model (cp. Fayyad et al., 1996)................................................... 4 Figure 2, Web Mining Taxonomy (cp. Srivastava J., 2002) .............................................. 6 Figure 3, (cp. Spiliopoulou M., 2000) ................................................................................. 10 Figure 4, A web site with a sample path (cp. Theusinger Ch., Huber K.P., 2000)...... 17 Figure 5, (cp. Montgomery A.L., Srinivasan K., January, 2002) .................................... 20