clickstream analysis on web usage mining

10
1 Clickstream Analysis on Web Usage Mining 1 Gnanasambandan P and 2 Poonkuzhali S 1 Phd. Scholor Department of Computer Science and Engineering, Bharath Institute of Higher Education & Research,,Chennai ,India, [email protected] 2 Professor & Head, Department of Information Technology, Rajalakshmi Engineering College [email protected] ABSTRACT Web Usage Mining is a very productive field of research with lot of scope for data analytics. The usage pattern of users can be used to make more informed business decisions. Page visits and the consecutive hit on pages can give usage patterns. Pattern discovery, pattern analysis, ratings and follow up action can be made based on the data on web usage mining. The proposed work presents some valuable insights into clickstream analysis on web usage mining based on time series analysis, text analysis, product based analysis. Keywords: clickstream analysis, data analytics, document term matrix , time series analysis, web usage mining I. INTRODUCTION Web Usage Mining is a field of Machine learning and Data Mining, which gives insight into pattern of usage discovery, analysis and prediction. With many users now involved in collaborative content generation and many online transactions happening recently, there is huge scope for the web usage mining. Web data is information rich data which is highly productive in the ecommerce. This information can readily be put to use in the recommendation systems or customer behavior analysis. Customer behavior analysis is a very important area that helps promote customer friendly products, promotional offers can be made based on these recommendations. It is a very useful information to promote business using data analytics. CBM is design the mathematical construct to represent the general behaviors identified from particular groups of customers to anticipate how similar customers will behave under particular circumstances. CBM is based on data mining of user data, and individual model is constructed to answer a question at one point in time. For example, a customer model can be used to forecast what a particular group of users will execute in response to a particular marketing action. If the design is good and the marketer follows the recommendations it created, then the marketer will identify that a majority of the customers in the group responded as forecasted by the model. Clickstream Analysis is a collection of data on customer page visits and consecutive visits also sometimes. There is scope for analysis on the frequency of visits or the usage pattern analysis. A click stream is a record of a customer behavior on the Internet, including every site and every web page of every Web site that the user visits, how long the user was on a page or site, in what structure the sites were inspected, any news groups that the user competes in and even the e-mail addresses of mail that International Journal of Pure and Applied Mathematics Volume 119 No. 16 2018, 891-899 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ Special Issue http://www.acadpubl.eu/hub/ 891

Upload: others

Post on 19-Dec-2021

11 views

Category:

Documents


0 download

TRANSCRIPT

1

Clickstream Analysis on Web Usage Mining

1Gnanasambandan P and

2Poonkuzhali S

1Phd. Scholor Department of Computer Science and Engineering, Bharath Institute of Higher

Education & Research,,Chennai ,India, [email protected]

2Professor & Head, Department of Information Technology, Rajalakshmi Engineering College

[email protected]

ABSTRACT Web Usage Mining is a very productive field of research with lot of scope for data analytics. The usage pattern of users can be used to make more informed business decisions. Page visits and the consecutive hit on pages can give usage patterns. Pattern discovery, pattern analysis, ratings and follow up action can be made based on the data on web usage mining. The proposed work presents some valuable insights into clickstream analysis on web usage mining based on time series analysis,

text analysis, product based analysis.

Keywords: clickstream analysis, data analytics, document term matrix , time series analysis, web usage mining

I. INTRODUCTION

Web Usage Mining is a field of Machine learning and Data Mining, which gives insight into pattern of usage discovery, analysis and prediction. With many users now involved in collaborative content generation and many online transactions happening recently, there is huge scope for the web usage mining. Web data is information rich data which is highly productive in the ecommerce. This information can readily be put to use in the recommendation systems or customer behavior analysis.

Customer behavior analysis is a very important area that helps promote customer friendly products, promotional offers can be made based on these recommendations. It is a very useful information to promote business using data analytics. CBM is design the mathematical construct to represent the general behaviors identified from particular groups of customers to anticipate how similar customers will behave under particular circumstances. CBM is based on data mining of user data, and individual model is constructed to answer a question at one point in time. For example, a customer model can be used to forecast what a particular group of users will execute in response to a particular marketing action. If the design is good and the marketer follows the recommendations it created, then the marketer will identify that a majority of the customers in the group responded as forecasted by the model.

Clickstream Analysis is a collection of data on customer page visits and consecutive visits also sometimes. There is scope for analysis on the frequency of visits or the usage pattern analysis. A click stream is a record of a customer behavior on the Internet, including every site and every web page of every Web site that the user visits, how long the user was on a page or site, in what structure the sites were inspected, any news groups that the user competes in and even the e-mail addresses of mail that

International Journal of Pure and Applied MathematicsVolume 119 No. 16 2018, 891-899ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

891

2

the user sends and receives. Both internet service providers and individual Web sites are capable of discovering a user's clickstream. Clickstream data is becoming very valuable to Internet marketers and advertisers. Data analytics is the process of analyzing large data sets to discover uncover patterns, forecasting, examining and grouping all relevant information with the help of data analytics, increase business revenue improve operational efficiency, optimizing business operations and user service efforts, sending response more quickly to develop market trends and gain a competitive edge over rivals --all with the goal of enhancing business performance. To perform any analytical operation, first user should learn the business domain, including relevant history, such as whether the organization or business unit has attempted similar projects in the past, from which you can learn.

II. RELATED WORKS

A. Web Usage Mining

(Chhavi Rana,2012) discussed some challenges and future techniques to implement web usage mining. (Rachit adhvaryu,2013) focused on usage mining of websites and specified

the relevance of web usage mining in deriving usage patterns and other different functionalities of data mining used in web usage mining. (Randolph E. Bucklin,2003) used

recorded click stream data and developed a model to know the visitors behavior on browsing website. (W.A.Awad ,2012) used machine learning algorithms like SVM, KNN and GIS to perform a behavior comparison on the web pages classifications problem, from the

experiment. (Viswanath bijalwan ,2014) categorized the documents using KNN based machine learning algorithm to return the most relevant documents. (Yueh-Min Huang,2006)

modeled Navigational Pattern Tree Structure to find frequent navigation sequence of the user to provide efficient recommendation. (Abdelghani Guerbas,2013) proposed the framework for web log mining and online navigational pattern prediction by combining DBSCAN and

OPTICS algorithms. Finally, KNN based approach is used to predict online navigation pattern where sessions are considered as documents, pages’ references as terms and an online

session as a query document. (Yoon Ho Choa ,2002) proposed a methodology for personalized recommendations in an e-

commerce environment. Decision tree induction technique is used for selecting the target customers. Target customer’s preferences across products analyzed from click stream data

(customer’s previous shopping behavior in an e-commerce site). Different association rule sets are generated from multiple dataset. (Yoon Ho Choa, & Jae Kyeong Kim (2004) enhanced the recommendation quality using a recommendation methodology based on Web usage

mining, and product taxonomy for improving the system performance of CF-based recommender systems. (S. Asharaf ,2014) introduced a novel clustering scheme called Rough

Fuzzy Clustering Approach which employs a combination of rough set theory and fuzzy set theory. This Rough Fuzzy Approach generates meaningful abstractions from web access logs like the persons having the web access logs in the same cluster may have same behavior.

(C. E. Dinuca ,2012) focused on finding associations as a data mining technique to extract potentially useful information from web usage information and developed this concept in

Java programming language, using NetBeans , a program for identification of pages’ association from sessions.

B. Clickstream Analysis

International Journal of Pure and Applied Mathematics Special Issue

892

3

(Priya Kale ,2015) processed the clickstream data using Hortonworks Data Platform (HDP) to provide large scale processing performance and visualized through power view tools to

analyze how visitors behave on website. (Dileep Kumar, 2016) suggested a plan to recognize the user action in commerce sites to discover a process pattern which implements business

intelligence. Authors extracted the web access pattern and recognized the usage patterns of the customers using click stream data of the web sites. (R. Dale Wilson,2010) illustrated how the click stream data is collected from b2b web sites

and used web analytics software to improve the performance of B2B. (Randolph E. Bucklin ,2008) reviewed major developments from the analysis of click stream data for understanding

the browsing and site usage behavior on the Internet, the role of internet and shopping behavior on the e commerce web sites. (D. P. Acharjya ,2016) explored the potential challenges, impact of big data, research issues and various tools associated with big data. As

a result, this article provides a platform to examine big data at various stages

III SYSTEM IMPLEMENTATION

The proposed system performs the analysis on user visits in the following scope. Data is collected from so many users visits logging into the search engine. The data log is collected, pre-processed and the corresponding data is organized as a dataset. The data is subjected to traffic analysis and the usage analysis.

Basically Traffic Analysis can be broadly analyzed on fine grained or coarse grained fashion. Coarse grained analysis can happen by analyzing the entire data, and the traffic most hit on which websites from the search engine. And fine grained analysis can be done to probe customer behavior on traffic

Websites most users visit frequently

Websites each user visits frequently

Fig 1: System Architecture

For Usage mining, again coarse grained and fine grained analysis can be done. In coarse grained analysis, the most hit searches and the consequent click urls are analysed.

Search Engine

Users

Clickstream Data

Collection

Traffic Analysis Usage Analysis

Time Series

Analysis

Text Analysis

Visits Analysis

International Journal of Pure and Applied Mathematics Special Issue

893

4

On fine grained basis, insights can be gained into customer behavior. It is possible to forms patterns of the customer usage ex. which site is preferred by the user, which site is logged in by the user at which time intervals.

A. Text Analysis

Text Analysis is performed by analyzing the frequency of terms in the documents. Basically two things are identified

TF: Term Frequency, how frequently a word occurs in a given document.

IDF: Inverse Document Frequency, it gives the importance of the term in the collection of documents.

This information is used to find the major clicked sites and the wordcloud is formed on this basis.

B. K-Means Clustering

Clustering helps in unsupervised learning where the class labels are not known. K-Means is a very prominent clustering algorithm which is explained as below.

Algorithm for k-means clustering:

Let D = {d1,d2,d3,……..,dn} be the set of data points and C = {c1,c2,…….,Vm} be the set of centers.

Step 1: Select the cluster centers ‘m’ randomly from the set ‘C’

Step 2: Compute the distance between each data point and cluster centers.

Step 3: Allocate the data point to the cluster center whose distance from the cluster center is least amount of all the cluster centers..

Step 4: Recompute the newly created cluster center using:

where, ‘cl’ represents the number of data points in lth

cluster.

Step 5: Recompute the distance between each data point and new obtained cluster centers.

Step 6: If no data point was reassigned then stop, otherwise repeat from step 3).

IV. RESULTS AND DISCUSSION

The dataset used for clickstream analysis is taken from Stanford University. The data pertains to the

American Online(AOL) searches for three months in 2006. The dataset has uniquely 657426 user ID's.The dataset contains 36389567 records of data and it has been pre-processed and the user information is anonymized and the description on the dataset is given as below.

i. AnonID – The ID for depicting users uniquely that have been anonymized.

ii. Query - The actual search text that the user is making.

iii. QueryTime – The time at which the search is being made.

International Journal of Pure and Applied Mathematics Special Issue

894

5

iv. ItemRank - When a user clicks on the result of the query then the item rank is incremented.

v. ClickURL - The resultant search and the domain on which the user cliks is taken.

A. Text Analysis

Text Analysis is performed by calculating the term document frequency matrix. The most frequent domains and least frequent domains searched are listed below from implementation in RStudio.

The Query is extracted and subjected to analysis and the results are suggested as below.

TABLE 1 : FREQUENCY OF

THE DOMAINS SEARCHED

The word cloud of the same is projected as below. The packages like Tm, wordcloud and SnowballC are used to generate the wordcloud in R language.

B. Item Analysis

The item rank in this dataset has a maximum of 500. If we analyze and cluster the itemranks to find out how many searches end up in clickURLs. The algorithm used here is simple K-Means clustering and the result shows that there are many transactions that end up in itemranks between 300-500 and tends to bring inferences that people who tend to spend longer times in browsing tend to always click on more results.

Most Visited Least Visited

Domain Count Domain Count

http 1882089 zyrtecscilkacom 1

wwwgooglecom 36018 zyrtecsideeffectmcznet 1

wwwmyspacecom 17646 zyrtecvrmblogspotcom 1

wwwyahoocom 15207 zzgroupcom 1

enwikipediaorg 12104 zzxxccblogspiritcom 1

wwwamazoncom 10742 zzyxucscedu 1

International Journal of Pure and Applied Mathematics Special Issue

895

6

Fig 2: Wordcloud on Term Frequency

Fig 3: Cluster of ItemRank

V. CONCLUSION The proposed work shows that the web usage mining and click stream analysis can help us to characterize the behavior of the users in websites and the visitors are subjected to traffic and usage analysis. The customer behavior, most hit websites from AOL and the usage patterns are summarized. The same for a larger dataset can be loaded in Hadoop environments and more insights can be gained in a much efficient manner.

REFERENCES Abdelghani ,Guerbas., Omar, Addam., Omar, Zaarour., Mohamad ,Nagi., Ahmad ,Elhajj., Mick, Ridley., & Reda, Alhajj.,(2013).Effective web log mining and online navigational pattern prediction, Elsevier, Knowledge-Based Systems 49,50-62.

International Journal of Pure and Applied Mathematics Special Issue

896

7

Acharjya,D,P.,&Kauser,Ahmed,P.,(2016).A Survey on Big Data Analytics: Challenges, Open Research Issues and Tools,International Journal of Advanced Computer Science and Applications, 7(2), Asharaf,S., & Narasimha, Murty,M.,(2004).A rough fuzzy approach to web usage categorization, Elsevier,Fuzzy sets and Systems (148) ,119-129. Awad,W,A.,(2012). Machine Learning Algorithms In Web Page Classification,International Journal of Computer Science & Information Technology,4(5), Borges,J., & Levene,M.,(2000).Data Mining of User Navigation Patterns’,Web Usage Analysis and User Profiling, San Diego, CA, USA, 31-39. Cooley,R.,Mobasher,B.,&Srivastava,J.,(1997).Web Mining: Information and Pattern Discovery on the World Wide Web. A survey paper. In Proc. ICTAI,558-567 Chhavi, Rana, A., (2012). Study of Web Usage Mining Research Tools ,Int. J. Advanced Networking and Applications , 0975-0290,1423 ,3(6) ,1422-1429.

Dale,R,Wilson.,(2010).Using clickstream data to enhance business‐to‐business web

site performance, Journal of Business & Industrial Marketing,25(3),177-187. Dileep, Kumar.,Padidem, C,Nalini.,(2016) .Discovery Of Process Model Using Click Stream Analysis In Web Mining,,International Journal Of Pharmacy & Technology , 8(3), 16132-16138 E ,Dinucă.,(2012).An application for clickstream analysis ,International Journal Of Computers And Communications ,6(1),68-75. Niknam,T.,Taherian,E., Fard,N,Pourjafarian.,(2011).An efficient algorithm based on modified imperialist competitive algorithm and K-means for data clustering, Engineering Applications of Artificial Intelligence, 24(2),306-317. Priya ,Kale.,Siddhartha ,Ghosh.,Manoj ,Kumar Danthala.,(2015).Visualizing Website Clickstream Data with Apache Hadoop using Hortonworks ,I,nternational Journal for Research in Applied Science & Engineering Technology , : (2321-9653),3 (5),7724-7728. Rachit ,Adhvaryu,B, H., & Gardi,Vidyapith ,A., 2012 Review Paper On Web Usage Mining And Pattern Discovery,Journal Of Information, Knowledge And Research In Computer Engineering ,( 0975 – 6760)| , 2(2 ),279. Randolph,E.,Bucklin,Catarina,Sismeiro.,(2008).Advances in Clickstream Data Analysis in Marketing ,Journal of Interactive Marketing, 40(3), 249-267. Rathipriya,R.,Thangavel,K.,&Bagyamani,J.,(2011).Evolutinary Biclustering of Clickstream Data,IJCSI International Journal of Computer Science, 8(3),341-347. Robert,Codley., Bamshad ,Mobasher., & Jaideep, Srivastava.,(1997).Web mining: Information and pattern discovery on the world wide web. IEEE-International Conference on Tools with Artificial Intelligence, (1082-3409),558- 567.

Vishwanath ,Bijalwan.,Vinay,Kumar.,Pinki, Kumari., & Jordan,Pascual.,(2014).KNN Based Machine Learning Approach for Text and Document Mining,International Journal of Database Theory and Application ,7(1),61-70. Yueh-Min, Huang ., Yen-Hung, Kuo., Juei-Nan, Chen., & Yu-Lin ,Jeng.,(2006). NP-miner: A real-time recommendation algorithm by using web usage mining, Elsevier, Knowledge-Based Systems 19, 272–286. Yoon, Ho ,Choa., Jae, Kyeong,Kim., & Soung, Hie, Kim.,(2002).A personalized recommender system based on web usage mining and decision tree induction, Elsevier, Expert Systems with Applications 23, 329–342.

International Journal of Pure and Applied Mathematics Special Issue

897

8

Yoon Ho Choa, & Jae Kyeong Kim (2004). Application of Web usage mining and product taxonomy to collaborative recommendations in e-commerce. Expert Systems with Applications, 26, 233–246. Zaki,M,J.,(2001).SPADE: An Efficient Algorithm for Mining Frequent Sequences, Machine Learning, , Kluwer Academic Publishers. Manufactured in The Netherlands,42(1-2),31-60. R. Karpagam, Dr. S. Suganya,” APPLICATIONS OF DATA MINING AND ALGORITHMS IN EDUCATION – A SURVEY”, International Journal of Innovations in Scientific and Engineering Research (IJISER), vol3, no 4, pp. 38-46.2016. Zhao,Y., Karypis,G., & Fayyad,U.,(2005).Hierarchical clustering algorithms for document datasets,Data Mining and Knowledge Discovery, 10(2),141-168.

International Journal of Pure and Applied Mathematics Special Issue

898

899

900