assessing real world applications of data mining with sas ... · credit card companies ... however,...

18
Vol. 6, No. 9, September 2015 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved. http://www.cisjournal.org 488 Assessing Real World Applications of Data Mining With SAS Enterprise Miner (EM): A Technical Report for Teaching the Big Data Generation 1 Chon Abraham (Corresponding Author), 2 Margaret Poston 1 Associate Professor, Mason School of Business, College William of Mary, 101 Ukrop Way, Williamsburg, US 2 Graduate Student, College of William and Mary, US ABSTRACT The explosive growth in data collection across different industries has made it necessary to have processes that can infer useful information from this data in a limited amount of time. The software application describes data mining techniques and illustrates the practicality of data mining using SAS Enterprise Miner (EM), leading data analytics software, through two separate data sets and case studies provided by SAS as part of a teaching series. As academic institutions continue to revamp and develop curriculum to meet the challenges of educating the “the big data” generation, assessment and overview of leading tools in industry is insightful for educators to contextualize problems for students. This paper is an attempt to do so. Keywords: Big data, data mining, SAS enterprise miner, knowledge discovery in databases 1. INTRODUCTION On a daily basis, a tremendous amount of data is collected and subsequently circulated. The amassment of large datasets has led to the field of “Big Data” and “Big Data Analytics “where rapid analysis is made possible with the use of computational techniques [Chiang and Storey 2012]. Websites like Google collect and store information about web searches. Credit card companies store information about who uses credit cards, stores, both grocery and retail, collect information about what a person is purchasing, how much, and how often. The number of web pages indexed by Google[2015] exceeded 60 trillion this year, this is twice of the 30 trillion unique URLs reported in 2013 [Koetsier 2013]and a major leap from the 1 trillion that was reported in 2008[Fan and Bifet 2013]. In order to store say 30 trillion unique WebPages, it will require 100 million gigabytes or 1000 terabytes of space. The number of images that are shared on the picture collection website, Flicker, on a daily basis would require 3.6 tear bytes of storage space per day. On a daily basis 2.5 quintillion bytes of data is generated and the 90% of the data that exists in the world has been generated over the last two years [Wu et al. 2015]. The list of who is collecting data and what type of information is being collected is vast and growing daily. Much of the information is collected to provide insight into a specific question or concern. Data mining is used to extract important descriptive and predictive information from these warehouses by utilizing certain tools and techniques. However, a massive shortage of professionals who are skilled in data mining is changing the landscape of educational institutions [Noyes 2014]. Educators need insight on industry grade tools and offerings for supporting classroom instruction especially in the context of business. This paper is an attempt to provide an overview of SAS EM as a leading tool used available to educators for teaching data mining. Cases and data are provided by the SAS educators teaching series and thus similarity of thecae description, overview of the data, and output could be replicated or published elsewhere by educators attempting to use the software. The novelty of what appears is the additional insight provided in from a graduate student level perspective and contextualization of the impact of applications of data mining in industry. There are hundreds of software applications that use a specific type of algorithms in order to extract information, predictions and decisions from these large datasets. Examples of these tools are, SAS Enterprise Miner, IBM Intelligent Miner, R, Unica Pattern Recognition Workbench (PRW), Watson Analytics, IBM SPSS Modeler, Ghost Miner, SAP , SGI Mine set, Oracle Darwin, Angoss Knowledge Seeker, Weka, Rapid Miner and several others[Mikut and Reischl 2011]. However, SAS EM was selected because of its leading status and academic alliance that makes use of the technology an enabler for classroom use. In the proceeding sections, we describe what data mining is and how it finds applications in the real world. We also briefly describe the classification of data mining techniques and algorithms. The application and usage of SAS Enterprise Miner in different business segments is discussed and two case studies are used to illustrate the practical benefits of SAS Enterprise Miner in analyzing datasets. Lastly the issues and challenges faced by the data mining industry are taken into considerations that are applicable for discussing in an analytics curriculum. 2. FUNDAMENTALS OF DATA MINING Like the name indicates, Data mining is the process of retrieving useful information from large amounts of data. The process of data mining helps in recognizing patterns and trends in the data. When the data becomes too big for manual analysis, the usability of automatic mining comes into play [Kriegel et al. 1 [email protected]u

Upload: hatram

Post on 21-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

488 

Assessing Real World Applications of Data Mining With SAS Enterprise Miner (EM): A Technical Report for Teaching the

Big Data Generation 1 Chon Abraham (Corresponding Author), 2 Margaret Poston

1 Associate Professor, Mason School of Business, College William of Mary, 101 Ukrop Way, Williamsburg, US 2 Graduate Student, College of William and Mary, US

ABSTRACT

The explosive growth in data collection across different industries has made it necessary to have processes that can infer useful information from this data in a limited amount of time. The software application describes data mining techniques and illustrates the practicality of data mining using SAS Enterprise Miner (EM), leading data analytics software, through two separate data sets and case studies provided by SAS as part of a teaching series. As academic institutions continue to revamp and develop curriculum to meet the challenges of educating the “the big data” generation, assessment and overview of leading tools in industry is insightful for educators to contextualize problems for students. This paper is an attempt to do so. Keywords: Big data, data mining, SAS enterprise miner, knowledge discovery in databases 1. INTRODUCTION On a daily basis, a tremendous amount of data is collected and subsequently circulated. The amassment of large datasets has led to the field of “Big Data” and “Big Data Analytics “where rapid analysis is made possible with the use of computational techniques [Chiang and Storey 2012]. Websites like Google collect and store information about web searches. Credit card companies store information about who uses credit cards, stores, both grocery and retail, collect information about what a person is purchasing, how much, and how often. The number of web pages indexed by Google[2015] exceeded 60 trillion this year, this is twice of the 30 trillion unique URLs reported in 2013 [Koetsier 2013]and a major leap from the 1 trillion that was reported in 2008[Fan and Bifet 2013]. In order to store say 30 trillion unique WebPages, it will require 100 million gigabytes or 1000 terabytes of space. The number of images that are shared on the picture collection website, Flicker, on a daily basis would require 3.6 tear bytes of storage space per day. On a daily basis 2.5 quintillion bytes of data is generated and the 90% of the data that exists in the world has been generated over the last two years [Wu et al. 2015]. The list of who is collecting data and what type of information is being collected is vast and growing daily.

Much of the information is collected to provide insight into a specific question or concern. Data mining is used to extract important descriptive and predictive information from these warehouses by utilizing certain tools and techniques.

However, a massive shortage of professionals who are skilled in data mining is changing the landscape of educational institutions [Noyes 2014]. Educators need insight on industry grade tools and offerings for supporting classroom instruction especially in the context of business. This paper is an attempt to provide an overview of SAS EM as a leading tool used available to educators for teaching data mining. Cases and data are

provided by the SAS educators teaching series and thus similarity of thecae description, overview of the data, and output could be replicated or published elsewhere by educators attempting to use the software. The novelty of what appears is the additional insight provided in from a graduate student level perspective and contextualization of the impact of applications of data mining in industry.

There are hundreds of software applications that use a specific type of algorithms in order to extract information, predictions and decisions from these large datasets. Examples of these tools are, SAS Enterprise Miner, IBM Intelligent Miner, R, Unica Pattern Recognition Workbench (PRW), Watson Analytics, IBM SPSS Modeler, Ghost Miner, SAP , SGI Mine set, Oracle Darwin, Angoss Knowledge Seeker, Weka, Rapid Miner and several others[Mikut and Reischl 2011]. However, SAS EM was selected because of its leading status and academic alliance that makes use of the technology an enabler for classroom use.

In the proceeding sections, we describe what data mining is and how it finds applications in the real world. We also briefly describe the classification of data mining techniques and algorithms. The application and usage of SAS Enterprise Miner in different business segments is discussed and two case studies are used to illustrate the practical benefits of SAS Enterprise Miner in analyzing datasets. Lastly the issues and challenges faced by the data mining industry are taken into considerations that are applicable for discussing in an analytics curriculum. 2. FUNDAMENTALS OF DATA MINING Like the name indicates, Data mining is the process of retrieving useful information from large amounts of data. The process of data mining helps in recognizing patterns and trends in the data. When the data becomes too big for manual analysis, the usability of automatic mining comes into play [Kriegel et al.

1 [email protected]

Page 2: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

489 

2007].Richard Watson[2013] describes data mining as ``the search for relationships and global patterns that exist in large databases but are hidden in vast amounts of data''.

Data mining is often considered synonymous to Knowledge Discovery in Databases (KDD), where it is referred to as “the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”[Fayyad et al. 1996]. The process of KDD is broken up into five phases, namely; selection, preprocessing, transformation, data mining and interpretation. A variation of these five steps has been used by the many data mining tools and techniques that are now popular in the market. 2.1 Practical Applications of Data Mining Across

Industries The edge of data mining over the traditional statistical approaches is in the amount of data this relatively new branch of computer science can handle. While statistical tests may be able to give a significant result with just 100 entries, data mining techniques may require millions or billions of datasets to decipher a useful pattern [Matignon and SAS Institute 2007].

Nearly all sectors and industries have benefited from the use of this technology. Its role in scientific and engineering applications has been long established, data mining has proven its worth in the field of biology, chemistry, physics, remote sensing and astronomy [Grossman 2001].

In healthcare and medicine, mining of clinical records, biomedical research and other experimental results have been used to make policy changes and have helped in defining scientific hypothesis for further analysis [Yoo et al. 2012].

Text mining of product reviews, such as those available on Amazon.com, has also been found useful in analyzing user preferences and in predicting future changes in sales [Archak et al. 2011]. Additionally, data mining techniques have also been applied in predicting sales of movies by mining their online reviews [Yu et al. 2012].

Data mining also finds its applications in the field of education where an explosive growth in educational data has led to the use of mining techniques in order to make managerial decisions [Kumar and Chadha 2011]. In the field of higher education, data science has been used to improve students’ learning activities and course development [Kumar and Chadha 2011]. 2.2 Data Driven Decision Making In Marketing There are numerous applications of data-driven decision making that have been used in the industry for variable purposes. The most common being the use of data science by online advertising agencies to project targeted ads. These companies handle billions of ad

impressions in a day and process these to make decisions in milliseconds.

In the past decade, the most effective and revolutionizing use of web mining has been carried out by the online retailer Amazon. By introducing cookie tracking to monitor the user’s browsing habits, Amazon has been able to match up product ads to the person’s liking and preferences[Broderick and Grinberg 2013]. Even more recently, in 2013 Amazon made use of data mining to choose which TV show to produce from a group of 14 pilot episodes for its new video streaming service.

The online retailer collected data from one million viewers of the pilot episodes which included specifics like, viewing patterns, comments on video, ratings and number of shares to assess popularity of the TV shows [Sharma 2013]. Based on the analysis of this data, Amazon chose to produce Alpha House out of thousands of show ideas. The TV series went on to get a 7.5/10 rating on IMDb and 8/10 on TV.com.

A classic example of using mining techniques to predict consumer behavior is its clever use by Wal-Mart in 2004. However more recently a New York Times article explains that Wal-Mart used information uncovered by data mining to make inventory decisions prior to the landfall of Hurricane Frances in 2004. Data mining revealed that pop-tarts and beer were among the top sellers prior to a hurricane’s arrival. The company later indicated that those items were sold out more rapidly than usual [Hays 2004; Provost and Fawcett 2013].

The strategy of database marketing, as used by Wal-Mart has been applied by other retailers, like Macy’s, which stores credit cards usage to collect data. The collected information is used to develop offers of special discounts benefitting frequent shoppers. Target and Wawa are among the stores that use credit cards to collect data about their customers. Companies even share or sell customer information. Airlines, car rental companies, and hotel chains, for instance, allow the use of member numbers for discounts between one another. Companies share information about when customers travel, the duration of the travel, and the type of travel. These companies can use this information to create travel packages for a specific type of customer or individual.

Casinos are using data mining in order to insure patron loyalty. Harrah’s, one of the largest and most successful casino chains, has continued to be successful as a result of their loyalty rewards program. This program allows customers to identify themselves at stations all over the casino in order to earn points. These points “can bring them a stream of ‘comps’ - small complimentary gifts, such as meals and free hotel rooms” [Schofield 2004].

Page 3: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

490 

2.3 Classification of Data Mining Algorithms There are two basic categories of data mining algorithms: one is descriptive which is also called unsupervised learning and the other is predictive that is also referred to as supervised learning[Ye 2013]. Descriptive or unsupervised methods work by measuring similarity between objects to establish relationships whereas predictive/supervised methods first infer prediction rules which are then applied on unclassified data. Clustering, association analysis, sequence discovery and summarization are types of descriptive methods whereas classification, regression, time series analysis, and prediction methods are part of predictive algorithms[Anderson 2012]. These algorithms are used in conjunction with statistical methods and visualization techniques. Some of the common methods for this purpose are decision trees, genetic algorithms, k-means clustering and regression techniques. A brief overview of some of these algorithms is given below: a. Association Analysis An association analysis “identifies affinities existing among the collection of items in a given set of records”. Association analysis is often referred to as market basket analysis or an affinity analysis. The association rules take the form of Set A Set B, or the items in set A imply the items in set B are also in the transaction.

In order to determine the strength of the association, the support and the confidence of each rule is determined. The support of a rule is the probability that the items in the two sets (on each side of the rule) occur together. Equation 1 shows how the support of AB is calculated.

ons transacti

nstransactio

ofnumber total

B andA in itemevery containing ofnumber

Equation 1: Support Calculation While the confidence of AB is the probability of a transaction containing the items in set B given that it contains the items in set A. Equation 2 shows how the confidence of AB is calculated.

Ain items thecontains the

B andA in itemevery containing ofnumber

ntransactio

nstransactio

Equation 2: Confidence Calculation One important thing to note is that cause and effect is not implied by high levels of support and confidence. In fact, there could potentially be no correlation between the two sets of interest. Also, the term confidence does not maintain the same meaning as in statistics.

Other measurements of the strength of an association are the expected confidence and the lift. The

expected confidence of the rule looks at each side of the rule as if it is an independent event. Consequently, the expected confidence of the rule AB is calculated by dividing the number of transaction that include set B by the total number of transactions. The lift of AB is a measure of association between A and B such that a lift is greater than 1 indicates a positive correlation, a lift less than one indicates a negative correlation, and a lift equal to 1 indicates no correlation. The lift is calculated by Equation 3:

rule theof confidence expected

rule theof confidence

Equation 3: Lift of a Rule An association analysis is used to determine if the purchase of an item implies that another item will also be purchased. Suppose there are 4 baskets with three items in each as in Table 1.

Table 1: Items in grocery baskets Basket 1 Basket 2 Basket 3 Basket 4 A, B, C B, C, D A, C, D A, D, E

A few rules and their strength measures derived from the baskets are illustrated in Table 2. Notice that the support of a rule and the lift of a rule are symmetric, support of Rule 1 and the support of Rule 2 are equal. The confidence, however, is not symmetric.

Table 2: Association rules

b. Sequential Patterns Identifying sequential patters involves detecting “frequently occurring sequences from given records. ”Using a sequence analysis is much like the association analysis but includes a time dimension. A sequence analysis is performed if a company wants to know the order in which a customer bought items. For instance, customers might buy gloves, hats, and scarves on separate trips to the store. Sequential patterns will tell the store if there is a particular order in which customers make their purchases.

Rule Support

Confidence

Expected Confidence

Lift

Rule 1: Item A Item E

1/4 1/3 ¼ 4/3

Rule 2: Item E Item A

1/4 1/1 ¾ 4/3

Rule 3: Item A Item C

2/4 2/3 ¾ 8/9

Rule 4: Item B Item C

2/4 2/2 ¾ 4/3

Rule 5: Item D Item E

1/4 1/3 ¼ 4/3

Page 4: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

491 

c. Classifying Classifying involves separating “predefined classes…into mutually exclusive groups”. Classifications are done before an analysis is performed. For instance, if a regression was to be performed on data representing factors that influence tuition rates colleges some classifications that could be made are based on state in which a college is located and whether the school is a private or public school. Stores might classify customers as frequent shoppers, occasional shoppers, or infrequent shoppers. A frequent shopper might shop in a store once a week; an occasional shopper might shop in a store once a month; and an infrequent shopper might shop in the store once a year. In a regression analysis, these variables could be used as binary indicator variables. d. Clustering Clustering is similar to classifying in that it involves separating the data into classes. Clustering identifies unknown classes, rather than predetermined classes, within the data. A clustering analysis can also be referred to as unsupervised classification or segmenting. It can be useful in creating marketing strategies if it can uncover groups with unique profiles. It can also be used as a tool in developing predictive models. There are a number of methods for clustering such as the k-means clustering algorithm. e. Predictive Prediction does what the name implicates; it predicts a “future value of a variable”. Predictive models must

“provide a rule to transform a measurement into a prediction

have a means of choosing useful inputs from a potentially vast number of candidates

Be able to adjust its complexity to compensate for noisy training data

Predictive modeling is done using a variety of statistics techniques including decision trees, neural networks, and regression. 3. DATA MINING WITH SAS

ENTERPRISE MINER (EM) SAS EM is a leading platform that is used in mining data especially in the advanced and predictive analytics market segment [Gartner 2015]. The software uses a modification of the KDD algorithm, where the process has five SEMMA steps; sample, explore, modify, model and assessment [Abell 2014; Al Ghoson 2010].

SAS Enterprise Miner has consistently been ranked among the top ten most popular data analytics tools for the past several years. The results from 16th annual KDnuggets Software Poll 2015 ranked SAS Enterprise Miner as the ninth most popular tool used for real projects, taking 11.3% of the 2900 votes [Piatetsky 2015].

SAS Enterprise Miner has the most solid standing in the advanced analytics market segment, where it dominates with 36.2% market share, according to IDC’s 2012 report [Vesset et al. 2013]. SAS was again ranked as the top supplier in advanced and predictive analytics market with 35.4% market share in IDC’s report based on a 2013 survey [Vesset et al. 2014]. Among business analytics, SAS has been consistently ranked as the fifth-largest vendor based on revenue generated in 2012 and 2013 [Vesset et al. 2013; Vesset et al. 2014]. In 2013, it controlled 6.3% of the market share after Oracle, SAP, IBM and Microsoft whereas in 2012 its revenue made up 6.9% of the business analytics industry.

In the industry SAS Enterprise Miner has been used for various applications of customer relationship management (CRM) and is widely used in a variety of commercial applications [Farooqi and Raza 2012]. SAS has also been used in mining customer value via direct marketing strategies [Wang et al. 2005];here predictive or incremental response models are applied on a particular group of customers that maximizes the profit return while minimizing the cost [Lee et al. 2013].

SAS has also found its usage in several fields, for instance in the healthcare market, insurance company Highmark was able to build a fraud detection system based on real-time or near real-time analysis. The same company also used SAS to construct a decision tree model using patient symptoms, health history and demographics that helped in maximizing revenues at the company. Another health services company, Health ways used predictive models in SAS Enterprise Miner to minimize healthcare costs [Yoo et al. 2012].

As mentioned earlier, SAS approaches data mining in a five step SEMMA process. Given a large dataset, the first step is to sample. During this step, a subset of the dataset is extracted. This subset must be “small enough to process” efficiently, yet large enough to contain significant data. If the subset of data is too large, the obvious issues of lack computer memory and lengthy processing time are encountered. On the other hand, if the subset is too small, the risk becomes that there is not enough data to accurately identify relationships and patterns. The procedure might identify a pattern that is only relevant to the small subset of data selected rather than the large subset.

Once an appropriate subset is selected, the next step in the process is to explore the data. This entails performing a preliminary investigation to identify trends or relationships that might be of significance and worth studying in more detail. The intention is that one becomes familiar with the dataset.

The data is modified next. Modifying data consists of “creating, selecting, and transforming the variables”. These modifications are made with the intention of building a model.

Page 5: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

492 

Modeling is the fourth step of this process. This step entails taking the variables from the modification step and applying “analytical tools” to ascertain a relationship which accurately predicts a desired result. No model will ever be perfect, but the goal is to create a model which is as good as possible.

The final step is to assess the model. In this stage, the model is evaluated to determine its relevance. This step can involve comparing multiple models to identify the best model for a specific situation.

SAS Enterprise Miner utilizes this process to organize available tools and functions of the software. SAS utilizes the functions and statistical techniques in order to perform analysis on vast quantities of data. 4. CASE STUDIES The following section provides two data mining case studies using SAS Enterprise Miner that are offered to educators free of charge along with access to the SAS EM platform. The case study overviews and datasets are provided by SAS EM as part of the teaching series to demonstrate capabilities of the software [Matignon and SAS Institute 2007]. Thus, the description of the case study, data sets, output are freely used and can appear in any number of resources published by those employing the platform (e.g., http://mis.aug.edu/drjmatls/JMP%20Training%20Folders/Adv%20Analytics%20AAEM71%20July%202012/CA/Case%20Studies.doc). The first study, which is rather simplistic, involves the usage of web site services by a radio station’s listeners. The second study is about enrollment management at a college. Case No 1: Web Site Usage Associations A radio station collected data about the usage of it website by its listeners. The website provides a number of services and the radio station would like to know if any unusual patterns existed in the combinations of services selected by its web users.

To begin the analysis using SAS Enterprise Miner, the diagram is created and the data source is defined. The next step is to begin exploring the data to become familiar with the database as well as to collect initial statistics. This is done adding the Stat Explore node to the diagram and running the path. The diagram is displayed in Figure 1.

Figure 1: Initial diagram As seen in Table 3, there are only two variables, the URL and the web service that is selected.

Table 4 shows that the ID or the URL has 1,586,124 observations corresponding to unique web users. The Target or web services show only 8 different levels. These are associated with the 8 web services provided to the listeners by the radio station.

Table 3: Variables in web site case study

Name Model Role

Measurement Level

Description

ID ID Nominal URL (with

anonymous ID numbers)

TARGET Target Nominal Web service selected

Table 4: Summary of variables in web site case Variable Levels Summary Number of Variable Role Levels ID ID 1586124 TARGET TARGET 8

Plots displaying the frequency of the target (or web service) are also generated using the Stat Explore node. Figure 2 is the bar graph of the usage of services while

Figure 3 presents the same information in the form of a pie chart.

Figure 2: Bar graph of usage of web site services

Page 6: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

493 

Figure 3: Pie graph of usage of web services Both figures identify the eight levels of the target, corresponding to the eight services offered by the website, and the relative frequency of each in a sample of 10,000 observations. These services include news streams, archives, music streams, simulcast, external referrers, podcast, website, and live streams. Unfortunately, there is no more information regarding what each of the services entails. The 10,000 observations were randomly sampled from the database.

The use of the website is often followed by the use podcast. Of the 10,000 observations, 40.44% are using the website and 31.38% are using the podcast service. The least utilized services are the external referrers and the live streams. External referrers are used in only 1.25% of the observations. Live streams are used the least in only 1.15% of the observations.

After exploring the data, the association tool is used to identify possible associations within in the database. The diagram is updated with the additions of the Association node, as seen in

Figure 4.

Figure 4: Final diagram for web site services To obtain association rules, the default settings of the Association node were altered to allow the number of items to process to be 3 million and the minimum support percentage to be 1. The results of running the path are 3 plots and a list of associations. The list of associations can be found in Appendix A.

The statistics plot of the website services usage in Figure 5 plots the support on the Y-axis against the confidence level on the X-axis. Notice that for the associations with three and four relations, the support level always falls below 5%. Also note that the associations with the highest support only involved two relations.

Figure 5: Statistics plot of web site services usage Figure 6 is a plot of the rules based on which side of the rule the items belong. The plot also uses the colors of each point to display the confidence of the rules. The rules with the highest level of confidence belong to the column with red points (the third column from the right). The only item with 100% confidence is that use of the live stream implies the use of the Website.

Figure 6: Rules matrix

The statistics line plot in

Figure7plots the lift, expected confidence, confidence, and support for each of the rules in order by the rule index. This graph indicates that the rules are indexed in descending order according to the lift. Using the rule with 100% confidence as identified in the rules matrix, this plot reveals more information about the rule. The index of the rule 34, the lift of the rule is 1.74, the expected confidence of the rule is 57.52, and the support of the rule is 2.15.

Figure 7: Statistics line plot All of the information in the plots can be found in the output results in Appendix A. However, the plots

Page 7: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

494 

provide visual depictions to better identify trends and important rules. The link graph in

Figure 8 displays the associations between the all of the services. The nodes are sized according to the count and the thickness of the links indicates the level of confidence. Notice the node corresponding to both the website and podcast is not connected to any of the other nodes.

In addition to the information discovered about the rule that the use of the live stream tool implies the use of the website, interesting conclusions drawn from this data are:

External referrers also pointed to the archives with 98% confidence. This is also the case for External referrers and users of website pointing to the archives;

Those who used live streams, podcasts, new services, or the simulcast tools were not as likely to go the Web site; and,

Those who used the simulcast service were three times as likely to also use the news services.

Figure 8: Linked graph of website services Case No 2: Enrollment Management “In the fall of 2004, the administration of a large private university requested that the Office of Enrollment Management and the Office of Institutional Research work together to help identify prospective students who would most likely to enroll as new freshmen in the fall 2005 semester. The administration stated several goals for this project: increase new freshman enrollment, increase diversity, and increase SAT scores of entering students. Historically, inquiries numbered about 90,000+ students, and the university enrolled from 2400 to 2800 new freshmen each fall semester.” 4.1 Initial Observations The description of the data can be found in Appendix B. The data set contains variables that described demographics of enrolled students, financials, correspondence, interests, and visits to the campus. There are a number of variables that were rejected from the use in the model. Some were rejected because there were too many data points missing. The nominal variables that described academic interests and high school code were replaced with interval variables. The academic codes were replaced by variables that described the rate at which the code was used over 5 years. The school code

was replaced by the enrollment rate from each school over 5 years. The variables that described race and sex were rejected because they are not admissible factors for the decision process. 4.2 Descriptive Statistics The next step in the analysis is to explore the data and to collect initial statistics. This is done by adding the Stat Explore node to the diagram and displaying the results, like in the radio website case study.

Table 5 displays the class variable summary statistics. Notice that the territory variable, which describes the recruitment area, is the only variable missing an observation. Moreover, the variables that have a mode of zero also have a very high percentage of zeros ranging from 51.01% to 96.61%.

Table 5: Class input variables summary statistics for

enrollment management Class Variable Summary Statistics Mode Variable Role NumcatNMiss Mode Pct Mode2 Mode2Pct CAMPUS_VISIT INPUT 3 0 0 96.61 1 3.31 INSTATE INPUT 2 0 Y 62.04 N 37.96 REFERRAL_CNTCTS INPUT 6 0 0 96.46 1 3.21 SOLICITED_CNTCTS INPUT 8 0 0 52.45 1 41.60 TERRITORY INPUT 12 1 2 15.98 5 15.34 TRAVEL_INIT_CNTCTS INPUT 7 0 0 67.00 1 29.90 INTEREST INPUT 4 0 0 95.01 1 4.62 MAILQ INPUT 5 0 5 69.33 2 12.80 PREMIERE INPUT 2 0 0 97.11 1 2.89 STUEMAIL INPUT 2 0 0 51.01 1 48.99 ENROLL TARGET 2 0 0 96.86 1 3.14

Table 6 shows the distribution of the class target which is enrollment. Enrollment is an indicator variable such that a 1 implies a successful enrollment and a 0 is a failure to enroll. Successful enrollment occurs only 3.135%.

Table 6: Distribution of class target and segment variables

Distribution of Class Target and Segment Variables Formatted Variable Role Value Frequency Percent ENROLL TARGET 0 88614 96.8650 ENROLL TARGET 1 2868 3.1350

Page 8: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

495 

Table 7 displays the summary statistics for the interval variables. Notice that the average income and the distance variables are missing a number of data points. The distributions of these variables are plotted in

Figure9. Each of the eight plots indicate skewness in the distributions. The INIT_SPAN input is not as skewed as the other interval variables. The appearance of skewness suggests that a transformation might be necessary in order to obtain and accurate regression model.

Page 9: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

496 

Table 7: Interval variable summary statistics

Figure 9: Distribution of interval variables for enrollment

management 4.3 Sampling Since the data chance of enrollment is so small, a stratified sample is taken using the Sample tool. A Sample node is added to the diagram. (The final diagram can be found in Appendix C.) The sample is created such that each case which enrolled in the school is forced into the sample. Then, for each of the 2,868 enrollments, 7 other observations are included in the sample, resulting in a sample size of 22,944. Running the sample node and exploring the results shows that this increases the percentage of cases that enroll to 12.5%. The results are displayed in Table 8.

Table 8: Summary statistics for sample in enrollment management case

Summary Statistics for Class Targets Data=DATA Numeric Formatted Variable Value Value Frequency Percent Enroll 0 0 88614 96.8650 Enroll 1 1 2868 3.1350 Data=SAMPLE Numeric Formatted Variable Value Value Frequency Percent Enroll 0 0 20076 87.5 Enroll 1 1 2868 12.5

4.4 Decision Process Recall that the administration wants to be able to identify students that are most likely to enroll at the college. A good candidate will have an above average probability of enrollment. In order to incorporate this in the model, a decision node is used. The decision node allows the prior probability enrollment of 3% to be used and the central decision rule to be created. The central decision rule is a matrix whose trace is the inverse of the prior probabilities. The matrix is used to force enrollment when the estimated probability is greater than 3% (or the prior probability). Otherwise, the applicant is considered unlikely to enroll. 4.5 Prediction Model (All Cases) After the decision node was setup to determine the central decision rule, the diagram was completed in order to perform a stepwise regression on the given data. In order to do this, the data is partitioned using a Data Partition node that uses 60% of the data for training and 40% of the data for validating the model. An impute node is used next to fill in the missing data. The node uses the tree method for both class and interval variables. Also, under the indicator properties, unique missing indicator variables are utilized to create binary indicator variables for every imputed variable. These binary indicator variables are then used as inputs. A stepwise regression was used with the entry and stay significance levels at the default setting of 0.05. The variables selected in stepwise regression are then used by the Neural Network and Instate Regression nodes.

Table 9 displays the results from the Stepwise Regression node. The stepwise regression employs a logistic regression with the link type of logit. Both the count of self-initiated contacts and the rate of enrollment of the high school are extremely important to the model with p-values of 0+. The student e-mail, though included, is not significantly different than zero with a large p-value of 0.6482.

Table 9: Stepwise regression results

Analysis of Maximum Likelihood Estimates Standard Wald Standardized Parameter DF Estimate Error Chi-Square Pr>ChiSq Estimate Exp(Est) INTERCEPT 1 -12.1953 16.8329 0.52 0.4688 0.000 SELF_INIT_CNTCTS 1 0.7069 0.0195 1315.95 <.0001 0.8446 2.028 HSCRAT 1 16.5157 0.7711 458.71 <.0001 0.7267 999.000 STUEMAIL 0 1 -7.6794 16.8321 0.21 0.6482 0.000 Odds Ratio Estimates Point Effect Estimate SELF_INIT_CNTCTS 2.028 HSCRAT 999.000 STUEMAIL 0 VS 1 <0.001

Page 10: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

497 

The regression equation for the stepwise node is:

321 68.752.1671.020.12ˆ1

ˆlog XXX

p

p

where 1X denotes self-initiated contacts, 2X denotes the

rate of enrollment from the high school over the past 5

years, and 3X denotes if an e-mail address is supplied.

The odds ratios expresses the increase in the

number of students who enrolled in the college with a unit change in that particular input. Consequently, if the count of self-initiated contacts increases by one unit, then it increases the odds of that student enrolling by 2. Notice that the odds ratios are unusual for the high school enrollment rate and the student e-mail. This is likely a result of a strong association within each of those inputs. For example, of the students who enrolled, the majority provided and e-mail address. Also, for some high schools, all of the students that applied enrolled or all of the students that applied chose not to enroll at this particular school. A neural network node is also processed using the variables selected by the stepwise regression.

Next, the INSTATE input is added to the regression model using the regression node named Instate Regression. This uses the variables selected in the stepwise regression and forces the INSTATE input to be included, as well. This model variable is included because it is thought that students enrollment decision depends on whether or not the student is in-state or out-of-state. The results of this regression are displayed in Table 10.

Table 10: Instate regression results

Analysis of Maximum Likelihood Estimates Standard Wald Standardized Parameter DF Estimate Error Chi-Square Pr>ChiSq Estimate Exp(Est) INTERCEPT 1 -12.0541 16.7449 0.52 0.4716 0.000 INSTATE N 1 -0.4145 0.0577 51.67 <.0001 0.661 SELF_INIT_CNTCTS 1 0.6889 0.0196 1233.22 <.0001 0.8231 1.992 HSCRAT 1 16.2327 0.7553 461.95 <.0001 0.7142 999.000 STUEMAIL 0 1 -7.3528 16.7443 0.19 0.6606 0.001 Odds Ratio Estimates Point Effect Estimate INSTATE N VS Y 0.437 SELF_INIT_CNTCTS 1.992 HSCRAT 999.000 STUEMAIL 0 VS 1 <0.001

Notice that whether a student is instate or out-of-state is significant with a p-value of 0+. Again, the student e-mail is not significant since the p-value is 0.6606 which is larger than the significance level of 0.05. Since the student e-mail address is not statistically different than zero, the regression function for the instate regression is:

321 23.1669.041.005.12ˆ1

ˆlog XXX

p

p

,

where 1X denotes instate or out-of-state status, 2X

denotes the number of self-initiated contacts, and 3X

denotes rate of enrollment from the high school over the past 5 years. Notice that the coefficient estimates for all of the previously included variables decreased with the addition of the status of instate or out-of-state.

The initial splits in the decision tree provides inside into the predictions.

Figure 10 shows that those who have made at fewer than four self-initiated contacts are unlikely to enroll. Students that made fewer than three self-initiated contacts almost never enroll. This can be seen because for both the training and validation in the branch with less than 2.5 self-initiated contacts, the probability of a 0 (or not enrolling in the school) are 99.6% and 99.7%, respectively.

Figure 10: Decision tree (initial splits)

Figure 11 indicates the optimal number of leaves is 18 because the average profit levels off at the amount. By adding more leaves, nothing is gained.

Page 11: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

498 

Figure 11: Average profit 4.6 Comparison of Models The neural network, the regression including the instate variable, and tree were all analyzed separately in the previous section. A tool aptly named the Model Comparison tool is used in the diagram to compare all of the models created. Using the results from running this node in Table 11, the neural network is found to increase the validation profit for enrollment while decreasing the validation average squared error.

Table 11: Model comparison output Fit Statistics Model selection based on _VAPROF_ Valid: Valid: Valid: Average Average Valid: Valid: Kolmogorov- Selected Model Profit for Squared Misclassification Roc Smirnov Model Node Enroll Error Rate Index Statistic Y Neural 1.88576 0.036167 0.07551 0.98120 0.88705 Reg2 1.86236 0.041097 0.07736 0.97691 0.86352 Tree 1.88127 0.040314 0.12497 0.96481 0.88220

The receiver operating characteristic (ROC) chart plots the trade-off between sensitivity and false positive fraction across all selected fraction of data. It provides the measure of the predictive accuracy of a logistic model. All of the models appear strong since the curves are far from the diagonal line. The neural model appears to perform slightly better than the other models as its corresponding curve is outside other curves. This supports the previous findings that the neural network decreased the average squared error.

Table 11 also indicates that the validation model has a ROC index, or a rank decision statistic, of 98%. This implies that the split between enrollment and failures to enroll is close to perfect. Utilizing the decision tree gives some clarification as to why the ROC index is so high. The decision tree indicates that the number of self-initiated contacts with the school is key to whether or not the student decides to enroll. Recall that if a student initiated two or less contacts it is almost guaranteed that he will not enroll at the school.

Figure 12: ROC chart

The goal of this case study is to identify the students most likely to enroll at the college. This information can then be used by the admissions office to aid in the decision making process of which students to offer acceptance. The results indicate that the largest factor to look at is whether or not the student contacts the school and how many times he does this. If the student contacts the university on his own more than three times, he is likely to enroll in the college. If the student, on the other hand, contacts the university fewer than three times, it is almost certain that he will decline acceptance to the university. 4.7 Issues and Challenges In Data Mining Protecting the privacy of those whose information is being collected is the most vital ethical issue with data mining that has been widely discussed. Automated Data mining raises the issues of privacy, security and governance frequently [Sharmaa et al. 2013]. Data that originates from online forums, like Gmail, Facebook, Twitter and Amazon is stored and analyzed by companies which is then used to engineer targeted ads and make recommendations. Wherever there has been an application of web-mining , the issue of protecting privacy has always been broached[Berendt 2012].

In this regard the initial outcry on Face book’s tracking of its users’ behavior is worth mentioning. In 2012, the social networking company bought data from a data mining firm Data logix. With this data on 70 million U.S households, Facebook was able to predict if the user bought the item after seeing it marketed on Facebook. The targeted ads that followed the company’s venture into data mining were protested by the users, as a result Facebook included an option to opt out of this service [Steel and Dembosky 2012].

Another controversial use of data mining is getting too personal with the customer’s online or in-store behavior. In a presentation given in 2010, Andrew Pole, a predictive analysis expert at Target, said that there is immense potential in focusing ads and offers around a family that is expecting a child and predicting a pregnancy would be a very lucrative sales opportunity. This presentation was dug up two years later, stories appeared in New York Times[Duhigg 2012] and subsequently in Forbes in 2012[Hill 2012], which went viral within days. These reports criticized the lengths to

Page 12: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

499 

which retail stores would go to offer targeted packages and coupons[Siegel 2013].

There are obvious benefits for companies and the government for using data mining techniques. Retail stores use the information to improve the efficiency in their operations and are able to cater to the demands of consumers more effectively. Organizations also use this strategy to personalize a customer’s experience, as seen with Harrah’s casinos. These uses and outcomes appear to be positive for both the corporation and the consumer. While the business is becoming more efficient in operations, consumers reap the benefit of enjoyable experiences, discounts, and special offers.

Government’s unconstrained access to general public’s personal data, that includes telephone conversations, messages, credit card usage, and driving records allows it the opportunity to mine and analyze the information as it pleases. In the best-case scenario, the authorities are able to use the acquired knowledge in preventing criminal activity and terrorist plots. On the other hand, the government’s use of data mining is viewed by some as an invasion of privacy and a violation of basic civil liberties. This objection is due in large part because people suspected as terrorists, any person with a connection to a suspect, or a randomly selected person could be investigated in this manner. Law abiding people, however, do not want their telephones tapped, their credit card information examined, or the driving records probed without their knowledge and good reason. The collection and subsequent analysis of the data is seen has an infringement upon one’s privacy.

The most recent controversy in this regard was the revelation of National Security Agency (NSA)’s surveillance program by Edward Snowden, a former employee of the organization [Greenwald 2013]. Snowden leaked surprising facts about the depth of NSA’s invasion into personal communications; these revelations began in June 2013 and continued through the year. The story revealed that NSA had records of millions of phone calls from telecom company Verizon and also had access to servers of large technology companies like, Apple, Facebook, Google, Microsoft, Skype, Yahoo, and YouTube [Lyon 2014]. Critics of the government’s unbridled invasion into personal data argue that in states of emergency the government has the right to suspend some rights but this level of blanket surveillance is unacceptable. While some experts are of the opinion that new laws need to be formulated to provide a check and balance in such instances of big data monitoring, some have said that NSA’s PRISM initiative is completely against the Fourth Amendment and has no allowance in the current legal framework [Park and Wang 2013].

In an essay published in Stanford Law Review in 2013, the authors emphasize the need for attention to both positive and negative aspects of big data and government surveillance. It is a tremendous challenge to balance the

risk and reward between invasions of privacy and crucial uses personal data mining [Polonetsky and Tene 2013].

Another criticism of using data mining to detect terrorism concerns the accuracy of analysis results. Since the amount of information is so immense and “terrorism is so rare,” the occurrence of “false positives are inevitable and often more common than truly accurate results” [Stannard 2006]. This implies that accusing an innocent person is unavoidable. The severity of the consequences for terrorists is not something that an innocent person should be forced to endure. 4.8 Future Prospects A step ahead in big data analytics would be the complete automation of data mining technologies. A combination of cognitive computing and big data would truly transform the present potential of data mining applications, as it will supersede the need for human supervision and speed up the process of inferring decisions and predictions from large datasets[Uddin 2013]. Cognitive computing is mirroring human thought process by making systems that use pattern recognition, data mining and natural language processing. Successful applications of cognitive computing include, IBM’s Watson, Google’s Deep Mind and Qualcomm’s Zero th Platform[Delgado 2015]. Machine learning is now being deemed the as the next generation of data mining capabilities to enable cognitive computing. 5. CONCLUSION Data mining has proven to have immense applications in marketing, education and scientific research. There are a number of tools that are able to apply mining techniques, however SAS Enterprise Miner has proved and maintained its dominance in the industry over the past decade because of the robust statistical capabilities. In the two case studies of SAS Enterprise Miner, the mining tool was able to successfully elicit patterns of visitor behavior based on usage of services at the radio station website and also found the most crucial indicator that determined student’s inclination to enroll at the institute in the second dataset. SAS Enterprise Miner is a user-friendly platform that can be adjusted to an organization or project’s requirements with ease and therefore has vast applications both at the commercial and institutional level. Thus, this platform is a viable tool for incorporation into curriculum for education the big data generation. REFERENCES [1] Martha Abell. 2014. First Steps in DATA

MINING with SAS ENTERPRISE MINER. CreateSpace Independent Publishing Platform, USA.

[2] Abdullah M. Al Ghoson. 2010. Decision Tree

Induction & Clustering Techniques In SAS Enterprise Miner, SPSS Clementine, And IBM Intelligent Miner – A Comparative Analysis. Intl J Manag & Inf Syst 14, 57-70.

Page 13: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

500 

[3] Russell K. Anderson. 2012. Prediction Algorithms

for Data Mining. John Wiley & Sons, Ltd, Chichester, UK.

[4] Nikolay Archak, Anindya Ghose and Panagiotis G

Ipeirotis. 2011. Deriving the Pricing Power of Product Features by Mining Consumer Reviews. Management Science 57, 8, 1485 - 1509.

[5] Bettina Berendt. 2012. More than modelling and

hiding: towards a comprehensive view of Web mining and privacy. Data Min Knowl Disc 24, 697-737.

[6] Ryan Broderick and Emanuella Grinberg. 2013. 10

ways you give up data without knowing it. Retrieved May 15, 2015 from http://edition.cnn.com/2013/06/13/living/buzzfeed-data-mining/

[7] Roger H. L. Chiang and Veda C. Storey. 2012.

Business Intelligence and Analytics: From Big Data to Big Impact. MIS Quarterly 36, 4, 1165-1188.

[8] Rick Delgado. 2015. Cognitive Computing:

Solving the Big Data Problem? Retrieved July 1, 2015 from http://www.kdnuggets.com/2015/06/cognitive-computing-solving-big-data-problem.html

[9] Charles Duhigg. 2012. How companies learn your

secrets. Retrieved May 20, 2015 from http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?_r=0

[10] Wei Fan and Albert Bifet. 2013. Mining big data:

current status, and forecast to the future. SIGKDD Explor. Newsl. 14, 2, 1-5.

[11] Md Rashid Farooqi and Khalid Raza. 2012. A

Comprehensive Study of CRM through Data Mining Techniques. In Proc of the Nat Conf(NCCIST 2011). New Delhi.

[12] Usama Fayyad, Gregory Piatetsky-Shapiro and

Padhraic Smyth. 1996. Knowledge discovery and data mining: Towards a unifying framework. In Proc of the 2nd ACM int conf know disc and data mining (KDD). Portland, OR, 82-88.

[13] Gartner.2015. Magic Quadrant for Advanced

Analytics Platforms. http://www.gartner.com/technology/reprints.do?id=1-2A881DN&ct=150219&st=sb

[14] Google. 2015. How search works, from algorithms

to answers. Retrieved April 28, 2015 from http://www.google.com/insidesearch/howsearchworks/thestory/

[15] Glenn Greenwald. 2013. NSA Prism program taps

in to user data of Apple, Google and others. In The Guardian.

[16] Robert Grossman. 2001. Data mining for scientific

and engineering applications. Kluwer Academic, Boston, Mass.

[17] Constance L. Hays. 2004. What Wal-Mart Knows

About Customers' Habits. Retrieved April 15, 2015 from http://www.nytimes.com/2004/11/14/business/yourmoney/14wal.html

[18] Kahsmir Hill. 2012. How Target Figured Out A

Teen Girl Was Pregnant Before Her Father Did. Retrieved May 20, 2015 from http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/

[19] John Koetsier. 2013. How Google searches 30

trillion web pages, 100 billion times a month. Retrieved June 3, 2015 from http://venturebeat.com/2013/03/01/how-google-searches-30-trillion-web-pages-100-billion-times-a-month/

[20] Hans-Peter Kriegel, Karsten M Borgwardt, Peer

Kroger, Alexey Pryakhin, Matthias Schubert and Arthur Zimek. 2007. Future trends in data mining. Data Min Knowl Disc 15, 1, 87-97.

[21] Varun Kumar and Anupama Chadha. 2011. An

Empirical Study of the Applications of Data Mining Techniques in Higher Education. Intl J of Adv Comp Sci and Appl 2, 3, 80-84.

[22] Taiyeong Lee, Ruiwen Zhang, Xiangxiang Meng

and Laura Ryan. 2013. Incremental Response Modeling Using SAS® Enterprise Miner. In Proceedings of the SAS Global Forum (2013). SAS Institute Inc.

[23] David Lyon. 2014. Surveillance, Snowden, and

Big Data: Capacities, consequences, critique. Big Data & Society (July-Sep), 1-13.

[24] Randall Matignon and SAS Institute. 2007. Data

mining using SAS Enterprise miner. Wiley-Interscience, Hoboken, N.J.

[25] Ralf Mikut and Markus Reischl. 2011. Data

mining tools. Wiley Interdiscip Rev 1, 5, 431-443. [26] Katherine Noyes 2014. Educating the "Big Data"

Generations

Page 14: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

501 

http://fortune.com/2014/05/27/educating-the-big-data-generation/

[27] Chanmin Park and Taehyung Wang. 2013. Big

Data and NSA Surveillance -- Survey of Technology and Legal Issues. IEEE Int Symp on Multimedia (ISM), 516-517.

[28] Gregory Piatetsky. 2015. R leads RapidMiner,

Python catches up, Big Data tools grow, Spark ignites. Retrieved June 30, 2015 from http://www.kdnuggets.com/2015/05/poll-r-rapidminer-python-big-data-spark.html

[29] Jules Polonetsky and Omer Tene. 2013. Privacy

and big data, making ends meet. The Stanford Law Review 66, 25.

[30] Foster Provost and Tom Fawcett. 2013. Data

Science and its Relationship to Big Data and Data-Driven Decision Making. Big Data 1, 51-59.

[31] Jack Schofield. 2004. Casino Rewards Total

Loyalty. Retrieved April 25, 2015 from http://www.theguardian.com/technology/2004/jan/15/onlinesupplement

[32] Amol Sharma. 2013. Amazon Mines Its Data

Trove to Bet on TV's Next Hit. Retrieved May 15, 2015 from http://www.wsj.com/articles/SB10001424052702304200804579163861637839706

[33] Bhoj Raj Sharmaa, Daljeet Kaura and Manju.

2013. A Review on Data Mining: Its Challenges, Issues and Applications. Intl J of Curr Eng and Tech 3, 2, 695-700.

[34] Eric Siegel. 2013. Predictive Analytics: The

Power to Predict Who Will Click, Buy, Lie, or Die. John Wiley & Sons, Hoboken, New Jersey.

[35] Matthew B Stannard. 2006. U.S. PHONE-CALL

DATABASE IGNITES PRIVACY UPROAR / DATA MINING. Retrieved May 15, 2015 from http://www.sfgate.com/news/article/U-S-PHONE-CALL-DATABASE-IGNITES-PRIVACY-UPROAR-2535457.php

[36] Emily Steel and April Dembosky. 2012. Facebook

raises fears with ad tracking. Retrieved May 17, 2015 from

http://edition.cnn.com/2012/09/23/business/facebook-datalogix/

[37] Niaz Uddin. 2013. James Kobielus: Big data,

cognitive computing and future of product. Retrieved July 2, 2015 from http://etalks.me/james-kobielus-big-data-cognitive-computing-and-future-of-product/

[38] Dan Vesset, Brian Mcdonough, David Schubmehl

and Mark Wardely. 2013. IDC Business Analytics Software 2013 2017 Forecast and 2012 Vendor Shares.

[39] Dan Vesset, Brian Mcdonough, David Schubmehl,

Alys Woodward, Mary Wardley and Carl W Olofson. 2014. Worldwide Business Analytics Software 2014–2018 Forecast and 2013 Vendor Shares.

[40] Ke Wang, Senqiang Zhou, Qiang Yang and Jack

Man Shun Yeung. 2005. Mining Customer Value: From Association Rules to Direct Marketing. Data Min Knowl Disc 11, 57-79.

[41] Richard Thomas Watson. 2013. Data management:

databases and organizations. John Wiley & Sons, New York.

[42] Xindong Wu, Xingquan Zhu, Gong-Qing Wu and

Wei Ding. 2015. Data Mining With Big Data. IEEE Trans. Knowledge and Data Eng 26, 1, 97-107.

[43] Nong Ye. 2013. Data Mining Theories,

Algorithms, and Examples Hoboken. CRC Press, Hoboken.

[44] Illhoi Yoo, Patricia Alafaireet, Miroslav Marinov,

Keila Pena-Hernandez, Rajhita Gopidi, Jia-Fu Chang and Lei Hua. 2012. Data mining in healthcare and biomedicine: a survey of the literature. J Med Syst 36, 4, 2431-2448.

[45] Xiaohui Yu, Yang Liu, Xiangji Huang and Aijun

An. 2012. Mining Online Reviews for Predicting Sales Performance: A Case Study in the Movie Domain. IEEE Trans. Knowledge and Data Eng 24, 4, 720-734.

APPENDICES

Website

RULE1 WEBSITE & EXTREF ==> ARCHIVE

RULE2 ARCHIVE ==> WEBSITE & EXTREF

RULE3 EXTREF ==> ARCHIVE

Page 15: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

502 

RULE4 ARCHIVE ==> EXTREF

RULE5 EXTREF ==> WEBSITE & ARCHIVE

RULE6 WEBSITE & ARCHIVE ==> EXTREF

RULE7 WEBSITE & SIMULCAST ==> PODCAST & MUSICSTREAM

RULE8 PODCAST & MUSICSTREAM ==> WEBSITE & SIMULCAST

RULE9 SIMULCAST & PODCAST ==> WEBSITE & MUSICSTREAM

RULE10 WEBSITE & MUSICSTREAM ==> SIMULCAST & PODCAST

RULE11 NEWS & MUSICSTREAM ==> SIMULCAST

RULE12 WEBSITE & NEWS ==> SIMULCAST

RULE13 WEBSITE & PODCAST & MUSICSTREAM ==> SIMULCAST

RULE14 SIMULCAST & MUSICSTREAM ==> NEWS

RULE15 NEWS ==> SIMULCAST & MUSICSTREAM

RULE16 PODCAST & MUSICSTREAM ==> SIMULCAST

RULE17 WEBSITE & SIMULCAST & PODCAST ==> MUSICSTREAM

RULE18 SIMULCAST & PODCAST ==> MUSICSTREAM

RULE19 WEBSITE & NEWS ==> MUSICSTREAM

RULE20 SIMULCAST & NEWS ==> MUSICSTREAM

RULE21 MUSICSTREAM ==> WEBSITE & SIMULCAST

RULE22 WEBSITE & SIMULCAST ==> MUSICSTREAM

RULE23 SIMULCAST ==> NEWS

RULE24 NEWS ==> SIMULCAST

RULE25 SIMULCAST ==> WEBSITE & MUSICSTREAM

RULE26 WEBSITE & MUSICSTREAM ==> SIMULCAST

RULE27 SIMULCAST ==> MUSICSTREAM

RULE28 MUSICSTREAM ==> SIMULCAST

RULE29 WEBSITE & SIMULCAST ==> ARCHIVE

RULE30 ARCHIVE ==> WEBSITE & SIMULCAST

RULE31 WEBSITE & SIMULCAST ==> NEWS

RULE32 WEBSITE & MUSICSTREAM ==> ARCHIVE

RULE33 ARCHIVE ==> WEBSITE & MUSICSTREAM

RULE34 LIVESTREAM ==> WEBSITE

RULE35 SIMULCAST & ARCHIVE ==> WEBSITE

RULE36 MUSICSTREAM & ARCHIVE ==> WEBSITE

RULE37 PODCAST & ARCHIVE ==> WEBSITE

RULE38 NEWS ==> MUSICSTREAM

RULE39 MUSICSTREAM ==> NEWS

RULE40 WEBSITE ==> ARCHIVE

RULE41 ARCHIVE ==> WEBSITE

RULE42 WEBSITE & MUSICSTREAM ==> NEWS

RULE43 SIMULCAST & PODCAST & MUSICSTREAM ==> WEBSITE

RULE44 EXTREF & ARCHIVE ==> WEBSITE

RULE45 EXTREF ==> WEBSITE

RULE46 SIMULCAST & MUSICSTREAM ==> WEBSITE & PODCAST

RULE47 PODCAST & MUSICSTREAM ==> WEBSITE

RULE48 SIMULCAST & PODCAST ==> WEBSITE

Page 16: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

503 

RULE49 PODCAST & NEWS ==> WEBSITE

RULE50 WEBSITE & ARCHIVE ==> MUSICSTREAM

RULE51 WEBSITE & ARCHIVE ==> SIMULCAST

RULE52 ARCHIVE ==> MUSICSTREAM

RULE53 ARCHIVE ==> SIMULCAST

RULE54 ARCHIVE ==> WEBSITE & PODCAST

RULE55 WEBSITE & NEWS ==> PODCAST

RULE56 WEBSITE & SIMULCAST & MUSICSTREAM ==> PODCAST

RULE57 SIMULCAST & MUSICSTREAM ==> WEBSITE

RULE58 SIMULCAST ==> WEBSITE & PODCAST

RULE59 MUSICSTREAM ==> WEBSITE & PODCAST

RULE60 MUSICSTREAM ==> WEBSITE

RULE61 SIMULCAST ==> WEBSITE

RULE62 NEWS & MUSICSTREAM ==> WEBSITE

RULE63 WEBSITE & SIMULCAST ==> PODCAST

RULE64 WEBSITE & MUSICSTREAM ==> PODCAST

RULE65 WEBSITE ==> PODCAST

RULE66 PODCAST ==> WEBSITE

RULE67 SIMULCAST & MUSICSTREAM ==> PODCAST

RULE68 SIMULCAST & NEWS ==> WEBSITE

RULE69 SIMULCAST ==> PODCAST

RULE70 WEBSITE & ARCHIVE ==> PODCAST

RULE71 ARCHIVE ==> PODCAST

RULE72 MUSICSTREAM ==> PODCAST

RULE73 NEWS ==> WEBSITE

RULE74 NEWS ==> PODCAST

Page 17: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

504 

Enrollment management data

Model diagram

Page 18: Assessing Real World Applications of Data Mining With SAS ... · Credit card companies ... However, a massive shortage of professionals ... SPSS Modeler, Ghost Miner, SAP , SGI Mine

Vol. 6, No. 9, September 2015 ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

505 

Fit statistics Fit Statistics Model selection based on _VAPROF_ Valid: Valid: Valid: Average Average Valid: Valid: Kolmogorov- Selected Model Profit for Squared Misclassification Roc Smirnov Model Node Enroll Error Rate Index Statistic Y Neural 1.88576 0.036167 0.07551 0.98120 0.88705 Reg2 1.86236 0.041097 0.07736 0.97691 0.86352 Tree 1.88127 0.040314 0.12497 0.96481 0.88220