weka tutorial

Contents

I Data Mining

1. INTRODUCTION.........................................................4

1.1 Overview...................................................4

1.2 The Scope of Data Mining.........................6

1.3 How Data Mining Works...........................6

1.4 An Architecture for Data Mining................7

1.5 Data Mining Techniques...........................8

1.6 What are issues in Data Mining...............11

II Weka Mining Tool

2. LAUNCHING WEKA..................................................16

3 THE WEKA EXPLORER...............................................19

3.1 Section Tabs.............................................19

3.2 Status Box................................................19

3.3 Log Button...............................................19

3.4 Graphical Output.....................................20

4 PREPROCESSING......................................................21

4.1 Loading Data............................................21

4.2 The Current Relation...............................21

4.3 Working with Filters................................21

5. CLASSIFICATION......................................................22

5.1 Selecting a Classifier................................22

5.2 Test Options............................................22

5.3 The Class Attribute.................................23

5.4 Training a Classifier.................................23

5.5 The Classifier Output Text.......................23

1 | P a g e

6 CLUSTERING...........................................................25

6.1 Selecting a Cluster.................................25

6.2 Cluster Modes.......................................25

6.3 Ignoring Attributes...............................25

6.4 Working with Filters............................25

6.5 Learning Clusters.................................25

7. ASSOCIATING......................................................26

7.1 Setting up.............................................26

7.2 Learning Associations...........................26

8. Selecting Attributes.............................................27

8.1 Searching and Evaluation.....................27

8.2 Options.................................................27

8.3 Performing Selection............................28

9. Visualizing............................................................29

9.1 The scatter plot matrix.........................29

9.2 Selecting an individual 2D scatter plot 30

9.3 Selecting instances ...............................30

III IMPLEMENTATION

10. Regression (Pr.)....................................................32

11. Classification (pr.) ................................................39

12. Clustering (Pr.) .....................................................44

13. Association (Pr). ...................................................50

References.................................................................53

2 | P a g e

Part IData Mining

3 | P a g e

CHAPTER 1: INTRODUCTION

An Introduction to Data Mining

Discovering hidden value in your data warehouse

1.1 Overview

Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.

Most companies already collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?"

This white paper provides an introduction to the basic technologies of data mining. Examples of profitable applications illustrate its relevance to today’s business environment as well as a basic description of how data warehouse architectures can evolve to deliver the value of data mining to end users.

The Foundations of Data Mining

Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature:

Massive data collection Powerful multiprocessor computers Data mining algorithms

4 | P a g e

Commercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of 1996.1 In some industries, such as retail, these numbers can be much larger. The accompanying need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods.

In the evolution from business data to business information, each new step has built upon the previous one. For example, dynamic data access is critical for drill-through in data navigation applications, and the ability to store large databases is critical to data mining. From the user’s point of view, the four steps listed in Table 1 were revolutionary because they allowed new business questions to be answered accurately and quickly.

Evolutionary Step

Business Question Enabling Technologies

Product Providers

Characteristics

Data Collection

(1960s)

"What was my total revenue in the last five years?"

Computers, tapes, disks

IBM, CDC Retrospective, static data delivery

Data Access

(1980s)

"What were unit sales in New England last March?"

Relational databases (RDBMS), Structured Query Language (SQL), ODBC

Oracle, Sybase, Informix, IBM, Microsoft

Retrospective, dynamic data delivery at record level

Data Warehousing &

Decision Support

(1990s)

"What were unit sales in New England last March? Drill down to Boston."

On-line analytic processing (OLAP), multidimensional databases, data warehouses

Pilot, Comshare, Arbor, Cognos, Microstrategy

Retrospective, dynamic data delivery at multiple levels

Data Mining

(Emerging Today)

"What’s likely to happen to Boston unit sales next month? Why?"

Advanced algorithms, multiprocessor computers, massive databases

Pilot, Lockheed, IBM, SGI, numerous startups

Prospective, proactive information delivery

Table 1. Steps in the Evolution of Data Mining.

The core components of data mining technology have been under development for decades, in research areas such as statistics, artificial intelligence, and machine learning. Today, the maturity of these techniques, coupled with high-performance relational database engines

5 | P a g e

and broad data integration efforts, make these technologies practical for current data warehouse environments.

1.2 The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable business information in a large database — for example, finding linked products in gigabytes of store scanner data — and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities:

Automated prediction of trends and behaviors. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data — quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events.

Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors.

Data mining techniques can yield the benefits of automation on existing software and hardware platforms, and can be implemented on new systems as existing platforms are upgraded and new products developed. When data mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes. Faster processing means that users can automatically experiment with more models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions.

1.3 How Data Mining Works

How exactly is data mining able to tell you important things that you didn't know or what is going to happen next? The technique that is used to perform these feats in data mining is called modeling. Modeling is simply the act of building a model in one situation where you know the answer and then applying it to another situation that you don't. For instance, if you were looking for a sunken Spanish galleon on the high seas the first thing you might do is to research the times when Spanish treasure had been found by others in the past. You might note that these ships often tend to be found off the coast of Bermuda and that there are certain characteristics to the ocean currents, and certain routes that have likely been taken by the ship’s captains in that era. You note these similarities and build a model that includes the characteristics that are common to the locations of these sunken treasures. With these models in hand you sail off looking for treasure where your model indicates it most likely

6 | P a g e

might be given a similar situation in the past. Hopefully, if you've got a good model, you find your treasure.

This act of model building is thus something that people have been doing for a long time, certainly before the advent of computers or data mining technology. What happens on computers, however, is not much different than the way people build models. Computers are loaded up with lots of information about a variety of situations where an answer is known and then the data mining software on the computer must run through that data and distill the characteristics of the data that should go into the model. Once the model is built it can then be used in similar situations where you don't know the answer. For example, say that you are the director of marketing for a telecommunications company and you'd like to acquire some new long distance phone customers. You could just randomly go out and mail coupons to the general population - just as you could randomly sail the seas looking for sunken treasure. In neither case would you achieve the results you desired and of course you have the opportunity to do much better than random - you could use your business experience stored in your database to build a model.

As the marketing director you have access to a lot of information about all of your customers: their age, sex, credit history and long distance calling usage. The good news is that you also have a lot of information about your prospective customers: their age, sex, credit history etc. Your problem is that you don't know the long distance calling usage of these prospects (since they are most likely now customers of your competition). You'd like to concentrate on those prospects who have large amounts of long distance usage. You can accomplish this by building a model. Table 2 illustrates the data used for building a model for new customer prospecting in a data warehouse.

1.4 An Architecture for Data Mining

To best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on. Figure 1 illustrates an architecture for advanced analysis in a large data warehouse.

7 | P a g e

Integrated Data Mining Architecture

The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database systems: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access.

An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business – summarizing by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. An advanced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply them to future decisions.

This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users’ business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans.

1.5 Data Mining Techniques

Linear regression:

In statistics prediction is usually synonymous with regression of some form. There are a variety of different types of regression in statistics but the basic idea is that a model is

8 | P a g e

created that maps values from predictors in such a way that the lowest error occurs in making a prediction. The simplest form of regression is simple linear regression that just contains one predictor and a prediction. The relationship between the two can be mapped on a two dimensional space and the records plotted for the prediction values along the Y axis and the predictor values along the X axis. The simple linear regression model then could be viewed as the line that minimized the error rate between the actual prediction value and the point on the line (the prediction from the model). Graphically this would look as it does in Figure 1.3. The simplest form of regression seeks to build a predictive model that is a line that maps between each predictor value to a prediction value. Of the many possible lines that could be drawn through the data the one that minimizes the distance between the line and the data points is the one that is chosen for the predictive model.

On average if you guess the value on the line it should represent an acceptable compromise amongst all the data at that point giving conflicting answers. Likewise if there is no data available for a particular input value the line will provide the best guess at a reasonable answer based on similar data.

Characterization:

Data characterization is a summarization of general features ofobjects in a target class, and produces what is called characteristic rules. The datarelevant to a user-specified class are normally retrieved by a database query andrun through a summarization module to extract the essence of the data at differentlevels of abstractions. For example, one may want to characterize theOurVideoStore customers who regularly rent more than 30 movies a year. With concept hierarchies on the attributes describing the target class, the attributeoriented induction method can be used, for example, to carry out data summarization. Note that with a data cube containing summarization of data, simple OLAP operations fit the purpose of data characterization.

Discrimination:

Data discrimination produces what are called discriminant rulesand is basically the comparison of the general features of objects between twoclasses referred to as the target class and the contrasting class. For example, onemay want to compare the general characteristics of the customers who rentedmore than 30 movies in the last year with those whose rental account is lower than 5. The techniques used for data discrimination are very similar to the techniques used for data characterization with the exception that data discrimination results include comparative measures.

Association analysis: Association analysis is the discovery of what are commonly called association rules. It studies the frequency of items occurring together in transactional databases, and based on a threshold called support, identifies the frequent item sets. Another threshold, confidence, which is the conditional probability than an item appears in a transaction when another itemappears, is used to pinpoint association rules. Association analysis is commonly used for market basket analysis. For example, it could be useful for the OurVideoStore manager to know what movies are often rented together or if there is a relationship between renting a certain type of movies and buying popcorn or pop. The discovered association rules are of the form: PQ [s,c], where P and Q are conjunctions of attribute value-pairs, and s (for

9 | P a g e

support) is the probability that P and Q appear together in a transaction and c (for confidence) is the conditional probability that Q appears in a transaction when P is present. For example, the hypothetic association rule:RentType(X, “game”) Age(X, “13-19”) Buys(X, “pop”) [s=2% ,c=55%]would indicate that 2% of the transactions considered are of customers aged between 13 and 19 who are renting a game and buying a pop, and that there is a certainty of 55% that teenage customers who rent a game also buy pop.

Classification: Classification analysis is the organization of data in given classes. Also known as supervised classification, the classification uses given class labels to order the objects in the data collection. Classification approaches normally use a training set where all objects are already associated with known class labels. The classification algorithm learns from the training set and builds a model. The model is used to classify new objects. For example, after starting a credit policy, the OurVideoStore managers could analyze the customers’ behaviours vis-à-vis their credit, and label accordingly the customers who received credits with three possible labels “safe”, “risky” and “very risky”. The classification analysis wouldgenerate a model that could be used to either accept or reject credit requests in the future.

Prediction: Prediction has attracted considerable attention given the potential implications of successful forecasting in a business context. There are two major types of predictions: one can either try to predict some unavailable data values or pending trends, or predict a class label for some data. The latter is tied to classification. Once a classification model is built based on a training set, the class label of an object can be foreseen based on the attribute values of the object and the attribute values of the classes. Prediction is however more often referred to the forecast of missing numerical values, or increase/ decrease trends in time related data. The major idea is to use a large number of past values to consider probablefuture values.

Clustering: Similar to classification, clustering is the organization of data in asses. However, unlike classification, in clustering, class labels are unknown and it is up to the clustering algorithm to discover acceptable classes. Clustering is also called unsupervised classification, because the classification is not dictated by given class labels. There are many clustering approaches all based on the principle of maximizing the similarity between objects in a same class (intra-class similarity) and minimizing the similarity between objects of different classes(inter-class similarity).

Outlier analysis: Outliers are data elements that cannot be grouped in a given class or cluster. Also known as exceptions or surprises, they are often very important to identify. While outliers can be considered noise and discarded in some applications, they can reveal important knowledge in other domains, and thus can be very significant and their analysis valuable.

Evolution and deviation analysis:Evolution and deviation analysis pertain to the study of time related data that changes in time. Evolution analysis models evolutionary trends in data, which consent to characterizing, comparing, classifying or clustering of time related data. Deviation analysis, on the other hand, considers differences between measured values and expected values, and

10 | P a g e

attempts to find the cause of the deviations from the anticipated values.It is common that users do not have a clear idea of the kind of patterns they can discoveror need to discover from the data at hand. It is therefore important to have a versatile andinclusive data mining system that allows the discovery of different kinds of knowledgeand at different levels of abstraction. This also makes interactivity an important attributeof a data mining system.

1.6 What are the issues in Data Mining?

Data mining algorithms embody techniques that have sometimes existed for many years,but have only lately been applied as reliable and scalable tools that time and againoutperform older classical statistical methods. While data mining is still in its infancy, itis becoming a trend and ubiquitous. Before data mining develops into a conventional,mature and trusted discipline, many still pending issues have to be addressed. Some ofthese issues are addressed below. Note that these issues are not exclusive and are notordered in any way.

Security and social issues: Security is an important issue with any data collection that is shared and/or is intended to be used for strategic decision-making. In addition, when data is collected for customer profiling, user behaviour understanding, correlating personal data with other information, etc., large amounts of sensitive and private information about individuals or companies is gathered and stored. This becomes controversial given the confidential nature of some of this data and the potential illegal access to the information. Moreover, data mining could disclose new implicit knowledge about individuals or groups that could be against privacy policies, especially if there is potential dissemination of discovered information. Another issue that arises from this concern is the appropriate use of data mining. Due to the value of data, databases of all sorts of content are regularly sold, and because of the competitive advantage that can be attained from implicit knowledge discovered, some important information could be withheld, while other information could be widely distributed and used without control.

User interface issues: The knowledge discovered by data mining tools is useful as long as it is interesting, and above all understandable by the user. Good data visualization eases the interpretation of data mining results, as well as helps users better understand their needs. Many data exploratory analysis tasks are significantly facilitated by the ability to see data in an appropriate visual presentation. There are many visualization ideas and proposals for effective data graphical presentation. However, there is still much research to accomplish in order to obtain good visualization tools for large datasets that could be used to display and manipulate mined knowledge. The major issues related to user interfaces and visualization are “screen real-estate”, information rendering, and interaction. Interactivity with the data and data mining results is crucial since it provides means for the user to focus and refine the mining tasks, as well as to picture the discovered knowledge from different angles and at different conceptual levels.

11 | P a g e

Mining methodology issues: These issues pertain to the data mining approaches applied and their limitations. Topics such as versatility of the mining approaches, the diversity of data available, the dimensionality of the domain, the broad analysis needs (when known), the assessment of the knowledge discovered, the exploitation of background knowledge and metadata, the control and handling of noise in data, etc. are all examples that can dictate mining methodology choices. For instance, it is often desirable to have different data mining methods available since different approaches may perform differently depending upon the data at hand. Moreover, different approaches may suit and solve user’s needs differently. Most algorithms assume the data to be noise-free. This is of course a strong assumption. ost datasets contain exceptions, invalid or incomplete information, etc., which maycomplicate, if not obscure, the analysis process and in many cases compromise theaccuracy of the results. As a consequence, data preprocessing (data cleaning andtransformation) becomes vital. It is often seen as lost time, but data cleaning, as timeconsuming and frustrating as it may be, is one of the most important phases in theknowledge discovery process. Data mining techniques should be able to handle noise indata or incomplete information. More than the size of data, the size of the search space is even more decisive for data mining techniques. The size of the search space is often depending upon the number of dimensions in the domain space. The search space usually grows exponentially when the number of dimensions increases. This is known as the curse of dimensionality. This “curse” affects so badly the performance of some data mining approaches that it is becoming one of the most urgent issues to solve.Performance issues: Many artificial intelligence and statistical methods exist for data analysis and interpretation. However, these methods were often not designed for the very large data sets data mining is dealing with today. Terabyte sizes are common. This raises the issues of scalability and efficiency of the data mining methods when processing considerably large data. Algorithms with exponential and even medium-order polynomial complexity cannot be of practical use for data mining. Linear algorithms are usually the norm. In same theme, sampling can be used for mining instead of the whole dataset. However, concerns such as completeness and choice of samples may arise. Other topics in the issue of performance are incremental updating, and parallel programming. There is no doubt that parallelism can elp solve the size problem if the dataset can be subdivided and the results can be merged later. Incremental updating is important for merging results from parallel mining, or updating data mining results when new data becomes available without having to re-analyze the complete dataset.

Data source issues: There are many issues related to the data sources, some are practical such as the diversity of data types, while others are philosophical like the data glut problem. We certainly have an excess of data since we already have more data than we can handle and we are still collecting data at an even higher rate. If the spread of database management systems has helped increase the gathering of information, the advent of data mining is certainly encouraging more data harvesting. The current practice is to collect as much data as possible now and process it, or try to process it, later. The concern is whether we are collecting the right data at the appropriate amount, whether we know what we want to do with it, and whether we distinguish between what data is important and what data is

12 | P a g e

insignificant. Regarding the practical issues related to data sources, there is the subject of heterogeneous databases and the focus on diverse complex data types. are storing different types of data in a variety of repositories. It is difficult to expect a data mining system to effectively and efficiently achieve good mining results on all kinds of data and sources. Different kinds of data and sources may require distinct algorithms and methodologies. Currently, there is a focus on relational databases and datawarehouses, but other approaches need to be pioneered for other specific complex datatypes. A versatile data mining tool, for all sorts of data, may not be realistic. Moreover,the proliferation of heterogeneous data sources, at structural and semantic levels, posesimportant challenges not only to the database community but also to the data miningcommunity.

Profitable Applications

A wide range of companies have deployed successful applications of data mining. While early adopters of this technology have tended to be in information-intensive industries such as financial services and direct mail marketing, the technology is applicable to any company looking to leverage a large data warehouse to better manage their customer relationships. Two critical factors for success with data mining are: a large, well-integrated data warehouse and a well-defined understanding of the business process within which data mining is to be applied (such as customer prospecting, retention, campaign management, and so on).

Some successful application areas include:

A pharmaceutical company can analyze its recent sales force activity and their results to improve targeting of high-value physicians and determine which marketing activities will have the greatest impact in the next few months. The data needs to include competitor market activity as well as information about the local health care systems. The results can be distributed to the sales force via a wide-area network that enables the representatives to review the recommendations from the perspective of the key attributes in the decision process. The ongoing, dynamic analysis of the data warehouse allows best practices from throughout the organization to be applied in specific sales situations.

A credit card company can leverage its vast warehouse of customer transaction data to identify customers most likely to be interested in a new credit product. Using a small test mailing, the attributes of customers with an affinity for the product can be identified. Recent projects have indicated more than a 20-fold decrease in costs for targeted mailing campaigns over conventional approaches.

A diversified transportation company with a large direct sales force can apply data mining to identify the best prospects for its services. Using data mining to analyze its own customer experience, this company can build a unique segmentation identifying the attributes of high-value prospects. Applying this segmentation to a general business database such as those provided by Dun & Bradstreet can yield a prioritized list of prospects by region.

13 | P a g e

A large consumer package goods company can apply data mining to improve its sales process to retailers. Data from consumer panels, shipments, and competitor activity can be applied to understand the reasons for brand and store switching. Through this analysis, the manufacturer can select promotional strategies that best reach their target customer segments.

Each of these examples have a clear common ground. They leverage the knowledge about customers implicit in a data warehouse to reduce costs and improve the value of customer relationships. These organizations can now focus their efforts on the most important (profitable) customers and prospects, and design targeted marketing strategies to best reach them.

14 | P a g e

Part II Weka Mining Tool

15 | P a g e

CHAPTER 2: LAUNCHING WEKA

The new menu-driven GUI in WEKA (class weka.gui.Main) succeeds the oldGUI Chooser (class weka.gui.GUIChooser). Its MDI (“multiple document in-terface”) appearance makes it easier to keep track of all the open windows. Ifone prefers an SDI (“single document interface”) driven layout, one can invokethis with option -gui sdi on the commandline.

The buttons can be used to start the following applications:

• Explorer An environment for exploring data with WEKA (the rest ofthis documentation deals with this application in more detail).

• Experimenter An environment for performing experiments and conduct-ing statistical tests between learning schemes.

• KnowledgeFlow This environment supports essentially the same func-tions as the Explorer but with a drag-and-drop interface. One advantageis that it supports incremental learning.

• SimpleCLI Provides a simple command-line interface that allows directexecution of WEKA commands for operating systems that do not providetheir own command line interface.

The menu consists of six sections:

1. Program• LogWindow Opens a log window that captures all that is printedto stdout or stderr. Useful for environments like MS Windows,where WEKA is normally not started from a terminal.

16 | P a g e

• Exit Closes WEKA.

2. Visualization Ways of visualizing data with WEKA.• Plot For plotting a 2D plot of a dataset.• ROC Displays a previously saved ROC curve.• TreeVisualizer For displaying directed graphs, e.g., a decisiontree.• GraphVisualizer Visualizes XML BIF or DOT format graphs,e.g., for Bayesian networks.• BoundaryVisualizer Allows the visualization of classifier deci-sion boundaries in two dimensions.

3. Tools Other useful applications.• ArffViewer An MDI application for viewing ARFF files inspreadsheet format.• SqlViewer represents an SQL worksheet, for querying databasesvia JDBC.

4. Help Online resources for WEKA can be found here.• Weka homepage Opens a browser window with WEKA’s home-page.• Online documentation Directs to the WekaDoc Wiki [4].• HOWTOs, code snippets, etc. The general WekaWiki [3], con-taining lots of examples and HOWTOs around the development anduse of WEKA.• Weka on SourceforgeWEKA’s project homepage on Sourceforge.net.• SystemInfo Lists some internals about the Java/WEKA environ-ment, e.g., the CLASSPATH.• About The infamous “About” box.

ARFF file:-

Attribute Relationship File Format (ARFF) is the text format file used by Weka to store data in a database. This kind of file is structured as follows ("weather" relational database):

@relation weather@attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no}

@data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes

17 | P a g e

http://en.wikipedia.org/wiki/Database

http://en.wikipedia.org/wiki/Text_file

rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes

The ARFF file contains two sections: the header and the data section. The first line of the header tells us the relation name. Then there is the list of the attributes (@attribute...). Each attribute is associated with a unique name and a type. The latter describes the kind of data contained in the variable and what values it can have. The variables types are: numeric, nominal, string and date. The class attribute is by default the last one of the list. In the header section there can also be some comment lines, identified with a '%' at the beginning, which can describe the database content or give the reader information about the author. After that there is the data itself (@data), each line stores the attribute of a single entry separated by a comma.

18 | P a g e

CHAPTER 3: THE WEKA EXPLORER

3.1 Section TabsAt the very top of the window, just below the title bar, is a row of tabs. Whenthe Explorer is first started only the first tab is active; the others are greyedout. This is because it is necessary to open (and potentially pre-process) a dataset before starting to explore the data.The tabs are as follows:

1. Preprocess. Choose and modify the data being acted on.

2. Classify. Train and test learning schemes that classify or perform regres-sion.

3. Cluster. Learn clusters for the data.

4. Associate. Learn association rules for the data.

5. Select attributes. Select the most relevant attributes in the data.

6. Visualize. View an interactive 2D plot of the data.

Once the tabs are active, clicking on them flicks between different screens, onwhich the respective actions can be performed. The bottom area of the window(including the status box, the log button, and the Weka bird) stays visibleregardless of which section you are in.The Explorer can be easily extended with custom tabs. The Wiki article“Adding tabs in the Explorer” [7] explains this in detail.

3.2 Status Box

The status box appears at the very bottom of the window. It displays messagesthat keep you informed about what’s going on. For example, if the Explorer isbusy loading a file, the status box will say that.TIP—right-clicking the mouse anywhere inside the status box brings up alittle menu. The menu gives two options:

1. Memory information. Display in the log box the amount of memoryavailable to WEKA.

2. Run garbage collector. Force the Java garbage collector to search formemory that is no longer needed and free it up, allowing more memoryfor new tasks. Note that the garbage collector is constantly running as abackground task anyway.

3.3 Log ButtonClicking on this button brings up a separate window containing a scrollable text

19 | P a g e

field. Each line of text is stamped with the time it was entered into the log. Asyou perform actions in WEKA, the log keeps a record of what has happened.For people using the command line or the SimpleCLI, the log now also containsthe full setup strings for classification, clustering, attribute selection, etc., sothat it is possible to copy/paste them elsewhere. Options for dataset(s) and, ifapplicable, the class attribute still have to be provided by the user (e.g., -t forclassifiers or -i and -o for filters).

3.4 Graphical output

Most graphical displays in WEKA, e.g., the GraphVisualizer or the TreeVisu-alizer, support saving the output to a file. A dialog for saving the output canbe brought up with Alt+Shift+left-click. Supported formats are currently Win-dows Bitmap, JPEG, PNG and EPS (encapsulated Postscript). The dialog alsoallows you to specify the dimensions of the generated image.

20 | P a g e

CHAPTER 4: PREPROCESSING

4.1 Loading DataThe first four buttons at the top of the preprocess section enable you to loaddata into WEKA:1. Open file.... Brings up a dialog box allowing you to browse for the datafile on the local file system.2. Open URL.... Asks for a Uniform Resource Locator address for wherethe data is stored.3. Open DB.... Reads data from a database. (Note that to make this workyou might have to edit the file in weka/experiment/DatabaseUtils.props.)4. Generate.... Enables you to generate artificial data from a variety ofDataGenerators.

Using the Open file... button you can read files in a variety of formats:WEKA’s ARFF format, CSV format, C4.5 format, or serialized Instances for-mat. ARFF files typically have a .arff extension, CSV files a .csv extension,C4.5 files a .data and .names extension, and serialized Instances objects a .bsiextension.NB: This list of formats can be extended by adding custom file convertersto the weka.core.converters package.

4.2 The Current RelationOnce some data has been loaded, the Preprocess panel shows a variety of in-formation. The Current relation box (the “current relation” is the currentlyloaded data, which can be interpreted as a single relational table in databaseterminology) has three entries:

1. Relation. The name of the relation, as given in the file it was loadedfrom. Filters (described below) modify the name of a relation.2. Instances. The number of instances (data points/records) in the data.3. Attributes. The number of attributes (features) in the data.

4.3 Working With FiltersThe preprocess section allows filters to be defined that transform the datain various ways. The Filter box is used to set up the filters that are required.At the left of the Filter box is a Choose button. By clicking this button it ispossible to select one of the filters in WEKA. Once a filter has been selected, itsname and options are shown in the field next to the Choose button. Clicking onthis box with the left mouse button brings up a GenericObjectEditor dialog box.A click with the right mouse button (or Alt+Shift+left click) brings up a menuwhere you can choose, either to display the properties in a GenericObjectEditordialog box, or to copy the current setup string to the clipboard.

21 | P a g e

CHAPTER 5: CLASSIFICATION

5.1 Selecting a ClassifierAt the top of the classify section is the Classifier box. This box has a text fieldthat gives the name of the currently selected classifier, and its options. Clickingon the text box with the left mouse button brings up a GenericObjectEditordialog box, just the same as for filters, that you can use to configure the optionsof the current classifier. With a right click (or Alt+Shift+left click) you canonce again copy the setup string to the clipboard or display the properties in aGenericObjectEditor dialog box. The Choose button allows you to choose oneof the classifiers that are available in WEKA.

5.2 Test OptionsThe result of applying the chosen classifier will be tested according to the optionsthat are set by clicking in the Test options box. There are four test modes:1. Use training set. The classifier is evaluated on how well it predicts theclass of the instances it was trained on.2. Supplied test set. The classifier is evaluated on how well it predicts theclass of a set of instances loaded from a file. Clicking the Set... buttonbrings up a dialog allowing you to choose the file to test on.3. Cross-validation. The classifier is evaluated by cross-validation, usingthe number of folds that are entered in the Folds text field.4. Percentage split. The classifier is evaluated on how well it predicts acertain percentage of the data which is held out for testing. The amountof data held out depends on the value entered in the % field.Note: No matter which evaluation method is used, the model that is output isalways the one build from all the training data. Further testing options can beset by clicking on the More options...

1. Output model. The classification model on the full training set is outputso that it can be viewed, visualized, etc. This option is selected by default.2. Output per-class stats. The precision/recall and true/false statisticsfor each class are output. This option is also selected by default.3. Output entropy evaluation measures. Entropy evaluation measuresare included in the output. This option is not selected by default.4. Output confusion matrix. The confusion matrix of the classifier’s pre-dictions is included in the output. This option is selected by default.5. Store predictions for visualization. The classifier’s predictions areremembered so that they can be visualized. This option is selected bydefault.6. Output predictions. The predictions on the evaluation data are output.Note that in the case of a cross-validation the instance numbers do notcorrespond to the location in the data!7. Output additional attributes. If additional attributes need to be out-put alongside the predictions, e.g., an ID attribute for tracking misclassi-

22 | P a g e

fications, then the index of this attribute can be specified here. The usualWeka ranges are supported,“first” and “last” are therefore valid indices aswell (example: “first-3,6,8,12-last”).8. Cost-sensitive evaluation. The errors is evaluated with respect to acost matrix. The Set... button allows you to specify the cost matrixused.9. Random seed for xval / % Split. This specifies the random seed usedwhen randomizing the data before it is divided up for evaluation purposes.10. Preserve order for % Split. This suppresses the randomization of thedata before splitting into train and test set.11. Output source code. If the classifier can output the built model as Javasource code, you can specify the class name here. The code will be printedin the “Classifier output” area.

5.3 The Class AttributeThe classifiers in WEKA are designed to be trained to predict a single ‘class’attribute, which is the target for prediction. Some classifiers can only learnnominal classes; others can only learn numeric classes (regression problems);still others can learn both.By default, the class is taken to be the last attribute in the data. If you wantto train a classifier to predict a different attribute, click on the box below theTest options box to bring up a drop-down list of attributes to choose from.

5.4 Training a ClassifierOnce the classifier, test options and class have all been set, the learning processis started by clicking on the Start button. While the classifier is busy beingtrained, the little bird moves around. You can stop the training process at anytime by clicking on the Stop button.When training is complete, several things happen. The Classifier outputarea to the right of the display is filled with text describing the results of trainingand testing. A new entry appears in the Result list box. We look at the resultlist below; but first we investigate the text that has been output.

5.5 The Classifier Output TextThe text in the Classifier output area has scroll bars allowing you to browsethe results. Clicking with the left mouse button into the text area, while holdingAlt and Shift, brings up a dialog that enables you to save the displayed outputin a variety of formats (currently, BMP, EPS, JPEG and PNG). Of course, youcan also resize the Explorer window to get a larger display area. The output issplit into several sections:1. Run information. A list of information giving the learning scheme op-tions, relation name, instances, attributes and test mode that were in-volved in the process.2. Classifier model (full training set). A textual representation of theclassification model that was produced on the full training data.3. The results of the chosen test mode are broken down thus:4. Summary. A list of statistics summarizing how accurately the classifierwas able to predict the true class of the instances under the chosen testmode.

23 | P a g e

5. Detailed Accuracy By Class. A more detailed per-class break downof the classifier’s prediction accuracy.6. Confusion Matrix. Shows how many instances have been assigned toeach class. Elements show the number of test examples whose actual classis the row and whose predicted class is the column.7. Source code (optional). This section lists the Java source code if onechose “Output source code” in the “More options” dialog.

24 | P a g e

CHAPTER 6 : CLUSTERING

6.1 Selecting a ClustererBy now you will be familiar with the process of selecting and configuring objects.Clicking on the clustering scheme listed in the Clusterer box at the top of thewindow brings up a GenericObjectEditor dialog with which to choose a newclustering scheme.

6.2 Cluster Modes

The Cluster mode box is used to choose what to cluster and how to evaluatethe results. The first three options are the same as for classification: Use training set, Supplied test set and Percentage split (Section 4.1)—exceptthat now the data is assigned to clusters instead of trying to predict a specificclass. The fourth mode, Classes to clusters evaluation, compares how wellthe chosen clusters match up with a pre-assigned class in the data. The drop-down box below this option selects the class, just as in the Classify panel.An additional option in the Cluster mode box, the Store clusters forvisualization tick box, determines whether or not it will be possible to visualizethe clusters once training is complete. When dealing with datasets that are solarge that memory becomes a problem it may be helpful to disable this option.

6.3 Ignoring AttributesOften, some attributes in the data should be ignored when clustering. TheIgnore attributes button brings up a small window that allows you to selectwhich attributes are ignored. Clicking on an attribute in the window highlightsit, holding down the SHIFT key selects a range of consecutive attributes, andholding down CTRL toggles individual attributes on and off. To cancel theselection, back out with the Cancel button. To activate it, click the Selectbutton. The next time clustering is invoked, the selected attributes are ignored.

6.4 Working with FiltersThe FilteredClusterer meta-clusterer offers the user the possibility to applyfilters directly before the clusterer is learned. This approach eliminates themanual application of a filter in the Preprocess panel, since the data getsprocessed on the fly. Useful if one needs to try out different filter setups.

6.5 Learning ClustersThe Cluster section, like the Classify section, has Start/Stop buttons, aresult text area and a result list. These all behave just like their classifica-tion counterparts. Right-clicking an entry in the result list brings up a similarmenu, except that it shows only two visualization options: Visualize clusterassignments and Visualize tree. The latter is grayed out when it is not applicable.

25 | P a g e

CHAPTER 7 : Associating

7.1 Setting Up

This panel contains schemes for learning association rules, and the learners arechosen and configured in the same way as the clusterers, filters, and classifiersin the other panels.

7.2 Learning AssociationsOnce appropriate parameters for the association rule learner bave been set, clickthe Start button. When complete, right-clicking on an entry in the result listallows the results to be viewed or saved.

26 | P a g e

CHAPTER 8 : Selecting Attributes

8.1 Searching and Evaluating

Attribute selection involves searching through all possible combinations of at-tributes in the data to find which subset of attributes works best for prediction.To do this, two objects must be set up: an attribute evaluator and a searchmethod. The evaluator determines what method is used to assign a worth toeach subset of attributes. The search method determines what style of searchis performed.

8.2 OptionsThe Attribute Selection Mode box has two options:1. Use full training set. The worth of the attribute subset is determinedusing the full set of training data.2. Cross-validation. The worth of the attribute subset is determined by aprocess of cross-validation. The Fold and Seed fields set the number of

27 | P a g e

folds to use and the random seed used when shuffling the data.As with Classify (Section 4.1), there is a drop-down box that can be used tospecify which attribute to treat as the class.

8.3 Performing SelectionClicking Start starts running the attribute selection process. When it is fin-ished, the results are output into the result area, and an entry is added tothe result list. Right-clicking on the result list gives several options. The firstthree, (View in main window, View in separate window and Save resultbuffer), are the same as for the classify panel. It is also possible to Visualizereduced data, or if you have used an attribute transformer such as Principal-Components, Visualize transformed data. The reduced/transformed datacan be saved to a file with the Save reduced data... or Save transformed

data... option.

28 | P a g e

CHAPTER 9 : Visualizing

WEKA’s visualization section allows you to visualize 2D plots of the currentrelation.

9.1 The scatter plot matrixWhen you select the Visualize panel, it shows a scatter plot matrix for allthe attributes, colour coded according to the currently selected class. It ispossible to change the size of each individual 2D plot and the point size, and torandomly jitter the data (to uncover obscured points). It also possible to changethe attribute used to colour the plots, to select only a subset of attributes forinclusion in the scatter plot matrix, and to sub sample the data. Note thatchanges will only come into effect once the Update button has been pressed.

29 | P a g e

9.2 Selecting an individual 2D scatter plotWhen you click on a cell in the scatter plot matrix, this will bring up a separatewindow with a visualization of the scatter plot you selected. (We describedabove how to visualize particular results in a separate window—for example,classifier errors—the same visualization controls are used here.)Data points are plotted in the main area of the window. At the top are twodrop-down list buttons for selecting the axes to plot. The one on the left showswhich attribute is used for the x-axis; the one on the right shows which is usedfor the y-axis.Beneath the x-axis selector is a drop-down list for choosing the colour scheme.This allows you to colour the points based on the attribute selected. Below theplot area, a legend describes what values the colours correspond to. If the valuesare discrete, you can modify the colour used for each one by clicking on themand making an appropriate selection in the window that pops up.To the right of the plot area is a series of horizontal strips. Each striprepresents an attribute, and the dots within it show the distribution of values19of the attribute. These values are randomly scattered vertically to help you seeconcentrations of points. You can choose what axes are used in the main graphby clicking on these strips. Left-clicking an attribute strip changes the x-axisto that attribute, whereas right-clicking changes the y-axis. The ‘X’ and ‘Y’written beside the strips shows what the current axes are (‘B’ is used for ‘bothX and Y’).Above the attribute strips is a slider labelled Jitter, which is a randomdisplacement given to all points in the plot. Dragging it to the right increases theamount of jitter, which is useful for spotting concentrations of points. Withoutjitter, a million instances at the same point would look no different to just asingle lonely instance.

9.3 Selecting InstancesThere may be situations where it is helpful to select a subset of the data us-ing the visualization tool. (A special case of this is the UserClassifier in theClassify panel, which lets you build your own classifier by interactively selectinginstances.)Below the y-axis selector button is a drop-down list button for choosing aselection method. A group of data points can be selected in four ways:1. Select Instance. Clicking on an individual data point brings up a windowlisting its attributes. If more than one point appears at the same location,more than one set of attributes is shown.2. Rectangle. You can create a rectangle, by dragging, that selects thepoints inside it.3. Polygon. You can build a free-form polygon that selects the points insideit. Left-click to add vertices to the polygon, right-click to complete it. Thepolygon will always be closed off by connecting the first point to the last.4. Polyline. You can build a polyline that distinguishes the points on oneside from those on the other. Left-click to add vertices to the polyline,right-click to finish. The resulting shape is open (as opposed to a polygon,which is always closed).

30 | P a g e

Once an area of the plot has been selected using Rectangle, Polygon orPolyline, it turns grey. At this point, clicking the Submit button removes allinstances from the plot except those within the grey selection area. Clicking onthe Clear button erases the selected area without affecting the graph.Once any points have been removed from the graph, the Submit buttonchanges to a Reset button. This button undoes all previous removals andreturns you to the original graph with all points included. Finally, clicking theSave button allows you to save the currently visible instances to a new ARFFfile.

31 | P a g e

Part III Implementation

32 | P a g e

CHAPTER 10 : Regression

Regression is the easiest technique to use, but is also probably the least powerful (funny how that always goes hand in hand). This model can be as easy as one input variable and one output variable (called a Scatter diagram in Excel, or an XYDiagram in OpenOffice.org). Of course, it can get more complex than that, including dozens of input variables. In effect, regression models all fit the same general pattern. There are a number of independent variables, which, when taken together, produce a result — a dependent variable. The regression model is then used to predict the result of an unknown dependent variable, given the values of the independent variables.

Everyone has probably used or seen a regression model before, maybe even mentally creating a regression model. The example that immediately comes to mind is pricing a house. The price of the house (the dependent variable) is the result of many independent variables — the square footage of the house, the size of the lot, whether granite is in the kitchen, bathrooms are upgraded, etc. So, if you've ever bought a house or sold a house, you've likely created a regression model to price the house. You created the model based on other comparable houses in the neighborhood and what they sold for (the model), then put the values of your own house into this model to produce an expected price.

Let's continue this example of a house price-based regression model, and create some real data to examine. These are actual numbers from houses for sale in my neighborhood, and I will be trying to find the value for my own house. (I'll also be taking the output from this model to protest my property-tax assessment).

Table 1. House values for regression model

House size (square feet) Lot size Bedrooms Granite Upgraded bathroom? Selling price

3529 9191 6 0 0 $205,000

3247 10061 5 1 1 $224,900

4032 10150 5 0 1 $197,900

2397 14156 4 1 0 $189,900

2200 9600 4 0 1` $195,000

3536 19994 6 1 1 $325,000

2983 9365 5 0 1 $230,000

3198 9669 5 1 1 ????

The good news (or bad news, depending on your point of view) is that this little introduction to regression barely scratches the surface, and that scratch is really even barely

33 | P a g e

noticeable. There are entire college semester courses on regression models, that will teach you more about regression models than you probably even want to know. But this scratch gets you acquainted with the concept and suffice for our WEKA tests in this article. If you have continued interest in regression models and all the statistical details that go into them, research the following terms with your favorite search engine: least squares, homoscedasticity, normal distribution, White tests, Lilliefors tests, R-squared, and p-values.

Building the data det for WEKATo load data into WEKA, we have to put it into a format that will be understood. WEKA's

preferred method for loading data is in the Attribute-Relation File Format (ARFF), where you can define the type of data being loaded, then supply the data itself. In the file, you define each column and what each column contains. In the case of the regression model, you are limited to a NUMERIC or aDATE column. Finally, you supply each row of data in a comma-delimited format. The ARFF file we'll be using with WEKA appears below. Notice in the rows of data that we've left out my house. Since we are creating the model, we cannot input my house into it since the selling price is unknown.

Listing 1. WEKA file format

@RELATION house

@ATTRIBUTE houseSize NUMERIC@ATTRIBUTE lotSize NUMERIC@ATTRIBUTE bedrooms NUMERIC@ATTRIBUTE granite NUMERIC@ATTRIBUTE bathroom NUMERIC@ATTRIBUTE sellingPrice NUMERIC

@DATA3529,9191,6,0,0,2050003247,10061,5,1,1,2249004032,10150,5,0,1,1979002397,14156,4,1,0,1899002200,9600,4,0,1,1950003536,19994,6,1,1,3250002983,9365,5,0,1,230000

Loading the data into WEKANow that the data file has been created, it's time to create our regression model. Start

WEKA, then choose the Explorer. You'll be taken to the Explorer screen, with the Preprocess tab selected. Select the Open File button and select the ARFF file you created in the section above. After selecting the file, your WEKA Explorer should look similar to the screenshot in Figure 3.

WEKA with house data loaded

34 | P a g e

In this view, WEKA allows you to review the data you're working with. In the left section of the Explorer window, it outlines all of the columns in your data (Attributes) and the number of rows of data supplied (Instances). By selecting each column, the right section of the Explorer window will also give you information about the data in that column of your data set. For example, by selecting the houseSize column in the left section (which should be selected by default), the right-section should change to show you additional statistical information about the column. It shows the maximum value in the data set for this column is 4,032 square feet, and the minimum is 2,200 square feet. The average size is 3,131 square feet, with a standard deviation of 655 square feet. (Standard deviation is a statistical measure of variance.) Finally, there's a visual way of examining the data, which you can see by clicking the Visualize All button. Due to our limited number of rows in this data set, the visualization is not as powerful as it would be if there were more data points (in the hundreds, for example).

Enough looking at the data. Let's create a model and get a price for my house.

Creating the regression model with WEKATo create the model, click on the Classify tab. The first step is to select the model we

want to build, so WEKA knows how to work with the data, and how to create the appropriate model:

1. Click the Choose button, then expand the functions branch.2. Select the LinearRegression leaf.

35 | P a g e

This tells WEKA that we want to build a regression model. As you can see from the other choices, though, there are lots of possible models to build. Lots! This should give you a good indication of how we are only touching the surface of this subject. Also of note: There is another choice called SimpleLinearRegression in the same branch. Do not choose this because simple regression only looks at one variable, and we have six. When you've selected the right model, your WEKA Explorer should look like Figure 4.

Linear regression model in WEKA

Now that the desired model has been chosen, we have to tell WEKA where the data is that it should use to build the model. Though it may be obvious to us that we want to use the data we supplied in the ARFF file, there are actually different options, some more advanced than what we'll be using. The other three choices are Supplied test set, where you can supply a different set of data to build the model;Cross-validation, which lets WEKA build a model based on subsets of the supplied data and then average them out to create a final model; and Percentage split, where WEKA takes a percentile subset of the supplied data to build a final model. These other choices are useful with different models, which we'll see in future articles. With regression, we can simply choose Use training set.This tells WEKA that to build our desired model, we can simply use the data set we supplied in our ARFF file.

Finally, the last step to creating our model is to choose the dependent variable (the column we are looking to predict). We know this should be the selling price, since that's what we're trying to determine for my house. Right below the test options, there's a combo box that lets

36 | P a g e

you choose the dependent variable. The column sellingPrice should be selected by default. If it's not, please select it.

Now we are ready to create our model. Click Start. Figure 5 shows what the output should look like.

House price regression model in WEKA

Interpreting the regression modelWEKA doesn't mess around. It puts the regression model right there in the output, as

shown in Listing 2.

Listing 2. Regression output

sellingPrice = (-26.6882 * houseSize) + (7.0551 * lotSize) + (43166.0767 * bedrooms) + (42292.0901 * bathroom) - 21661.1208

Listing 3 shows the results, plugging in the values for my house.

37 | P a g e

Listing 3. House value using regression model

sellingPrice = (-26.6882 * 3198) + (7.0551 * 9669) + (43166.0767 * 5) + (42292.0901 * 1) - 21661.1208

sellingPrice = 219,328

38 | P a g e

CHAPTER 11 : Classification(Pr.)

WEKA data setThe data set we'll use for our classification example will focus on our fictional BMW

dealership. The dealership is starting a promotional campaign, whereby it is trying to push a two-year extended warranty to its past customers. The dealership has done this before and has gathered 4,500 data points from past sales of extended warranties. The attributes in the data set are:

Income bracket [0=$0-$30k, 1=$31k-$40k, 2=$41k-$60k, 3=$61k-$75k, 4=$76k-$100k, 5=$101k-$150k, 6=$151k-$500k, 7=$501k+]

Year/month first BMW bought Year/month most recent BMW bought Whether they responded to the extended warranty offer in the past

Let's take a look at the Attribute-Relation File Format (ARFF) we'll use in this example.

Listing 2. Classification WEKA data

@attribute IncomeBracket {0,1,2,3,4,5,6,7}@attribute FirstPurchase numeric@attribute LastPurchase numeric@attribute responded {1,0}

@data

4,200210,200601,05,200301,200601,1...

Classification in WEKALoad the data file bmw-training.arff (see Download) into WEKA using the same steps

we've used up to this point. Note: This file contains only 3,000 of the 4,500 records that the dealership has in its records. We need to divide up our records so some data instances are used to create the model, and some are used to test the model to ensure that we didn't overfit it. Your screen should look like Figure 1 after loading the data.

39 | P a g e

http://www.ibm.com/developerworks/opensource/library/os-weka2/index.html#downloads

Like we did with the regression model, we select the Classify tab, then we select the trees node, then the J48 leaf (I don't know why this is the official name, but go with it).

BMW classification

40 | P a g e

At this point, we are ready to create our model in WEKA. Ensure that Use training set is selected so we use the data set we just loaded to create our model. Click Start and let WEKA run. The output from this model should look like the results in Listing 3.

Listing 3. Output from WEKA's classification model

Number of Leaves : 28

Size of the tree : 43

Time taken to build model: 0.18 seconds

=== Evaluation on training set ====== Summary ===

Correctly Classified Instances 1774 59.1333 %Incorrectly Classified Instances 1226 40.8667 %Kappa statistic 0.1807Mean absolute error 0.4773Root mean squared error 0.4885Relative absolute error 95.4768 %Root relative squared error 97.7122 %Total Number of Instances 3000

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.662 0.481 0.587 0.662 0.622 0.616 1 0.519 0.338 0.597 0.519 0.555 0.616 0Weighted Avg. 0.591 0.411 0.592 0.591 0.589 0.616

=== Confusion Matrix ===

a b <-- classified as 1009 516 | a = 1 710 765 | b = 0

What do all these numbers mean? How do we know if this is a good model? Where is this so-called "tree" I'm supposed to be looking for? All good questions. Let's answer them one at a time:

What do all these numbers mean? The important numbers to focus on here are the numbers next to the "Correctly Classified Instances" (59.1 percent) and the "Incorrectly Classified Instances" (40.9 percent). Other important numbers are in the "ROC Area" column, in the first row (the 0.616); I'll explain this number later, but keep it in mind. Finally, in the "Confusion Matrix," it shows you the number of false positives and false negatives. The false positives are 516, and the false negatives are 710 in this matrix.

How do we know if this is a good model? Well, based on our accuracy rate of only 59.1 percent, I'd have to say that upon initial analysis, this is not a very good model.

41 | P a g e

Where is this so-called tree? You can see the tree by right-clicking on the model you just created, in the result list. On the pop-up menu, select Visualize tree. You'll see the classification tree we just created, although in this example, the visual tree doesn't offer much help. Our tree is pictured in Figure 3. The other way to see the tree is to look higher in the Classifier Output, where the text output shows the entire tree, with nodes and leaves.

Classification tree visualization

There's one final step to validating our classification tree, which is to run our test set through the model and ensure that accuracy of the model when evaluating the test set isn't too different from the training set. To do this, in Test options, select the Supplied test set radio button and click Set. Choose the file bmw-test.arff, which contains 1,500 records that were not in the training set we used to create the model. When we click Start this time, WEKA will run this test data set through the model we already created and let us know how the model did. Let's do that, by clicking Start. Below is the output.

42 | P a g e

Classification tree test

Comparing the "Correctly Classified Instances" from this test set (55.7 percent) with the "Correctly Classified Instances" from the training set (59.1 percent), we see that the accuracy of the model is pretty close, which indicates that the model will not break down with unknown data, or when future data is applied to it.

However, because the accuracy of the model is so bad, only classifying 60 perent of the data records correctly, we could take a step back and say, "Wow. This model isn't very good at all. It's barely above 50 percent, which I could get just by randomly guessing values." That's entirely true. That takes us to an important point that I wanted to secretly and slyly get across to everyone: Sometimes applying a data mining algorithm to your data will produce a bad model. This is especially true here, and it was on purpose.

I wanted to take you through the steps to producing a classification tree model with data that seems to be ideal for a classification model. Yet, the results we get from WEKA indicate that we were wrong. A classification tree is not the model we should have chosen here. The model we created tells us absolutely nothing, and if we used it, we might make bad decisions and waste money.

Does that mean this data can't be mined? The answer is another important point to data mining: the nearest-neighbor model, discussed in a future article, will use this same data set, but will create a model that's over 88 percent accurate. It aims to drive home the point that you have to choose the right model for the right data to get good, meaningful information.

43 | P a g e

CHAPTER 12 : Clustering(Pr.)

Data set for WEKAThe data set we'll use for our clustering example will focus on our fictional BMW dealership again. The dealership has kept track of how people walk through the dealership and the showroom, what cars they look at, and how often they ultimately make purchases. They are hoping to mine this data by finding patterns in the data and by using clusters to determine if certain behaviors in their customers emerge. There are 100 rows of data in this sample, and each column describes the steps that the customers reached in their BMW experience, with a column having a 1 (they made it to this step or looked at this car), or 0 (they didn't reach this step). Listing 4 shows the ARFF data we'll be using with WEKA.

Listing 4. Clustering WEKA data

@attribute Dealership numeric@attribute Showroom numeric@attribute ComputerSearch numeric@attribute M5 numeric@attribute 3Series numeric@attribute Z4 numeric@attribute Financing numeric@attribute Purchase numeric

@data

1,0,0,0,0,0,0,01,1,1,0,0,0,1,0...

Clustering in WEKALoad the data file bmw-browsers.arff into WEKA using the same steps we used to load data into the Preprocess tab. Take a few minutes to look around the data in this tab. Look at the columns, the attribute data, the distribution of the columns, etc. Your screen should look like Figure 5 after loading the data.

Figure 5. BMW cluster data in WEKA

44 | P a g e

With this data set, we are looking to create clusters, so instead of clicking on the Classify tab, click on the Cluster tab. ClickChoose and select SimpleKMeans from the choices that appear (this will be our preferred method of clustering for this article). Your WEKA Explorer window should look like Figure 6 at this point.

45 | P a g e

BMW cluster algorithm

Finally, we want to adjust the attributes of our cluster algorithm by clicking SimpleKMeans (not the best UI design here, but go with it). The only attribute of the algorithm we are interested in adjusting here is the numClusters field, which tells us how many clusters we want to create. (Remember, you need to know this before you start.) Let's change the default value of 2 to 5 for now, but keep these steps in mind later if you want to adjust the number of clusters created. Your WEKA Explorer should look like Figure 7 at this point. Click OK to accept these values.

46 | P a g e

Cluster attributes

At this point, we are ready to run the clustering algorithm. Remember that 100 rows of data with five data clusters would likely take a few hours of computation with a spreadsheet, but WEKA can spit out the answer in less than a second. Your output should look like Listing 5.

Listing 5. Cluster output

Cluster#Attribute Full Data 0 1 2 3 4 (100) (26) (27) (5) (14) (28)==================================================================================Dealership 0.6 0.9615 0.6667 1 0.8571 0Showroom 0.72 0.6923 0.6667 0 0.5714 1ComputerSearch 0.43 0.6538 0 1 0.8571 0.3214M5 0.53 0.4615 0.963 1 0.7143 03Series 0.55 0.3846 0.4444 0.8 0.0714 1Z4 0.45 0.5385 0 0.8 0.5714 0.6786Financing 0.61 0.4615 0.6296 0.8 1 0.5Purchase 0.39 0 0.5185 0.4 1 0.3214

Clustered Instances

0 26 ( 26%)1 27 ( 27%)2 5 ( 5%)

47 | P a g e

3 14 ( 14%)4 28 ( 28%)

How do we interpret these results? Well, the output is telling us how each cluster comes together, with a "1" meaning everyone in that cluster shares the same value of one, and a "0" meaning everyone in that cluster has a value of zero for that attribute. Numbers are the average value of everyone in the cluster. Each cluster shows us a type of behavior in our customers, from which we can begin to draw some conclusions:

Cluster 0 — This group we can call the "Dreamers," as they appear to wander around the dealership, looking at cars parked outside on the lots, but trail off when it comes to coming into the dealership, and worst of all, they don't purchase anything.

Cluster 1 — We'll call this group the "M5 Lovers" because they tend to walk straight to the M5s, ignoring the 3-series cars and the Z4. However, they don't have a high purchase rate — only 52 percent. This is a potential problem and could be a focus for improvement for the dealership, perhaps by sending more salespeople to the M5 section.

Cluster 2 — This group is so small we can call them the "Throw-Aways" because they aren't statistically relevent, and we can't draw any good conclusions from their behavior. (This happens sometimes with clusters and may indicate that you should reduce the number of clusters you've created).

Cluster 3 — This group we'll call the "BMW Babies" because they always end up purchasing a car and always end up financing it. Here's where the data shows us some interesting things: It appears they walk around the lot looking at cars, then turn to the computer search available at the dealership. Ultimately, they tend to buy M5s or Z4s (but never 3-series). This cluster tells the dealership that it should consider making its search computers more prominent around the lots (outdoor search computers?), and perhaps making the M5 or Z4 much more prominent in the search results. Once the customer has made up his mind to purchase the vehicle, he always qualifies for financing and completes the purchase.

Cluster 4 — This group we'll call the "Starting Out With BMW" because they always look at the 3-series and never look at the much more expensive M5. They walk right into the showroom, choosing not to walk around the lot and tend to ignore the computer search terminals. While 50 percent get to the financing stage, only 32 percent ultimately finish the transaction. The dealership could draw the conclusion that these customers looking to buy their first BMWs know exactly what kind of car they want (the 3-series entry-level model) and are hoping to qualify for financing to be able to afford it. The dealership could possibly increase sales to this group by relaxing their financing standards or by reducing the 3-series prices.

One other interesting way to examine the data in these clusters is to inspect it visually. To do this, you should right-click on theResult List section of the Cluster tab (again, not the best-designed UI). One of the options from this pop-up menu is Visualize Cluster Assignments. A window will pop up that lets you play with the results and see them visually. For this example, change the X axis to be M5 (Num), the Y axis to Purchase (Num), and the Color to Cluster (Nom). This will show us in a chart how the clusters are grouped in terms of who looked at the M5 and who purchased one. Also, turn up the "Jitter" to about three-fourths of the way maxed out, which will artificially scatter the plot points to allow us to see them more easily.

48 | P a g e

Do the visual results match the conclusions we drew from the results in Listing 5? Well, we can see in the X=1, Y=1 point (those who looked at M5s and made a purchase) that the only clusters represented here are 1 and 3. We also see that the only clusters at point X=0, Y=0 are 4 and 0. Does that match our conclusions from above? Yes, it does. Clusters 1 and 3 were buying the M5s, while cluster 0 wasn't buying anything, and cluster 4 was only looking at the 3-series. Figure 8 shows the visual cluster layout for our example. Feel free to play around with the X and Y axes to try to identify other trends and patterns.

Cluster visual inspection

49 | P a g e

CHAPTER 13 : Association (Pr.)

This example illustrates some of the basic elements of associate rule mining using WEKA.

The sample data set used for this example is the "bank data".

Clicking on the "Associate" tab will bring up the interface for the association rule algorithms. The Apriori algorithm which we will use is the deafult algorithm selected. However, in order to change the parameters for this run (e.g., support, confidence, etc.) we click on the text box immediately to the right of the "Choose" button. Note that this box, at any given time, shows the specific commandline arguments that are to be used for the algorithm. Here, you can specify various parameters associated with Apriori. Click on the "More" button to see the synopsis for the different parameters.

50 | P a g e

Parameter specification:

51 | P a g e

Once the parameters have been set, the commandline text box will show the new command line. We now click on start to run the program. This results in a set of rules aredepicted below:

52 | P a g e

REFERENCES

WEBSITES

1.)http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html

2.) http://maya.cs.depaul.edu/classes/ect584/weka/associate.html

3.) http://en.wikipedia.org/wiki/Weka_(machine_learning)

4.) http://www.cs.waikato.ac.nz/ml/weka/

5.) http://weka.wikispaces.com/space/content#content?q=cross&rid=2

6.) http://en.wikipedia.org/wiki/Data_mining

7.) http://databases.about.com/od/datamining/a/datamining.htm

8.) http://www.thearling.com/text/dmwhite/dmwhite.htm

9.) WekaDoc – http://weka.sourceforge.net/wekadoc/

53 | P a g e

http://weka.sourceforge.net/wekadoc/

http://www.thearling.com/text/dmwhite/dmwhite.htm

http://databases.about.com/od/datamining/a/datamining.htm

http://en.wikipedia.org/wiki/Data_mining

http://weka.wikispaces.com/space/content#content?q=cross&rid=2

http://www.cs.waikato.ac.nz/ml/weka/

http://en.wikipedia.org/wiki/Weka_(machine_learning)

http://maya.cs.depaul.edu/classes/ect584/weka/associate.html

http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html

weka tutorial

Documents

java source code

dynamic data delivery

scatter plot matrix

data mining system

proactive information delivery

finding predictive information

genericobjecteditor dialog box

chosen test mode