weka term paper

11
Weka Term Paper Two public domain data is select to run regression and classify the data to get business sense out of it. 4/19/2012

Upload: somaskhandan-chinnasamy

Post on 27-Aug-2014

141 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: WEKA Term Paper

Weka Term Paper

Two public domain data is select to run regression and classify the data to get business sense out of it.

4/19/2012

Page 2: WEKA Term Paper

WEKA Term Paper

This term paper attempts to some data analysis tools built into WEKA data mining software. The data from this term paper is taken from public domain available in internet. The source of data and other materials is provided in the references tab.

WEKA stands for Waikato Environment for Knowledge Analysis is software written using JAVA programming language. This software is developed at University of Waikato, New Zealand. WEKA has built in visualization and algorithm for data and predictive analysis with an easy to use GUI interface.

Example-1 Bolts Optimization

Data from an experiment on the effects of machine adjustments on the time to count bolts. Data appear as the STATS (Issue 10) Challenge. A manufacturer of automotive accessories provides hardware, e.g. nuts, bolts, washers and screws, to fasten the accessory to the car or truck. Hardware is counted and packaged automatically. Specifically, bolts are dumped into a large metal dish. A plate that forms the bottom of the dish rotates counter clockwise. This rotation forces bolts to the outside of the dish and up along a narrow ledge. Due to the vibration of the dish caused by the spinning bottom plate, some bolts fall off the ledge and back into the dish. The ledge spirals up to a point where the bolts are allowed to drop into a pan on a conveyor belt. As a bolt drops, it passes by an electronic eye that counts it. When the electronic counter reaches the pre-set number of bolts, the rotation is stopped and the conveyor belt is moved forward. There are several adjustments on the machine that affect its operation. These include; a speed setting that controls the speed of rotation (SPEED1) of the plate at the bottom of the dish, a total number of bolts (TOTAL) to be counted, a second speed setting (SPEED2) that is used to change the speed of rotation (usually slowing it down) for the last few bolts, the number of bolts to be counted at this second speed (NUMBER2), and the sensitivity of the electronic eye (SENS). The sensitivity setting is to insure that the correct number of bolts is counted. Too few bolts packaged causes customer complaints. Too many bolts packaged increases costs. For each run conducted in this experiment the correct number of bolts was counted. From an engineering standpoint if the correct number of bolts is counted, the sensitivity should not affect the time to count bolts. The measured response is the time (TIME), in seconds; it takes to count the desired number of bolts. In order to put times on an equal footing the response to be analysed is the time to count 20 bolts (T20BOLT). Below are the data for 40 combinations of settings. RUN is the order in which the data were collected. The snippet of the data is given below-

@attribute RUN integer

@attribute SPEED1 integer

Page 3: WEKA Term Paper

@attribute TOTAL integer

@attribute SPEED2 integer

@attribute NUMBER2 integer

@attribute SENS integer

@attribute TIME real

@attribute T20BOLT real

@data

25, 2, 10, 1.5, 0, 6, 5.70, 11.40

24, 2, 10, 1.5, 0, 10, 17.56, 35.12

30, 2, 10, 1.5, 2, 6, 11.28, 22.56

2, 2, 10, 1.5, 2, 10, 8.39, 16.78

The columns in the data set are the attribute names given above. We used percentage split of 66% to run and validate the data in WEKA. Running the data in WEKA software, we get the following result-

Scheme:weka.classifiers.trees.M5P -M 4.0

Relation: bolts

Instances: 40

Attributes: 8

RUN

SPEED1

TOTAL

SPEED2

NUMBER2

SENS

TIME

T20BOLT

Test mode:split 66.0% train, remainder test

Page 4: WEKA Term Paper

M5 pruned model tree:

(using smoothed linear models)

T20BOLT <= 62.365 : LM1 (29/4.058%)

T20BOLT > 62.365 :

| TOTAL <= 20 : LM2 (3/5.86%)

| TOTAL > 20 : LM3 (8/0.008%)

LM num: 1

TIME =

1.1824 * TOTAL

+ 0.4414 * NUMBER2

+ 0.7813 * T20BOLT

- 21.3755

LM num: 2

TIME =

0.0561 * RUN

+ 2.4037 * TOTAL

+ 1.0813 * T20BOLT

- 52.9476

LM num: 3

TIME =

0.0439 * RUN

+ 2.1194 * TOTAL

Page 5: WEKA Term Paper

+ 1.2106 * T20BOLT

- 48.5376.

WEKA processed and gave three equations for calculating Time which can be used to set system to such a level than it we can know amount of time it takes to give 20 bolts. This result is given by M5P decision tree classifier. The summary of the result is given below-

=== Summary ===

Correlation coefficient 0.988

Mean absolute error 4.0921

Root mean squared error 6.559

Relative absolute error 12.9437 %

Root relative squared error 15.5953 %

Total Number of Instances 14.

We can say that the model derived by WEKA is good one as relative absolute error and root relative squared error is small and the model is significant. This model can be used for prediction and tweaking the system to give desired output. The decision tree is given below-

Page 6: WEKA Term Paper

Above plot is the Bolts predicted plot between bolts and time. We can see that three is clustering of data the top end of the graph.

Example 2-Auto Prices Risk Analysis.

This data set consists of three types of entities:

(a) The specification of an auto in terms of various characteristics;(b) Its assigned insurance risk rating,;(c) Its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe. The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year.

The original data (from the UCI repository) (http://www.ics.uci.edu/~mlearn/MLSummary.html) has 205 instances described by 26 attributes :- 15 continuous

Page 7: WEKA Term Paper

- 1 integer- 10 nominal.

We will be using 16 attributes for WEKA analysis. The details about attributes are given below-

1. Symboling: -3, -2, -1, 0, 1, 2, 3.2. Normalized-losses: continuous from 65 to 256.3. Wheel-base: continuous from 86.6 120.9.4. Length: continuous from 141.1 to 208.1.5. Width: continuous from 60.3 to 72.3.6. Height: continuous from 47.8 to 59.8.7. Curb-weight: continuous from 1488 to 4066.8. Engine-size: continuous from 61 to 326.9. Bore: continuous from 2.54 to 3.94.10.Stroke: continuous from 2.07 to 4.17.11.Compression-ratio: continuous from 7 to 23.12.Horsepower: continuous from 48 to 288.13.Peak-rpm: continuous from 4150 to 6600.14.City-mpg: continuous from 13 to 49.15.Highway-mpg: continuous from 16 to 54.16.Price: continuous from 5118 to 45400.

Using SMOReg in WEKA the analysis is given below-

Scheme:weka.classifiers.functions.SMOreg -C 1.0 -N 0 -I "weka.classifiers.functions.supportVector.RegSMOImproved -L 0.0010 -W 1 -P 1.0E-12 -T 0.0010 -V" -K "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"Relation: priceInstances: 159Attributes: 16 symboling normalized-losses wheel-base length width height curb-weight engine-size bore stroke

Page 8: WEKA Term Paper

compression-ratio horsepower peak-rpm city-mpg highway-mpg priceTest mode:split 66.0% train, remainder test

=== Classifier model (full training set) ===

SMOreg

weights (not support vectors): + 0.601 * (normalized) normalized-losses - 0.8196 * (normalized) wheel-base - 0.1909 * (normalized) length + 0.4036 * (normalized) width + 0.0513 * (normalized) height - 0.1216 * (normalized) curb-weight + 0.0719 * (normalized) engine-size - 0.0169 * (normalized) bore + 0.0439 * (normalized) stroke - 0.0221 * (normalized) compression-ratio - 0.0244 * (normalized) horsepower + 0.0191 * (normalized) peak-rpm - 0.0924 * (normalized) city-mpg + 0.1199 * (normalized) highway-mpg + 0.1616 * (normalized) price + 0.5164

Number of kernel evaluations: 12720 (95.873% cached)

Time taken to build model: 0.07 seconds.=== Summary ===

Correlation coefficient 0.7459Mean absolute error 0.6006Root mean squared error 0.8551Relative absolute error 59.7457 %Root relative squared error 69.5818 %

Page 9: WEKA Term Paper

Total Number of Instances 54.

We can see the regression equation we can classify whether the car falls under risky or less risky category. After this analysis we did Cluster analysis to see how many cluster the various car data falls into. The WEKA output for K means cluster is given below-

We can see that there are two clusters in the Data. The classification is based on the data for each attribute.

Analysis of Result

The regression equation we got from WEKA can be used to classify the data as risky or less risky. All we need to do is to input the values for the car and the equation will give input from -3 to +3. With this equation we can easily the car as risky or less risky.