dr. dale e. parson, assignment 4, comparing weka bayesian ... · weka’s visualize tab to find a...

CSC 458 Data Mining and Predictive Analytics I, Fall 2018

Dr. Dale E. Parson, Assignment 4, Comparing Weka Bayesian, clustering, ZeroR, OneR, and J48 models to predict nominal dissolved oxygen levels in an extension of Assignments 2 and 3. Due by 11:59 PM on Friday December 7 via make turnitin. I will not accept late solutions after the end of Saturday December 8 because I need to post my solution to help with your exam preparation; assignments coming in after 9 AM on Sunday December 9 earn 0%. There will be at least one in-class work session for this assignment, and unless you are registered for the 100% on-line sections, please attend with questions, either in the room, or at class time via Ultra. 100% on-line students are encouraged to attend in Old Main 158 or nearby labs at class time if schedules permit. Perform the following steps to set up for this project. Start out in your login directory on csit (a.k.a. acad). cd $HOME mkdir DataMine # This should already be there from assignment 2. cp ~parson/DataMine/bayes458fall2018.problem.zip DataMine/bayes458fall2018.problem.zip cd ./DataMine unzip bayes458fall2018.problem.zip cd ./bayes458fall2018 This is the directory from which you must run make turnitin by the project deadline to avoid a 10% per day late penalty. If you run out of file space in your account, you can perform the following steps from within your DataMine/ directory. Be extremely careful, and do NOT use any file name wildcards. This will discard your results from previous assignments. If you wish to keep those, do not remove directories csc458fall2018assn1, csc458fall2018assn2or linear458fall2018. rm -rf csc458fall2018assn1.problem.zip csc458fall2018assn1 rm -rf csc458fall2018assn2.problem.zip csc458fall2018assn2 rm -rf linear458fall2018.problem.zip linear458fall2018 You will see the following files in this bayes458fall2018 directory: readme.txt Your answers to Q1 through Q10 below go here, in the required format. csc458fall2018assn4trainingset49K.arff The ARFF file derived from assignment 3. makefile Files needed to make turnitin to get your solution to me. checkfiles.sh makelib How can you avoid running out of memory in Weka? 1. Run Weka using a command line or batch script that sets memory size. I run it this way on my Mac: java -server -Xmx4000M -jar /Applications/weka-3-8-0/weka.jar That requires having the Java runtime environment (not necessarily the Java compiler) installed on your machine (true of campus PCs), and locating the path to the weka.jar Java archive that contains the Weka

class libraries and other resources. This line allocates 4,000,000 bytes of storage for Weka. As for assignment 2, I have created batch file S:\ComputerScience\WEKA\WekaWith2GBcampus.bat for campus PCs, with handout data files in S:\ComputerScience\Parson\Weka\. I plan to create a 4Gb. Byte script S:\ComputerScience\WEKA\WekaWith4GBcampus.bat after I return to campus on November 8. Try using that. It will contain this command line: java –Xmx4096M -jar "S:\ComputerScience\WEKA\weka.jar" 2. Right-click results buffers in the Weka -> Classify window, or use Alt-click on Mac (control-click on

PC) to Delete result buffer after you are done with one. They take up space. You can also save these results to text files via this menu.

3. Some of these models take a long time to execute. I have noted that condition in these instructions. In

such cases, it may save time just to exit Weka and restart it via the command line or a batch file with a large memory limit, rather than just deleting result buffers.

PART I: Preparing your ARFF file. (30% of project grade.) Answer questions at steps 4 & 5.

1. Open csc458fall2018assn4trainingset49K.arff in Weka’s Preprocess tab. 2. Remove TimeOfYear because it is redundant with MinuteFromNewYear and MinuteOfYear. We

are leaving month in the attribute set for now. (Note: Some machine learning algorithms such as J48 and other decision trees may perform better using partially redundant attributes. A low-resolution attribute such as TimeOfYear may contribute to a more general tree that is less prone to over-fitting than a high-resolution attribute such as MinuteOfYear; also, a redundant attribute may help to fine tune a complex tree. However, the NaiveBayes statistical technique assumes statistical independence of non-class attributes, and may be more accurate after removing redundant attributes.) We are keeping MinuteOfYear because we can always coarsen its resolution later via discretization. Once an attribute such as MinuteOfYear is in low-resolution form such as the 4-valued TimeOfYear, it is impossible to get the high resolution of MinuteOfYear back.)

3. Remove TimeOfDay because it is redundant with MinuteFromMidnite and MinuteOfDay. Reasoning is similar to that in step 2.

4. Remove MinuteFromNewYear because it is redundant with MinuteOfYear. Next, you can use Weka’s Visualize tab to find a numeric attribute that is not derived from datetime that correlates curvilinearly1 with MinuteOfYear, or you can use your knowledge gained from assignments 2 and

1 https://www.merriam-webster.com/dictionary/curvilinear

What is a numeric attribute that is not derived from datetime attribute that correlates curvilinearly with MinuteOfYear? (5 of the 30% for this question)

TempCelsius or OxygenMgPerLiter 5. Remove MinuteFromMidnite because it is redundant with MinuteOfDay. We are keeping

MinuteOfDay because it correlates positively with an underlying mechanism for increasing dissolved oxygen found in the assignment 2 readings. What is this underlying mechanism? (5 of the 30% for this question)

Photosynthesis

6. Remove site_no and site_name because we don’t need them to remove instances in the way that we did for Assignment 3.

7. Create a new derived attribute HourOfDay by using the Weka unsupervised -> attribute filter AddExpression that divides MinuteOfDay by the number of minutes in an hour. Look at the statistics and graph in the right side of the Weka Preprocess tab to ensure that these attributes have the same distribution. After verifying that HourOfDay is an accurate representation of MinuteOfDay in terms of hours, remove MinuteOfDay. We are doing this because HourOfDay is easier to think about. There are only 24 possible hours from the previous midnight, in contrast to 1440 minutes. HourOfDay preserves the fine-grain resolution of MinuteOfDay in its fractional part.

8. Create a new derived attribute DayOfYear by using the Weka unsupervised -> attribute filter AddExpression that divides MinuteOfYear by the number of minutes in a day. Look at the statistics and graph in the right side of the Weka Preprocess tab to ensure that these attributes have the same distribution. After verifying that DayOfYear is an accurate representation of MinuteOfYear, remove MinuteOfYear. We are doing this because DayOfYear is easier to think about. There are only 365 possible days from the previous January 1, in contrast to over 525K minutes. DayOfYear preserves the fine-grain resolution of MinuteOfYear in its fractional part.

9. Discretize ONLY OxygenMgPerLiter into 10 discrete bins as in assignment 2. Bayesian analysis requires a nominal target attribute (a.k.a. class). Keep useEqualFrequency as False. Do NOT discretize any other numeric attributes at this time.

10. Reorder the attributes to put OxygenMgPerLiter in the last (target) position, without disturbing the relative order of the other attributes. At the end of this step you MUST have these attributes in this order.

Save this as ARFF file csc458fall2018assn4trainingset49K.arff, over-writing the handout file. You must put this into your bayes458fall2018/ project directory before you run make turnitin. Work with

csc458fall2018assn4trainingset49K.arff throughout the remainder of this assignment. We are using 10-fold cross validation with these 49K instances as the training & test dataset in this assignment. Each of Q1 through Q10 is worth 7% of the total project grade. Q1: On this initial set of attributes in this 49K set of measurements, run the following classifiers in the order shown below, and record only these results in your answer. See this footnote for the Kappa statistic2. ZeroR: Correctly Classified Instances N N.N % Kappa statistic N.N Relative absolute error N.N % Root relative squared error N.N % OneR: Correctly Classified Instances N N.N % Kappa statistic N.N Relative absolute error N.N % Root relative squared error N.N % J48: Correctly Classified Instances N N.N % Kappa statistic N.N Relative absolute error N.N % Root relative squared error N.N % NaiveBayes: Correctly Classified Instances N N.N % Kappa statistic N.N Relative absolute error N.N % Root relative squared error N.N % 2 From https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english: “The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). The kappa statistic is used not only to evaluate a single classifier, but also to evaluate classifiers amongst themselves. In addition, it takes into account random chance (agreement with a random classifier), which generally means it is less misleading than simply using accuracy as a metric (an Observed Accuracy of 80% is a lot less impressive with an Expected Accuracy of 75% versus an Expected Accuracy of 50%). Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy) Not only can this kappa statistic shed light into how the classifier itself performed, the kappa statistic for one model is directly comparable to the kappa statistic for any other model used for the same classification task.” Parson’s example: If you had a 6-sided die that had the value 1 on 5 sides, and 0 on the other, the random-chance expected accuracy of rolling a 1 would be 5/6 = 83.3%. Since the ZeroR classifier simply picks the most statistically likely class without respect to the other (non-target) attributes, it would pick an expected die value of 1 in this case, giving a random observed accuracy of 83.3%, and a Kappa of (.833 - .833) / (1 - .833) = 0. Also from this linked site: “Landis and Koch considers 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect. Fleiss considers kappas > 0.75 as excellent, 0.40-0.75 as fair to good, and < 0.40 as poor. It is important to note that both scales are somewhat arbitrary. At least two further considerations should be taken into account when interpreting the kappa statistic. First, the kappa statistic should always be compared with an accompanied confusion matrix if possible to obtain the most accurate interpretation. Second, acceptable kappa statistic values vary on the context. For instance, in many inter-rater reliability studies with easily observable behaviors, kappa statistic values below 0.70 might be considered low. However, in studies using machine learning to explore unobservable phenomena like cognitive states such as day dreaming, kappa statistic values above 0.40 might be considered exceptional.”

BayesNet: Correctly Classified Instances N N.N % Kappa statistic N.N Relative absolute error N.N % Root relative squared error N.N % ZeroR Correctly Classified Instances 14623 29.8027 % Kappa statistic 0 Relative absolute error 100 % Root relative squared error 100 % OneR Correctly Classified Instances 26906 54.8363 % Kappa statistic 0.4209 Relative absolute error 56.64 % Root relative squared error 106.4347 J48 Correctly Classified Instances 46295 94.3525 % Kappa statistic 0.9291 Relative absolute error 8.8528 % Root relative squared error 35.4751 % NaiveBayes Correctly Classified Instances 28786 58.6679 % Kappa statistic 0.4865 Relative absolute error 63.7376 % Root relative squared error 83.387 % BayesNet Correctly Classified Instances 35897 73.1606 % Kappa statistic 0.6654 Relative absolute error 37.1416 % Root relative squared error 70.7849 % Examine the conditional probability table in the output of NaiveBayes and the graph of BayesNet. You can see the latter, partially illustrated on the next page, by Alt-clicking BayesNet in the Classify tab’s result list and selecting Visualize graph. Clicking a node in the graph shows its conditional probabilities. BayesNet is sometimes more accurate than NaiveBayes because NaiveBayes assumes statistical independence of the non-class attributes, while BayesNet does not. BayesNet attempts to model statistical interdependence among these attributes. In the BayesNet illustration below, clicking OxygenMgPerLiter reveals the probability distribution of its 10 discretized bins. Clicking other nodes that are successors (downstream) in the directed acyclic graph reveal more complicated tables. In the illustrated table for TempCelsius below, BayesNet auto-discretizes TempCelsius, and then gives conditional probabilities for OxygenMgPerLiter’s bins, given discrete bins for TempCelsius. Note how the probability for the mid-level (12.88-14.51] bin of OxygenMgPerLiter changes going left-to-right from lower-to-higher TempCelsius. BayesNet takes all of probabilities in all graph nodes for a given bin of OxygenMgPerLiter, multiplies them together, normalizes the result in the

range 0%-100%, and uses this number to predict the probability of that bin of the class (target attribute), given all other attribute value bins. While the graph below auto-generates from OxygenMgPerLiter as the class, it is possible to use expertise to hand-design a graph. Again, the main benefit of BayesNet over NaiveBayes in some cases is BayesNet’s non-assumption of conditional independence among the non-class attributes.

Q2: From NaiveBayes, copy & paste the mean row for month as it correlates with OxygenMgPerLiter in the 10 columns. Attribute '(range]' '(range]' '(range]' '(range]' '(range]' '(range]' '(range]' '(range]' '(range]' '(range)' month mean N.N N.N N.N N.N N.N N.N N.N N.N N.N N.N Attribute '(-inf-2.27]' '(2.27-4.44]' '(4.44-6.61]' '(6.61-8.78]' '(8.78-10.95]' '(10.95-13.12]' '(13.12-15.29]' '(15.29-17.46]' '(17.46-19.63]' '(19.63-inf)' month mean 7.7768 7.8294 7.1566 7.0054 7.213 8.7751 9.8712 4.988 5.7468 6.8421 What correlation, if any, can you see in the low-to-high ranges of OxygenMgPerLiter, going left to right in the table, to the correlated values of month? In particular, look at the value of month for the lowest range of OxygenMgPerLiter on the left, and the value of month for the highest range of OxygenMgPerLiter on the right. Use Weka’s Visualize tab to look at the correlation of either month or DayOfYear (easier to view) to the time of year. What aspect of the value of month in the right column contributes to the highest value of OxygenMgPerLiter for this table, based on our previous projects and discussions. The exponential plant growth in late June to early July spikes the dissolved oxygen level briefly. Here is the DayOfYear graph.

Q3: From the BayesNet graph node for month, what happens to the probability for the highest-valued three bins of OxygenMgPerLiter (the bottom three rows of oxygen bins) as a function of month, for the range of month values that contains the month measurement in the rightmost column of Naïve Bayes of Q2? OxygenMgPerLiter has a temporary peak around start of July that dies away quickly, again due to the seasonal burst in plant growth.

Alt-click each result except NaiveBayes in the Classify tab’s result list and “Delete result buffer” to recover some storage. Note the value of Correctly Classified Instances for NaiveBayes with this full attribute set. Then, for each of the non-class attributes, starting at pH and working your way, one at a time, down through DayOfYear, perform the following steps in a loop:

A. Remove the next non-class attribute and run NaiveBayes.

B. If Correctly Classified Instances increases or stays the same after this removal, leave that attribute removed; otherwise (Correctly Classified Instances has decreased from its maximum NaiveBayes value so far), execute Undo to restore the attribute.

C. Note which attributes you have removed without a subsequent Undo to restore them. D. You can use “Delete result buffer” to recover some storage. I kept only the NaiveBayes result with

the greatest Correctly Classified Instances so far to help me keep track of this maximum. E. Repeat steps A-D, one attribute at a time, until you have removed, tested, and conditionally restored

each non-class attribute, one at a time, through DayOfYear, which is the last non-class attribute. All 58.6679 % pH 50.1875 % restore TempCelsius 49.3009 % restore Conductance 56.3812 % restore DischargeRate 65.4751 % do NOT restore Month 66.2536 % do NOT restore HourOfDay 65.8908 % restore DayOfYear 63.9119 % restore Q4: After completing the above steps, which attribute or attributes did you permanently remove? DischargeRate, Month Q5: Which of the permanently removed attribute(s) of Q4, if any, correlate with a remaining attribute, based on the analyses of assignments 2 and 3 and/or the relationships of the non-target attributes? With which of the remaining non-class attributes do these removed attribute(s) correlate? Month correlates with DayOfYear, partially with TempCelsius Other removed attribute(s) simply do(es) not correlate well with OxygenMgPerLiter, so removal decreases error in NaiveBayes. The removed attribute(s) answered for Q5, on the other hand, violate(s) the statistical independence assumption of NaiveBayes, and so removal reduces error introduced by violating this assumption. Q6: Repeat step Q1 with this reduced attribute set and record the same results here for those same exact classifiers ZeroR, OneR, J48, NaiveBayes, and BayesNet. ZeroR Correctly Classified Instances 14623 29.8027 % (Q1 was 29.8027 %) Kappa statistic 0 Relative absolute error 100 % Root relative squared error 100 % OneR Correctly Classified Instances 26906 54.8363 % (Q1 was 54.8363 %) Kappa statistic 0.4209 Relative absolute error 56.64 % Root relative squared error 106.4347 % J48

Correctly Classified Instances 46237 94.2343 % (Q1 was 94.3525 %) Kappa statistic 0.9277 Relative absolute error 9.0029 % Root relative squared error 35.7266 % NaiveBayes Correctly Classified Instances 32508 66.2536 % (Q1 was 58.6679 %) Kappa statistic 0.5734 Relative absolute error 63.6889 % Root relative squared error 79.414 % BayesNet Correctly Classified Instances 36415 74.2164 % (Q1 was 73.1606 %) Kappa statistic 0.6774 Relative absolute error 39.7572 % Root relative squared error 67.5141 % Q7: In going from the full attribute set of Q1 to the reduced attribute set of Q6, which classifier(s) improved accuracy in terms of Correct Classified Instances? Why did it or they improve? NaiveBayes and BayesNet because we removed interdependence among the non-class attributes. Q8: In going from the full attribute set of Q1 to the reduced attribute set of Q6, which classifier(s) show decreased accuracy in terms of Correct Classified Instances? Why did it or they get worse? J48 because it does not require statistical independence among non-class attributes. Its tree can use these attributes for fine tuning. Q9: In going from the full attribute set of Q1 to the reduced attribute set of Q6, which classifier(s) show no change in accuracy in terms of Correct Classified Instances? Why did it or they show no change? ZeroR and OneR show no change because they do not use the removed non-class attributes. ZeroR uses only the distribution of the class attribute, and OneR uses DayOfYear for both Q1 and Q6. Q10: Remove Conductance and pH, leaving attributes TempCelsius, HourOfDay, DayOfYear, and OxygenMGPerLiter intact. Will attempt to see whether the three main contributors to OxygenMgPerLiter (water temperature, photosynthesis, and the exponential spike in plant growth around day 180) appear in cluster relationships. Run SimpleKMeans clustering with 6 clusters (leave other parameters at their defaults) and complete the table below by using Copy and Paste (Control-C or Command-C on Mac) from the Weka results. Make a pairwise comparison between the “Full Data” centroids and Clusters 0 through 5, i.e., pair “Full Data” with each of the others in turn and compare changes from the overall centroids of Full Data. Describe any correlations you see in changes for TempCelsius and OxygenMgPerLiter in going from Full Data to the respective Cluster 0 through 5. Do any of the other non-class attributes HourOfDay or DayOfYear show a similarly clear correlation with OxygenMgPerLiter? As Temp goes up, OxygenMgPerLiter goes down; as Temp goes down, OxygenMgPerLiter goes up. No other attributes show a clear correlation. HourOfDay in early afternoon correlates with higher oxygen levels, which could be an indication of photosynthesis.

Final cluster centroids: Cluster# Attribute Full Data 0 1 2 3 4 5 (49066.0) (N) (N) (N) (N) (N) (N) ================================================================================================================================== TempCelsius HourOfDay DayOfYear OxygenMgPerLiter Final cluster centroids: Cluster# Attribute Full Data 0 1 2 3 4 5 (49066.0) (6663.0) (7394.0) (7148.0) (10222.0) (13143.0) (4496.0) ================================================================================================================================== TempCelsius 18.451 23.4295 20.1835 8.2211 13.8225 24.0212 18.7282 HourOfDay 11.7522 17.7178 4.5455 12.2487 12.0294 10.642 16.5887 DayOfYear 213.0288 237.1607 202.9097 275.6897 203.9463 204.9601 138.5222 OxygenMgPerLiter '(7.99-9.62]' '(7.99-9.62]' '(7.99-9.62]' '(11.25-12.88]' '(9.62-11.25]' '(6.36-7.99]' '(7.99-9.62]'

dr. dale e. parson, assignment 4, comparing weka bayesian ... · weka’s visualize tab to find a...

Documents