introduction to data mining
TRANSCRIPT
![Page 1: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/1.jpg)
Introduction to Data Mining
Kai Koenig@AgentK
![Page 2: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/2.jpg)
Web/Mobile Developer since the late 1990s
Interested in: Java & JVM, CFML, Functional
Programming, Go, Android, Data Science
And this is my view of the world…
Me
![Page 3: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/3.jpg)
![Page 4: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/4.jpg)
1. What is Data Mining? 2. Concepts and Terminology3. Weka4. Algorithms5. Dealing with Text 6. Java integration
Agenda
![Page 5: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/5.jpg)
![Page 6: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/6.jpg)
We are overwhelmed with data.
![Page 7: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/7.jpg)
![Page 8: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/8.jpg)
1. What is Data Mining?
![Page 9: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/9.jpg)
Fundamentals
Why do we nowadays have SO MUCH data?
Reasons include:
- Cheap storage and better processing power
- Legal & Business requirements
- Digital hoarding
![Page 10: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/10.jpg)
Fundamentals
Data Mining is all about going from data to useful and meaningful information.
- Recommendation in online shops
- Finding an “optimal” partner
- Weather prediction
- Judgement decisions (credit applications)
![Page 11: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/11.jpg)
Fundamentals
![Page 12: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/12.jpg)
A better definition
“Data Mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, often an economic one.”
(Prof. Dr. Ian Witten)
![Page 13: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/13.jpg)
How can you express patterns?
![Page 14: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/14.jpg)
Finding and applying rules
Tear Production Rate == reduced
none
![Page 15: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/15.jpg)
Finding and applying rules
Age == young && Astigmatism == no
soft
Age == young && Astigmatism == no
soft
![Page 16: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/16.jpg)
A Result: Decision lists
If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes
![Page 17: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/17.jpg)
Not all rules are equal
Classification rules: predict an outcome
Association rules: rules that strongly associate different attribute values
If temperature = cool then humidity = normal If humidity = normal and windy = false then play = yes If outlook = sunny and play = no then humidity = high
![Page 18: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/18.jpg)
2. Concepts and Terminology
![Page 19: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/19.jpg)
Learning
What is Learning? And what is Machine Learning?
A good approach is:
“Things learn when they change their behaviour in a way that makes them perform better in the future”
![Page 20: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/20.jpg)
Learning types
Classification learning
Association learning
Clustering
Numerical Prediction
![Page 21: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/21.jpg)
Some basic terminology
The thing to be learned is the concept.
The output of a learning scheme is the concept description.
Classification learning is sometimes called supervised learning. The outcome is the class.
Examples are called instances.
![Page 22: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/22.jpg)
![Page 23: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/23.jpg)
![Page 24: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/24.jpg)
Some more basic terminology
Discrete attribute values are usually called nominal values, continuous attribute values are called just numeric values.
Algorithms used to process data and find patterns are often called classifiers. There are lots of them and all of them can be heavily configured.
![Page 25: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/25.jpg)
![Page 26: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/26.jpg)
3. Weka
![Page 27: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/27.jpg)
What is Weka?
Waikato Environment for Knowledge Analysis
Developed by a group in the Dept. of Computer Science at the University of Waikato in New Zealand.
Also, Weka is a New Zealand-only bird.
![Page 28: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/28.jpg)
What is Weka?
Download for Mac OS X, Linux and Windows:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
Weka is written in Java, comes either as native applications or executable .jar file and is licensed under GPL v3.
![Page 29: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/29.jpg)
Getting data into Weka
Easiest and common for experimenting: .arff
Also supported: CSV, JSON, XML, JDBC connections etc.
Filters in Weka can then be used to preprocess data.
![Page 30: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/30.jpg)
Features
50+ Preprocessing tools
75+ Classification/Regression algorithms
~10 clustering algorithms
… and a packet manager to load and install more if you want.
![Page 31: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/31.jpg)
4. Algorithms
![Page 32: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/32.jpg)
Classifiers
There are literally hundreds with lots of tuning options.
Main Categories:
- Rule-based (ZeroR, OneR, PART etc.)- Tree-based (J48, J48graft, CART etc.)- Bayes-based (NaiveBayes etc.)- Functions-based (LR, Logistic etc.)- Lazy (IB1, IBk etc.)
![Page 33: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/33.jpg)
OneR
Very simplistic classifier and based on a single attribute.
For each attribute, For each value of that attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute value. Calculate the error rate of the rules.Choose the rules with the smallest error rate.
![Page 34: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/34.jpg)
C4.5 (J48)
Produces a decision tree, derived from divide-and-conquer tree building techniques.
Decision trees are often verbose and need to be pruned - J48 uses post-pruning, pruning can in some instances be costly.
J48 usually provides a good balance re quality vs. cost (execution times etc.)
![Page 35: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/35.jpg)
NaiveBayes
Very good and popular for document (text) classification.
Based on statistical modelling (Bayes formula of conditional probability)
In document classification we treat the existence or absence of a word as a Boolean attribute.
![Page 36: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/36.jpg)
![Page 37: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/37.jpg)
Training and Testing
We implicitly trained and tested our classifiers in the previous examples using Cross-Validation.
![Page 38: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/38.jpg)
Training and Testing
Test data and Training data NEED to be different.
If you have only one dataset, split it up.
n-fold Cross-Validation:- Divides your dataset into n parts, holds out each part in turn- Trains with n-1 parts, tests with the held out part- Stratified CV is even better
![Page 39: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/39.jpg)
![Page 40: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/40.jpg)
5. Dealing with Text
![Page 41: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/41.jpg)
Bag of Words
Generally for document classification we treat a document as a bag of words and the existence or absence of a word is a Boolean attribute.
This results in problems with very many attributes having 2 values each.
This is quite a bit different from the usual classification problem.
![Page 42: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/42.jpg)
![Page 43: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/43.jpg)
Filtered Classifiers
First step: use Filtered classifier with J48 and StringToWordVector filter.
Example: Reuters Corn datasets (train/test)
We get 97% accuracy, but there’s still an issue here -> investigate the confusion matrix
Is accuracy the best way to evaluate quality?
![Page 44: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/44.jpg)
Better approaches to evaluation
Accuracy: (a+d)/(a+b+c+d)Recall: R = d/(c+d)Precision: P = d/(b+d)F-Measure: 2PR/(P+R)
False positive rate FP: b/(a+b)True negative rate TN: a/(a+b)False negative rate FN: c/(c+d)
predicted
– +
true – a b
+ c d
![Page 45: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/45.jpg)
ROC (threshold) curves
Area under the threshold curve determines the overall quality of a classifier.
![Page 46: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/46.jpg)
![Page 47: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/47.jpg)
NaiveBayesMultinomial
Often the best classifier for document classification. In particular:- good ROC- good results on minority class (often what we want)
![Page 48: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/48.jpg)
NaiveBayesMultinomial
J48: 96% accuracy, 38/57 on grain docs, 544/547 on non-grain docs, ROC 0.91
NaiveBayes: 80% accuracy, 46/57 on grain docs, 439/547 on non-grain docs, ROC 0.885
NaiveBayesMultinomial: 91% accuracy, 52/57 on grain docs, 496/547 on non-grain docs, ROC 0.973
![Page 49: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/49.jpg)
![Page 50: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/50.jpg)
NaiveBayesMultinomial
NaiveBayesMultinomial with stoplist, lowerCase and outputWords: 94% accuracy, 56/57 on grain docs, 504/547 on non-grain docs, ROC 0.978
Why? NBM is designed for text:
- based solely on word appearance- can deal with multiple repetitions of a word- faster than NB
![Page 51: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/51.jpg)
6. Java integration
![Page 52: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/52.jpg)
Weka is written in Java
The UI is essentially making use of a vast underlying data mining and machine learning API.
Obviously this fact invites us to use the API directly :)
![Page 53: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/53.jpg)
Setting up a project (IntelliJ IDEA)
Create new Java project in IntelliJ
Import weka.jarImport weka-src.jar
Off you go!
![Page 54: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/54.jpg)
The main classes/packages you need…
import weka.classifiers.Evaluation;import weka.classifiers.trees.J48;import weka.core.Instances;
![Page 55: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/55.jpg)
Getting stuff done
Instances train = new Instances(bReader);train.setClassIndex(train.numAttributes()-1);
J48 j48 = new J48();j48.buildClassifier(train);
Evaluation eval = new Evaluation(train);eval.crossValidateModel(
j48, train, 10, new Random(1));
![Page 56: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/56.jpg)
You can also grab Java code off Weka UI
![Page 57: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/57.jpg)
Photo Credits
https://www.flickr.com/photos/johnnystiletto/3339808858/https://www.flickr.com/photos/theequinest/5056055144/https://www.flickr.com/photos/flyingkiwigirl/17385243168https://www.flickr.com/photos/x6e38/3440973490/https://www.flickr.com/photos/42931449@N07/5418402840/https://www.flickr.com/photos/gerardstolk/12194108005/https://www.flickr.com/photos/zzpza/3269784239/in/https://www.flickr.com/photos/internationaltransportforum/14258907973/
![Page 58: Introduction to Data Mining](https://reader031.vdocument.in/reader031/viewer/2022030316/5879f0911a28ab70298b4961/html5/thumbnails/58.jpg)
Get in touch
Kai Koenig
Email: [email protected]
www.ventego-creative.co.nz
Blog: www.bloginblack.de
Twitter: @AgentK