introduction to data mining with weka

15
Introduction to Data Mining with Weka Data Science and Business Analytics Denver Meetup Nancy Abramson Principal Data Scientist

Upload: lani

Post on 23-Feb-2016

84 views

Category:

Documents


2 download

DESCRIPTION

Introduction to Data Mining with Weka. Data Science and Business Analytics Denver Meetup Nancy Abramson Principal Data Scientist. Agenda. Introduction What does Open Source mean? Data Science and Data Mining Open Source Data Mining Tools Weka Overview Profiling Demonstration - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to  Data Mining with  Weka

Introduction to Data Mining with WekaData Science and Business Analytics Denver MeetupNancy AbramsonPrincipal Data Scientist

Page 2: Introduction to  Data Mining with  Weka

Slide 2

IntroductionWhat does Open Source mean?Data Science and Data MiningOpen Source Data Mining ToolsWeka

OverviewProfiling DemonstrationAnalysis Demonstration

Summary

Agenda

Page 3: Introduction to  Data Mining with  Weka

Slide 3

Datasource Consulting Employee for past 3 year developing, using and evaluating open source and enterprise Business Intelligence toolsNew hire to spotXchange as Principal Data ScientistBachelor of Science degree in Computer Science & Mathematics Masters in Applied StatisticsExperience with databases, ETL, and analyticsUsing “Open Source” or “free software” more than 25 yearsMarket analysis in aerospace, financial, telephony, and retail

Introduction – Who am I?

Page 4: Introduction to  Data Mining with  Weka

Slide 4

A software development project in which code is developed by peer production and collaboration, with the end-product, source-code and documentation available at no cost to the public.

Free Access to Source CodeFree RedistributionStrong development community

Examples:LinuxHadoopApache/TomcatMySQLWeka

What is Open Source?

Page 5: Introduction to  Data Mining with  Weka

Slide 5

Data Science process defined by Dr. DJ Patil, previous head of Data Analytics at LinkedIn

Clean-up and preparation of dataCreate measurable levers to increase the value of the businessMonitor if state of metrics for changesExperiment with the results of the models

Traditional Data Mining is used for…Profiling data to check for quality e.g. max, min, data types, and patterns between variablesFinding relationships between variables or independent variables, e.g. clusters, regressionsChecking variance of a measure over timeDetermine the level an experiment produced significant results

Data Science and Data Mining

Page 6: Introduction to  Data Mining with  Weka

Slide 6

Fun StuffSee what you never thought possible

Name: Mr. EdGenus: EquusAddress: Apt 302, Manhattan, NY 10033

Profiling and Heavy Lifting

Page 7: Introduction to  Data Mining with  Weka

Slide 7

Data Mining Tools

Reference: http://www.phiresearchlab.org/downloads/OpenSourceDataMining.pdf

RapidMiner Weka Orange Rattle Knime

url Rapid-i.com www.cs.waikato.ac.nz/ml/weka

www.ailab.si/orange rattle.togaware.com knime.org

Bayes Network yes yes yes no yes

Decision Tree yes yes yes yes yes

Neural Network yes yes no no yes

SVM yes yes yes yes yesClustering yes yes yes yes yes

Association Rules yes yes yes yes yes

Ease of Use Fair Good Excellent Good Good

Data Visualization Good Fair Excellent Excellent Fair

Page 8: Introduction to  Data Mining with  Weka

Slide 8

Waikato Environment for Knowledge Analysis (WEKA)Developed by the University of Waikato, New ZealandJava based distributed under the GNU Public License

ExplorerPreprocessing, attribute selection, learning, visualization

ExperimenterTesting and evaluating machine learning algorithms

Knowledge FlowData-flow interface to WEKA

SimpleCLI

Weka Introduction

Page 9: Introduction to  Data Mining with  Weka

9

loadfilter analyze

Page 10: Introduction to  Data Mining with  Weka

Slide 10

Load and view csv dataCompare pairs of attributesExamine min/max data valueCompare nominal and numeric valuesSave in ARFF format

Weka Pre-process Demo

Derived from census bureau database found at| http://www.census.gov/ftp/pub/DES/www/welcome.html

Page 11: Introduction to  Data Mining with  Weka

Slide 11

Attribute-Relation File Format @relation workers

@attribute age numeric@attribute workclass {' State-gov',' Self-emp-not-inc',' Private',' Federal-gov',' Local-gov',' ?',' Self-emp-inc',' Without-pay',' Never-worked'}@attribute ' fnlwgt' numeric

:

@attribute ' wage' {' <=50K',' >50K'}

@data39,' State-gov',77516,' Bachelors',13,' Never-married',' Adm-clerical',' Not-in-family',' White',' Male',2174,0,40,' United-States',' <=50K'

50,' Self-emp-not-inc',83311,' Bachelors',13,' Married-civ-spouse',' Exec-managerial',' Husband',' White',' Male',0,0,13,' United-States',' <=50K'

38,' Private',215646,' HS-grad',9,' Divorced',' Handlers-cleaners',' Not-in-family',' White',' Male',0,0,40,' United-States',' <=50K'

Page 12: Introduction to  Data Mining with  Weka

Slide 12

49 data preprocessing tools76 classification/regression algorithms8 clustering algorithms15 attribute/subset evaluators + 10 search algorithms for feature selection. 3 algorithms for finding association rules

Weka Classify Features

Derived from census bureau database found at| http://www.census.gov/ftp/pub/DES/www/welcome.html

Page 13: Introduction to  Data Mining with  Weka

Slide 13

Linear RegressionPredicted attribute is continuousCorrelation Coefficient determines fit of data

measures the strength and the direction of a linear relationship-1 < r < +1A correlation greater than 0.8 is generally described as strong, depending on the type of data

UsesForecastingExploring factor effects

Demo: cpu.arff

Page 14: Introduction to  Data Mining with  Weka

Slide 14

ClassificationPredicted attribute is categoricalImplemented methods

Naïve Bayesdecision trees and rulesneural networkssupport vector machines

Demo: J48 decision tree with weather.arff

Page 15: Introduction to  Data Mining with  Weka

Slide 15

Nancy [email protected]

That’s All

?