the artful business of data mining: computational statistics with open source tools

78
The Artful Business of Data Mining Computational Statistics with Open Source Tool Wednesday 20 March 13

Upload: david-coallier

Post on 11-Apr-2017

969 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

The Artful Businessof Data Mining

Computational Statistics with Open Source Tool

Wednesday 20 March 13

Page 2: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

David Coallier@davidcoallier

Wednesday 20 March 13

Page 3: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Data ScientistAt Engine Yard (.com)

Wednesday 20 March 13

Page 4: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Find Data

Wednesday 20 March 13

Page 5: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Clean Data

Wednesday 20 March 13

Page 6: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Analyse Data?

Wednesday 20 March 13

Page 7: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Analyse Data

Wednesday 20 March 13

Page 8: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Question Data

Wednesday 20 March 13

Page 9: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Report Findings

Wednesday 20 March 13

Page 10: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Data Scientist

Wednesday 20 March 13

Page 11: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Data Janitor

Wednesday 20 March 13

Page 12: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Actual Tasks

Wednesday 20 March 13

Page 13: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

“If your modelis elegant, it’s probably wrong”

Wednesday 20 March 13

Page 14: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

“The Times they area-Changing”

— Bob Dylan

Wednesday 20 March 13

Page 15: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Python & R

Wednesday 20 March 13

Page 16: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

SciPyhttp://www.scipy.org

Wednesday 20 March 13

Page 17: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

scipy.stats

Wednesday 20 March 13

Page 18: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

scipy.statsDescriptive Statistics

Wednesday 20 March 13

Page 19: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

from scipy.stats import describe

s = [1,2,1,3,4,5]

print describe(s)

Wednesday 20 March 13

Page 20: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

scipy.statsProbability Distributions

Wednesday 20 March 13

Page 21: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

ExamplePoisson Distribution

Wednesday 20 March 13

Page 22: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

f (k;λ) = λ ke−k

k!for k >= 0

Wednesday 20 March 13

Page 23: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

import scipy.stats.poissonp = poisson.pmf([1,2,3,4,1,2,3], 2)

Wednesday 20 March 13

Page 24: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

print p.mean()print p.sum()...

Wednesday 20 March 13

Page 25: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

NumPyhttp://www.numpy.org/

Wednesday 20 March 13

Page 26: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

NumPyLinear Algebra

Wednesday 20 March 13

Page 27: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

1 00 1

⎛⎝⎜

⎞⎠⎟

Wednesday 20 March 13

Page 28: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

import numpy as npx = np.array([ [1, 0], [0, 1] ])vec, val = np.linalg.eig(x)np.linalg.eigvals(x)

Wednesday 20 March 13

Page 29: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

>>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) )

Wednesday 20 March 13

Page 30: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

MatplotlibPython Plotting

Wednesday 20 March 13

Page 31: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

statsmodelsAdvanced Statistics Modeling

Wednesday 20 March 13

Page 32: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

NLTKNatural Language Tool Kit

Wednesday 20 March 13

Page 33: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

scikit-learnMachine Learning

Wednesday 20 March 13

Page 34: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

from sklearn import treeX = [[0, 0], [1, 1]]Y = [0, 1]clf = tree.DecisionTreeClassifier()clf = clf.fit(X, Y)

clf.predict([[2., 2.]])>>> array([1])

Wednesday 20 March 13

Page 35: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

PyBrain... Machine Learning

Wednesday 20 March 13

Page 36: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

PyMCBayesian Inference

Wednesday 20 March 13

Page 37: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

PatternWeb Mining for Python

Wednesday 20 March 13

Page 38: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

NetworkXStudy Networks

Wednesday 20 March 13

Page 39: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

MILKMOAR machine LEARNING!

Wednesday 20 March 13

Page 40: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Pandaseasy-to-use

data structures

Wednesday 20 March 13

Page 41: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

from pandas import *x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18}])

print x[x['age'] > 20].count()print x[x['age'] > 20].mean()

Wednesday 20 March 13

Page 42: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

RWednesday 20 March 13

Page 43: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

RStudioThe IDE

Wednesday 20 March 13

Page 44: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

lubridateand zoo

Dealing with Dates...

Wednesday 20 March 13

Page 45: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

yy/mm/dd mm/dd/yyYYYY-mm-dd HH:MM:ss TZyy-mm-dd 1363784094.513425yy/mm different timezone

Wednesday 20 March 13

Page 46: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

reshape2Reshape your Data

Wednesday 20 March 13

Page 47: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

ggplot2Visualise your Data

Wednesday 20 March 13

Page 48: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

RCurl, RJSONIOFind more Data

Wednesday 20 March 13

Page 49: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

HMiscMiscellaneous useful functions

Wednesday 20 March 13

Page 50: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

forecastCan you guess?

Wednesday 20 March 13

Page 51: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

garchAnd ruGarch

Wednesday 20 March 13

Page 52: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

quantmodStatistical Financial Trading

Wednesday 20 March 13

Page 53: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

xtsExtensible Time Series

Wednesday 20 March 13

Page 54: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

igraphStudy Networks

Wednesday 20 March 13

Page 55: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

maptoolsRead & View Maps

Wednesday 20 March 13

Page 56: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)

Wednesday 20 March 13

Page 57: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

StorageWednesday 20 March 13

Page 58: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Oppose“big” Data

Wednesday 20 March 13

Page 59: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

“Learn how

to sample”

Wednesday 20 March 13

Page 60: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

ExperimentsWednesday 20 March 13

Page 61: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

What DoYou Want to Answer?

Wednesday 20 March 13

Page 62: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

UnderstandYour Audience

Wednesday 20 March 13

Page 63: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

ScientificReporting

Wednesday 20 March 13

Page 64: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Busy-nessTime is money

Wednesday 20 March 13

Page 65: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

PublicVisualisation

Wednesday 20 March 13

Page 66: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Best Visualisation,Bad Data

Wednesday 20 March 13

Page 67: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Best Forecastingmodels...Bad Visualisation

Wednesday 20 March 13

Page 68: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Wednesday 20 March 13

Page 69: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Wednesday 20 March 13

Page 70: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

SeanchaíWednesday 20 March 13

Page 71: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Wednesday 20 March 13

Page 72: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

FeelitWednesday 20 March 13

Page 73: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Wednesday 20 March 13

Page 74: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Wednesday 20 March 13

Page 75: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Wednesday 20 March 13

Page 76: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

“Don’t be scared of bar charts.”

Wednesday 20 March 13

Page 77: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

Mathematical StatisticsEngineering BusinessEconomicsCuriosity

Wednesday 20 March 13

Page 78: The Artful Business of Data Mining: Computational Statistics with Open Source Tools

davidcoallier.github.com@davidcoallier on Twitter

Wednesday 20 March 13