the artful business of data mining: computational statistics with open source tools
TRANSCRIPT
The Artful Businessof Data Mining
Computational Statistics with Open Source Tool
Wednesday 20 March 13
David Coallier@davidcoallier
Wednesday 20 March 13
Data ScientistAt Engine Yard (.com)
Wednesday 20 March 13
Find Data
Wednesday 20 March 13
Clean Data
Wednesday 20 March 13
Analyse Data?
Wednesday 20 March 13
Analyse Data
Wednesday 20 March 13
Question Data
Wednesday 20 March 13
Report Findings
Wednesday 20 March 13
Data Scientist
Wednesday 20 March 13
Data Janitor
Wednesday 20 March 13
Actual Tasks
Wednesday 20 March 13
“If your modelis elegant, it’s probably wrong”
Wednesday 20 March 13
“The Times they area-Changing”
— Bob Dylan
Wednesday 20 March 13
Python & R
Wednesday 20 March 13
scipy.stats
Wednesday 20 March 13
scipy.statsDescriptive Statistics
Wednesday 20 March 13
from scipy.stats import describe
s = [1,2,1,3,4,5]
print describe(s)
Wednesday 20 March 13
scipy.statsProbability Distributions
Wednesday 20 March 13
ExamplePoisson Distribution
Wednesday 20 March 13
f (k;λ) = λ ke−k
k!for k >= 0
Wednesday 20 March 13
import scipy.stats.poissonp = poisson.pmf([1,2,3,4,1,2,3], 2)
Wednesday 20 March 13
print p.mean()print p.sum()...
Wednesday 20 March 13
NumPyLinear Algebra
Wednesday 20 March 13
1 00 1
⎛⎝⎜
⎞⎠⎟
Wednesday 20 March 13
import numpy as npx = np.array([ [1, 0], [0, 1] ])vec, val = np.linalg.eig(x)np.linalg.eigvals(x)
Wednesday 20 March 13
>>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) )
Wednesday 20 March 13
MatplotlibPython Plotting
Wednesday 20 March 13
statsmodelsAdvanced Statistics Modeling
Wednesday 20 March 13
NLTKNatural Language Tool Kit
Wednesday 20 March 13
scikit-learnMachine Learning
Wednesday 20 March 13
from sklearn import treeX = [[0, 0], [1, 1]]Y = [0, 1]clf = tree.DecisionTreeClassifier()clf = clf.fit(X, Y)
clf.predict([[2., 2.]])>>> array([1])
Wednesday 20 March 13
PyBrain... Machine Learning
Wednesday 20 March 13
PyMCBayesian Inference
Wednesday 20 March 13
PatternWeb Mining for Python
Wednesday 20 March 13
NetworkXStudy Networks
Wednesday 20 March 13
MILKMOAR machine LEARNING!
Wednesday 20 March 13
Pandaseasy-to-use
data structures
Wednesday 20 March 13
from pandas import *x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18}])
print x[x['age'] > 20].count()print x[x['age'] > 20].mean()
Wednesday 20 March 13
RWednesday 20 March 13
RStudioThe IDE
Wednesday 20 March 13
lubridateand zoo
Dealing with Dates...
Wednesday 20 March 13
yy/mm/dd mm/dd/yyYYYY-mm-dd HH:MM:ss TZyy-mm-dd 1363784094.513425yy/mm different timezone
Wednesday 20 March 13
reshape2Reshape your Data
Wednesday 20 March 13
ggplot2Visualise your Data
Wednesday 20 March 13
RCurl, RJSONIOFind more Data
Wednesday 20 March 13
HMiscMiscellaneous useful functions
Wednesday 20 March 13
forecastCan you guess?
Wednesday 20 March 13
garchAnd ruGarch
Wednesday 20 March 13
quantmodStatistical Financial Trading
Wednesday 20 March 13
xtsExtensible Time Series
Wednesday 20 March 13
igraphStudy Networks
Wednesday 20 March 13
maptoolsRead & View Maps
Wednesday 20 March 13
map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)
Wednesday 20 March 13
StorageWednesday 20 March 13
Oppose“big” Data
Wednesday 20 March 13
“Learn how
to sample”
Wednesday 20 March 13
ExperimentsWednesday 20 March 13
What DoYou Want to Answer?
Wednesday 20 March 13
UnderstandYour Audience
Wednesday 20 March 13
ScientificReporting
Wednesday 20 March 13
Busy-nessTime is money
Wednesday 20 March 13
PublicVisualisation
Wednesday 20 March 13
Best Visualisation,Bad Data
Wednesday 20 March 13
Best Forecastingmodels...Bad Visualisation
Wednesday 20 March 13
Wednesday 20 March 13
Wednesday 20 March 13
SeanchaíWednesday 20 March 13
Wednesday 20 March 13
FeelitWednesday 20 March 13
Wednesday 20 March 13
Wednesday 20 March 13
Wednesday 20 March 13
“Don’t be scared of bar charts.”
Wednesday 20 March 13
Mathematical StatisticsEngineering BusinessEconomicsCuriosity
Wednesday 20 March 13
davidcoallier.github.com@davidcoallier on Twitter
Wednesday 20 March 13