clojure for data science

151
CLOJURE FOR DATA SCIENCE

Upload: henrygarner

Post on 25-Jan-2017

2.558 views

Category:

Technology


10 download

TRANSCRIPT

CLOJURE FOR DATASCIENCE

WHY AM I GIVING THIS TALK?I am in the final stages of writing Clojure for Data Science.

It will be published by later this year.http://packtpub.com

AM I QUALIFIED?I co-founded and was CTO of a data analytics company.

I am a software engineer, not a statistician.

WHY IS DATA SCIENCE IMPORTANT?The robots are coming!

The rise of the computational developer.

These trends influence the kinds of systems we are allexpected to build.

WHY CLOJURE?Clojure lends itself to interactive exploration and learning.

It has fantastic data manipulating abstractions.

The JVM hosts many of the workhorse data storage andprocessing frameworks.

WHAT I WILL COVERDistributionsStatisticsVisualisation with QuilCorrelationSimple linear regressionMultivariable linear regression with IncanterBreakCategorical dataBayes classificationLogistic regression with Apache Commons MathClustering with Parkour and Apache Mahout

FOLLOW ALONGThe book's GitHub is available at

http://github.com/clojuredatascience

ch1-introductionch2-statistical-inferencech3-linear-regressionch5-classificationch6-clustering

PART 1: SPOTTING ELECTION FRAUD

LOADING UK ELECTION DATAUsing incanter's excel namespace

(ns cljds.ch1.data (:require [incanter [core :as i] [excel :as xls]] [clojure.java.io :as io]))

(defn uk-data [] (-> (io/resource "UK2010.xls") (str) (xls/read-xls)))

(i/view (uk-data))

IF YOU'RE FOLLOWING ALONGgit clone [email protected]:clojuredatascience/ch1-introduction.git

cd ch1-introduction

script/download-data.sh

lein run -e 1.1

COLUMN NAMES(defn ex-1-1 [] (i/col-names (uk-data)))

;; => ["Press Association Reference" "Constituency Name" "Region" "Election Year" "Electorate" "Votes" "AC" "AD" "AGS" "APNI" "APP" "AWL" "AWP" "BB" "BCP" "Bean" "Best" "BGPV" "BIB" "BIC" "Blue" "BNP" "BP Elvis" "C28" "Cam Soc" "CG" "Ch M" "Ch P" "CIP" "CITY" "CNPG" "Comm" "Comm L" "Con" "Cor D" "CPA" "CSP" "CTDP" "CURE" "D Lab" "D Nat" "DDP" "DUP" "ED" "EIP" "EPA" "FAWG" "FDP" "FFR" "Grn" "GSOT" "Hum" "ICHC" "IEAC" "IFED" "ILEU" "Impact" "Ind1" "Ind2" "Ind3" "Ind4" "Ind5" "IPT" "ISGB" "ISQM" "IUK" "IVH" "IZB" "JAC" "Joy" "JP" "Lab" "Land" "LD" "Lib" "Libert" "LIND" "LLPB" "LTT" "MACI" "MCP" "MEDI" "MEP" "MIF" "MK" "MPEA" "MRLP" "MRP" "Nat Lib" "NCDV" "ND" "New" "NF" "NFP" "NICF" "Nobody" "NSPS" "PBP" "PC" "Pirate" "PNDP" "Poet" "PPBF" "PPE" "PPNV" "Reform" "Respect" "Rest" "RRG" "RTBP" "SACL" "Sci" "SDLP" "SEP" "SF" "SIG" "SJP" "SKGP" "SMA" "SMRA" "SNP" "Soc" "Soc Alt" "Soc Dem" "Soc Lab" "South" "Speaker" "SSP" "TF" "TOC" "Trust" "TUSC" "TUV" "UCUNF" "UKIP" "UPS" "UV" "VCCA" "Vote" "Wessex Reg" "WRP" "You" "Youth" "YRDPL"]

ELECTORATE(defn uk-electorate [] (->> (uk-data) (i/$ "Electorate") (remove nil?))

VARIANCE( +1

n *ni=1 xi μx)2

…EXPLAINED is `(reduce + …)`.*

is "for all xs"*ni=1

is a function of x and the mean of x( +xi μx)2

(defn variance [xs] (let [m (mean xs) n (count xs) square-error (fn [x] (Math/pow (- x m) 2))] (/ (reduce + (map square-error xs)) n)))

HISTOGRAM(require '[incanter.charts :as c])

(defn ex-1-11 [] (-> (uk-electorate) (c/histogram :nbins 20) (i/view)))

DISTIBUTIONS AS MODELS [PDF]http://cljds.com/matura-2013

POINCARÉ'S BREADPoincaré weighed his bread every day for a year.

He discovered that the weights of the bread followed anormal distribution, but that the peak was at 950g, whereas

loaves of bread were supposed to be regulated at 1kg. Hereported his baker to the authorities.

The next year Poincaré continued to weigh his bread fromthe same baker, who was now wary of giving him the lighterloaves. After a year the mean loaf weight was 1kg, but thistime the distribution had a positive skew. This is consistent

with the baker giving Poincaré only the heaviest of his loaves.The baker was reported to the authorities again

HONEST BAKER(require '[incanter.distributions :as d])

(defn honest-baker [] (let [distribution (d/normal-distribution 1000 30)] (repeatedly #(d/draw distribution))))

(defn ex-1-16 [] (-> (take 10000 (honest-baker)) (c/histogram :nbins 25) (i/view)))

DISHONEST BAKER(defn dishonest-baker [] (let [distribution (d/normal-distribution 950 30)] (->> (repeatedly #(d/draw distribution)) (partition 13) (map (partial apply max)))))

(defn ex-1-17 [] (-> (take 10000 (dishonest-baker)) (c/histogram :nbins 25) (i/view)))

THE IMPORTANCE OF VISUALISATIONAnscombe's Quartet: all have identical mean and variance.

SELECTION(defn filter-election-year [data] (i/$where {"Election Year" {:$ne nil}} data))

(defn filter-victor-constituencies [data] (i/$where {"Con" {:$fn number?} "LD" {:$fn number?}} data))

PROJECTION(->> (uk-data) (filter-election-year) (filter-victor-constituencies) (i/$ ["Region" "Electorate" "Con" "LD"]) (i/add-derived-column "Victors" ["Con" "LD"] +) (i/add-derived-column "Victors Share" ["Victors" "Electorate"] /) (i/view))

TWO VARIABLES: SCATTER PLOTS!(defn ex-1-33 [] (let [data (->> (uk-data) (clean-uk-data) (derive-uk-data))] (-> (scatter-plot ($ "Turnout" data) ($ "Victors Share" data) :x-label "Turnout" :y-label "Victor's Share") (view))))

UK VOTES SCATTER

RUSSIA IS A BIGGER COUNTRY

WE NEED A BETTER VISUALIZATION

BINNING DATA(defn bin [n-bins xs] (let [min-x (apply min xs) range-x (- (apply max xs) min-x) max-bin (dec n-bins) bin-fn (fn [x] (-> x (- min-x) (/ range-x) (* n-bins) int (min max-bin)))] (map bin-fn xs)))

(defn ex-1-10 [] (->> (uk-electorate) (bin 10) (frequencies)))

;; => {0 1, 1 1, 2 4, 3 22, 4 130, 5 320, 6 156, 7 15, 9 1}

A 2D HISTOGRAM(defn histogram-2d [xs ys n-bins] (-> (map vector (bin n-bins xs) (bin n-bins ys)) (frequencies)))

(defn uk-histogram-2d [] (let [data (->> (uk-data) (clean-uk-data) (derive-uk-data))] (histogram-2d ($ "Turnout" data) ($ "Victors Share" data) 5)))

;; => {[2 1] 59, [3 2] 91, [4 3] 32, [1 0] 8, [2 2] 89, [3 3] 101, [4 4] 60, [0 0] 2, [1 1] 22, [2 3] 19, [3 4] 53, [0 1] 6, [1 2] 15, [2 4] 5, [1 3] 2, [0 3] 1, [3 0] 6, [4 1] 3, [3 1] 17, [4 2] 17, [2 0] 23}

QUILhttp://quil.info

VISUALIZATION WITH QUIL(require '[quil.core :as q])

(defn ratio->grayscale [f] (-> f (* 255) (int) (min 255) (max 0) (q/color)))

(defn draw-histogram [data {:keys [n-bins size]}] (let [[width height] size x-scale (/ width n-bins) y-scale (/ height n-bins) max-value (apply max (vals data)) setup (fn [] (doseq [x (range n-bins) y (range n-bins)] (let [v (get data [x y] 0) x-pos (* x x-scale) y-pos (- height (* y y-scale))] (q/fill (ratio->grayscale (/ v max-value))) (q/rect x-pos y-pos x-scale y-scale))))] (q/sketch :setup setup :size size)))

A 2D HISTOGRAM

A COLOUR HEATMAPInterpolate between the colours of the spectrum.

(defn ratio->heat [f] (let [colors [(q/color 0 0 255) ;; blue (q/color 0 255 255) ;; turquoise (q/color 0 255 0) ;; green (q/color 255 255 0) ;; yellow (q/color 255 0 0)] ;; red f (-> f (max 0.000) (min 0.999) (* (dec (count colors))))] (q/lerp-color (nth colors f) (nth colors (inc f)) (rem f 1))))

A FINISHED HEATMAP$> lein run -e 1.36

CREDITProceedings of the National Academy of Sciences, titled

"Statistical Detection of Election Irregularities," a team led bySanta Fe Institute External Professor Stefan Thurner

INFERENCE

WHAT ARE STATISTICS ANYWAY?They are estimates of values, based on a sample.

WHAT ARE PARAMETERS?They are the true values, based on the entire population.

SAMPLING SIZEThe values converge as the sample size increases.

We can often only infer the population parameters.

Sample Population

n NX̄ μX

SX σX

REAGENT(def population-mean 100)(def population-sd 20)(def sample-size 10)

REAGENT ATOMS(require '[reagent.core :as r])

(defn randn [mean sd] (.. js/jStat -normal (sample mean sd)))

(defn normal-distribution [mean sd] (repeatedly #(randn mean sd)))

(def state (r/atom {:sample []}))

(defn update-sample! [state] (swap! state assoc :sample (->> (normal-distribution population-mean population-sd) (map int) (take sample-size))))

CREATE THE WIDGETS(defn new-sample [state] [:button {:on-click #(update-sample! state)} "New Sample"])

(defn sample-list [state] [:div (let [sample (:sample @state)] [:div [:ul (for [n sample] [:li n])] [:dl [:dt "Sample Mean:"] [:dd (mean sample)]]])])

LAY OUT THE INTERFACE(defn layout-interface [] [:div [:h1 "Normal Sample"] [new-sample state] [sample-list state]])

;; Render the root component(defn run [] (r/render-component [layout-interface] (.getElementById js/document "root")))

COMPILE THE CLJS$> lein cljsbuild once

DEMO

STANDARD ERRORIt's the standard deviation of the sample means.

SE = σXn√

STANDARD ERRORLet's see how the standard error changes with sample size.

DEMO

STANDARD DEVIATION PROPORTIONS

SMALL SAMPLESThe standard error is calculated from the population

standard deviation, but we don't know it!

In practice they're assumed to be the same above around 30samples, but there is another distribution that models the

loss of precision with small samples.

T-DISTRIBUTION

CALCULATING THE T-STATISTICBased entirely on our sample statistics

(defn t-statistic [sample test-mean] (let [sample-mean (mean sample) sample-size (count sample) sample-sd (standard-deviation sample)] (/ (- sample-mean test-mean) (/ sample-sd (Math/sqrt sample-size)))))

DEMO

WHY THIS INTEREST IN MEANS?Because often when we want to know if a difference in

populations is statistically significant, we'll compare themeans.

HYPOTHESIS TESTINGBy convention the data is assumed not to support what the

researcher is looking for.

This conservative assumption is called the null hypothesis anddenoted .h0

The alternate hypothesis, , can then only be supported witha given confidence interval.

h1

SIGNIFICANCEThe greater the significance of a result, the more certainty we

have that the null hypothesis can be rejected.

Let's use our range controller to adjust the significancethreshold.

DEMO

PREDICTION

QUESTIONWhat was Olympic swimmer Mark Spitz' competition weight?

POPULATION OF OLYMPIC SWIMMERSThe Guardian has helpfully provided data on the vital

statistics of Olympians

http://www.theguardian.com/sport/datablog/2012/aug/07/olympics-2012-athletes-age-weight-height#data

WEIGHT HISTOGRAM

LOG-WEIGHT HISTOGRAM

LOG-NORMAL DISTRIBUTION"A variable might be modeled as log-normal if it can be

thought of as the multiplicative product of many independentrandom variables, each of which is positive. This is justified by

considering the central limit theorem in the log-domain."

SCATTER PLOT

WHAT IS CORRELATION ANYWAY?We've talked about variance in the context of the normal

distribution.

COVARIANCEHow much things vary together!

COV(X, Y) = ( + )( + )1n*

ni=1 xi μx yi μy

CORRELATIONA few ways of measuring it, depending on whether your data

is continuous or discrete

http://xkcd.com/552/

PEARSON'S CORRELATIONCovariance divided by the product of standard deviations. It

measures linear correlation.

ρX, Y = COV(X,Y)σXσY

(defn pearsons-correlation [x y] (/ (covariance x y) (* (standard-deviation x) (standard-deviation y))))

PEARSON'S CORRELATIONIf is 0, it doesn’t necessarily mean that the variables are not

correlated. Pearson’s correlation only measures linearrelationships.

r

THIS IS A STATISTICThe unknown population parameter for correlation is theGreek letter . We are only able to calculate the sample

statistic .ρ

r

How far we can trust as an estimate of will depend on twofactors:

r ρ

the size of the coefficientthe size of the sample

rX, Y = COV(X,Y)sX sY

SIMPLE LINEAR REGRESSIONy = α + βx

SIMPLE LINEAR REGRESSION(defn slope [x y] (/ (covariance x y) (variance x)))

(defn intercept [x y] (- (mean y) (* (mean x) (slope x y))))

(defn predict [a b x] (+ a (* b x)))

TRAINING A MODEL(defn swimmer-data [] (->> (athlete-data) ($where {"Height, cm" {:$ne nil} "Weight" {:$ne nil} "Sport" {:$eq "Swimming"}})))

(defn ex-3-12 [] (let [data (swimmer-data) heights ($ "Height, cm" data) weights (log ($ "Weight" data)) a (intercept heights weights) b (slope heights weights)] (println "Intercept: " a) (println "Slope: " b)))

MAKING A PREDICTION(predict 1.691 0.0143 185)

;; => 4.3365

(i/exp (predict 1.691 0.0143 185))

;; => 76.44

Corresponding to a predicted weight of 76.4kg

In 1979, Mark Spitz was 79kg.

http://www.topendsports.com/sport/swimming/profiles/spitz-mark.htm

MORE DATA!(defn features [dataset col-names] (->> (i/$ col-names dataset) (i/to-matrix)))

(defn gender-dummy [gender] (if (= gender "F") 0.0 1.0))

(defn ex-3-26 [] (let [data (->> (swimmer-data) (i/add-derived-column "Gender Dummy" ["Sex"] gender-dummy)) x (features data ["Height, cm" "Age" "Gender Dummy"]) y (i/log ($ "Weight" data)) model (s/linear-model y x)] (:coefs model)))

;; => [2.2307529431422637 0.010714697827121089 0.002372188749408574 0.0975412532492026]

MULTIVARIABLE LINEAR REGRESSIONy = + +. . . +θ0 θ1x1 θnxn

LEAST SQUARES

MAKING PREDICTIONSy = xθT

(defn predict [theta x] (-> (cl/t theta) (cl/* x) (first)))

(defn ex-3-27 [] (let [data (->> (swimmer-data) (i/add-derived-column "Gender Dummy" ["Sex"] gender-dummy)) x (features data ["Height, cm" "Age" "Gender Dummy"]) y (i/log ($ "Weight" data)) model (s/linear-model y x)] (i/exp (predict (i/matrix (:coefs model)) (i/matrix [1 185 22 1])))))

;; => 78.46882772631697

HOW CLOSE?The result is around 78.47kg.

Compared to 79kg, we were pretty close!

SUMMARYDistributionsStatisticsVisualisation with QuilCorrelationLinear regressionMultivariate linear regression with Incanter

INTERLUDE

CLASSIFICATION

QUESTIONDid all passengers on the Titanic have an equal chance of

survival?

ON THE DATAData is based on Thomas Cason's Titanic3 dataset.

INSPECT THE DATAClass Survived Name Sex Age

1 1 Allen, Miss. ElisabethWalton

female 29

1 0 Allison, Mr. HudsonJoshua Creighton

male 30

CATEGORICAL VARIABLES  Survived Perished

Male 161 682

Female 339 127

STANDARD ERROR FOR A PROPORTIONSE =

p(1 + p)n

‾ ‾‾‾‾‾‾‾‾√(defn standard-error-proportion [p n] (-> (- 1 p) (* p) (/ n) (Math/sqrt)))

= = 0.61161 + 339682 + 127

500809

SE = 0.013

HOW SIGNIFICANT?z =

+p1 p2

SEP1: the proportion of women who survived is = 0.76339

446

P2: the proportion of men who survived = = 0.19161843

SE: 0.013

z = 20.36

This is essentially impossible.

MORE CATEGORIES  Survived Perished

First Class 200 123

Second Class 119 158

Third Class 181 528

OUR APPROACH DOESN'T SCALEWe can use a test.χ 2

(defn ex-5-5 [] (let [observations (i/matrix [[200 119 181] [123 158 528]])] (s/chisq-test :table observations)))

How likely is that this distribution occurred via chance?{:X-sq 127.85915643930326, :col-levels (0 1 2), :row-margins {0 500.0, 1 809.0}, :table [matrix] , :p-value 1.7208259588256175E-28, :df 2, :probs nil, :col-margins {0 323.0, 1 277.0, 2 709.0}, :E (123.37662337662337 199.62337662337663 105.80595874713522 171.1940412528648 270.8174178762414 438.1825821237586), :row-levels (0 1), :two-samp? true, :N 1309.0}

P-VALUE"The estimated probability of rejecting the null hypothesis

of a study question when that hypothesis is true."h0

PROBABILITY

https://xkcd.com/1132/

BAYES RULEP(A|B) =

P(B|A)P(A)P(B)

initial degree of belief in A (the prior)P(A)

BAYES TITANICP(survive|f emale) =

P(f emale|survive)P(survive)P(f emale)

P(survive|f emale) = =339500

5001309

4461309

339446

BAYES CLASSIFICATIONP(survive|third, male) =

P(survive)P(third|survive)P(male|P(third, male)

P(perish|third, male) =P(perish)P(third|perish)P(male|per

P(third, male)Because the evidence is the same for all classes, we can

cancel this out.

PARSE THE DATA(titanic-samples)

;; => ({:survived true, :gender :female, :class :first, :embarked "S", :age "20-30"} {:survived true, :gender :male, :class :first, :embarked "S", :age "30-40"} ...)

IMPLEMENTING A NAIVE BAYES MODEL(defn safe-inc [v] (inc (or v 0)))

(defn inc-class-total [model class] (update-in model [class :total] safe-inc))

(defn inc-predictors-count-fn [row class] (fn [model attr] (let [val (get row attr)] (update-in model [class attr val] safe-inc))))

IMPLEMENTING A NAIVE BAYES MODEL(defn assoc-row-fn [class-attr predictors] (fn [model row] (let [class (get row class-attr)] (reduce (inc-predictors-count-fn row class) (inc-class-total model class) predictors))))

(defn naive-bayes [data class-attr predictors] (reduce (assoc-row-fn class-attr predictors) {} data))

NAIVE BAYES MODEL(defn ex-5-6 [] (let [data (titanic-samples)] (pprint (naive-bayes data :survived [:gender :class]))))

…produces the following output…;; {false;; {:class {:third 528, :second 158, :first 123},;; :gender {:male 682, :female 127},;; :total 809},;; true;; {:class {:third 181, :second 119, :first 198},;; :gender {:male 161, :female 337},;; :total 498}}

MAKING PREDICTIONS(defn n [model] (->> (vals model) (map :total) (apply +)))

(defn conditional-probability [model test class] (let [evidence (get model class) prior (/ (:total evidence) (n model))] (apply * prior (for [kv test] (/ (get-in evidence kv) (:total evidence))))))

(defn bayes-classify [model test] (let [probs (map (fn [class] [class (conditional-probability model test class)]) (keys model))] (-> (sort-by second > probs) (ffirst))))

DOES IT WORK?(defn ex-5-7 [] (let [data (titanic-samples) model (naive-bayes data :survived [:gender :class])] (bayes-classify model {:gender :male :class :third})))

;; => false

(defn ex-5-8 [] (let [data (titanic-samples) model (naive-bayes data :survived [:gender :class])] (bayes-classify model {:gender :female :class :first})))

;; => true

WHY NAIVE?Because it assumes all variables are independent. We know

they are not (e.g. being male and in third class) but naivebayes weights all attributes equally.

In practice it works surprisingly well, particularly where thereare large numbers of features.

LOGISTIC REGRESSION

LOGISTIC REGRESSIONLogistic regression uses similar techniques to linear

regression but guarantees an output only between 0 and 1.

(x) = xhθ θT

(x) = g( x)hθ θT

Where the sigmoid function is

g(z) =1

1 + e+z

THE LOGISTIC FUNCTION

THE LOGISTIC FUNCTION(defn logistic-function [theta] (let [tt (matrix/transpose (vec theta)) z (fn [x] (- (matrix/mmul tt (vec x))))] (fn [x] (/ 1 (+ 1 (Math/exp (z x)))))))

INTERPRETATION(let [f (logistic-function [0])] (f [1]) ;; => 0.5

(f [-1]) ;; => 0.5

(f [42]) ;; => 0.5 )

(let [f (logistic-function [0.2]) g (logistic-function [-0.2])] (f [5]) ;; => 0.73

(g [5]) ;; => 0.27 )

COST FUNCTIONCost varies between 0 and (a big number).

(defn cost-function [y y-hat] (- (if (zero? y) (Math/log (max (- 1 y-hat) Double/MIN_VALUE)) (Math/log (max y-hat (Double/MIN_VALUE))))))

(defn logistic-cost [ys y-hats] (avg (map cost-function ys y-hats)))

CONVERTING TITANIC DATA TO FEATURES(defn titanic-features [] (remove (partial some nil?) (for [row (titanic-data)] [(:survived row) (:pclass row) (:sibsp row) (:parch row) (if (nil? (:age row)) 30 (:age row)) (if (= (:sex row) "female") 1.0 0.0) (if (= (:embarked row) "S") 1.0 0.0) (if (= (:embarked row) "C") 1.0 0.0) (if (= (:embarked row) "Q") 1.0 0.0)])))

CALCULATING THE GRADIENT(defn gradient-fn [h-theta xs ys] (let [g (fn [x y] (matrix/mmul (- (h-theta x) y) x))] (->> (map g xs ys) (matrix/transpose) (map avg))))

We transpose to calculate the average for each featureacross all xs rather than average for each x across all

features.

GRADIENT DESCENTThe cost function will be lowest when the parameters are at

their optimum.

APACHE COMMONS MATHProvides heavy-lifting for running tasks like gradient descent.

(:import [org.apache.commons.math3.analysis MultivariateFunction MultivariateVectorFunction] [org.apache.commons.math3.optim InitialGuess MaxEval SimpleBounds OptimizationData SimpleValueChecker PointValuePair] [org.apache.commons.math3.optim.nonlinear.scalar ObjectiveFunction ObjectiveFunctionGradient GoalType] [org.apache.commons.math3.optim.nonlinear.scalar.gradient NonLinearConjugateGradientOptimizer NonLinearConjugateGradientOptimizer$Formula])

CLOJURE'S JAVA INTEROPAn object wrapper to represent a function: too many levels of

indirection?!(defn objective-function [f] (ObjectiveFunction. (reify MultivariateFunction (value [_ v] (apply f (vec v))))))

(defn objective-function-gradient [f] (ObjectiveFunctionGradient. (reify MultivariateVectorFunction (value [_ v] (double-array (apply f (vec v)))))))

GRADIENT DESCENT(defn make-ncg-optimizer [] (NonLinearConjugateGradientOptimizer. NonLinearConjugateGradientOptimizer$Formula/FLETCHER_REEVES (SimpleValueChecker. (double 1e-6) (double 1e-6))))

(defn initial-guess [guess] (InitialGuess. (double-array guess)))

(defn max-evaluations [n] (MaxEval. n))

(defn gradient-descent [f g estimate n] (let [options (into-array OptimizationData [(objective-function f) (objective-function-gradient g) (initial-guess estimate) (max-evaluations n) GoalType/MINIMIZE])] (-> (make-ncg-optimizer) (.optimize options) (.getPoint) (vec))))

RUNNING GRADIENT DESCENT(defn run-logistic-regression [data initial-guess] (let [points (titanic-features) xs (->> points (map rest) (map #(cons 1 %))) ys (map first points)]

(gradient-descent (fn [& theta] (let [f (logistic-function theta)] (logistic-cost (map f xs) ys))) (fn [& theta] (gradient-fn (logistic-function theta) xs ys)) initial-guess 2000)))

PRODUCING A MODEL(defn ex-5-11 [] (let [data (titanic-features) initial-guess (-> data first count (take (repeatedly rand)))] (run-logistic-regression data initial-guess)))

MAKING PREDICTIONS(def theta [0.690807824623404 -0.9033828001369435 -0.3114375278698766 -0.01894319673287219 -0.03100315579768661 2.5894858366033273 0.7939190708193374 1.3711334887947388 0.6672555257828919])

(defn round [x] (Math/round x))

(def logistic-model (logistic-function theta))

(defn ex-5-13 [] (let [data (titanic-features) test (fn [x] (= (round (logistic-model (cons 1 (rest x)))) (round (first x)))) results (frequencies (map test data))] (/ (get results true) (apply + (vals results)))))

;; => 1030/1309

EVALUATING THE CLASSIFIERCross-validation: we want to separate our test and training

data sets

Bias vs variance: your model may fail to generalise

CLUSTERING

CLUSTERINGFind a grouping of a set of objects such that objects in thesame group are more similar to each other than those in

other groups.

SIMILARITY MEASURESMany to choose from: Jaccard, Euclidean.

For text documents the Cosine measure is often chosen.

Good for high-dimensional spacesPositive spaces the similarity is between 0 and 1.

COSINE SIMILARITYcos(θ) =

A Þ B>A>>B>

(defn cosine [a b] (let [dot-product (->> (map * a b) (apply +)) magnitude (fn [d] (->> (map #(Math/pow % 2) d) (apply +) Math/sqrt))] (/ dot-product (* (magnitude a) (magnitude b)))))

CREATING SPARSE VECTORS(def dictionary (atom {:count 0 :words {}}))

(defn add-word-to-dict [dict word] (if (get-in dict [:words word]) dict (-> dict (update-in [:words] assoc word (get dict :count)) (update-in [:count] inc))))

(defn update-words [dict doc word] (let [word-id (-> (swap! dict add-word-to-dict word) (get-in [:words word]))] (update-in doc [word-id] #(inc (or % 0)))))

(defn document-vector [dict ngrams] (r/reduce (partial update-words dict) {} ngrams))

EXAMPLE(->> (split "the quick brown fox jumps over the lazy dog" #"\W+") (document-vector dictionary))

;; => {7 1, 6 1, 5 1, 4 1, 3 1, 2 1, 1 1, 0 2}

@dictionary

;; => {:words {"dog" 7, "lazy" 6, "over" 5, "jumps" 4, "fox" 3, "brown" 2, "quick" 1, "the" 0}, :count 8}

STEMMING / STOPWORDShttp://clojars.org/stemmers

(stemmer/stems "it's lovely that you're musical")

;; => ("love" "music")

WHY? (cosine-sparse (->> "music is the food of love" stemmer/stems (document-vector dictionary)) (->> "war is the locomotive of history" stemmer/stems (document-vector dictionary)))

;; => 0.0

(cosine-sparse (->> "music is the food of love" stemmer/stems (document-vector dictionary)) (->> "it's lovely that you're musical" stemmer/stems (document-vector dictionary)))

;; => 0.8164965809277259

EXAMPLE(->> "it's lovely that you're musical" stemmer/stems (document-vector dictionary))

;; => {0 1, 2 1}

@dictionary

;; => {:count 6, :words {"histori" 5, "locomot" 4, "war" 3, "love" 2, "food" 1, "music" 0}}

MAHOUThttp://mahout.apache.org/

"The Apache Mahout™ project's goal is to build anenvironment for quickly creating scalable preformant

machine learning applications."

GET THE DATAWe're going to be clustering the Reuters dataset.

Follow the readme instructions:brew install mahout

script/download-reuters.shlein run -e 6.7mahout seqdirectory -i data/reuters-txt -o data/reuters-sequencefile

VECTOR REPRESENTATIONEach document is converted into a vector representation.

All vectors share a dictionary providing a unique index foreach word.

SEQUENCEFILESInput:

org.apache.hadoop.io.Textorg.apache.hadoop.io.Text

Output (Vectors):

org.apache.hadoop.io.Textorg.apache.mahout.math.VectorWritable

Output (Dictionary):

org.apache.hadoop.io.Textorg.apache.mahout.math.IntWritable

PARKOURParkour is a Clojure library for interacting with Hadoop.

It provides a thinner layer of abstraction than PigPen andCascalog.

TF-IDFTerm frequency, inverse document frenquency.

tf idf (t, d, D) = tf (t, d) Þ idf (t, D)

WE NEED A UNIQUE IDAnd we need to compute it in parallel.

PARKOUR MAPPING(require '[clojure.core.reducers :as r] '[parkour.mapreduce :as mr])

(defn document->terms [doc] (clojure.string/split doc #"\W+"))

(defn document-count-m "Emits the unique words from each document" {::mr/source-as :vals} [documents] (->> documents (r/mapcat (comp distinct document->terms)) (r/map #(vector % 1))))

SHAPE METADATA:keyvals ;; Re-shape as vectors of key-vals pairs.:keys ;; Just the keys from each key-value pair.:vals ;; Just the values from each key-value pair.

PLAIN OLD FUNCTIONS (->> (document-count-m ["it's lovely that you're musical" "music is the food of love" "war is the locomotive of history"]) (into []))

;; => [["love" 1] ["music" 1] ["music" 1] ["food" 1] ["love" 1] ["war" 1] ["locomot" 1] ["histori" 1]]

AND REDUCING…(require '[parkour.io.dux :as dux] '[transduce.reducers :as tr])

(defn unique-index-r {::mr/source-as :keyvalgroups, ::mr/sink-as dux/named-keyvals} [coll] (let [global-offset (conf/get-long mr/*context* "mapred.task.partition" -1)] (tr/mapcat-state (fn [local-offset [word counts]] [(inc local-offset) (if (identical? ::finished word) [[:counts [global-offset local-offset]]] [[:data [word [[global-offset local-offset] (apply + counts)]]]])]) 0 (r/mapcat identity [coll [[::finished nil]]]))))

CREATING A JOB(require '[parkour.graph :as pg] '[parkour.avro :as mra] '[abracad.avro :as avro])

(def long-pair (avro/tuple-schema [:long :long]))(def index-value (avro/tuple-schema [long-pair :long]))

(defn df-j [dseq] (-> (pg/input dseq) (pg/map #'document-count-m) (pg/partition (mra/shuffle [:string :long])) (pg/reduce #'unique-index-r) (pg/output :data (mra/dsink [:string index-value]) :counts (mra/dsink [:long :long]))))

WRITING TO DISTRIBUTED CACHE(require '[parkour.io.dval :as dval])

(defn calculate-offsets "Build map of offsets from dseq of counts." [dseq] (->> dseq (into []) (sort-by first) (reductions (fn [[_ t] [i n]] [(inc i) (+ t n)]) [0 0]) (into {})))

(defn df-execute [conf dseq] (let [[df-data df-counts] (pg/execute (df-j dseq) conf ̀df) offsets-dval (dval/edn-dval (calculate-offsets df-counts))] ...))

READING FROM DISTRIBUTED CACHE(defn global-id "Use offsets to calculate unique id from global and local offset" [offsets [global-offset local-offset]] (+ local-offset (get offsets global-offset)))

(defn words-idf-m "Calculate the unique id and inverse document frequency for each word" {::mr/sink-as :keys} [offsets-dval n coll] (let [offsets @offsets-dval] (r/map (fn [[word [word-offset df]]] [word (global-id offsets word-offset) (Math/log (/ n df))]) coll)))

(defn make-dictionary [conf df-data df-counts doc-count] (let [offsets-dval (dval/edn-dval (calculate-offsets df-counts))] (-> (pg/input df-data) (pg/map #'words-idf-m offsets-dval doc-count) (pg/output (mra/dsink [words])) (pg/fexecute conf ̀idf) (->> (r/map parse-idf) (into {})) (dval/edn-dval))))

CREATING TEXT VECTORS(import '[org.apache.mahout.math RandomAccessSparseVector])

(defn create-sparse-vector [dictionary [id doc]] (let [vector (RandomAccessSparseVector. (count dictionary))] (doseq [[term freq] (-> doc document->terms frequencies)] (let [term-info (get dictionary term)] (.setQuick vector (:id term-info) (* freq (:idf term-info))))) [id vector]))

(defn create-vectors-m [dictionary coll] (let [dictionary @dictionary] (r/map #(create-sparse-vector dictionary %) coll)))

THE FINISHED JOB(import '[org.apache.hadoop.io Text] '[org.apache.mahout.math VectorWritable])

(defn tfidf [conf dseq dictionary-path vector-path] (let [doc-count (->> dseq (into []) count) [df-data df-counts] (pg/execute (df-j dseq) conf ̀df) dictionary-dval (make-dictionary conf df-data df-counts doc-count)] (write-dictionary dictionary-path dictionary-dval) (-> (pg/input dseq) (pg/map #'create-vectors-m dictionary-dval) (pg/output (seqf/dsink [Text VectorWritable] vector-path)) (pg/fexecute conf ̀vectorize))))

(defn tool [conf input output] (let [dseq (seqf/dseq input) dictionary-path (doto (str output "/dictionary") fs/path-delete) vector-path (doto (str output "/vectors") fs/path-delete)] (tfidf conf dseq dictionary-path vector-path)))

(defn -main [& args] (System/exit (tool/run tool args)))

RUN THE JOB(defn ex-6-14 [] (let [input "data/reuters-sequencefile" output "data/parkour-vectors"] (tool/run vectorizer/tool [input output])))

K-MEANS

K-MEANS

RUNNING CLUSTERINGscript/run-kmeans.sh

#!/bin/bash

WORK_DIR=dataINPUT_DIR=${WORK_DIR}/parkour-vectors

mahout kmeans \ -i ${INPUT_DIR}/vectors \ -c ${WORK_DIR}/clusters-out \ -o ${WORK_DIR}/kmeans-out \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -x 20 -k 5 -cd 0.01 -ow --clustering

mahout clusterdump \ -i ${WORK_DIR}/kmeans-out/clusters-*-final \ -o ${WORK_DIR}/clusterdump.txt \ -d ${INPUT_DIR}/dictionary/part-r-00000 \ -dt sequencefile \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ --pointsDir ${WORK_DIR}/kmeans-out/clusteredPoints \ -b 100 -n 20 -sp 0 -e

HOW MANY CLUSTERS?

WHAT DID I LEAVE OUT?Cluster quality measuresSpectral and LDA clusteringCollaborative filtering with MahoutRandom forestsSpark for movie recommendations with SparklingGraph data with Loom and DatomicMapReduce with Cascalog and PigPenAdapting algorithms for massive scaleTime series and forecastingDimensionality reduction, feature selectionMore visualisation techniquesLots more…

BOOKClojure for Data Science will be available in the second half

of the year from .http://packtpub.com

http://cljds.com

SLIDEShttp://github.com/henrygarner/clojure-data-science

THANK YOU!Henry Garner

@henrygarner