diy analytics with apache spark

DIY ANALYTICS WITH APACHE SPARK

ADAM ROBERTS

London, 22nd June 2017: originally presented at Geecon

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.Informaton in these presentatons (including informaton relatng to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of inital publicaton and could include unintentonal technical or typographical errors. IBM shall have no responsibility to update this information. THIS document is distributed "AS IS" without any warranty, either express or implied. In no event shall IBM be liable for any damage arising from the use of this informaton, including but not limited to, loss of data, business interrupton, loss of profit or loss of opportunity. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided.

Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operatng environments may vary.

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business.

Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall consttute legal or other guidance or advice to any individual partcipant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identficaton and interpretaton of any relevant laws and regulatory requirements that may affect the customer’s business and any actons the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law.

Information within this presentation is accurate to the best of the author's knowledge as of the 4th of June 2017

Informaton concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connecton with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilites of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warrantes of merchantability and fitness for a partcular purpose.

The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.

IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document Management System™, Global Business Services ®, Global Technology Services ®, Informaton on Demand, ILOG, LinuxONE™, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytcs™, PureApplicaton®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Ratonal®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of Internatonal Business Machines Corporaton, registered in many jurisdictons worldwide. Other product and service names might be trademarks of IBM or other companies. Oracle and Java are registered trademarks of Oracle and/or its afiliates. Other names may be trademarks of their respectve owners: and a current list of IBM trademarks is available on the Web at "Copyright and trademark informaton" at www.ibm.com/legal/copytrade.shtml. Apache Spark, Apache Cassandra, Apache Hadoop, Apache Maven, Apache Kafka and any other Apache project mentoned here and the Apache product logos including the Spark logo are trademarks of The Apache Software Foundaton.

● Showing you how to get started from scratch:going from “I’ve heard about Spark” to “I can use it for...”

● Worked examples aplenty: lots of code● Not intended to be scientfically accurate! Sharing ideas● Useful reference material● Slides will be hosted

Stick around for...

✔ Doing stuf yourself (within your tmeframe and rules)

✔ Findings can be subject to bias: yours don’t have to be

✔ Trust the data instead

Motivation!

✔ Finding aliens with the SETI insttute✔ Genomics projects (GATK, Bluemix

Genomics)✔ IBM Watson services

Cool projects involving Spark

✔ Powerful machine(s)✔ Apache Spark and a JDK✔ Scala (recommended)✔ Optonal: visualisation library for Spark output e.g. Python with

✔ bokeh✔ pandas

✔ Optonal but not covered here: a notebook bundled with Spark like Zeppelin, or use Jupyter

Your DIY analytcs toolkit

Toolbox from wikimedia: Tanemori derivatve work: יוקיג'אנקי

Why listen to me?● Worked on Apache Spark since 2014● Helping IBM customers use Spark for the first tme

● Resolving problems, educatng service teams● Testng on lots of IBM platforms since Spark 1.2: x86, Power, Z systems,

all Java 8 deliverables...● Fixing bugs in Spark/Java: contributng code and helping others to do so● Working with performance tuning pros

● Code provided here has an emphasis on readability!

● What is it (why the hype)?● How to answer questons with Spark

● Core spark functons (the “bread and butter” stuf), plotting, correlatons, machine learning

● Built-in utlity functons to make our lives easier (labels, features, handling nulls)

● Examples using data from wearables: two years of actvity

What I'll be covering today

Ask me later if you're interested in...● Spark on IBM hardware● IBM SDK for Java specifics● Notebooks● Spark using GPUs/GPUs from Java● Performance tuning● Comparison with other projects● War stories fixing Spark/Java bugs

● You know how to write Java or Scala● You’ve heard about Spark but never used it● You have something to process!

What I assume...

This talk won’t make you a superhero!

● Know more about Spark – what it can/can’t do● Know more about machine learning in Spark● Know that machine learning’s stll hard but in diferent ways

But you will...

Open source project (the most actve for big data) offering distributed...

● Machine learning● Graph processing● Core operatons (map, reduce, joins)

● SQL syntax with DataFrames/Datasets

✔ Build it yourself from source (requiring Git, Maven, a JDK) or

✔ Download a community built binary or✔ Download our free Spark development package (includes IBM's SDK for Java)

Things you can process...● File formats you could use with Hadoop● Anything there’s a Spark package for● json, csv, parquet...

Things you can use with it...● Kafka for streaming● Hive tables● Cassandra as a database● Hadoop (using HDFS with Spark)● DB2!

“What’s so good about it then?”

● Offers scalability and resiliency ● Auto-compression, fast serialisaton, caching● Python, R, Scala and Java APIs: eligible for Java

based optmisations● Distributed machine learning!

“Why isn’t everyone using it?”

● Can you get away with using spreadsheet software?● Have you really got a large amount of data?● Data preparation is very important!

How will you properly handle negative, null, or otherwise strange values in your data?

● Will you benefit from massive concurrency?● Is the data in a format you can work with? ● Needs transforming first (and is it worth it)?

Not every problem is a Spark one!

● Not really real-tme streaming (“micro-batching”)● Debugging in a largely distributed system with many moving parts can be tough

● Security: not really locked down out of the box (extra steps required by knowledgable users: whole disk encrypton or using other projects, SSL config to do...)

Implementation details...

Getting something up and running quickly

Run any Spark example in “local mode” first (from “spark”)bin/run-example org.apache.spark.examples.SparkPi 100

Then run it on a cluster you can set up yourself:Add hostnames in conf/slavessbin/start-all.shbin/run-example –master <your_master:7077> ...Check for running Java processes: looking for workers/executors coming and goingSpark UI (default port 8080 on the master)

See: http://spark.apache.org/docs/latest/spark-standalone.html

lib is only with the IBM package

Running something simple

And you can use Spark's Java/Scala APIs with bin/spark-shell (a REPL!)

bin/spark-submit

java/scala -cp “$SPARK_HOME/jars/*”

PySpark not covered in this presentation – but fun to experiment with and lots of good docs online for you

Increasing the number of threads available for Spark processing in local mode (5.2gb text file) – actually works?

--master local[1] real 3m45.328s

--master local[4] real 1m31.889s

time { echo "--master local[1]" $SPARK_HOME/bin/spark-submit --master local[1] --class MyClass WordCount.jar}

time { echo "--master local[4]" $SPARK_HOME/bin/spark-submit --master local[4] –class MyClass WordCount.jar}

“Anything else good about Spark?”

● Resiliency by replicaton and lineage tracking● Distribution of processing via (potentally many) workers that can spawn (potentally many) executors

● Caching! Keep data in memory, reuse later● Versatlity and interoperability

APIs include Spark core, ML, DataFrames and Datasets, Streaming and Graphx ...

● Read up on RDDs and ML material by Andrew Ng, Spark Summit videos, deep dives on Catalyst/Tungsten if you want to really get stuck in! This is a DIY talk

Recap – we know what it is now...and want to do some

analytics!

● Data I’ll process here is for educational purposes only: road_accidents.csv

● Kaggle is a good place to practice – lots of datasets available for you

● Data I'm using is licensed under the Open Government License for public sector information

"accident_index","vehicle_reference","vehicle_type","towing_and_articulation","vehicle_manoeuvre","vehicle_location”,restricted_lane","junction_location","skidding_and_overturning","hit_object_in_carriageway","vehicle_leaving_carriageway","hit_object_off_carriageway","1st_point_of_impact","was_vehicle_left_hand_drive?","journey_purpose_of_driver","sex_of_driver","age_of_driver","age_band_of_driver","engine_capacity_(cc)","propulsion_code","age_of_vehicle","driver_imd_decile","driver_home_area_type","vehicle_imd_decile","NUmber_of_Casualities_unique_to_accident_ind ex","No_of_Vehicles_involved_unique_to_accident_index","location_easting_osgr","location_north ing_osgr","longitude","latitude","police_force","accident_severity","number_of_vehicles","number_of_casualties","date","day_of_week","time","local_authority_(district)","local_authority_(highway)","1st_road_class","1st_road_number","road_type","speed_limit","junction_detail","junction_control"," 2nd_road_class","2nd_road_number","pedestrian_crossing-human_control","pedestrian_crossing-physical_facilities","light_conditions","weather_conditions","road_surface_conditions","special_conditions_at_site","carriageway_hazards","urban_or_rural_area","did_police_officer_attend_scene_of_accident","lsoa_of_accident_location","casualty_reference","casualty_class","sex_of_casualty","age_of_casualty","age_band_of_casualty","casualty_severity","pedestrian_location","pedestrian_movement","car_passenger","bus_or_coach_passenger","pedestrian_road_maintenance_worker","casualty_type","casualty_home_area_type","casualty_imd_decile"

Features of the data (“columns”)

"201506E098757",2,9,0,18,0,8,0,0,0,0,3,1,6,1,45,7,1794,1,11,-1,1,-1,1,2,384980,394830,-2.227629,53.450014,6,3,2,1,"42250",2,1899-12-3012:56:00,102,"E08000003",5,0,6,30,3,4,6,0,0,0,1,1,1,0,0,1,2,"E01005288",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA, NA,NA

"201506E098766",1,9,0,9,0,8,0,0,0,0,4,1,6,2,25,5,1582, 2,1,-1,-1,-1,1,2,383870,394420,-2.244322,53.446296,6,3,2,1,"14/03/2015",7,1899-12-3015:55:00,102,"E08000003",3,5103,3,40,6,2,5,0,0,5,1,1,1

,0,0,1,1,"E01005178",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA, NA,NA,NA

Values (“rows”)

Spark way to figure this out? groupBy* vehicle_type sort** the results on count vehicle_type maps to a code

First place: carDistant second: pedal bikeClose third: van/goods HGV <= 3.5 T Distant last: electric motorcycle

Type of vehicle involved in the most accidents?

Different column name this tme, weather_conditons maps to a code again

First place: fine with no high windsSecond: raining, no high windsDistant third: fine, with high winds Distant last: snowing, high winds

groupBy* weather_conditions sort** the results on count weather_conditions maps to a code

What weather should I be avoiding?

First place: going ahead (!)Distant second: turning rightDistant third: slowing or stoppingLast: reversing

Spark way... groupBy* manoeuvre sort** the results on count manoeuvre maps to a code

Which manoeuvres should I be careful with?

“Why * and **?”

org.apache.spark functions that can run in a distributed manner

Spark code example – I'm using Scala● Forced mutability consideration (val or var)● Not mandatory to declare types (or “return ...”)● Check out “Scala for the Intrigued” on YouTube ● JVM based

Scala main method I’ll be usingobject AccidentsExample {

def main(args: Array[String]) : Unit = {

Which age group gets in the most accidents?

Spark entrypointval session = SparkSession.builder().appName("Accidents").master("local[*]")

Creatng a DataFrame: API we’ll use to interact with data as though it’s in an SQL table

val sqlContext = session.getOrCreate().sqlContext

val allAccidents = sqlContext.read.format("com.databricks.spark.csv"). option("header", "true"). load(myHome + "/datasets/road_accidents.csv")

allAccidents.show would give us a table like...

accident_index vehicle_reference vehicle_type towing_and_articulation

201506E098757 2 9 0

201506E098766 1 9 0

Group our data and save the result

val myAgeDF = groupCountSortAndShow(allAccidents, "age_of_casualty", true)

myAgeDF.coalesce(1). write.option("header", "true"). format("csv"). save("victims")

Runtime.getRuntime().exec("python plot_me.py" )

def groupCountSortAndShow(df: DataFrame, columnName: String, toShow:val ourSortedData = df.groupBy(columnName).count().sort("count")if(toShow) ourSortedData.show()

ourSortedData}

Boolean):DataFrame = {

“Hold on...what’s that getRuntime().exec

stuff?!”

It’s calling my Python code to plot the CSV fileimport glob, os, pandasfrom bokeh.plotting import figure, output_file, show path = r'victims'

all_files = glob.glob(os.path.join(path, "*.csv"))

df_from_each_file = (pandas.read_csv(f) for f in all_files)

df = pandas.concat(df_from_each_file, ignore_index=True)

plot = figure(plot_width=640,plot_height=640,title='Accident victims by age',

x_axis_label='Age of victim', y_axis_label='How many')

plot.title.text_font_size = '16pt'

plot.xaxis.axis_label_text_font_size = '16pt'

plot.yaxis.axis_label_text_font_size = '16pt'

plot.scatter(x=df.age_of_casualty, y=df['count'])

output_file('victims.html') show(plot)

Bokeh gives us a graph like this

“What else can I do?”

You’ve got some JSON files...•

“Best doom metal band please”

sqlContext.sql("SELECT name, average_rating from bands WHERE " + "genre == 'doom_metal'").sort(desc("average_rating")).show(1)

+--------------------+--------------+| name|average_rating|+--------------------+--------------+|Bugle Infantry| 5|+--------------------+--------------+only showing top 1 row

val bandsDF = sqlContext.read.json(myHome + "/datasets/bands.json") bandsDF.createGlobalTempView("bands")

import org.apache.spark.sql.functions._

{"id":"2","name":"Louder Bill","average_rating":"4.1","genre":"ambient"}{"id":"3","name":"Prey Fury","average_rating":"2","genre":"pop"}{"id":"4","name":"Unbranded Newsroom","average_rating":"4","genre":"rap"}

{"id":"5","name":"Bugle Infantry","average_rating":"5", "genre": "doom_metal"}

{"id":"1","name":"Into Latch","average_rating":"4.9","genre":"doom_metal"}

Randomly generated band names as of May the 18th 2017, zero affiliation on my behalf or IBM’s for any of these names...entirely coincidental if they do exist

“Great, but you mentioned some data collected with wearables and machine

learning!”

Anonymised data gathered from Automatc, Apple Health, Withings, Jawbone Up

● Car journeys● Sleeping activity (start and end tme)

● Daytme actvity (calories consumed, steps taken)● Weight and heart rate● Several CSV files● Anonymised by subject gatherer before uploading anywhere! Nothing identfiable

Exploring the datasets: driving actvity

val autoData = sqlContext.read.format("com.databricks.spark.csv"). option("header", "true").option("inferSchema", "true").load(myHome + "/datasets/geecon/automatic.csv"). withColumnRenamed("End Location Name", "Location"). withColumnRenamed("End Time", "Time")

Checking our data is sensible...val colsWeCareAbout = "Distance (mi)","Duration (min)","Fuel Cost (USD)")

for (col <- colsWeCareAbout) { summarise(autoData, col)

Array(def summarise(df: DataFrame, columnName: String)

{ averageByCol(df, columnName)minByCol(df, columnName) maxByCol(df, columnName)

def averageByCol(df: DataFrame, columnName: String) { println("Printing the average " + columnName) df.agg(avg(df.col(columnName))).show()

def minByCol(df: DataFrame, columnName: String) { println("Printing the minimum " + columnName) df.agg(min(df.col(columnName))).show()

def maxByCol(df: DataFrame, columnName: String) { println("Printing the maximum " + columnName) df.agg(max(df.col(columnName))).show()

}Average distance (in miles): 6.88, minimum: 0.01, maximum: 187.03Average duration (in minutes): 14.87, minimum: 0.2, maximum: 186.92Average fuel Cost (in USD): 0.58, minimum: 0.0, maximum: 14.35

Looks OK - what’s the rate of Mr X visiting a certain place? Got a favourite gym day?

Slacking on certain days?● Using Spark to determine chance of the subject being there● Timestamps (the “Time” column need to become days of the

week instead)● The start of a common theme: data preparaton!

Explore the data first

|2005 Nissan0.27|

15:07|1.52| 0.04| 13.64|

0|5.0| 5.0|

null||2005 Nissan

0.1|0|

15:18|0.71| 0.01| 17.64|

0|5.0| 5.0|

autoData.show() ...

val preparedAutoData = sqlContext.sql("SELECT TO_DATE(CAST(UNIX_TIMESTAMP(Time, 'MM/dd/yyyy') AS TIMESTAMP)) as Date, Location, “ +

“date_format(TO_DATE(CAST(UNIX_TIMESTAMP(Time, 'MM/dd/yyyy') AS TIMESTAMP)), 'EEEE') as Day FROM auto_data")

preparedAutoData.show()

Timestamp fun: 4/03/2016 15:06 is no good!

----------+-----------+---------+|2016-04-03|PokeStop 12||2016-04-03|PokeStop 12|

Sunday| Sunday| Sunday||2016-04-03| Michaels|

+----------+-----------+---------+

Date| Location | Day|

def printChanceLocationOnDay(sqlContext: SQLContext, day: String, location: String) {

val allDatesAndDaysLogged = sqlContext.sql( "SELECT Date, Day " +"FROM prepared_auto_data " +"WHERE Day = '" + day + "'").distinct()

allDatesAndDaysLogged.show()

Scala function: give us all of the rows where the day is what we specified

+----------+------+| Date| Day|+----------+------+|2016-10-17|Monday||2016-10-24|Monday||2016-04-25|Monday||2017-03-27|Monday||2016-08-15|Monday|

+----------+--------+------+| Date|Location| Day|+----------+--------+------+|2016-04-04||2016-11-14||2017-01-09||2017-02-06|

Gym|Monday| Gym|Monday| Gym|Monday| Gym|Monday|

var rate = Math.floor( (Double.valueOf(allDatesAndDaysLogged.count()) /

Double.valueOf(visits.count())) * 100)

println(rate + "% rate of being at the location '" + location + "' on " + day + ", activity logged for " + allDatesAndDaysLogged + " " + day + "s")

val visits = sqlContext.sql("SELECT * FROM prepared_auto_data " +"WHERE Location = '" + location + "' AND Day = '"

visits.show()

+ day + "'")

Rows where the location and day matches our query (passed in as parameters)

● 7% rate of being at the location 'Gym' on Monday, activity logged for 51 Mondays● 1% rate of being at the location 'Gym' on Tuesday, activity logged for 51 Tuesdays● 2% rate of being at the location 'Gym' on Wednesday, activity logged for 49 Wednesdays● 6% rate of being at the location 'Gym' on Thursday, activity logged for 47 Thursdays● 7% rate of being at the location 'Gym' on Saturday, activity logged for 41 Saturdays● 9% rate of being at the location 'Gym' on Sunday, activity logged for 41 Sundays

val days = Array("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

for (day <- days) {printChanceLocationOnDay(sqlContext, autoData, day, "Gym")

Which feature(s) are closely related to another - e.g. the time spent asleep?Dataset has these features from Jawbone

● s_duration (the sleep time as well...)● m_active_time ● m_calories● m_distance● m_steps● m_total_calories● n_bedtime (hmm)● n_awake_time

How about correlations?

Very strong positive correlation for n_bedtime and s_asleep_time

Correlation between goal_body_weight and s_asleep time: -0.02

Val shouldBeLow = sleepData.stat.corr("goal_body_weight", "s_duration")

println("Correlation between goal body weight and sleep duration: " + shouldBeLow)

val compareToCol = "s_duration"for (col <- sleepData.columns) {

If (! col.equals(compareToCol)) { // don’t compare to itself...val corr = sleepData.stat.corr(col, compareToCol)

if (corr > 0.8) {

println("Very strong positive correlation for " + col + " and " + compareToCol)

} else if (corr >= 0.5) {println("Positive correlation for " + col + " and " + compareToCol)

And something we know isn’t related?

“...can Spark help me to get a good sleep?”

Need to define a good sleep first8 hours for this test subject

If duration is > 8 hoursgood sleep = true, else false

I’m using 1 for true and 0 for falseWe will label this data soon so remember this

Then we’ll determine the most influential features on the value being true or false. This can reveal the interestng stuf!

Sanity check first: any good sleeps for Mr X?

Found 538 valid recorded sleep times and 129 were 8 or more hours in duration

// Don't care if the sleep duration wasn't even recorded or it's 0 val onlyRecordedSleeps = onlyDurations.filter($"s_duration" > 0)

println("Found " + onlyRecordedSleeps.count() + " valid recorded " +"sleep times and " + onlyGoodSleeps.count() + " were " + NUM_HOURS + " or more hours in duration")

THRESHOLD = 60 *onlyGoodSleeps =

val onlyDurations = sleepData.select("s_duration")val NUM_HOURS = 8val val

60 * NUM_HOURS onlyDurations.filter($"s_duration" >= THRESHOLD)

We will use machine learning: but first...1) What do we want to find out?

Main contributng factors to a good sleep 2) Pick an algorithm3) Prepare the data4) Separate into training and test data5) Build a model with the training data (in parallel using Spark!)6) Use that model on the test data7) Evaluate the model8) Experiment with parameters untl reasonably accurate e.g. N iteratons

Alternating Least Squares

K-means (unsupervised learning (no labels, cheap))

Classificaton algorithms such as

Clustering algorithms such as● Produce n clusters from data to determine which cluster a new item can be categorised as● Identfy anomalies: transaction fraud, erroneous data

Recommendaton algorithms such as

● Movie recommendatons on Netlix?● Recommended purchases on Amazon? ● Similar songs with Spotify?● Recommended videos on YouTube?

Logistic regression● Create model that we can use to predict where to plot the next item in a sequence (above or

below our line of best fit)● Healthcare: predict adverse drug reactons based on known interactons with similar drugs● Spam filter (binomial classification)● Naive Bayes

Which algorithms might be of use?

What does “Naive Bayes” have to do with my sleep quality?Using evidence provided, guess what a label will be (1 or 0) for us: easy to use with some training data

0 = the label (category 0 or 1) e.g. 0 = low scoring athlete, 1 = high scoring 1:x = the score for a sportng event 1 2:x = the score for a sportng event 2 3:x = the score for a sportng event 3

bayes_data.txt (libSVM format)

val model = new NaiveBayes().fit(trainingData)

val predictions = model.transform(testData)

val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("accuracy")

val accuracy = evaluator.evaluate(predictions) println("Test set accuracy = " + accuracy)Test set accuracy = 0.82

val bayesData = sqlContext.read.format("libsvm").load("bayes_data.txt")

val Array(trainingData, testData) = bayesData.randomSplit(Array(0.7, 0.3))

Read it in, split it, fit it, transform and evaluate – all on one slide with Spark!

https://spark.apache.org/docs/2.1.0/mllib-naive-bayes.html

Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to the training data, it computes the conditional probability distribution of each feature given label, and then it applies Bayes’ theorem to compute the conditional probability distribution of label given an observation and use it for prediction.

Naive Bayes correctly classifies the data (giving it the right labels)

Feed some new data in for the model...

“Can I just use Naive Bayes on all of the sleep data?”

1) didn’t label each row in theDataFrame yet2) Naive Bayes can’t handle our data in the current form3) too many useless features

Possibilites – bear in mind that DataFrames are immutable, can't modify elements directly...

1) Spark has a .map functon, how about that?“map is a transformation that passes each dataset element through a function and returns a new

RDD representing the results” - http://spark.apache.org/docs/latest/programming-guide.html● Removes all other columns in my case...(new DataFrame with just the labels!)

2) Running a user defined functon on each row?● Maybe, but can Spark’s internal SQL optmiser “Catalyst” see

and optmise it? Probably slow

Labelling each row according to our “good sleep” criteria

Preparing the labels

Preparing the features is easier

val labelledSleepData = sleepData. withColumn("s_duration", when(col("s_duration") > THRESHOLD, 1). otherwise(0))

val assembler = new VectorAssembler().setInputCols(sleepData.columns).setOutputCol("features")

val preparedData = assembler.transform(labelledSleepData). withColumnRenamed("s_duration", "good_sleep")

“If duration is > 8 hoursgood sleep = true, else false

I’m using 1 for true and 0 for false”

✔ Labelled data now

Trying to fit a model to the DataFrame now leads to...

s_asleep_time and n_bedtime (integers)

API docs: “Time user fell asleep. Seconds to/from midnight. If negative, subtract from midnight. If positive, add to midnight”

Solution in this example?Change to positives only

Add the number of seconds in a day to whatever s_asleep_time's value is. Think it through properly when you try this if you’re done experimenting and want something reliable to use!

The problem...

New DataFrame where negative values are handled

toModel.createOrReplaceTempView("to_model_table")

val preparedSleepAsLabel = preparedData.withColumnRenamed("good_sleep", "label")

val secondsInDay = 24 * 60 * 60

val toModel = preparedSleepAsLabel.withColumn("s_asleep_time", (col("s_asleep_time")) + secondsInDay). withColumn("s_bedtime", (col("s_bedtime")) + secondsInDay)

Reducing your “feature space”Spark’s ChiSqSelector algorithm will work hereWe want labels and features to inspect

val selector = new ChiSqSelector().setNumTopFeatures(10).setFeaturesCol("features").setLabelCol("good_sleep").setOutputCol("selected_features")

val model = selector.fit(preparedData)

val topFeatureIndexes = model.selectedFeaturesfor (i <- 1 to topFeatureIndexes.length - 1) {// Get col names based on feature indexes println(preparedData.columns(topFeatureIndexes(i)))

Using ChiSq selector to get the top features

Feature selection tries to identify relevant features for use in model construction. It reduces the size of the feature space, which can improve both speed and statistical learning behavior. ChiSqSelector implements Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence to decide which features to choose. It supports three selection methods: numTopFeatures, percentile, fpr: numTopFeatures chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.

https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html#chisqselector

Transform values into a “features” column and only select columns we identified as influentialEarlier we did... toModel.createOrReplaceTempView("to_model_table")

val onlyInterestingColumns = sqlContext.sql("SELECT label, " + colNames.toString() to_model_table")

+ " FROM

val theAssembler = new VectorAssembler().setInputCols(onlyInterestingColumns.columns).setOutputCol("features")

val thePreparedData = theAssembler.transform(onlyInterestingColumns)

Top ten influental features (most to least influental)Feature Description from Jawbone API docss_count Number of primary sleep entries loggeds_awake_time Time the user woke ups_quality Proprietary formula, don't knows_asleep_time Time when the user fell asleeps_bedtime Seconds the device is in sleep modes_deep Seconds of main “sound sleep”s_light Seconds of “light sleeps” during the sleep periodm_workout_time Length of logged workouts in secondsn_light Seconds of light sleep during the napn_sound Seconds of sound sleep during the nap

And after all that...we can generate predictions!val Array(trainingSleepData, testSleepData)=thePreparedData.randomSplit(Array(0.7, 0.3)

val sleepModel = new NaiveBayes().fit(trainingSleepData)

val predictions = sleepModel.transform(testSleepData)

val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("accuracy")

val accuracy = evaluator.evaluate(predictions)println("Test set accuracy for labelled sleep data = " + accuracy)

Test set accuracy for labelled sleep data = 0.81 ...

Testing it with new dataval somethingNew = sqlContext.createDataFrame(Seq(

// Good sleep: high workout time, achieved a good amount of deep sleep, went to bed after midnight and woke at almost noon!

(0, Vectors.dense(0, 1, 42600, 100, 87659, 85436, 16138, 22142, 4073, 0)),

// Bad sleep, woke up early (5 AM), didn't get much of a deep sleep, didn't workout, bedtime 10.20 PM

(0, Vectors.dense(0, 0, 18925, 0, 80383, 80083, 6653, 17568, 0, 0))

)).toDF("label","features")

sleepModel.transform(somethingNew).show()

Sensible model created with outcomes we’d expectGo to bed earlier, exercise moreI could have looked closer into removing the s_ variables so they’re all m_ and diet informaton; exercise for the reader

Algorithms are producing these outcomes without domain specific knowledge

Last example: “does weighing more result in a higher heart rate?”Will get the average of all the heart rates logged on a day when weight was measured

Lower heart rate day = weight was more?Higher rate day = weight was less?

Maybe MLlib again? But all that preparation work...

How deeply involved with Spark do we usually need to get?

More data preparaton needed, but there’s a twistHere I use data from two tables: weights, activities

+----------+------+| Date|weight|+----------+------+|2017-04-09||2017-04-08||2017-04-07|

220.4|219.9|221.0|+----------+------+

only showing top 3 rows

becomes

Times are removed as we only care about dates

Include only heart beat readings when we have weight(s) measured: join on date used

+----------+------+----------------------+| Date|weight|heart_beats_per_minute|+----------+------+----------------------+|2017-02-13||2017-02-13||2017-02-09||2017-02-09||2017-02-09|

220.3|220.3|215.9|215.9|215.9|

79.0|77.0|97.0|

104.0|88.0|

+----------+------+----------------------...

Average the rate and weight readings by day+----------+------+----------------------+| Date|weight|heart_beats_per_minute|+----------+------+----------------------+|2017-02-13| 220.3||2017-02-13| 220.7|

79.0|77.0|

+----------+------+----------------------+...

Should become this:+----------+------+-----------------------------------+| Date|avg weight |avg_heart_beats_per_minute |+----------+------+-----------------------------------+|2017-02-13| 220.5| 78 |+----------+------+-----------------------------------+

DataFrame now looks like this...+----------+--------------------------- +------------------+|Date ||avg(heart_beats_per_minute)| avg(weight) |+----------+----------------------------+------------------+|2016-04-25||2017-01-06||2016-05-03||2016-07-26|

Something we can quickly plot!

|85.933... |196.46... ||93.8125... |216.0 ||83.647... |198.35... ||84.411... |192.69... |

Bokeh used again, no more analysis required

Used the same functions as earlier (groupBy, formatting dates) and also a join. Same plotting with different column names. No distinct correlation identified so moved on

Still lots of questions we could answer with Spark using this data● Any impact on mpg when the driver weighs much less than before?● Which fuel provider gives me the best mpg?● Which visited places have a positive effect on subject’s weight?

● Analytics doesn’t need to be complicated: Spark’s good for the heavy lifting

● Sometimes best to just plot as you go – saves plenty of time

● Other harder things to worry aboutWriting a distributed machine learning algorithm shouldn’t be one of them!

“Which tools can I use to answer my questions?”

This question becomes easier

Infrastructure when you’re ready to scale beyond your laptop● Setting up a huge HA cluster: a talk on its own● Who sets up then maintains the machines? Automate it all?● How many machines do you need? RAM/CPUs? ● Who ensures all software is up to date (CVEs?) ● Access control lists?● Hosting costs/providers? ● Reliability, fault tolerance, backup procedures...

Still got to think about...

● Use GPUs to train models faster● DeepLearning4J?● Writing your own kernels/C/JNI code (or a Java API like CUDA4J/Aparapi?)

● Use RDMA to reduce network transfer times

● Zero copy: RoCE or InfiniBand?

● Tune the JDK, the OS, the hardware

● Continuously evaluate performance: Spark itself, use● -Xhealthcenter, your own metrics, various libraries...

● Go tackle something huge – join the alien search

● Combine Spark Streaming with MLlib to gain insights fast

● More informed decision making

And if you want to really show off with Spark

● Know more about Spark: what it can and can’t do (new project ideas?)

● Know more about machine learning in Spark● Know that machine learning’s stll hard but in diferent ways

Data preparaton, handling junk, knowing what to look forGetting the data in the first place Writng the algorithms to be used in Spark?

Recap – you should now...

● Built-in Spark functons are aplenty – try and stck to these● You can plot your results by saving to a csv/json and using your existng favourite plotting libraries easily

● DataFrame (or Datasets) combined with ML = powerful APIs● Filter your data – decide how to handle nulls!● Pick and use a suitable ML algorithm● Plot results

Points to take home...

Final points to consider...Where would Spark fit in to your systems? A replacement or

supplementary? Give it a try with your own data and you might be surprised with

the outcomeIt’s free and open source with a very actve community!

Contact me directly: aroberts@uk.ibm.com

Questons?

● Automatic: log into the Automatc Dashboard https://dashboard.automatc.com/, on the bottom right, click export, choose what data you want to export (e.g. All)

● Fuelly: (Obtained Gas Cubby), log into the Fuelly Dashboard http://www.fuelly.com/dashboard, select your vehicle in Your Garage, scroll down to vehicle logs, select Export Fuel-ups or Export Services, select duraton of export

● Jawbone: sign into your account at https://jawbone.com/, click on your name on the top right, choose Settings, click on the Accounts tab, scroll down to Download UP Data, choose which year you'd like to download data for

How did I access the data to process?

● Withings: log into the Withings Dashboard https://healthmate.withings.com click Measurement table, click the tab corresponding to the data you want to export, click download. You can go here to download all data instead: https://account.withings.com/export/

● Apple: launch the Health app, navigate to the Health Data tab, select your account in the top right area of your screen, select Export Health Data

● Remember to remove any sensitive personal information before sharing/showing/storing said data elsewhere! I am dealing with “cleansed” datasets with no SPI

diy analytics with apache spark

Data & Analytics

apache spark introduction

apache spark 2.0

apache spark overview

apache spark

using apache spark

introduction to cassandra • why spark - apache cassandra |...

new architectures for apache spark tm and big data · new...

apache spark streaming

teachyourself apache spark...hour 1 introducing apache...

apache spark - courses€¦ · apache spark introduction to...

a tutorial on apache spark - michael...

apache spark session

apache spark 101

budapest spark meetup - apache spark @enbrite.ly

apache spark - yandex

apache spark operations

hortonworks data platform - apache spark component...

apache spark pdf

apache spark briefing

state of security: apache spark & apache zeppelin