agile data science 2.0 - big data science meetup

Click here to load reader

Post on 20-Mar-2017

321 views

Category:

Software

2 download

Embed Size (px)

TRANSCRIPT

  • Building Full Stack Data Analytics Applications with Kafka and Spark

    Agile Data Science 2.0

    https://www.slideshare.net/rjurney/agile-data-science-20-big-data-science-meetup or

    http://bit.ly/agile_data_slides_2

    https://www.slideshare.net/rjurney/agile-data-science-20-big-data-science-meetuphttp://bit.ly/agile_data_slides_2

  • Agile Data Science 2.0

    Russell Jurney

    2

    Data Engineer

    Data Scientist

    Visualization Software Engineer

    85%

    85%

    85%

    Writer85%

    Teacher50%

    Russell Jurney is a veteran data

    scientist and thought leader. He

    coined the term Agile Data Science in

    the book of that name from OReilly

    in 2012, which outlines the first agile

    development methodology for data

    science. Russell has constructed

    numerous fu l l -stack analyt ics

    products over the past ten years and

    now works with clients helping them

    extract value from their data assets.

    Russell Jurney

    Skill

    Principal Consultant at Data Syndrome

    Russell Jurney

    Data Syndrome, LLC

    Email : [email protected] : datasyndrome.com

    Principal Consultant

    mailto:[email protected]?subject=http://datasyndrome.com

  • Lorem Ipsum dolor siamet suame this placeholder for text can simply

    random text. It has roots in a piece of classical. variazioni deiwords which

    whichhtly. ven on your zuniga merida della is not denis.

    Product Consulting

    We build analytics products and systems

    consisting of big data viz, predictions,

    recommendations, reports and search.

    Corporate Training

    We offer training courses for data

    scientists and engineers and data

    science teams,

    Video Training

    We offer video training courses that rapidly

    acclimate you with a technology and

    technique.

  • Agile Data Science 2.0 4

    What makes data science agile data science?

    Theory

  • Agile Data Science 2.0 5

    Yes. Building applications is a fundamental skill for todays data scientist.

    Data Products or Data Science?

  • Agile Data Science 2.0 6

    Data Products or

    Data Science?

  • Agile Data Science 2.0 7

    If someone else has to start over and rebuild it, it aint agile.

    Big Data or Data Science?

  • Agile Data Science 2.0 8

    Goal of MethodologyThe goal of agile data science in

  • Agile Data Science 2.0 9

    In analytics, the end-goal moves or is complex in nature, so we model as a

    network of tasks rather than as a strictly linear process.

    Critical Path

  • Agile Data Science 2.0

    Agile Data Science Manifesto

    10

    Seven Principles for Agile Data Science

    Discover and pursue the critical path to a killer product

    Iterate, iterate, iterate: tables, charts, reports, predictions 1.

    Integrate the tyrannical opinion of data in product management4.

    Get Meta. Describe the process, not just the end-state7.

    Ship intermediate output. Even failed experiments have output2.

    Climb up and down the data-value pyramid as we work5.

    Prototype experiments over implementing tasks3.

    6.

  • Agile Data Science 2.0 11

    People will pay more for the things towards the top, but you need the things

    on the bottom to have the things above. They are foundational. See:

    Maslows Theory of Needs.

    Data Value Pyramid

  • Agile Data Science 2.0 12

    Things we use to build the apps

    Tools

  • Agile Data Science 2.0

    Agile Data Science 2.0 Stack

    13

    Apache Spark Apache Kafka MongoDB

    Batch and Realtime Realtime Queue Document Store

    Flask

    Simple Web App

    Example of a high productivity stack for big data applications

    ElasticSearch

    Search

  • Agile Data Science 2.0

    Flow of Data Processing

    14

    Tools and processes in collecting, refining, publishing and decorating data

    {hello: world}

  • Data Syndrome: Agile Data Science 2.0

    Apache Spark Ecosystem

    15

    HDFS, Amazon S3, Spark, Spark SQL, Spark MLlib, Spark Streaming

    /

  • Agile Data Science 2.0 16

    SQL or dataflow programming?

    Programming Models

  • Agile Data Science 2.0 17

    Describing what you want and letting the planner figure out how

    SQL

    SELECT associations2.object_id, associations2.term_id, associations2.cat_ID, associations2.term_taxonomy_idFROM (SELECT objects_tags.object_id, objects_tags.term_id, wp_cb_tags2cats.cat_ID, categories.term_taxonomy_id FROM (SELECT wp_term_relationships.object_id, wp_term_taxonomy.term_id, wp_term_taxonomy.term_taxonomy_id FROM wp_term_relationships LEFT JOIN wp_term_taxonomy ON wp_term_relationships.term_taxonomy_id = wp_term_taxonomy.term_taxonomy_id ORDER BY object_id ASC, term_id ASC) AS objects_tags LEFT JOIN wp_cb_tags2cats ON objects_tags.term_id = wp_cb_tags2cats.tag_ID LEFT JOIN (SELECT wp_term_relationships.object_id, wp_term_taxonomy.term_id as cat_ID, wp_term_taxonomy.term_taxonomy_id FROM wp_term_relationships LEFT JOIN wp_term_taxonomy ON wp_term_relationships.term_taxonomy_id = wp_term_taxonomy.term_taxonomy_id WHERE wp_term_taxonomy.taxonomy = 'category' GROUP BY object_id, cat_ID, term_taxonomy_id ORDER BY object_id, cat_ID, term_taxonomy_id) AS categories on wp_cb_tags2cats.cat_ID = categories.term_id WHERE objects_tags.term_id = wp_cb_tags2cats.tag_ID GROUP BY object_id, term_id, cat_ID, term_taxonomy_id ORDER BY object_id ASC, term_id ASC, cat_ID ASC) AS associations2LEFT JOIN categories ON associations2.object_id = categories.object_idWHERE associations2.cat_ID categories.cat_IDGROUP BY object_id, term_id, cat_ID, term_taxonomy_idORDER BY object_id, term_id, cat_ID, term_taxonomy_id

  • Agile Data Science 2.0 18

    Flowing data through operations to effect change

    Dataflow Programming

  • Agile Data Science 2.0 19

    The best of both worlds!

    SQL AND Dataflow

    Programming

    # Flights that were late arriving...late_arrivals = on_time_dataframe.filter(on_time_dataframe.ArrDelayMinutes > 0) total_late_arrivals = late_arrivals.count()# Flights that left late but made up time to arrive on time...on_time_heros = on_time_dataframe.filter( (on_time_dataframe.DepDelayMinutes > 0) & (on_time_dataframe.ArrDelayMinutes

  • Agile Data Science 2.0 20

    FAA on-time performance data

    Data

  • Data Syndrome: Agile Data Science 2.0

    Collect and Serialize Events in JSONI never regret using JSON

    21

  • Data Syndrome: Agile Data Science 2.0

    FAA On-Time Performance Records95% of commercial flights

    22http://www.transtats.bts.gov/Fields.asp?table_id=236

    http://www.transtats.bts.gov/Fields.asp?table_id=236

  • Data Syndrome: Agile Data Science 2.0

    FAA On-Time Performance Records95% of commercial flights

    23

    "Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID","Carrier","TailNum","FlightNum","OriginAirportID","OriginAirportSeqID","OriginCityMarketID","Origin","OriginCityName","OriginState","OriginStateFips","OriginStateName","OriginWac","DestAirportID","DestAirportSeqID","DestCityMarketID","Dest","DestCityName","DestState","DestStateFips","DestStateName","DestWac","CRSDepTime","DepTime","DepDelay","DepDelayMinutes","DepDel15","DepartureDelayGroups","DepTimeBlk","TaxiOut","WheelsOff","WheelsOn","TaxiIn","CRSArrTime","ArrTime","ArrDelay","ArrDelayMinutes","ArrDel15","ArrivalDelayGroups","ArrTimeBlk","Cancelled","CancellationCode","Diverted","CRSElapsedTime","ActualElapsedTime","AirTime","Flights","Distance","DistanceGroup","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay","FirstDepTime","TotalAddGTime","LongestAddGTime","DivAirportLandings","DivReachedDest","DivActualElapsedTime","DivArrDelay","DivDistance","Div1Airport","Div1AirportID","Div1AirportSeqID","Div1WheelsOn","Div1TotalGTime","Div1LongestGTime","Div1WheelsOff","Div1TailNum","Div2Airport","Div2AirportID","Div2AirportSeqID","Div2WheelsOn","Div2TotalGTime","Div2LongestGTime","Div2WheelsOff","Div2TailNum","Div3Airport","Div3AirportID","Div3AirportSeqID","Div3WheelsOn","Div3TotalGTime","Div3LongestGTime","Div3WheelsOff","Div3TailNum","Div4Airport","Div4AirportID","Div4AirportSeqID","Div4WheelsOn","Div4TotalGTime","Div4LongestGTime","Div4WheelsOff","Div4TailNum","Div5Airport","Div5AirportID","Div5AirportSeqID","Div5WheelsOn","Div5TotalGTime","Div5LongestGTime","Div5WheelsOff","Div5TailNum"

  • Data Syndrome: Agile Data Science 2.0

    openflights.org DatabaseAirports, Airlines, Routes

    24

    http://openflights.org

  • Data Syndrome: Agile Data Science 2.0

    Scraping the FAA RegistryAirplane Data by Tail Number

    25

  • Data Syndrome: Agile Data Science 2.0

    Wikipedia Airlines EntriesDescriptions of Airlines

    26

  • Data Syndrome: Agile Data Science 2.0

    National Centers for Environmental Information Historical Weather Observations

    27

  • Agile Data Science 2.0 28

    Working our way up the data value pyramid

    Climbing the Stack

  • Agile Data Science 2.0 29

    Starting by plumbing the system from end to end

    Plumbing

  • Data Syndrome: Agile Data Science 2.0

    Publishing Flight RecordsPlumbing our master records through to the web

    30

  • Data Syndrome: Agile Data Science 2.0

    Publishing Flight Records to MongoDBPlumbing our master records through to the web

    31

    import pymongoimport pymongo_spark# Important: activate pymongo_spark.pymongo_spark.activate()

    # Load the parquet file on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')

    # Convert to RDD of dicts and save to MongoDBas_dict = on_time_dataframe.rdd.map(lambda row: row.asDict()) as_dict.saveToMongoDB(mongodb://localhost:27017/agile_data_science.on_time_performance')

  • Data Syndrome: Agile Data Science 2.0

    Publishing Flight Records to ElasticSearchPlumbing our master records through to the web

    32

    # Load the parquet fileon_time_dataframe = spark.read.parquet('data/on_time_performance.parquet') # Save the DataFrame to Elasticsearchon_time_dataframe.write.format("org.elasticsearch.spark.sql")\ .option("es.resource","agile_data_science/on_time_performance")\ .option("es.batch.size.entries","100")\ .mode("overwrite")\ .save()

  • Data Syndrome: Agile Data Science 2.0

    Putting Records on the WebPlumbing our master records through to the web

    33

    from flask import Flask, render_template, requestfrom pymongo import MongoClientfrom bson import json_util# Set up Flask and Mongoapp = Flask(__name__)client = MongoClient()# Controller: Fetch an email and display [email protected]("/on_time_performance") def on_time_performance(): carrier = request.args.get('Carrier') flight_date = request.args.get('FlightDate') flight_num = request.args.get('FlightNum') flight = client.agile_data_science.on_time_performance.find_one({ 'Carrier': carrier, 'FlightDate': flight_date, 'FlightNum': int(flight_num) }) return json_util.dumps(flight)if __name__ == "__main__": app.run(debug=True)

  • Data Syndrome: Agile Data Science 2.0

    Putting Records on the WebPlumbing our master records through to the web

    34

  • Data Syndrome: Agile Data Science 2.0

    Putting Records on the WebPlumbing our master records through to the web

    35

  • Agile Data Science 2.0 36

    Getting to know your data

    Tables and Charts

  • Data Syndrome: Agile Data Science 2.0

    Tables in PySparkBack end development in PySpark

    37

    # Load the parquet fileon_time_dataframe = spark.read.parquet('data/on_time_performance.parquet') # Use SQL to look at the total flights by month across 2015on_time_dataframe.registerTempTable("on_time_dataframe") total_flights_by_month = spark.sql( """SELECT Month, Year, COUNT(*) AS total_flights FROM on_time_dataframe GROUP BY Year, Month ORDER BY Year, Month""") # This map/asDict trick makes the rows print a little prettier. It is optional.flights_chart_data = total_flights_by_month.map(lambda row: row.asDict())flights_chart_data.collect()# Save chart to MongoDBimport pymongo_sparkpymongo_spark.activate()flights_chart_data.saveToMongoDB( 'mongodb://localhost:27017/agile_data_science.flights_by_month')

  • Data Syndrome: Agile Data Science 2.0

    Tables in Flask and Jinja2Front end development in Flask: controller and template

    38

    # Controller: Fetch a flight [email protected]("/total_flights") def total_flights(): total_flights = client.agile_data_science.flights_by_month.find({}, sort = [ ('Year', 1), ('Month', 1) ]) return render_template('total_flights.html', total_flights=total_flights)

    {% extends "layout.html" %}{% block body %} Total Flights by Month Month Total Flights {% for month in total_flights %} {{month.Month}} {{month.total_flights}} {% endfor %} {% endblock %}

  • Data Syndrome: Agile Data Science 2.0

    TablesVisualizing data

    39

  • Data Syndrome: Agile Data Science 2.0

    Charts in Flask and d3.jsVisualizing data with JSON and d3.js

    40

    # Serve the chart's data via an asynchronous request (formerly known as 'AJAX')@app.route("/total_flights.json") def total_flights_json(): total_flights = client.agile_data_science.flights_by_month.find({}, sort = [ ('Year', 1), ('Month', 1) ]) return json_util.dumps(total_flights, ensure_ascii=False)

    var width = 960, height = 350; var y = d3.scale.linear() .range([height, 0]); // We define the domain once we get our data in d3.json, belowvar chart = d3.select(".chart") .attr("width", width) .attr("height", height);d3.json("/total_flights.json", function(data) { var defaultColor = 'steelblue'; var modeColor = '#4CA9F5'; var maxY = d3.max(data, function(d) { return d.total_flights; }); y.domain([0, maxY]); var varColor = function(d, i) { if(d['total_flights'] == maxY) { return modeColor; } else { return defaultColor; } } var barWidth = width / data.length; var bar = chart.selectAll("g") .data(data) .enter() .append("g") .attr("transform", function(d, i) { return "translate(" + i * barWidth + ",0)"; }); bar.append("rect") .attr("y", function(d) { return y(d.total_flights); }) .attr("height", function(d) { return height - y(d.total_flights); }) .attr("width", barWidth - 1) .style("fill", varColor); bar.append("text") .attr("x", barWidth / 2) .attr("y", function(d) { return y(d.total_flights) + 3; }) .attr("dy", ".75em") .text(function(d) { return d.total_flights; });});

  • Data Syndrome: Agile Data Science 2.0

    ChartsVisualizing data

    41

  • Agile Data Science 2.0 42

    Exploring your data through interaction

    Reports

  • Data Syndrome: Agile Data Science 2.0

    Creating Interactive Ontologies from Semi-Structured DataExtracting and visualizing entities

    43

  • Data Syndrome: Agile Data Science 2.0

    Home PageExtracting and decorating entities

    44

  • Data Syndrome: Agile Data Science 2.0

    Airline EntityExtracting and decorating entities

    45

  • Data Syndrome: Agile Data Science 2.0

    Summarizing Airlines 1.0Describing entities in aggregate

    46

  • Data Syndrome: Agile Data Science 2.0

    Summarizing Airlines 2.0Describing entities in aggregate

    47

  • Data Syndrome: Agile Data Science 2.0

    Summarizing Airlines 3.0Describing entities in aggregate

    48

  • Data Syndrome: Agile Data Science 2.0

    Summarizing Airlines 4.0Describing entities in aggregate

    49

  • Agile Data Science 2.0 50

    Predicting the future for fun and profit

    Predictions

  • Data Syndrome: Agile Data Science 2.0

    Back End DesignDeep Storage and Spark vs Kafka and Spark Streaming

    51

    /

    Batch Realtime

    Historical Data

    Train Model Apply Model

    Realtime Data

  • Data Syndrome: Agile Data Science 2.0 52

    jQuery in the web client submits a form to create the prediction request, and

    then polls another url every few seconds until the prediction is ready. The

    request generates a Kafka event, which a Spark Streaming worker processes

    by applying the model we trained in batch. Having done so, it inserts a record

    for the prediction in MongoDB, where the Flask app sends it to the web client

    the next time it polls the server

    Front End Design/flights/delays/predict/classify_realtime/

  • Data Syndrome: Agile Data Science 2.0

    User InterfaceWhere the user submits prediction requests

    53

  • Data Syndrome: Agile Data Science 2.0

    String VectorizationFrom properties of items to vector format

    54

  • Data Syndrome: Agile Data Science 2.0 55

    scikit-learn was 166. Spark MLlib is very powerful!

    http://bit.ly/train_model_spark

    190 Line Model

    # !/usr/bin/env pythonimport sys, os, re# Pass date and base path to main() from airflowdef main(base_path): # Default to "." try: base_path except NameError: base_path = "." if not base_path: base_path = "." APP_NAME = "train_spark_mllib_model.py" # If there is no SparkSession, create the environment try: sc and spark except NameError as e: import findspark findspark.init() import pyspark import pyspark.sql sc = pyspark.SparkContext() spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate() # # { # "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00", # "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0, # "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS" # } # from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType from pyspark.sql.types import StructType, StructField from pyspark.sql.functions import udf schema = StructType([ StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0 StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00" StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00" StructField("Carrier", StringType(), True), # "Carrier":"WN" StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31 StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4 StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365 StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0 StructField("Dest", StringType(), True), # "Dest":"SAN" StructField("Distance", DoubleType(), True), # "Distance":368.0 StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00" StructField("FlightNum", StringType(), True), # "FlightNum":"6109" StructField("Origin", StringType(), True), # "Origin":"TUS" ]) input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format( base_path ) features = spark.read.json(input_path, schema=schema) features.first() # # Check for nulls in features before using Spark ML # null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns] cols_with_nulls = filter(lambda x: x[1] > 0, null_counts) print(list(cols_with_nulls)) # # Add a Route variable to replace FlightNum # from pyspark.sql.functions import lit, concat features_with_route = features.withColumn( 'Route', concat( features.Origin, lit('-'), features.Dest ) ) features_with_route.show(6) # # Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2) # from pyspark.ml.feature import Bucketizer # Setup the Bucketizer splits = [-float("inf"), -15.0, 0, 30.0, float("inf")] arrival_bucketizer = Bucketizer( splits=splits, inputCol="ArrDelay", outputCol="ArrDelayBucket" )

    # Save the bucketizer arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path) arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path) # Apply the bucketizer ml_bucketized_features = arrival_bucketizer.transform(features_with_route) ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show() # # Extract features tools in with pyspark.ml.feature # from pyspark.ml.feature import StringIndexer, VectorAssembler # Turn category fields into indexes for column in ["Carrier", "Origin", "Dest", "Route"]: string_indexer = StringIndexer( inputCol=column, outputCol=column + "_index" ) string_indexer_model = string_indexer.fit(ml_bucketized_features) ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features) # Drop the original column ml_bucketized_features = ml_bucketized_features.drop(column) # Save the pipeline model string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format( base_path, column ) string_indexer_model.write().overwrite().save(string_indexer_output_path) # Combine continuous, numeric fields with indexes of nominal ones # ...into one feature vector numeric_columns = [ "DepDelay", "Distance", "DayOfMonth", "DayOfWeek", "DayOfYear"] index_columns = ["Carrier_index", "Origin_index", "Dest_index", "Route_index"] vector_assembler = VectorAssembler( inputCols=numeric_columns + index_columns, outputCol="Features_vec" ) final_vectorized_features = vector_assembler.transform(ml_bucketized_features) # Save the numeric vector assembler vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path) vector_assembler.write().overwrite().save(vector_assembler_path) # Drop the index columns for column in index_columns: final_vectorized_features = final_vectorized_features.drop(column) # Inspect the finalized features final_vectorized_features.show() # Instantiate and fit random forest classifier on all the data from pyspark.ml.classification import RandomForestClassifier rfc = RandomForestClassifier( featuresCol="Features_vec", labelCol="ArrDelayBucket", predictionCol="Prediction", maxBins=4657, maxMemoryInMB=1024 ) model = rfc.fit(final_vectorized_features) # Save the new model over the old one model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format( base_path ) model.write().overwrite().save(model_output_path) # Evaluate model using test data predictions = model.transform(final_vectorized_features) from pyspark.ml.evaluation import MulticlassClassificationEvaluator evaluator = MulticlassClassificationEvaluator( predictionCol="Prediction", labelCol="ArrDelayBucket", metricName="accuracy" ) accuracy = evaluator.evaluate(predictions) print("Accuracy = {}".format(accuracy)) # Check the distribution of predictions predictions.groupBy("Prediction").count().show() # Check a sample predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6) if __name__ == "__main__": main(sys.argv[1])

    http://bit.ly/train_model_spark

  • Data Syndrome: Agile Data Science 2.0

    Initializing the EnvironmentSetting up the environment

    56

    # !/usr/bin/env pythonimport sys, os, re# Pass date and base path to main() from airflowdef main(base_path): # Default to "." try: base_path except NameError: base_path = "." if not base_path: base_path = "." APP_NAME = "train_spark_mllib_model.py" # If there is no SparkSession, create the environment try: sc and spark except NameError as e: import findspark findspark.init() import pyspark import pyspark.sql sc = pyspark.SparkContext() spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()

  • Data Syndrome: Agile Data Science 2.0

    Loading the Training DataUsing DataFrames to load structured data

    57

    from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampTypefrom pyspark.sql.types import StructType, StructFieldfrom pyspark.sql.functions import udfschema = StructType([ StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0 StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00" StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00" StructField("Carrier", StringType(), True), # "Carrier":"WN" StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31 StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4 StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365 StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0 StructField("Dest", StringType(), True), # "Dest":"SAN" StructField("Distance", DoubleType(), True), # "Distance":368.0 StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00" StructField("FlightNum", StringType(), True), # "FlightNum":"6109" StructField("Origin", StringType(), True), # "Origin":"TUS"]) input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format( base_path) features = spark.read.json(input_path, schema=schema)features.first()

  • Data Syndrome: Agile Data Science 2.0

    Checking for NullsChecking the data for null values that would crash Spark MLlib

    58

    # # Check for nulls in features before using Spark ML# null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)print(list(cols_with_nulls))

  • Data Syndrome: Agile Data Science 2.0

    Adding a FeatureUsing DataFrame.withColumn to add a Route feature to the data

    59

    # # Add a Route variable to replace FlightNum# from pyspark.sql.functions import lit, concatfeatures_with_route = features.withColumn( 'Route', concat( features.Origin, lit('-'), features.Dest )) features_with_route.show(6)

  • Data Syndrome: Agile Data Science 2.0

    Bucketizing the Prediction ColumnUsing Bucketizer to convert a continuous variable to a nominal one

    60

    # # Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)# from pyspark.ml.feature import Bucketizer# Setup the Bucketizersplits = [-float("inf"), -15.0, 0, 30.0, float("inf")] arrival_bucketizer = Bucketizer( splits=splits, inputCol="ArrDelay", outputCol="ArrDelayBucket") # Save the bucketizerarrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)# Apply the bucketizerml_bucketized_features = arrival_bucketizer.transform(features_with_route)ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()

  • Data Syndrome: Agile Data Science 2.0

    StringIndexing the String ColumnsUsing StringIndexer to convert nominal fields to numeric ones

    61

    # # Extract features tools in with pyspark.ml.feature# from pyspark.ml.feature import StringIndexer, VectorAssembler# Turn category fields into indexesfor column in ["Carrier", "Origin", "Dest", "Route"]: string_indexer = StringIndexer( inputCol=column, outputCol=column + "_index" ) string_indexer_model = string_indexer.fit(ml_bucketized_features) ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features) # Drop the original column ml_bucketized_features = ml_bucketized_features.drop(column) # Save the pipeline model string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format( base_path, column ) string_indexer_model.write().overwrite().save(string_indexer_output_path)

  • Data Syndrome: Agile Data Science 2.0

    Vectorizing the Numeric ColumnsCombining the numeric fields with VectorAssembler

    62

    # Combine continuous, numeric fields with indexes of nominal ones# ...into one feature vectornumeric_columns = [ "DepDelay", "Distance", "DayOfMonth", "DayOfWeek", "DayOfYear"] index_columns = ["Carrier_index", "Origin_index", "Dest_index", "Route_index"] vector_assembler = VectorAssembler( inputCols=numeric_columns + index_columns, outputCol="Features_vec") final_vectorized_features = vector_assembler.transform(ml_bucketized_features)# Save the numeric vector assemblervector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)vector_assembler.write().overwrite().save(vector_assembler_path)# Drop the index columnsfor column in index_columns: final_vectorized_features = final_vectorized_features.drop(column)# Inspect the finalized featuresfinal_vectorized_features.show()

  • Data Syndrome: Agile Data Science 2.0

    Training the Classifier ModelCreating and training a RandomForestClassifier model

    63

    # Instantiate and fit random forest classifier on all the datafrom pyspark.ml.classification import RandomForestClassifierrfc = RandomForestClassifier( featuresCol="Features_vec", labelCol="ArrDelayBucket", predictionCol="Prediction", maxBins=4657, maxMemoryInMB=1024) model = rfc.fit(final_vectorized_features)# Save the new model over the old onemodel_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format( base_path) model.write().overwrite().save(model_output_path)

  • Data Syndrome: Agile Data Science 2.0

    Evaluating the Classifier ModelUsing MultiClassificationEvaluator to check the accuracy of the model

    64

    # Evaluate model using test datapredictions = model.transform(final_vectorized_features)from pyspark.ml.evaluation import MulticlassClassificationEvaluatorevaluator = MulticlassClassificationEvaluator( predictionCol="Prediction", labelCol="ArrDelayBucket", metricName="accuracy") accuracy = evaluator.evaluate(predictions)print("Accuracy = {}".format(accuracy))# Check the distribution of predictionspredictions.groupBy("Prediction").count().show()# Check a samplepredictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)

  • Data Syndrome: Agile Data Science 2.0

    Running MainJust what it looks like

    65

    if __name__ == "__main__": main(sys.argv[1])

  • Data Syndrome: Agile Data Science 2.0 66

    Using the model in realtime via Spark Streaming!

    Deploying the Model

  • Data Syndrome: Agile Data Science 2.0

    Loading the ModelsLoading the models we trained in batch to reproduce the data pipeline

    67

    # ch08/make_predictions_streaming.py

    # Load the arrival delay bucketizerfrom pyspark.ml.feature import Bucketizerarrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)arrival_bucketizer = Bucketizer.load(arrival_bucketizer_path)# Load all the string field vectorizer pipelines into a dictfrom pyspark.ml.feature import StringIndexerModelstring_indexer_models = {}for column in ["Carrier", "DayOfMonth", "DayOfWeek", "DayOfYear", "Origin", "Dest", "Route"]: string_indexer_model_path = "{}/models/string_indexer_model_{}.bin".format( base_path, column ) string_indexer_model = StringIndexerModel.load(string_indexer_model_path) string_indexer_models[column] = string_indexer_model

    https://github.com/rjurney/Agile_Data_Code_2/blob/master/ch08/make_predictions_streaming.py

  • Data Syndrome: Agile Data Science 2.0

    Loading the ModelsLoading the models we trained in batch to reproduce the data pipeline

    68

    # ch08/make_predictions_streaming.py

    # Load the numeric vector assemblerfrom pyspark.ml.feature import VectorAssemblervector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)vector_assembler = VectorAssembler.load(vector_assembler_path)# Load the classifier modelfrom pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModelrandom_forest_model_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format( base_path) rfc = RandomForestClassificationModel.load( random_forest_model_path

    https://github.com/rjurney/Agile_Data_Code_2/blob/master/ch08/make_predictions_streaming.py

  • Data Syndrome: Agile Data Science 2.0

    Connecting to KafkaCreating a direct stream to the Kafka queue containing our prediction requests

    69

    # # Process Prediction Requests in Streaming# from pyspark.streaming.kafka import KafkaUtilsstream = KafkaUtils.createDirectStream( ssc, [PREDICTION_TOPIC], { "metadata.broker.list": BROKERS, "group.id": "0", }) object_stream = stream.map(lambda x: json.loads(x[1]))object_stream.pprint()

  • Data Syndrome: Agile Data Science 2.0

    Repeating the PipelineRunning the prediction requests through the same data flow as the training data

    70

    row_stream = object_stream.map( lambda x: Row( FlightDate=iso8601.parse_date(x['FlightDate']), Origin=x['Origin'], Distance=x['Distance'], DayOfMonth=x['DayOfMonth'], DayOfYear=x['DayOfYear'], UUID=x['UUID'], DepDelay=x['DepDelay'], DayOfWeek=x['DayOfWeek'], FlightNum=x['FlightNum'], Dest=x['Dest'], Timestamp=iso8601.parse_date(x['Timestamp']), Carrier=x['Carrier'] )) row_stream.pprint()

    # Do the classification and store to Mongorow_stream.foreachRDD(classify_prediction_requests) ssc.start()ssc.awaitTermination()

  • Data Syndrome: Agile Data Science 2.0

    Repeating the PipelineRunning the prediction requests through the same data flow as the training data

    71

    def classify_prediction_requests(rdd): from pyspark.sql.types import StringType, IntegerType, DoubleType, DateType, TimestampType from pyspark.sql.types import StructType, StructField prediction_request_schema = StructType([ StructField("Carrier", StringType(), True), StructField("DayOfMonth", IntegerType(), True), StructField("DayOfWeek", IntegerType(), True), StructField("DayOfYear", IntegerType(), True), StructField("DepDelay", DoubleType(), True), StructField("Dest", StringType(), True), StructField("Distance", DoubleType(), True), StructField("FlightDate", DateType(), True), StructField("FlightNum", StringType(), True), StructField("Origin", StringType(), True), StructField("Timestamp", TimestampType(), True), StructField("UUID", StringType(), True), ]) prediction_requests_df = spark.createDataFrame(rdd, schema=prediction_request_schema) prediction_requests_df.show()

    from pyspark.sql.functions import lit, concat prediction_requests_with_route = prediction_requests_df.withColumn( 'Route', concat( prediction_requests_df.Origin, lit('-'), prediction_requests_df.Dest ) ) prediction_requests_with_route.show(6) ...

  • Data Syndrome: Agile Data Science 2.0

    Repeating the PipelineRunning the prediction requests through the same data flow as the training data

    72

    for column in ["Carrier", "DayOfMonth", "DayOfWeek", "DayOfYear", "Origin", "Dest", "Route"]: string_indexer_model = string_indexer_models[column] prediction_requests_with_route = string_indexer_model.transform(prediction_requests_with_route)# Vectorize numeric columns: DepDelay, Distance and index columnsfinal_vectorized_features = vector_assembler.transform(prediction_requests_with_route)# Inspect the vectorsfinal_vectorized_features.show()# Drop the individual index columnsindex_columns = ["Carrier_index", "DayOfMonth_index", "DayOfWeek_index", "DayOfYear_index", "Origin_index", "Dest_index", "Route_index"] for column in index_columns: final_vectorized_features = final_vectorized_features.drop(column)# Inspect the finalized featuresfinal_vectorized_features.show()# Make the predictionpredictions = rfc.transform(final_vectorized_features)

    # Drop the features vector and prediction metadata to give the original fieldspredictions = predictions.drop("Features_vec") final_predictions = predictions.drop("indices").drop("values").drop("rawPrediction").drop("probability") # Inspect the outputfinal_predictions.show()

  • Data Syndrome: Agile Data Science 2.0

    Storing to MongoPutting the result where our web application can access it

    73

    # Store to Mongoif final_predictions.count() > 0: final_predictions.rdd.map(lambda x: x.asDict()).saveToMongoDB( "mongodb://localhost:27017/agile_data_science.flight_delay_classification_response" )

  • Data Syndrome: Agile Data Science 2.0 74

    Experimental setup for iteratively improving the predictive model

    Improving the Model

  • Data Syndrome: Agile Data Science 2.0

    Experiment SetupNecessary to improve model

    75

  • Data Syndrome: Agile Data Science 2.0 76

    155 additional lines to setup an experiment and add 3 new features to improvement the model

    http://bit.ly/improved_model_spark

    345 L.O.C.

    # !/usr/bin/env pythonimport sys, os, reimport jsonimport datetime, iso8601from tabulate import tabulate# Pass date and base path to main() from airflowdef main(base_path): APP_NAME = "train_spark_mllib_model.py" # If there is no SparkSession, create the environment try: sc and spark except NameError as e: import findspark findspark.init() import pyspark import pyspark.sql sc = pyspark.SparkContext() spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate() # # { # "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00", # "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0, # "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS" # } # from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType from pyspark.sql.types import StructType, StructField from pyspark.sql.functions import udf schema = StructType([ StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0 StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00" StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00" StructField("Carrier", StringType(), True), # "Carrier":"WN" StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31 StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4 StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365 StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0 StructField("Dest", StringType(), True), # "Dest":"SAN" StructField("Distance", DoubleType(), True), # "Distance":368.0 StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00" StructField("FlightNum", StringType(), True), # "FlightNum":"6109" StructField("Origin", StringType(), True), # "Origin":"TUS" ]) input_path = "{}/data/simple_flight_delay_features.json".format( base_path ) features = spark.read.json(input_path, schema=schema) features.first() # # Add a Route variable to replace FlightNum # from pyspark.sql.functions import lit, concat features_with_route = features.withColumn( 'Route', concat( features.Origin, lit('-'), features.Dest ) ) features_with_route.show(6) # # Add the hour of day of scheduled arrival/departure # from pyspark.sql.functions import hour features_with_hour = features_with_route.withColumn( "CRSDepHourOfDay", hour(features.CRSDepTime) ) features_with_hour = features_with_hour.withColumn( "CRSArrHourOfDay", hour(features.CRSArrTime) ) features_with_hour.select("CRSDepTime", "CRSDepHourOfDay", "CRSArrTime", "CRSArrHourOfDay").show() # # Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2) # from pyspark.ml.feature import Bucketizer # Setup the Bucketizer splits = [-float("inf"), -15.0, 0, 30.0, float("inf")] arrival_bucketizer = Bucketizer( splits=splits, inputCol="ArrDelay", outputCol="ArrDelayBucket" ) # Save the model arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path) arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path) # Apply the model ml_bucketized_features = arrival_bucketizer.transform(features_with_hour) ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show() # # Extract features tools in with pyspark.ml.feature # from pyspark.ml.feature import StringIndexer, VectorAssembler # Turn category fields into indexes for column in ["Carrier", "Origin", "Dest", "Route"]: string_indexer = StringIndexer( inputCol=column, outputCol=column + "_index" ) string_indexer_model = string_indexer.fit(ml_bucketized_features) ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)

    # Save the pipeline model string_indexer_output_path = "{}/models/string_indexer_model_3.0.{}.bin".format( base_path, column ) string_indexer_model.write().overwrite().save(string_indexer_output_path)# Combine continuous, numeric fields with indexes of nominal ones# ...into one feature vectornumeric_columns = [ "DepDelay", "Distance", "DayOfMonth", "DayOfWeek", "DayOfYear", "CRSDepHourOfDay", "CRSArrHourOfDay"] index_columns = ["Carrier_index", "Origin_index", "Dest_index", "Route_index"] vector_assembler = VectorAssembler( inputCols=numeric_columns + index_columns, outputCol="Features_vec") final_vectorized_features = vector_assembler.transform(ml_bucketized_features)# Save the numeric vector assemblervector_assembler_path = "{}/models/numeric_vector_assembler_3.0.bin".format(base_path)vector_assembler.write().overwrite().save(vector_assembler_path)# Drop the index columnsfor column in index_columns: final_vectorized_features = final_vectorized_features.drop(column)# Inspect the finalized featuresfinal_vectorized_features.show()# # Cross validate, train and evaluate classifier: loop 5 times for 4 metrics# from collections import defaultdictscores = defaultdict(list) feature_importances = defaultdict(list) metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"] split_count = 3 for i in range(1, split_count + 1): print("\nRun {} out of {} of test/train splits in cross validation...".format( i, split_count, ) ) # Test/train split training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2]) # Instantiate and fit random forest classifier on all the data from pyspark.ml.classification import RandomForestClassifier rfc = RandomForestClassifier( featuresCol="Features_vec", labelCol="ArrDelayBucket", predictionCol="Prediction", maxBins=4657, ) model = rfc.fit(training_data) # Save the new model over the old one model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format( base_path ) model.write().overwrite().save(model_output_path) # Evaluate model using test data predictions = model.transform(test_data) # Evaluate this split's results for each metric from pyspark.ml.evaluation import MulticlassClassificationEvaluator for metric_name in metric_names: evaluator = MulticlassClassificationEvaluator( labelCol="ArrDelayBucket", predictionCol="Prediction", metricName=metric_name ) score = evaluator.evaluate(predictions) scores[metric_name].append(score) print("{} = {}".format(metric_name, score)) # # Collect feature importances # feature_names = vector_assembler.getInputCols() feature_importance_list = model.featureImportances for feature_name, feature_importance in zip(feature_names, feature_importance_list): feature_importances[feature_name].append(feature_importance)# # Evaluate average and STD of each metric and print a table# import numpy as npscore_averages = defaultdict(float) # Compute the table dataaverage_stds = [] # hafor metric_name in metric_names: metric_scores = scores[metric_name] average_accuracy = sum(metric_scores) / len(metric_scores) score_averages[metric_name] = average_accuracy std_accuracy = np.std(metric_scores) average_stds.append((metric_name, average_accuracy, std_accuracy))# Print the tableprint("\nExperiment Log") print("--------------") print(tabulate(average_stds, headers=["Metric", "Average", "STD"]))# # Persist the score to a sccore log that exists between runs# import pickle

    # Load the score log or initialize an empty one try: score_log_filename = "{}/models/score_log.pickle".format(base_path) score_log = pickle.load(open(score_log_filename, "rb")) if not isinstance(score_log, list): score_log = [] except IOError: score_log = [] # Compute the existing score log entry score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names} # Compute and display the change in score for each metric try: last_log = score_log[-1] except (IndexError, TypeError, AttributeError): last_log = score_log_entry experiment_report = [] for metric_name in metric_names: run_delta = score_log_entry[metric_name] - last_log[metric_name] experiment_report.append((metric_name, run_delta)) print("\nExperiment Report") print("-----------------") print(tabulate(experiment_report, headers=["Metric", "Score"])) # Append the existing average scores to the log score_log.append(score_log_entry) # Persist the log for next run pickle.dump(score_log, open(score_log_filename, "wb")) # # Analyze and report feature importance changes # # Compute averages for each feature feature_importance_entry = defaultdict(float) for feature_name, value_list in feature_importances.items(): average_importance = sum(value_list) / len(value_list) feature_importance_entry[feature_name] = average_importance # Sort the feature importances in descending order and print import operator sorted_feature_importances = sorted( feature_importance_entry.items(), key=operator.itemgetter(1), reverse=True ) print("\nFeature Importances") print("-------------------") print(tabulate(sorted_feature_importances, headers=['Name', 'Importance'])) # # Compare this run's feature importances with the previous run's # # Load the feature importance log or initialize an empty one try: feature_log_filename = "{}/models/feature_log.pickle".format(base_path) feature_log = pickle.load(open(feature_log_filename, "rb")) if not isinstance(feature_log, list): feature_log = [] except IOError: feature_log = [] # Compute and display the change in score for each feature try: last_feature_log = feature_log[-1] except (IndexError, TypeError, AttributeError): last_feature_log = defaultdict(float) for feature_name, importance in feature_importance_entry.items(): last_feature_log[feature_name] = importance # Compute the deltas feature_deltas = {} for feature_name in feature_importances.keys(): run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name] feature_deltas[feature_name] = run_delta # Sort feature deltas, biggest change first import operator sorted_feature_deltas = sorted( feature_deltas.items(), key=operator.itemgetter(1), reverse=True ) # Display sorted feature deltas print("\nFeature Importance Delta Report") print("-------------------------------") print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"])) # Append the existing average deltas to the log feature_log.append(feature_importance_entry) # Persist the log for next run pickle.dump(feature_log, open(feature_log_filename, "wb")) if __name__ == "__main__": main(sys.argv[1])

    http://bit.ly/improved_model_spark

  • Data Syndrome: Agile Data Science 2.0 77

    Next steps for learning more about Agile Data Science 2.0

    Next Steps

  • Building Full-Stack Data Analytics Applications with Spark

    http://bit.ly/agile_data_science

    Available Now on OReilly Safari: http://bit.ly/agile_data_safari

    Agile Data Science 2.0

    http://bit.ly/agile_data_sciencehttp://bit.ly/agile_data_safari

  • Agile Data Science 2.0 79

    Realtime Predictive Analytics

    Rapidly learn to build entire predictive systems driven by

    Kafka, PySpark, Speak Streaming, Spark MLlib and with a web

    front-end using Python/Flask and JQuery.

    Available for purchase at http://datasyndrome.com/video

    http://datasyndrome.com/video

  • Data Syndrome Russell JurneyPrincipal Consultant

    Email : [email protected] : datasyndrome.com

    Data Syndrome, LLC

    Product ConsultingWe build analytics products

    and systems consisting of

    big data viz, predictions,

    recommendations, reports

    and search.

    Corporate TrainingWe offer training courses

    for data scientists and

    engineers and data

    science teams,

    Video TrainingWe offer video training

    courses that rapidly

    acclimate you with a

    technology and technique.

    mailto:[email protected]?subject=http://datasyndrome.com