a full machine learning pipeline in scikit-learn vs in scala-spark: pros and cons
Post on 21-Apr-2017
1.693 Views
Preview:
TRANSCRIPT
PowerPoint Presentation
A full Machine learning pipeline in Scikit-learn vs Scala-Spark: pros and consJose Quesada and David Anderson@quesada, @alpinegizmo, @datascienceret
1
Scala and spark are very close: if you learn one you learn the other.Spark is distributed scala2
Why this talk?
How do you get from a single-machine workload to a fully distributed one?
Answer: Spark machine learning
Is there something I'm missing out by staying with python?
Scala and spark are very close: if you learn one you learn the other.Spark is distributed scalaThis has been possible for years, but nowadays its not only possible but pleasant5
You attend a Retreat, not a training6
Mentors are world-class. CTOs, library authors, inventors, founders of fast-growing companies, etc DSR accepts fewer than 5% of the applications
Strong focus on commercial awareness
5 years of working experience on average
30+ partner companies in Europe
DSR participants do a portfolio project
Why is DSR talking about Scala/Spark?
They are behind ScalaIBM is behind thisThey hired us to make training materials
Source: Spark 2015 infographic
TimeMindshare in data science badasses (subjective)
A talk should give you a superpower.- Am I missing out?14
Scala
Scala offers the easiest refactoring experience that I've ever had due to the type system.Jacob, coursera engineer
SparkBasically distributed ScalaAPIScala, Java, Python, and R bindings
LibrariesSQL, streams, graph processing, machine learning
One of the most active open source projects
redo the diagram16
Spark will inevitably become the de-facto Big Data framework for Machine Learning and Data Science.
Dean Wampler, Lightbend
All under one roof (big Win)
Source: Spark 2015 infographicSpark CoreSpark SQLSpark streamingSpark.ml (machine learningGraphX(graphs)
Spark Programming ModelInputDriver / SparkContextWorkerWorkerTasksResults
19
Data is partitioned; code is sent to the dataInputDriver / SparkContextWorkerWorkerTasksResults
DataData
20
Example: word counthello worldfoo barfoo foo barbye worldData is immutable, and is partitioned across the cluster
21
Example: word counthello worldfoo barfoo foo barbye worldWe get things done by creating new, transformed copies of the data.In parallel.
helloworldfoobarfoofoobarbyeworld
(hello, 1)(world, 1)(foo, 1)(bar, 1)(foo, 1)(foo, 1)(bar, 1)(bye, 1)(world, 1)
22
Example: word counthello worldfoo barfoo foo barbye worldSome operations require a shuffle to group data together
helloworldfoobarfoofoobarbyeworld
(hello, 1)(world, 1)(foo, 1)(bar, 1)(foo, 1)(foo, 1)(bar, 1)(bye, 1)(world, 1)
(hello, 1)(foo, 3)(bar, 2)(bye, 1)(world, 2)
23
Example: word countlines = sc.textFile(input)words = lines.flatMap(lambda x: x.split(" "))word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x + y))-------------------------------------------------word_count.saveAsTextFile(output)Pipelined into the same python executor Nothing happens until after this line, when this "action" forces evaluation of the RDD
24
RDD Resilient Distributed DatasetAn immutable, partitioned collection of elements that can be operated on in parallelLazyFault-tolerant
fault-tolerant: missing partitions can be recomputed by using the lineage graph to rerun operations25
PySpark RDD Execution Model
Whenever you provide a lambda to operate on an RDD:Each Spark worker forks a Python workerdata is serialized and piped to those Python workers
When using python, the sparkcontext in python is basically a proxy. py4j is used to launch a JVM and create a native spark context. py4j manages communication between the python and java spark context objects.
In the workers, some operations can be executed directly in the JVM. But, for example, if you've implemented a map function in python, a python process is forked to execute this user-supplied mapping. Each thread in the spark worker will have its own python sub-process.
When Python wrapper calls the underlying Spark codes written in Scala running on a JVM, translation between two different environments and languages might be the source of more bugs and issues.
26
Scala and spark are very close: if you learn one you learn the other.Spark is distributed scalaThis has been possible for years, but nowadays its not only possible but pleasant27
Impact of this execution modelWorker overhead (forking, serialization)The cluster manager isn't aware of Python's memory needsVery confusing error messages
28
Spark Dataframes (and Datasets)Based on RDDs, but tabular; something like SQL tablesNot PandasRescues Python from serialization overheaddf.filter(df.col("color") == "red") vs. rdd.filter(lambda x: x.color == "red")processed entirely in the JVMPython UDFs and maps still require serialization and piping to Pythoncan write (and register) Scala code, and then call it from Python
29
DataFrame execution: unified across languagesPython DFJava/Scala DFR DFLogical PlanExecutionAPI wrappers create a logical plan (a DAG)Catalyst optimizes the plan; Tungsten compiles the plan into executable code
DataFrame performance
ML WorkflowData IngestionData Cleaning / Feature EngineeringModel TrainingTesting and ValidationDeployment
33
Machine learning with scikit-learnEasy to useRich ecosystemLimited to one machine (but see sparkit-learn package)
34
Machine learning with Hadoop (in short: NO)Each iteration is a new M/R jobEach job must store data in HDFS lots of overhead
35
How Spark killed Hadoop map/reduceFar easier to program
More cost-effective since less hardware can perform the same tasks much faster
Can do real-time processing as well as batch processing
Can do ML, graphs
Just one Map / Reduce step, but many algorithms are iterativeDisk based long startup times-------Spark is a wholesale replacement for MapReduce that leverages lessons learned from MapReduce. The Hadoop community realized that areplacement for MR was needed. While MR has served the community well, its a decade old and shows clear limitations and problems, as weve seen. In late 2013, Cloudera, the largest Hadoop vendor officially embraced Spark as the replacement. Most of the other Hadoop vendors have followed suit.
When it comes to one-pass ETL-like jobs, for example, data transformation ordata integration, then MapReduce is the dealthis is what it was designed for.
Advantages for Hadoop: Security, staffing36
Machine learning with SparkSpark was designed for ML workloadsCaching (reuse data)Accumulators (keep state across iterations)Functional, lazy, fault-tolerantMany popular algorithms are supported out of the boxSimple to productionalize modelsMLlib is RDD (the past), spark.ml is dataframes, the future
sample use case for accumulators: gradient descent37
Spark is an Ecosystem of ML frameworksSpark was designed by people who understood the need of ML practitioners (unlike Hadoop)MLlibSpark.mlSystem.ml (IBM)Keystone.ml
Spark.ML the basicsDataFrame: ML requires DFs holding vectorsTransformer: transforms one DF into anotherEstimator: fit on a DF; produces a transformerPipeline: chain of transformers and estimatorsParameter: there is a unified API for specifying parametersEvaluator: CrossValidator: model selection via grid search
39
Hyper-parametertuningMachine Learning scaling challenges that Spark solves
40
Hyper-parametertuningMachine Learning scaling challenges that Spark solvesETL/feature engineering
42
Hyper-parametertuningMachine Learning scaling challenges that Spark solvesETL/feature engineeringModel
43
Q: Hardest scaling problem in data science?A: Adding peopleSpark.ml has a clean architecture and APIs that should encourage code sharing and reuseGood first step: can you refactor some ETL code as a Transformer?
Don't see much sharing of components happening yetEntire libraries, yes; components, not so muchPerhaps because Spark has been evolving so quicklyE.g., pull request implementing non-linear SVMs that has been stuck for a year
Structured types in SparkSQLDataFramesDataSets(Java/Scala only) Syntax ErrorsRuntimeCompile timeCompile timeAnalysis ErrorsRuntimeRuntimeCompile time
45
User experience Spark.ml Scikit-learn
Indexing categorical featuresYou are responsible for identifying and indexing categorical featuresval rfcd_indexer = new StringIndexer() .setInputCol("color") .setOutputCol("color_index") .fit(dataset)val seo_indexer = new StringIndexer() .setInputCol("status") .setOutputCol("status_index") .fit(dataset)
Spark.ml Departs from scikit-learn quite a bit47
Assembling featuresYou must gather all of your features into one Vector, using a VectorAssemblerval assembler = new VectorAssembler() .setInputCols(Array("color_index", "status_index", ...)) .setOutputCol("features")
48
Spark.ml Scikit-learn: Pipelines (good news!)Spark ML and scikit-learn: same approach
Chain together Estimators and Transformers
Support non-linear pipelines (must be a DAG)
Unify parameter passing
Support for cross-validation and grid search
Can write your own custom pipeline stages
Spark.ml just like scikit-learn
Good 49
TransformerDescriptionscikit-learnBinarizerThreshold numerical feature to binaryBinarizerBucketizerBucket numerical features into rangesElementwiseProductScale each feature/column separatelyHashingTFHash text/data to vector. Scale by term frequencyFeatureHasherIDFScale features by inverse document frequencyTfidfTransformerNormalizerScale each row to unit normNormalizerOneHotEncoderEncode k-category feature as binary featuresOneHotEncoderPolynomialExpansionCreate higher-order featuresPolynomialFeaturesRegexTokenizerTokenize text using regular expressions(part oftext methods)StandardScalerScale features to 0 mean and/or unit varianceStandardScalerStringIndexerConvert String feature to 0-based indicesLabelEncoderTokenizerTokenize text on whitespace(part oftext methods)VectorAssemblerConcatenate feature vectorsFeatureUnionVectorIndexerIdentify categorical features, and indexWord2VecLearn vector representation of words
Spark.ml Scikit-learn: NLP tasks (thumbs up)
fromhttps://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-apache-spark-1-4.html50
Graph stuff (graphX, graphframes, not great)Extremely easy to run monster algorithms in a cluster
GraphX has no python API
Graphframes are cool, and should provide access to the graph tools in Spark from python
In practice, it didnt work too well
Things we liked in Spark MLArchitecture encourages building reusable pieces
Type safety, plus types are driving optimizations
Model fitting returns an object that transforms the data
Uniform way of passing parameters
It's interesting to use the same platform for ETL and model fitting
Very easy to parallelize ETL and grid search, or work with huge models
52
Disappointments using Spark MLFeature indexing and assembly can become tedious
Surprised by the maximum depth limit for trees: 30
Data exploration and visualization aren't easy in Scala
Wish list: non-linear SVMs, deep learning (but see Deeplearning4j)
53
What is new for machine learning in Spark 2.0DataFrame-based Machine Learning API emerges as the primary ML API:With Spark 2.0, the spark.ml package, with its pipeline APIs, will emerge as the primary machine learning API. While the original spark.mllib package is preserved, future development will focus on the DataFrame-based API.Machine learning pipeline persistence:Users can nowsave and loadmachine learning pipelines and models across all programming languages supported by Spark.
What is new for data structures in Spark 2.0
Unifying the API for Streams and static data: Infinite datasets (same interface as dataframes)
What have Spark and Scala ever given us?
Other than distributed dataframes, distributed machine learning, easy distributed grid search,distributed SQL, distributed stream analysis,more performance than map reduceeasier programming modelAnd easier deployment
What have Spark and Scala ever given us?
Reminder: 25 videos explaining ML on sparkFor people who already know ML
http://datascienceretreat.com/videos/data-science-with-scala-and-spark)
Thank you for your attention!@quesada, @datascienceret
top related