data science with spark - training at sparksummit (east)
TRANSCRIPT
![Page 1: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/1.jpg)
Data Science Training Spark
“I want to die on Mars but not on impact”
— Elon Musk, interview with Chris Anderson
“The shrewd guess, the fertile hypothesis, the courageous leap to a
tentative conclusion – these are the most valuable coin of the thinker at
work” -- Jerome Seymour Bruner�"There are no facts, only interpretations." - Friedrich Nietzsche �
"If you torture the data long enough, it will confess to anything." – Hal Varian,
Computer Mediated Transactions!------ W
e are not going to hang data by it’s legs !!
http://training.databricks.com/workshop/datasci.pdf
![Page 2: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/2.jpg)
ADVANCED: DATA SCIENCE WITH APACHE SPARK Data Science applications with Apache Spark combine the scalability of Spark and the distributed machine learning algorithms.
This material expands on the “Intro to Apache Spark” workshop. Lessons focus on industry use cases for machine learning at scale, coding examples based on public data sets, and leveraging cloud-based notebooks within a team context. Includes limited free accounts on Databricks Cloud.
Topics covered include:
Data transformation techniques based on both Spark SQL and functional programming in Scala and Python.
Predictive analytics based on MLlib, clustering with KMeans, building classifiers with a variety of algorithms and text analytics – all with emphasis on an iterative cycle of feature engineering, modeling, evaluation.
Visualization techniques (matplotlib, ggplot2, D3, etc.) to surface insights.
Understand how the primitives like Matrix Factorization are implemented in a distributed parallel framework from the designers of MLlib
Several hands-on exercises using datasets such as Movielens, Titanic, State Of the Union speeches, and RecSys Challenge 2015.
Prerequisites:Intro to Apache Spark workshop or equivalent (e.g., Spark Developer Certificate)Experience coding in Scala, Python, SQLHave some familiarity with Data Science topics (e.g., business use cases)
![Page 3: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/3.jpg)
Agenda
• Detailed)agenda)In)Google)doc)• https://docs.google.com/document/d/
1T9AkXUmL6gDYTpAEEgqsy9hfJtqGlavjnGzMqzUwDfE/edit)
![Page 4: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/4.jpg)
Goals • Patterns):)Data)wrangling)(Transform,)Model)&)Reason))with)Spark)
o Use)RDDs,)Transformations)and)Actions)in)the)context)of)a)Data)Science)Problem,)an)Algorithms))&)a)Dataset)
• Spend)time)working)through)MLlib))• Balance)between)internals)&)handsWon)
o Internals)from)Reza,)the)MLlib)lead)• ~65%)of)time)on)Databricks)Cloud)&)Notebooks)
o Take)the)time)to)get)familiar)with)the)Interface)&)the)Data)Science)Cloud)o Make*mistakes,*experiment,…*
• Good)Time)for)this)course,)this)version)o Will)miss)many)of)the)gory)details)as)the)framework)evolves)
• Summarized)materials)for)a)3)day)course)o Even)if)we)don’t)finish)the)exercises)today,)that)is)fine)o Complete)the)work)at)home)W)There*are*also**homework*notebooks*o Ask)us)questions)@ksankar,@pacoid,*@reza_zadeh,*@mhfalaki,*@andykonwinski,*@xmeng,*
@michaelarmbrust,*@tathadas*
![Page 5: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/5.jpg)
Tutorial)Outline:)
morning' a)ernoon'
o Welcome)+)Ge3ng)Started)(Krishna))o Databricks)Cloud)mechanics)(Andy)))
o Ex)3):)Clustering)B)In)which)we)explore)SegmenFng)Frequent)InterGallacFcHoppers)(Krishna))
o Ex)0:)PreBFlight)Check)(Krishna)) o Ex)4):)RecommendaFon)(Krishna))
o DataScience)DevOps)B)IntroducFon)to)Spark)(Krishna)) o Theory):)Matrix)FactorizaFon,)SVD,…)(Reza))
o Ex)1:)MLlib):)StaFsFcs,)Linear)Regression)(Krishna)) o OnBline)kBmeans,)spark)streaming)(Reza))
o MLlib)Deep)Dive)–)Lecture)(Reza))o Design)Philosophy,)APIs)
o Ex)2:)In)which)we)explore)Disasters,)Trees,)ClassificaFon)&)the)Kaggle)CompeFFon)(Krishna))o Random)Forest,)Bagging,)Data)DeBcorrelaFon)
o Ex)5):)Mood)of)the)UnionBText)AnalyFcs(Krishna))o In)which)we)analyze)the)Mood)of)the)naFon)from)
inferences)on)SOTU)by)the)POTUS)(State)of)the)Union)Addresses)by)The)President)Of)the)US))
o Deepdive)B)Leverage)parallelism)of)RDDs,)sparse)vectors,)etc)(Reza))
o Ex)99):)RecSys)2015)Challenge)(Krishna))
o Ask)Us)Anything)B)Panel)
![Page 6: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/6.jpg)
Introducing:)
Reza Zadeh @Reza_Zadeh
Hossein Falaki @mhfalaki
Andy Konwinski @andykonwinski
Krishna Sankar @ksankar
Paco Nathan @pacoid Xiangrui Meng
@xmeng
Michael Armbrust @michaelarmbrust
Tathagata Das @tathadas
![Page 7: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/7.jpg)
About Me
o Chief Data Scientist at BlackArrow.tv o Have been speaking at OSCON, PyCon, Pydata,
Strata et al o Reviewer “Machine Learning with Spark” o Picked up co-authorship Second Edition of “Fast
Data Processing with Spark” o Have done lots of things: • )'��!3!���%3!)+�� ).)-&.1,!3)#2���)-!-#)!+���$�%#(���• 1)33%-� ..*2�� %"���� )1%+%22���!5!�:��• �3!-$!1$2�� %"��%15)#%���+.4$����.,%�6.1*�)-����• �4%23��%#341%1�!3��!5!+�����#(..+�:�• �+!--)-'��!23%12��.,/43!3).-!+��)-!-#%�.1��3!3)23)#2��• �.+4-3%%1�!2��.".3)#2��4$'%�!3��)123��%'.�+%!'4%� .1+$��.,/%3)3).-2�
o @ksankar, doubleclix.wordpress.com [email protected]
The)Nuthead)band)!)
![Page 8: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/8.jpg)
Pre-requisites ① Register)&)Download)data)from)Kaggle.)
We)cannot)distribute)Kaggle)data.)Moreover)you)need)an)account)to)submit)entries)
a) Setup)an)account)in)Kaggle)(www.kaggle.com))b) We)will)be)using)the)data)from)the)competition)“Titanic:)
Machine)Learning)from)Disaster”)c) Download)data)from)
http://www.kaggle.com/c/titanicWgettingStarted)② Register)for)RecSys)2015)Competition)
a) http://2015.recsyschallenge.com/)
![Page 9: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/9.jpg)
Welcome + Getting Started
9:00
![Page 10: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/10.jpg)
Everyone will receive a username/password for one !of the Databricks Cloud shards. Use your laptop and browser to login there.
We find that cloud-based notebooks are a simple way to get started using Apache Spark – as the motto “Making Big Data Simple” states.
Please create and run a variety of notebooks on your account throughout the tutorial. These accounts will remain open long enough for you to export your work.
See the product page or FAQ for more details, or contact Databricks to register for a trial account.
10
Getting Started: Step 1
![Page 11: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/11.jpg)
11
Getting Started: Step 1 – Credentials
url: https://class01.cloud.databricks.com/user: student-777pass: 93ac11xq23z5150cluster: student-777
![Page 12: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/12.jpg)
12
Open in a browser window, then click on the navigation menu in the top/left corner:
Getting Started: Step 2
![Page 13: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/13.jpg)
13
The next columns to the right show folders, !and scroll down to click on databricks_guide
Getting Started: Step 3
![Page 14: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/14.jpg)
14
Scroll to open the 01 Quick Start notebook, then follow the discussion about using key features:
Getting Started: Step 4
![Page 15: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/15.jpg)
15
See /databricks-guide/01 Quick Start !
Key Features:
• Workspace / Folder / Notebook
• Code Cells, run/edit/move/comment
• Markdown
• Results
• Import/Export
Getting Started: Step 5
![Page 16: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/16.jpg)
16
Click on the Workspace menu and create your !own folder (pick a name):
Getting Started: Step 6
![Page 17: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/17.jpg)
17
Navigate to /_DataScience Hover on its drop-down menu, on the right side:
Getting Started: Step 7
![Page 18: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/18.jpg)
18
Navigate to /_DataScience Hover on its drop-down menu, on the right side:Click Clone:
Getting Started: Step 8
![Page 19: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/19.jpg)
19
Then create a clone of this folder in the folder that you just created:
Getting Started: Step 9
![Page 20: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/20.jpg)
20
Now let’s get started with the coding exercise!
We’ll define an initial Spark app in three lines of code:
Click on _00.pre-flight-check
Getting Started: Coding Exercise
![Page 21: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/21.jpg)
21
Attach your cluster – same as your username:
Getting Started: Step 10
![Page 22: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/22.jpg)
Introduction To Spark
![Page 23: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/23.jpg)
Data Science : The art of building a model with known knowns, which when let loose, works with unknown unknowns!
�$#�!���("&��!���&��#��%"����%���'������#'�&'���
http://smartorg.com/2013/07/valuepoint19/
The World Knowns&&&&&&Unknowns&
You UnKnown& Known&
o Others)know,)you)don’t) o What)we)do)
o Facts,)outcomes)or)scenarios)we)have)not)encountered,)nor)considered)
o “Black)swans”,)outliers,)long)tails)of)probability)distribuFons)
o Lack)of)experience,)imaginaFon)
o PotenFal)facts,)outcomes)we)are)aware,)but)not))with)certainty)
o StochasFc)processes,)ProbabiliFes)
o �#$)#��#$)#&�o ���%���%��'��#�&�)�� #$)�'��'�)�� #$)�
o �#$)#��# #$)#&�o ���'��&�'$�&�*��'��%���%��'��#�&�'��'�)��
#$)� #$)�)���$#�'� #$)�o ('�'��%���%���!&$��# #$)#��# #$)#&�
o ���%���%��'��#�&�)���$�#$'� #$)�)���$#�'� #$)�
![Page 24: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/24.jpg)
The curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual o Data Scientist should be building Data
Products o Data Scientist should tell a story
��'������#'�&'
��#$(#�����%&$
#�)�$��&���''�
%��'�
&'�'�&'��&�'��#
��#*�&$�')�%�
��#��#��%�����
''�%�
�'�&$�')�%���
#��#��%�#��'��
#��#*�&'�'�&'��
��#��
���$&����!!&���!
$(��%���
��'������#'�&'
��#$(#�����%&$
#�)�$��&�)$%&
���'�
&'�'�&'��&�'��#
��#*�&'�'�&'���
�#���)$%&���'
�
&$�')�%���#��
#��%�#��'��#��
#*�&$�')�%��
�#��#��%� ����!!��(
��%& �������!�
��
http://doubleclix.wordpress.com/2014/01/25/theWcuriousWcaseWofWtheWdataWscientistWprofession/
��%����&���%�� #��#�'���&�"(�����&��%�������'(&�%$)#�
![Page 25: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/25.jpg)
Data Science - Context
o Scalable)Model)Deployment)
o Big)Data)automation)&)purpose)built)appliances)(soft/hard))
o Manage)SLAs)&)response)times)
o Volume)o Velocity)o Streaming)Data)
o Canonical)form)o Data)catalog)o Data)Fabric)across)the)
organization)o Access)to)multiple)
sources)of)data))o Think)Hybrid)–)Big)
Data)Apps,)Appliances)&)Infrastructure)
Collect Store Transform
o Metadata)o Monitor)counters)&)
Metrics)o Structured)vs.)MultiW
structured)
o Flexible)&)Selectable)! Data)Subsets))! Attribute)sets)
o Refine)model)with)! Extended)Data)
subsets)! Engineered)
Attribute)sets)o Validation)run)across)a)
larger)data)set)
Reason Model Deploy
Data Management
Data Science
o Dynamic)Data)Sets)o 2)way)keyWvalue)tagging)of)
datasets)o Extended)attribute)sets)o Advanced)Analytics)
Explore Visualize Recommend Predict
o Performance)o Scalability)o Refresh)Latency)o InWmemory)Analytics)
o Advanced)Visualization)o Interactive)Dashboards)o Map)Overlay)o Infographics)
" Bytes to Business a.k.a. Build the full stack
" Find Relevant Data For Business
" Connect the Dots
![Page 26: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/26.jpg)
Volume
Velocity
Variety
Data Science - Context
Context
Connectedness
Intelligence
Interface
Inference
,��'��$��(#(&(�!�&�+�-�'��'���#�'�����%('���$%����
o Three Amigos o Interface = Cognition o Intelligence = Compute(CPU) & Computational(GPU) o Infer Significance & Causality
![Page 27: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/27.jpg)
Day in the life of a (super) Model
Intelligence
Inference
Data Representation
Interface
Algorithms)
Parameters)Agributes)
Data)(Scoring))
Model)SelecFon)
Reason)&)Learn)
Models)
Visualize,)Recommend,)Explore)
Model)Assessment)
Feature)SelecFon)Dimensionality)ReducFon)
![Page 28: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/28.jpg)
The'Sense'&'Sensibility'of'a'DataScien7st'DevOps'
Factory)=)OperaFonal)
Lab)=)InvesFgaFve)
http://doubleclix.wordpress.com/2014/05/11/theWsenseWsensibilityWofWaWdataWscientistWdevops/)
![Page 29: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/29.jpg)
Spark-The Stack
![Page 30: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/30.jpg)
RDD – The workhorse of Spark
o Resilient Distributed Datasets • �.++%#3).-�3(!3�#!-�"%�./%1!3%$�)-�/!1!++%+�
o Transformations – create RDDs • �!/���)+3%1�:�
o Actions – Get values • �.++%#3���!*%�:�
o We will apply these operations during this tutorial
![Page 31: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/31.jpg)
MLlib Hands-on
Stats, Linear Regression
9:20
![Page 32: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/32.jpg)
Algorithm spectrum
o ���"�##�! �o !��$�o �����o � #�������
�� �!��!"�#$�
o ��%#$�"� ��o ����o � �$�������o ���%��$���
� ���� �)
o �!�������$�"� ��
o ����o ��" ��#�
o ����
o ���$�o �!�$&�� �
����� ��o ��$%"��
��" � �)
Machine0
Learning0
Cute0Math0Ar6ficial0
Intelligence0
![Page 33: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/33.jpg)
Session – 1 : MLlib - Statistics & Linear Regression
1. Notebook):)01_StatsLRW1)① Read)car)data)② Stats)(Guided))③ Correlation)(Guided))④ Coding)ExerciseW21WTemplate)
(Correlation))2. Notebook):)02_StatsLRW2)
① CEW21WSolution)3. Linear)Regression)
① LR)(Guided))② CEW22WTemplate((LR)on)Car)
Data))
4. Notebook):)03_StatsLRW3)① CEW22WSolution)② Explain)
![Page 34: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/34.jpg)
Linear Regression - API LabeledPoint The)features)and)labels)of)a)data)point LinearModel weights,)intercept LinearRegressionModelBase predict()
LinearRegressionModel LinearRegressionWithSGD train(cls,)data,)iteraFons=100,)step=1.0,)miniBatchFracFon=1.0,)
iniFalWeights=None,)regParam=1.0,)regType=None,)intercept=False)
LassoModel LeastBsquares)fit)with)an)L1)penalty)term. LassoWithSGD train(cls,)data,)iteraFons=100,)step=1.0,)regParam=1.0,)
miniBatchFracFon=1.0,iniFalWeights=None)
RidgeRegressionModel LeastBsquares)fit)with)an)L2)penalty)term. RidgeRegressionWithSGD train(cls,)data,)iteraFons=100,)step=1.0,)regParam=1.0,)miniBatchFracFon=1.0,)
iniFalWeights=None)
![Page 35: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/35.jpg)
MLlib - Deep Dive • Design Philosophy & APIs
• Algorithms - Regression, SGD et al
• Interfaces
9:50
![Page 36: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/36.jpg)
Reza Zadeh
Distributed Machine Learning"on Spark
@Reza_Zadeh | http://reza-zadeh.com
![Page 37: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/37.jpg)
Outline Data flow vs. traditional network programming Spark computing engine Optimization Examples Matrix Computations MLlib + {Streaming, GraphX, SQL} Future of MLlib
![Page 38: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/38.jpg)
Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators » System picks how to split each operator into tasks
and where to run each task » Run parts twice fault recovery
Biggest example: MapReduce Map
Map
Map
Reduce
Reduce
![Page 39: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/39.jpg)
Spark Computing Engine Extends a programming language with a distributed collection data-structure » “Resilient distributed datasets” (RDD)
Open source at Apache » Most active community in big data, with 50+
companies contributing
Clean APIs in Java, Scala, Python Community: SparkR, soon to be merged
![Page 40: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/40.jpg)
Key Idea Resilient Distributed Datasets (RDDs) » Collections of objects across a cluster with user
controlled partitioning & storage (memory, disk, ...) » Built via parallel transformations (map, filter, …) » The world only lets you make make RDDs such that
they can be:
Automatically rebuilt on failure
![Page 41: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/41.jpg)
MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution from AMPLab, UC Berkeley Shipped with Spark since Sept 2013
![Page 42: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/42.jpg)
MLlib: Available algorithms classification: logistic regression, linear SVM, "naïve Bayes, least squares, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means|| decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS
![Page 43: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/43.jpg)
Optimization At least two large classes of optimization problems humans can solve:" » Convex » Spectral
![Page 44: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/44.jpg)
Optimization Example: Convex Optimization
![Page 45: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/45.jpg)
Logistic Regression data$=$spark.textFile(...).map(readPoint).cache()$$w$=$numpy.random.rand(D)$$for$i$in$range(iterations):$$$$$gradient$=$data.map(lambda$p:$$$$$$$$$(1$/$(1$+$exp(Bp.y$*$w.dot(p.x))))$*$p.y$*$p.x$$$$$).reduce(lambda$a,$b:$a$+$b)$$$$$w$B=$gradient$$print$“Final$w:$%s”$%$w$
![Page 46: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/46.jpg)
Separable Updates Can be generalized for » Unconstrained optimization » Smooth or non-smooth » LBFGS, Conjugate Gradient, Accelerated
Gradient methods, …
![Page 47: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/47.jpg)
Logistic Regression Results
0 500
1000 1500 2000 2500 3000 3500 4000
1 5 10 20 30
Runn
ing T
ime
(s)
Number of Iterations
Hadoop Spark
110 s / iteration
first iteration 80 s further iterations 1 s
100 GB of data on 50 m1.xlarge EC2 machines !
![Page 48: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/48.jpg)
Behavior with Less RAM 68
.8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
0% 25% 50% 75% 100%
Itera
tion
time
(s)
% of working set in memory
![Page 49: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/49.jpg)
Optimization Example: Spectral Program
![Page 50: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/50.jpg)
Spark PageRank Given directed graph, compute node importance. Two RDDs: » Neighbors (a sparse graph/matrix) » Current guess (a vector) "Using cache(), keep neighbor list in RAM
![Page 51: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/51.jpg)
Spark PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing
Neighbors (id, edges)
Ranks (id, rank)
join
partitionBy
join join …
![Page 52: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/52.jpg)
PageRank Results 171
72
23
0
50
100
150
200
Tim
e pe
r ite
ratio
n (s)
Hadoop
Basic Spark
Spark + Controlled Partitioning
![Page 53: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/53.jpg)
Spark PageRank
Generalizes!to!Matrix!Multiplication,!opening!many!algorithms!from!Numerical!Linear!Algebra!
![Page 54: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/54.jpg)
Distributing Matrix Computations
![Page 55: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/55.jpg)
Distributing Matrices How to distribute a matrix across machines? » By Entries (CoordinateMatrix) » By Rows (RowMatrix) » By Blocks (BlockMatrix) All of Linear Algebra to be rebuilt using these partitioning schemes
As!of!version!1.3!
![Page 56: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/56.jpg)
Distributing Matrices Even the simplest operations require thinking about communication e.g. multiplication How many different matrix multiplies needed? » At least one per pair of {Coordinate, Row,
Block, LocalDense, LocalSparse} = 10 » More because multiplies not commutative
![Page 57: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/57.jpg)
Coffee-Break
Back at 10:45
![Page 58: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/58.jpg)
MLlib - Hands On #2 – Kaggle Competition Predicting Titanic Survivors: • Feature Engineering
• Classification Algorithms(Random forest)
• Submission & Leaderboard scores
10:45
![Page 59: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/59.jpg)
Data Science “folk knowledge” (1 of A)
o "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer Mediated Transactions
o Learning = Representation + Evaluation + Optimization o It’s Generalization that counts • �(%�&4-$!,%-3!+�'.!+�.&�,!#()-%�+%!1-)-'�)2�3.�'%-%1!+)9%�"%8.-$�3(%�%7!,/+%2�)-�3(%�31!)-)-'�2%3�
o Data alone is not enough • �-$4#3).-�-.3�$%$4#3).-����5%18��+%!1-%1�2(.4+$�%,".$8�2.,%�*-.6+%$'%�.1�!224,/3).-2�"%8.-$�3(%�$!3!�)3�)2�')5%-�)-�.1$%1�3.�'%-%1!+)9%�"%8.-$�)3�
o Machine Learning is not magic – one cannot get something from nothing • �-�.1$%1�3.�)-&%1��.-%�-%%$2�3(%�*-."2���3(%�$)!+2�• �-%�!+2.�-%%$2�!�1)#(�%7/1%22)5%�$!3!2%3�
A few useful things to know about machine learning - by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755
![Page 60: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/60.jpg)
Classification - Spark API • Logistic)Regression)• SVMWithSGD)• DecisionTrees)• Data)as)LabelledPoint)(we)will)see)in)a)moment))• DecisionTree.trainClassifier(data,)numClasses,)categoricalFeaturesInfo,)impurity="gini",)
maxDepth=4,)maxBins=100))• Impurity)–)“entropy”)or)“gini”)• maxBins)=)control)to)throttle)communication)at)the)expense)of)accuracy)
o Larger)=)Higher)Accuracy)o Smaller)=)less)communication)(as)#)of)bins)=)number)of)instances))
• Data)Adaptive)–)i.e.)decision)tree)samples)on)the)driver)and)figures)out)the)bin)spacing)i.e.)the)places)you)slice)for)binning)
• Spark*=*Intelligent*Framework*D*need*this*for*scale*
![Page 61: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/61.jpg)
Lookout for these interesting Spark features
• Concept)of)Labeled)Point)&)how)to)create)an)RDD)of)LPs)
• Print)the)tree)• Calculate)Accuracy)&)MSE)from)RDDs)
![Page 62: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/62.jpg)
Anatomy Of a Kaggle Competition
![Page 63: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/63.jpg)
Kaggle Data Science Competitions
o Hosts Data Science Competitions o Competition Attributes: • �!3!2%3�• �1!)-�• �%23���4",)22).-��• �)-!+��5!+4!3).-��!3!��%3�� %�$.-;3�2%%��
• �4+%2�• �),%�".7%$�• �%!$%1".!1$�• �5!+4!3).-�&4-#3).-�• �)2#422).-��.14,�• �1)5!3%�.1��4"+)#�
![Page 64: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/64.jpg)
Titanic)Passenger)Metadata)• Small)• 3)Predictors)
• Class)• Sex)• Age)• Survived?)
http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic http://flyhigh-by-learnonline.blogspot.com/2009/12/at-movies-sherlock-holmes-2009.html
City Bike Sharing Prediction (Washington DC)
Walmart Store Forecasting
![Page 65: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/65.jpg)
Train.csv)Taken)from)Titanic)Passenger)Manifest)
Variable' Descrip7on'
Survived) 0BNo,)1=yes)
Pclass) Passenger)Class)()1st,2nd,3rd)))
Sibsp) Number)of)Siblings/Spouses)Aboard)
Parch) Number)of)Parents/Children)Aboard)
Embarked) Port)of)EmbarkaFon)o C)=)Cherbourg)o Q)=)Queenstown)o S)=)Southampton)
Titanic)Passenger)Metadata)• Small)• 3)Predictors)
• Class)• Sex)• Age)• Survived?)
![Page 66: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/66.jpg)
Test.csv)
Submission
o 418 lines; 1st column should have 0 or 1 in each line o Evaluation: • ��#.11%#3+8�/1%$)#3%$�
![Page 67: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/67.jpg)
Data Science “folk knowledge” (Wisdom of Kaggle) Jeremy’s Axioms
o Iteratively explore data o Tools • �7#%+��.1,!3���%1+���%1+� ..*���/!1*���
o Get your head around data • �)5.3��!"+%�
o Don’t over-complicate o If people give you data, don’t assume that you
need to use all of it o Look at pictures ! o History of your submissions – keep a tab o Don’t be afraid to submit simple solutions • %�6)++�$.�3()2�$41)-'�3()2�6.1*2(./�
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/
![Page 68: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/68.jpg)
Session-2 : Kaggle, Classification & Trees
1. Notebook):)04_TitanicW01)① Read)Training)Data)② Henry)the)Sixth)Model)③ Submit)to)Kaggle)
2. Notebook):)05_TitanicW02)① Decision)Tree)Model)② CEW31)Template)
① Create)Randomforest)Model)② Predict)Testset)③ Submit)to)Kaggle)
3. Notebook):)06_TitanicW03)① CEW32)Solution)
① RandomForest)Model)② Predict)Testset)③ Submit)Solution)2)
② Discussion)about)Models)
![Page 69: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/69.jpg)
Trees, Forests & Classification
• Discuss)Random)Forest)o Boosting,)Bagging)o Data)deWcorrelation))
• Why)it)didn’t)do)better)in)Titanic)dataset)• Data)Science)Folk)Wisdom)
o http://www.slideshare.net/ksankar/dataWscienceWfolkWknowledge)
![Page 70: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/70.jpg)
Results Algorithm' Kaggle'Score'
Gender)based)model) ~0.7655)Rank):)1276)
Decision)Tree) ~0.775124,)Rank):)1147)
Random)Forest)) ~0.77512)Gender)Based)Model)
Why didn’t RF do better ? Bias/Variance See next slide
Henry0the0Sixth,0Part02,0Act04,0Scene020
![Page 71: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/71.jpg)
Why didn’t RF do better ? Bias/Variance
o High Bias • �4%�3.��-$%1&)33)-'�• �$$�,.1%�&%!341%2�• �.1%�2./()23)#!3%$�,.$%+�• �4!$1!3)#��%1,2��#.,/+%7�%04!3).-2�:�• �%#1%!2%�1%'4+!1)9!3).-�
o High Variance • �4%�3.��5%1&)33)-'�• �2%�&%6%1�&%!341%2�• �2%�,.1%�31!)-)-'�2!,/+%�• �-#1%!2%��%'4+!1)9!3).-�
Prediction Error!Training !Error!
Ref: Strata 2013 Tutorial by Olivier Grisel
Learning Curve
Need*more*features*or*more*complex*model*to*improve*
Need*more*data*to*improve*
'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos
hgp://www.slideshare.net/ksankar/dataBscienceBfolkBknowledge)
![Page 72: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/72.jpg)
Decision Tree – Best Practices
maxDepth) Tune)with)Data/Model)SelecFon)maxBins) Set)low,)monitor)communicaFons,)increase)if)needed)#)RDD)parFFons) Set)to)#)of)cores)
• Usually the recommendation is that the RDD partitions should be over partitioned ie “more partitions than cores”, because tasks take different times, we need to utilize the compute power and in the end they average out
• But for Machine Learning especially trees, all tasks are approx equal computationally intensive, so over partitioning doesn’t help
• Joe Bradley talk (reference below) has interesting insights
hgps://speakerdeck.com/jkbradley/mllibBdecisionBtreesBatBsfBscalaBbamlBmeetup)
DecisionTree.trainClassifier(data,)numClasses,)categoricalFeaturesInfo,)impurity="gini",)maxDepth=4,)maxBins=100))
![Page 73: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/73.jpg)
◦ “Output)of)weak)classifiers)into)a)powerful)commigee”)
◦ Final)PredicFon)=)weighted)majority)vote))◦ Later)classifiers)get)misclassified)points))! With)higher)weight,))! So)they)are)forced))! To)concentrate)on)them)◦ AdaBoost)(AdapFveBoosting))◦ BoosFng)vs)Bagging)
! Bagging)–)independent)trees)<B)Spark)shines)here)! BoosFng)–)successively)weighted)
Boosting " Goal!◦ Model Complexity (-)!◦ Variance (-)!◦ Prediction Accuracy (+)!
![Page 74: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/74.jpg)
◦ Builds)large)collecFon)of)deBcorrelated)trees)&)averages)them)◦ Improves)Bagging)by)selecFng)i.i.d*)random)variables)for)
spli3ng)◦ Simpler)to)train)&)tune)◦ “Do0remarkably0well,0with0very0liIle0tuning0required”)–)ESLII)◦ Less)suscepFble)to)over)fi3ng)(than)boosFng))◦ Many)RF)implementaFons)! Original)version)B)FortranB77)!)By)Breiman/Cutler)! Python,)R,)Mahout,)Weka,)Milk)(ML)toolkit)for)py),)matlab))! And)of)course,)Spark)!)
* i.i.d – independent identically distributed + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Random Forests+ " Goal!◦ Model Complexity (-)!◦ Variance (-)!◦ Prediction Accuracy (+)!
![Page 75: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/75.jpg)
Random Forests • While)Boosting)splits)based)on)best)among)all)variables,)RF)splits)based)
on)best)among)randomly*chosen)variables)• )Simpler)because)it)requires)two)variables)–)no.)of)Predictors)(typically)√k))
&)no.)of)trees)(500)for)large)dataset,)150)for)smaller))• Error)prediction)
o For)each)iteration,)predict)for)dataset)that)is)not)in)the)sample)(OOB)data))o Aggregate)OOB)predictions)o Calculate)Prediction)Error)for)the)aggregate,)which)is)basically)the)OOB)estimate)of)error)rate)
• Can)use)this)to)search)for)optimal)#)of)predictors)o We)will)see)how)close)this)is)to)the)actual)error)in)the)Heritage)Health)Prize)
• Assumes)equal)cost)for)misWprediction.)Can)add)a)cost)function)• Proximity)matrix)&)applications)like)adding)missing)data,)dropping)
outliers))
Ref: R News Vol 2/3, Dec 2002 Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg
![Page 76: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/76.jpg)
◦ Two)Step)! Develop)a)set)of)learners)! Combine)the)results)to)develop)a)composite)predictor)◦ Ensemble)methods)can)take)the)form)of:)! Using)different)algorithms,))! Using)the)same)algorithm)with)different)se3ngs)! Assigning)different)parts)of)the)dataset)to)different)
classifiers)◦ Bagging)&)Random)Forests)are)examples)of)
ensemble)method))Ref: Machine Learning In Action
Ensemble Methods " Goal!◦ Model Complexity (-)!◦ Variance (-)!◦ Prediction Accuracy (+)!
![Page 77: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/77.jpg)
Deepdive : Leverage parallelism of RDDs, sparse vectors, etc.
11:30
![Page 78: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/78.jpg)
Lunch
Back at 1:30
![Page 79: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/79.jpg)
Clustering - Hands On : • Normalization & Centering
• Clustering
• Optimizing k based on cohesively of the clusters (WSSE)
1:30
![Page 80: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/80.jpg)
Data Science “folk knowledge” (3 of A)
o More Data Beats a Cleverer Algorithm • �1�#.-5%12%+8�2%+%#3�!+'.1)3(,2�3(!3�),/1.5%�6)3(�$!3!�• �.-;3�./3),)9%�/1%,!341%+8�6)3(.43�'%33)-'�,.1%�$!3!�
o Learn many models, not Just One • �-2%,"+%2������(!-'%�3(%�(8/.3(%2)2�2/!#%�• �%3&+)7�/1)9%�• �'� !'')-'�� ..23)-'���3!#*)-'�
o Simplicity Does not necessarily imply Accuracy o Representable Does not imply Learnable • �423�"%#!42%�!�&4-#3).-�#!-�"%�1%/1%2%-3%$�$.%2�-.3�,%!-�)3�#!-�"%�+%!1-%$�
o Correlation Does not imply Causation o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o A few useful things to know about machine learning - by Pedro Domingos
! http://dl.acm.org/citation.cfm?id=2347755
![Page 81: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/81.jpg)
Session-3 : Clustering 1. Notebook):)07_ClusterW1)
① Read)Data)② Cluster)③ Modeling)ExerciseW41WTemplate)
2. Notebook):)08_ClusterW2)① MEW41WSolution)② Center)and)Scale)③ Cluster)④ Inspect)centroid)⑤ CEW42WTemplate):)Cluster)
Semantics)
3. Notebook):)09_ClusterW3)① CEW42)Solution)② Cluster)Semantics)W)Discussion)
![Page 82: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/82.jpg)
Clustering - Theory • Clustering)is)unsupervised)learning)• While)the)computers)can)dissect)a)dataset)into)“similar”)clusters,)it)
still)needs)human)direction)&)domain)knowledge)to)interpret)&)guide)
• Two)types:)o Centroid)based)clustering)–)kWmeans)clustering)o Tree)based)Clustering)–)hierarchical)clustering)
• Spark)implements)the)Scalable)Kmeans++))o Paper):)http://theory.stanford.edu/~sergei/papers/vldb12Wkmpar.pdf)
![Page 83: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/83.jpg)
Lookout for these interesting Spark features
• Application)of)Statistics)toolbox)• Center)&)Scale)RDD)• Filter)RDDs)
![Page 84: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/84.jpg)
Clustering - API • from)pyspark.mllib.clustering)import)KMeans)• Kmeans.train)• train(cls,)data,)k,)maxIterations=100,)runs=1,)initializationMode="kW
means||"))• K)=)number)of)clusters)to)create,)default=2)• initializationMode)=)The)initialization)algorithm.)This)can)be)either)
"random")to)choose)random)points)as)initial)cluster)centers,)or)"kWmeans||")to)use)a)parallel)variant)of)kWmeans++)(Bahmani)et)al.,)Scalable)KWMeans++,)VLDB)2012).)Default:)kWmeans||)
• KMeansModel.predict)• Maps)a)point)to)a)cluster)
![Page 85: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/85.jpg)
Interpretation
C#' AVG' Interpreta7on'
1)
2) ))Not)that)acFve)–)Give)them)flight)coupons)to)encourage)them)to)fly)more.)Ask)them)why)they)are)not)flying.)May)be)they)are)flying)to)desFnaFons)(say)Jupiter))where)InterGallacFc)has)less)gates)
3)
4)
5)
Best0Customers,0S6ll0lots0
of0nonLflying0miles0
Very0ac6ve0onLline.0Why0are0they0coming0to0us0instead0of0Amazon0?0
Note0:00
• This0is0just0a0sample0interpreta6on.0
• In0real0life0we0would0“noodle”0over0the0clusters0&0tweak0
them0to0be0useful,0interpretable0and0dis6nguishable.0
• May0be030is0more0suited0to0create0targeted0promo6ons0
![Page 86: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/86.jpg)
Epilogue • KMeans)in)Spark)has)enough)controls)• It)does)a)decent)job)• We)were)able)to)control)the)clusters)based)on)our)experience)(2)
cluster)is)too)low,)10)is)too)high,)5)seems)to)be)right))• We)can)see)that)the)Scalable)KMeans)has)control)over)runs,)
parallelism)et)al.)(Home)work):)Explore)the)scalability))• We)were)able)to)interpret)the)results)with)domain)knowledge)and)
arrive)at)a)scheme)to)solve)the)business)opportunity)• Naturally)we)would)tweak)the)clusters)to)fit)the)business)viability.)
20)clusters)with)corresponding)promotion)schemes)are)unwieldy,)even)if)the)WSSE)is)the)minimum.)
![Page 87: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/87.jpg)
Recommendation - Hands On : • Movie Lens - Medium Data
• Movie Lens - Large Data (Homework)
2:00
![Page 88: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/88.jpg)
hgps://doubleclix.wordpress.com/2014/03/07/aBglimpseBofBgoogleBnasaBpeterBnorvig/)
![Page 89: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/89.jpg)
Session-4 : Recommendation at Scale
1. NotebookW10_RecoW1)① Read)Movielens)medium)
data)② CEW51)Template)–)Partition)
Data)2. NotebookW11_RecoW2)
① CEW51)Solution)② ALS)Slide)③ Train)ALS)&)Predict)④ Calculate)Model)
Performance)
![Page 90: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/90.jpg)
Recommendation & Personalization - Spark
Automated*AnalyticsW)Let)Data)tell)story)Feature)Learning,)AI,)Deep)Learning)
Learning*Models*W)fit)parameters)as)it)gets)more)data))
Dynamic*Models*–)model)selection)based)on)context)
o Knowledge)Based)o Demographic)Based)o Content)Based)o Collaborative)Filtering)
o Item)Based)o User)Based)
o Latent)Factor)based)
o User)Rating)o Purchased)o Looked/Not)purchased)
Spark)(in)1.1.0))implements)the)user)based)ALS)collaboraFve)filtering)
Ref:))ALS)B)CollaboraFve)Filtering)for)Implicit)Feedback)Datasets,)Yifan)Hu);)AT&T)Labs.,)Florham)Park,)NJ);)Koren,)Y.);)Volinsky,)C.)ALSBWR)B)LargeBScale)Parallel)CollaboraFve)Filtering)for)the)Nezlix)Prize,)Yunhong)Zhou,)Dennis)Wilkinson,)Robert)Schreiber,)Rong)Pan)
![Page 91: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/91.jpg)
Spark Collaborative Filtering API • ALS.train(cls,)ratings,)rank,)iterations=5,)lambda_=0.01,)blocks=W1))
• ALS.trainImplicit(cls,)ratings,)rank,)iterations=5,)lambda_=0.01,)blocks=W1,)alpha=0.01))
• MatrixFactorizationModel.predict(self,)user,)product))
• MatrixFactorizationModel.predictAll(self,)usersProducts))
![Page 92: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/92.jpg)
Theory : Matrix Factorization, SVD,…
On-line k-means, spark streaming
2:30
![Page 93: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/93.jpg)
Singular Value Decomposition on Spark
![Page 94: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/94.jpg)
Singular Value Decomposition
![Page 95: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/95.jpg)
Singular Value Decomposition Two cases » Tall and Skinny » Short and Fat (not really) » Roughly Square SVD method on RowMatrix takes care of which one to call.
![Page 96: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/96.jpg)
Tall and Skinny SVD
![Page 97: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/97.jpg)
Tall and Skinny SVD
Gets%us%%%V%and%the%singular%values%
Gets%us%%%U%by%one%matrix%multiplication%
![Page 98: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/98.jpg)
Square SVD ARPACK: Very mature Fortran77 package for computing eigenvalue decompositions" JNI interface available via netlib-java" Distributed using Spark – how?
![Page 99: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/99.jpg)
Square SVD via ARPACK Only interfaces with distributed matrix via matrix-vector multiplies The result of matrix-vector multiply is small. The multiplication can be distributed.
![Page 100: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/100.jpg)
Square SVD
With 68 executors and 8GB memory in each, looking for the top 5 singular vectors
![Page 101: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/101.jpg)
Communication-Efficient All pairs similarity on Spark (DIMSUM)
![Page 102: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/102.jpg)
All pairs Similarity All pairs of cosine scores between n vectors » Don’t want to brute force (n choose 2) m » Essentially computes "Compute via DIMSUM » Dimension Independent Similarity
Computation using MapReduce
![Page 103: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/103.jpg)
Intuition Sample columns that have many non-zeros with lower probability. " On the flip side, columns that have fewer non-zeros are sampled with higher probability. " Results provably correct and independent of larger dimension, m.
![Page 104: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/104.jpg)
Spark implementation
![Page 105: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/105.jpg)
MLlib + {Streaming, GraphX, SQL}
![Page 106: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/106.jpg)
A General Platform
Spark Core
Spark Streaming"
real-time
Spark SQL structured
GraphX graph
MLlib machine learning
…
Standard libraries included with Spark
![Page 107: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/107.jpg)
Benefit for Users Same engine performs data extraction, model training and interactive queries
… DFS read
DFS write pa
rse DFS read
DFS write tra
in
DFS read
DFS write qu
ery
DFS
DFS read pa
rse
train
quer
y
Separate engines
Spark
![Page 108: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/108.jpg)
MLlib + Streaming As of Spark 1.1, you can train linear models in a streaming fashion, k-means as of 1.2 Model weights are updated via SGD, thus amenable to streaming More work needed for decision trees
![Page 109: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/109.jpg)
MLlib + SQL df = context.sql(“select latitude, longitude from tweets”) !
model = pipeline.fit(df) !
DataFrames in Spark 1.3! (March 2015) Powerful coupled with new pipeline API
![Page 110: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/110.jpg)
MLlib + GraphX
![Page 111: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/111.jpg)
Future of MLlib
![Page 112: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/112.jpg)
Goals for next version Tighter integration with DataFrame and spark.ml API
Accelerated gradient methods & Optimization interface
Model export: PMML (current export exists in Spark 1.3, but not PMML, which lacks distributed models)
Scaling: Model scaling (e.g. via Parameter Servers)
![Page 113: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/113.jpg)
Research Goal: General Distributed Optimization
Distribute%CVX%by%backing%CVXPY%with%
PySpark%%
EasyBtoBexpress%distributable%convex%
programs%%
Need%to%know%less%math%to%optimize%
complicated%objectives%
![Page 114: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/114.jpg)
Most active open source community in big data 200+ developers, 50+ companies contributing
Spark Community
0
50
100
150
Contributors in past year
![Page 115: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/115.jpg)
Continuing Growth
source: ohloh.net
Contributors per month to Spark
![Page 116: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/116.jpg)
Spark and ML Spark has all its roots in research, so we hope to keep incorporating new ideas!
![Page 117: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/117.jpg)
Coffee-Break
Back at 3:15
![Page 118: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/118.jpg)
Hands On : • Mood Of the Union
• RecSys 2015 Challenge
3:15
![Page 119: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/119.jpg)
The Art of ELO Ranking & Super Bowl XLIX
o The real formula is
o Not what is written on the glass !
o But then that is Hollywood !
�� ����$������!"�$������ ����$������!"�$��������"����$!���%�"�!���
Ref : Who is #1, Princeton University Presshttps://doubleclix.wordpress.com/2015/01/20/theWartWofWnflWrankingWtheWeloWalgorithmWandWfivethirtyeight/)
![Page 120: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/120.jpg)
Session-5 : Mood Of the Union – Data Science on SOTU by POTUS
1. DataScience/12_SOTUW1)① Read)BO)② CEW61)Template):)Has)BO)
changed)since)2014)?)2. DataScience/13_SOTUW2)
① CEW61)Solution)② Read)GW)③ Preprocess)④ CEW62)Template):)What)
mood)the)country)was)in)1790W1796)vs.)2009W2015)
3. Notebook):)14_SOTUW3)① CEW62)Solution)② Homework)
① GWB)vs)Clinton)② WJC)vs)AL)
③ Discussions ) ))
![Page 121: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/121.jpg)
WJC)
Mood Of The Nation
JFK)
FDR)
GW)
AL)
JFK)
![Page 122: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/122.jpg)
Epilogue
• Interesting)Exercise)• Highlights)
o Map-reduce in a couple of lines ! • But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)
o Set differences using substractByKey o Ability to sort a map by values (or any arbitrary function, for that matter)
• To)Explore)as)homework:)o TF-IDF in
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
hgp://blog.cloudera.com/blog/2014/09/howBtoBtranslateBfromBmapreduceBtoBapacheBspark/)
![Page 123: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/123.jpg)
Session-7 : Predict Buying Pattern Recsys 2015 Challenge
1. Notebook):)99_RecsysW2015W1)
1. Read)Data)2. Explore)Options)3. CEW71)
![Page 124: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/124.jpg)
After Hours
Homework
![Page 125: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/125.jpg)
Notebooks to Explore
• Run)thru)all)the)homework)in)you)Databricks)cloud)
![Page 126: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/126.jpg)
GraphX examples
![Page 127: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/127.jpg)
82
GraphX:
spark.apache.org/docs/latest/graphx-programming-guide.html
Key Points:
• graph-parallel systems• importance of workflows• optimizations
![Page 128: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/128.jpg)
83
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs!J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin!graphlab.org/files/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf
Pregel: Large-scale graph computing at Google!Grzegorz Czajkowski, et al. !googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html
GraphX: Unified Graph Analytics on Spark!Ankur Dave, Databricks!databricks-training.s3.amazonaws.com/slides/graphx@sparksummit_2014-07.pdf
Advanced Exercises: GraphX!databricks-training.s3.amazonaws.com/graph-analytics-with-graphx.html
GraphX: Further Reading…
![Page 129: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/129.jpg)
// http://spark.apache.org/docs/latest/graphx-programming-guide.html
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
case class Peep(name: String, age: Int)
val nodeArray = Array(
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),
(5L, Peep("Leslie", 45))
)
val edgeArray = Array(
Edge(2L, 1L, 7), Edge(2L, 4L, 2),
Edge(3L, 2L, 4), Edge(3L, 5L, 3),
Edge(4L, 1L, 1), Edge(5L, 3L, 9)
)
val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)
val results = g.triplets.filter(t => t.attr > 7)
for (triplet <- results.collect) {
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")
}
84
GraphX: Example – simple traversals
![Page 130: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/130.jpg)
85
GraphX: Example – routing problems
What is the cost to reach node 0 from any other node in the graph? This is a common use case for graph algorithms, e.g., Djikstra
![Page 131: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/131.jpg)
86
Run /<your folder>/_DataSCience/08.graphx !in your folder:
GraphX: Coding Exercise
![Page 132: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/132.jpg)
Case Studies
![Page 133: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/133.jpg)
Case Studies: Apache Spark, DBC, etc.
Additional details about production deployments for Apache Spark can be found at:
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
https://databricks.com/blog/category/company/partners
http://go.databricks.com/customer-case-studies
88
![Page 134: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/134.jpg)
Case Studies: Automatic Labs
89
Spark Plugs Into Your Car !Rob Ferguson !spark-summit.org/east/2015/talk/spark-plugs-into-your-car
finance.yahoo.com/news/automatic-labs-turns-databricks-cloud-140000785.html
Automatic creates personalized driving habit dashboards
• wanted to use Spark while minimizing investment in DevOps
• provides data access to non-technical analysts via SQL
• replaced Redshift and disparate ML tools with single platform
• leveraged built-in visualization capabilities in notebooks to generate dashboards easily and quickly
• used MLlib on Spark for needed functionality out of the box
![Page 135: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/135.jpg)
Spark at Twitter: Evaluation & Lessons Learnt!Sriram Krishnan !slideshare.net/krishflix/seattle-spark-meetup-spark-at-twitter
• Spark can be more interactive, efficient than MR
• support for iterative algorithms and caching
• more generic than traditional MapReduce
• Why is Spark faster than Hadoop MapReduce?
• fewer I/O synchronization barriers
• less expensive shuffle
• the more complex the DAG, the greater the !performance improvement
90
Case Studies: Twitter
![Page 136: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/136.jpg)
Pearson uses Spark Streaming for next generation adaptive learning platformDibyendu Bhattacharyadatabricks.com/blog/2014/12/08/pearson-uses-spark-streaming-for-next-generation-adaptive-learning-platform.html
91
• Kafka + Spark + Cassandra + Blur, on AWS on a YARN cluster
• single platform/common API was a key reason to replace Storm with Spark Streaming
• custom Kafka Consumer for Spark Streaming, using Low Level Kafka Consumer APIs
• handles: Kafka node failures, receiver failures, leader changes, committed offset in ZK, tunable data rate throughput
Case Studies: Pearson
![Page 137: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/137.jpg)
Unlocking Your Hadoop Data with Apache Spark and CDH5Denny Leeslideshare.net/Concur/unlocking-your-hadoop-data-with-apache-spark-and-cdh5
92
• leading provider of spend management solutions and services
• delivers recommendations based on business users’ travel and expenses – “to help deliver the perfect trip”
• use of traditional BI tools with Spark SQL allowed analysts to make sense of the data without becoming programmers
• needed the ability to transition quickly between Machine Learning (MLLib), Graph (GraphX), and SQL usage
• needed to deliver recommendations in real-time
Case Studies: Concur
![Page 138: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/138.jpg)
Stratio Streaming: a new approach to Spark StreamingDavid Morales, Oscar Mendezspark-summit.org/2014/talk/stratio-streaming-a-new-approach-to-spark-streaming
93
• Stratio Streaming is the union of a real-time messaging bus with a complex event processing engine atop Spark Streaming
• allows the creation of streams and queries on the fly
• paired with Siddhi CEP engine and Apache Kafka
• added global features to the engine such as auditing and statistics
Case Studies: Stratio
![Page 139: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/139.jpg)
Collaborative Filtering with Spark !Chris Johnson !slideshare.net/MrChrisJohnson/collaborative-filtering-with-spark
• collab filter (ALS) for music recommendation
• Hadoop suffers from I/O overhead
• show a progression of code rewrites, converting a !Hadoop-based app into efficient use of Spark
94
Case Studies: Spotify
![Page 140: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/140.jpg)
Guavus Embeds Apache Spark !into its Operational Intelligence Platform !Deployed at the World’s Largest TelcosEric Carrdatabricks.com/blog/2014/09/25/guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcos.html
95
• 4 of 5 top mobile network operators, 3 of 5 top Internet backbone providers, 80% MSOs in NorAm
• analyzing 50% of US mobile data traffic, +2.5 PB/day
• latency is critical for resolving operational issues before they cascade: 2.5 MM transactions per second
• “analyze first” not “store first ask questions later”
Case Studies: Guavus
![Page 141: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/141.jpg)
Case Studies: Radius Intelligence
96
From Hadoop to Spark in 4 months, Lessons Learned !Alexis Roos!http://youtu.be/o3-lokUFqvA
• building a full SMB index took 12+ hours using !Hadoop and Cascading
• pipeline was difficult to modify/enhance
• Spark increased pipeline performance 10x
• interactive shell and notebooks enabled data scientists !to experiment and develop code faster
• PMs and business development staff can use SQL to !query large data sets
![Page 142: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/142.jpg)
Further Resources + Q&A
![Page 143: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/143.jpg)
community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
![Page 144: Data Science with Spark - Training at SparkSummit (East)](https://reader034.vdocument.in/reader034/viewer/2022042607/55a5e3fd1a28ab3c368b4788/html5/thumbnails/144.jpg)