Download - Spark meetup TCHUG

LARGE-SCALE ANALYTICS WITH APACHE SPARK THOMSON REUTERS R&D TWIN CITIES HADOOP USER GROUP

FRANK SCHILDER SEPTEMBER 22, 2014

THOMSON REUTERS •  The Thomson Reuters Corporation

–  50,000+ employees –  2,000+ journalists at news desks world wide –  Offices in more than 1,000 countries –  $12 billion dollars revenue/year

•  Products: intelligent information for professionals and enterprises –  Legal: WestlawNext legal search engine –  Financial: Eikon financial platform; Datastream real-time share price data –  News: REUTERS news –  Science: Endnote, ISI journal impact factor, Derwent World Patent Index –  Tax & Accounting: OneSource tax information

•  Corporate R&D –  Around 40 researchers and developers (NLP, IR, ML) –  Three R&D sites the US one in the UK: Eagan, MN; Rochester, NY; NYC

and London –  We are hiring… email me at [email protected]

OVERVIEW •  Speed

–  Data locality, scalability, fault tolerance

•  Ease of Use –  Scala, interactive Shell

•  Generality –  SparkSQL, MLLib

•  Comparing ML frameworks –  Vowpal Wabbit (VW) –  Sparkling Water

•  The Future

WHAT IS SPARK? Apache Spark is a fast and general engine for large-scale data processing.

•  Speed: allows to run iterative Map-Reduce faster because of in-Memory computation: Resilient Distributed Datasets (RDD)

•  Ease of use: enables interactive data analysis in Scala, Python, or Java; interactive Shell

•  Generality: offers libraries for SQL, Streaming and large-scale analytics (graph processing and machine learning)

•  Integrated with Hadoop: runs on Hadoop 2’s YARN cluster

ACKNOWLEDGMENTS •  Matei Zaharia and ampLab and databricks team for

fantastic learning material and tutorials on Spark

•  Hiroko Bretz, Thomas Vacek, Dezhao Song, Terry Heinze for Spark and Scala support and running experiments

•  Adam Glaser for his time as a TSAP intern

•  Mahadev Wudali and Mike Edwards for letting us play in the “sandbox” (cluster)

PRIMARY GOALS OF SPARK •  Extend the MapReduce model to better support

two common classes of analytics apps: –  Iterative algorithms (machine learning, graphs) –  Interactive data mining (R, Python)

•  Enhance programmability: –  Integrate into Scala programming language –  Allow interactive use from Scala interpreter –  Make Spark easily accessible from other

languages (Python, Java)

MOTIVATION

• Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: –  Iterative algorithms (machine learning, graphs) –  Interactive data mining tools (R, Python)

• With current frameworks, apps reload data from stable storage on each query

HADOOP MAPREDUCE VS SPARK

SOLUTION: Resilient Distributed Datasets (RDDs) •  Allow apps to keep working sets in memory for

efficient reuse

•  Retain the attractive properties of MapReduce –  Fault tolerance, data locality, scalability

•  Support a wide range of applications

PROGRAMMING MODEL Resilient distributed datasets (RDDs)

–  Immutable, partitioned collections of objects –  Created through parallel transformations (map, filter,

groupBy, join, …) on data in stable storage –  Functions follow the same patterns as Scala operations

on lists –  Can be cached for efficient reuse

80+ Actions on RDDs –  count, reduce, save, take, first, …

EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns

Val lines = spark.textFile(“hdfs://...”)

Val errors = lines.filter(_.startsWith(“ERROR”))

Val messages = errors.map(_.split(‘\t’)(2))

Val cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“timeout”)).count

cachedMsgs.filter(_.contains(“license”)).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)

BEHAVIOR WITH NOT ENOUGH RAM

68.8

58.1

40.7

29.7

11.5

0

20

40

60

80

100

Cache disabled

25% 50% 75% Fully cached

Iteration time (s)

% of working set in memory

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions

Ex:

messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDD filter

(func = _.contains(...)) map

(func = _.split(...))

Fault Recovery Results

119

57

56

58

58 81

57

59

57

59

0 20 40 60 80 100 120 140

1 2 3 4 5 6 7 8 9 10

Iteratrion

time (s)

Iteration

No Failure Failure in the 6th Iteration

EASE OF USE

INTERACTIVE SHELL •  Data analysis can be done in the interactive shell.

–  Start from local machine or cluster –  Access multi-core processor with local[n] –  Spark context is already set up for you: SparkContext sc

•  Load data from anywhere (local, HDFS, Cassandra, Amazon S3 etc.):

•  Start analyzing your data:

Processing starts here

Local data file

ANALYZE YOUR DATA •  Word count in one line:

•  List the word counts:

•  Broadcast variables (e.g. dictionary, stop word list) because local variables need to distributed to the workers:

RUN A SPARK SCRIPT

PYTHON SHELL & IPYTHON •  The interactive shell can also be started as Python

shell called pySpark:

•  Start analyzing your data in python now:

•  Since it’s Python, you may want to use iPython –  (command shell for interactive programming in your

brower) :

IPYTHON AND SPARK •  The iPython notebook environment and pySpark:

–  Document data analysis results –  Carry out machine learning experiments –  Visualize results with matplotlib or other visualization

libraries –  Combine with NLP libraries such as NLTK

•  PySpark does not offer the full functionality of Spark Shell in Scala (yet)

•  Some bugs (e.g. problems with unicode)

PROJECTS AT R&D USING SPARK •  Entity linking

–  Alternative name extraction from Wikipedia, Freebase, free text, ClueWeb12; several TB large web collection (planned)

•  Large-scale text data analysis: –  creating fingerprints for entities/events –  Temporal slot filling: Assigning a begin and end time

stamp to a slot filler (e.g. A is employee of company B from BEGIN to END)

–  Large-Scale text classification of Reuters News Archive articles (10 years)

•  Language model computation used for search query analysis

SPARK MODULES •  Spark streaming:

–  Processing real-time data streams

•  Spark SQL: –  Support for structured data (JSON, Parquet) and

relational queries (SQL)

•  MLlib: –  Machine learning library

•  GraphX: –  New graph processing API

SPARKSQL

SPARK SQL •  Relational queries expressed in

–  SQL –  HiveQL –  Scala Domain specific language (DSL)

•  New type of RDD: SchemaRDD : –  RDD composed of Row objects –  Schema definition or inferred from a Parquet file, JSON

data set, or data store in Hive

•  SPARK SQL is in alpha: API may change in the future!

DEFINING A SCHEMA

MLLIB •  A machine learning module that comes with Spark •  Shipped since Spark 0.8.0

•  Provides various machine learning algorithms for classification and clustering

•  Sparse vector representation since 1.0.0

•  New features in recently released version 1.1.0: –  Includes a standard statistics library (e.g. correlation,

Hypothesis testing, sampling) –  More algorithms ported to Java and Python –  More feature engineering: TF-IDF, Singular Value

Decomposition (SVD)

MLLIB •  Provides various machine learning algorithms:

–  Classification: •  Logistic regression, support vector machine (SVM), naïve

Bayes, decision trees

–  Regression: •  Linear regression, regression trees

–  Collaborative Filtering: •  Alternative least square (ALS)

–  Clustering: •  K-means

–  Decomposition •  Singular value decomposition (SVD), Principal component

analysis (PCA)

OTHER ML FRAMEWORKS •  Mahout

•  LIBLINEAR

•  MatLAB

•  Scikit-learn

•  GraphLab

•  R

•  Weka

•  Vowpal Wabbit

•  BigML

LARGE-SCALE ML INFRASTRUCTURE •  More data implies bigger training sets and richer

feature sets.

•  More data with simple ML algorithm often beats small data with complicated ML algorithm

•  Large-scale ML requires big data infrastructure: –  Faster processing: Hadoop, Spark –  Feature engineering: Principal Component Analysis,

Hashing trick, Word2Vec

PREDICTIVE ANALYTICS WITH MLLIB

PREDICTIVE ANALYTICS WITH MLLIB

http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html

VW AND MLLIB COMPARISON •  We compared Vowpal Wabbit and MLlib in

December 2013 (work with Tom Vacek)

•  Vowpal Wabbit (VW) is a large-scale ML tool developed by John Langford (Microsoft)

•  Task: binary text classification task on Reuters articles –  Ease of implementation –  Feature Extraction –  Parameter tuning –  Speed –  Accessibility of programming languages

VW VS. MLLIB •  Ease of implementation

–  VW: user tool designed for ML, not programming language –  MLlib: programming language, some support now (e.g. regularization)

•  Feature Extraction –  VW: specific capabilities for bi-grams, prefix etc. –  MLlib: no limit in terms of creating features

•  Parameter tuning –  VW: no parameter search capability, but multiple parameters can be hand-tuned –  MLlib: offers cross-validation

•  Speed –  VW: highly optimized, very fast even on a single machine with multiple cores –  MLlib: fast with lots of machines

•  Accessibility of programming languages –  VW: written in C++, a few wrappers (e.g. Python) –  MLlib: Scala, Python, Java

•  Conclusion end of 2013: VW had a slight advantage, but MLlib has caught up in at least some of the areas (e.g. sparse feature representation)

FINDINGS SO FAR •  Large-scale extraction is a great fit for Spark when

working with large data sets (> 1GB)

•  Ease of use makes Spark an ideal framework for rapid prototyping.

•  MLlib is a fast growing ML library, but “under development”

•  Vowpal Wabbit has been shown to crunch even large data sets with ease.

0

50

100

150

200

250

vw liblinear Spark local[4]

0/1 loss

time

OTHER ML FRAMEWORKS •  Internship by Adam Glaser compared various ML

frameworks with 5 standard data sets (NIPS) –  Mass-spectrometric data (cancer), handwritten digit

detection, Reuters news classification, synthetic data sets –  Data sets were not very big, but had up to 1.000.000

features

•  Evaluated accuracy of the generated models and speed for training time

•  H20, GraphLab and Microsoft Azure showed strong performances in terms of accuracy and training time.

ACCURACY

WHAT IS NEXT? •  Oxdata plans to release Sparkling Water in October

2014:

•  Microsoft Azure also offers a strong platform with multiple ML algorithm and an intuitive user interface

•  GraphLab has GraphLab Canvas ™ for visualizing your data and plans to incorporate more ML algorithms.

CAN’T DECIDE?

CONCLUSIONS

CONCLUSIONS •  Apache Spark is the most active project in the Hadoop

eco system

•  Spark offers speed and ease of use because of –  RDDs –  Interactive shell and –  Easy integration of Scala, Java, Python scripts

•  Integrated in Spark are modules for –  Easy data access via SparkSQL –  Large-scale analytics via MLlib

•  Other ML frameworks enable analytics as well

•  Evaluate which framework is the best fit for your data problem

THE FUTURE? •  Apache Spark will be a unified platform to run

under various work loads: –  Batch –  Streaming –  Interactive

•  And connect with different runtime systems –  Hadoop –  Cassandra –  Mesos –  Cloud –  …

THE FUTURE? •  Spark will extend its offering of large-scale

algorithms for doing complex analytics: –  Graph processing –  Classification –  Clustering –  …

•  Other frameworks will continue to offer similar capabilities.

•  If you can’t beat them, join them.

[email protected]

http://labs.thomsonreuters.com/about-rd-careers/

EXTRA SLIDES

Example: Logistic Regression Goal: find best line separating two sets of points

+

–

+ + +

+

+

+ + +

– – –

–

–

– – –

+

target

–

random initial line

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Logistic Regression Performance

0 500

1000 1500 2000 2500 3000 3500 4000 4500

1 5 10 20 30

Run

ning

Tim

e (s

)

Number of Iterations

Hadoop Spark

127 s / iteration

first iteration 174 s further iterations 6 s

Spark Scheduler

Dryad-like DAGs

Pipelines functions within a stage

Cache-aware work reuse & locality

Partitioning-aware to avoid shuffles

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= cached data partition

Spark Operations

Transformations (define a new

RDD)

map filter

sample groupByKey reduceByKey

sortByKey

flatMap union join

cogroup cross

mapValues

Actions (return a result to driver program)

collect reduce count save

lookupKey

Download - Spark meetup TCHUG

Top Related