Download - Spark meetup TCHUG
LARGE-SCALE ANALYTICS WITH APACHE SPARK THOMSON REUTERS R&D TWIN CITIES HADOOP USER GROUP
FRANK SCHILDER SEPTEMBER 22, 2014
THOMSON REUTERS • The Thomson Reuters Corporation
– 50,000+ employees – 2,000+ journalists at news desks world wide – Offices in more than 1,000 countries – $12 billion dollars revenue/year
• Products: intelligent information for professionals and enterprises – Legal: WestlawNext legal search engine – Financial: Eikon financial platform; Datastream real-time share price data – News: REUTERS news – Science: Endnote, ISI journal impact factor, Derwent World Patent Index – Tax & Accounting: OneSource tax information
• Corporate R&D – Around 40 researchers and developers (NLP, IR, ML) – Three R&D sites the US one in the UK: Eagan, MN; Rochester, NY; NYC
and London – We are hiring… email me at [email protected]
OVERVIEW • Speed
– Data locality, scalability, fault tolerance
• Ease of Use – Scala, interactive Shell
• Generality – SparkSQL, MLLib
• Comparing ML frameworks – Vowpal Wabbit (VW) – Sparkling Water
• The Future
WHAT IS SPARK? Apache Spark is a fast and general engine for large-scale data processing.
• Speed: allows to run iterative Map-Reduce faster because of in-Memory computation: Resilient Distributed Datasets (RDD)
• Ease of use: enables interactive data analysis in Scala, Python, or Java; interactive Shell
• Generality: offers libraries for SQL, Streaming and large-scale analytics (graph processing and machine learning)
• Integrated with Hadoop: runs on Hadoop 2’s YARN cluster
ACKNOWLEDGMENTS • Matei Zaharia and ampLab and databricks team for
fantastic learning material and tutorials on Spark
• Hiroko Bretz, Thomas Vacek, Dezhao Song, Terry Heinze for Spark and Scala support and running experiments
• Adam Glaser for his time as a TSAP intern
• Mahadev Wudali and Mike Edwards for letting us play in the “sandbox” (cluster)
SPEED
PRIMARY GOALS OF SPARK • Extend the MapReduce model to better support
two common classes of analytics apps: – Iterative algorithms (machine learning, graphs) – Interactive data mining (R, Python)
• Enhance programmability: – Integrate into Scala programming language – Allow interactive use from Scala interpreter – Make Spark easily accessible from other
languages (Python, Java)
MOTIVATION
• Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: – Iterative algorithms (machine learning, graphs) – Interactive data mining tools (R, Python)
• With current frameworks, apps reload data from stable storage on each query
HADOOP MAPREDUCE VS SPARK
SOLUTION: Resilient Distributed Datasets (RDDs) • Allow apps to keep working sets in memory for
efficient reuse
• Retain the attractive properties of MapReduce – Fault tolerance, data locality, scalability
• Support a wide range of applications
PROGRAMMING MODEL Resilient distributed datasets (RDDs)
– Immutable, partitioned collections of objects – Created through parallel transformations (map, filter,
groupBy, join, …) on data in stable storage – Functions follow the same patterns as Scala operations
on lists – Can be cached for efficient reuse
80+ Actions on RDDs – count, reduce, save, take, first, …
EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns
Val lines = spark.textFile(“hdfs://...”)
Val errors = lines.filter(_.startsWith(“ERROR”))
Val messages = errors.map(_.split(‘\t’)(2))
Val cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“timeout”)).count
cachedMsgs.filter(_.contains(“license”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)
BEHAVIOR WITH NOT ENOUGH RAM
68.8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
Cache disabled
25% 50% 75% Fully cached
Iteration time (s)
% of working set in memory
RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions
Ex:
messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))
HDFS File Filtered RDD Mapped RDD filter
(func = _.contains(...)) map
(func = _.split(...))
Fault Recovery Results
119
57
56
58
58 81
57
59
57
59
0 20 40 60 80 100 120 140
1 2 3 4 5 6 7 8 9 10
Iteratrion
time (s)
Iteration
No Failure Failure in the 6th Iteration
EASE OF USE
INTERACTIVE SHELL • Data analysis can be done in the interactive shell.
– Start from local machine or cluster – Access multi-core processor with local[n] – Spark context is already set up for you: SparkContext sc
• Load data from anywhere (local, HDFS, Cassandra, Amazon S3 etc.):
• Start analyzing your data:
Processing starts here
Local data file
ANALYZE YOUR DATA • Word count in one line:
• List the word counts:
• Broadcast variables (e.g. dictionary, stop word list) because local variables need to distributed to the workers:
RUN A SPARK SCRIPT
PYTHON SHELL & IPYTHON • The interactive shell can also be started as Python
shell called pySpark:
• Start analyzing your data in python now:
• Since it’s Python, you may want to use iPython – (command shell for interactive programming in your
brower) :
IPYTHON AND SPARK • The iPython notebook environment and pySpark:
– Document data analysis results – Carry out machine learning experiments – Visualize results with matplotlib or other visualization
libraries – Combine with NLP libraries such as NLTK
• PySpark does not offer the full functionality of Spark Shell in Scala (yet)
• Some bugs (e.g. problems with unicode)
PROJECTS AT R&D USING SPARK • Entity linking
– Alternative name extraction from Wikipedia, Freebase, free text, ClueWeb12; several TB large web collection (planned)
• Large-scale text data analysis: – creating fingerprints for entities/events – Temporal slot filling: Assigning a begin and end time
stamp to a slot filler (e.g. A is employee of company B from BEGIN to END)
– Large-Scale text classification of Reuters News Archive articles (10 years)
• Language model computation used for search query analysis
SPARK MODULES • Spark streaming:
– Processing real-time data streams
• Spark SQL: – Support for structured data (JSON, Parquet) and
relational queries (SQL)
• MLlib: – Machine learning library
• GraphX: – New graph processing API
SPARKSQL
SPARK SQL • Relational queries expressed in
– SQL – HiveQL – Scala Domain specific language (DSL)
• New type of RDD: SchemaRDD : – RDD composed of Row objects – Schema definition or inferred from a Parquet file, JSON
data set, or data store in Hive
• SPARK SQL is in alpha: API may change in the future!
DEFINING A SCHEMA
MLLIB
MLLIB • A machine learning module that comes with Spark • Shipped since Spark 0.8.0
• Provides various machine learning algorithms for classification and clustering
• Sparse vector representation since 1.0.0
• New features in recently released version 1.1.0: – Includes a standard statistics library (e.g. correlation,
Hypothesis testing, sampling) – More algorithms ported to Java and Python – More feature engineering: TF-IDF, Singular Value
Decomposition (SVD)
MLLIB • Provides various machine learning algorithms:
– Classification: • Logistic regression, support vector machine (SVM), naïve
Bayes, decision trees
– Regression: • Linear regression, regression trees
– Collaborative Filtering: • Alternative least square (ALS)
– Clustering: • K-means
– Decomposition • Singular value decomposition (SVD), Principal component
analysis (PCA)
OTHER ML FRAMEWORKS • Mahout
• LIBLINEAR
• MatLAB
• Scikit-learn
• GraphLab
• R
• Weka
• Vowpal Wabbit
• BigML
LARGE-SCALE ML INFRASTRUCTURE • More data implies bigger training sets and richer
feature sets.
• More data with simple ML algorithm often beats small data with complicated ML algorithm
• Large-scale ML requires big data infrastructure: – Faster processing: Hadoop, Spark – Feature engineering: Principal Component Analysis,
Hashing trick, Word2Vec
PREDICTIVE ANALYTICS WITH MLLIB
PREDICTIVE ANALYTICS WITH MLLIB
http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
VW AND MLLIB COMPARISON • We compared Vowpal Wabbit and MLlib in
December 2013 (work with Tom Vacek)
• Vowpal Wabbit (VW) is a large-scale ML tool developed by John Langford (Microsoft)
• Task: binary text classification task on Reuters articles – Ease of implementation – Feature Extraction – Parameter tuning – Speed – Accessibility of programming languages
VW VS. MLLIB • Ease of implementation
– VW: user tool designed for ML, not programming language – MLlib: programming language, some support now (e.g. regularization)
• Feature Extraction – VW: specific capabilities for bi-grams, prefix etc. – MLlib: no limit in terms of creating features
• Parameter tuning – VW: no parameter search capability, but multiple parameters can be hand-tuned – MLlib: offers cross-validation
• Speed – VW: highly optimized, very fast even on a single machine with multiple cores – MLlib: fast with lots of machines
• Accessibility of programming languages – VW: written in C++, a few wrappers (e.g. Python) – MLlib: Scala, Python, Java
• Conclusion end of 2013: VW had a slight advantage, but MLlib has caught up in at least some of the areas (e.g. sparse feature representation)
FINDINGS SO FAR • Large-scale extraction is a great fit for Spark when
working with large data sets (> 1GB)
• Ease of use makes Spark an ideal framework for rapid prototyping.
• MLlib is a fast growing ML library, but “under development”
• Vowpal Wabbit has been shown to crunch even large data sets with ease.
0
50
100
150
200
250
vw liblinear Spark local[4]
0/1 loss
time
OTHER ML FRAMEWORKS • Internship by Adam Glaser compared various ML
frameworks with 5 standard data sets (NIPS) – Mass-spectrometric data (cancer), handwritten digit
detection, Reuters news classification, synthetic data sets – Data sets were not very big, but had up to 1.000.000
features
• Evaluated accuracy of the generated models and speed for training time
• H20, GraphLab and Microsoft Azure showed strong performances in terms of accuracy and training time.
ACCURACY
SPEED
WHAT IS NEXT? • Oxdata plans to release Sparkling Water in October
2014:
• Microsoft Azure also offers a strong platform with multiple ML algorithm and an intuitive user interface
• GraphLab has GraphLab Canvas ™ for visualizing your data and plans to incorporate more ML algorithms.
CAN’T DECIDE?
CONCLUSIONS
CONCLUSIONS • Apache Spark is the most active project in the Hadoop
eco system
• Spark offers speed and ease of use because of – RDDs – Interactive shell and – Easy integration of Scala, Java, Python scripts
• Integrated in Spark are modules for – Easy data access via SparkSQL – Large-scale analytics via MLlib
• Other ML frameworks enable analytics as well
• Evaluate which framework is the best fit for your data problem
THE FUTURE? • Apache Spark will be a unified platform to run
under various work loads: – Batch – Streaming – Interactive
• And connect with different runtime systems – Hadoop – Cassandra – Mesos – Cloud – …
THE FUTURE? • Spark will extend its offering of large-scale
algorithms for doing complex analytics: – Graph processing – Classification – Clustering – …
• Other frameworks will continue to offer similar capabilities.
• If you can’t beat them, join them.
http://labs.thomsonreuters.com/about-rd-careers/
EXTRA SLIDES
Example: Logistic Regression Goal: find best line separating two sets of points
+
–
+ + +
+
+
+ + +
– – –
–
–
– – –
+
target
–
random initial line
Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)
Logistic Regression Performance
0 500
1000 1500 2000 2500 3000 3500 4000 4500
1 5 10 20 30
Run
ning
Tim
e (s
)
Number of Iterations
Hadoop Spark
127 s / iteration
first iteration 174 s further iterations 6 s
Spark Scheduler
Dryad-like DAGs
Pipelines functions within a stage
Cache-aware work reuse & locality
Partitioning-aware to avoid shuffles
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
Spark Operations
Transformations (define a new
RDD)
map filter
sample groupByKey reduceByKey
sortByKey
flatMap union join
cogroup cross
mapValues
Actions (return a result to driver program)
collect reduce count save
lookupKey