introduc*on!cis.csuohio.edu/~sschung/cis601/sonal_spark.pdf · • existing frameworks cannot do...
TRANSCRIPT
Submitted to: Dr. Sunnie Chung
Presented by: Sonal Deshmukh Jay Upadhyay
Introduc*on • Data Streams is the large amount of data created and arrives in
high speed con*nuously. Example, credit card transac*ons, Internet traffic data, sensor data, or network alarm data
• MapReduce, the parallel data processing paradigm, greatly simplified the analysis of big data using large clusters of commodity hardware.
• For itera*ve jobs, mul*ple map-‐reduce opera*ons needs to be performed sequen*ally,
which involves a very high disk I/O and high latency making them too slow. • Similarly, for interac*ve queries, data is read from the disk each *me the query is executed.
treaming
What is Spark?
• Fast and general compu*ng system . • Spark and its streaming version built on top of
Hadoop and perform data analysis on clusters • Improves over MapReduce
- In memory compu*ng primi*ves - General computa*on graphs
• Improves usability over MapReduce
- Rich APIs in Scala, Java, Python - Interac*ve extended Scala Shell
Up to 100x Faster (2-‐ 10x on disk)
Super fast interac*ve analysis
of Big Data
Spark Streaming • Memory abstrac*on : Efficiently share data across the
different stages of a map-‐reduce job or provide in-‐memory data sharing.
• Memory abstrac*on is by RDD. • Spark applica*on : Driver program that run’s the user’s main func*on and executes various parallel opera*ons on the clusters. • Caches the data
Spark Ecosystem
• GraphX – Graph computa*on engine which supports complex graph processing algorithms efficiently and with improved performance. Example :PageRank
• MLLib – Machine learning library built on the top of Spark and supports many complex machine learning algorithms which runs 100x faster than map-‐reduce.
• Spark Streaming – Supports analy*cal and interac*ve applica*ons built on live streaming data.
• Spark(SQL) – Used for querying structured data. Spark SQL allows the users to ETL their data from its current format (like JSON, Parquet, a Database), transform it, and expose it for ad-‐hoc querying.
Design • Scala is a func*onal language which runs on top of the Java VM, and it is fully compa*ble with any Java-‐based library.
• In StreamDM ,Input/Output Streams Datasets are divided in several discre'zed streams (DStream)
• It allows combining batch processing algorithms with streaming algorithms
A StreamDM task • Internal instance data structure, the Example • Read by Dstream ,Reader class parsed it. • Data mining algorithm implementa*on in Learner • The assignments or predic*ons from Learner are evaluated by
an Evaluator • Finally, the results are output by a Writer class to disk,console
or HDFS
Implemented Learner Algorithms StreamDM contains implementa*ons of the following
classifica*on algorithms, which implement the Classifier trait:
• Mul*nomial Naive Bayes : Rela*ve frequency of each word
• SGD Learner and Perceptron : Op*mizer for linear models
• Hoeffding Decision Trees : Based on decision tree algorithm
• Bagging : Bootstrap sampling with replacement • CluStream : Set of instances from the stream • StreamKM++ : K-‐means algorithm
Streaming Twifer data analysis using Sparks for effec*ve job search
Big Streaming Data
Why process Big Streaming Data? Fraud detection in bank transactions
Anomalies in sensor data
Cat videos in tweets
How to Process Big Streaming Data?
Website monitoring Fraud detection
Ad monetization Scales to hundreds of nodes
Achieves low latency
Efficiently recover from failures
Integrates with batch and interactive processing
Distributed Stream Processing
System Raw Data Streams Processed Data
What people have been doing? • Build two stacks - one for batch processing, one
for streaming - Often both process same data
• Extremely painful to maintain two different stacks - Different programming models - Doubles implementation effort
• Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing of TBs of data with high latency
Spark Programming Model
> Resilient distributed datasets (RDDs) - Distributed collection of objects - Manipulated through parallel transformations
(map, filter, reduce, etc.) - Can be cached in memory across cluster - Automatically rebuilt on failure
> Programming Interface - Functional APIs in Scala, Java, Python - Interactive use from Scala & Python shell
Applica*on Model
Twi%er Spark Cluster (Filtering and Classifica*on)
Data Store
Client
Cloud OR On-‐Premise
OAuth
Tweet Stream
Query Result
Tweet Classifica*on based on Job Type Tweet Job Category #IT #Job alert: VMware Quality System Engineer | VMware | #Palo Alto hfp://txxxx #Jobs
IT
Blown an interview? Maybe not. Here's how to recover: hfp://t.co/xxx #career
invalid
#Job #Germantown STxxx: Systems Administrator (Sxx): xx Project Overview: ... hfp://txxx
IT
JOB OPENING: Project Financial Controls Specialist -‐ IRC at Mxxx (Minneapolis, MN) hfp://txxxxj #job
Finance
Axx Txx TRAINING #Transporta*on #Job: DRIVERS (#OKLAHOMA, OK) hfp://t.coxx
Driving
Results
0
5
10
15
20
25
IT Healthcare Finance Sales & Marke*ng
Popular Job Categories Tweeted, Expressed as a % of Total Job Vacancy Tweets.
Job Category AdverHzed
% Amon
g Ad
verHzed Job Tw
eets
Conclusion • By focusing on prac*cal real-‐world scenarios, the paper demonstrated that StreamDM is intui*ve and that it is ready for industry deployments.
• A Scalable model for real *me analysis and filtering job vacancy related tweets
• Classify in Job category • Access tweets without being follow twifer account
• Used for streaming tweets • Cater growing data size
References
• hfp://silviu.maniu.info/publica*ons/bifet2015streamdm.pdf
• hfp://spark.apache.org/docs/latest • hfps://databricks.com/spark/about • hfp://hortonworks.com/apache/spark/ • hfp://www.kdnuggets.com/2015/06/introduc*on-‐big-‐data-‐apache-‐spark.html
Thank You