introduc*on!cis.csuohio.edu/~sschung/cis601/sonal_spark.pdf · • existing frameworks cannot do...

Submitted to: Dr. Sunnie Chung

Presented by: Sonal Deshmukh Jay Upadhyay

Introduc*on •  Data Streams is the large amount of data created and arrives in

high speed con*nuously. Example, credit card transac*ons, Internet traffic data, sensor data, or network alarm data

•  MapReduce, the parallel data processing paradigm, greatly simplified the analysis of big data using large clusters of commodity hardware.

•  For itera*ve jobs, mul*ple map-‐reduce opera*ons needs to be performed sequen*ally,

which involves a very high disk I/O and high latency making them too slow. •  Similarly, for interac*ve queries, data is read from the disk each *me the query is executed.

treaming

What is Spark?

•  Fast and general compu*ng system . •  Spark and its streaming version built on top of

Hadoop and perform data analysis on clusters •  Improves over MapReduce

-  In memory compu*ng primi*ves - General computa*on graphs

•  Improves usability over MapReduce

- Rich APIs in Scala, Java, Python -  Interac*ve extended Scala Shell

Up to 100x Faster (2-‐ 10x on disk)

Super fast interac*ve analysis

of Big Data

Spark Streaming •  Memory abstrac*on : Efficiently share data across the

different stages of a map-‐reduce job or provide in-‐memory data sharing.

•  Memory abstrac*on is by RDD. •  Spark applica*on : Driver program that run’s the user’s main func*on and executes various parallel opera*ons on the clusters. •  Caches the data

Spark Ecosystem

•  GraphX – Graph computa*on engine which supports complex graph processing algorithms efficiently and with improved performance. Example :PageRank

•  MLLib – Machine learning library built on the top of Spark and supports many complex machine learning algorithms which runs 100x faster than map-‐reduce.

•  Spark Streaming – Supports analy*cal and interac*ve applica*ons built on live streaming data.

•  Spark(SQL) – Used for querying structured data. Spark SQL allows the users to ETL their data from its current format (like JSON, Parquet, a Database), transform it, and expose it for ad-‐hoc querying.

Design •  Scala is a func*onal language which runs on top of the Java VM, and it is fully compa*ble with any Java-‐based library.

•  In StreamDM ,Input/Output Streams Datasets are divided in several discre'zed streams (DStream)

•  It allows combining batch processing algorithms with streaming algorithms

A StreamDM task •  Internal instance data structure, the Example •  Read by Dstream ,Reader class parsed it. •  Data mining algorithm implementa*on in Learner •  The assignments or predic*ons from Learner are evaluated by

an Evaluator •  Finally, the results are output by a Writer class to disk,console

or HDFS

Implemented Learner Algorithms StreamDM contains implementa*ons of the following

classifica*on algorithms, which implement the Classifier trait:

•  Mul*nomial Naive Bayes : Rela*ve frequency of each word

•  SGD Learner and Perceptron : Op*mizer for linear models

•  Hoeffding Decision Trees : Based on decision tree algorithm

•  Bagging : Bootstrap sampling with replacement •  CluStream : Set of instances from the stream •  StreamKM++ : K-‐means algorithm

Streaming Twifer data analysis using Sparks for effec*ve job search

Big Streaming Data

Why process Big Streaming Data? Fraud detection in bank transactions

Anomalies in sensor data

Cat videos in tweets

How to Process Big Streaming Data?

Website monitoring Fraud detection

Ad monetization Scales to hundreds of nodes

Achieves low latency

Efficiently recover from failures

Integrates with batch and interactive processing

Distributed Stream Processing

System Raw Data Streams Processed Data

What people have been doing? •  Build two stacks - one for batch processing, one

for streaming -  Often both process same data

•  Extremely painful to maintain two different stacks - Different programming models - Doubles implementation effort

•  Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing of TBs of data with high latency

Spark Programming Model

> Resilient distributed datasets (RDDs) - Distributed collection of objects - Manipulated through parallel transformations

(map, filter, reduce, etc.) - Can be cached in memory across cluster - Automatically rebuilt on failure

> Programming Interface - Functional APIs in Scala, Java, Python -  Interactive use from Scala & Python shell

Applica*on Model

Twi%er Spark Cluster (Filtering and Classifica*on)

Data Store

Client

Cloud OR On-‐Premise

OAuth

Tweet Stream

Query Result

Tweet Classifica*on based on Job Type Tweet Job Category #IT #Job alert: VMware Quality System Engineer | VMware | #Palo Alto hfp://txxxx #Jobs

IT

Blown an interview? Maybe not. Here's how to recover: hfp://t.co/xxx #career

invalid

#Job #Germantown STxxx: Systems Administrator (Sxx): xx Project Overview: ... hfp://txxx

IT

JOB OPENING: Project Financial Controls Specialist -‐ IRC at Mxxx (Minneapolis, MN) hfp://txxxxj #job

Finance

Axx Txx TRAINING #Transporta*on #Job: DRIVERS (#OKLAHOMA, OK) hfp://t.coxx

Driving

Results

0

5

10

15

20

25

IT Healthcare Finance Sales & Marke*ng

Popular Job Categories Tweeted, Expressed as a % of Total Job Vacancy Tweets.

Job Category AdverHzed

% Amon

g Ad

verHzed Job Tw

eets

Conclusion •  By focusing on prac*cal real-‐world scenarios, the paper demonstrated that StreamDM is intui*ve and that it is ready for industry deployments.

•  A Scalable model for real *me analysis and filtering job vacancy related tweets

•  Classify in Job category •  Access tweets without being follow twifer account

•  Used for streaming tweets •  Cater growing data size

References

•  hfp://silviu.maniu.info/publica*ons/bifet2015streamdm.pdf

•  hfp://spark.apache.org/docs/latest •  hfps://databricks.com/spark/about •  hfp://hortonworks.com/apache/spark/ •  hfp://www.kdnuggets.com/2015/06/introduc*on-‐big-‐data-‐apache-‐spark.html

Thank You

introduc*on!cis.csuohio.edu/~sschung/cis601/sonal_spark.pdf · • existing frameworks cannot do...

Documents