introduc*on!cis.csuohio.edu/~sschung/cis601/sonal_spark.pdf · • existing frameworks cannot do...

21
Submitted to: Dr. Sunnie Chung Presented by: Sonal Deshmukh Jay Upadhyay

Upload: others

Post on 20-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

       

 

 Submitted to: Dr. Sunnie Chung

Presented by: Sonal Deshmukh Jay Upadhyay

Page 2: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

Introduc*on  •  Data  Streams  is  the  large  amount  of  data  created  and  arrives  in  

high  speed  con*nuously.  Example,  credit  card  transac*ons,  Internet  traffic  data,  sensor  data,  or  network  alarm  data    

•  MapReduce,  the  parallel  data  processing  paradigm,  greatly  simplified  the  analysis  of  big  data  using  large  clusters  of  commodity  hardware.  

•  For  itera*ve  jobs,  mul*ple  map-­‐reduce  opera*ons  needs  to  be  performed  sequen*ally,    

         which  involves  a  very  high            disk  I/O  and  high  latency              making  them  too  slow.    •  Similarly,  for  interac*ve  queries,                data  is  read  from  the  disk  each              *me  the  query  is  executed.  

Page 3: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

treaming

Page 4: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

What  is  Spark?

•  Fast  and  general  compu*ng  system  .  •  Spark  and  its  streaming  version  built  on  top  of  

Hadoop  and  perform  data  analysis  on  clusters  •  Improves  over  MapReduce  

-  In  memory  compu*ng  primi*ves  - General  computa*on  graphs  

 •  Improves  usability  over  MapReduce  

- Rich  APIs  in  Scala,  Java,  Python  -  Interac*ve  extended  Scala  Shell  

Up  to  100x  Faster  (2-­‐    10x  on  disk)  

Super  fast  interac*ve  analysis  

of  Big  Data  

Page 5: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

Spark  Streaming  •  Memory  abstrac*on  :  Efficiently  share  data  across  the  

different  stages  of  a  map-­‐reduce  job  or  provide  in-­‐memory  data  sharing.  

•  Memory  abstrac*on  is  by  RDD.  •  Spark  applica*on  :  Driver              program  that  run’s              the  user’s  main  func*on            and  executes  various            parallel  opera*ons  on              the  clusters.  •  Caches  the  data  

 

Page 6: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

     Spark  Ecosystem  

 

•  GraphX  –  Graph  computa*on  engine  which  supports  complex  graph  processing  algorithms  efficiently  and  with  improved  performance.  Example  :PageRank    

•  MLLib  –  Machine  learning  library  built  on  the  top  of  Spark  and  supports  many  complex  machine  learning  algorithms  which  runs  100x  faster  than  map-­‐reduce.  

•  Spark  Streaming  –    Supports  analy*cal  and  interac*ve  applica*ons  built  on  live  streaming  data.  

•  Spark(SQL)  –  Used  for  querying  structured  data.  Spark  SQL  allows  the  users  to  ETL  their  data  from  its  current  format  (like  JSON,  Parquet,  a  Database),  transform  it,  and  expose  it  for  ad-­‐hoc  querying.  

Page 7: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

Design  •  Scala  is  a  func*onal  language  which  runs  on  top  of  the  Java  VM,  and  it  is  fully  compa*ble  with  any  Java-­‐based  library.    

•  In  StreamDM  ,Input/Output  Streams  Datasets  are  divided  in  several  discre'zed  streams  (DStream)    

•  It  allows  combining  batch  processing  algorithms  with  streaming  algorithms    

 

Page 8: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

A  StreamDM  task  •  Internal  instance  data  structure,  the  Example    •  Read  by  Dstream  ,Reader  class  parsed  it.  •  Data  mining  algorithm  implementa*on  in  Learner  •  The  assignments  or  predic*ons  from  Learner  are  evaluated  by  

an  Evaluator    •  Finally,  the  results  are  output  by  a  Writer  class  to  disk,console  

or  HDFS  

Page 9: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

Implemented  Learner  Algorithms      StreamDM  contains  implementa*ons  of  the  following  

classifica*on  algorithms,  which  implement  the  Classifier  trait:    

•  Mul*nomial  Naive  Bayes  :  Rela*ve  frequency  of  each  word  

•  SGD  Learner  and  Perceptron  :  Op*mizer  for  linear  models  

•  Hoeffding  Decision  Trees  :  Based  on  decision  tree  algorithm  

•  Bagging  :  Bootstrap  sampling  with  replacement  •  CluStream  :  Set  of  instances  from  the  stream  •  StreamKM++    :  K-­‐means  algorithm  

Page 10: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

Streaming  Twifer  data  analysis  using  Sparks  for  effec*ve  job  search    

Page 11: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

Big  Streaming Data

Page 12: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

Why  process  Big  Streaming  Data?  Fraud detection in bank transactions

Anomalies in sensor data

Cat videos in tweets

Page 13: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

How  to  Process  Big  Streaming  Data?

Website monitoring Fraud detection

Ad monetization Scales to hundreds of nodes

Achieves low latency

Efficiently recover from failures

Integrates with batch and interactive processing

Distributed Stream Processing

System Raw Data Streams Processed Data

Page 14: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

What  people  have  been  doing?  •  Build two stacks - one for batch processing, one

for streaming -  Often both process same data

•  Extremely painful to maintain two different stacks - Different programming models - Doubles implementation effort

•  Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing of TBs of data with high latency

Page 15: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

Spark  Programming  Model  

> Resilient distributed datasets (RDDs) - Distributed collection of objects - Manipulated through parallel transformations

(map, filter, reduce, etc.) - Can be cached in memory across cluster - Automatically rebuilt on failure

> Programming Interface - Functional APIs in Scala, Java, Python -  Interactive use from Scala & Python shell  

Page 16: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

Applica*on  Model  

Twi%er   Spark  Cluster  (Filtering  and  Classifica*on)  

Data  Store  

Client  

Cloud  OR  On-­‐Premise  

OAuth  

Tweet  Stream  

Query   Result  

Page 17: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

Tweet  Classifica*on  based  on  Job  Type  Tweet   Job  Category  #IT  #Job  alert:  VMware  Quality  System  Engineer  |  VMware  |  #Palo  Alto  hfp://txxxx  #Jobs    

IT    

Blown  an  interview?  Maybe  not.  Here's  how  to  recover:  hfp://t.co/xxx  #career  

invalid  

#Job  #Germantown  STxxx:  Systems  Administrator  (Sxx):  xx  Project  Overview:  ...  hfp://txxx    

IT    

JOB  OPENING:  Project  Financial  Controls  Specialist  -­‐  IRC  at  Mxxx  (Minneapolis,  MN)  hfp://txxxxj  #job    

Finance    

Axx  Txx  TRAINING  #Transporta*on  #Job:  DRIVERS  (#OKLAHOMA,  OK)  hfp://t.coxx  

Driving    

Page 18: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

Results  

0  

5  

10  

15  

20  

25  

IT   Healthcare   Finance   Sales  &  Marke*ng  

Popular  Job  Categories  Tweeted,  Expressed  as  a  %  of  Total  Job  Vacancy  Tweets.  

Job  Category  AdverHzed  

%  Amon

g  Ad

verHzed  Job  Tw

eets  

Page 19: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

Conclusion  •  By  focusing  on  prac*cal  real-­‐world  scenarios,  the  paper  demonstrated  that  StreamDM  is  intui*ve  and  that  it  is  ready  for  industry  deployments.    

•  A  Scalable  model  for  real  *me  analysis  and  filtering  job  vacancy  related  tweets  

•  Classify  in  Job  category    •  Access  tweets  without  being  follow  twifer  account  

•  Used  for  streaming  tweets  •  Cater  growing  data  size  

Page 20: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

References  

•  hfp://silviu.maniu.info/publica*ons/bifet2015streamdm.pdf  

•  hfp://spark.apache.org/docs/latest  •  hfps://databricks.com/spark/about  •  hfp://hortonworks.com/apache/spark/  •  hfp://www.kdnuggets.com/2015/06/introduc*on-­‐big-­‐data-­‐apache-­‐spark.html  

Page 21: Introduc*on!cis.csuohio.edu/~sschung/CIS601/Sonal_Spark.pdf · • Existing frameworks cannot do both - Either, stream processing of 100s of MB/s with low latency - Or, batch processing

Thank  You