“transform real time data into real time decisions” asit parija

15
1 “Transform Real Time Data into Real Time Decisions” “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

Upload: voquynh

Post on 08-Dec-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: “Transform Real Time Data into Real Time Decisions” Asit Parija

1 “Transform Real Time Data into Real Time Decisions”

“Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

Page 2: “Transform Real Time Data into Real Time Decisions” Asit Parija

2 “Transform Real Time Data into Real Time Decisions”

CUSTOMERS PARTNERSHIPS

OPEN SOURCE

Page 3: “Transform Real Time Data into Real Time Decisions” Asit Parija

3 “Transform Real Time Data into Real Time Decisions”

RDBMS  

RDBMS  

•  Only  structured  data  •  $50K  –  100K  per  TB  •  Limited  Analy?cs  

ü  Both  structured  and  unstructured  data  ü  50x-­‐100x  cost  savings:  $1K  per  TB  ü  Expanded  analy?cs  with  MapReduce/NoSQL  etc.  

FROM  

TO  

EDW  

EDW  Hadoop/SPARK  ETL  +  Long  Term  Storage   Query  +  Present  

ETL  

Sensor  Data    

Web  Logs  

Page 4: “Transform Real Time Data into Real Time Decisions” Asit Parija

4 “Transform Real Time Data into Real Time Decisions”

ETL  Goals  

•  Make  data  processing  more  powerful  •  Make  data  processing  more  simple  •  Make  data  processing  100x  faster  than  before  •  What  are  the  op?ons  ?  

 

Page 5: “Transform Real Time Data into Real Time Decisions” Asit Parija

5 “Transform Real Time Data into Real Time Decisions”

What  steered  us  into  Spark  

•  Powerful  in-­‐memory  Processing  •  Simple  operator  on  Data  •  Debuggable  API  •  Efficient  Execu?on  •  Universally  distributed  

Page 6: “Transform Real Time Data into Real Time Decisions” Asit Parija

6 “Transform Real Time Data into Real Time Decisions”

What  steered  us  into  Pig  

•  DSL  for  ETL  •  Rich  Operator  Library  •  Extendable  •  Pluggable  •  Powerful  ETL  

Page 7: “Transform Real Time Data into Real Time Decisions” Asit Parija

7 “Transform Real Time Data into Real Time Decisions”

Operator  Mapping  

Pig   Spark  

Load   HadoopRDD  

Store   saveasObjectFile  

Filter   MappedRDD  +  filter  func  

GroupBY  (Local  rearrange,  global  rearrange  &  package)   Sort  +  Group  by  

….   …  

Page 8: “Transform Real Time Data into Real Time Decisions” Asit Parija

8 “Transform Real Time Data into Real Time Decisions”

Current  Flow  

Page 9: “Transform Real Time Data into Real Time Decisions” Asit Parija

9 “Transform Real Time Data into Real Time Decisions”

Issues    

•  Scaling  •  Performance  •  Spark  Specific  Operators  (Cache)  •  Pig  on  Spark  Unit  test  •  Some  specific  joins  &  rank  opera?on  

Page 10: “Transform Real Time Data into Real Time Decisions” Asit Parija

10 “Transform Real Time Data into Real Time Decisions”

Filter  Code  implementa?on  

•  hcps://bitbucket.org/SigmoidDev/spork/src/80a3e4626e4504c1829568942e0690abc79d239a/src/org/apache/pig/backend/hadoop/execu?onengine/spark/converter/FilterConverter.java?at=spork-­‐1.0  

Page 11: “Transform Real Time Data into Real Time Decisions” Asit Parija

11 “Transform Real Time Data into Real Time Decisions”

Contribute  

•  Pig  on  Spark  Umbrella  Jira  •  hcps://issues.apache.org/jira/browse/PIG-­‐4059  

•  hcps://github.com/sigmoidanaly?cs/spork  •  Issues  

Page 12: “Transform Real Time Data into Real Time Decisions” Asit Parija

12 “Transform Real Time Data into Real Time Decisions”

Benchmark  

Dis?nct  opera?on  on  the  data  is  a  wikistats  dump  for  25  days  with  size  270G    took  4.25mins  on  Pig  on  Spark,  as  compared  to  30mins  in  MapReduce  .  

Page 13: “Transform Real Time Data into Real Time Decisions” Asit Parija

13 “Transform Real Time Data into Real Time Decisions”

Mixing  Streaming  &  Batch  Processing      

•  Current  State  –    Different  code  for  batch  and  stream  •  Lambda  Architecture  •  One  unified  language  to  perform  both    

Page 14: “Transform Real Time Data into Real Time Decisions” Asit Parija

14 “Transform Real Time Data into Real Time Decisions”

What  else  is  cool  

CloudFlux   SigmaStream  Cloud  Deployment   PIG/SQL  Like  DSL  Fault  Tolerance   Rich  Stream  operators  AutoScaling   Mul?ple  Data  source/Sink  Programma?c  interface     Add  custom  Operators    Cloud  Agnos?c   Apache  Spark  Based  Apache  License   Apache  License  

 

Page 15: “Transform Real Time Data into Real Time Decisions” Asit Parija

15 “Transform Real Time Data into Real Time Decisions”

Thank You

Gulmohar Enclave Road,

Silver Spring Layout, Munnekollal

Bengaluru, Karnataka 560037

+1 (760) 203 3257

[email protected]

US Office

1343 Kingfisher Way

Sunnyvale, CA, 94087 India Office