configuring and optimizing spark applications with ease

37
1 © Cloudera, Inc. All rights reserved. Configuring and Op;mizing Spark Applica;ons With Ease Nishkam Ravi Cloudera Ethan Chan Stanford, Cloudera

Upload: doanthuy

Post on 14-Feb-2017

258 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Configuring and Optimizing Spark Applications with Ease

1  ©  Cloudera,  Inc.  All  rights  reserved.  

Configuring  and  Op;mizing  Spark  Applica;ons  With  Ease  

Nishkam  Ravi  Cloudera  

Ethan  Chan  Stanford,  Cloudera  

Page 2: Configuring and Optimizing Spark Applications with Ease

2  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark              I              MapReduce  

Page 3: Configuring and Optimizing Spark Applications with Ease

3  ©  Cloudera,  Inc.  All  rights  reserved.  

val    dataRDD  =  sparkContext.textFile("hdfs://...")  dataRDD.flatMap(line  =>  line.split("  ")).map(word  =>  (word,  1)).reduceByKey(_  +  _).collect()  

Transforma;ons   Ac;on  

RDD  

•  Simple  high-­‐level  API  for  data  transforma;on  

Page 4: Configuring and Optimizing Spark Applications with Ease

4  ©  Cloudera,  Inc.  All  rights  reserved.  

val    dataRDD  =  sparkContext.textFile("hdfs://...")  dataRDD.flatMap(line  =>  line.split("  ")).map(word  =>  (word,  1)).reduceByKey(_  +  _).collect()  

Driver  

Worker  

Worker  

Worker  

Tasks  

Tasks  

Tasks  

Executors  

Executors  

Executors  

•  Simple  high-­‐level  API  for  data  transforma;on  

• General  task  graphs  •  filter,  dis;nct,  union,  sortBy..  •  Transforma;ons  pipelined  where  possible  

Page 5: Configuring and Optimizing Spark Applications with Ease

5  ©  Cloudera,  Inc.  All  rights  reserved.  

•  Simple  high-­‐level  API  for  data  transforma;on  

• General  task  graphs  •  filter,  dis;nct,  union,  sortBy..  •  Transforma;ons  pipelined  where  possible  

•  In-­‐memory  compute  • Great  performance  •  Orders  of  magnitude  faster  than  MapReduce  •  Fault  tolerance  

• One  stack  to  rule  them  all  •  Procedural,  ML,  Graph,  SQL,  Streaming,  etc  

val    dataRDD  =  sparkContext.textFile("hdfs://...")  dataRDD.flatMap(line  =>  line.split("  ")).map(word  =>  (word,  1)).reduceByKey(_  +  _).collect()  

Page 6: Configuring and Optimizing Spark Applications with Ease

6  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera  Customer  Use  Cases  Core  Spark   Spark  Streaming  

•  Pordolio  Risk  Analysis  •  ETL  Pipeline  Speed-­‐Up  •  20+  years  of  stock  data  Financial  

Services  

Health  

•  Iden;fy  disease-­‐causing  genes  in  the  full  human  genome  

•  Calculate  Jaccard  scores  on  health  care  data  sets  

ERP  

•  Op;cal  Character  Recogni;on  and  Bill  Classifica;on  

•  Trend  analysis    •  Document  classifica;on  (LDA)  •  Fraud  analy;cs  Data  

Services  

1010  

•  Online  Fraud  Detec;on  

Financial  Services  

Health  

•  Incident  Predic;on  for  Sepsis  

Retail  

•  Online  Recommenda;on  Systems  •  Real-­‐Time  Inventory  Management  

Ad  Tech  

•  Real-­‐Time  Ad  Performance  Analysis  

Page 7: Configuring and Optimizing Spark Applications with Ease

7  ©  Cloudera,  Inc.  All  rights  reserved.  

Problem  

Too  many  knobs  Developers  don’t  always  write  good  code  Debugging  is  hard  

Cloudera  Spark  Support  Issues  

Page 8: Configuring and Optimizing Spark Applications with Ease

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Need  for  tools  

       

 •   Not  enough  tools  in  Big  Data  space  •   Develop-­‐deploy-­‐debug  cycle  is  complex    

Page 9: Configuring and Optimizing Spark Applications with Ease

9  ©  Cloudera,  Inc.  All  rights  reserved.  

Auto-­‐configura;on   Performance  Op;miza;on   Debugging  

Preven;ve  Care   Correc;ve  Care  

Page 10: Configuring and Optimizing Spark Applications with Ease

10  ©  Cloudera,  Inc.  All  rights  reserved.  

Common  pidalls  

•  Executor  (mis)configura;on  •  Number,  size  

•  Insufficient  parallelism  

• Number  of  par;;ons  •  YARN  memory  overhead  •  Fetch  failures  •  Timeout  values,  ulimit  

 

•  Caching  and  serializa;on  • Use  of  collect()  and  such  • Use  of  groupByKey()  •   reduceBykey()  

• Use  of  rdd.forEach()  •  rdd.forEachPar::on()  

• GC  tuning          

Page 11: Configuring and Optimizing Spark Applications with Ease

11  ©  Cloudera,  Inc.  All  rights  reserved.  

Use  run;me  informa;on  •  Run  applica;on  and  monitor  run;me  info  

•  Pros  •  More  accurate  

•  Cons  •  High  internal  and  external  complexity  

•  Ini;al  run  needs  configura;on  

Use  load  ;me  informa;on  Input  data  size  Cluster  informa;on  Applica;on  code  

Pros  Zero  external  complexity  Out  of  the  box  usability  &  performance  Configura;on  recommenda;ons  for  performance  

Cons  Not  as  accurate    

Auto-­‐configura;on  (Design  Op;ons)  

Configura;on  language  num-­‐exec  =  2  *  num-­‐nodes  

Pros  More  expressive  

Cons  User  needs  to  specify  heuris;cs  Most  heuris;cs  are  non-­‐trivial  

Page 12: Configuring and Optimizing Spark Applications with Ease

12  ©  Cloudera,  Inc.  All  rights  reserved.  

Use  run;me  informa;on  Run  applica;on  and  monitor  run;me  info  

Pros  More  accurate  

Cons  High  internal  and  external  complexity  Ini;al  run  needs  starter  configura;on      

Use  load  ;me  informa;on  Input  data  size  Cluster  informa;on  Applica;on  code  

Pros  Zero  external  complexity  Out  of  the  box  usability  &  performance  Configura;on  recommenda;ons  for  performance  

Cons  Not  as  accurate    

Auto-­‐configura;on  (Design  Op;ons)  

Configura;on  language  •  num-­‐exec  =  2  *  num-­‐nodes  

•  Pros  •  More  expressive  

•  Cons  •  User  needs  to  specify  heuris;cs  •  Heuris;cs  are  non-­‐trivial  

Page 13: Configuring and Optimizing Spark Applications with Ease

13  ©  Cloudera,  Inc.  All  rights  reserved.  

Use  run;me  informa;on  Run  applica;on  and  monitor  run;me  info  

Pros  More  accurate  

Cons  High  internal  and  external  complexity  Ini;al  run  needs  starter  configura;on      

Use  load  ;me  informa;on    •  Input  data  size  •  Cluster  informa;on  •  Applica;on  code  

•  Pros  •  Zero  external  complexity  •  Out-­‐of-­‐the-­‐box  usability  &  performance  

•  Cons  •  May  not  be  as  accurate  

Auto-­‐configura;on  (Design  Op;ons)  

Configura;on  language  num-­‐exec  =  2  *  num-­‐nodes  

Pros  More  expressive  

Cons  User  needs  to  specify  heuris;cs  Most  heuris;cs  are  non-­‐trivial  

Page 14: Configuring and Optimizing Spark Applications with Ease

14  ©  Cloudera,  Inc.  All  rights  reserved.  

Interface  Output  1  spark-­‐final.conf  

(configura;on  seqngs)  2  spark-­‐conf.advice    

(configura;on  recommenda;ons)  3  command  line  to  execute  job  4  op;mizedCode.scala  

(op;mized  code)  5  spark-­‐code.advice  

(code  recommenda;ons)  6  op;miza;on-­‐report.txt  

(op;miza;on  report)  

Input  • Cluster  info  •  Input  data  size  • Deploy  mode  • Applica;on  code  path    ..    

SparkAid  

Auto-­‐config  

Op;mizer  

Page 15: Configuring and Optimizing Spark Applications with Ease

15  ©  Cloudera,  Inc.  All  rights  reserved.  

Input:  Interac;ve  Command  Line  

Page 16: Configuring and Optimizing Spark Applications with Ease

16  ©  Cloudera,  Inc.  All  rights  reserved.  

Output  (1):  Configura;on  File  

• ~100  default  seqngs  • 15-­‐20  seqngs  configured  with  heuris;cs  •  Executor  memory  •  Executor  cores  •  Storage  level    …  

Page 17: Configuring and Optimizing Spark Applications with Ease

17  ©  Cloudera,  Inc.  All  rights  reserved.  

Output  (2):  Recommenda;ons  File  

•  Recommenda;ons  for  Spark  and  CM  configura;ons  •  spark.yarn.executor.memoryOverhead  Increase  this  if  YARN  containers  fail/run  out  of  memory  

•  spark.akka.;meout  Increase  if  GC  pauses  cause  problems  

•  spark.default.parallelism  Try  doubling  this  value  for  poten;al  performance  gains  

Page 18: Configuring and Optimizing Spark Applications with Ease

18  ©  Cloudera,  Inc.  All  rights  reserved.  

Output  (3):  Command  Line  

•  spark-­‐submit      -­‐-­‐master  yarn      -­‐-­‐deploy-­‐mode  cluster      -­‐-­‐class  pagerank      -­‐-­‐proper;es-­‐file  spark-­‐final.conf    

           -­‐-­‐driver-­‐cores  16      -­‐-­‐driver-­‐memory  32g      /path/to/spark.jar  

Page 19: Configuring and Optimizing Spark Applications with Ease

19  ©  Cloudera,  Inc.  All  rights  reserved.  

Design  Overview  Auto-­‐config  •  Implements  a  bunch  of  heuris;cs  

•  Draws  on  past  experience  •  And  experimenta;on  

•  Wriuen  in  Java  

Op;mizer  •  Parses  source  code  

•  Looks  for  performance  bugs  •  Fixes  performance  bugs  

•  Modifies  source  code  •  Generates  advice  

•  Wriuen  in  Python  

Debugger  •  Analyzes  run;me  informa;on  

•  Logs,  stack  etc  •  Helps  visualize    •  Generates  sugges;ons  

•  Configura;on  •  Code  changes  

•  Not  yet  implemented  

Page 20: Configuring and Optimizing Spark Applications with Ease

20  ©  Cloudera,  Inc.  All  rights  reserved.  

Integrate  with  Spark?  

•   Op;mizer  and  Debugger  completely  external  to  Spark  •   Config  defaults  in  Spark  will  likely  improve  over  ;me    •   Adding  non-­‐essen;al  features  to  Spark  repo    

•  Increases  complexity    • Maintenance  effort  

•   spark-­‐packages.org  and  Cloudera  labs  exist  for  that  purpose  

Page 21: Configuring and Optimizing Spark Applications with Ease

21  ©  Cloudera,  Inc.  All  rights  reserved.  

Auto-­‐config  Heuris;cs    

int calculatedNumExecutorsPerNode = (int)(effectiveMemoryPerNode / idealExecutorMemory);!double finalExecutorMemory = idealExecutorMemory;!boolean recalculateFlag = false;!if (calculatedNumExecutorsPerNode > upperBound){!

!numExecutorsPerNode = upperBound;!!recalculateFlag = true;!

}!else if (calculatedNumExecutorsPerNode < lowerBound){!

!numExecutorsPerNode = lowerBound;!!recalculateFlag = true;!

}!else{ !

!numExecutorsPerNode = calculatedNumExecutorsPerNode;!!double currMemSizePerNode = idealExecutorMemory * numExecutorsPerNode;!!double leftOverMemPerNode = effectiveMemoryPerNode - currMemSizePerNode;!!if(leftOverMemPerNode > (idealExecutorMemory / 2)){!! !recalculateFlag = true;!!}!

}!if(recalculateFlag){!

! finalExecutorMemory = effectiveMemoryPerNode/numExecutorsPerNode!}!

•  Example  #1:  spark.executor.memory  

Page 22: Configuring and Optimizing Spark Applications with Ease

22  ©  Cloudera,  Inc.  All  rights  reserved.  

if (inputDeserialized < availableMemory){!!storageLevel = "MEMORY_ONLY";!

}else if (inputUncompressedSerialized < availableMemory){!!storageLevel = "MEMORY_ONLY_SER";!

}else if (inputCompressedSerialized < availableMemory){!!storageLevel = "MEMORY_ONLY_SER";!!rddCompress = “true”;!

}else{!!storageLevel = "MEMORY_AND_DISK_SER";!

}!

•  Example  #2:  spark.storage.level  

Auto-­‐config  Heuris;cs    

Page 23: Configuring and Optimizing Spark Applications with Ease

23  ©  Cloudera,  Inc.  All  rights  reserved.  

int totalCoresAvailable = (int)(numCoresPerNode * (numNodes - numJobs) * resourceFraction);!int calculatedParallelism = (int)(parallelismFactor * totalCoresAvailable);!

•  Example  #3:  spark.default.parallelism  

Auto-­‐config  Heuris;cs  

if (inputsTable.get("fileSystem").equals("ext4") || inputsTable.get("fileSystem").equals("xfs")){!!shuffleConsolidateFiles = "true";!

}!

•  Example  #4:  spark.shuffle.consolidateFiles  

Page 24: Configuring and Optimizing Spark Applications with Ease

24  ©  Cloudera,  Inc.  All  rights  reserved.  

       

               Performance  Op;mizer  

Page 25: Configuring and Optimizing Spark Applications with Ease

25  ©  Cloudera,  Inc.  All  rights  reserved.  

Performance  Op;mizer  

•  Find  performance  bugs  using  poor  man’s  sta;c  analysis  •  Iden;fy  pauerns  and  generate  op;mized  code  and  recommenda;ons  • RDD-­‐cache  op;miza;on  •  Storage-­‐frac;on  op;miza;on  • Parallelism-­‐level  op;miza;on  • GroupByKey-­‐replace  op;miza;on  

Page 26: Configuring and Optimizing Spark Applications with Ease

26  ©  Cloudera,  Inc.  All  rights  reserved.  

Preprocessing  

• Classes,  func;ons,  nested  func;ons  • Loops,  comments,  line  numbers,  etc  

Analysis  

• RDD  iden;fica;on  • UD  analysis  

Op;miza;on  • Four  op;miza;ons  

Code  genera;on  

• Separate  phase  

Page 27: Configuring and Optimizing Spark Applications with Ease

27  ©  Cloudera,  Inc.  All  rights  reserved.  

Op;miza;ons  #1:  RDD-­‐cache  op;miza;on  •  Find  RDDs  that  can  be  cached  •  Being  read  in  a  loop  •  Not  being  assigned  in  the  loop  

•  Insert  rdd.cache()  before  the  loop  

Page 28: Configuring and Optimizing Spark Applications with Ease

28  ©  Cloudera,  Inc.  All  rights  reserved.  

Op;miza;ons  #1:  RDD-­‐cache  op;miza;on  •  Find  RDDs  that  can  be  cached  •  Being  read  in  a  loop  •  Not  being  assigned  in  the  loop  

•  Insert  rdd.cache()  before  the  loop  

#2:  Storage-­‐frac;on  op;miza;on  •  Look  for  RDD  cache  instances  •  If  no  RDD  is  being  cached  •  spark.storage.memoryFrac;on  set  to  0.2  

Page 29: Configuring and Optimizing Spark Applications with Ease

29  ©  Cloudera,  Inc.  All  rights  reserved.  

Op;miza;ons  

•  Modify  RDD  instan;a;ons  to  be  consistent  with  spark.default.parallelism  value  

#3:  Parallelism-­‐level  op;miza;on  

Page 30: Configuring and Optimizing Spark Applications with Ease

30  ©  Cloudera,  Inc.  All  rights  reserved.  

Op;miza;ons  

•  Modify  RDD  instan;a;ons  to  be  consistent  with  spark.default.parallelism  value  

#4:  GroupByKey-­‐replace  op;miza;on  •  Recommend  reduceByKey  ()  instead  of  groupByKey()  

#3:  Parallelism-­‐level  op;miza;on  

Page 31: Configuring and Optimizing Spark Applications with Ease

31  ©  Cloudera,  Inc.  All  rights  reserved.  

       

             Op;mizer  Output  Files  

Page 32: Configuring and Optimizing Spark Applications with Ease

32  ©  Cloudera,  Inc.  All  rights  reserved.  

Output  (1):  Op;mized  Code  object PageRank {!

!def main(args: Array[String]) {!! !val conf = new SparkConf().setAppName("PageRank")!! !val sc = new SparkContext(conf)!

! !val key_links_txt = sc.textFile("hdfs://cloudera.com:8020/user/pagerank_input", 3136)!! !val key_links = key_links_txt.map( line => line.split("\t"))!! ! ! ! ! !.map(array => (array(0),array(1))).distinct().groupByKey()!

! !var ranks = key_links.mapValues (v => 1.0) ! key_links.cache()!! !for (i <-0 until 2){!! ! !val contributions = key_links.join(ranks).flatMap{!

! ! ! !case (pageId, (links,rank)) =>!! ! ! ! !links.map(dest => (dest, rank/links.size))!! ! ! !}!

! ! ! !ranks = contributions.reduceByKey((x,y)=> x+y).mapValues(v => 0.15 + 0.85 * v)!! ! !}!! !ranks.collect()!!}!

}!

Page 33: Configuring and Optimizing Spark Applications with Ease

33  ©  Cloudera,  Inc.  All  rights  reserved.  

Output  (2):  Code  Advice  ===================== GroupByKey() Recommendation ========================!Consider using reduceByKey() instead of groupByKey() if possible at Line 12:!key_links_txt.map( line => line.split("\t")).map(array => (array(0),array(1))).distinct().groupByKey()!

Output  (3):  Op;miza;on  Report  

Page 34: Configuring and Optimizing Spark Applications with Ease

34  ©  Cloudera,  Inc.  All  rights  reserved.  

Thank  you  Ethan  Chan|  Performance  Team  UIUC  2015  ,  Stanford  2017  Nishkam  Ravi  |  Mentor  Silvius  Rus  |  Manager  

Demo  

Page 35: Configuring and Optimizing Spark Applications with Ease

35  ©  Cloudera,  Inc.  All  rights  reserved.  

Uni;ng  Spark  and  Hadoop    The  One  Pladorm  Ini;a;ve  

Management  Leverage  Hadoop-­‐na;ve  RM  

Usability  

Security  Full  support  for  Hadoop  security  

and  beyond  

Scale  Enable  10k-­‐node  clusters  

Streaming  Support  for  80%  of  common  stream  

processing  workloads  

Page 36: Configuring and Optimizing Spark Applications with Ease

36  ©  Cloudera,  Inc.  All  rights  reserved.  

   Exploratory  Effort          

   No  plan  to  release  in  the  immediate  future  

Page 37: Configuring and Optimizing Spark Applications with Ease

37  ©  Cloudera,  Inc.  All  rights  reserved.  

Thanks!    Nishkam  Ravi,  Ethan  Chan  [email protected]