deploying and administering spark
DESCRIPTION
Deploying and Administering Spark. Patrick Wendell Databricks. Outline. Spark components Cluster managers Hardware & configuration Linking with Spark Monitoring and measuring. Outline. Spark components Cluster managers Hardware & configuration Linking with Spark - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/1.jpg)
Patrick WendellDatabricks
Deploying and Administering Spark
![Page 2: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/2.jpg)
OutlineSpark componentsCluster managersHardware & configurationLinking with SparkMonitoring and measuring
![Page 3: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/3.jpg)
OutlineSpark componentsCluster managersHardware & configurationLinking with SparkMonitoring and measuring
![Page 4: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/4.jpg)
Spark applicationDriver program
Java program that creates a SparkContext
ExecutorsWorker processes that
execute tasks and store data
![Page 5: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/5.jpg)
Cluster managerCluster manager grants executors to a Spark application
![Page 6: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/6.jpg)
Driver programDriver program decides when to launch tasks on which executor
Needs full networkconnectivity to workers
![Page 7: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/7.jpg)
Types of ApplicationsLong lived/shared applications
Shark Spark Streaming Job Server (Ooyala)Short lived applications Standalone apps Shell sessions
May do mutli-user scheduling within allocation from cluster manger
![Page 8: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/8.jpg)
OutlineSpark componentsCluster managersHardware & configurationLinking with SparkMonitoring and measuring
![Page 9: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/9.jpg)
Cluster ManagersSeveral ways to deploy Spark1. Standalone mode (on-site)2. Standalone mode (EC2)3. YARN4. Mesos5. SIMR [not covered in this
talk]
![Page 10: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/10.jpg)
Standalone ModeBundled with SparkGreat for quick “dedicated” Spark clusterH/A mode for long running applications (0.8.1+)
![Page 11: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/11.jpg)
Standalone Mode1. (Optional) describe amount of resources in conf/spark-env.sh - SPARK_WORKER_CORES - SPARK_WORKER_MEMORY2. List slaves in conf/slaves3. Copy configuration to slaves4. Start/stop using ./bin/stop-all and ./bin/start-all
![Page 12: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/12.jpg)
Standalone ModeSome support for inter-application scheduling
Set spark.cores.max to limit # of cores each application can use
![Page 13: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/13.jpg)
EC2 DeploymentLauncher bundled with SparkCreate cluster in 5 minutesSizes cluster for any EC2 instance type and # of nodesUsed widely by Spark team for internal testing
![Page 14: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/14.jpg)
EC2 Deployment./spark-ec2 -t [instance type] -k [key-name] -i [path-to-key-file] -s [num-slaves] -r [ec2-region] --spot-price=[spot-price]
![Page 15: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/15.jpg)
EC2 DeploymentCreates:Spark Sandalone cluster at <ec2-master>:8080HDFS cluster at< ec2-master >:50070MapReduce cluster at< ec2-master >:50030
![Page 16: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/16.jpg)
Apache MesosGeneral-purpose cluster manager that can run Spark, Hadoop MR, MPI, etcSimply pass mesos://<master-url> to SparkContextOptional: set spark.executor.uri to a pre-built Spark package in HDFS, created by make-distribution.sh
![Page 17: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/17.jpg)
Fine-grained (default):● Apps get static memory allocations, but share
CPU dynamically on each node
Coarse-grained:● Apps get static CPU and memory allocations● Better predictability and latency, possibly at
cost of utilization
Mesos Run Modes
![Page 18: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/18.jpg)
In Spark 0.8.0:● Runs standalone apps only, launching driver
inside YARN cluster● YARN 0.23 to 2.0.x
Coming in 0.8.1:● Interactive shell● YARN 2.2.x support● Support for hosting Spark JAR in HDFS
Hadoop YARN
![Page 19: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/19.jpg)
1. Build Spark assembly JAR2. Package your app into a JAR3. Use the yarn.Client classSPARK_JAR=<SPARK_ASSEMBLY_JAR> ./spark-class org.apache.spark.deploy.yarn.Client \ --jar <YOUR_APP_JAR> --class <MAIN_CLASS> \ --args <MAIN_ARGUMENTS> \ --num-workers <N> \ --master-memory <MASTER_MEM> \ --worker-memory <WORKER_MEM> \ --worker-cores <CORES_PER_WORKER>
YARN Steps
![Page 20: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/20.jpg)
http://spark.incubator.apache.org/docs/latest/cluster-overview.htmlDetailed docs about each of standalone mode, Mesos, YARN, EC2
More Info
![Page 21: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/21.jpg)
OutlineCluster componentsDeployment optionsHardware & configurationLinking with SparkMonitoring and measuring
![Page 22: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/22.jpg)
Where to run Spark?If using HDFS, run on same nodes or within LAN1. Have dedicated (usually
“beefy”) nodes for Spark2. Colocate Spark and
MapReduce on shared nodes
![Page 23: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/23.jpg)
Local DisksSpark uses disk for writing shuffle data and paging out RDD’sIdeally have several disks per node in JBOD configurationSet spark.local.dir with comma-separated disk locations
![Page 24: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/24.jpg)
MemoryRecommend 8GB heap and upGenerally, more is betterFor massive (>200GB) heaps you may want to increase # of executors per node (see SPARK_WORKER_INSTANCES)
![Page 25: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/25.jpg)
Network/CPUFor in-memory workloads, network and CPU are often the bottleneckIdeally use 10Gb EthernetWorks well on machines with multiple cores (since parallel)
![Page 26: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/26.jpg)
Environment-related configsspark.executor.memory How much memory you will ask for from cluster managerspark.local.dir Where spark stores shuffle files
![Page 27: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/27.jpg)
OutlineCluster componentsDeployment optionsHardware & configurationLinking with SparkMonitoring and measuring
![Page 28: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/28.jpg)
Typical Spark Applicationsc = new SparkContext(<cluster-manager>…)sc.addJar(“/uber-app-jar.jar”) sc.textFile(XX) …reduceBy …saveAS
Created using maven or sbt assembly
![Page 29: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/29.jpg)
Linking with SparkAdd an ivy/maven dependency in your project on spark-core artifact If using HDFS, add dependency on hadoop-client for your version
e.g. 1.2.0, 2.0.0-cdh4.3.1For YARN, also add spark-yarn
![Page 30: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/30.jpg)
Hadoop VersionsDistribution Release Maven Version Code
CDH 4.X.X 2.0.0-mr1-chd4.X.X
4.X.X (YARN mode) 2.0.0-chd4.X.X
3uX 0.20.2-cdh3uX
HDP 1.3 1.2.0
1.2 1.1.2
1.1 1.0.3
See Spark docs for details: http://spark.incubator.apache.org/docs/latest/hadoop-third-party-distributions.html
![Page 31: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/31.jpg)
OutlineCluster componentsDeployment optionsHardware & configurationLinking with SparkMonitoring and measuring
![Page 32: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/32.jpg)
MonitoringCluster Manager UIExecutor LogsSpark Driver LogsApplication Web UISpark Metrics
![Page 33: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/33.jpg)
Cluster Manager UIStandalone mode: <master>:8080Mesos, YARN have their own UIs
![Page 34: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/34.jpg)
Executor LogsStored by cluster manager on each workerDefault location in standalone mode:
/path/to/spark/work
![Page 35: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/35.jpg)
Executor Logs
![Page 36: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/36.jpg)
Spark Driver LogsSpark initializes a log4j when createdInclude log4j.properties file on the classpathSee example in conf/log4j.properties.template
![Page 37: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/37.jpg)
Application Web UIhttp://spark-application-host:4040(or use spark.ui.port to configure the port)
For executor / task / stage / memory status, etc
![Page 38: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/38.jpg)
Executors Page
![Page 39: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/39.jpg)
Environment Page
![Page 40: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/40.jpg)
Stage Information
![Page 41: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/41.jpg)
Task Breakdown
![Page 42: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/42.jpg)
App UI FeaturesStages show where each operation originated in codeAll tables sortable by task length, locations, etc
![Page 43: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/43.jpg)
MetricsConfigurable metrics based on Coda Hale’s Metrics libraryMany Spark components can report metrics (driver, executor, application)Outputs: REST, CSV, Ganglia, JMX, JSON Servlet
![Page 44: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/44.jpg)
MetricsMore details: http://spark.incubator.apache.org/docs/latest/monitoring.html
![Page 45: Deploying and Administering Spark](https://reader036.vdocument.in/reader036/viewer/2022062815/56816920550346895de04db9/html5/thumbnails/45.jpg)
More InformationOfficial docs: http://spark.incubator.apache.org/docs/latest
Look for Apache Spark parcel in CDH