BUILDING A REST JOB SERVER FOR INTERACTIVE SPARK AS A SERVICERomain Rigaux - Cloudera Erick Tryzelaar - Cloudera
WHY?
NOTEBOOKS
EASYACCESSFROMANYWHERE
SHARESPARKCONTEXTSANDRDDs
BUILDAPPS
SPARKMAGIC
…
WHY SPARKAS A SERVICE?
MARRIEDWITHFULLHADOOPECOSYSTEM
WHY SPARKIN HUE?
HISTORYV1: OOZIE
• Itworks
• Codesnippet
THE GOOD
• SubmitthroughOozie
• Shellac:on
• VerySlow
• Batch
THE BAD
workflow.xmlsnippet.py
stdout
HISTORYV2: SPARK IGNITER
• ItworksbeAer
THE GOOD
• CompilerJar
• Batchonly,noshell
• NoPython,R
• Security
• Singlepointoffailure
THE BAD Compile
Implement
Upload
jsonoutput
Batch
Scala
jar
Ooyala
HISTORYV3: NOTEBOOK
• Likespark-submit/sparkshells
• Scala/Python/Rshells
• Jar/PythonbatchJobs
• NotebookUI
• YARN
THE GOOD
• Beta?
THE BAD
Livy
codesnippet batch
GENERAL ARCHITECTURE
Spark
Spark
Spark
Livy YARN
!"
# $
Livy
Spark
Spark
Spark
YARN
API
!"
# $
GENERAL ARCHITECTURE
LIVY SPARK SERVER
LIVYSPARK SERVER
•RESTWebserverinScalaforSparksubmissions
• Interac:veShellSessionsorBatchJobs
•Backends:Scala,Java,Python,R
•NodependencyonHue
•OpenSource:hAps://github.com/cloudera/
hue/tree/master/apps/spark/java
•Readaboutit:hAp://gethue.com/spark/
ARCHITECTURE
• Standardwebservice:wrapperaroundspark-submit/Sparkshells• YARNmode,Sparkdriversruninsidethecluster(supportscrashes)• Noneedtoinheritanyinterfaceorcompilecode• Extendedtoworkwithadditionalbackends
LIVY WEB SERVERARCHITECTURE
LOCAL“DEV”MODE YARNMODE
LOCAL MODE
LivyServer
Scalatra
SessionManager
Session
SparkContextSpark
Client
SparkClient
SparkInterpreter
LOCAL MODE
LivyServer
Scalatra
SessionManager
Session
SparkClient
SparkClient
SparkContext
SparkInterpreter
LOCAL MODE
SparkClient
1
LivyServer
Scalatra
SessionManager
Session
SparkClient
SparkContext
SparkInterpreter
LOCAL MODE
SparkClient
1
2
LivyServer
Scalatra
SessionManager
Session
SparkClient
SparkContext
SparkInterpreter
LOCAL MODE
SparkClient
SparkInterpreter
1
2
LivyServer
Scalatra
SessionManager
Session
SparkClient
SparkContext
3
LOCAL MODE
SparkClient
1
2
LivyServer
Scalatra
SessionManager
Session
SparkClient
SparkContext
3
4 SparkInterpreter
LOCAL MODE
SparkClient
1
2
LivyServer
Scalatra
SessionManager
Session
SparkClient
SparkContext
3
4
5
SparkInterpreter
YARN-CLUSTERMODE
PRODUCTION SCALABLE
YARNMaster
SparkClient
YARNNode
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
LivyServer
Scalatra
SessionManager
Session
YARN-CLUSTERMODE
SparkInterpreter
LivyServer
YARNMaster
Scalatra
SparkClient
SessionManager
Session
YARNNode
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1
YARN-CLUSTERMODE
SparkInterpreter
YARNMaster
SparkClient
YARNNode
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1
2
LivyServer
Scalatra
SessionManager
Session
YARN-CLUSTERMODE
SparkInterpreter
YARNMaster
SparkClient
YARNNode
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1
2
3
LivyServer
Scalatra
SessionManager
Session
YARN-CLUSTERMODE
SparkInterpreter
YARNMaster
SparkClient
YARNNode
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1
2
3
4LivyServer
Scalatra
SessionManager
Session
YARN-CLUSTERMODE
SparkInterpreter
YARNMaster
SparkClient
YARNNode
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1
2
3
4
5
LivyServer
Scalatra
SessionManager
Session
YARN-CLUSTERMODE
SparkInterpreter
YARNMaster
SparkClient
YARNNode
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1
2
3
4
5
6
LivyServer
Scalatra
SessionManager
Session
YARN-CLUSTERMODE
SparkInterpreter
YARNMaster
SparkClient
YARNNode
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
1 7
2
3
4
5
6
LivyServer
Scalatra
SessionManager
Session
YARN-CLUSTERMODE
SparkInterpreter
SESSION CREATION AND EXECUTION%curl-XPOSTlocalhost:8998/sessions\-d'{"kind":"spark"}'{"id":0,"kind":"spark","log":[...],"state":"idle"}
%curl-XPOSTlocalhost:8998/sessions/0/statements-d'{"code":"1+1"}'{"id":0,"output":{"data":{"text/plain":"res0:Int=2"},"execution_count":0,"status":"ok"},"state":"available"}
Jar
Py
Scala
Python
R
Livy
Spark
Spark
Spark
YARN
/batches
/sessions
BATCH OR INTERACTIVE
SHELL OR BATCH?YARNMaster
SparkClient
YARNNode
SparkInterpreter
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
LivyServer
Scalatra
SessionManager
Session
SHELLYARNMaster
SparkClient
YARNNode
pyspark
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
LivyServer
Scalatra
SessionManager
Session
BATCHYARNMaster
SparkClient
YARNNode
spark-submit
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
LivyServer
Scalatra
SessionManager
Session
LIVY INTERPRETERSScala,Python,R…
REMEMBER?YARNMaster
SparkClient
YARNNode
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
LivyServer
Scalatra
SessionManager
Session
SparkInterpreter
INTERPRETERS
• Pipestdin/stdouttoarunningshell
• Executethecode/sendtoSparkworkers
• Performmagicopera:ons
• Oneinterpreterperlanguage• “Swappable”withotherkernels(python,spark..)
Interpreter
>println(1+1)2
println(1+1)
2
LivyServer
INTERPRETER FLOW
Interpreter
LivyServer
>1+1
Interpreter
INTERPRETER FLOW
LivyServer
{“code”:“1+1”}
>1+1
Interpreter
INTERPRETER FLOW
LivyServer Interpreter
1+1{“code”:“1+1”}
>1+1
INTERPRETER FLOW
LivyServer Interpreter
1+1{“code”:“1+1”}
>1+1
Magic
INTERPRETER FLOW
LivyServer
2
Interpreter
1+1{“code”:“1+1”}
>1+1
Magic
INTERPRETER FLOW
{“data”:{“application/json”:“2”}}
LivyServer
2
Interpreter
1+1{“code”:“1+1”}
>1+1
Magic
INTERPRETER FLOW
{“data”:{“application/json”:“2”}}
LivyServer
2
Interpreter
1+1{“code”:“1+1”}
>1+1
2 Magic
INTERPRETER FLOW
INTERPRETER FLOW CHART
ReceivelinesSplitintoChunks
Sendoutputtoserver
Senderrortoserver
Success
ExecuteChunkMagic!
Chunksle[?
Magicchunk?
No
Yes
NoYes
Exampleofparsing
INTERPRETER MAGIC
• table• json• plotting• ...
NO MAGIC
>1+1
Interpreter
1+1
sparkIMain.interpret(“1+1”)
{"id":0,"output":{"application/json":2}}
[('',506610),('the',23407),('I',19540)...]
JSON MAGIC
>countssparkIMain.valueOfTerm(“counts”)
.toJson()
Interpreter
vallines=sc.textFile("shakespeare.txt");valcounts=lines.flatMap(line=>line.split("")).map(word=>(word,1)).reduceByKey(_+_).sortBy(-_._2).map{case(w,c)=>Map("word"->w,"count"->c)}
%jsoncounts
JSON MAGIC
>countssparkIMain.valueOfTerm(“counts”)
.toJson()
Interpreter
{"id":0,"output":{"application/json":[{"count":506610,"word":""},{"count":23407,"word":"the"},{"count":19540,"word":"I"},...]...}
vallines=sc.textFile("shakespeare.txt");valcounts=lines.flatMap(line=>line.split("")).map(word=>(word,1)).reduceByKey(_+_).sortBy(-_._2).map{case(w,c)=>Map("word"->w,"count"->c)}
%jsoncounts
[('',506610),('the',23407),('I',19540)...]
TABLE MAGIC
>counts
Interpreter
vallines=sc.textFile("shakespeare.txt");valcounts=lines.flatMap(line=>line.split("")).map(word=>(word,1)).reduceByKey(_+_).sortBy(-_._2).map{case(w,c)=>Map("word"->w,"count"->c)}
%tablecounts
sparkIMain.valueOfTerm(“counts”).guessHeaders().toList()
TABLE MAGIC
>countssparkIMain.valueOfTerm(“counts”)
.guessHeaders().toList()
Interpreter
vallines=sc.textFile("shakespeare.txt");valcounts=lines.flatMap(line=>line.split("")).map(word=>(word,1)).reduceByKey(_+_).sortBy(-_._2).map{case(w,c)=>Map("word"->w,"count"->c)}
%tablecounts"application/vnd.livy.table.v1+json":{"headers":[{"name":"count","type":"BIGINT_TYPE"},{"name":"name","type":"STRING_TYPE"}],"data":[[23407,"the"],[19540,"I"],[18358,"and"],...]}
PLOT MAGIC
>
sparkIMain.interpret(“png(‘/tmp/plot.png’)barplotdev.off()”)
Interpreter
...barplot(sorted_data$count,names.arg=sorted_data$value,main="Resourcehits",las=2,col=colfunc(nrow(sorted_data)),ylim=c(0,300))
PLOT MAGIC
>
sparkIMain.interpret(“png(‘/tmp/plot.png’)barplotdev.off()”)
Interpreter
...barplot(sorted_data$count,names.arg=sorted_data$value,main="Resourcehits",las=2,col=colfunc(nrow(sorted_data)),ylim=c(0,300))
PLOT MAGIC
>png(‘/tmp/..’)>barplot>dev.off()
sparkIMain.interpret(“png(‘/tmp/plot.png’)barplotdev.off()”)
Interpreter
...barplot(sorted_data$count,names.arg=sorted_data$value,main="Resourcehits",las=2,col=colfunc(nrow(sorted_data)),ylim=c(0,300))
PLOT MAGIC
>png(‘/tmp/..’)>barplot>dev.off()
sparkIMain.interpret(“png(‘/tmp/plot.png’)barplotdev.off()”)
File(’/tmp/plot.png’).read().toBase64()
Interpreter
...barplot(sorted_data$count,names.arg=sorted_data$value,main="Resourcehits",las=2,col=colfunc(nrow(sorted_data)),ylim=c(0,300))
PLOT MAGIC
>png(‘/tmp/..’)>barplot>dev.off()
sparkIMain.interpret(“png(‘/tmp/plot.png’)barplotdev.off()”)
File(’/tmp/plot.png’).read().toBase64()
Interpreter
...barplot(sorted_data$count,names.arg=sorted_data$value,main="Resourcehits",las=2,col=colfunc(nrow(sorted_data)),ylim=c(0,300))
{"data":{"image/png":"iVBORw0KGgoAAAANSUhEUgAAAe…"...}...}
• PluggableBackends• Livy'sSparkBackends– Scala– pyspark– R
• IPython/Jupytersupportcomingsoon
PLUGGABLE INTERPRETERS
• Re-usingit• GenericFrameworkforInterpreters
• 51Kernels
JUPYTER BACKEND
SPARK AS A SERVICE
REMEMBER AGAIN?YARNMaster
SparkClient
YARNNode
SparkContext
YARNNode
SparkWorker
YARNNode
SparkWorker
LivyServer
Scalatra
SessionManager
Session
SparkInterpreter
MULTI USERS
YARNNode
SparkContext
LivyServer
Scalatra
SessionManager
Session
SparkInterpreter YARN
Node
SparkContext
SparkInterpreter
YARNNode
SparkContext
SparkInterpreter
SparkClient
SparkClient
SparkClient
SHARED CONTEXTS?
YARNNode
SparkContext
LivyServer
Scalatra
SessionManager
Session
SparkInterpreter
SparkClient
SparkClient
SparkClient
SHARED RDD?
YARNNode
SparkContext
LivyServer
Scalatra
SessionManager
Session
SparkInterpreter
SparkClient
SparkClient
SparkClient
RDD
SHARED RDDS?
YARNNode
SparkContext
LivyServer
Scalatra
SessionManager
Session
SparkInterpreter
SparkClient
SparkClient
SparkClient
RDD
RDD
RDD
YARNNode
SparkContext
LivyServer
Scalatra
SessionManager
Session
SparkInterpreter
SparkClient
SparkClient
SparkClient
RDD
RDD
RDD
SECURE IT?
YARNNode
SparkContext
LivyServer
Scalatra
SessionManager
Session
SparkInterpreter
SparkClient
SparkClient
SparkClient
RDD
RDD
RDD
SECURE IT?
LivyServer
Spark
SparkClient
SparkClient
SparkClient
SPARK AS SERVICE
Spark
SHARING RDDS
PySparkshell
RDD
ShellPythonShell
PySparkshell
RDD
ShellPythonShell
PySparkshell
RDD
ShellPythonShell
r=sc.parallelize([])srdd=ShareableRdd(r)
PySparkshell
RDD{'ak':'Alaska'}
{'ca':'California'}
ShellPythonShell
r=sc.parallelize([])srdd=ShareableRdd(r)
PySparkshell
RDD{'ak':'Alaska'}
{'ca':'California'}
ShellPythonShell
curl-XPOST/sessions/0/statement{'code':srdd.get('ak')}
r=sc.parallelize([])srdd=ShareableRdd(r)
PySparkshell
RDD{'ak':'Alaska'}
{'ca':'California'}
ShellPythonShell
states=SharedRdd('host/sessions/0','srdd')states.get('ak')
r=sc.parallelize([])srdd=ShareableRdd(r)
curl-XPOST/sessions/0/statement{'code':srdd.get('ak')}
DEMO TIME
https://github.com/romainr/hadoop-tutorials-examples/tree/master/notebook/shared_rdd
• SSLSupport• PersistentSessions• Kerberos
SECURITY
SPARK MAGIC
• FromMicrosop
•PythonmagicsforworkingwithremoteSpark
clusters
•OpenSource:hAps://github.com/jupyter-
incubator/sparkmagic
FUTURE
•Movetoextrepo?
• Security• iPython/Jupyterbackendsandfileformat
• SharednamedRDD/contexts?
• Sharedata• Sparkspecific,languagegeneric,both?• LeverageHue4
https://issues.cloudera.org/browse/HUE-2990
• OpenSource:hAps://github.com/cloudera/
hue/tree/master/apps/spark/java
• Readaboutit:hAp://gethue.com/spark/
•Scala,Java,Python,R
•TypeIntrospec:onforVisualiza:on
•YARN-clusterorlocalmodes
•Codesnippets/compiled
•RESTAPI
•Pluggablebackends
•Magickeywords
•Failureresilient
•Security
LIVY’SCHEAT SHEET
BEDANKT!
@gethue
USER GROUP
hue-user@
WEBSITE
hAp://gethue.com
LEARN
hAp://learn.gethue.com