ë Ü ¼ spark 1 o à - pic.huodongjia.com€¦ · spark 1 2 e • [´ w 1 _ Ä r x Û • §uc...

44
Sparkጱრԏ᪠ ڃஙԯᦇᓒӨ௳מۓਫḵਰ(CISL)

Upload: others

Post on 16-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Spark

(CISL)

Page 2: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Spark

•  •  UCBerkeley,AMPLab MateiZaharia •  2010 BSD •  2014 Apache •  863

Page 3: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Spark

•  ResilientDistributedDatasets(RDDs)•  •  /

•  RDDs •  RDDs map,filter,join,reduce,groupBy,…•  RDD lineage •  RDDs

Page 4: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Spark

• 

vallogs=spark.textFile(“hdfs://…”)

valerrMsgs=lines.map(_.split(“,”)) .filter(_(0)==“ERROR”) .map(_(1))

errMsgs.cache()

errMsgs.filter(_contains“foo”).count()

//header:LEVEL,MSG//INFO,msg1//ERROR,msg2//…//…

Driver

Executor

Executor

RDD

Transforms

tasks

tasks

logs1

logs2

msgs1

msgs2

results

results

Page 5: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Spark

• SparkSQL: • SparkStreaming: • MLlib: • GraphX:

Page 6: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Spark SQL •  DataFrames valerrMsgs=sqlCtx.read.format(“csv”)

.load(“hdfs://…”) .where(“LEVEL=‘error’”) .select(“MSG”)

errMsgs.cache()

errMsgs.where(“MSGlike‘foo’”).count()

//header:LEVEL,MSG//INFO,msg1//ERROR,msg2//…//…

Page 7: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

•  Hadoop Yarn Mesos standalone •  HDFS, Cassandra, Azure, S3

Spark

Page 8: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Spark

•  AzureDataLake•  spark • HDInsight Spark

AzureDataLake

analyWcsservice Clusters(HDInsight)

unstructured semi-structured structured

Store

Analy,cs

YARN

WebHDFS

C#

Page 9: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Spark

•  AzureDataLake•  spark • HDInsight Spark

•  spark •  .NET

Page 10: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Bing

•  (“FastSML”)•  TB• 

•  •  •  (OperaWonalIntelligence)• …

Page 11: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Bing --FastSML

Click

UI-Layout

Kaea

C C C C

U U U U

… … … …

MergedEvent

RawEvents Databus EventMergePipeline

10-minuteApp-TimeWindow10

Kaea

Databus1 2 3 4

Page 12: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

FastSML+Spark?

ApacheStorm(SCP.Net)+Kaea+Microsoh’s

• Spark ?•  FastSML ?

•  C#

Page 13: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Spark + .NET

.NET •  C# Spark •  .NET •  Spark .NET

Spark + .NET = Mobius!!!

Page 14: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius

•  2015 8 •  CISL ASG(Bing) 5 • 2015 11 •  MIT

•  V1.5.2 V.1.6.0• 4 V1.6.1

Page 15: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius@github •  ApacheSparkWiki hmp://github.com/Microsoh/Mobius•  758 ,2 •  –131 ,4K

Page 16: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius

C# Spark •  •  •  •  • 

Page 17: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Word Count

Scala

C#

valtextFile=spark.textFile(“hdfs://…”)valcounts=textFile.flatMap(line=>line.split(“”))

.map(word=>(word,1)) .reduceByKey(_+_)

counts.saveAsTextFile(“hdfs://…”)

vartextFile=sparkContext.textFile(@“hdfs://…”)varcounts=textFile.FlatMap(line=>line.split(“”))

.Map(word=>newKeyValuePair<string,int>(word,1)) .ReduceByKey((x,y)=>x+y) .Map(wc=>string.Format(“{0},{1}”,wc.Key,wc.Value));

counts.saveAsTextFile(@“hdfs://…”);

Page 18: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius

•  spark • JVM–CLR(.NETVM)

•  Spark JVM• C# CLR

•  PySpark SparkR

Page 19: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

C#Worker

CLR

IPCSockets

C#Worker

CLR

IPCSockets

C#Worker

CLR

IPCSockets

C#Driver

CLR

IPCSockets

SparkExecutor

SparkExecutor

SparkExecutor

JavaDriver

JVM

JVM

JVM

JVM

Workers

Driver

Method

Result

Method

Method

Method

Result

Result

Result

Page 20: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

CSharpRunner

Calledbysparkclr-submit.cmd

JVM

Java/Scalacomponent

C#component

CSharpBackendLaunchesNemyservercreaWngproxyforJVMcalls1

Driver(usercode)LaunchesC#

sub-process

2SqlContext

Init

3

InvokesJVM-methodtocreatecontext

4

SqlContext(Spark)

create 5

createDF

6

InvokesJVM-methodtocreateDF

7

DataFrame(Spark)

Usejsc&createDFinJVM8

DataFrame

9

C#DFhasreferencetoDFinJVM

SqlContexthasreferencetoSCinJVM

12

InvokesmethodonDF

Driver-side Interop - DataFrame

Page 21: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

C#Worker

Launchexecutableassub-process

Serializedata&user-implementedC#lambdaandsendthroughsocket

Serializeprocesseddataandsendthroughsocket

SparkExecutorSparkcallsCompute()

Scalacomponent

C#component

Executor-side Interop - RDD

Page 22: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

• Driver •  SparkR •  Nemyserver JVM

• Worker •  PySpark •  /

Page 23: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius

• SparkExecutor CSharpWorker• 

• DataFrame CSharpWorker•  SparkCore codegen

•  C# / Java

Executor

C#Worker

JavaExecuto

r

C#Worker

JavaExecuto

r

TransformaWon#1 TransformaWon#2

SER/DE SER/DESER SER

Page 24: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius

• Mobius •  •  • 

• Mobius •  • 

Page 25: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobiu

•  Spark C# •  •  • 

Page 26: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius

•  Roslyn •  .NET • 

Page 27: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius

•  Roslyn •  RoslynC#Interpreter

AssemblyService

MobiusSparkCtx

Master

Worker WorkerWorker

SparkCluster

Page 28: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius as a service)

•  Spark •  •  Pay-by-Job

RESTmessageprotocol

Jupyter Zeppelin Shell …

HostedService

MobiusServiceMobiusShell

Master

Worker WorkerWorker

SparkCluster

Page 29: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius

•  API

Rou,ngLayer

MobiusServiceEndpoint

MobiusServiceEndpoint

Mobius/SparkShell

InterpreterAPI

InterpreterAPI

HiveShell

Page 30: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius

•  (ElasWcity) (FaultTolerance)•  MobiusServiceendpoints

Rou,ngLayer

MobiusServiceEndpoint

MobiusServiceEndpoint

…Interpreter

APIInterpreter

API

PersistentStore

Page 31: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

•  Mobius •  Mobius Spark •  @github.com/Microsoh/Mobius

• AnycontribuWoniswelcome!•  :Jupyter/Zeppelin/…•  :Puppet/Chef/…• …

Page 32: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

(CISL)

• Mobius• Yarn++• ApacheREEF• Rayon• TieredStorage• …

Page 33: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

•  Mobius •  Mobius Spark •  @github.com/Microsoh/Mobius

• AnycontribuWoniswelcome!•  :Jupyter/Zeppelin/…•  :Puppet/Chef/…• …

•  CISL •  [email protected]

Page 34: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±
Page 35: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Interactive Log Analysis • MassiveCosmosloganalysis:severalPBsperday• RapiditeraWvedrill-downsforDRIstodiagnoseissues

Customer(orotherDRIs)

AutoPilotWatchdogAlerts

DRI(DesignatedResponsibleIndividual)

PerfCountersExamine1hour,2hours…14days

Architects/Developers/OtherDRIs

RDP—CosmosMachinesOpenindividualconnecWonsfortroubleshooWngVendors/SecondaryDRIs

Eavesdropanddocumentincident

ScopeStudio

AlerWng Triage

35

Spark?

Page 36: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

C# API for Spark

ApacheSpark

C#API

Scala/JavaAPI

SparkR PySpark

SparkAppsinC#

Page 37: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius Service Differentiators

• Bemerfaulttolerance--sessionpersistenceandreplay

Rou,ngLayer

MobiusServiceEndpoint

MobiusServiceEndpoint

…PersistentStore

Page 38: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius Service Differentiators

• ElasWcityandFaultTolerance• Auto-scale#ofMobiusServiceendpoints

Page 39: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

CSharpRunner

Calledbysparkclr-submit.cmd

JVM

Java/Scalacomponent

C#component

CSharpBackendLaunchesNemyservercreaWngproxyforJVMcalls1

Driver(usercode)LaunchesC#

sub-process

2SqlContext

Init

3

InvokesJVM-methodtocreatecontext

4

SqlContext(Spark)

create 5

createDF

6

InvokesJVM-methodtocreateDF

7

DataFrame(Spark)

Usejsc&createDFinJVM8

10

OperaWonDataFrame

9

C#DFhasreferencetoDFinJVM

11

InvokesJVM-method

SqlContexthasreferencetoSCinJVM

12

InvokesmethodonDF

Driver-side Interop - DataFrame

Page 40: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

C#Worker

Launchexecutableassub-process

Serializedata&user-implementedC#lambdaandsendthroughsocket

Serializeprocesseddataandsendthroughsocket

CSharpRDDSparkcallsCompute()

Scalacomponent

C#component

Executor-side Interop - RDD

CSharpRDD is used only when customized C# code is used

in transformation

Page 41: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius Performance Considerations 1.  OneCSharpWorkerprocessforeachJVMexecutorprocess

•  PySparkforksaPythonprocessforeachtaskthreadinJVMexecutorprocess•  NewC#opWonofonethreadforeachtaskthreadinJVMexecutorprocess

2.  C#operaWonsarepipelinedwhenpossible•  Map&FilterRDDoperaWonsinC#needdatatobepassedfromJVMtoC#,incurringthecostofserializaWonanddeserializaWon

•  C#operaWonsarepipelinedwhenpossibletominimizedatapassing

3.  DataFrameoperaWonswithoutC#UDFsdonotrequireCSharpWorker•  SameexecuWonplanopWmizaWonandcodegeneraWoninSparkCore•  PerformthesameasScalaapplicaWons

Page 42: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Mobius Service Differentiators

• ElasWcityandFaultTolerance• Usevirtualactormodel• Auto-scale#ofMobiusServiceendpoints

Rou,ngLayer

MobiusServiceEndpoint

MobiusServiceEndpoint

…PersistentStore

Page 43: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

Linux Support

• MonoandCoreCLR(ongoing)forMobiusonLinux• GitHubprojectusesTravisforCIinUbuntu14.04.3LTS

• Unittestsandsamples(funcWonaltests)arerun• [email protected]

Page 44: ë Ü ¼ Spark 1 O à - pic.huodongjia.com€¦ · Spark 1 2 e • [´ W 1 _ Ä R X Û • §UC Berkeley, AMPLab 1Matei Zaharia ¤ • 2010 òBSD b O à • 2014 n Apache ® F ±

CSharpRDD

• C# CLR •  C# => JVM

• RDD<byte[]>•  C#worker

• TransformaWonsarepipelinedwhenpossible•