spark and bloomberg by sudarshan kadambi and partha nageswaran

27
Spark @ Bloomberg: Dynamic Composable Analytics Partha Nageswaran Sudarshan Kadambi BLOOMBERG L.P.

Upload: spark-summit

Post on 21-Apr-2017

2.117 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Spark @ Bloomberg: Dynamic Composable Analytics

Partha NageswaranSudarshan KadambiBLOOMBERG L.P.

Page 2: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Spark at Bloomberg: Dynamic Composable Analytics

• AdaptationofSparkinBloombergisevolvingfromcraftingstand-aloneSparkAppstoServerized SparkApps

Spark Cluster

Spark App

Spark Cluster

Spark App

Spark Cluster

Spark Server

Spark App

Spark App

Page 3: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

A Couple of Tenets• Preferenceforontheflycalculationsoverpre-computedvalues

• Supportanalyticsonanalytics,adinfinituminadynamicmanner

Page 4: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Spark Serverization - Motivation• Stand-aloneSparkAppsonisolatedclustersposechallenges:

– Managementofclusters,replicationofdata,etc.

– Analyticsareconfined tospecificcontentsetsmakingCross-AssetAnalytics muchharder

– NeedtohandleReal-timeingestion ineachApp!

– Redundancy in:

» CraftingandManagingofRDDs/DFs

» Codingofthesameorsimilartypesoftransforms/actions

Spark Cluster

Spark App

Spark Cluster

Spark App

Page 5: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Spark Serverization - Approach• Long-runningprocess(SparkServer)maintainsashared

SparkContext

– SparkAppswithinprocess sharethesameSparkContext

• ProvideaContainerbasedapproachtocapabilitiessuchas:

– Deploying andManagingSparkApplications

– Deploying, ManagingandSharingRDDs/DFsacrossSparkApps

– Handling on-the-flyanalyticsonStreamingdata

– Declarativeorchestrationofhigher-order analyticsonRDDs/DFsandacrossSparkApplications

Spark Server

Spark Context

Spark App

Spark App

DF DF

DF DF

Page 6: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Benefits of Container Approach• ProvideLifecycleManagement forSparkApps

• ProvideLifecycleManagement forRDDs/DFs

• Provideotherdeclarative qualitiesofservices, suchas:

– RoutingofRequests toappropriateSparkApps

– NamingServicecapabilities forRDDs/DFs

– IngestionServices

– SecuringaccesstoSparkApps,andRDDs/DFs

Spark Server

Lifecycle Services

Naming Services

Security Services

RoutingServices

Query Services

IngestionServices

Page 7: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Introducing Managed DataFrames(MDFs)

• AManagedDataFrame (MDF)isanamedDataFrame,optionallycombinedwithExecutionMetadata

– MDFscanbesearchedbynameORbyanyColumnNamedefined intheSchemaof thecorrespondingDF

• ExecutionMetadataincludes:

– DataDistribution metadatacapturesinformationabout thedatadepth, histogram information, etc.

– E.g.:AmanagedDataFrame forpricingof stocks,representing 2yearsofhistoricaldata andanotherforrepresenting 30yearsofhistoricaldata

MDF

Price DF<ID, Price>

Name: Shallow

PriceMDF

ExecutionMetadata:* 2 Yr Price

History

MDF

Price DF<ID, Price>

Name: Deep

PriceMDF

ExecutionMetadata:

* 30 Yr Price History

Page 8: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Introducing Managed DataFrames(MDFs)

– DataDerivationmetadatawhicharemathematicalexpressions thatdefinehowadditional columnscanbesynthesized fromexistingcolumns intheschema

– E.g.:adjPrice isaderivedColumn, definedintermsofthebasePricecolumn

– Inessence,anMDFwithdataderivationmetadatahaveaSchemathatisaunionofthecontainedDFandthederivedcolumns

MDF

Name:ShallowPriceDF

ExecutionMetadata:* 2 Yr Price

History* adjPrice =

Price – 3% of Price

Price DF<ID, Price>

MDF

Name:Deep

PriceDF

ExecutionMetadata:

* 30 Yr Price History

* adjPrice = Price – 1.75% of

Price

Price DF<ID, Price>

Page 9: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Introducing MDF Registry• ARegistry, calledtheMDFRegistry

withintheSparkServer providessupportfor:

– BindingMDFsbyName

– LookingupMDFsbyName

– LookingupMDFbyaColumn Name(anelementoftheMDFSchema),etc.

• TheMDFRegistrymaintainsa'table'thatassociates theNameoftheMDFwiththeDFreference andColumnsintheDF

MDFRegistryName Columns DF

Ref.MetaData

ShallowPriceDF

Price,adjPrice

… …

DeepPriceDF

……

Price,adjPrice

Page 10: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Introducing DF Function Transform Libraries (FTLs)

• Standard(Analytics)functionscanbeexpressedas'pre-canned'SparkTransforms/ActionsinFunctionTransformLibraries(FTLs)

• SparkAppscancomposepre-cannedtransformswithotherapplicationlogictransformsonMDFs,bylookingupthepre-cannedtransformsfromFTLs

– E.g.:convertedPriceDF =FTL.apply(prices,“ConvertCurrency”,params);

Function Transform Library (FTL)

ConvertCurrency

df.join(rates.filter(rates("toCCY")=== toCCY), df("CURRENCY")=== rates("fromCCY") && df("DATE") === rates("DATE")).select(df("ID"), df("DATE"), rates("RATE") * df("VALUE"), rates("toCCY"))

Page 11: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Introducing Request Processor• SparkApps(calledRequestProcessors - RP)

withintheSparkServerareimplementedcomplianttospecifications

– TheseRPsareprovidedaccesstotheRegistryandFTLs

– AreresponsibleforcomposingtransformsandactionsononeormoreMDFs

– MaydynamicallybindadditionalMDFs(materializedorotherwise)forusebyotherApps

Request Handler

Request Processor

.

MDF Registry

lookup MDFs

FTLs

applyFunction

MDFs

decoratewithTransforms

collect

Page 12: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Spark Context

Request Processor

Request Processor

Client Query Request Processor

Request Handler (E.g.: Apache CXF running in Tomcat)

MDF Registry

MDF

12

Function Transform Library

(FTL)

ConvertCurrency

use MDF

MDF

MDF

Bloomberg Spark Server

Page 13: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Bloomberg Spark Server

13

Spark Context

Request Processor

Request Processor

Request Processor

Request Handler (E.g.: Apache CXF running in Tomcat)

MDF Registry

MDF1

MDF2

Function Transform Library

(FTL)

currConv …1 2

1 2

Ingestion Manager

Page 14: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Bloomberg Spark Server

14

Spark Context

Request Processor

Request Processor

Request Processor

Request Handler (E.g.: Apache CXF running in Tomcat)

MDF Registry

MDF1

MDF2

Function Transform Library

(FTL)

currConv …1 2

1 2

Ingestion Manager

Page 15: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Schema Repository

15

• Enterprise-widedatapipeline

• External(toSpark)schemarepositoryandservice

• EnablesMDFlookupbyadatasetschemaelement

• Analyticexpressionscannowbecomposedoverdataelements

Page 16: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Execution Metadata

16

• Connection Identifiers

• BackingStores

• Real-time Topics

• StorageLevel&RefreshRate

• SubsetPredicate,etc.

Page 17: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Cross-Domain Analytics

17

• Registrationofpre-materializedDataFrames

• Collaborativeanalyticsbetweenapplicationworkflows

• DynamiccreationofManagedDataFrames

• Ad-hoc cross-domainanalytics

Page 18: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Subsetting

18

• Highvaluedatasub-settedwithinSpark

• Reducecostofqueryingexternaldatastore

• Specifiedasafilterpredicateattimeofregistration

• E.g.Membercompaniesofpopularindices[Dow30,S&P500,…]haverecordsplacedwithinSpark

Page 19: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Subsetting

19

• SeamlessstitchingbetweendatainSpark(DFsubset)andbackingstore(DFsubset’)

• (DFsubset UDFsubset’).filter(query) =DFsubset.filter(query) UDFsubset’.filter(query)

• Future:Predicatelogicbetween queryandsubsetpredicates

• Datasetownersprovidedknobsforcostvsperformance.

• Future:LRUcache

Page 20: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Ingestion: AB Swap

20

• PeriodicdatapullintoSparkfromthebackingstore

• Subsetcriteriaappliedduringdataretrieval

• Scenariowhenbackingstorekeptcontinuouslyupdated,externaltoSpark

• Avro-deserializationpusheddownintovariousconnectors

Page 21: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Ingestion: Stream Reconciliation

21

• Analyticsneedstobelow-latencywithrespecttoqueries,butalsodatafreshness

• Sincedataisbeingsub-settedwithinSpark,needtokeepthesubsetuptodate

• DatasetspublishedtodifferentKafkatopics.

• 1:1mappingbetween datasets,topicsandDStreams.

Page 22: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Ingestion: Stream Reconciliation

22

Backing Store

U1 U2 U3 UN DFsubset

S1 S2 S3 SNDFN

MDF A

Real-Time Stream

(updateStateByKey)

(Avro Deserialize, Subset Predicate)

(foreachRDD, convert to DF)

Page 23: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Reference Counting

23

• AnMDFcontainsmultiplegenerationofDFs,beinggeneratedanddestroyed

• MultiplegenerationsoperateduponbyRPsatgivenpointintime

• ReferencecountingtokeeptrackofwhatDFsarebeingusedandbywhom

• Longrunningqueriesabortedforforcedreclamation

Page 24: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Snapshot Consistency

24

• Multiplequeriesneedtooperateonsamesnapshotofdata

• Howtoachieve,ifdataconstantlychangingunderneath?

• EachDFwithinMDFassociatedwithtimeepoch

• Registrylookupwithareferencetime

• Time-alignsub-setted dataframeswithdatainbackingstore

Page 25: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

Spark Challenges

25

• Lowlatencyperformanceconsistency

• EfficientStreamreconciliation

• SparkDriverHA

• Strongconsistencyacrosscontextsneedsstateexternalization

Page 26: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

MDF Acknowledgements

Andrew Foster Joe Davey Shubham Chopra

Hamel KothariNimbus Goehausen

Page 27: Spark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran

THANK [email protected]@bloomberg.net