what’s new in spark 2.0?files.meetup.com/19070069/161129 - michael... · ©2016 couchbase inc. 3...
TRANSCRIPT
What’snewinSpark2.0?
©2015CouchbaseInc. 2
©2016CouchbaseInc. 3
SparkOverview
ApacheSparkisafastandgeneralengineforlarge-scaledataprocessing.
©2016CouchbaseInc. 4
SparkOverview
©2016CouchbaseInc. 5
Spark2.0
§ Largelycompatiblewith1.x§ SimplifiesAPI§ 2000patchesfrom280contributors
http://www.slideshare.net/SparkSummit/simplifying-big-data-applications-with-apache-spark-20
©2016CouchbaseInc. 6
Spark2.0
§ StructuredAPIImprovements§ Whole-stagecodegeneration§ StructuredStreaming§ SimplerSetup§ SQL2003Support§ MLlibenhancements§ EnhancedRsupport§ …
©2016CouchbaseInc. 7
StructuredAPIImprovements
§ Dataset(typed)andDataFrame(untyped)arenowunified§ DataFrame==Dataset<Row>
§ AlsousedforStructuredStreaming
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
©2016CouchbaseInc. 8
Whole-StageCodegen
§ Second-generationTungstenengine§ DepartingfromVolcanoIteratorModel§ Also:Vectorizationformoreefficientbatch-processing
https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
©2016CouchbaseInc. 9
Whole-StageCodegen
https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
©2016CouchbaseInc. 10http://www.slideshare.net/SparkSummit/simplifying-big-data-applications-with-apache-spark-20
©2016CouchbaseInc. 11
StructuredStreaming(Experimental)
§ Tackling“continuousapplications”§ IntegratedAPIwithbatchjobs§ Betterinteractionwithstoragesystems§ RichIntegrationintotherestofSpark
©2016CouchbaseInc. 12
StructuredStreaming(Experimental)
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
©2016CouchbaseInc. 13
StructuredStreaming(Experimental)
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
©2016CouchbaseInc. 14
SimplerSetup
§ SparkSessionsubsumesSQLContext,HiveContext,…§ Onecommonentrypoint
©2016CouchbaseInc. 15
SQL2003Support
§ SupportsSQL2003Standard
§ ReworkednativeSQLparser§ NativeDDLcommandimplementations§ Allkindsofsubqueriesnowsupported§ Cannowrunall99TPC-DSqueries
©2016CouchbaseInc. 16
MLlibEnhancements
§ DataFrameasprimaryMLAPI§ ModelPersistence§ SupportforalllanguageAPIsinSpark:Scala,Java,Python&R§ SupportfornearlyallMLalgorithmsintheDataFrame-basedAPI§ SupportforsinglemodelsandfullPipelines,bothunfitted(a“recipe”)andfitted(aresult)
§ Distributedstorageusinganexchangeableformat
QAThanks!
Couchbase&Spark
©2015CouchbaseInc. 19
©2016CouchbaseInc. 20
EcommerserunsonCouchbase
6 10 ECOMMERCE COMPANIES
IN THE UNITED STATES
of the TOP
Online shopping
Omni channel services
©2016CouchbaseInc. 21
TravelrunsonCouchbase
3 3
GLOBAL DISTRIBUTION SYSTEMS WORLDWIDE
of the TOP
3 10
AIRLINES
of the TOP
©2016CouchbaseInc. 22
OnlineVideoStreamingrunsonCouchbase
6 10 NORTH AMERICAN AND
EUROPEAN BROADCAST TELEVISION COMPANIES
of the TOP
©2016CouchbaseInc. 23
Sports&CasinoGamingrunsonCouchbase
6 10 ONLINE SPORTS AND
CASINO GAMING COMPANIES
of the TOP
©2016CouchbaseInc. 24
FinancialServicesrunonCouchbase
3 3 CREDIT REPORTING
COMPANIES
of the TOP
©2015CouchbaseInc. 25DaHorvath,http://up.picr.de/23770402by.jpg
WhySparkandCouchbaseOverview&Use-Cases
©2016CouchbaseInc. 27
UseCases
Operations Analytics
CB
§ Recommendations§ Predictiveanalytics§ Frauddetection
§ Catalog§ Personalization§ Mobileapplications
©2016CouchbaseInc. 28
UseCase:OperationalizeAnalytics/ML
Hadoop
MLModel
Data Warehouse
Training Data
CB
Model Online Data
Serving
Predictions
©2016CouchbaseInc. 29Adaptedfrom:Databricks–NotYourFather’sDatabasehttps://www.brighttalk.com/webcast/12891/196891
©2016CouchbaseInc. 30
UseCase:DataIntegration
RDBMS S3 HDFS ES
NoSQL
©2016CouchbaseInc. 31
StandaloneDeployment
©2016CouchbaseInc. 32
Side-By-SideDeployment
AccessPatternsFromSparktoCouchbaseandBackAgain
©2015CouchbaseInc. 34
Key-Value
Fetch/StorebyDocumentID
©2015CouchbaseInc. 35
Key-Value
Fetch/StorebyDocumentID
N1QLQuery
FetchbyCriteria“SQL”
©2015CouchbaseInc. 36
Key-Value
Fetch/StorebyDocumentID
N1QLQuery
FetchbyCriteria“SQL”
Map-ReduceViews
MaterializedIndexes
(Aggregation)
©2015CouchbaseInc. 37
Key-Value
Fetch/StorebyDocumentID
N1QLQuery
FetchbyCriteria“SQL”
Map-ReduceViews
MaterializedIndexes
(Aggregation)
Streaming
MutationStreamsForProcessing
©2015CouchbaseInc. 38
Key-Value
Fetch/StorebyDocumentID
N1QLQuery
FetchbyCriteria“SQL”
Map-ReduceViews
MaterializedIndexes
(Aggregation)
Streaming
MutationStreamsForProcessing
FullText
SearchonFreeformText
©2015CouchbaseInc. 39
Key-Value
Fetch/StorebyDocumentID
N1QLQuery
FetchbyCriteria“SQL”
Map-ReduceViews
MaterializedIndexes
(Aggregation)
Streaming
MutationStreamsForProcessing
©2016CouchbaseInc. 40
CouchbaseDataPartitioning
©2016CouchbaseInc. 41
DataLocality
§ RDDLocationHintsbasedontheClusterMap
§ NotavailableforN1QLorViews§ Roundrobin-can’tgivelocationhints§ Backendisscattergatherwith1noderesponding
©2016CouchbaseInc. 42
N1QLQuery
§ N1QLisaSQLservicewithJSONextensions
§ UsesCouchbase’sGlobalSecondaryIndexes
§ Canrunonanynodeswithinthecluster
§ Nodeswithdifferingservicescanbeaddedandremovedasneededonthefly
©2016CouchbaseInc. 43
DataService
Projector&Router
CouchbaseQueryArchitecture
QueryService
IndexService
SupervisorIndexmaintenance&Scancoordinator
Index#2Index#1
QueryProcessorcbq-engine
Bucket#1 Bucket#2
DCPStreamIndex#4Index#3
...Bucket#2
Bucket#1
ForestDBStorageEngine
©2016CouchbaseInc. 44
SparkSQLSources
TableScanScanallofthedataandreturnit
PrunedScanScananindexthatmatchesonlyrelevantdatatothequeryathand.
PrunedFilteredScanScananindexthatmatchesonlyrelevantdatatothequeryathand.
©2016CouchbaseInc. 45
PredicateConversion
©2016CouchbaseInc. 46
SchemaInference
©2016CouchbaseInc. 47
SchemaInference
N1QLRelation:28 - Inferring schema from bucket travel-sample with query 'SELECT META(`travel-sample`).id as `META_ID`, `travel-sample`.* FROM `travel-sample` WHERE `type` = 'airline' LIMIT 1000'
N1QLRelation:28 - Executing generated query: 'SELECT `name`,`callsign` FROM `travel-sample` WHERE `type` = 'airline''
©2016CouchbaseInc. 48
SchemaInference
©2016CouchbaseInc. 49
DCPandSparkStreaming
ReplicaIndexing
…
©2016CouchbaseInc. 50
StructuredStreamingSource
50
Adaptedfromhttps://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
DCPStream UnboundedTable
©2016CouchbaseInc. 51
(Un)StructuredStreaming?
51
©2016CouchbaseInc. 52
StructuredStreamingSource
52
©2016CouchbaseInc. 53
StructuredStreamingSink
53
©2016CouchbaseInc. 54
CouchbaseSparkConnector1.2.1
§ Spark1.6.xsupport,includingDatasets§ DCPFlowControl§ EnhancedJavaAPIs
54
©2016CouchbaseInc. 55
CouchbaseSparkConnector2.0.0
• Spark2.0.xSupport• EnhancedDCPClient• ExperimentalStructuredStreaming
55
©2016CouchbaseInc. 56
Resources
§ SparkPackageshttps://spark-packages.org/package/couchbase/couchbase-spark-connector
• Docshttp://docs.couchbase.com
§ Sourcehttps://github.com/couchbase/couchbase-spark-connector
§ Bugshttps://issues.couchbase.com/browse/SPARKC
56
QAThanks!