Download - Spark to DocumentDB connector
Spark to DocumentDB Connector
Denny Lee,Principal Program Manager, Azure DocumentDB
Denny Lee• Principal Program Manager for Azure DocumentDB• 20+ years of experience in databases, distributed
systems, data sciences, and software development at Microsoft, Concur, and Databricks
• Noteable Projects:• Project Isotope: Incubation team for HDInsight• Yahoo! 24TB cube: Largest SSAS cube in production
@dennylee
A Brief Overview...
Elastically Scalable Throughput + Storage
Guaranteed low latency
Reads <10ms @ P99Writes <15ms @ P99
Globally Distributed
Speaks your language
DocumentDB
REST over HTTPS/TCP
MongoDB wire protocol
drivers for MongoDB
Java .NET
Java .NETRuby
…
Aggregations
Demo
Running Aggregations from Portal
Supports SUM, COUNT, MIN, MAX, AVG
Working on DISTINCT and GROUP BY
Data Sciences:Apache Spark + DocumentDB
Demo
Notebook View: https://aka.ms/docdb-spark-graphpyView: https://aka.ms/pydocdb-spark-graphCode: https://aka.ms/docdb-spark-graph-code
AdvantagesData Science Scenarios
• Distributed Aggregations and Analytics
• Blazing Fast IoT Scenarios
• Updateable columns
• Push-down predicate filtering
AdvantagesDistributed Aggregations and Analytics
AdvantagesBlazing Fast IoT Scenarios
Flight information
global safetyalerts
weather
Data Science Scenarios
Device Notifications
Web / REST API
AdvantagesUpdateable Columns
Flight information
Data Science Scenarios
Device Notifications
Web / REST API
{ tripid: “100100”, delay: -5, time: “01:00:01”}
{ tripid: “100100”, delay: -30, time: “01:00:01”}
{delay:-30}
{delay:-30}
{delay:-30}
AdvantagesPushdown Predicate Filtering Data Science Scenarios
{city:SEA}
locations headquarter exports
0 1
country
Germany
city
Seattle
country
France
city
Paris
city
Moscow
city
Athens
Belgium 0 1 {city:SEA, dst: POR, ...},{city:SEA, dst: JFK, ...}, {city:SEA, dst: SFO, ...}, {city:SEA, dst: YVR, ...}, {city:SEA, dst: YUL, ...}, ...
gateway node data
nodes
master node
worker nodes
pyDocumentDB
1
2
3
pyDocumentDB1.Connection is between
Spark master node and DocumentDB gateway node.
2.Query is submitted from DocumentDB gateway node to data nodes. Results are sent back to the gateway node and then transmitted back to the Spark master node.
3.Spark master node converts the dictionary to a DataFrame and distributed out to the worker nodes.
gateway node data
nodes
master node
worker nodes
Spark-DocumentDBConnector (Java)
1
3
2
4
Spark to DocumentDB Connector
1.Connection is between Spark master node and
2.map data is transmitted back to DocumentDB gateway node
3.Query is submitted from Spark worker nodes to
4.DocumentDB data nodes and the data is transmitted back to Spark worker nodes for further processing
Query Test Results
Query pyDocumentDB Azure-DocumentDB-Spark
LIMIT 100 0:00:00.774820 00:00:01.286
All Seattle flights (23K rows)
0:00:05.146107 00:00:01.582
All flights (~1.39M rows) 0:02:36.335267 00:00:08.899
More info at: https://github.com/Azure/azure-documentdb-spark/wiki/Query-Test-Runs
Query Test Results
Issue # Issue Description7 Improve push down predicates (e.g. take advantage of TOP/LIMIT,
aggregations, etc.)6 Schema-less query bug5 Optimize computation push to partitions3 Add Python wrapper / examples2 Add Azure-DocumentDB-Spark connector as Spark package
More info at: https://github.com/Azure/azure-documentdb-spark/issues
AsksGo to https://github.com/Azure/azure-documentdb-spark/ and try it out!
References:• Real-time machine learning on globally-distributed data with Apa
che Spark and DocumentDB• Accelerate real-time big-data analytics with the Spark to Docum
entDB connector
Any questions?• We’re on StackOverflow #azure-documentdb• Email askdocdb@ or denny.lee@
Data Sciences:Apache Spark + DocumentDB
Example: Graph Structures
Example: Graph Structures
Graph Calculations: Degrees, PageRank
What is the most important airport (most flights in / out)
tripGraph.inDegrees\
.sort(desc("inDegree"))\
.limit(10))
Classic Graph Scenario: Flights
vertex = airports
edges = flights