solving low latency query over big data with spark sql

Solving low latency query over big data with Spark SQLJulien Pierre

• Intro• Project’s mission• Use case• Learnings• Implementation details• The future of Spark in ASG

Agenda

• Senior Product Manager• ASG Shared Data team• Beijing• Working on:• Spark• Driving Microsoft internal Spark

community

• Worked on:• Office 365 Admin• Office 365 Customer Facing Reporting• Open sourced the Office 365 Splunk

Who am I?

Julien Pierre朱力安

What we do

Client DataFluency

Office

Modern Data Capability

Instrumentation & Ingestion

Processing & Storage

Reporting & Analytics

Information Management

Mobile-First Analytics Experience

Experimentation

Deliver interactive analytics capabilities for ASG teams to build great products

Increase analyst and engineer productivityCapitalize on our dataLeverage open technology to run a stable service

Project mission

Use Case

Where does Spark SQL help us?

Data Size

Query Latency

Kuntao Yu, engineerCustomer found an issue with the data“Why is the Created Date field = 0001/01/01?”

Is it by design or a bug?71.2 TB of Office Data in Spark

Data issue troubleshooting use case

What did Kuntao have to do? Writing & compiling query

Submitting & waiting in queueJob runtime

Get results inline in Zeppelin

Check data in Excel

Need to open the results in

1. Check affected records 2. Find root cause 3. Assert findings

Elapsed time

Cosmos

SparkSQL

SparkSQL with Cache

0 20 40 60 80 100 120 140 160 180 200

Write and Compile Query Submit and Wait in Job Queue Job Run Time

What did we learn in the process?

Enable self-serviceMinimize time to insightNotebooks at scale

Learnings

How did we do it?

Architecture

Mesos Cluster/HDFS

Job ManagerZookeepe

Job Frontend Web API

Spark Driver Host Pool

Spark Hive Thrift Server

Zeppelin Server

Avocado(Hive Query + Schedule Task)

Rover(Drag & Drop BI tool with Hive Code

Zeppelin Web UI

MetastoreDBHive

Loader

Cosmos Storage

Open Source

Microsoft

Cosmos -> HDFS

Partition 1

Partition 2

Partition n

Export Cosmos Partition

Partition 1

Partition 2

Partition n

Task 2

HDFS.copyFromLocalFile

Task n

Partition 1

Partition 2

Partition n

saveAsParquetFile

Task 2

Task n

Hive Loader

MetastoreDB

Hive Thrift Server

Hive Loader

Zeppelin Server

UserQueryQuery

User Experience in Avocado Task

User Experience in Avocado Task (Cont.)

What next?

Analytics as a Service in ASG

Data Inges

tServices

Clients

Transform Compute

Transform

Compute

Data Streams

Data Sets

EventProcessing

HDFSDataTransportatio

Spark Streaming Receiver

Analyst

ZeppelinNotebooks

Avocado

Simple query

Querylanguage

“Analyze”

“Debug”

“Mine”

“Glance”

Unified platform

Intelligence

Interactive analytics

Data Products

Better Digital

Experiences

Dual users

“Bing”

“Office”Synchronou

experience

Asynchronous

experience

The #DataAnalystHub

Questions?

@JulienJTPierrehttps://linkedin.com/in/julienjtpierre

solving low latency query over big data with spark sql

partition n task

task n partition

data size query latency

compile query

job queue job

cosmos sparksql sparksql

bi tool

copyfrom localfile

Technology

functional query optimization with sql -...

how to build your query engine in spark

query engines for hive: mr, spark, tez with llap –...

focus: querying large video datasets with low latency and...

the effects of interactive latency on exploratory visual...

database systems 12 distributed analytics · task 4.4 query...

low latency execution for apache spark

using an inverted index synopsis for query latency and

functional query optimization with sql...functional query...

polystore,julia,andproduc3vity inabig*dataworld · big data...

drizzle—low latency execution for apache spark: spark...

query performance visualizationin this paper, we perform a...

how to integrate spark mllib and apache solr to build...

multi-query optimization in wide-area streaming...

spinach: an ad-hoc query engine on top of spark...

sparql query processing with apache spark · 2017-05-20 ·...

real-time query-by-image video search system...500 video...

tsps: a real-time time series prediction systemtime and...

cdb: optimizing queries with crowd-based selections and...

making analytics viable in enterprises: potential routes...