data in the cloud - github pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018...

29
Data in the Cloud Happy 10 th ACM SoCC! Raghu Ramakrishnan CTO for Data, Technical Fellow

Upload: others

Post on 17-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Data in the CloudHappy 10th ACM SoCC!

Raghu RamakrishnanCTO for Data, Technical Fellow

Page 2: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

2010 2011 2012 2013 2014

2015 2016 2017 2018 2019

ACM SoCC Topics Over the Past 10 Years

Word clouds courtesy Carlo Curino

Page 3: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

2010 2011 2012 2013 2014

2015 2016 2017 2018 2019

ACM SoCC Topics After Filtering “data” and “cloud”

Word clouds courtesy Carlo Curino

Page 4: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

4

Page 5: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy
Page 6: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy
Page 7: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy
Page 8: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

TRADITIONAL

DATACENTER

Dedicated

storage,

high-end

deployment

TRADITIONAL

DATACENTER

Large,

high-end

deployment

TRADITIONAL

DATACENTER

Large,

high-end

deployment

TRADITIONAL

DATACENTER

Virtualized

servers, high-

end deployment

Azure StorageExchange Online

SharePoint Online

Azure Compute

Carbon Footprint of Cloud Computing

Electricity/core-hour Electricity/TB-year Electricity/mailbox-year Electricity/user-year

52%more efficient

71%more efficient

77%more efficient

22%more efficient

8

http://download.microsoft.com/download/7/3/9/739BC4AD-A855-436E-961D-9C95EB51DAF9/Microsoft_Cloud_Carbon_Study_2018.pdf

Page 9: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

AI in Operation & Optimization

IoT and big data platforms make it increasingly

easy to optimize datacenters

IoT telemetry, analytics and ML

optimization

Predictive maintenance

Capacity planning and workload

placement

Microsoft Confidential 9

Page 10: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Microsoft Confidential 10

Page 11: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy
Page 12: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy
Page 13: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy
Page 14: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Ubiquitous Data

Page 15: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy
Page 16: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Scale

Heterogeneity

Many engines

Many workloads

Data

silo

s

Elastic storage

Ela

stic

co

mp

ute

Network

latency

16

Page 17: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Azure SQL DB

Update-optimized

Meta data

XACT_STATE

Governance

Azure Cosmos DB

Document model

Meta data

XACT_STATE

Governance

OPERATIONAL

RE

LA

TIO

NA

LN

ON

-RE

LA

TIO

NA

L

ANALYTICS

Azure SQL DW

Analytics-optimized

Meta data

XACT_STATE

Governance

Spark, Hive, ML…

Data Lake

Meta data

Governance

Page 18: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Big Picture: Separation of Compute and State

Spark, Hive, ML… Azure SQL DW Azure SQL DB

Data Lake Update-optimizedAnalytics-optimized

Meta data Meta data Meta data

XACT_STATE XACT_STATE

Azure Cosmos DB

Document model

Meta data

XACT_STATE

Page 19: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Microsoft’s internal data lake

• A data lake for all teams @Microsoft

• Good developer tools

• Batch, Interactive, Streaming, ML

• Used across Office, Xbox, Azure, Windows, Ads, Bing, Skype, …

• Production jobs and experimentation

Microsoft’s Internal Big Data Service

Azure Data Lake StoreHDFS as a PaaS cloud service

• Microsoft’s serverless Big Data platform

• Fully aligned with Hadoop ecosystem and standards, with full support for Hadoop tools and engines as well as unique Microsoft capabilities

• Migrated to ADLS

• 1P = 3P

Enabling business growth:

Office productivity revenue (45%YoY)*

Intelligent Cloud (100% YoY)*

Bing search share doubles

By the numbers• 9+ Exabytes of data, 8+ billion files

• 100Ks of physical servers

• Millions of interactive queries

• Huge streaming pipelines

• 100Ks of daily batch jobs

• 15K+ developers

• 300+ teams

MSR/GSL Collaboration

J. Zhou et. al., SCOPE: parallel databases

meet MapReduce, VLDBJ 21(5)

R. Ramakrishnan et. Al., Azure Data Lake

Store, SIGMOD 2017

Apache YARN Federation

Page 20: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Traditional MPP DW Architecture

Data movement

channel

DQE communication channel

Compute

Remote storage

Adaptive cache

• DMS

• SQL

• SSD Cache

Meta data

Transactions

Premium Standard

Snapshot backupsData

Log

Page 21: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Cloud-Native Scale-Out, Data Heterogeneity

Data movement

channel

Compute

Remote storage

• ES

• SQL

• SSD Cache

• Distribution-less

Columnar files

StandardMeta data

Transactions

Centralized services

▪ Data and state separated from

compute

▪ Fault-tolerant scale-out

▪ Online scaling

▪ Data heterogeneity

➢ Converge DW and Lake

Page 22: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Task

Workload Tasks

A next generation distributed query engine (blend massive scale batch QP with scale up (small scale-out) interactive QP)

Polaris Concurrency –Workload Aware Scheduling

Workload Task Graph

Task-cost Driven Scheduling

Resource Aware Task Placement

%Resource

Demand

25

5

10Agg

5 Query 1

15 40

55

155

5

Query 2

25

510Agg

15 540

5

5

State Machine Execution

State

Transition

States

Waiting Execution

Executing

Failed

Completed

Edges are precedence constraints

Global Workload Graph that enables for

workload optimizations across queries

State Machines:

• Guarantees precedence constraints are satisfied

• Defines a formal model on how we recover from

failures

Page 23: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Scalability: All TPC-H Queries at 1PB Scale!Elastic DQP – Unlimited Scale

Page 24: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

MSR Collaboration

𝒔𝒊𝒛𝒆𝒐𝒇(𝒅𝒂𝒕𝒂)

𝑵

𝒔𝒊𝒛𝒆𝒐𝒇(𝒅𝒂𝒕𝒂)

𝑵

𝒔𝒊𝒛𝒆𝒐𝒇(𝒅𝒂𝒕𝒂)

𝑵

P. Antonopoulos, et. al,., Socrates: The New SQL Server in the Cloud. ACM SIGMOD 2019

Page 25: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

OLTP Data Warehouses NoSQLBig Data / Lake

Page 26: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Unified Data Suite and Governance

Spark, Hive, ML…

Data Lake Update-optimizedAnalytics-optimized

Meta data Meta data Meta data

XACT_STATE XACT_STATE

Azure Cosmos DB

Document storage

Global apps

Meta data

XACT_STATE

SQL

Governance

Page 27: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Big Picture: Must Simplify Usability and Governance

• Cloud

• Elastic compute and storage is transformative

• But compute-storage latency and bandwidth is key challenge

• Edge blurs cloud/on-prem separation

• ML

• An integral part of data processing, with a rapidly growing community of its own

• Implications for Data Management

• Rethink what belongs in a “DBMS”—ML, data governance

• Rethink data architectures from the ground up—OLTP/Analytics/HTAP

Page 28: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

App

logic

offline

online

Model Scoring

Featurization

Live Data

Decisions

other data

featurization Model

Training

Model Development / Training

offline featurization.

Model

optimizationModel

dep

loym

en

t

Modelpolicies

Data

Catalogs

orchestration

Governance

Model Tracking

& Provenance

Access

Control

Logs &

Telemetry

policies

A. Agrawal et al., Cloudy with high chance of DBMS: a 10-year prediction for Enterprise-Grade ML, CIDR 2020.

GSL Collaboration

Page 29: Data in the Cloud - GitHub Pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ACM SoCC Topics After Filtering “data” and “cloud” Word clouds courtesy

Big Data and

Data warehousing

BI AI and ML

Data &Operational

Systems

A single pane of glass to…

Manage data lifecycle(collect, clean, publish, discover, curate, …)

Ensure Data Quality & Correctness

Assess data compliance, privacy & protection

Author & manage data policy(access, use, retention, location, sharing)

Unified Governance

Across Cloud, Edge, On-Prem