data in the cloud - github pages · 2020-03-06 · 2010 2011 2012 2013 2014 2015 2016 2017 2018...
TRANSCRIPT
Data in the CloudHappy 10th ACM SoCC!
Raghu RamakrishnanCTO for Data, Technical Fellow
2010 2011 2012 2013 2014
2015 2016 2017 2018 2019
ACM SoCC Topics Over the Past 10 Years
Word clouds courtesy Carlo Curino
2010 2011 2012 2013 2014
2015 2016 2017 2018 2019
ACM SoCC Topics After Filtering “data” and “cloud”
Word clouds courtesy Carlo Curino
4
TRADITIONAL
DATACENTER
Dedicated
storage,
high-end
deployment
TRADITIONAL
DATACENTER
Large,
high-end
deployment
TRADITIONAL
DATACENTER
Large,
high-end
deployment
TRADITIONAL
DATACENTER
Virtualized
servers, high-
end deployment
Azure StorageExchange Online
SharePoint Online
Azure Compute
Carbon Footprint of Cloud Computing
Electricity/core-hour Electricity/TB-year Electricity/mailbox-year Electricity/user-year
52%more efficient
71%more efficient
77%more efficient
22%more efficient
8
http://download.microsoft.com/download/7/3/9/739BC4AD-A855-436E-961D-9C95EB51DAF9/Microsoft_Cloud_Carbon_Study_2018.pdf
AI in Operation & Optimization
IoT and big data platforms make it increasingly
easy to optimize datacenters
IoT telemetry, analytics and ML
optimization
Predictive maintenance
Capacity planning and workload
placement
Microsoft Confidential 9
Microsoft Confidential 10
Ubiquitous Data
Scale
Heterogeneity
Many engines
Many workloads
Data
silo
s
Elastic storage
Ela
stic
co
mp
ute
Network
latency
16
Azure SQL DB
Update-optimized
Meta data
XACT_STATE
Governance
Azure Cosmos DB
Document model
Meta data
XACT_STATE
Governance
OPERATIONAL
RE
LA
TIO
NA
LN
ON
-RE
LA
TIO
NA
L
ANALYTICS
Azure SQL DW
Analytics-optimized
Meta data
XACT_STATE
Governance
Spark, Hive, ML…
Data Lake
Meta data
Governance
Big Picture: Separation of Compute and State
Spark, Hive, ML… Azure SQL DW Azure SQL DB
Data Lake Update-optimizedAnalytics-optimized
Meta data Meta data Meta data
XACT_STATE XACT_STATE
Azure Cosmos DB
Document model
Meta data
XACT_STATE
Microsoft’s internal data lake
• A data lake for all teams @Microsoft
• Good developer tools
• Batch, Interactive, Streaming, ML
• Used across Office, Xbox, Azure, Windows, Ads, Bing, Skype, …
• Production jobs and experimentation
Microsoft’s Internal Big Data Service
Azure Data Lake StoreHDFS as a PaaS cloud service
• Microsoft’s serverless Big Data platform
• Fully aligned with Hadoop ecosystem and standards, with full support for Hadoop tools and engines as well as unique Microsoft capabilities
• Migrated to ADLS
• 1P = 3P
Enabling business growth:
Office productivity revenue (45%YoY)*
Intelligent Cloud (100% YoY)*
Bing search share doubles
By the numbers• 9+ Exabytes of data, 8+ billion files
• 100Ks of physical servers
• Millions of interactive queries
• Huge streaming pipelines
• 100Ks of daily batch jobs
• 15K+ developers
• 300+ teams
MSR/GSL Collaboration
J. Zhou et. al., SCOPE: parallel databases
meet MapReduce, VLDBJ 21(5)
R. Ramakrishnan et. Al., Azure Data Lake
Store, SIGMOD 2017
Apache YARN Federation
Traditional MPP DW Architecture
Data movement
channel
DQE communication channel
Compute
Remote storage
Adaptive cache
• DMS
• SQL
• SSD Cache
Meta data
Transactions
Premium Standard
Snapshot backupsData
Log
Cloud-Native Scale-Out, Data Heterogeneity
Data movement
channel
Compute
Remote storage
• ES
• SQL
• SSD Cache
• Distribution-less
Columnar files
StandardMeta data
Transactions
Centralized services
▪ Data and state separated from
compute
▪ Fault-tolerant scale-out
▪ Online scaling
▪ Data heterogeneity
➢ Converge DW and Lake
Task
Workload Tasks
A next generation distributed query engine (blend massive scale batch QP with scale up (small scale-out) interactive QP)
Polaris Concurrency –Workload Aware Scheduling
Workload Task Graph
Task-cost Driven Scheduling
Resource Aware Task Placement
%Resource
Demand
25
5
10Agg
5 Query 1
15 40
55
155
5
Query 2
25
510Agg
15 540
5
5
State Machine Execution
State
Transition
States
Waiting Execution
Executing
Failed
Completed
Edges are precedence constraints
Global Workload Graph that enables for
workload optimizations across queries
State Machines:
• Guarantees precedence constraints are satisfied
• Defines a formal model on how we recover from
failures
Scalability: All TPC-H Queries at 1PB Scale!Elastic DQP – Unlimited Scale
MSR Collaboration
𝒔𝒊𝒛𝒆𝒐𝒇(𝒅𝒂𝒕𝒂)
𝑵
𝒔𝒊𝒛𝒆𝒐𝒇(𝒅𝒂𝒕𝒂)
𝑵
𝒔𝒊𝒛𝒆𝒐𝒇(𝒅𝒂𝒕𝒂)
𝑵
P. Antonopoulos, et. al,., Socrates: The New SQL Server in the Cloud. ACM SIGMOD 2019
OLTP Data Warehouses NoSQLBig Data / Lake
Unified Data Suite and Governance
Spark, Hive, ML…
Data Lake Update-optimizedAnalytics-optimized
Meta data Meta data Meta data
XACT_STATE XACT_STATE
Azure Cosmos DB
Document storage
Global apps
Meta data
XACT_STATE
SQL
Governance
Big Picture: Must Simplify Usability and Governance
• Cloud
• Elastic compute and storage is transformative
• But compute-storage latency and bandwidth is key challenge
• Edge blurs cloud/on-prem separation
• ML
• An integral part of data processing, with a rapidly growing community of its own
• Implications for Data Management
• Rethink what belongs in a “DBMS”—ML, data governance
• Rethink data architectures from the ground up—OLTP/Analytics/HTAP
App
logic
offline
online
Model Scoring
Featurization
Live Data
Decisions
other data
featurization Model
Training
Model Development / Training
offline featurization.
Model
optimizationModel
dep
loym
en
t
Modelpolicies
Data
Catalogs
orchestration
Governance
Model Tracking
& Provenance
Access
Control
Logs &
Telemetry
policies
A. Agrawal et al., Cloudy with high chance of DBMS: a 10-year prediction for Enterprise-Grade ML, CIDR 2020.
GSL Collaboration
Big Data and
Data warehousing
BI AI and ML
Data &Operational
Systems
A single pane of glass to…
Manage data lifecycle(collect, clean, publish, discover, curate, …)
Ensure Data Quality & Correctness
Assess data compliance, privacy & protection
Author & manage data policy(access, use, retention, location, sharing)
Unified Governance
Across Cloud, Edge, On-Prem