cloud storage: azure blobsswift/classes/cs739-fa19/wiki... · 2019. 11. 19. · microsoft, windows,...
TRANSCRIPT
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 1
Cloud Storage: Azure Blobs
CS 739Fall 2019
Slide credit: Microsoft Corporation
1
2
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 2
Design Goals
3
• Question: What data types needed? Why not just blob/key-value?• Access patterns:• Blobs: whole file access, fairly large, individually named objects• Tables: random access, small data, sequential scans, maybe unnamed objects• Queues: push/pull, ordered, unnamed objects• Drives: random bock access
• QUESTION: what else?
4
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 3
5
Storage Stamp
LB
StorageLocation Service
Access blob storage via the URL: http://<account>.blob.core.windows.net/
Data access
Partition Layer
Front-Ends
Stream Layer
Intra-stamp replicationStorage Stamp
LB
Partition Layer
Front-Ends
Stream Layer
Intra-stamp replication
Inter-stamp (Geo) replication
6
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 4
7
8
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 5
9
• Append-only distributed file system• All data from the Partition Layer is stored into files (extents) in the Stream layer• An extent is replicated 3 times across different fault and upgrade domains
• With random selection for where to place replicas for fast MTTR• Checksum all stored data
• Verified on every client read• Scrubbed every few days
• Re-replicate on disk/node/rack failure or checksum mismatch• Operations: open/close/delete/rename/append/concatenate/random read
M
Extent Nodes (EN)
Paxos
M
MStream Layer(DistributedFile System)
10
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 6
• Provide transaction semantics and strong consistency for Blobs, Tables and Queues• Stores and reads the objects to/from extents in the Stream layer• Provides inter-stamp (geo) replication by shipping logs to other stamps• Scalable object index via partitioning
M
Extent Nodes (EN)
Paxos
M
M
PartitionServer
PartitionServer
PartitionServer
PartitionServer
PartitionMaster Lock
Service
Partition Layer
StreamLayer
11
• Stateless Servers• Authentication + authorization• Request routing
M
Extent Nodes (EN)
Paxos
Front End LayerFE
M
M
PartitionServer
PartitionServer
PartitionServer
PartitionServer
PartitionMaster
FE FE FE FE
Lock Service
Partition Layer
Stream Layer
12
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 7
M
Extent Nodes (EN)
Paxos
Front End LayerFE
Incoming Write Request
M
M
PartitionServer
PartitionServer
PartitionServer
PartitionServer
PartitionMaster
FE FE FE FE
Lock Service
Ack
Partition Layer
Stream Layer
13
Partition Layer: think BigTable
14
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 8
• Need a scalable index for the objects that can• Spread the index across 100s of servers• Dynamically load balance• Dynamically change what servers are serving each part of the index based on load
15
16
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 9
AccountName
ContainerName
BlobName
aaaa aaaa aaaaa
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
…….. …….. ……..
zzzz zzzz zzzzz
Storage Stamp
PartitionServer
PartitionServer
AccountName
ContainerName
BlobName
richard videos tennis
……… ……… ………
……… ……… ………
zzzz zzzz zzzzz
AccountName
ContainerName
BlobName
harry pictures sunset
……… ……… ………
……… ……… ………
richard videos soccer
PartitionServer
Partition Master
Front-EndServer
PS 2 PS 3
PS 1
A-H: PS1H’-R: PS2R’-Z: PS3
A-H: PS1H’-R: PS2R’-Z: PS3
PartitionMap
Blob Index
Partition Map
AccountName
ContainerName
BlobName
aaaa aaaa aaaaa
……… ……… ………
……… ……… ………
harry pictures sunrise A-H
R’-ZH’-R
17
CheckpointFile Table
CheckpointFile Table
CheckpointFile Table
Blob Data Blob Data Blob Data
Commit Log Stream
Metadata log Stream
Writes Read/Query
18
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 10
19
20
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 11
Stream Layer: like GFS
21
22
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 12
Extent E2 Extent E3
Bloc
kBl
ock
Bloc
kBl
ock
Bloc
kBl
ock
Bloc
kBl
ock
Bloc
kBl
ock
Bloc
kBl
ock
Bloc
kBl
ock
Bloc
k
Extent E4
Stream //foo/myfile.data
Ptr E1 Ptr E2 Ptr E3 Ptr E4
Extent E1
23
SMSMStream Master
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN
Create Stream/Extent
Allocate Extent replica set
Primary Secondary A Secondary B
EN1 PrimaryEN2, EN3 Secondary
24
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 13
25
SMSMSM
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN
Append
Primary Secondary A Secondary B
Ack
EN1 PrimaryEN2, EN3 Secondary
26
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 14
27
Stream //foo/myfile.dat
Ptr E1 Ptr E2 Ptr E3 Ptr E4
Extent E5
Ptr E5
Extent E1 Extent E2 Extent E3 Extent E4
28
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 15
SMSMStream Master
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN 4
Append
Primary Secondary A Secondary B
Ask for current length120120
Sealed at 120Seal Extent
Seal Extent
29
SMSMStream Master
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN 4
Primary Secondary A Secondary B
Sync with SM120
Sealed at 120Seal Extent
30
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 16
SMSMSM
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN 4
Append
Primary Secondary A Secondary B
Ask for current length120
Sealed at 100Seal Extent
100
Seal Extent
31
SMSMSM
Paxos
Partition Layer
EN 1 EN 2 EN 3 EN 4
Primary Secondary A Secondary B
Sync with SM
Sealed at 100Seal Extent
100
32
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 17
33
34
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 18
SMSMSM
EN 1 EN 2 EN 3
Primary Secondary A Secondary B
Partition Server
Network partition• PS can talk to EN3• SM cannot talk to EN3
• For Data Streams, Partition Layer only reads from offsets returned from successful appends • Committed on all replicas• Row and Blob Data Streams
• Offset valid on any replica Safe to read from EN3
35
SMSMSM
EN 1 EN 2 EN 3
Primary Secondary A Secondary B
Partition Server
Check commit length• Logs are used on partition load• Commit and Metadata log streams
• Check commit length first• Only read from• Unsealed replica if all replicas have
the same commit length• A sealed replica
Check commit lengthSeal Extent
Use EN1, EN2 for loading
Network partition• PS can talk to EN3• SM cannot talk to EN3
36
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 19
37
Design Choices and Lessons Learned
38
BUILD 11/19/19
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 20
• Multi-Data Architecture• Use extra resources to serve mixed
workload for incremental costs• Blob -> storage capacity• Table -> IOps• Queue -> memory• Drives -> storage capacity and IOps
• Multiple data abstractions from a single stack• Improvements at lower layers help all data
abstractions• Simplifies hardware management
• Tradeoff: single stack is not optimized for specific workload pattern
• Greatly simplifies replication protocol and failure handling• Consistent and identical replicas up to the
extent’s commit length• Keep snapshots at no extra cost• Benefit for diagnosis and repair• Erasure Coding• Tradeoff: GC overhead
• Allows each to be scaled separately• Important for multitenant environment• Moving toward full bisection bandwidth
between compute and storage• Tradeoff: Latency/BW to/from storage
39
40