azure-consistent object storage in microsoft azure …...azure resource manager (arm) in azure stack...

35
Azure-consistent Object Storage in Microsoft Azure Stack Ali Turkoglu Principal Software Engineering Manager Microsoft Mallikarjun Chadalapaka Principal Program Manager Microsoft

Upload: others

Post on 15-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Azure-consistent Object Storage in Microsoft Azure Stack

Ali Turkoglu Principal Software Engineering Manager

Microsoft

Mallikarjun Chadalapaka Principal Program Manager

Microsoft

Page 2: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Agenda

Context, Solution Stack, Architecture, ARM & SRP

ACS Architecture Deep Dive

Blob Service Architecture & Design

Questions/Discussion

Page 3: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Azure-consistent storage

Cloud storage for Azure Stack

Azure-consistent blobs, tables, queues, and storage accounts

Administrator manageability

Enterprise-private clouds or hosted clouds from service providers

IaaS (page blobs) + PaaS (block blobs, append blobs, tables, queues)

Builds on & enhances WS2016 Software-Defined Storage (SDS) platform capabilities

Presenter
Presentation Notes
Integral part of Microsoft Software-Defined Storage (SDS) vision - highly-reliable, resilient, and scalable cloud storage based on standard hardware, or can just as well leverage existing SAN investments
Page 4: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Fabric services

Virtualized services

Administrator

Azure Stack storage cloud admin service

Resource Provider Cluster

Tenant-facing storage cloud services

Table service Queue service Account service

Scale-Out File Server (SOFS) with Storage Spaces Direct (S2D)

. . . . .

Blob back-end

Blob back-end

Blob service

Application clients using Azure Account, Blob ,Table & Queue APIs & Tooling

ACS Cluster

Microsoft Azure Stack Portal, Azure Storage cmdlets, ACS cmdlets, Azure CLI,

and Client SDK

Azure-consistent Storage: Big Picture

Page 5: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Administrator

Azure Stack storage cloud admin service

Resource Provider Cluster

Tenant-facing storage cloud services

Table service Queue service Account service

Scale-Out File Server (SOFS) with Storage Spaces Direct (S2D)

. . . . .

Blob back-end

Blob back-end

Blob service

ACS Cluster

Clustering Architecture

Azure Service Fabric (ASF) guest clusters for cloud services

Hyper-converged Windows Server Failover Cluster (WSFC) Host fabric

WSFC enhances ASF cluster resiliency HA VMs via Hyper-V host clustering Anti-affinity policies on VMs to ensure all

VMs never failover to same host Application health monitoring on Service

Fabric service for timely detection of service hang situations

WSFC & CSVFS* provide the

basis for blob service HA model

*Cluster Shared Volume File System

Page 6: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Subscription Resource Group

Relating Azure Storage Concepts

Storage Account

Container

Blob

Table

Block Blob

Page Blob

Append Blob

Queue

Presenter
Presentation Notes
Subscription Subscription One instance of an accepted “Offer” May contain different types of resources: Compute, Network, Storage, Web Sites,… Aggregation point for usage data reporting & billing Resource Group Grouping of logically related resources – typically together to deliver a single business service, e.g. “My PayRoll Service” Storage Account Top-level home for a set of storage objects Determines the starting URL for all contained objects, e.g. https://myaccount.blob.myserviceprovider.net/... Two keys are tied to the account to allow Symmetric Shared Key authentication for account owners Container Container for a group of logically-related blobs within a storage account Blob Objects with data, metadata and properties Can be block blobs (optimized for streaming), or page blobs (optimized for random read/write access, and access of byte ranges) Signed or public access with Shared Access Signature authorization Table Structured data storage object with No-SQL Signed access with Shared Access Signature authorization Collection of entities (rows)
Page 7: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

ARM & Resource Providers

Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs, PS cmdlets, or Portal experiences to manage an Azure

Stack deployment

Resource Provider (RP) is a plug-in which enables Azure-consistent management of a type of infra Compute Resource Provider (CRP) Network Resource Provider (NRP) Storage Resource Provider (SRP) ….

Users express desired state to ARM via templates

Template is a declarative statement about what the user wants ARM drives all necessary orchestration, including imperative directives to RPs

Page 8: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Storage Resource Provider

ACS Management Model

Azure Resource Manager

CRP

ACS data path services

Tenant Resources

Admin Resources

Microsoft Azure Stack Portal, and ACS cmdlets

Microsoft Azure Stack Portal, Azure Storage cmdlets, Azure CLI, and Client SDK

Page 9: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

IaaS VM Storage

All VM storage in Azure Stack resides in blob store

Every OS or Data Disk is a page blob Page blob ReFS file Starts in REST API access mode

CRP and SRP collaborate behind the scenes

Page blob toggles to ‘SMB-only’ at VM run time Super-optimized Hyper-V-over-SMB I/O path

Presenter
Presentation Notes
Lead with this value prop for the section
Page 10: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

ACS Architecture Deep Dive

Page 11: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Key Requirements and Challenges for Object Storage on MAS

• Atomicity guarantees, 4MB put page, 4MB put block. • Data Consistency guarantees, after a successful atomic write, any reads after that write gets the

latest value. • Immutable blocks that allow re-composition • Snapshot isolated copy for read operations. • Case sensitive blob names • Historical 512 byte page alignment for page blobs (vs. ReFS 4k cluster size). • Expensive list blob/enumeration including blob metadata. • Durable: Synchronously stores 3 replications before write completes. • Scalable: Up to millions of blobs under a container and millions of containers • Highly Available: 99.9% read/write for local region • Built with Fault Tolerance in mind at every component. Verified crash consistent implementation. • Specific to MAS architecture, Hyper-V access to page blobs over SMB is required. • Public vs. Private cloud scale differences:

• I.e. Azure min deployment stamp is ~80 racks. Typical small scale on premise needs 4 nodes – few racks.

Page 12: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

ACS Architecture SRP (Storage Resource Provider):

Integrates with ARM and exposes resource management REST interfaces for storage service overall.

FE: Provides REST interface Front End, consistent with Azure.

WAC: Storage account management, user requests authentication, and container operations.

Blob Service: Implements the blob service backend. Stores block and page blob data in file system/Chunk Store, and metadata in ESENT DB.

Table Master: Maps user table to database/TS instances.

Table Server: Handles table query & transactions in databases.

Storage: SOFS exposes a unified view of all tiered storage to compute nodes as CA shares. Provides fault tolerance & local replication.

Virtual Services

Physical Services

CSV (ReFS)

SSU Node 2SSU Node 1

Blob Service

Account/ContainerService

WFE

HTTP

SMB/R

PC

WAC DB

SOFSSM

B

Storage Spaces & PoolsShared or S2D DAS Disks

HA Clustering ACS

Component

Load Balancer

HTTP

Blob Service

RPCRPC

TS DBBLOB

DBBLOB

DB TM DB

SOFS

SMB

ChunkStoreFile

Table Server

TS DB

SMB

SMB

FE (Front End)

Table Master RPC

SRPSRP

RPC

ChunkStoreFile

Page Blob

File Page Blob

File

*Key Interactions between the components are shown

Presenter
Presentation Notes
Remove Calabria references!
Page 13: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Blobs - Semantic Requirements: See MSDN

BLOCK BLOBS Client uploads individual immutable blocks

with PUT-BLOCK for future inclusion in a block blob. Block size may be up to 4 MB. A blob can have up to 100,000 uncommitted blocks. Maximum size of uncommitted block list is 400 GB.

Followed by a PUT-BLOCK-LIST call to assemble the blob. Maximum size supported for Block Blob is 200 GB and 50,000 committed blocks.

Blocks must retain their identity to permit later PUT-BLOCK-LIST calls to re-arrange.

Unused blocks are lazily cleaned up after a PUT-BLOCK-LIST request. In the absence of PUT-BLOCK-LIST, uncommitted blocks are garbage collected after 7 days.

Blob names are case-sensitive. At least one character long, and at most 1024 characters.

All Blob operations guarantee atomicity where it either happened as a whole, or it has not happened at all. There is no undetermined state at failure.

For Block Blobs each GET-BLOB request gets a snapshot isolated copy of the Blob data (or request fails if this cannot be accommodated.

PAGE BLOBS Client creates a new empty page blob by calling

PUT-BLOB. A page Blob starts as sparse and its size can be up to 1 TB.

Random Read/Write access Client than calls PUT-PAGE to add content to

the Page Blob. PUT-PAGE operation writes a range of pages to a Page Blob. Put-page operation must guarantee atomicity.

Calling Put Page with the Update option performs an in-place write on the specified page blob. Any content in the specified page is overwritten with the update.

Calling Put Page with the Clear option releases the storage space used by the specified page. Pages that have been cleared are no longer tracked as part of the page blob.

Each range of pages submitted with Put Page for an update operation may be up to 4 MB in size. The start and end range of the page must be aligned with 512-byte boundaries

Page 14: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Azure Blobs Object API : See MSDN for details

Common Blob Operations Put Blob, Get Blob, Get/Set Blob Properties, Get/Set Blob Metadata, Lease Blob, Snapshot Blob, Copy Blob, Abort Copy Blob, Delete Blob

Operations on Block Blobs

Put Block, Put Block List, Get Block List

Operations on Append Blobs Append Block (operation commits a new block of data to the end of an existing append blob)

Operations on Page Blobs Put Page, Get Page Ranges

Presenter
Presentation Notes
Properties: Content Type, Encoding, request id, origin , content disposition etc.. Metadata: Arbitrary metadata in name:value pair
Page 15: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Block Blob Object Design Challenges on Traditional File System Why not implement Block Blobs as file objects?

Isolation/atomicity and unique composition requirements are the key offenders. Lack of any form of acceptable atomicity support on NTFS.

Rename/Create path on file system is prohibitively expensive API semantics does not map to files, immutable vs. random access Enumeration & Rich Metadata operations requires Index and DB, and transactions. SMB access to block blobs are not needed Namespace should be in database, but not in file system, to meet scale demands, and

other requirements Kernel mode filter driver would have been needed to implement stream maps, and

atomicity, but even then we would still need an index and transactional DB in kernel mode, thus increasing the cost and complexity.

Dedup integration complexity

Page 16: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

ReFS is used for Page Blob Implementation for Page Blob Snapshots and Page Blob 4 MB atomic write via ReFS duplicate extent (FSCTL_DUPLICATE_EXTENTS_TO_FILE)

Page Blob and Block Blob implementations must be in same service to Share the DB implementation, Metadata operations Meet the REST API requirement to have block and page blobs that could be

under same container. SMB is the Hyper-V data transport for Page Blobs because

Page Blobs are designed as (backed by) files Highly optimized data path for Hyper-V over SMB

RPC is chosen as data transport from WFE to Blob service because WFE cannot write directly to ChunkStore container files. Extending

ChunkStore implementation for this deemed costly. Alternatively staging Put Blob/Block writes via a file would cause additional

write penalty. Block Blobs are NOT files.

Key Design Decisions for Blob Service

Presenter
Presentation Notes
ReFS is used for Page Blob Implementation for Page Blob Snapshots Page Blob 4 MB atomic write via ReFS duplicate extent (FSCTL_DUPLICATE_EXTENTS_TO_FILE) Page Blob Integrity Stream Requirements Blob service should be deployed on physical nodes because Blob service HA model can be designed to use WSFC and CSVFS ReFS volume/VHDX MUST not be attached directly to a VM, to meet the access requirement for Hyper-V. Page Blob and Block Blob implementations must be in same service to Share the DB implementation, Metadata operations Meet the REST API requirement to have block and page blobs that could be under same container. Block Blobs could be stored on NTFS (in addition to ReFS) volume for following reasons ESE/NT DB provides Integrity support and proven to work on NTFS volumes. ChunkStore provides Integrity support and proven to work on NTFS volumes. Block blob Snapshots are implemented as cloning stream maps. SMB is the data transport for Page Blobs because Page Blobs are designed as (backed by) files Compute requires SMB access to page blobs RPC is chosen as data transport (WFE to Blob service) for Block Blobs because WFE cannot write directly to ChunkStore container files. Extending ChunkStore implementation for this deemed costly. Alternatively staging Put Blob/Block writes via a file would cause additional write penalty. Block
Page 17: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

ACS Service Design Principles Keep it simple

Choose a good achievable limit in V1. Build a solid base and iterate/optimize from there.

More than API consistency Guaranteed strong data consistency, durability, and transaction. Build with high availability and fault tolerance in every component. Design for management & architecture simplicity.

Build on available technologies and not reinvent the wheel Depend on SOFS (with ReFS, Storage Spaces Direct), CSVFS for highly

available, redundant storage. No application-level data replication. Use ESE/NT (JetDB) engine for table data and blob metadata storage, transaction

and indexing. Proven through Exchange deployments. Leverage Dedup Chunk Store for block blob data storage/implementation. Build on ServiceFabric/WSFC for high availability, scaling out and load balancing.

Page 18: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Block Blob Service Design Global namespace exists in

the DB. User mode only access/store.

No file system access to the block blobs; neither for namespace nor for data.

Use *Dedup ChunkStore implementation to store committed and uncommitted blocks, and the stream maps.

Design for Azure Parity from get go.

Azure blob metadata is stored in ESE/NT DB. To optimize metadata only

operations To implement Blob API

GC semantics. DB Scope is set of containers

and their blobs metadata.

Presenter
Presentation Notes
Page 19: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

ChunkStore implements immutable file containers for chunk and

stream maps. Append Only Chunk Insertion Used also as part of Deduplication Feature and proven. Integrity checks for detecting page corruptions, and read from a

replica for in place patching. Shallow and Deep Garbage Collection. Data chunks shared by root blob and snapshots (deduplication) Guarantee crash-consistent file commit (Precise order of operations

guarantees data integrity at each stage) Various chunk size support (up to 4MB) Efficient self contained chunk id referencing

Why ChunkStore?

Page 20: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Why ESE/NT? ESE

Extensible Storage Engine (ESE), also known as JET Blue, is an ISAM (Indexed Sequential Access Method) data storage technology. It provides transacted data update and retrieval

Transactions & Index

The ESE database engine provides Atomic Consistent Isolated Durable (ACID) transactions.

Logging and crash recovery ESE has write ahead logging and snapshot isolation for guaranteed crash recovery. The application-defined data consistency is honored even in the event of a system crash.

Good Backup/Restore support

ESE supports on-line backup where one or more databases are copied, along with log files in a manner that does not affect database operations

Page corruption detection and read from replica and patch in place (FSCTL_MARK_HANDLE / MARK_HANDLE_READ_COPY)

Automatic DB scan and corruption detection ESENT engine self-detecting and auto-correcting checksum errors on a Jet database stored on a Spaces

triple-replica remotely accessed via SMB/CSVFS

Used at scale in exchange workloads.

Presenter
Presentation Notes
A transaction is a logical unit of processing delimited by BeginTransaction and CommitTransaction, or Rollback, operations. It allows applications to retrieve data only from reliable data states and maintains data consistency in the event of an unexpected process termination or system shutdown. An index is a persisted ordering of records in a table. Indexes are used for both sequential access to rows in the order defined, and for direct record navigation based on indexed column values.
Page 21: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Page Blob Service Design Global namespace

exists in the DB. Page blobs are files

stored in ReFS volume. Blob is a linear mapping

to the file. There is no stream map.

Page blobs also use DB to store azure blob metadata.

To enable exclusive in-place access (direct via SMB) for Hyper-V, check-out, and check-in semantics supplied.

No concurrent REST vs. File System access.

Presenter
Presentation Notes
Page blob access is through DB lookup. There is a blob name to file name mapping, and lookup. Global namespace exits in the DB.
Page 22: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Page Blobs - Compute vs. REST data access path Blob REST access is

via the Blob service.

Compute page blob access (Hyper-V accessing VHDs) is direct to the file over SMB, same path as today with RDMA etc.

Once a blob is “checked out” for compute access, it is not accessible through REST.

SSU Node

WFE

CSU

BLOB Service

Hyper-VCompute

SOFS

SMBCompute In-Place

Page Blob VHD Access

Blob Files on ReFS

SSU

Presenter
Presentation Notes
Page 23: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Blob Service Design Data Flow & Representation

24

Page 24: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Header

Page X of PageBlobA

Page Y of PageBlobB

Page X+1 of PageBlobA

Page Z of PageBlobC

Unused

Shared Staging File

Blob Service Frontend

Blob Service Backend

(2) Put Page Z using RPC to Blob Service hosting PageBlobC

(1) HTTP Put Page Z to PageBlobC

Page X

Unset

Page Y

Page Z

Unset

FilePageBlobC

Backend

Front End

Blobs Table in Metadata Store

Blob Name Filename

PageBlobA FilePageBlobA

...

MetadataA

(6) Update Metadata for PageBlobC

(4) Build in-memory buffer with

alignment data from FilePageBlobC and unaligned RPC data then append

buffer to shared staging file

(5) Duplicate extent for Page Zfrom shared staging file

to FilePageBlobC

(3) Lookup Filename for PageBlobC

PageBlobB FilePageBlobB

PageBlobC FilePageBlobC

MetadataB

MetadataC

Metadata

PAGE Blob Design on ReFS

Blob Service backend maintains its private log per managed volume to “stage” page writes. Single log for all page writes within a single volume. Arbitrary log write size exceeds 4MB REST requirement .

Append-only writes to staging log with control, request parameters and page write data use standard logging techniques to achieve atomic log writes.

ReFS duplicate extents feature used to atomically insert write data from staging log into target page blob.

Final Commit call updates metadata and returns ETAG/LastModified

Crash Consistency achieved via staging log write, ReFS extent duplication, sequenced metastore update with log replay from last valid checkpoint on a service restart.

Staging log uses checkpointing to manage space consumption on volume and volume restart recovery.

Page 25: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

BLOCK BLOB – PUT BLOCK Blocks are uploaded individually first.

Followed by Put-Block-List call later.

Put Block operation does not alter the composition of the blob and it does not change its existing stream map.

Concurrent put block operations are allowed. However the Block Sharing across different Blobs is not allowed .

Crash Consistency achieved via additional Log Write (not shown)

Service needs to maintain uncommitted blocks LV Column in DB:

To Persist Azure Block Id (64 bytes) to Chunk Id mapping for uncommitted blocks.

To search uncommitted blocks in a performant way at Put-Block-List (or at recovery)

To enforce/implement GC policy required by the Blob API. ChunkStore

WFE

Blob Service

(2) Put Block (Block Id + Data)

(1) Put Block X for Blob-A

(3) Insert Chunk, Get Chunk-Id

(4) Insert Blob Entry (if needed)

Blob-A Metadata

...

Stream-Id

Committed

Blocks

UnCommitted

Blocks

Block X Id, Chunk-Id

Uncommitted Blocks Table

Blobs Table

Stream MapContainer

Chunk X

Block DataContainer

(5) Insert BlockId,Chunk-Id entry

Presenter
Presentation Notes
Page 26: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

BLOCK BLOB – PUT BLOCK LIST Concurrent Put Block List calls to the

same block blob are serialized by the blob service.

Put Block List modifies the metadata as well, not just the composition of the blob.

Put Block List searches among committed and uncommitted sets and allocates and inserts a stream map to ChunkStore.

Blob entry in the Blobs table has the "current" stream map ID. This shows the current composition of the blob.

Crash Consistency is achieved by additional log writes (not shown)

Committed Block List LV (long value) column is needed to

Efficiently search referenced committed blocks at put block list.

To GC no longer needed previously committed blocks at Modifying Put-Block-List.

ChunkStore

WFE

Blob Service

(2) Put Block List (List + Metadata)

(1) Put Block List { X , Y } for Blob-A

(4) Add new Stream Map

(7) Update Stream-Id and Meatada

Blob-A Metadata

...

Stream-Id

Committed

Blocks

UnCommitted

Blocks

Block X Id, Chunk-Id

Uncommitted Blocks Table

Blobs Table

Stream MapContainer

Chunk Y

Block DataContainer

(6) CreateCommitted Blocks

TableBlock Y Id, Chunk-Id

...

Block X Id, Chunk-Id

Committed Blocks Table

Block Y Id, Chunk-Id

Stream Map 1

Chunk X

Block Z Id, Chunk-Id

(3) Lookup Chunks(5) Move remaining

Chunks for GC

Presenter
Presentation Notes
If the new Put-Block-List command modifying the composition, Stream-Id is atomically changed within a DB transaction. Old StreamId is moved for GC. GETs have their isolated references to stream-maps. (not shown) Snapshots clones and inserts new stream maps. (not shown) LV column can hold up to 2 GB. Though we use much less.
Page 27: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Blob Service Crash Consistency

28

Page 28: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

D0

1 2

M0 D1 M0Insert Data Chunk

Commit Transaction

3

D1 M1

4

D1 M1

Orphaned chunk

Periodic GC

5

D1 M0 GC0S1

Overwrite?

Commit Transaction

6

D1 M1

7D1 M1

GC1

GC1

S1

S1

Update blob with block IDUpdate GC with stream map

Update blob with block ID Insert stream map forOverwritten block

YesNo

Full GCFull GC

GC0

GC0

GC0

GC0

S0

S0

S0

S0

D Data Chunks state

S Stream Maps state

M Blob record state

GC GC record state

Red In memory entity

Blue On disk entity

Modified entity

Crash

GCing chunks/stream

State Crash just before normal action

Normal action Coverage ID

1. The state consists of a valid metadata blob record that might have a pre-existing blob record or not (M0), valid stream map state that might have a stream for the existing blob or no stream in case the blob doesn’t exist (S0) and a valid GC metadata table that has no records for related to this blob (GC0).

We remain in this state.

If the operation conditions succeed, then insert the new data chunk into the chunk store. This takes us to state 2. Otherwise, if the operation conditions fails, then remain in this state

1

2. The state consists of a valid metadata blob record (M0) that is not yet updated with the valid inserted data chunk ID (D1).

On demand full GC. For more details about the full GC, please refer to the full GC crash consistency section.

If there is no uncommitted block with the same block id, then begin transaction to update the metadata store with the newly inserted data chunk id. This takes us to state 3. Otherwise, if there is an uncommitted block with the same block id, then create a stream map for it. This takes us to state 5.

If blob exists: 2, 301 Else 2, 100

3. The state consists of in memory metadata update for the blob record (M1) that is updated with the valid data chunk id (D1).

On demand full GC. For more details about the full GC, please refer to the full GC crash consistency section.

Commit the DB transaction. This takes us to state 4.

101, 102

4. The state consists of a valid metadata update for the blob record (M1) that is updated with the valid inserted data chunk id (D1).

Will remain in this state.

This is a successful terminal state.

103

Example Crash Consistency Approach

Page 29: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Blob Service HA Model

30

Page 30: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Blob Service HA model on WSFC & CSVFS Multiple instances of Blob

Service, each running on a single physical machine.

Blobs are stored in a cluster file system (CSVFS).

Blob namespace is partitioned among the blob service instances.

Each Blob Service maintains a partition mapping table.

CSV volumes can move between nodes to remain highly available.

A Blob client maintains a mapping of the partition ID to the node name on which the partition is hosted.

The mapping can change due to node failover and CSV failover.

Presenter
Presentation Notes
Only one Blob Service instance is responsible for a particular partition as defined by CSV ownership. In the diagram, Blob Service on Node1 is responsible for the partitions Partition1 and Partition2 Each Blob Service maintains a partition mapping table that indicates on which CSV volume each partition lies For example when Node1 crashes, CSV1 and CSV2 move to Node2/Node3. When this happens the Blob Service running on those nodes becomes responsible for all the partitions on those two CSV volumes. For example if CSV1 moves to Node2 and CSV2 to Node3 then Partition1 is now owned by Node2 and Partition2 by Node3. A Blob client maintains a mapping of the partition ID to the node name on which the partition is hosted. The mapping can change due to node failover and CSV failover because a partition can now be hosted on a different node. The client is designed to figure out when the mapping changes and connect to the correct Blob Service instance. The client has retry and reconnect logic. The client is designed to figure out when the mapping changes and connect to the correct Blob Service instance. The client has retry and reconnect logic.
Page 31: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Questions?

32

Page 32: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Appendix

33

Page 33: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Differences against Azure Storage

No Standard_RAGRS or Standard_GRS

Premium_LRS: no performance limits or guarantees

No Azure Files yet

Usage meter differences No IaaS transactions in Storage Transactions No internal vs. external traffic distinction in Data Transfer

No account type change, or custom domains

Presenter
Presentation Notes
Point in time view Certain differences in storage manageability scope of functionality exist, for example, changing the account type is not supported; custom domains are not supported; only API-level consistency is offered for Premium_LRS storage account type.
Page 34: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Blob Service Metadata Persistence Model

Stream Maps being evaluated to move to DB

Single table with mixed row/blob types

Page 35: Azure-consistent Object Storage in Microsoft Azure …...Azure Resource Manager (ARM) in Azure Stack Provides an Azure-consistent resource management model Clients can use REST APIs,

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

ACS HA Model Service Fabric is used by WFE,

TM/TS, WAC, SRP instances for failover, load balance, deployment/upgrade .

All ACS service roles are co-locatable.

SRP collocates with other RPs in the same cluster.

Compute cluster VM health monitoring complements in-guest Service Fabric by providing VM-level recovery.

Blob service is running as an HA service on the SSU cluster directly on physical nodes, as a peer of SMB. It monitors CSV movements in the cluster and attach/detach to them.

Blob service does not failover. It is multiple active configuration.

CSU CSU

SSU

Failover Cluster

Service Fabric Cluster + VM Monitoring

SSU Node

Blob Service

SSU Node

Blob Service

SSU Node

Blob Service

SSU Node

Blob Service

Service Fabric Cluster + VM Monitoring HA VM

WAC

WFE

HA VM

SRP

HA VM

SRP

ActiveVirtualPhysical Scale Unit Physical

HA VM

SRP TS

Metric

HA VM

WFE

TS

Health

HA VM

WFE

TS

Metric

TM

Presenter
Presentation Notes
Map components to their clusters. WinFabric provides a uniform way for deployment and update. SRP, WAC, TM Active -> Secondary. Leverage WinFabric Failover WFE, stateless, load balanced by SLB TS, WinFabric partitioned service, leverage for failover and load balancing Blob service, resource groups relocation Notes: We Use WinFabric for WFE/SRP for deployment model uniformity and to leverage update domains.