azure-consistent object storage in microsoft azure stack...azure-consistent storage management model...

41
Azure-consistent Object Storage in Microsoft Azure Stack Ali Turkoglu Principal Software Engineering Manager Microsoft Mallikarjun Chadalapaka Principal Program Manager Microsoft

Upload: others

Post on 01-Mar-2021

9 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Azure-consistent Object Storage in Microsoft Azure Stack

Ali Turkoglu Principal Software Engineering Manager

Microsoft

Mallikarjun Chadalapaka Principal Program Manager

Microsoft

Page 2: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Agenda

Context, Solution Stack, Architecture, ARM & SRP

ACS Architecture Deep Dive

Blob Service Architecture & Design

Questions/Discussion

Page 3: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Azure-consistent storage

Cloud storage for Azure Stack

Azure-consistent blobs, tables, queues, and storage accounts

Administrator manageability

Enterprise-private clouds or hosted clouds from service providers

IaaS (page blobs) + PaaS (block blobs, append blobs, tables, queues)

Builds on & enhances WS2016 Software-Defined Storage (SDS) platform capabilities

Presenter
Presentation Notes
Integral part of Microsoft Software-Defined Storage (SDS) vision - highly-reliable, resilient, and scalable cloud storage based on standard hardware, or can just as well leverage existing SAN investments
Page 4: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Azure-consistent storage: Solution view

Infrastructure services

Virtualized services

Administrator

Resource Provider Cluster

Tenant-facing storage cloud services

Scale-Out File Server (SOFS) with Storage Spaces Direct (S2D)

. . . . .

Blob back-end

Blob back-end

Application clients using Azure Account, Blob ,Table, Queue APIs, Microsoft Azure Storage Explorer & Tooling

Data services Cluster

Microsoft Azure Stack Portal, Azure Storage cmdlets, ACS cmdlets, Azure

CLI, Client SDK

Page 5: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Clustering Architecture

Azure Service Fabric (ASF) guest clusters for cloud services

Hyper-converged Windows Server

Failover Cluster (WSFC) Host fabric

WSFC enhances ASF cluster

resiliency HA VMs via Hyper-V host clustering Anti-affinity policies on VMs to ensure all

VMs never failover to same host Application health monitoring on Service

Fabric service for timely detection of service hang situations

WSFC & CSVFS* provide the

basis for blob service HA model *Cluster Shared Volume File System

Administrator

Resource Provider Cluster

Tenant-facing storage cloud services

Scale-Out File Server (SOFS) with Storage Spaces Direct (S2D)

. . . . .

Blob back-end

Blob back-end

Data services Cluster

Page 6: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Relating Azure Storage Concepts

Subscription Resource Group Storage

Account

Container

Table

Block Blob

Page Blob

Append Blob

Queue

Presenter
Presentation Notes
Subscription Subscription One instance of an accepted “Offer” May contain different types of resources: Compute, Network, Storage, Web Sites,… Aggregation point for usage data reporting & billing Resource Group Grouping of logically related resources – typically together to deliver a single business service, e.g. “My PayRoll Service” Storage Account Top-level home for a set of storage objects Determines the starting URL for all contained objects, e.g. https://myaccount.blob.myserviceprovider.net/... Two keys are tied to the account to allow Symmetric Shared Key authentication for account owners Container Container for a group of logically-related blobs within a storage account Blob Objects with data, metadata and properties Can be block blobs (optimized for streaming), or page blobs (optimized for random read/write access, and access of byte ranges) Signed or public access with Shared Access Signature authorization Table Structured data storage object with No-SQL Signed access with Shared Access Signature authorization Collection of entities (rows)
Page 7: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

ARM & Resource Providers

Azure Resource Manager (ARM) in Azure Stack Azure-consistent management Clients use REST, PS, or Portal

Resource Provider (RP) manages a type of infra

Plug-in to ARM Compute (CRP) Network (NRP) Storage (SRP) ….

Users express desired state via templates

Template = declarative statement ARM necessary orchestration + imperative directives to RPs

7

Presenter
Presentation Notes
Resource Group reference
Page 8: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Storage Resource Provider

Azure-consistent storage Management Model

Azure Resource Manager

CRP

ACS data path services

Tenant Resources

Admin Resources

Microsoft Azure Stack Portal, and ACS cmdlets

Microsoft Azure Stack Portal, Azure Storage cmdlets, Azure CLI, and Client

SDK

Page 9: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

IaaS VM Storage

All VM storage in Azure Stack resides in blob store

Every OS or Data Disk is a page blob Page blob ReFS file Starts in REST API access mode

CRP and SRP collaborate behind the scenes

Page blob toggles to ‘SMB-only’ at VM run time Super-optimized Hyper-V-over-SMB I/O path

Presenter
Presentation Notes
Lead with this value prop for the section
Page 10: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

ACS Architecture Deep Dive

Page 11: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Key Requirements and Challenges for Object Storage on MAS

Atomicity guarantees Data Consistency

guarantees Immutable blocks Snapshot isolated copy for reads

512 byte page alignment for page

blobs

Distributed List Blob (enumeration)

Durability: Synchronous 3-copy

replication.

Scalable to millions of objects

99.9% High availability read/write

Fault tolerance & Crash consistency

No performance regression relative to

Hyper-V over SMB

Adapt to smaller cloud scale

Presenter
Presentation Notes
Atomicity guarantees, 4MB put page, 4MB put block. Data Consistency guarantees, after a successful atomic write, any reads after that write gets the latest value. Immutable blocks that allow re-composition Snapshot isolated copy for read operations. Historical 512 byte page alignment for page blobs (vs. ReFS 4k cluster size). Expensive list blob/enumeration including blob metadata. Durable: Synchronously stores 3 replications before write completes. Scalable: Up to millions of blobs under a container and millions of containers Highly Available: 99.9% read/write for local region Built with Fault Tolerance in mind at every component. Verified crash consistent implementation. Specific to MAS architecture, Hyper-V access to page blobs over SMB is required. Public vs. Private cloud scale differences: I.e. Azure min deployment stamp is ~80 racks. Typical small scale on premise needs 4 nodes – few racks.
Page 12: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

ACS Architecture SRP (Storage Resource

Provider): Integrates with ARM and exposes resource management REST interfaces for storage service overall.

FE: Provides REST interface Front End, consistent with Azure.

WAC: Storage account management, user requests authentication, and container operations.

Blob Service: Implements the blob service backend. Stores block and page blob data in file system/Chunk Store, and metadata in ESENT DB.

Table Master: Maps user table to database/TS instances.

Table Server: Handles table query & transactions in databases.

Storage: SOFS exposes a unified view of all tiered storage to compute nodes as CA shares. Provides fault tolerance & local replication.

Virtual Services

Physical Services

CSV (ReFS)

SSU Node 2SSU Node 1

Blob Service

Account/ContainerService

WFE

HTTP

SMB/R

PC

WAC DB

SOFS

SMB

Storage Spaces & PoolsShared or S2D DAS Disks

HA Clustering ACS

Component

Load Balancer

HTTP

Blob Service

RPCRPC

TS DBBLOB

DBBLOB

DB TM DB

SOFS

SMB

ChunkStoreFile

Table Server

TS DB

SMB

SMB

FE (Front End)

Table Master RPC

SRPSRP

RPC

ChunkStoreFile

Page Blob

File Page Blob

File

*Key Interactions between the components are shown

Presenter
Presentation Notes
Remove Calabria references!
Page 13: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Blobs - Semantic Requirements: See MSDN BLOCK BLOBS Client uploads individual immutable blocks

with PUT-BLOCK for future inclusion in a block blob. Block size may be up to 4 MB. A blob can have up to 100,000 uncommitted blocks. Maximum size of uncommitted block list is 400 GB.

Followed by a PUT-BLOCK-LIST call to assemble the blob. Maximum size supported for Block Blob is 200 GB and 50,000 committed blocks.

Blocks must retain their identity to permit later PUT-BLOCK-LIST calls to re-arrange.

Unused blocks are lazily cleaned up after a PUT-BLOCK-LIST request. In the absence of PUT-BLOCK-LIST, uncommitted blocks are garbage collected after 7 days.

Blob names are case-sensitive. At least one character long, and at most 1024 characters.

All Blob operations guarantee atomicity where it either happened as a whole, or it has not happened at all. There is no undetermined state at failure.

For Block Blobs each GET-BLOB request gets a snapshot isolated copy of the Blob data (or request fails if this cannot be accommodated.

PAGE BLOBS Client creates a new empty page blob by

calling PUT-BLOB. A page Blob starts as sparse and its size can be up to 1 TB.

Random Read/Write access Client than calls PUT-PAGE to add content

to the Page Blob. PUT-PAGE operation writes a range of pages to a Page Blob. Put-page operation must guarantee atomicity.

Calling Put Page with the Update option performs an in-place write on the specified page blob. Any content in the specified page is overwritten with the update.

Calling Put Page with the Clear option releases the storage space used by the specified page. Pages that have been cleared are no longer tracked as part of the page blob.

Each range of pages submitted with Put Page for an update operation may be up to 4 MB in size. The start and end range of the page must be aligned with 512-byte boundaries

Page 14: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Azure Blobs Object API : See MSDN for details

Common Blob Operations Put Blob, Get Blob, Get/Set Blob Properties, Get/Set Blob Metadata, Lease Blob, Snapshot Blob, Copy Blob, Abort Copy Blob, Delete Blob

Operations on Block Blobs

Put Block, Put Block List, Get Block List

Operations on Append Blobs Append Block

Operations on Page Blobs Put Page, Get Page Ranges

Presenter
Presentation Notes
Properties: Content Type, Encoding, request id, origin , content disposition etc.. Metadata: Arbitrary metadata in name:value pair Append Block (operation commits a new block of data to the end of an existing append blob)
Page 15: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Block Blob Object Design Challenges on Traditional File System Why not implement Block Blobs as file objects?

Isolation/atomicity and unique composition requirements are the key

offenders. Lack of any form of acceptable atomicity support on NTFS.

Rename/Create path on file system is prohibitively expensive API semantics does not map to files, immutable vs. random access Enumeration & Rich Metadata operations requires Index and DB,

and transactions. Namespace should be in database, but not in file system, to meet

scale demands, and other requirements Kernel mode filter driver complexity. Need for Index & Transaction

support

Presenter
Presentation Notes
* Dedup integration complexity * SMB access to block blobs are not needed * Kernel mode filter driver would have been needed to implement stream maps, and atomicity, but even then we would still need an index and transactional DB in kernel mode, thus increasing the cost and complexity.
Page 16: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

ACS Service Design Principles

Keep it simple Achievable limit in V1.

More than API consistency Build on available technologies and not reinvent the

wheel Depend on SOFS (with ReFS, Storage Spaces Direct), CSVFS Use ESE/NT Leverage Dedup Chunk Store ServiceFabric/WSFC for high availability, scaling out and load

balancing.

Presenter
Presentation Notes
Keep it simple Choose a good achievable limit in V1. Build a solid base and iterate/optimize from there. More than API consistency Guaranteed strong data consistency, durability, and transaction. Build with high availability and fault tolerance in every component. Design for management & architecture simplicity. Build on available technologies and not reinvent the wheel Depend on SOFS (with ReFS, Storage Spaces Direct), CSVFS for highly available, redundant storage. No application-level data replication. Use ESE/NT (JetDB) engine for table data and blob metadata storage, transaction and indexing. Proven through Exchange deployments. Leverage Dedup Chunk Store for block blob data storage/implementation. Build on ServiceFabric/WSFC for high availability, scaling out and load balancing.
Page 17: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

ReFS for Page Blobs, Snapshots/Extend Cloning, 4 MB atomic write

Page Blob and Block Blob share the same service.

Share the DB/Metadata Block and page blobs are under same container

Page Blobs as files

Highly optimized data path for Hyper-V over SMB

Block Blobs as ChunkStore Objects Block Blobs are not files, but immutable chunks

RPC as data transport from WFE to Blob service

WFE cannot write directly to ChunkStore

Key Design Decisions for Blob Service

Presenter
Presentation Notes
ReFS is used for Page Blob Implementation for Page Blob Snapshots Page Blob 4 MB atomic write via ReFS duplicate extent (FSCTL_DUPLICATE_EXTENTS_TO_FILE) Page Blob Integrity Stream Requirements Blob service should be deployed on physical nodes because Blob service HA model can be designed to use WSFC and CSVFS ReFS volume/VHDX MUST not be attached directly to a VM, to meet the access requirement for Hyper-V. Page Blob and Block Blob implementations must be in same service to Share the DB implementation, Metadata operations Meet the REST API requirement to have block and page blobs that could be under same container. Block Blobs could be stored on NTFS (in addition to ReFS) volume for following reasons ESE/NT DB provides Integrity support and proven to work on NTFS volumes. ChunkStore provides Integrity support and proven to work on NTFS volumes. Block blob Snapshots are implemented as cloning stream maps. SMB is the data transport for Page Blobs because Page Blobs are designed as (backed by) files Compute requires SMB access to page blobs RPC is chosen as data transport (WFE to Blob service) for Block Blobs because WFE cannot write directly to ChunkStore container files. Extending ChunkStore implementation for this deemed costly. Alternatively staging Put Blob/Block writes via a file would cause additional write penalty. Block
Page 18: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Block Blob Service Design Global namespace exists in

the DB. User mode only access/store.

No file system access to the block blobs; neither for namespace nor for data.

Use ChunkStore implementation to store committed and uncommitted blocks, and the stream maps.

Design for Azure Parity from get go.

Azure blob metadata is stored in ESE/NT DB. To optimize metadata only

operations To implement Blob API

GC semantics. DB Scope is set of containers

and their blobs metadata.

Presenter
Presentation Notes
Page 19: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Immutable file containers for chunk and stream maps. Append Only Chunk Insertion design Part of Deduplication Feature, proven. Integrity checks for detecting page corruptions, and read

from a replica for in place patching. Shallow and Deep Garbage Collection. Data chunks shared by root blob and snapshots

(deduplication) Guarantee crash-consistent file commit (Precise order of

operations guarantees data integrity at each stage) Various chunk size support (up to 4MB) Efficient self contained chunk id referencing

Why ChunkStore?

Page 20: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Why ESE/NT? ESE

Extensible Storage Engine (ESE), also known as JET Blue, is an ISAM (Indexed Sequential Access Method) data storage technology. It provides transacted data update and retrieval

Transactions & Index

The ESE database engine provides Atomic Consistent Isolated Durable (ACID) transactions.

Logging and crash recovery ESE has write ahead logging and snapshot isolation for guaranteed crash recovery. The application-defined data consistency is honored even in the event of a system crash.

Good Backup/Restore support

ESE supports on-line backup where one or more databases are copied, along with log files in a manner that does not affect database operations

Page corruption detection and read from replica and patch in place (FSCTL_MARK_HANDLE /

MARK_HANDLE_READ_COPY) Automatic DB scan and corruption detection ESENT engine self-detecting and auto-correcting checksum errors on a Jet database stored on a

Spaces triple-replica remotely accessed via SMB/CSVFS

Used at scale in exchange workloads.

Presenter
Presentation Notes
A transaction is a logical unit of processing delimited by BeginTransaction and CommitTransaction, or Rollback, operations. It allows applications to retrieve data only from reliable data states and maintains data consistency in the event of an unexpected process termination or system shutdown. An index is a persisted ordering of records in a table. Indexes are used for both sequential access to rows in the order defined, and for direct record navigation based on indexed column values.
Page 21: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Page Blob Service Design Global namespace

exists in the DB. Page blobs are files

stored in ReFS volume. Blob is a linear mapping

to the file. There is no stream map.

Page blobs also use DB to store azure blob metadata.

To enable exclusive in-place access (direct via SMB) for Hyper-V, check-out, and check-in semantics supplied.

No concurrent REST vs. File System access.

Presenter
Presentation Notes
Page blob access is through DB lookup. There is a blob name to file name mapping, and lookup. Global namespace exits in the DB.
Page 22: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Page Blobs - Compute vs. REST data access path Blob REST access is

via the Blob service.

Compute page blob access (Hyper-V accessing VHDs) is direct to the file over SMB, same path as today with RDMA etc.

Once a blob is “checked out” for compute access, it is not accessible through REST. SSU Node

WFE

CSU

BLOB Service

Hyper-VCompute

SOFS

SMBCompute In-Place

Page Blob VHD Access

Blob Files on ReFS

SSU

Presenter
Presentation Notes
Page 23: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Blob Service Design Data Flow & Representation

25

Page 24: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Header

Page X of PageBlobA

Page Y of PageBlobB

Page X+1 of PageBlobA

Page Z of PageBlobC

Unused

Shared Staging File

Blob Service Frontend

Blob Service Backend

(2) Put Page Z using RPC to Blob Service hosting PageBlobC

(1) HTTP Put Page Z to PageBlobC

Page X

Unset

Page Y

Page Z

Unset

FilePageBlobC

Backend

Front End

Blobs Table in Metadata Store

Blob Name Filename

PageBlobA FilePageBlobA

...

MetadataA

(6) Update Metadata for PageBlobC

(4) Build in-memory buffer with

alignment data from FilePageBlobC and unaligned RPC data then append

buffer to shared staging file

(5) Duplicate extent for Page Zfrom shared staging file

to FilePageBlobC

(3) Lookup Filename for PageBlobC

PageBlobB FilePageBlobB

PageBlobC FilePageBlobC

MetadataB

MetadataC

Metadata

PAGE Blob Design on ReFS Single Log File per

Volume. Append-only writes to

staging log with control

ReFS duplicate extents

Final Commit call updates metadata

Crash Consistency via staging log write, ReFS extent duplication.

Check-pointing to manage space consumption/ and recovery.

Presenter
Presentation Notes
Blob Service backend maintains its private log per managed volume to “stage” page writes. Single log for all page writes within a single volume. Arbitrary log write size exceeds 4MB REST requirement . Append-only writes to staging log with control, request parameters and page write data use standard logging techniques to achieve atomic log writes (checksum/CRC, log replay) ReFS duplicate extents feature used to atomically insert write data from staging log into target page blob. Final Commit call updates metadata and returns ETAG/LastModified Crash Consistency achieved via staging log write, ReFS extent duplication, sequenced metastore update with log replay from last valid checkpoint on a service restart. Staging log uses checkpointing to manage space consumption on volume and volume restart recovery.
Page 25: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

BLOCK BLOB – PUT BLOCK No change in

composition of the blob

Concurrent put block operations are allowed.

No Block Sharing across different Blobs

Uncommitted blocks LV Column in DB

Azure Block Id (64 bytes) to Chunk Id mapping

Efficient search uncommitted at Put-Block-List (or at recovery)

For GC policy required by the Blob API. ChunkStore

WFE

Blob Service

(2) Put Block (Block Id + Data)

(1) Put Block X for Blob-A

(3) Insert Chunk, Get Chunk-Id

(4) Insert Blob Entry (if needed)

Blob-A Metadata

...

Stream-Id

Committed

Blocks

UnCommitted

Blocks

Block X Id, Chunk-Id

Uncommitted Blocks Table

Blobs Table

Stream MapContainer

Chunk X

Block DataContainer

(5) Insert BlockId,Chunk-Id entry

Presenter
Presentation Notes
Blocks are uploaded individually first. Followed by Put-Block-List call later. Put Block operation does not alter the composition of the blob and it does not change its existing stream map. Concurrent put block operations are allowed. However the Block Sharing across different Blobs is not allowed . Crash Consistency achieved via additional Log Write (not shown) Service needs to maintain uncommitted blocks LV Column in DB: To Persist Azure Block Id (64 bytes) to Chunk Id mapping for uncommitted blocks. To search uncommitted blocks in a performant way at Put-Block-List (or at recovery) To enforce/implement GC policy required by the Blob API.
Page 26: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

BLOCK BLOB – PUT BLOCK LIST Put Block List modifies

metadata/composition

Allocates/Inserts a stream map to ChunkStore.

Blob entry has the "current" stream map ID.

Committed Block List LV (long value) column

Snapshots clones and inserts new stream maps.

ChunkStore

WFE

Blob Service

(2) Put Block List (List + Metadata)

(1) Put Block List { X , Y } for Blob-A

(4) Add new Stream Map

(7) Update Stream-Id and Meatada

Blob-A Metadata

...

Stream-Id

Committed

Blocks

UnCommitted

Blocks

Block X Id, Chunk-Id

Uncommitted Blocks Table

Blobs Table

Stream MapContainer

Chunk Y

Block DataContainer

(6) CreateCommitted Blocks

TableBlock Y Id, Chunk-Id

...

Block X Id, Chunk-Id

Committed Blocks Table

Block Y Id, Chunk-Id

Stream Map 1

Chunk X

Block Z Id, Chunk-Id

(3) Lookup Chunks(5) Move remaining

Chunks for GC

Presenter
Presentation Notes
Put Block List searches among committed and uncommitted sets and allocates and inserts a stream map to ChunkStore. If the new Put-Block-List command modifying the composition, Stream-Id is atomically changed within a DB transaction. Old StreamId is moved for GC. Concurrent Put Block List calls to the same block blob are serialized by the blob service. Blob entry has the "current" stream map ID. This shows the current composition of the blob. Crash Consistency is achieved by additional log writes (not shown) GETs have their isolated references to stream-maps. (not shown) Snapshots clones and inserts new stream maps. (not shown) LV column can hold up to 2 GB. Though we use much less. Committed Block List LV (long value) column is needed to Efficiently search referenced committed blocks at put block list. To GC no longer needed previously committed blocks at Modifying Put-Block-List.
Page 27: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Blob Service Crash Consistency

29

Page 28: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

D0

1 2

M0 D1 M0Insert Data Chunk

Commit Transaction

3

D1 M1

4

D1 M1

Orphaned chunk

Periodic GC

5

D1 M0 GC0S1

Overwrite?

Commit Transaction

6

D1 M1

7D1 M1

GC1

GC1

S1

S1

Update blob with block IDUpdate GC with stream map

Update blob with block ID Insert stream map forOverwritten block

YesNo

Full GCFull GC

GC0

GC0

GC0

GC0

S0

S0

S0

S0

D Data Chunks state

S Stream Maps state

M Blob record state

GC GC record state

Red In memory entity

Blue On disk entity

Modified entity

Crash

GCing chunks/stream

State Crash just before normal action

Normal action Coverage ID

1. The state consists of a valid metadata blob record that might have a pre-existing blob record or not (M0), valid stream map state that might have a stream for the existing blob or no stream in case the blob doesn’t exist (S0) and a valid GC metadata table that has no records for related to this blob (GC0).

We remain in this state.

If the operation conditions succeed, then insert the new data chunk into the chunk store. This takes us to state 2. Otherwise, if the operation conditions fails, then remain in this state

1

2. The state consists of a valid metadata blob record (M0) that is not yet updated with the valid inserted data chunk ID (D1).

On demand full GC. For more details about the full GC, please refer to the full GC crash consistency section.

If there is no uncommitted block with the same block id, then begin transaction to update the metadata store with the newly inserted data chunk id. This takes us to state 3. Otherwise, if there is an uncommitted block with the same block id, then create a stream map for it. This takes us to state 5.

If blob exists: 2, 301 Else 2, 100

3. The state consists of in memory metadata update for the blob record (M1) that is updated with the valid data chunk id (D1).

On demand full GC. For more details about the full GC, please refer to the full GC crash consistency section.

Commit the DB transaction. This takes us to state 4.

101, 102

4. The state consists of a valid metadata update for the blob record (M1) that is updated with the valid inserted data chunk id (D1).

Will remain in this state.

This is a successful terminal state.

103

Example Crash Consistency Approach

Page 29: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Blob Service HA Model

31

Page 30: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Blob Service HA model on WSFC & CSVFS Multiple instances, one on

each physical machine. Blobs are stored in a cluster

file system (CSVFS). Blob namespace is

partitioned among the blob service instances.

Each Blob Service maintains a partition mapping table.

CSV volumes can move between nodes to remain highly available.

A Blob client maintains a mapping of the partition ID to the node name on which the partition is hosted.

The mapping can change due to node failover and CSV failover.

Presenter
Presentation Notes
Only one Blob Service instance is responsible for a particular partition as defined by CSV ownership. In the diagram, Blob Service on Node1 is responsible for the partitions Partition1 and Partition2 Each Blob Service maintains a partition mapping table that indicates on which CSV volume each partition lies For example when Node1 crashes, CSV1 and CSV2 move to Node2/Node3. When this happens the Blob Service running on those nodes becomes responsible for all the partitions on those two CSV volumes. For example if CSV1 moves to Node2 and CSV2 to Node3 then Partition1 is now owned by Node2 and Partition2 by Node3. A Blob client maintains a mapping of the partition ID to the node name on which the partition is hosted. The mapping can change due to node failover and CSV failover because a partition can now be hosted on a different node. The client is designed to figure out when the mapping changes and connect to the correct Blob Service instance. The client has retry and reconnect logic. The client is designed to figure out when the mapping changes and connect to the correct Blob Service instance. The client has retry and reconnect logic.
Page 31: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Table Service

33

Page 32: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Azure Table Semantics See MSDN for details

Data Model: Each row (entity) contains up to 1MB of schema-less data. Each entity contains up to 252 properties including 3 system properties PartitionKey + RowKey form the primary key for data query in Tables. Support filter, LINQ queries and pagination for retrieving table entities.

Atomic: Support both entity-level transactions as well as Group Entity Transaction in one partition with maximum 100 entities.

Consistent: After a successful write, any reads that happen after the write get the latest value.

Durable: Synchronously store 3 replications before reporting success. Scalable: Millions of entities in a single table. Millions of tables in the whole system. Highly Available: 99.9% read/write for local replication

PartitionKey RowKey TimeStamp Status …

A Alice May 29, 2016 “Online” …

B Brian June 29, 2016 “Offlinee” …

Presenter
Presentation Notes
FanZhang: Data Model -> APIs -> Properties Talk about how WOSS Table addresses or deviate from each of the characteristics Needs to scale out both I/O and CPU intensive.
Page 33: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Azure Table API : See MSDN for details

Common Table Operations Query Tables, Create Table, Delete Table, Get Table ACL, Set Table ACL

Operations on Table Entities

Query Entities, Insert Entity, Insert Or Merge Entity, Insert Or Replace Entity, Update Entity, Merge Entity, Delete Entity

Presenter
Presentation Notes
Properties: Content Type, Encoding, request id, origin , content disposition etc.. Metadata: Arbitrary metadata in name:value pair
Page 34: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Table Backend Table master creates and keeps table ranged partition to Table Server instance mapping in Azure Service Fabric reliable collection, which is replicated across the cluster.

Mapping also cached in FE to expedite lookup.

One Table Server instance serves one DB including multiple user tables

Multiple TS instances share one process

TS instances acquire assigned database/ partition ranges from table master upon start

Use Service Fabric to achieve High Availability, Scale-out, and Load Balancing

Storage Layer, SoFS

Service Fabric Cluster

Table Server Instance 3

Table Server Instance 5

HA VM 6

Table Server Instance 8

T2T1 T4T3 T6T5 T7

HA VM 1

Table Server Instance 3

Table Server Instance 2

Table Server Instance 7

HA VM 2

Table Server Instance 1

Table Server Instance 4

Table Server Instance 6

HA VM 3 HA VM 4

T8

Table Master

Failover or LB

Get partition assignmentCreate Table

WFE

HA VM 7

WFE

Table command & Response

Query TS Mapping

HA VM 5

WAC

WAC

Store Table list & properties

PartitionKey

RowKey

TimeStamp PropertyBag

C&E 100000 2016/08/29 Project = “ACS” …

TableName = “CCADB32B-3955-41B8-B28B-062EE8791EE9"

36

Presenter
Presentation Notes
4 Key components: WFE, WAC, Table Master, Table Server. All services are stateless and persist data in DB. Table Server is the scale-out units. Need to scale out for thousands of Tables, so TM maps table to processing units, which is a table server instance. Table server instance loads one DB and serves only one user table. It is the smallest unit for failover and load balance. Talk about failover and load reporting. Talk about Table limitations here WinFabric: Table master does not control which TS instance runs on which node. Windows Fabric helps achieve High Availability, Scale-out, and Load Balancing Detects instance/node heartbeat and creates corresponding new instances on separate nodes. Assign service instances across all nodes , and looks up service endpoint based on service name and instance ID. Moves instances across nodes based on aggregated load metrics reported by service instances.
Page 35: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Questions?

37

Page 36: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Appendix

38

Page 37: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Differences against Azure Storage

No Standard_RAGRS or Standard_GRS

Premium_LRS: no performance limits or guarantees

No Azure Files yet

Usage meter differences No IaaS transactions in Storage Transactions No internal vs. external traffic distinction in Data Transfer

No account type change, or custom domains

Presenter
Presentation Notes
Point in time view Certain differences in storage manageability scope of functionality exist, for example, changing the account type is not supported; custom domains are not supported; only API-level consistency is offered for Premium_LRS storage account type.
Page 38: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Blob Service Metadata Persistence Model

Stream Maps being evaluated to move to DB

Single table with mixed row/blob types

Page 39: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

ACS HA Model Service Fabric is used by WFE,

TM/TS, WAC, SRP instances for failover, load balance, deployment/upgrade .

All ACS service roles are co-locatable.

SRP collocates with other RPs in the same cluster.

Compute cluster VM health monitoring complements in-guest Service Fabric by providing VM-level recovery.

Blob service is running as an HA service on the SSU cluster directly on physical nodes, as a peer of SMB. It monitors CSV movements in the cluster and attach/detach to them.

Blob service does not failover. It is multiple active configuration.

CSU CSU

SSU

Failover Cluster

Service Fabric Cluster + VM Monitoring

SSU Node

Blob Service

SSU Node

Blob Service

SSU Node

Blob Service

SSU Node

Blob Service

Service Fabric Cluster + VM Monitoring HA VM

WAC

WFE

HA VM

SRP

HA VM

SRP

ActiveVirtualPhysical Scale Unit Physical

HA VM

SRP TS

Metric

HA VM

WFE

TS

Health

HA VM

WFE

TS

Metric

TM

Presenter
Presentation Notes
Map components to their clusters. WinFabric provides a uniform way for deployment and update. SRP, WAC, TM Active -> Secondary. Leverage WinFabric Failover WFE, stateless, load balanced by SLB TS, WinFabric partitioned service, leverage for failover and load balancing Blob service, resource groups relocation Notes: We Use WinFabric for WFE/SRP for deployment model uniformity and to leverage update domains.
Page 40: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Create Table Data Flow

Storage Layer, SoFS

Service Fabric Cluster

Table Server Instance 3

Table Server Instance 5

HA VM 6

Table Server Instance 8

T2T1 T4T3 T6T5 T7

HA VM 1

Table Server Instance 2

Table Server Instance 7

HA VM 2

Table Server Instance 1

Table Server Instance 4

Table Server Instance 6

HA VM 3 HA VM 4

T8

Table Master

Create Table

WFE

HA VM 7

WFE

HA VM 5

WAC

WAC

Store Table list & properties

1. WFE authenticates with Account service (WAC) and sends Table Creation request to WAC

2. WAC adds a new table entry in WAC metadata DB

3. WAC queries Table Master for corresponding Table Server Instance

4. WAC requests Table Server to create the new table

5. TS instance creates the user table in the DB

1

2

5

4

42

3

Presenter
Presentation Notes
Give an example of failure case here.
Page 41: Azure-consistent Object Storage in Microsoft Azure Stack...Azure-consistent storage Management Model Azure Resource Manager CRP ACS data path services Tenant Resources Admin Resources

2016 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Insert Entity Data Flow

Storage Layer, SoFS

Service Fabric Cluster

Table Server Instance 3

Table Server Instance 5

HA VM 6

Table Server Instance 8

T2T1 T4T3 T6T5 T7

HA VM 1

Table Server Instance 2

Table Server Instance 7

HA VM 2

Table Server Instance 1

Table Server Instance 4

Table Server Instance 6

HA VM 3 HA VM 4

T8

Table Master

WFE

HA VM 7

WFE

Query TS Mapping

HA VM 5

WAC

WAC

Table command & Response

1. WFE authenticates & authorizes requests with WAC (if not already cached by WFE)

2. WFE queries Table Master for the TS instance for the table (if it’s not already cached by WFE)

3. WFE sends Insert Entity request to specified TS instance

4. TS instance updates DB containing the corresponding table and sends success/failure back to WFE

1

2

3

4

43