dragon: a distributed object storage at yahoo! japan (webdb forum 2017 / english ver.)

55
Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved. Sep. 19. 2017 WebDB Forum Tokyo 1 Yasuharu Goto Dragon: A Distributed Object Storage @Yahoo! JAPAN (English Ver.)

Upload: yahoo

Post on 21-Jan-2018

11.080 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Sep. 19. 2017 WebDB Forum Tokyo

1

Yasuharu Goto

Dragon: A Distributed Object Storage @Yahoo! JAPAN

(English Ver.)

Page 2: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

About me

• Yasuharu Goto

• Yahoo! JAPAN (2008-)

• Software Engineer

• Storage, Distributed Database Systems (Cassandra)

• Twitter: @ono_matope

• Lang: Go

2

Page 3: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Agenda

• About Dragon

• Architecture

• Issues and Future works

3

Page 4: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Dragon

Page 5: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Object Storage

• What is Object Storage?

• A storage architecure that manages files not as files but as objects.

• Instead of providing features like file hierarchy, it provides high availability and scalabiliity.

• (Typically) provides REST API, so it can be used easily by applications.

• Populer products

• AWS: Amazon S3

• GCP: Google Cloud Storage

• Azure: Azure Blob Storage

• An essential component for modern web development.

5

Page 6: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Dragon

• A distributed Object Storage developed at Yahoo! JAPAN.

• Design Goals:

• High { performance, scalability, availability, cost efficiency }

• Written in Go

• Released in Jan/2016 (20 months in production)

• Scale

• deployed in 2 data centers in Japan

• Stores 20 billion / 11 PB of objects.

6

Page 7: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Use Cases

• 250+ users in Y!J

• Various usage

• media content

• user data, log storage

• backend for Presto (experimental)

7

• Yahoo! Auction (image)

• Yahoo! News/Topics (image)

• Yahoo! Display Ad Network (image/video)

• Yahoo! Blog (image)

• Yahoo! Smartphone Themes (image)

• Yahoo! Travel (image)

• Yahoo! Real Estate (image)

• Yahoo! Q&A (image)

• Yahoo! Reservation (image)

• Yahoo! Politics (image)

• Yahoo! Game (contents)

• Yahoo! Bookstore (contents)

• Yahoo! Box (user data)

• Netallica (image)

• etc...

Page 8: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

S3 Compatible API

• Dragon provides an S3 compatible API

• aws-sdk, aws-cli, CyberDuck...

• Implemented

• Basic S3 API (Service, Bucket, Object, ACL...)

• SSE (Server Side Encryption)

• TODO

• Multipart Upload API (to upload large objects up to 5TB)

• and more...

8

Page 9: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Performance (with Riak CS/reference)

• Dragon: API*1, Storage*3, Cassandra*3

• Riak CS: haproxy*1, stanchion*1, Riak (KV+CS)*3

• Same Hardware except for Cassandra and Stanchion.

9

0

500

1000

1500

2000

2500

3000

3500

1 5 10 50 100 200 400

Re

qu

ests

/ s

ec

# of Threads

GET Object 10KB Throughput

Riak CS

Dragon

0

100

200

300

400

500

600

700

800

900

1000

1 5 10 50 100 200 400

Re

qu

ests

/ s

ec

# of Threads

PUT Object 10KB Throughput

Riak CS

Dragon

Page 10: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Why?

Page 11: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Why did we build a new Object Storage?

• Octagon (2011-2017)

• Our 1st Generation Object Storage

• Up to 7 PB / 7 Billion Objects / 3,000 Nodes at a time

• used for personal cloud storage service, E-Book, etc...

• Problems of Octagon

• Low performance

• Unstable

• Expensive TCO

• Hard to operate

• We started to consider alternative products.

11

Page 12: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Requirements

• Our requirements

• High performance enough for our services

• High scalability to respond to rapid increase in data demands

• High availability with less operation cost

• High cost efficiency

• Mission

• To establish a company-wide storage infrastructure

12

Page 13: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Alternatives

• Existing Open Source Products

• Riak CS

• Some of our products introduced it, but it did not meet our performance requiremnt.

• OpenStack Swift

• Concerns about peformance degration when object count increases.

• Public Cloud Providers

• cost inefficient

• We mainly provides our services with our own DC.

We needed a high scalable storage system which runs on-premise.

13

Page 14: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Alternatives

14

OK, let’s make it by ourselves!

• Existing Open Source Products

• Riak CS

• Some of our products introduced it, but it did not meet our performance requiremnt.

• OpenStack Swift

• Concerns about peformance degration when object count increases.

• Public Cloud Providers

• cost inefficient

• We mainly provides our services with our own DC.

We needed a high scalable storage system which runs on-premise.

Page 15: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Architecture

Page 16: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Architecture Overview

• Dragon consists of 3 components: API Nodes, Storage Cluster and MetaDB.

• API Node

• Provides S3 compatible API and serves all user requets.

• Storage Node

• HTTP file servers that store BLOBs of uploaded objects.

• 3 nodes make up a VolumeGroup. BLOBs in each group are periodically synchronized.

• MetaDB

• Apache Cassandra cluster

• Stores metadata of uploaded objects including the location of its BLOB.

16

Page 17: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Architecture

17

API Nodes

HTTP (S3 API)

BLOB

Metadata

Storage Cluster

VolumeGroup: 01

StorageNode

1

HDD2

HDD1

StorageNode

2

HDD2

HDD1

StorageNode

3

HDD2

HDD1

VolumeGroup: 02

StorageNode

4

HDD2

HDD1

StorageNode

5

HDD2

HDD1

StorageNode

6

HDD2

HDD1

Meta DB

Page 18: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Architecture

18

API Nodes

HTTP (S3

API)

BLOB

Metadata

Storage Cluster

API and Storage nodes are witten in Go

18

VolumeGroup: 01

StorageNode

1

HDD2

HDD1

StorageNode

2

HDD2

HDD1

StorageNode

3

HDD2

HDD1

VolumeGroup: 02

StorageNode

4

HDD2

HDD1

StorageNode

5

HDD2

HDD1

StorageNode

6

HDD2

HDD1

Meta DB

Page 19: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Architecture

19

API Nodes

BLOBStorage Cluster

VolumeGroup: 01

StorageNode

1

HDD4

HDD3

StorageNode

2

HDD4

HDD3

StorageNode

3

HDD4

HDD3

VolumeGroup: 02

StorageNode

4

HDD4

HDD3

StorageNode

5

HDD4

HDD3

StorageNode

6

HDD4

HDD3

API Nodes periodically fetch and cache VolumeGroup configuration from MetaDB.

Meta DB

id hosts Volumes

01 node1,node2,node3 HDD1, HDD2

02 node4,node5,node6 HDD1, HDD2

volumegroup configuration

Page 20: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Upload

20

API Nodes

Meta DB

VolumeGroup: 01

StorageNode

1

HDD2

HDD1

StorageNode

2

HDD2

HDD1

StorageNode

3

HDD2

HDD1

VolumeGroup: 02

StorageNode

4

HDD2

HDD1

StorageNode

5

HDD2

HDD1

StorageNode

6

HDD2

HDD1

① HTTP PUT

key: bucket1/sample.jpg,

size: 1024bytes

blob: volumegroup01/hdd1/...,PUT bucket1/sample.jpg

② Metadata

1. When a user uploads an object, the API Node first randomly picks a VolumeGroup and transfers

the object’s BLOB to the nodes in the VolumeGroup using HTTP PUT.

2. Stores the metadata including its BLOB location into the MetaDB.

Page 21: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Download

21

API Nodes

Meta DB

VolumeGroup: 01

StorageNode

1

HDD2

HDD1

StorageNode

2

HDD2

HDD1

StorageNode

3

HDD2

HDD1

VolumeGroup: 02

StorageNode

4

HDD2

HDD1

StorageNode

5

HDD2

HDD1

StorageNode

6

HDD2

HDD1

② HTTP GET

key: bucket1/sample.jpg,

size: 1024bytes

blob: volumegroup01/hdd1/...,PUT bucket1/sample.jpg

① Metadata

1. When a user downloads an Object, the API Node retrieves its metadata from the MetaDB.

2. Requests a HTTP GET to a Storage holding the BLOB based on the metadata and transfer the

response to the user.

Page 22: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Failure Recovery

22

API Nodes

Meta DB

VolumeGroup: 01

StorageNode

1

HDD2

HDD1

StorageNode

2

HDD2

HDD1

StorageNode

3

HDD2

HDD1

VolumeGroup: 02

StorageNode

4

HDD2

HDD1

StorageNode

5

HDD2

HDD1

StorageNode

6

HDD2

HDD1

When a Hard Disk fails...

Page 23: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Failure Recovery

23

API Nodes

Meta DB

VolumeGroup: 01

StorageNode

1

HDD2

StorageNode

2

HDD2

HDD1

StorageNode

3

HDD2

HDD1

VolumeGroup: 02

StorageNode

4

HDD2

HDD1

StorageNode

5

HDD2

HDD1

StorageNode

6

HDD2

HDD1

The drive will be replaced and data that should be in the drive will be recovered by transferring from

the other StorageNodes in the VolumeGroup.

HDD1

Page 24: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Scaling out

24

API Nodes

Meta DB

When you add capacity to the cluster...

24

VolumeGroup: 01

StorageNode

1

HDD2

HDD1

StorageNode

2

HDD2

HDD1

StorageNode

3

HDD2

HDD1

VolumeGroup: 02

StorageNode

4

HDD2

HDD1

StorageNode

5

HDD2

HDD1

StorageNode

6

HDD2

HDD1

id hosts Volumes

01 node1,node2,node3 HDD1, HDD2

02 node4,node5,node6 HDD1, HDD2

volumegroup Configuration

Page 25: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Scaling out

API Nodes

Meta DB

• ... simply set up a new set of StorageNodes and update the VolumeGroup configuration.

25

VolumeGroup: 01

StorageNode

1

HDD2

HDD1

StorageNode

2

HDD2

HDD1

StorageNode

3

HDD2

HDD1

VolumeGroup: 02

StorageNode

4

HDD2

HDD1

StorageNode

5

HDD2

HDD1

StorageNode

6

HDD2

HDD1

VolumeGroup: 03

StorageNode

7

HDD2

HDD1

StorageNode

8

HDD2

HDD1

StorageNode

9

HDD2

HDD1

id hosts Volumes

01 node1,node2,node3 HDD1, HDD2

02 node4,node5,node6 HDD1, HDD2

03 node7,node8,node9 HDD1, HDD2

volumegroup Configuration

Page 26: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Why not Consistent Hash?

• Dragon’s distributed architecture is based on mapping managed by the DB.

• Q. Why not Consistent Hash?

26

quoted from: http://docs.basho.com/riak/kv/2.2.3/learn/concepts/clusters/

• Consistent Hash

• Data is distributed uniformly by hash of key

• Used by many existing distributed systems

• e.g. Riak CS, OpenStack Swift

• No need for external DB to manage the map

Page 27: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Why not Consistent Hash?

• A. Able to add storage capacities without Rebalancing

• It heavily consumes Disk I/O, bandwidth, and often takes a long time.

• eg. Adding 1 node into 10 node * 720TB cluster which is 100% utilized requires transfering

655TB. 655TB/2Gbps = 30 days

• Scaling hash-based DB to more than 1000 nodes with large nodes is very challenging.

27

655TB

(720TB*10Node)/11Node = 655TB

Page 28: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Other Pros/Cons

• Pros

• We can scale out MetaDB and BLOB Storage independently.

• Backend Storage Engine is pluggable.

• We can easily add or change the storage technology/class in the future

• Cons

• We need external Database to manage the map

• BLOB load would be non-uniform

• We’ll rebalance periodically.

28

Page 29: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Storage Node

Page 30: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Storage Hardware

• High density Storage Servers for cost efficiency

• We need to make use of the full potential of the hardware.

30

https://www.supermicro.com/products/system/4U/6048/SSG-6048R-E1CR90L.cfm

Page 31: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Storage Configuration

• HDDs are configured as independent logical volumes instead of RAID

• Reason 1: To reduce time to recover when HDDs fail.

31

VolumeGroup

StorageNode

HDD4

HDD3

HDD2

HDD1

StorageNode

HDD4

HDD3

HDD2

HDD1

StorageNode

HDD4

HDD3

HDD2

HDD1

Page 32: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Storage Configuration

• Reason 2: RAID is slow for random access.

32

Configure Requests per sec

Non RAID 178.9

RAID 0 73.4

RAID 5 68.6

Throughput for random access work load.Served by Nginx. 4HDDs. Filesize: 500KB

2.4x Faster

than RAID 0

Page 33: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

File Persistence Strategy

• Dragon’s Storage Nodes use one file per BLOB.

• Strategy to increase robustness by using stable filesystem (ext4).

• But, it is known that file systems can not handle large numbers of files well.

• It is reported that Swift has poor writing performance as the number of files increases.

• To get over this problem, Dragon uses a unique technique.

33

ref.1: “OpenStack Swiftによる画像ストレージの運用” http://labs.gree.jp/blog/2014/12/11746/

ref.2: “画像システムの車窓から|サイバーエージェント公式エンジニアブログ” http://ameblo.jp/principia-ca/entry-12140148643.html

Page 34: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

File Persistence Strategy• Typical approach: Write files into directories evenly which are created in advance

• Swift writes files in this manner.

• As the number of files increases, the number of seeks increases and the write throughput decreases.

• Cost for updating dentries increases.

34

(256dirs)

... 256 dirs01 02 03 fe ff

Seek count and throughput when randomly writing 3 million files in 256 directories.

Implemented as a smple HTTP server. Used ab, blktrace, seekwatcher for measurement.

photo2.jpgphoto1.jpg photo4.jpgphoto3.jpg

Hash function

Page 35: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Dynamic Partitioning

• Dynamic Partitioning Approach

1. Create a sequentially numbered directories (partitions). API Nodes upload files into the latest directory.

2. Once the number of files in the partition reaches a threshold (1000 here), the Storage Node creates the

next partition and informs the API nodes about it.

• Keep the number of files in the directory constant by adding directories at any time.

35

When # of files/dir exceeds approximately 1000, Dragon creates a next directory and uploads there.

0 1 0 New

Dir!

11000

Files!

2

Page 36: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Dynamic Partitioning

36

• Comparison with hash strategy. Green is Dyamic Partitioning.

• Even if file count increases, seek count does not increase, throughput is stable

Writing Files in Hash Based Strategy (blue) and Dynamic Partitioning (green)

Page 37: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Microbenchmark

Confirmed the maintenance of writing

throughput up to 10 Million files for

single HDD.

37

Writing throughput when creating up to 10 Million files.

We syncd and dropped cache after each creating 100,000 files.

Page 38: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Eventual Consistency• To achieve high availability, writing to Storage Nodes uses eventual consistency with Quorum.

• Uploads succeed if writing to the majority of 3 nodes is successful.

• Anti-Entropy Repair process synchronizes failed nodes periodically.

38

VolumeGroup: 01

StorageNode

1

HDD4

HDD3

HDD2

HDD1

StorageNode

2

HDD4

HDD3

HDD2

HDD1

StorageNode

3

HDD4

HDD3

HDD2

HDD1

API Nodes

OK

Page 39: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Anti-Entropy Repair

• Anti-Entropy Repair

• Process to compare data between nodes, detect data that is not replicated and recover the

consistency.

39

Node B Node C

file1

file2

file3

file4

Node A

file1

file2

file3

file4

file1

file2

file4

file3

Page 40: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Anti-Entropy Repair• Detect and correct inconsistency of Storage Nodes in a partition unit.

1. Calculate the hash of the names of the files in a partition.

2. Compare the hashes between nodes in a VolumeGroup. There are inconsistencies if the hashes do not match.

3. If the hashes do not match, compare the files in the partition and transfer missing files.

• Comparing process is IO efficient as we can cache the hash and the update is concentrated in the latest partition.

40

HDD2

01 60b725f...

02 e8191b3...

03 97880df...

HDD2

01 60b725f...

02 e8191b3...

03 97880df...

HDD2

01 60b725f...

02 e8191b3...

03 10c9c85c...

node1 node2 node3

file1001.data

-----

file1003.data

file1001.data

file1002.data

file1003.data

file1001.data

file1002.data

file1003.data

transfer file1002.data to node1

Page 41: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

MetaDB

Page 42: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Cassandra

• Apache Cassandra

• High Availability

• Linear Scalability

• Low operation cost

• Eventual Consistency

• Cassandra does not support ACID transactions

42

Page 43: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Cassandra

• Tables

• VolumeGroup

• Account

• Bucket

• Object

• ObjectIndex

43

Page 44: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Object Table

• Object Table

• Table to retain Object Metadata

• size, BLOB location, ACL, Content-Type...

• Distributed evenly within the cluster by the partition key which is composed of (bucket, key).

44

bucket key mtime status metadata...

b1 photo1.jpg uuid(t2) ACTIVE {size, location, acl...,}

b1 photo2.jpg uuid(t1) ACTIVE {size, location, acl....}

b3 photo1.jpg uuid(t3) ACTIVE {size, location, acl....}

Partition Key

Page 45: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

PUT Object

• Update matadata

• Within each partition, metadata is clustered in descending order by UUIDv1 based on creation time.

• When an object is overwritten, the metadata of the latest version is inserted into the top of the partition.

• Since we keep records of multiple versions, no inconsistency occurs even if the object is overwritten

concurrently.

45

Clustering Column

bucket key mtime status metadata...

b1 photo2.jpg

uuid(t5) ACTIVE {size, location, acl...,}

uuid(t4) ACTIVE {size, location, acl...,}

uuid(t1) ACTIVE {size, location, acl...,}

b1 photo2.jpg uuid(t1) ACTIVE {size, location, acl....}

PUT b1/photo2.jpg (time: t4)

PUT b1/photo2.jpg (time: t5)

photo2.jpg reaches consistency. (t5 wins)

Page 46: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

GET Object

• Retrieving Metadata

• Retrieve the first row of the partition with SELECT query

• Since the partition is sorted by the creation time, the first row always indicates the current

state of the object.

46

bucket key mtime status metadata...

b1 photo1.jpguuid(t5) ACTIVE {size, location, acl...}

uuid(t3) ACTIVE {size, location, acl....}

b1 photo2.jpg uuid(t1) ACTIVE {size, location, acl....}

Partition Key Clustering Column

SELECT * FROM bucket=‘b1’ AND key= ‘photo1.jpg’ LIMIT 1;

(time:t5)

Page 47: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

DELETE Object

• Request Deletion of object

• Insert row with deletion status without deleting the row immediately.

47

bucket key mtime status metadata...

b1 photo1.jpguuid(t5) ACTIVE {size, location, acl...}

uuid(t3) ACTIVE {size, location, acl....}

b1 photo2.jpguuid(t7) DELETED N/A

uuid(t1) ACTIVE {size, location, acl....}

DELETE b1/photo1.jpg (time: t7)

Partition Key Clustering Column

Page 48: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

GET Object (deleted)

• Retrieving Metadata (in case of deleted)

• If the retrieved latest row has DELETED status, the object is considered deleted logically and

returns error

48

bucket key mtime status metadata...

b1 photo1.jpguuid(t5) ACTIVE {size, location, acl...}

uuid(t3) ACTIVE {size, location, acl....}

b1 photo2.jpguuid(t7) DELETED N/A

uuid(t1) ACTIVE {size, location, acl....}

SELECT * FROM bucket=‘b1’ AND key= ‘photo2.jpg’ LIMIT 1;

(time:t7)

Partition Key Clustering Column

Page 49: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Object Garbage Collection

• Garbage Collection (GC)

• Periodically deletes metadata and the linked BLOBs of overwritten or deleted Objects.

• Full scan of Object table

• The second and subsequent rows of each partition are garbage. GC Deletes them.

49

bucket key mtime status metadata...

b1 photo1.jpguuid(t5) ACTIVE {size, location, acl...}

uuid(t3) ACTIVE {size, location, acl....}

b1 photo2.jpg

uuid(t7) DELETED N/A

uuid(t3) ACTIVE {size, location, acl...,}

uuid(t1) ACTIVE {size, location, acl....}

Garbage

Garbage

Garbage

full scan

Upload 0 byte tomstone files to delete the BLOB

Partition Key Clustering Column

Page 50: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Object Garbage Collection

• GC completed

50

bucket key mtime status metadata...

b1 photo1.jpg uuid(t5) ACTIVE {size, location, acl...}

b1 photo2.jpg uuid(t7) DELETED N/A

GC completed

We achieved Concurrency control on Eventual Consistency Database by using partitioning and UUID

clustering.

Partition Key Clustering Column

Page 51: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Issues and Future Plans

Page 52: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

ObjectIndex Table• ObjectIndex Table

• Objects in bucket are sorted and stored in ObjectIndex table in asc order by key name for ListObjects API

• Since the partitions get extremely large, objects in a bucket are split into 16 partitions.

52

bucket hash key metadata

bucket1 0

key0001 ...

key0003 ...

key0012 ...

key0024 ...

... ...

bucket1 1

key0004 ...

key0009 ...

key0011 ...

... ...

bucket1 2

key0002 ...

key0005 ...

... ...

... ... ... ...

key metadata

key0001 ...

key0002 ...

key0003 ...

key0004 ...

key0005 ...

key0006 ...

key0007 ...

key0008 ...

... ...

Retrieve 16 partitions and merge them to respond

ObjectIndex Table

Partition Key Clustering Column

Page 53: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Issues

• ObjectIndex related problems

• Some API requests cause a lot of queries to Cassandra, resulting in high load and high

latency.

• Because of Cassandra’s limitation, the # of Objects in Bucket is restricted to 32 Billion.

• We’d like to eliminate constraints on the number of Objects by introducing a mechanism

that dynamically divides the index partition.

53

Page 54: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Future Plans

• Improvement of Storage Engine

• WAL (Write Ahead Log) based Engine?

• Erasure Coding?

• Serverless Architecture

• Push notification to messaging queues such as Kafka, Pulsar

• Integration with other distributed systems

• Hadoop, Spark, Presto, etc...

54

Page 55: Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved .

Wrap up

• Yahoo! JAPAN is developing a large scale object storage named “Dragon”.

• “Dragon” is a highly scalable object storage platform.

• We’re going to improve it to meet our new requirements.

• Thank you!