Download - Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers

1

Ceph on All-Flash Storage –Breaking Performance BarriersAxel RosenbergSr. Technical Marketing ManagerApril 28, 2015

Forward-Looking Statements

During our meeting today we will make forward-looking statements.

Any statement that refers to expectations, projections or other characterizations of future events or

circumstances is a forward-looking statement, including those relating to market growth, industry

trends, future products, product performance and product capabilities. This presentation also

contains forward-looking statements attributed to third parties, which reflect their projections as of the

date of issuance.

Actual results may differ materially from those expressed in these forward-looking statements due

to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors”

and elsewhere in the documents we file from time to time with the SEC, including our annual and

quarterly reports.

We undertake no obligation to update these forward-looking statements, which speak only as

of the date hereof or as of the date of issuance by a third party, as the case may be.

Designed for Big Data Workloads @ PB Scale

Mixed media container, active-archiving, backup, locality of data

Large containers with application SLAs

Internet of Things, Sensor Analytics

Time-to-Value and Time-to-Insight

Hadoop

NoSQL

Cassandra

MongoDB

High read intensive access from billions of edge devices

Hi-Def video driving even greater demand for capacity and performance

Surveillance systems, analytics

CONTENT REPOSITORIES BIG DATA ANALYTICS MEDIA SERVICES

InfiniFlash System

• Ultra-dense All-Flash Appliance

- 512TB in 3U

- Best in class $/IOPS/TB

• Scale-out software for massive capacity

- Unified Content: Block, Object

- Flash optimized software with

programmable interfaces (SDK)

• Enterprise-Class storage features

- snapshots, replication, thin

provisioningIF500

InfiniFlash OS (Ceph)

Ideal for large-scale object storage use cases

12G IOPS Performance Numbers

Innovating Performance @Massive ScaleInfiniFlash OS

Ceph Transformed for Flash Performance and Contributed Back to Community

• 10x Improvement for Block Reads,2x Improvement for Object Reads

Major Improvements to Enhance Parallelism

• Removed single Dispatch queue bottlenecks for OSD and Client (librados) layers

• Shard thread pool implementation

• Major lock reordering

• Improved lock granularity – Reader / Writer locks

• Granular locks at Object level

• Optimized OpTracking path in OSD eliminating redundant locks

Messenger Performance Enhancements

• Message signing

• Socket Read aheads

• Resolved severe lock contentions

Backend Optimizations – XFS and Flash

• Reduced ~2 CPU core usage with improved file path resolution from object ID

• CPU and Lock optimized fast path for reads

• Disabled throttling for Flash

• Index Manager caching and Shared FdCache in filestore

Emerging Storage Solutions (EMS) SanDisk Confidential 7

Results!

Test Configuration – Single InfiniFlash System

Test Configuration – Single InfiniFlash SystemPerformance Config - InfiniFlash with 2 storage controller configuration

2 Node Cluster ( 32 drives shared to each OSD node)

Node 2 Servers(Dell R720) 2x E5-2680 8C 2.8GHz 25M$ 4x 16GB RDIMM, dual rank x4 (64GB)

1x Mellanox X3 Dual 40GbE 1x LSI 9207 HBA card

RBD Client4 Servers(Dell R620)

2 x E5-2680 10C 2.8GHz 25M$ 2 x 16GB RDIMM, dual rank x4 (32 GB) 1x Mellanox X3 Dual 40GbE

Storage - InfiniFlash with 512TB with 2 OSD servers

InfiniFlash1-InfiniFlash is connected 64 x 1YX2 Icechips in A2 topology.

Total storage - 64 * 8 tb = 512tb ( effective 430 TB )

InfiniFlash- fw details FFU 1.0.0.31.1

Network Details40G Switch NA

OS Details OS Ubuntu 14.04 LTS 64bit 3.13.0-32

LSI card/ driver SAS2308(9207) mpt2sas

Mellanox 40gbps nw card MT27500 [ConnectX-3] mlx4_en - 2.2-1 (Feb 2014)

Cluster Configuration CEPH Version sndk-ifos-1.0.0.04 0.86.rc.eap2

Replication (Default)2 [Host]

Note: - Host level replication.

Number of Pools, PGs & RBDspool = 4 ;PG = 2048 per pool

2 RBDs from each pool

RBD size 2TB

Number of Monitors 1

Number of OSD Nodes 2

Number of OSDs per Node 32 total OSDs = 32 * 2 = 64

Performance Improvement: Stock Ceph vs IF OS 8K Random Blocks

Read Performance improves 3x to 12x depending on the Block size

Top Row: Queue DepthBottom Row: % Read IOs

IOP

S

Avg

late

nv

(ms)

Avg Latency

0

50000

100000

150000

200000

250000

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

Stock Ceph(Giant)

IFOS 1.0

0

20

40

60

80

100

120

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

• 2 RBD/Client x Total 4 Clients• 1 InfiniFlash node with 512TB

IOPS


0

20000

40000

60000

80000

100000

120000

140000

160000

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

Stock Ceph

IFOS 1.0

Avg

Late

ncy

(m

s)

0

20

40

60

80

100

120

140

160

180

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

IOP

SPerformance Improvement: Stock Ceph vs IF OS 64K Random Blocks

IOPS Avg Latency




Performance Improvement: Stock Ceph vs IF OS 256K Random Blocks

0

5000

10000

15000

20000

25000

30000

35000

40000

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

Stock Ceph

IFOS 1.0

0

50

100

150

200

250

300

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

IOP

S

Avg

Late

ncy

(m

s)

For 256K blocks , max throughput is getting up to 5.8GB/s , bare metal InfiniFlash performance is at 7.5GB/s

IOPS Avg Latency




Test Configuration – 3 InfiniFlash Systems (128TB each)

Scaling with Performance 8K Random Blocks

0

100000

200000

300000

400000

500000

600000

700000

1 8 64 1 8 64 1 8 64 1 8 64 1 8 64

0 25 50 75 100

0

50

100

150

200

250

300

350

1 8 64 1 8 64 1 8 64 1 8 64 1 8 64

0 25 50 75 100

Performance scales linearly with additional InfiniFlash nodes

IOPSAvg Latency

• 2 RBD/Client x 5 Clients• 3 InfiniFlash nodes with 128TB each



IOP

S

Avg

Late

ncy

(m

s)


0

50000

100000

150000

200000

2500001 4

16

64

25

6 2 8

32

12

8 1 4

16

64

25

6 2 8

32

12

8 1 4

16

64

25

6

0 25 50 75 100

0

100

200

300

400

500

600

700

800

900

1000

1 8 64 1 8 64 1 8 64 1 8 64 1 8 64

0 25 50 75 100

IOPSAvg Latency

Performance scales linearly with additional InfiniFlash nodes • 2 RBD/Client x 5 Clients• 3 InfiniFlash nodes with 128TB each



IOP

S

Avg

Late

ncy

(m

s)


0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

1 4

16

64

25

6 2 8

32

12

8 1 4

16

64

25

6 2 8

32

12

8 1 4

16

64

25

6

0 25 50 75 100

0

500

1000

1500

2000

2500

3000

3500

1 8 64 1 8 64 1 8 64 1 8 64 1 8 64

0 25 50 75 100

IOPSAvg Latency

Performance scales linearly with additional InfiniFlash nodes • 2 RBD/Client x 5 Clients• 3 InfiniFlash nodes with 128TB each



IOP

S

Avg

Late

ncy

(m

s)

Open Source with SanDisk Advantage InfiniFlash OS – Enterprise Level Hardened Ceph

Innovation and speed of Open Source with the trustworthiness of Enterprise grade and Web-Scale testing, hardware optimization

Performance optimization for flash andhardware tuning

Hardened and tested for Hyperscale deployments and workloads

Enterprise class support and servicesfrom SanDisk

Risk mitigation through long term support and a reliable long term roadmap

Continual contribution back to the community

Enterprise Level Hardening

Testing at Hyperscale

FailureTesting

9,000 hours of cumulative IO tests

1,100+ unique test cases

1,000 hours of Cluster Rebalancing tests

1,000 hours of IO on iSCSI

Over 100 server node clusters

Over 4PB of Flash Storage

2,000 Cycle Node Reboot

1,000 times Node Abrupt Power Cycle

1,000 times Storage Failure

1,000 times Network Failure

IO for 250 hours at a stretch

IFOS on InfiniFlash

HSEB A HSEB B

OSDs

SAS Connected

….

HSEB A HSEB B HSEB A HSEB B

….

Co

mp

ute

Far

m LUN LUN

Client Application…LUN LUN

Client Application…LUN LUN

Client Application…

RBDs / RGW

SCSI Targets

Rea

d IO

O

Rea

d IO

O

Write IO

RBDs / RGW

SCSI Targets

RBDs / RGW

SCSI Targets

OSDs OSDs OSDs OSDs OSDs

Rea

d IO

O

Disaggregated Architecture

Compute & Storage Disaggregation leads to Optimal Resource utilization

Independent Scaling of Compute and Storage

Optimized for Performance

Software & Hardware Configurations tuned for performance

Reduced Costs

Reduce the replica count with higher reliability of Flash

Choice of Full Replicas or Erasure Coded Storage pool on Flash

Sto

rage

Far

m

Flash + HDD with Data Tier-ingFlash Performance with TCO of HDD

InfiniFlash OS performs automatic data placement and data movement between tiers based transparent to Applications

User defined Policies for data placement on tiers

Can be used with Erasure coding to further reduce the TCO

Benefits

Flash based performance with HDD like TCO

Lower performance requirements on HDD tier enables use of denser and cheaper SMR drives

Denser and lower power compared to HDD only solution

InfiniFlash for High Activity data and SMR drives for Low activity data

60+ HDD per Server

Compute Farm

Flash Primary + HDD ReplicasFlash Performance with TCO of HDD

Primary replica onInfiniFlash

HDD based data nodefor 2nd local replica

HDD based data nodefor 3rd DR replica

Higher Affinity of the Primary Replica ensures much of the compute is on InfiniFlash Data

2nd and 3rd replicas on HDDs are primarily for data protection

High throughput of InfiniFlash provides data protection, movement for all replicas without impacting application IO

Eliminates cascade data propagation requirement for HDD replicas

Flash-based accelerated Object performance for Replica 1 allows for denser and cheaper SMR HDDs for Replica 2 and 3

Compute Farm

TCO Example - Object StorageScale-out Flash Benefits at the TCO of HDD

Note that operational/maintenance cost and performance benefits are not accounted for in these models!!!

@Scale Operational Costs demand Flash

• Weekly failure rate for a 100PB deployment15-35 HDD vs. 1 InfiniFlash Card

• HDD cannot handle simultaneous egress/ingress• Long rebuild times, multiple failures

• Rebalancing of P’s of data impact in service disruption

• Flash provides guaranteed & consistent SLA • Flash capacity utilization >> HDD due to reliability & ops

$-

$10,000,000

$20,000,000

$30,000,000

$40,000,000

$50,000,000

$60,000,000

$70,000,000

$80,000,000

TraditionalObjStore on

HDD

InfiniFlashObjectStore -3Full Replicas

on Flash

InfiniFlashwith

ErasureCoding- All Flash

InfiniFlash -Flash Primary& HDD copies

3 Year TCO Comparison for 96PB Object Storage

3 YearOpex

TCA

0

20

40

60

80

100

Total Racks

Flash Card Performance**

Read Throughput > 400MB/s

Read IOPS > 20K IOPS

Random Read/Write@4K- 90/10 > 15K IOPS

Flash Card Integration

Alerts and monitoring

Latching integrated and monitored

Integrated air temperature sampling

InfiniFlash System

Capacity 512TB* raw

All-Flash 3U Storage System

64 x 8TB Flash Cards with Pfail

8 SAS ports total

Operational Efficiency and Resilient

Hot Swappable components, Easy FRU

Low power 450W(avg), 750W(active)

MTBF 1.5+ million hours

Scalable Performance**

780K IOPS

7GB/s Throughput

Upgrade to 12GB/s in Q315

* 1TB = 1,000,000,000,000 bytes. Actual user capacity less.** Based on internal testing of InfiniFlash 100. Test report available.

InfiniFlash™ System

The First All-Flash Storage System Built for High Performance Ceph

24

Thank You! @BigDataFlash #bigdataflash

©2015 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. InfiniFlash is a trademarks of SanDisk Enterprise IP LLC. All other product and company names are used for identification purposes and may be trademarks of their respective holder(s).


Messenger layer

Removed Dispatcher and introduced a “fast path” mechanism for read/write requests

• Same mechanism is now present on client side (librados) as well

Fine grained locking in message transmit path

Introduced an efficient buffering mechanism for improved throughput

Configuration options to disable message signing, CRC check etc


OSD Request Processing

Running with Memstore backend revealed bottleneck in OSD thread pool code

OSD worker thread pool mutex heavily contended

Implemented a sharded worker thread pool. Requests sharded based on their pg (placement group) identifier

Configuration options to set number of shards and number of worker threads per shard

Optimized OpTracking path (Sharded Queue and removed redundant locks)


FileStore improvements

Eliminated backend storage from picture by using a small workload (FileStore served data from page cache)

Severe lock contention in LRU FD (file descriptor) cache. Implemented a sharded version of LRU cache

CollectionIndex (per-PG) object was being created upon every IO request. Implemented a cache for the same as PG info doesn’t change often

Optimized “Object-name to XFS file name” mapping function

Removed redundant snapshot related checks in parent read processing path


Inconsistent Performance Observation

Large performance variations on different pools across multiple clients

First client after cluster restart gets maximum performance irrespective of the pool

Continued degraded performance from clients starting later

Issue also observed on read I/O with unpopulated RBD images –Ruled out FS issues

Performance counters show up to 3x increase in latency through the I/O path with no particular bottleneck


Issue with TCmalloc

Perf top shows rapid increase in time spent in TCmalloc functions 14.75% libtcmalloc.so.4.1.2 [.] tcmalloc::CentralFreeList::FetchFromSpans()

7.46% libtcmalloc.so.4.1.2 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)

I/O from different client causing new threads in sharded thread pool to process I/O

Causing memory movement from thread caches and increasing alloc/free latency

JEmalloc and Glibc malloc do not exhibit this behavior

JEmalloc build option added to Ceph Hammer

Setting TCmalloc tunable 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES’ to larger value (64M) alleviates the issue


Client Optimizations

Ceph by default, turns Nagle’s algorithm OFF

RBD kernel driver ignored TCP_NODELAY setting

Large latency variations at lower queue depths

Changes to RBD driver submitted upstream

Download - Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers

Top Related