1
Ceph on All-Flash Storage –Breaking Performance BarriersAxel RosenbergSr. Technical Marketing ManagerApril 28, 2015
Forward-Looking Statements
During our meeting today we will make forward-looking statements.
Any statement that refers to expectations, projections or other characterizations of future events or
circumstances is a forward-looking statement, including those relating to market growth, industry
trends, future products, product performance and product capabilities. This presentation also
contains forward-looking statements attributed to third parties, which reflect their projections as of the
date of issuance.
Actual results may differ materially from those expressed in these forward-looking statements due
to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors”
and elsewhere in the documents we file from time to time with the SEC, including our annual and
quarterly reports.
We undertake no obligation to update these forward-looking statements, which speak only as
of the date hereof or as of the date of issuance by a third party, as the case may be.
Designed for Big Data Workloads @ PB Scale
Mixed media container, active-archiving, backup, locality of data
Large containers with application SLAs
Internet of Things, Sensor Analytics
Time-to-Value and Time-to-Insight
Hadoop
NoSQL
Cassandra
MongoDB
High read intensive access from billions of edge devices
Hi-Def video driving even greater demand for capacity and performance
Surveillance systems, analytics
CONTENT REPOSITORIES BIG DATA ANALYTICS MEDIA SERVICES
InfiniFlash System
• Ultra-dense All-Flash Appliance
- 512TB in 3U
- Best in class $/IOPS/TB
• Scale-out software for massive capacity
- Unified Content: Block, Object
- Flash optimized software with
programmable interfaces (SDK)
• Enterprise-Class storage features
- snapshots, replication, thin
provisioningIF500
InfiniFlash OS (Ceph)
Ideal for large-scale object storage use cases
12G IOPS Performance Numbers
Innovating Performance @Massive ScaleInfiniFlash OS
Ceph Transformed for Flash Performance and Contributed Back to Community
• 10x Improvement for Block Reads,2x Improvement for Object Reads
Major Improvements to Enhance Parallelism
• Removed single Dispatch queue bottlenecks for OSD and Client (librados) layers
• Shard thread pool implementation
• Major lock reordering
• Improved lock granularity – Reader / Writer locks
• Granular locks at Object level
• Optimized OpTracking path in OSD eliminating redundant locks
Messenger Performance Enhancements
• Message signing
• Socket Read aheads
• Resolved severe lock contentions
Backend Optimizations – XFS and Flash
• Reduced ~2 CPU core usage with improved file path resolution from object ID
• CPU and Lock optimized fast path for reads
• Disabled throttling for Flash
• Index Manager caching and Shared FdCache in filestore
Emerging Storage Solutions (EMS) SanDisk Confidential 7
Results!
Test Configuration – Single InfiniFlash System
Test Configuration – Single InfiniFlash SystemPerformance Config - InfiniFlash with 2 storage controller configuration
2 Node Cluster ( 32 drives shared to each OSD node)
Node 2 Servers(Dell R720) 2x E5-2680 8C 2.8GHz 25M$ 4x 16GB RDIMM, dual rank x4 (64GB)
1x Mellanox X3 Dual 40GbE 1x LSI 9207 HBA card
RBD Client4 Servers(Dell R620)
2 x E5-2680 10C 2.8GHz 25M$ 2 x 16GB RDIMM, dual rank x4 (32 GB) 1x Mellanox X3 Dual 40GbE
Storage - InfiniFlash with 512TB with 2 OSD servers
InfiniFlash1-InfiniFlash is connected 64 x 1YX2 Icechips in A2 topology.
Total storage - 64 * 8 tb = 512tb ( effective 430 TB )
InfiniFlash- fw details FFU 1.0.0.31.1
Network Details40G Switch NA
OS Details OS Ubuntu 14.04 LTS 64bit 3.13.0-32
LSI card/ driver SAS2308(9207) mpt2sas
Mellanox 40gbps nw card MT27500 [ConnectX-3] mlx4_en - 2.2-1 (Feb 2014)
Cluster Configuration CEPH Version sndk-ifos-1.0.0.04 0.86.rc.eap2
Replication (Default)2 [Host]
Note: - Host level replication.
Number of Pools, PGs & RBDspool = 4 ;PG = 2048 per pool
2 RBDs from each pool
RBD size 2TB
Number of Monitors 1
Number of OSD Nodes 2
Number of OSDs per Node 32 total OSDs = 32 * 2 = 64
Performance Improvement: Stock Ceph vs IF OS 8K Random Blocks
Read Performance improves 3x to 12x depending on the Block size
Top Row: Queue DepthBottom Row: % Read IOs
IOP
S
Avg
late
nv
(ms)
Avg Latency
0
50000
100000
150000
200000
250000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Stock Ceph(Giant)
IFOS 1.0
0
20
40
60
80
100
120
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
• 2 RBD/Client x Total 4 Clients• 1 InfiniFlash node with 512TB
IOPS
Top Row: Queue DepthBottom Row: % Read IOs
0
20000
40000
60000
80000
100000
120000
140000
160000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Stock Ceph
IFOS 1.0
Avg
Late
ncy
(m
s)
0
20
40
60
80
100
120
140
160
180
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
IOP
SPerformance Improvement: Stock Ceph vs IF OS 64K Random Blocks
IOPS Avg Latency
• 2 RBD/Client x Total 4 Clients• 1 InfiniFlash node with 512TB
Top Row: Queue DepthBottom Row: % Read IOs
Top Row: Queue DepthBottom Row: % Read IOs
Performance Improvement: Stock Ceph vs IF OS 256K Random Blocks
0
5000
10000
15000
20000
25000
30000
35000
40000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Stock Ceph
IFOS 1.0
0
50
100
150
200
250
300
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
IOP
S
Avg
Late
ncy
(m
s)
For 256K blocks , max throughput is getting up to 5.8GB/s , bare metal InfiniFlash performance is at 7.5GB/s
IOPS Avg Latency
Top Row: Queue DepthBottom Row: % Read IOs
Top Row: Queue DepthBottom Row: % Read IOs
• 2 RBD/Client x Total 4 Clients• 1 InfiniFlash node with 512TB
Test Configuration – 3 InfiniFlash Systems (128TB each)
Scaling with Performance 8K Random Blocks
0
100000
200000
300000
400000
500000
600000
700000
1 8 64 1 8 64 1 8 64 1 8 64 1 8 64
0 25 50 75 100
0
50
100
150
200
250
300
350
1 8 64 1 8 64 1 8 64 1 8 64 1 8 64
0 25 50 75 100
Performance scales linearly with additional InfiniFlash nodes
IOPSAvg Latency
• 2 RBD/Client x 5 Clients• 3 InfiniFlash nodes with 128TB each
Top Row: Queue DepthBottom Row: % Read IOs
Top Row: Queue DepthBottom Row: % Read IOs
IOP
S
Avg
Late
ncy
(m
s)
Scaling with Performance 64K Random Blocks
0
50000
100000
150000
200000
2500001 4
16
64
25
6 2 8
32
12
8 1 4
16
64
25
6 2 8
32
12
8 1 4
16
64
25
6
0 25 50 75 100
0
100
200
300
400
500
600
700
800
900
1000
1 8 64 1 8 64 1 8 64 1 8 64 1 8 64
0 25 50 75 100
IOPSAvg Latency
Performance scales linearly with additional InfiniFlash nodes • 2 RBD/Client x 5 Clients• 3 InfiniFlash nodes with 128TB each
Top Row: Queue DepthBottom Row: % Read IOs
Top Row: Queue DepthBottom Row: % Read IOs
IOP
S
Avg
Late
ncy
(m
s)
Scaling with Performance 256K Random Blocks
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
1 4
16
64
25
6 2 8
32
12
8 1 4
16
64
25
6 2 8
32
12
8 1 4
16
64
25
6
0 25 50 75 100
0
500
1000
1500
2000
2500
3000
3500
1 8 64 1 8 64 1 8 64 1 8 64 1 8 64
0 25 50 75 100
IOPSAvg Latency
Performance scales linearly with additional InfiniFlash nodes • 2 RBD/Client x 5 Clients• 3 InfiniFlash nodes with 128TB each
Top Row: Queue DepthBottom Row: % Read IOs
Top Row: Queue DepthBottom Row: % Read IOs
IOP
S
Avg
Late
ncy
(m
s)
Open Source with SanDisk Advantage InfiniFlash OS – Enterprise Level Hardened Ceph
Innovation and speed of Open Source with the trustworthiness of Enterprise grade and Web-Scale testing, hardware optimization
Performance optimization for flash andhardware tuning
Hardened and tested for Hyperscale deployments and workloads
Enterprise class support and servicesfrom SanDisk
Risk mitigation through long term support and a reliable long term roadmap
Continual contribution back to the community
Enterprise Level Hardening
Testing at Hyperscale
FailureTesting
9,000 hours of cumulative IO tests
1,100+ unique test cases
1,000 hours of Cluster Rebalancing tests
1,000 hours of IO on iSCSI
Over 100 server node clusters
Over 4PB of Flash Storage
2,000 Cycle Node Reboot
1,000 times Node Abrupt Power Cycle
1,000 times Storage Failure
1,000 times Network Failure
IO for 250 hours at a stretch
IFOS on InfiniFlash
HSEB A HSEB B
OSDs
SAS Connected
….
HSEB A HSEB B HSEB A HSEB B
….
Co
mp
ute
Far
m LUN LUN
Client Application…LUN LUN
Client Application…LUN LUN
Client Application…
RBDs / RGW
SCSI Targets
Rea
d IO
O
Rea
d IO
O
Write IO
RBDs / RGW
SCSI Targets
RBDs / RGW
SCSI Targets
OSDs OSDs OSDs OSDs OSDs
Rea
d IO
O
Disaggregated Architecture
Compute & Storage Disaggregation leads to Optimal Resource utilization
Independent Scaling of Compute and Storage
Optimized for Performance
Software & Hardware Configurations tuned for performance
Reduced Costs
Reduce the replica count with higher reliability of Flash
Choice of Full Replicas or Erasure Coded Storage pool on Flash
Sto
rage
Far
m
Flash + HDD with Data Tier-ingFlash Performance with TCO of HDD
InfiniFlash OS performs automatic data placement and data movement between tiers based transparent to Applications
User defined Policies for data placement on tiers
Can be used with Erasure coding to further reduce the TCO
Benefits
Flash based performance with HDD like TCO
Lower performance requirements on HDD tier enables use of denser and cheaper SMR drives
Denser and lower power compared to HDD only solution
InfiniFlash for High Activity data and SMR drives for Low activity data
60+ HDD per Server
Compute Farm
Flash Primary + HDD ReplicasFlash Performance with TCO of HDD
Primary replica onInfiniFlash
HDD based data nodefor 2nd local replica
HDD based data nodefor 3rd DR replica
Higher Affinity of the Primary Replica ensures much of the compute is on InfiniFlash Data
2nd and 3rd replicas on HDDs are primarily for data protection
High throughput of InfiniFlash provides data protection, movement for all replicas without impacting application IO
Eliminates cascade data propagation requirement for HDD replicas
Flash-based accelerated Object performance for Replica 1 allows for denser and cheaper SMR HDDs for Replica 2 and 3
Compute Farm
TCO Example - Object StorageScale-out Flash Benefits at the TCO of HDD
Note that operational/maintenance cost and performance benefits are not accounted for in these models!!!
@Scale Operational Costs demand Flash
• Weekly failure rate for a 100PB deployment15-35 HDD vs. 1 InfiniFlash Card
• HDD cannot handle simultaneous egress/ingress• Long rebuild times, multiple failures
• Rebalancing of P’s of data impact in service disruption
• Flash provides guaranteed & consistent SLA • Flash capacity utilization >> HDD due to reliability & ops
$-
$10,000,000
$20,000,000
$30,000,000
$40,000,000
$50,000,000
$60,000,000
$70,000,000
$80,000,000
TraditionalObjStore on
HDD
InfiniFlashObjectStore -3Full Replicas
on Flash
InfiniFlashwith
ErasureCoding- All Flash
InfiniFlash -Flash Primary& HDD copies
3 Year TCO Comparison for 96PB Object Storage
3 YearOpex
TCA
0
20
40
60
80
100
Total Racks
Flash Card Performance**
Read Throughput > 400MB/s
Read IOPS > 20K IOPS
Random Read/Write@4K- 90/10 > 15K IOPS
Flash Card Integration
Alerts and monitoring
Latching integrated and monitored
Integrated air temperature sampling
InfiniFlash System
Capacity 512TB* raw
All-Flash 3U Storage System
64 x 8TB Flash Cards with Pfail
8 SAS ports total
Operational Efficiency and Resilient
Hot Swappable components, Easy FRU
Low power 450W(avg), 750W(active)
MTBF 1.5+ million hours
Scalable Performance**
780K IOPS
7GB/s Throughput
Upgrade to 12GB/s in Q315
* 1TB = 1,000,000,000,000 bytes. Actual user capacity less.** Based on internal testing of InfiniFlash 100. Test report available.
InfiniFlash™ System
The First All-Flash Storage System Built for High Performance Ceph
24
Thank You! @BigDataFlash #bigdataflash
©2015 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. InfiniFlash is a trademarks of SanDisk Enterprise IP LLC. All other product and company names are used for identification purposes and may be trademarks of their respective holder(s).
Emerging Storage Solutions (EMS) SanDisk Confidential 25
Messenger layer
Removed Dispatcher and introduced a “fast path” mechanism for read/write requests
• Same mechanism is now present on client side (librados) as well
Fine grained locking in message transmit path
Introduced an efficient buffering mechanism for improved throughput
Configuration options to disable message signing, CRC check etc
Emerging Storage Solutions (EMS) SanDisk Confidential 26
OSD Request Processing
Running with Memstore backend revealed bottleneck in OSD thread pool code
OSD worker thread pool mutex heavily contended
Implemented a sharded worker thread pool. Requests sharded based on their pg (placement group) identifier
Configuration options to set number of shards and number of worker threads per shard
Optimized OpTracking path (Sharded Queue and removed redundant locks)
Emerging Storage Solutions (EMS) SanDisk Confidential 27
FileStore improvements
Eliminated backend storage from picture by using a small workload (FileStore served data from page cache)
Severe lock contention in LRU FD (file descriptor) cache. Implemented a sharded version of LRU cache
CollectionIndex (per-PG) object was being created upon every IO request. Implemented a cache for the same as PG info doesn’t change often
Optimized “Object-name to XFS file name” mapping function
Removed redundant snapshot related checks in parent read processing path
Emerging Storage Solutions (EMS) SanDisk Confidential 28
Inconsistent Performance Observation
Large performance variations on different pools across multiple clients
First client after cluster restart gets maximum performance irrespective of the pool
Continued degraded performance from clients starting later
Issue also observed on read I/O with unpopulated RBD images –Ruled out FS issues
Performance counters show up to 3x increase in latency through the I/O path with no particular bottleneck
Emerging Storage Solutions (EMS) SanDisk Confidential 29
Issue with TCmalloc
Perf top shows rapid increase in time spent in TCmalloc functions 14.75% libtcmalloc.so.4.1.2 [.] tcmalloc::CentralFreeList::FetchFromSpans()
7.46% libtcmalloc.so.4.1.2 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
I/O from different client causing new threads in sharded thread pool to process I/O
Causing memory movement from thread caches and increasing alloc/free latency
JEmalloc and Glibc malloc do not exhibit this behavior
JEmalloc build option added to Ceph Hammer
Setting TCmalloc tunable 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES’ to larger value (64M) alleviates the issue
Emerging Storage Solutions (EMS) SanDisk Confidential 30
Client Optimizations
Ceph by default, turns Nagle’s algorithm OFF
RBD kernel driver ignored TCP_NODELAY setting
Large latency variations at lower queue depths
Changes to RBD driver submitted upstream