ceph day beijing - spdk for ceph
Post on 22-Jan-2018
720 Views
Preview:
TRANSCRIPT
Ziye Yang, Senior software Engineer
Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
No computer system can be absolutely secure.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
*Other names and brands may be claimed as the property of others.
© 2017 Intel Corporation.
2
• SPDK introduction and status update
• Current SDPK support in Bluestore
• Case study: Accelerate iSCSI service exported by Ceph
• SPDK support for Ceph in 2017
• Summary
The Problem: Software is becoming the bottleneck
The Opportunity: Use Intel software ingredients to unlock the potential of new media
HDD SATA NANDSSD
NVMe* NANDSSD
Intel® Optane™SSD
Latency
I/OPerformance <500 IO/s
>25,000 IO/s
>400,000 IO/s
>2ms
<100µs <100µs
Storage Performance
Development Kit
6
Scalable and Efficient Software Ingredients
• User space, lockless, polled-mode components
• Up to millions of IOPS per core
• Designed for Intel Optane™ technology latencies
Intel® Platform Storage Reference Architecture
• Optimized for Intel platform characteristics
• Open source building blocks (BSD licensed)
• Available via spdk.io
Architecture
Drivers
StorageServices
StorageProtocols
iSCSI Target
NVMe-oF*Target
SCSI
vhost-scsiTarget
NVMe
NVMe Devices
Blobstore
NVMe-oF*
Initiator
Intel® QuickDataTechnology Driver
Block Device Abstraction (BDEV)
Ceph RBD
Linux Async IO
Blob bdev
3rd Party
NVMe
NVMe*
PCIe Driver
Released
Q2’17
Pathfinding
vhost-blkTarget
Object
BlobFS
Integration
RocksDB
Ceph
Core
ApplicationFramework
Benefits of using SPDK
SPDKmore performance
from Intel CPUs, non-volatile media, and
networking
FASTER TTM/LESS RESOURCES
than developing componentsfrom scratch
10X MORE IOPS/coreUp to for NVMe-oF* vs. Linux kernel
as NVM technologies increase in performanceFuture ProofingProvides
for NVMe vs. Linux kernel8X MORE IOPS/coreUp to
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured usingspecific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests toassist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance
350%Up to for RocksDB workloadsBETTER Tail Latency
SPDK Updates: 17.03 Release (Mar 2017)
Blobstore• Block allocator for applications• Variable granularity, defaults to 4KB
BlobFS• Lightweight, non-POSIX filesystem• Page caching & prefetch• Initially limited to DB file semantic
requirements (e.g. file name and size)
RocksDB SPDK Environment• Implement RocksDB using BlobFS
QEMU vhost-scsi Target• Simplified I/O path to local QEMU
guest VMs with unmodified apps
NVMe over Fabrics Improvements• Read latency improvement• NVMe-oF Host (Initiator) zero-copy • Discovery code simplification• Quality, performance & hardening fixes
New components:broader set of use cases for SPDK
libraries & ingredients
Existing components:feature and hardening
improvements
Current status
Fully realizing new media performance requires software optimizations
SPDK positioned to enable developers to realize this performance
SPDK available today via http://spdk.io
Help us build SPDK as an open source community!
Current SPDK support in BlueStore
New features
Support multiple threads for doing I/Os on NVMe SSDs via SPDK user space NVMe driver
Support running SPDK I/O threads on designated CPU cores in configuration file.
Upgrade in Ceph (now is 17.03)
Upgraded SPDK to 16.11 in Dec, 2016
Upgraded SPDK to 17.03 in April, 2017
Stability
Fixed several compilation issues, running time bugs while using SPDK.
Totally 16 SPDK related Patches are merged in Bluestore (mainly in NVMEDEVICE module)
(From iStaury’s talk in SPDK PRC meetup 2016)
Block service exported by Ceph via iSCSI protocol
Cloud service providers which provision VM service can use iSCSI.
If Ceph could export block service with good performance, it would be easy to glue those providers to Ceph cluster solution.
APP
Multipath
iSCSI initiator
dm-1
sdx sdy
iSCSI target
RBD
iSCSI target
RBD
OSD OSD OSD OSD
OSD OSD OSD OSD
Client
iSCSI gateway
Ceph cluster
iSCSI + RBD Gateway
Ceph server
CPU:Intel(R) Xeon(R) CPU E5-2660 v4 @2.00GHz
Four intel P3700 SSDs
One OSD on each SSD, total 4 osds
4 pools PG number 512, one 10G image in one pool
iSCSI target server (librbd+SPDK / librbd+tgt)
CPU:Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Only one core enable
iSCSI initiator
CPU:Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
iSCSI Initiator
iSCSI Target Server
iSCSI Target
librbd
Ceph Server
OSD0 OSD1
OSD2 OSD3
iSCSI + RBD Gateway
One CPU Core:
FIO + img
iSCSI type + op
1 FIO + 1 img(IOPS)
2 FIO + 2 img(IOPS)
3 FIO + 3 img(IOPS)
SPDK iSCSI tgt/TGT
ratio
TGT + 4k_randread 10K 20K 20K140%
SPDK iSCSI tgt+ 4k_randread 20K 24K 28K
TGT + 4k_randwrite 6.5K 9.5K 18K133%
SPDK iSCSI tgt + 4k_randwrite 14K 19K 24K
iSCSI + RBD Gateway
Two CPU Cores:
FIO + img
iSCSI type + op1 FIO + 1 img(IOPS)
2 FIO + 2 img(IOPS)
3 FIO + 3 img(IOPS)
4 FIO + 4 img(IOPS)
SPDK iSCSI tgt/TGT
ratio
TGT + 4k_randread 12K 24K 26K 26K181%
SPDK iSCSI tgt + 4k_randread 37K 47K 47K 47K
TGT + 4k_randwrite 9.5K 13.5K 19K 22K123%
SPDK iSCSI tgt + 4k_randwrite 16K 24K 25K 27K
Reading Comparison
10
20
12
37
20
24 24
47
20
2826
47
0
5
10
15
20
25
30
35
40
45
50
One core:TGT One core:SPDK-iSCSI Two cores:TGT Two cores:SPDK-iSCSI
4K_randread(IOPS(K))
1stream 2 streams 3streams
Writing Comparison
6.5
14
9.5
16
9.5
19
13.5
24
18
24
19
25
22
27
0
5
10
15
20
25
30
One core:TGT One core:SPDK-iSCSI Two cores:TGT Two cores:SPDK-iSCSI
4K_randwrite(IOPS(K))
1stream 2 streams 3streams 4streams
SPDK support for Ceph in 2017
To make SPDK really useful in Ceph, we will still do the following works with partners:
Continue stability maintenance
– Version upgrade, bug fixing in compilation/running time.
Performance enhancement
– Continue optimizing NVMEDEVICE module according to customers or partners’ feedback.
New feature Development:
– Occasionally pickup some common requirements/feedback in community and may upstream those features in NVMEDEVICE module
Proposals/opportunties for better leveraging SPDK
Multiple OSD support on same NVMe Device by using SPDK.
Leverage SPDK’s multiple process features in user space NVMe driver.
Risks: Same with kernel, i.e., fail all OSDs on the device if it is fail.
Enhance cache support in NVMEDEVICE via using SPDK
Need better cache/buffer strategy for Read/Write performance improvement.
Optimize Rocksdb usage in Bluestore by SPDK’s blobfs/blobstore
Make Rocksdb use SPDK’s Blobfs/Blostore instead of kernel file system for metadata management.
Leverage SPDK to accelerate the block service exported by CephOptimization in front of Ceph
Use optimized Block service daemon, e.g., SPDK iSCSI target or NVMe-oF target
Introduce Cache policy in Block service daemon.
Store Optimization inside Ceph
Use SPDK’s user space NVMe driver instead of Kernel NVMe driver (Already have)
May replace “BlueRocksEnv + Bluefs” with “BlobfsENV + Blobfs/Blobstore”.
Ceph RBD service
SPDK optimized iSCSI target SPDK optimized NVMe-oF target
SPDK Ceph RBD bdev module (Leverage librbd/librados)
SPDK Cache module
Existing SPDK app/module
Existing Ceph Service/component
FileStore
Export Block Service
KVStoreBluestore
metadata
RocksDB
BlueRocksENV
Bluefs
Kernel/SPDK driver
NVMe device
metadata
RocksDB
SPDK BlobfsENV
SPDK Blobfs/Blobstore
SPDK NVMedriver
NVMe device
Optimized module to be developed (TBD in SPDK roadmap)
Accelerate block service exported by Ceph via SPDK
Even replace RocksDB?
Summary
SPDK proves to useful to explore the capability of fast storage devices (e.g., NVMe SSDs)
But it still needs lots of development work to make SPDK useful for Bluestore in product quality level.
Call for actions:
Call for code contribution in SPDK community
Call for leveraging SPDK for Ceph optimization, welcome to contact SPDK dev team for help and collaboration.
Summary
SPDK proves to useful to explore the capability of fast storage devices (e.g., NVMe SSDs)
But it still needs lots of development work to make SPDK useful for Bluestore in product quality level.
Call for actions:
Call for code contribution in SPDK community
Call for leveraging SPDK for Ceph optimization, welcome to contact SPDK dev team for help and collaboration.
Vhost-scsi Performance
SPDK provides
1 Million IOPS with 1 core
and
8x VM performance vs. kernel!
Features Realized Benefit
High performancestorage virtualization
Increased VMdensity
Reduced VM exit Reduced tail latencies
1
11
System Configuration: Target system: 2x Intel® Xeon® E5-2695v4 (HT off), Intel® Speed Step enabled, Intel® Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, 8x Intel® P3700 NVMe SSD (800GB), 4x per CPU socket, FW 8DV10102, Network: Mellanox* ConnectX-4 100Gb RDMA, direct connection between initiator and target; Initiator OS: CentOS* Linux* 7.2, Linux kernel 4.7.0-rc2, Target OS (SPDK): CentOS Linux 7.2, Linux kernel 3.10.0-327.el7.x86_64, Target OS (Linux kernel): CentOS Linux 7.2, Linux kernel 4.7.0-rc2 Performance as measured by: fio, 4KB Random Read I/O, 2 RDMA QP per remote SSD, Numjobs=4 per SSD, Queue Depth: 32/job
10
10
10
17
8
1
0 5 10 15 20 25 30
QEMU virtio-scsi
kernel vhost-scsi
SPDK vhost-scsi
VM cores I/O processing cores
0
200000
400000
600000
800000
1000000
QEMU virtio-scsi kernel vhost-scsi SPDK vhost-scsi
I/Os handled per I/O processing core
Alibaba* Cloud ECS Case Study: Write Performance
Source: http://mt.sohu.com/20170228/n481925423.shtml
* Other names and brands may be claimed as the property of others
Ali Cloud sees 300% improvement in IOPS and latency using SPDK
0
200
400
600
800
1000
1200
1400
1 2 4 8 16 32
La
ten
cy (
use
c)
Queue Depth
Random Write Latency (usec)
General Virtualization Infrastructure
Ali Cloud High-Performance Storage Infrastructure with SPDK
0
50000
100000
150000
200000
250000
300000
350000
400000
1 2 4 8 16 32
IOP
S
Queue Depth
Random Write 4K IOPS
General Virtualization Infrastructure
Ali Cloud High-Performance Storage Infrastructure with SPDK
Alibaba* Cloud ECS Case Study: MySQL Sysbench
Source: http://mt.sohu.com/20170228/n481925423.shtml
* Other names and brands may be claimed as the property of others
Sysbench Update sees 4.6X QPS at 10% of the latency!
0
2
4
6
8
10
12
14
16
18
Select Update
La
ten
cy (m
s)
MySQL Sysbench - Latency
General Virtualization Infrastructure High Performance Virtualization with SPDK
0
20000
40000
60000
80000
100000
120000
Select Update
MySQL Sysbench - TPS/QPS
General Virtualization Infrastructure High Performance Virtualization with SPDK
SPDK Blobstore Vs. Kernel: Key Tail Latency
0
20000
40000
60000
80000
100000
120000
140000
Readwrite
Late
ncy
uS
db_bench 99.99th Percentile LatencyLower is Better
Kernel (256KB sync) Blobstore (20GB Cache + Readahead)
372%
SPDK Blobstore reduces tail latency by 3.7X
Insert Randread Overwrite Readwrite
Kernel (256KB Sync) 366 6444 1675 122500
SPDK Blobstore(20GB Cache + Readahead)
444 3607 1200 33052
0
1000
2000
3000
4000
5000
6000
7000
Insert Randread Overwrite
Late
ncy
uS
db_bench 99.99th Percentile LatencyLower is Better
Kernel (256KB sync) Blobstore (20GB Cache + Readahead)
21%
44%
28%
SPDK Blobstore Vs. Kernel: Key Transactions per sec
0
200000
400000
600000
800000
1000000
1200000
Insert Randread Overwrite Readwrite
Ke
ys p
er s
eco
nd
db_bench Key TransactionsHigher is Better
85%
8% 4% ~0%
Insert Randread Overwrite Readwrite
Kernel (256KB Sync) 547046 92582 51421 30273
SPDK Blobstore(20GB Cache + Readahead)
1011245 99918 53495 29804
SPDK Blobstore improves insert throughput by 85%
top related