challenges with gluster and persistent memory with dan lambright

Post on 08-Jan-2017

124 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Challenges in Using Persistent Memory in Gluster

Gluster Summit 2016

Dan LambrightStorage System Software DeveloperAdjunct Professor University of Massachusetts LowellAug. 23, 2016

RED HAT

● Technologies○ Persistent memory, aka storage class memory (SCM)○ Distributed storage

● Challenges○ Network latency (studied Gluster)○ Accelerating parts of the system with SCM○ CPU latency (studied Ceph)

2

Overview

RED HAT 3

● Near DRAM speeds● Wearability better than SSDs (claims Intel for 3DxPoint) ● API available

○ Crash-proof transactions○ Byte or block addressable

● Likely to be at least as expensive as SSDs● Fast random access● Has support in Linux● NVDIMMs available today.

Storage Class MemoryWhat do we know / expect?

RED HAT

Media Latency

HDD 10ms

SSD 1ms

SCM < 1us

CPU (Ceph, aprox.)

~1000us

Network (RDMA)

~10-50us

4

The problemMust lower latencies throughout system : storage, network, CPU

RED HAT

● “Primary copy” : update replicas in parallel, ○ processes reads and writes○ Ceph’s choice, also JBR

● Other design options○ Read at “tail” - the data there is always committed

Server Side ReplicationLatency cost to replicate across nodes

client

Primary server

Replica 1

Replica 2

RED HAT

● Uses more client side bandwidth● Likely client has slower network than server.

Client Side ReplicationLatency price lower than server software replication

client

Replica 1

Replica 1

Replica 2

RED HAT

● Avoid OS data copy; free CPU from transfer● Application must manage buffers

○ Reserve memory up-front, sized properly○ Both Ceph and Gluster have good RDMA interfaces

● Extend protocol to further improve latency? ○ Proposed protocol extensions, could shrink latency to ~3us.○ RDMA write completion does not indicate data was persisted (“ship and pray”)○ ACK in higher level protocol - adds overhead○ Add “commit bit”, perhaps combine with last write?

Improving Network LatencyRDMA

RED HAT

IOPS RDMA vs 10Gbps : Glusterfs 2x replication, 2 clients

Biggest gain with reads, little gain for small I/O.

Sequential I/O (1024 block size)

Random I/O (1024 block size)

1024 bytes transfers

RED HAT

● Reduce protocol traffic (discuss more next section)● Coalesce protocol operations

○ WIth this, observed 10% gain in small file creates on Gluster● Pipelining

○ In Ceph, on two updates to same object, start replicating second before first completes

Improving Network LatencyOther techniques

RED HAT

Adding SCM to Parts of SystemKernel and application level tiering

DM-cache Ceph tiering

RED HAT

● Heterogeneous storage in single volume○ Fast/expensive storage cache for slower storage○ Fast “Hot tier” (e.g. SSD, SCM)○ Slow “Cold tier” (e.g. erasure coded)

● Database tracks files○ Pro: easy metadata manipulation○ Con: very slow O(n) enumeration of objects for promotion/demotion scans.

● Policies: ○ Data put on hot tier, until “full”○ Once “full”, data “promoted/demoted” based on access frequency

Gluster TieringIllustration of network problem

RED HAT

Gluster Tiering

RED HAT

● Tiering helped large I/Os, not small● Pattern seen elsewhere ..

○ RDMA tests ○ Customer Feedback, overall GlusterFS reputation …

● Observed many “LOOKUP operations” over network● Hypothesis: metadata transfers dominate data transfers for small files

○ small file data transfer speedup fails to help overall IO latency

Gluster’s “Small File” Tiering Problem Analysis

RED HAT

● Each directory in path is tested on an open(), by client’s VFS layer○ Full path traversal○ d1/d2/d3/f1○ Existence○ Permission

Understanding LOOKUPs in GlusterProblem : Path Traversal

d1

d2

d3

f1

RED HAT

● Distributed hash space is split in parts○ Unique space for each directory○ Each node owns a piece of this “layout”○ Stored in extended attributes○ When new nodes added to cluster, rebuild the layouts

● When file opened, entire layout is rechecked, for each directory○ Each node receives a lookup to retrieve its part of the layout

● Work is underway to improve this.

Understanding LOOKUPs in GlusterProblem : Coalescing Distributed Hash Ranges

RED HAT

LOOKUP Amplification

d2 d3 f1

S1 S2 S3 S4

d1/d2/d3/f1Four LOOKUPsFour servers16 LOOKUPs total in worse case

d2 VFS layer

Gluster client

Gluster server

Client

Path used in Ben England’s “smallfile” utility../mnt/p6.1/file_dstdir/192.168.1.2/thrd_00/d_000

RED HAT

● Cache file metadata at client long term ○ WIP - under development

● Invalidate cache entry on another client’s change○ Invalidate intelligently, not spuriously○ Some attributes may change a lot (ctime, ..)

Client Metadata Cache: TieringGluster’s “md-cache” translator

Red is cached

RED HAT

Client Metadata Cache RDMAOne client , 16K 64 byte files, 8 threads, 14Gbps Mellonox IB

Red is cached

RED HAT

● Near term (2016)○ Add-remove brick, partially in review stages○ Md-cache integration, working with SMB team○ Better performance counters

● Longer term (2017)○ Policy based tiering○ More than two tiers?○ Replace database with “hitsets” (ceph-like bloom filters) ?

TieringDevelopment notes

RED HAT

● Network latency reductions○ Use RDMA○ Reduce round trips by streamlining protocol , coalescing etc○ Cache at client

● CPU latency reductions○ Aggressively optimize / shrink stack○ Remove/replace large components

● Consider ○ SCM as a tier/cache○ 2 way replication

Summary and DiscussionDistributed storage and SCM pose unique problems with latency.

THANK YOU

plus.google.com/+RedHat

linkedin.com/company/red-hat

youtube.com/user/RedHatVideos

facebook.com/redhatinc

twitter.com/RedHatNews

top related