glusterfs internals and - red hat€¦ · glusterfs internals and directions jeff darcy principal...

Post on 20-May-2020

24 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

GlusterFS Internals and

Directions

Jeff Darcy Principal Engineer, Red Hat13 June, 2013

GlusterFSis not

a filesystem

Wait . . . what?

● GlusterFS is a scalable general purpose storage platform

● We handle common storage tasks● cluster management and configuration● data distribution and replication● common control and data structures

● That platform can be used many different ways

Interface Possibilities

qemu

NFS

SMB

Hadoop

FUSE

Cinder

Swift (UFO)

Files Blocks

Objects

libgfapi

Whatever

IP RDMA

Transports

files BD

Back ends

DB

OpenStack and GlusterFS – Current Integration

Glance Images

NovaNodes

SwiftObjects

Cinder Data

Glance Data

Swift Data

Swift API

Storage Server

Storage Server

Storage Server…

KVM

KVM

KVM

● Separate Compute and Storage Pools

● GlusterFS directly provides Swift object service

● Integration with Keystone● GeoReplication for multi-site

support● Swift data also available via

other protocols● Supports non-OpenStack use in

addition to OpenStack use

Logical View Physical View

OpenStack and GlusterFS - Future Direction

HadoopGuest

OtherGuest

...

Host

GlusterGuest

HadoopGuest

OtherGuest

...

Host

GlusterGuest

NovaCompute

Nodes

Open Stack and GlusterFS - Future Direction

● POC based on proposed OpenStack FaaS (File as a Service) proposal

● Cinder-like virtual NAS service● Tenant-specific file shares● Hypervisor mediated for security

● Avoid exposing servers to Quantum tenant network ● Optional multi-site or multi-zone GeoReplication

● FaaS data optionally available to non OpenStack nodes

● Initial focus on Linux guest

● Windows (SMB) and NFS shares also under consideration

Making Hard Stuff Easier

● Distributed filesystems are notoriously hard to set up● multiple experts for multiple weeks is “normal”

● How about four CLI commands?● probe peer, create volume, start volume, mount

● We handle cluster membership, process management, port mapping, dynamic configuration changes, etc.

● add/remove nodes on the fly● add/remove features on the fly● rolling upgrade

Q: How Do We Do It?

Distribution

Replication

...

RPC Server

...

Local Storage

LocalFS

FUSE

...

libgfapi

RPC Client

one of... ...plus all of... ...plus all of...

A: Modularity!

Deep Dive: Distribution

Distribution

Replication

...

RPC Server

...

Local Storage

LocalFS

FUSE

...

libgfapi

RPC Client

Elastic Hashing

Server A

Server BServer C

File X

File Y

● Deterministic mapping: object hash → server

Adding a Node

Server A

Server BServer C

File X

File Y

● Minimize reassignment when server set changes

Server D

Rebalancing

● Goal: optimal layout with minimal data movement

● Greatly improved algorithms in 3.4

Future: Tiering and Topology Awareness

● General deterministic matching function: file attributes to storage attributes

● Currently both attributes are hashes, but...● file attribute could be account ID, age, ...● storage attribute could be disk type (SSD), replication

level, ...● either could be an arbitrary tag

● Rebalance etc. “just work” regardless

● Algorithms can be stacked on top of one another

Tiering ExampleVolume

(select by path)

SSDs(random)

Replicated(random)

Development(random)

Production(select by age)

Deep Dive: Replication

Distribution

Replication

...

RPC Server

...

Local Storage

LocalFS

FUSE

...

libgfapi

RPC Client

Replicated Writes

Client

Server A

Server B

lock xattr+ write xattr- unlock

● Many optimizations avoid the lock/xattr ops● especially for sequential writes

● Still synchronous● don't try this on a high-latency network

Self Heal

● Generation 1: on demand

● Generation 2: full manual scan

● Generation 3: parallel, automatic repair● index based● GlusterFS 3.3, RHS 2.0

● Future: journal based● even more precise (i.e. faster)● lower overhead

Split Brain

Server A Server B

Client 1 Client 2

write“foo”

write“bar”

networkpartition

Split Brain (continued)

● In 3.3: basic quorum enforcement● client side, replica-set level● poor approach for N=2

● In 3.4: advanced quorum enforcement● server side, cluster level

● In 3.5: hyper-advanced (?) quorum enforcement● volume level● arbiters (best approach for N=2)

Access Methods (past)

Distribution

Replication

...

RPC Client

FUSE

NFS

Samba

Swift

HadoopHadoop

Access Methods (present)

Distribution

Replication

...

RPC Client

FUSE

NFS

Samba

Swift

Hadoop

qemu libgfapi

Access Methods (future)

Distribution

Replication

...

RPC Client

FUSE

NFS

Samba

Swift

HadoopHadoop

qemu libgfapi

Your API

What is libgfapi?

● User-space library for accessing data in GlusterFS

● Filesystem-like API

● Runs in application process● no FUSE, no copies, no context switches● ...but same volfiles, translators, etc.

● Could be used for Apache/nginx modules, MPI I/O (maybe), Ganesha, etc. ad infinitum

● BTW it's usable from Python too :)

Translator API

● If libgfapi isn't enough, you can write your own translators (including glupy for Python)

● Most of what we already do is in translators

● It's a public (though not well documented) API● “Translator 101” series, forge.gluster.org

● Translators are right in the I/O path

● Current examples: encryption, erasure coding

● Other possibilities: dedup/compression, format translation, indexing

http://www.gluster.org

● Modularity makes it all possible

● Expect:● OpenStackHadoopOpenStackHadoop...

● marketing made me say that

● more front-end protocols● more back-end storage options● more functionality within the I/O path● more performance enhancements

● Make the storage system you want

top related