Download - Let's talk about Ceph Terena TF-Storage Meeting – February 12th

Let’s talk about Ceph Terena TF-Storage Meeting – February 12th 2015 – University of Vienna -

WHO IS TACO?

Taco Scargo Sr. Solution Architect

[email protected] +31 6 45434628 @TacoScargo

!  First part: 13:00 to 15:00 Introduction to Ceph Technical deep dive: why is Ceph so interesting and popular Q&A about Ceph

!  Coffee break

!  Second part: 15:30 to 17:30 Notable use cases, deployment options, management and monitoring

AGENDA

Copyright © 2015 Red Hat, Inc. | Private and Confidential

THE FORECAST

By 2020 over 15 ZB of data will be stored. 1 .5 ZB are stored today.

5 Copyright © 2015 Red Hat, Inc. | Private and Confidential

THE PROBLEM

!  Existing systems don’t scale !  Increasing cost and complexity

!  Need to invest in new platforms ahead of time

2010 2020

IT Storage Budget

Growth of data


THE SOLUTION

PAST: SCALE UP

FUTURE: SCALE OUT


EXAMPLES IN NATURE

DO YOU STILL REMEMBER?


STORAGE APPLIANCES

!  RAID: Redundant Array of Inexpensive Disks !  Enhanced Reliability

!  RAID-1 mirroring !  RAID-5/6 parity (reduced overhead) !  Automated recovery

!  Enhanced Performance !  RAID-0 striping !  SAN interconnects !  Enterprise SAS drives !  Proprietary H/W RAID controllers

END OF RAID AS WE KNOW IT

END OF RAID AS WE KNOW IT

!  Economical Storage Solutions !  Software RAID implementations !  iSCSI and JBODs

!  Enhanced Capacity !  Logical volume concatenation

RAID CHALLENGES: CAPACITY/SPEED !  Storage economies in disks come from more GB per spindle !  NRE rates are flat (typically estimated at 10E-15/bit)

!  4% chance of NRE while recovering a 4+1 RAID-5 set and it goes up with the number of volumes in the set many RAID controllers fail the recovery after an NRE

!  Access speed has not kept up with density increases !  27 hours to rebuild a 4+1 RAID-5 set at 20MB/s during which time

a second drive can fail

!  Managing the risk of second failures requires hot-spares !  Defeating some of the savings from parity redundancy

RAID CHALLENGES: EXPANSION !  The next generation of disks will be larger and cost less per GB.

We would like to use these as we expand !  Most RAID replication schemes require identical disk meaning

new disks cannot be added to an old set meaning failed disks must be replaced with identical units

!  Proprietary appliances may require replacements from manufacturer (at much higher than commodity prices)

!  Many storage systems reach a limit beyond which they cannot be further expanded (e.g. fork-lift upgrade)

!  Re-balancing existing data over new volumes is non-trivial

RELIABILITY/AVAILABILITY

!  RAID-5 can only survive a single disk failure !  The odds of an NRE during recovery are significant !  Odds of a second failure during recovery are non-

negligible !  Annual peta-byte durability for RAID-5 is only 3 nines

!  RAID-6 redundancy protects against two disk failures !  Odds of an NRE during recovery are still significant !  Client data access will be starved out during recovery !  Throttling recovery increases the risk of data loss

RELIABILITY/AVAILABILITY

!  Even RAID-6 cant protect against: !  Server failures !  NIC failures !  Switch failures !  OS crashes !  Facility or regional disasters

RAID CHALLENGES: EXPENSE

!  Capital Expenses … good RAID costs !  Significant mark-up for enterprise hardware !  High performance RAID controllers can add $50-100/disk !  SANs further increase !  Expensive equipment, much of which is often poorly used !  Software RAID is much less expensive, and much slower

RAID CHALLENGES: EXPENSE

!  Operating Expenses … RAID doesn’t manage itself !  RAID group, LUN and pool management !  Lots of application-specific tunable parameters !  Difficult expansion and migration !  When a recovery goes bad, it goes very bad !  Dont even think about putting off replacing a failed drive

!  Ceph is a massively scalable, open source, software-defined storage system that runs on commodity hardware.

!  Ceph delivers object, block, and file system storage in one self-managing, self-healing platform with no single point of failure.

!  Ceph ideally replaces legacy storage and provides a single solution for the cloud.


WHAT IS CEPH?

22

CEPH UNIFIED STORAGE

FILE SYSTEM

BLOCK STORAGE

OBJECT STORAGE

Keystone Geo-Replication Erasure Coding

23

Multi-tenant S3 & Swift

OpenStack Linux Kernel

Tiering

Clones Snapshots

CIFS/NFS HDFS

Distributed Metadata

Linux Kernel POSIX


TRADITIONAL STORAGE VS. CEPH

TRADITIONAL ENTERPRISE STORAGE


“Run storage the way Google does”


VISION

25

!  Failure is normal ! Scale out on commodity hardware ! Everything runs on software ! Solution must be 100% open source


THE CEPH PHILOSOPHY

26

HISTORICAL TIMELINE

Copyright © 2015 Red Hat, Inc. | Private and Confidential 27

RHEL-OSP Certification FEB 2014

MAY 2012 Launch of Inktank

OpenStack Integration 2011

2010 Mainline Linux Kernel

Open Source 2006

2004 Project Starts at UCSC

Production Ready Ceph SEPT 2012

2012 CloudStack Integration

OCT 2013 Inktank Ceph Enterprise Launch

Xen Integration 2013

APR 2014 Inktank Acquired by Red Hat

STRONG & GROWING COMMUNITY


COMMUNITY METRICS DASHBOARD

METRICS.CEPH.COM

ARCHITECTURE


ARCHITECTURAL COMPONENTS

30

APP HOST/VM CLIENT



31

APP HOST/VM CLIENT

RADOS COMPONENTS

32

OSDs: !  10s to 10000s in a cluster !  One per disk (or one per SSD, RAID group…)

!  Serve stored objects to clients !  Intelligently peer for replication & recovery

Monitors: !  Maintain cluster membership and state !  Provide consensus for distributed decision-making

!  Small, odd number !  These do not serve stored objects to clients


OBJECT STORAGE DAEMONS

33

btrfs xfs ext4


WHERE DO OBJECTS LIVE?

34

??


A METADATA SERVER?

35

1

2


CALCULATED PLACEMENT?

36

A-G

H-N

O-T

U-Z


EVEN BETTER: CRUSH!

37

RADOS CLUSTER


CRUSH: DYNAMIC DATA PLACEMENT

38

CRUSH: !  Pseudo-random placement algorithm

!  Fast calculation, no lookup

!  Repeatable, deterministic !  Statistically uniform distribution !  Stable mapping

!  Limited data migration on change !  Rule-based configuration

!  Infrastructure topology aware !  Adjustable replication !  Weighting


DATA IS ORGANIZED INTO POOLS

39

CLUSTER POOLS (CONTAINING PGs)

POOL A

POOL B

POOL C

POOL D



40

APP HOST/VM CLIENT


ACCESSING A RADOS CLUSTER

41

RADOS CLUSTER

socket


LIBRADOS: RADOS ACCESS FOR APPS

42

LIBRADOS: ! Direct access to RADOS for applications ! C, C++, Python, PHP, Java, Erlang ! Direct access to storage nodes ! No HTTP overhead




43

APP HOST/VM CLIENT

THE RADOS GATEWAY

44

RADOS CLUSTER

socket

REST


RADOSGW MAKES RADOS WEBBY

45

RADOSGW: ! REST-based object storage proxy ! Uses RADOS to store objects ! API supports buckets, accounts ! Usage accounting for billing ! Compatible with S3 and Swift applications




46

APP HOST/VM CLIENT

RBD STORES VIRTUAL DISKS

47

RADOS BLOCK DEVICE: ! Storage of disk images in RADOS ! Decouples VMs from host !  Images are striped across the cluster (pool) ! Snapshots ! Copy-on-write clones ! Support in:

! Mainline Linux Kernel (2.6.39+) ! Qemu/KVM, native Xen coming soon ! OpenStack, CloudStack, Nebula, Proxmox


STORING VIRTUAL DISKS

48

RADOS CLUSTER


SEPARATE COMPUTE FROM STORAGE

49

RADOS CLUSTER


KERNEL MODULE FOR MAX FLEXIBILITY

50

RADOS CLUSTER




51

APP HOST/VM CLIENT

SCALABLE METADATA SERVERS

52

METADATA SERVER ! Manages metadata for a POSIX-compliant shared

filesystem ! Directory hierarchy ! File metadata (owner, timestamps, mode, etc.)

! Stores metadata in RADOS ! Does not serve file data to clients ! Only required for shared filesystem


SEPARATE METADATA SERVER

53

RADOS CLUSTER

data metadata


CEPH ADVANTAGE: DECLUSTERED PLACEMENT

!  Consider a failed 2TB RAID mirror !  We must copy 2TB from the survivor to the successor !  Survivor and successor are likely in same failure zone

!  Consider two RADOS objects clustered on the same primary !  Surviving copies are declustered (on different secondaries) !  New copies will be declustered (on different successors) !  Copy 10GB from each of 200 survivors to 200 successors !  Survivors and successors are in different failure zones

CEPH ADVANTAGE: DECLUSTERED PLACEMENT

!  Benefits !  Recovery is parallel and 200x faster !  Service can continue during the recovery process !  Exposure to 2nd failures is reduced by 200x !  Zone aware placement protects against higher level failures !  Recovery is automatic and does not await new drives !  No idle hot-spares are required

CEPH ADVANTAGE: OBJECT GRANULARITY

!  Consider a failed 2TB RAID mirror !  To recover it we must read and write (at least) 2TB !  Successor must be same size as failed volume !  An error in recovery will probably lose the file system

!  Consider a failed RADOS OSD !  To recover it we must read and write thousands of objects !  Successor OSDs must have each have some free space !  An error in recovery will probably lose one object

CEPH ADVANTAGE: OBJECT GRANULARITY

!  Benefits !  Heterogeneous commodity disks are easily supported !  Better and more uniform space utilization !  Per-object updates always preserve causality ordering !  Object updates are more easily replicated over WAN links !  Greatly reduced data loss if errors do occur

CEPH ADVANTAGE: INTELLIGENT STORAGE

!  Intelligent OSDs automatically rebalance data !  When new nodes are added !  When old nodes fail or are decommissioned !  When placement policies are changed

!  The resulting rebalancing is very good: !  Even distribution of data across all OSDs !  Uniform mix of old and new data across all OSDs !  Moves only as much data as required

CEPH ADVANTAGE: INTELLIGENT STORAGE

!  Intelligent OSDs continuously scrub their objects !  To detect and correct silent write errors before another failure

!  This architecture scales from Petabytes to exabytes !  A single pool of thin provisioned, self-managing storage !  Serving a wide range of block, object, and file clients

QUESTIONS?

USE CASES

USING CEPH

DOES ONE SOLUTION FITS ALL?

!  Remember this?

Ceph ideally replaces legacy storage and provides a single solution for the cloud.

!  One simple rule: if the performance is not high enough, add more nodes …


The most widely deployed1 technology for OpenStack storage

CLOUD INFRASTRUCTURE WITH OPENSTACK

63

Features Benefits •  Full integration with Nova, Cinder & Glance •  Single storage for images and ephemeral & persistent

volumes •  Copy-on-write provisioning •  Swift-compatible object storage gateway •  Full integration with Red Hat Enterprise Linux OpenStack

Platform

•  Provides both volume storage and object storage for tenant applications

•  Reduces provisioning time for new virtual machines •  No data transfer of images between storage and compute

nodes required •  Unified installation experience with Red Hat Enterprise Linux

OpenStack Platform

1 http://superuser.openstack.org/articles/openstack-user-survey-insights-november-2014


OPENSTACK SURVEY NOVEMBER 2014

source: http://superuser.openstack.org/articles/openstack-user-survey-insights-november-2014


CEPH AND OPENSTACK

65

RADOS CLUSTER


S3-compatible, on-premise cloud storage with uncompromised cost and performance

WEB-SCALE OBJECT STORAGE FOR APPLICATIONS

66

Features Benefits •  Compatibility with S3 APIs •  Fully-configurable replicated storage backend •  Pluggable erasure coded storage backend •  Cache tiering pools •  Multi-site failover

•  Supports broad ecosystem of tools and applications built for the S3 API

•  Provides a modern hot/warm/cold storage topology that offers cost-efficient performance

•  Advanced durability at optimal price •  Delivers a foundation for the next generation of cloud-native

applications using object storage

S3 API S3 API S3 API S3 API


WEB APPLICATION STORAGE

67

S3/Swift S3/Swift S3/Swift S3/Swift


MULTI-SITE OBJECT STORAGE


ARCHIVE / COLD STORAGE

69

CEPH STORAGE CLUSTER


ARCHIVE / COLD STORAGE

70

Site A Site B

CEPH STORAGE CLUSTER CEPH STORAGE CLUSTER


WEBSCALE APPLICATIONS

71

Native Protocol

Native Protocol

Native Protocol

Native Protocol


DATABASES

72

Native Protocol

Native Protocol

Native Protocol

Native Protocol


!  Why do people ask about iSCSI

!  There is currently NO supported iSCSI solution for Ceph

!  We’ve tested a TGT proxy with mixed results

!  Goal is a hardened iSCSI solution


WHAT ABOUT iSCSI?

CEPH FOR HADOOP

!  CephFS can be used as a drop replacement for HDFS

!  Overcomes limitations of HDFS

!  Very experimental at this time

!  Not supported


!  Customers often ask about VMware compatibility !  We have tried via third-party iSCSI target software (stgt), but

results were mixed. !  Pre-acquisition some large banks have requested VMware to

incorporate Ceph support into their hypervisor !  VMware has indicated they will evaluate incorporating Ceph

support into their next major release (scheduled 2015) !  If you encounter VMware companies that want that, please let

them know they should request that from VMware. !  We will be looking at VVOL support in a later version as well


VMWARE

75


CEPH ON DRIVES

76

!  Object model !  Pools

!  1s to 100s !  Independent namespaces for object collections !  Replication level, placement policy

!  Objects !  Bazillions !  Blob of data (bytes to gigabytes) !  Attributes (e.g. “version=12”) !  Key/value bundle

CLOSER LOOK AT LIBRADOS


ATOMIC TRANSACTIONS

!  Client operations sent to the OSD cluster !  Operate on a single object !  Can contain a sequence of operations; e.g.

!  Truncate object !  Write new object data !  Set attribute

ATOMIC TRANSACTIONS

!  Atomicity !  All operations commit or do not commit atomically

!  Conditional !  ‘guard’ operations can control

whether operation is performed !  Verify xattr has specific value !  Assert object is a specific version

!  Allows atomic compare-and-swap etc.

KEY/VALUE STORAGE !  store key/value pairs in an object

!  independent from object attrs or byte data payload

!  based on google's leveldb !  efficient random and range insert/query/removal !  based on BigTable SSTable design

!  exposed via key/value API !  insert, update, remove !  individual keys or ranges of keys

!  avoid read/modify/write cycle for updating complex objects !  e.g., file system directory objects

WATCH/NOTIFY !  establish stateful 'watch' on an object

!  client interest persistently registered with object !  client keeps session to OSD open

!  send 'notify' messages to all watchers !  notify message (and payload) is distributed to all watchers !  variable timeout !  notification on completion

!  all watchers got and acknowledged the notify !  use any object as a communication/synchronization channel

!  locking, distributed coordination (ala ZooKeeper), etc.

WATCH/NOTIFY

WATCH/NOTIFY EXAMPLE

!  radosgw cache consistency !  radosgw instances watch a single object (.rgw/notify) !  locally cache bucket metadata !  on bucket metadata changes (removal, ACL changes) !  write change to relevant bucket object !  send notify with bucket name to other radosgw

instances !  on receipt of notify

!  invalidate relevant portion of cache

RADOS CLASSES

!  dynamically loaded .so !  /var/lib/rados-classes/* !  implement new object “methods” using existing methods !  part of I/O pipeline !  simple internal API

RADOS CLASSES

!  reads !  can call existing native or class methods !  do whatever processing is appropriate !  return data

!  writes !  can call existing native or class methods !  do whatever processing is appropriate !  generates a resulting transaction to be applied

atomically

CLASS EXAMPLES !  Grep

!  read an object, filter out individual records, and return those

!  Sha1 !  read object, generate fingerprint, return that

!  Images !  rotate, resize, crop image stored in object !  remove red-eye

!  crypto !  encrypt/decrypt object data with provided key

IDEAS

!  rados mailbox (RMB?) !  plug librados backend into dovecot, postfix, etc. !  key/value object for each mailbox

!  key = message id !  value = headers

!  object for each message or attachment !  watch/notify to delivery notification

IDEAS !  distributed key/value table

!  aggregate many k/v objects into one big 'table’ !  working prototype exists

!  lua rados class !  embed lua interpreter in a rados class !  ship semi-arbitrary code for operations

!  json class !  parse, manipulate json structures

●  RBD is a relatively thin layer on top of RADOS ●  many RBD problems are RADOS problems ●  Most RBD-specific issues are configuration/

integration with other software (libvirt, OpenStack, CloudStack, QEMU, etc.)

●  Most of these issues are during initial setup

CLOSER LOOK AT RBD

To connect, need: ●  monitor addresses ●  client name (--name client.foo, or --id foo) ●  secret key (from keyring, plain key file, or command

line) Capabilities needed after connecting: ●  mon 'allow r' (every client) ●  osd 'allow rwx pool foo' (RBD rw access to images in pool foo) ●  osd 'allow r class-read pool bar' (RBD ro access to pool bar) ●  osd 'allow class-read object_prefix rbd_children'

o  for RBD format 2 snapshot unprotection ●  ceph-authtool man page has more

RADOS AUTH

●  hang connecting o  is ceph.conf being read for the monitor addresses? o  correct keyring being read? o  are monitors up and accessible from this machine? o  --debug-ms 1 will show if it's getting any response o  strace useful for checking the correct files are being

read ●  error connecting

o  permission denied !  is keyring and ceph.conf readable? !  are capabilities correct? (ceph auth list) !  is the right client name being used?

COMMON RADOS CLIENT PROBLEMS

Format 2 is not supported yet: $ sudo modprobe rbd $ rbd create -s 20 --format 2 foo $ sudo rbd map foo rbd: add failed: (2) No such file or directory $ dmesg | tail [17246768.484421] libceph: client0 fsid c78b75ac-c0a1-4a2a-9401-f1cc74f541f0 [17246768.486924] libceph: mon0 10.214.131.34:6789 session established

Providing a secret that does not match the rados user (client.admin):

$ sudo rbd map foo --keyfile keyfile rbd: add failed: (1) Operation not permitted $ dmesg | tail [17247003.169925] libceph: client0 fsid c78b75ac-c0a1-4a2a-9401-f1cc74f541f0 [17247003.170480] libceph: auth method 'x' error -1

KERNEL RBD

What if the cluster stops? $ mkfs.xfs /dev/rbd0 & $ ps aux | grep mkfs root 32059 0.0 0.0 18800 1084 pts/0 D 16:17 0:00 mkfs.ext4 /dev/rbd/

rbd/test

$ dmesg | tail [17174479.039082] libceph: mon0 10.214.131.34:6789 socket closed [17174479.049735] libceph: mon0 10.214.131.34:6789 session lost, hunting for new mon [17174479.049803] libceph: mon0 10.214.131.34:6789 connection failed [17174511.047618] libceph: osd1 10.214.131.34:6803 socket closed [17174511.058221] libceph: osd1 10.214.131.34:6803 connection failed

Newer kernels will not let /dev/rbd0 be unmapped at this point, since it

is still in use by mkfs.xfs. rbd unmap will show 'device is busy'.

KERNEL RBD

Crush tunables are supported in 3.6+. Before then: $ rbd map foo & (hangs) $ dmesg | tail [17248284.205022] libceph: mon0 10.214.131.34:6789 feature set mismatch, my 2

< server's 2040002, missing 2040000 [17248284.219332] libceph: mon0 10.214.131.34:6789 missing required protocol

features

KERNEL RBD

Most errors should give a useful error message, i.e.: $ rbd snap rm foo@snap rbd: snapshot 'snap' is protected from removal. 2013-02-22 13:08:02.603716 7fde490e3780 -1 librbd: removing

snapshot from header failed: (16) Device or resource busy $ rbd rm foo 2013-02-22 13:06:52.836368 7fd46ae54780 -1 librbd: image has

snapshots - not removing Removing image: 0% complete...failed. rbd: image has snapshots - these must be deleted with 'rbd snap

purge' before the image can be removed.

USERSPACE RBD

Same problems with auth can occur: $ rbd --auth-supported none create -s 1 image 2013-02-22 13:14:29.796404 7ffa74c26780 0 librados: client.admin

authentication error (95) Operation not supported rbd: couldn't connect to the cluster!

$ rbd --name client.foo create -s 1 image 2013-02-22 13:14:01.944670 7fda89844780 -1 monclient(hunting):

ERROR: missing keyring, cannot use cephx for authentication 2013-02-22 13:14:01.944675 7fda89844780 0 rbd: couldn't connect

to the cluster! librados: client.foo initialization error (2) No such file or

directory

Use --debug-ms 1, --debug-rbd 20 to see lots of detail.

USERSPACE RBD

Libvirt is often used to configure QEMU to use librbd. ●  same issues, info harder to reach ●  check instance logs in /var/log/libvirt/qemu/* ●  logging, admin socket often complicated by

apparmor, selinux, and unix permissions ●  QEMU will read /etc/ceph/ceph.conf by default, but

it must be accessible

MOVING UP THE STACK

●  userspace only ●  useful for showing current config ●  unlike other parts of ceph, provides help for every command

automatically ●  if there are performance issues, can help debug with perf dump ●  disabled for clients (like librbd) by default ●  enable by adding to client section in ceph.conf, or in libvirt

appending to disk name (i.e. name="pool/image:admin_socket=/path/")

ADMIN SOCKET

Error handling is not easy for users still - need to look at service logs. Path is:

cinder-api cinder-schedule cinder-volume

Nova uses libvirt to configure QEMU, so same issues here (but QEMU log may be in nova's instance directory).

OPENSTACK

●  check rados performance ●  is rbd_cache enabled (and QEMU configured to send flushes)? ●  perf dump from admin socket may give coarse clues ●  more infrastructure to measure needed,

should be added soon

PERFORMANCE ISSUES

QUESTIONS?

RELIABILITY BEST PRACTICES

USING CEPH

!  Block & Object are rock-solid and are in use at many important production sites

!  CephFS cannot be used reliably yet as it is still under heavy development

!  As features are added new bugs might be introduced


HOW RELIABLE IS CEPH?

103

Striping, erasure coding, or replication across nodes !  Enjoy data durability, high availability, and high performance Dynamic block resizing !  Expand or shrink block devices with no or minimal downtime Data placement !  No need for lookups - every client can calculate where data is stored !  Policy-defined SLAs, performance tiers, and failure domains Automatic failover !  Prevent failures from impacting integrity or availability

WHY IS CEPH RELIABLE?

104

RELIABILITY & AVAILABILITY


!  Use multiple RADOSGW’s with HA PROXY !  Monitor mailinglists for critical updates and fixes

(or purchase support for Ceph from Red Hat and run Inktank Ceph Enterprise)

!  Deploy Ceph automated using a configuration management/scripting system like Puppet

!  Monitor all Ceph nodes like you would monitor any Linux server, just the response might be different

RELIABILITY BEST PRACTICES


HARD- & SOFTWARE ARCHITECTURE

USING CEPH

!  Good News, Ceph can run on any hardware !  Bad News, Ceph can run on any hardware !  Paying attention to hardware up front can help eliminate

problems later !  You do not need and should not use a RAID controller !  Always check if controller supports pass-through (JBOD)


CEPH HARDWARE REQUIREMENTS

107


STORAGE NODE SIZING

108

Block Storage •  Smaller Nodes. <24 HDD’s

Object Storage •  Larger Nodes. Up to 72 HDD’s

!  Storage Node Processor Requirements !  1GHz of 1 core per OSD !  Example: a single socket 8 core 2.0 GHz proc will support 16 OSD’s

!  Storage Node Memory Requirements !  2-3 GB per OSD

!  Monitor Requirements !  Lower Requirements !  E5-2603, 16GB RAM, 1GbE


MEMORY AND PROCESSOR RULES

109

!  To improve write performance, moving write journals to SSD drives is recommended

!  One SSD drive per 5-6 spinning drives !  A write journal requires approx. 10 GB per journal, so

80-120 GB drives are sufficient !  A typical Ceph storage node has 12 spinning disks

and 2 SSD disks, example Dell R515


SSD JOURNALS

110

!  Want to know what hardware should we select for validating Ceph in our environment? This document provides high level guidelines and recommendations for selecting commodity hardware infrastructure from major server vendors to enable you to quickly deploy Ceph-based storage solutions.

Download at: https://engage.redhat.com/inktank-hardware-selection-guide-s-201409080912


HARDWARE SELECTION GUIDE

111

NETWORK ARCHITECTURE

USING CEPH

BASIC NETWORK CONSIDERATIONS

!  Separate client and backend network

!  Backend network preferably 10 GbE

!  Client facing network speed depends on requirements


!  qperf $IP_ADDR -t 10 -oo msg_size:1:8M:*2 –v tcp_lat tcp_bw rc_rdma_write_lat rc_rdma_write_bw rc_rdma_write_poll_lat

!  ρ fio --filename=$RBD|--directory=$MDIR --direct=1 --rw=$io --bs=$bs --size=10G --numjobs=$threads --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}

!  RBD=/dev/rbdX, MDIR=/cephfs/fio-test-dir

!  io=write,randwrite,read,randread

!  bs=4k,8k,4m,8m

!  threads=1,64,128

Latency is defined by the network


qperf - latency


qperf - bandwidth


qperf - explanation !  tcp_lat almost the same for blocks <= 128 bytes !  tcp_lat very similar between 40GbE, 40Gb IPoIB and 56Gb IPoIB !  Significant difference between 1 / 10GbE only for blocks >= 4k !  Better latency on IB can only be achieved with rdma_write /

rdma_write_poll !  Bandwidth on 1 / 10GbE very stable on possible maximum !  40GbE implementation (MLX) with unexpected fluctuation !  Under IB only with RDMA the maximum transfer rate can be achieved !  Use Socket over RDMA !  Options without big code changes are: SDP, rsocket, SMC-R


PERFORMANCE EXPECTATIONS

USING CEPH


WHAT PERFORMANCE TO EXPECT?

119

!  Not easy to calculate, really dependant on workload and chosen hardware components

!  But: performance can be enhanced by adding more nodes

!  Generally speaking: to calculate IOPS: # drives * IOPS/drive * 80%

!  Many factors affect performance


WHAT PERFORMANCE TO EXPECT?

120

MY CUSTOMER REQUIRES X IOPS/GB

!  Some customers have very specific requirements !  X iops/GB is a common requirement !  Although the total # op iops increases when a

cluster grows, the size grows as well, so your IOPS/GB does not go up

!  Not for all GBs anyway !  In these cases recommend small and fast

harddrives (or better: SSDs)

!  Ceph relies heavily on a tuned OS !  Network subsystem tuning !  Disk I/O subsystem tuning

!  Bottlenecks can !  Jeopardize Ceph’s overall performance !  Jeopardize overall balance

WHAT IS AT STAKE?


!  Assuming tuning has been achieved:

!  How can we check the path is clear for performance?

!  How can we check Ceph’s performance?

WHAT IS AT STAKE?


!  Fast journal drives – preferably SSDs !  Data drive choice must match SLAs

!  Life can’t be reduced to just costs !  Business driving both costs and SLA’s

!  10GbE or Infiniband cluster network !  2-3GB RAM per OSD daemon !  1 GHz of a CPU core per OSD daemon

GENERAL BEST PRACTICES


SYSTEM METRICS

!  Network subsystem !  Public side network must be monitored !  Cluster side network must be monitored

!  Metrics !  Frames per second !  Bandwidth

SYSTEM METRICS


!  Disk IO subsystem !  Journal drives must be monitored !  Data drives must be equally monitored

!  Metrics !  General IO wait time !  Device IO wait queue !  Device utilization rate & IOPS !  Device bandwidth usage !  Device error rates

SYSTEM METRICS

CLUSTER HEALTH

!  General cluster health !  OSDs status !  PG status !  Cluster health

CLUSTER HEALTH


!  General cluster health !  ceph health and ceph health detail!  ceph –s and ceph status!  ceph –w

CLUSTER HEALTH


CEPH METRICS

!  Data drive usage !  ceph health detail

!  Check for x nearfull osds!  Check for x full osds

!  df output !  Important to remember

!  Writes will get denied on full OSD !  Backfilling will get denied on full OSD !  Recovery will get denied on full OSD

CEPH METRICS

!  Admin socket (Check primary PG assignment) !  ceph daemon osd.$id perf dump

!  numpg Total cluster PGs !  numpg_primary Primary PGs this OSD !  numpg_replica Sec.C`PGs this OSD

CEPH METRICS


!  Admin socket (Check OSD usage) !  ceph daemon osd.$id perf dump

!  stat_bytes Total bytes this OSD !  stat_bytes_used Bytes used this OSD !  stat_bytes_avail Bytes available this OSD

CEPH METRICS


!  Admin socket (Checking ops and subops live) !  ceph daemon osd.$id dump_ops_in_flight

!  "flag_point" : "waiting for subops" !  "events"

!  "time": "2014-03-17 17:55:57.274529", !  "event": "xxxxxxxxxxxxxxx"

CEPH METRICS


!  Admin socket (Checking ops and subops live) !  waiting_for_osdmap !  reached_pg !  started !  waiting for subops from [x] Only on primary OSD !  commit_queued_for_journal_write !  write_thread_in_journal_buffer !  journaled_completion_queued !  op_commit !  op_applied or sub_op_applied !  sub_op_commit_rec Only on primary OSD !  commit_sent !  done

CEPH METRICS

PERFORMANCE OPTIMIZATION

USING CEPH

GEO-REPLICATION

USING CEPH

LET’S START WITH AN EXAMPLE •  CLIMB: a collaboration between

Birmingham, Cardiff, Swansea and Warwick Universities, to develop and deploy a world leading cyber infrastructure for microbial bioinformatics, supported by three world class Medical Research Fellows, a comprehensive training program and two newly refurbished Bioinformatics facilities in Swansea and Warwick.


QUOTE FROM WWW.CLIMB.AC.UK !  “For longer term data storage, to share datasets and VMs, and

to provide block storage for running VMs, we will be deploying a storage solution based on Ceph. Each site has 27 Dell R730XD servers, with each server containing 16x 4TB HDDs, giving a total raw storage capacity of 6912TB. All data stored in this system is replicated 3x, which gives us a usable capacity of 2304TB.”


TIME FOR OTHER EXAMPLES !  A large American cable company streams all their Video on

Demand content from 17 datacenters across the United States with Ceph.

!  Another large American media sharing service has the currently largest “single” Ceph cluster that we know about: 300(!) PB. The cluster is spanning multiple datacenters.

!  And there are so many more …


SECURITY CONSIDERATIONS

USING CEPH

!  The weakest link will fail eventually !  Although Ceph has no single point of failures, the

administrator can cause serious trouble !  The storage systems’ operating system should not

be accessed by ‘normal’ administrators !  Consider storage nodes as ‘appliances’

SECURE YOUR SYSTEM LIKE ANY OTHER

MONITORING & MAINTENANCE

USING CEPH

!  Ceph was designed to be self healing !  Cluster rebalancing is automatic !  Drives do not need to be swapped immediately !  Data distribution is handled automatically

!  Calamari monitoring allows you to keep an eye on cluster !  New Calamari management features allow cluster

management through GUI !  More management features coming in the future


HOW DIFFICULT IS IT TO MANAGE?

146

CALAMARI


COMMERCIAL BREAK

USING CEPH


DEPLOYMENT TOOLS New “Wizard” for easier installation of Calamari

Local repository with dependencies

Cluster bootstrapping tool

CALAMARI

On-premise, web-based application

Cluster monitoring and management

RESTful API

Unlimited Instances

CEPH FOR OBJECT & BLOCK

SUPPORT SERVICES

SLA-backed Technical Support

Bug Escalation

“Hot Patches”

Roadmap Input

INKTANK CEPH ENTERPRISE WHAT’S INSIDE?

BENEFITS

TCO FUTURE PROOFED EXPERTISE ENTERPRISE READY

"  Very Low TCO "  Operational

Efficiency

"  Scalable Tools

150

"  Long-Term Support "  Single, predictable

price

"  Use-Case Flexibility

"  Backed by the Ceph experts

"  Support from the developers

"  Use Existing Infrastructure

"  Extend use-case options


RELEASE SCHEDULE


2013 2014 2015 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2

SUPPORT MATRIX

152

SUPPORTED

SUPPORTED FUTURE


QUESTIONS?

THANK YOU!

Taco Scargo Sr. Solution Architect

[email protected] +31 6 45434628 @TacoScargo

Download - Let's talk about Ceph Terena TF-Storage Meeting – February 12th

Top Related