system professional / red hat software defined storage - past, present & future

Systems Integrator of Infrastructure Technology

Storage & Virtualisation

Support Services

• 25+ Service Desk• ITIL Framework• Remote Monitoring• 50+ contracts, 2K calls

Managed Services

• IaaS• DRaaS (on demand)• On-line Backup

IT Services Portfolio - Summary

Consultancy Services

• IT Specialists• Design - Deploy• PRINCE 2 Framework

Microsoft Solutions Backup & Disaster Recovery

Red Hat Software Defined Storage

Past, Present and FutureNick FiskTarquin Dunn

SysPro Cloud Team

London, UK26th February, 2016

Introductions

Tarquin DunnSysPro CTO

[email protected]

@Tarqs

[email protected]

Nick FiskSysPro Senior Cloud Architect

[email protected]

Introductions

• Software Defined Storage – Evolution

• Industry Uptake• Key Architecture

• Case Study – us!

• Current Challenges & Issues

• Gluster Roadmap• Use Cases / Benefits

• CEPH Roadmap• Use Cases / Benefits

Red Hat benefits

SysPro benefits & next steps

futurepast present

Today’s discussion

A (VERY) BRIEF HISTORY OF EVERYTHING STORAGE

FROM BLOCKS

TO OBJECTS

Raw volumes accessed by client O/S

block

• Evenly sized “blocks” of data to emulate a physical hard drive (with sectors and tracks)

• No metadata

• Directly accessed from client OS (e.g. as a mounted drive)

• Faster writes (when compared to file/object)

• Geographically sensitive (further app is from block storage, latency increases)

Storage based on files (group of blocks)

file

• Simple to implement & use

• Clients & Systems see same information

• File metadata stored with data

• Accessed by protocol such as NFS or SMB/CIFS

001001010

Every object contains 3 things:

object

• Data (photo, doc, database etc)

• Expandable Metadata

• Globally unique identifier

• Defined by the creator• Contextual information

(what, why, status, how used)

• Address and location discovery

• E.g. Facebook, Spotify, Amazon S3

• Manipulated as a whole unit

• Good for web, archive/storage, high read / low write

00100 8de4

In the beginning there was just File Servers…

Local files stored on

locally, direct attached hard

disks

Shared storage requirements appeared/increased

Centrally managed storage e.g. for HA or multiple access

…then block storage became more prevalent

The ability to make remote storage look and work like

local disks

… now object storage

Big dataHyper-scale computing

Automatic provisioning

Increased element management

00100 8de4

Evolution from DAS to NAS/SAN

Solved a challenge• Central management

of storage• Efficiency

• Allowed shared storage

• HA etc

Created a challenge *• Scale out• Cost

• Flexibility

• Multiple diverse clients

* Eventually!

• File share, Virtualisation, Analytics etc

• File, object, block

Requirements have become more complex

• Multiple different workloads

• Virtualisation

• Self-service

• Data growth (structured & unstructured)

• Snapshots etc.

Open Source - why is this important?

Internal skills Shape and

develop products

Innovation rather than

profit

Open Standards

Driven

Big vendors – big contributors

Open Source =

Current Challenges (in more detail…!)

the present

Potential Problems

Raid rebuild –

larger disks

Scale up reaching

limitsExpensive Fork Lift

upgrades Proprietary

More complex ways of

consuming storage

What can I do today to address my current (or potentially future) issues?

Many ways of consuming storage

iSCSI Fibre Channel SMB/CIFS NFS

Object S3

?Is your storage agile enough to deliver

Distributed File System

0.5 1 2 3 4 6 8 10

Disk Size 8 8 8

0

2

4

6

8

10

12

14

Rebuild Time (days)

Rebuild Time (days)

RAID Rebuild TimesDisk capacity increases every year,but speeds altered little

Expect active array rebuild rate ≤ 10MB/s

i

∴ 10TB drive @ rebuild rate of 10MB/s = >2 weeks!

More data + same speed = longer rebuild times

Performance is degraded and data is at risk during this period!

?

…..about 5% Annual Failure Rate

How often do disks fail

?Why does that 5% matter

Probability of data loss with 10TB drives in RAID6 over a 5 year period25%

rebuild times +

unrecoverable read errors

2 weeks RAID5 is 99%...

… but you’re not using that with large drives…. are you?This is only going to

get worse as drive capacities increase!

Don’t take our word for it Run the numbers yourself:http://wintelguy.com/raidmttdl.pl

?So, can you afford to gamble with those odds

5 out of 6 Russian Roulette players believe its completely safe….

?

?Question is…

…well, do ya, ****?

Every remaining disk reads its entire capacity.The lost data is calculated from parity.Then written to the replacement disk.

Every disk reads and writes and communicates with all other disks to replace just the degraded data.

Rebuilding RAID vs Software Defined Storage

RAID

The more disks,the faster it can recover.! The more disks,

the longer it takes to read all the data.

SDS

i

Upgrades and Expansion

?Recognise this model

Every 3-5 years replace entire SAN with a faster larger model

• Expensive

• Fork lift upgrade model

• Require lengthy data migration period

• Extended Vendor support conveniently equals cost of upgrading to new model?

Upgrades and Expansion

?Or the other way….

Upgrade by adding more nodes

• Choose your own hardware and support terms

• Retire old nodes as and when you see fit

• Data migration is an intrinsic part of the system

I’ve got 99 problems, but a dead server ain’t one …

Design to fail Use cost effective commodity hardware

Resilience is provided by the software layer

No single point of failure – Spot the difference

Software Defined Solutions

Gluster and Ceph

Gluster

Gluster OverviewScale-out File storage for petabyte-scale workloads

• Purpose-built as a scale-out file store

• Straightforward architecture suitable for public, private and hybrid cloud

• Simple to install and configure

• Minimal hardware footprint

• Offers mature NFS, SMB and HDFS interfaces for enterprise use

Analytics

TARGET USE

CASES

• Machine analytics with Splunk

• Big data analytics with Hadoop

Enterprises File Sharing• Media Streaming

• Active Archives

Enterprise Virtualisation

Rich Media & Archival

Gluster Big Data Analytics

In-place Hadoop analytics in a POSIX compatible environment

Gluster Machine Data Analytics

High-performance, scale-out, online cold storage for Splunk Enterprise

Gluster Rich Media

Massively-scalable, flexible, and cost-effective storage for image, video and audio content

Gluster Active Archives

Open-source, capacity-optimised archival storage on commodity hardware

Gluster File Sync and Share

Powerful, software-defined, scale-out, on-promise storage for file sync and share with own Cloud

Gluster Storage Concepts

Gluster Bricks• A brick is the combination of a

node and a file system: hostname:/dir

• Each brick inherits limits of the underlying file system (XFS).

• Red Hat Storage Server operates at the brick level, not at thenode level

• Ideally, each brick in a cluster should be the same size.

Gluster Volumes

• A volume is some number of bricks = 2, clusters and exported with Gluster

• Volumes have administrator assigned names (= export names)

• A brick is a member of only one volume

• A global namespace can have a mix of replicated and distributed volumes

• Data in different volumes physically exists on different bricks

• Volumes can be sub-mounted on clients using NFS, CIFS and/or Glusterfs clients

• The directory structure of the volume exists on every brick in the volume

Gluster Elastic Hash Algorithm

No central metadata No performance bottleneck Eliminates risk scenarios

Location hashed on file name Unique identifiers, similar to md5sum

The elastic part Files assigned to virtual volumes Virtual volumes assigned to multiple bricks Volumes easily reassigned on the fly

Gluster Data Placement StrategiesGlusterFS volume type Characteristics

Distributed•Distributes files across bricks in the volume•Used where scaling and redundancy requirements are not important, or provided by other hardware or software layers

Replicated•Replicates files across bricks in the volume•Used in environments where high availability and high reliability are critical

Distributed Replicated•Offers improved read performance in most environments•Used in environments where high reliability and scalability are critical

Gluster Default Data Placement (distributed volume)

Gluster Fault-tolerant data placement (distributed replicated volume)

Gluster Erasure Coding

Storing more data with less hardware

• Standard replication back-ends are very durable, can recover quickly, but have inherently large capacity overheads

• Erasure coding back-ends reconstruct corrupted or lost data by using information about the data stored elsewhere in the system

• Providing failure protection with erasure coding Eliminates the need for RAID Consumes far less space than replication Can be appropriate for capacity

optimised use cases

Gluster Tiering

Cost-effective flash acceleration

• Optimally:

• Frequently accessed data can be served from faster, more expensive systems

• Manually moving data between storage tiers can be time-consuming and expensive

Gluster supports automated promotion and demotion of date between ‘hot’ and ‘cold’ sub volumes

• Infrequently accessed data is served from less expensive storage systems

Gluster Bit Rot Detection

Detection of silent data corruption

• A mechanism that detects data corruption resulting from silent hardware failures, leading to deterioration in performance and integrity

• Gluster provides a mechanism to scan data periodically and detect bit-rot

• Using SHA256 algorithm:

• Checksums are computed when files are accessed

• Compared against previously stored values

• Unmatched value logged as error for storage admin

Gluster Multi-protocol access

GlusterFS Native Client (FUSE)

• Based on FUSE kernel module, which allows the filesystem to operate entirely in userspace

• Specify mount to any GlusterFS server

• Recommended for high concurrency and high write performance

• Load is inherently balanced across distributed volumes

• Native Client fetches volfile from mount server, then communicates directly with all nodes to access data

Gluster NFS

• Standard NFS v3 clients connect to GlusterFS NFS server process (user space) on storage node

• Mount GlusterFS volume from any storage node

• Better performance for reading many small files from a single client

• Load balancing must be managed externally

• GlusterFS NFS server includes network lock manager (NLM) to synchronize locks across clients

• Standard automounter is supported

Gluster SMB/CIFS

• Storage node uses Samba with winbind to connect with Active Directory environments

• Samba uses Libgfapi library to communicate directly with GlusterFS server process without goingthrough FUSE

• SMB version 2.0 supported

• Load balancing must be managed externally

• SMB clients can connect to any storage node running Samba

• CTDB is required for Samba clustering

Gluster Object access of GlusterFS volume

• Built upon OpenStack’s Swift object storage

• GlusterFS is the back-end file system for Swift

• Accounts are implemented as GlusterFS volumes

• Store and retrieve files using the REST interface

• Implements objects as files and directories under the container

• Support integration with SWAuth and Keystone authentication service

Gluster Hadoop plug-in for HDFC accessRed Hat Storage Server now offers a Hadoop file system plug-in

• Benefit: Run in-place analytics on data stored in a Red Hat Storage Server without the overhead of preparing andmoving data into a file system that is built for running Hadoop workloads

Supported Hortonworks Data Platform (HDP) 2.1, which includes the management tool Apache Ambari 1.6

Benefits of using Red Hat Storage Server for Hadoop analytics workloads:

• Data ingest via NFS & FUSE• No single point of failure• POSIX compliance• Co-location of compute and data

• Ability to run Hadoop across across multiple namespaces using multiple volumes

• Strong disaster recovery capabilities

Gluster Roadmap

• Improved small file perfomance with Samba

• Improved ACL Support

• Further RedHat Integration

Cepha 1 minute history…

Ceph Architectural components

Ceph RADOS components

Ceph Object Storage Daemons

Ceph Where do objects live?

Ceph A metadata server?

Ceph Calculated placement

Ceph Even better – Crush!

Ceph Crush: Dynamic data placement

Ceph Data is organised into pools

Ceph Accessing a RADOS cluster

Ceph LIBRADOS: RADOS access for apps

Ceph The RADOS gateway

Ceph RADOSGW makes RADOS webby

Ceph RBD stores virtual disks

Ceph Storing virtual disks

Ceph Kernel module for max flexible!

CephUse Cases

Ceph and Openstack

Ceph Web application storage

Ceph Multi-site object storage

Ceph Archive / cold storage

Ceph Erasure coding

Ceph Cache tiering

Ceph Roadmap

• Improved small IO performance for erasure coded pools• Improved cache tiering performance• Improved automated Bit Rot detection and healing• Faster IO Latency• Improved performance on high performance NVME SSD’s• New backend object store for OSD’s (Replaces XFS) – 2x performance• Quality of Service for IO operations

RADOS

Ceph Roadmap

• Global Active Active Clusters• LDAP/AD Authentication• Access objects via NFS• Various Swift API enhancements

Rados Gateway (S3/Object)

Ceph Roadmap

• Async Block device mirroring between two clusters• HA iSCSI support• Persistent client side caching with SSD’s• Snapshot improvements• Userspace RBD driver, tracks Ceph development faster than kernel

RBD (Block Device)

Ceph Roadmap

• CephFS (Distributed File system) Production Ready – Community Release only at this stage

• Tech Preview in Redhat Ceph Storage 2.0• Active/Active Metadata Server Support• Fsck Tool• Multiple namespaces per cluster• Manila – File As a Service in Openstack

CephFS

Unified Storage Management Console• Developed by Redhat to allow a single pane of management of both Gluster and

Ceph• Foreman (Puppet) and Satellite to install and configure clusters

https://vimeo.com/autodemo/review/130392213/66fd2efe30

https://vimeo.com/autodemo/review/130392213/66fd2efe30

and Open Source

Storage

Hosting Methodology

Gluster for High Available Web Services (including PaaS)

CEPH for DR Storage of IaaS workloads

• Cost per GB

present

• Scalability (with an unknown predicted growth curve)

• Flexibility (relatively hardware agnostic, allows for best-of-breed upgrade paths)

• Multi-characteristic storage requirement, e.g. bulk storage of DR/Backup VM images with the ability to run these in a high-performing mode if required.

• Separated Vendor risk (e.g. use different vendors)

System Professional and Open Source Storage

• We replicate all VM’s in our hosting environment to a 2nd Data Centre

• We needed a large amount of bulk storage to store these replicas

• The storage needed to be Highly Available and Resilient

why Ceph

?• We needed the storage to be

very dense and power efficient

• In the event of having to invoke our DR, the storage needed to be capable of providing sufficient performance

• As the above is hopefully unlikely, the storage needed to be cost effective for its role

• We had experience of Ceph through our R&D team and we were very interested in it

• Although not the easiest solution it would give us extensive knowledge into installing and running Ceph

• Attended “Ceph Days” which increased our interest


• It was a bit of an unknown technology. Would it be flaky or lose our data?

• We also use ESXi, how well could we present Ceph block devices (RBD’s) to ESXi?

• Would it require a lot of learning for our support & operations teams?

why not Ceph

?

• Would it require a lot of implementation compared to a drop in legacy array (Nimble, EMC, NetApp etc.)

• What about our reputation internally in the company if it went wrong?


• What about on balance?

• In 4U

• 48x 3.5” Disks

• 8x 2.5” Disks

• Shared 95% PSU’s for high efficiency

• Dual CPU’s

• Onboard 10Gb-T

what we built


• Ceph as a technology is awesome!!!

Thank you, have a safe journey home


?what have we learnt

• Our fears around presenting block devices to ESXi were realised

• Linux iSCSI Target (LIO) doesn’t work with ESXi and RBD’s

• Erasure Coding performed very poorly

• 10Gb Networking is a must

• Minimal outages and all have been caused by administrative error

• No data loss

• Overall a big success

• Recovery from failed disks is fast

• Very resilient

• Caching was severely broken (More on this later)

Outages


• Limit Ceph so it won’t try and recover from a whole node loss. Unless you have hundreds of nodes, this will cause a bigger impact than the node going offline

• Be careful when PG splitting, can cause large performance dips


Cache tiering


?

• Initially we turned it on and everything slowed down by a significant amount.

• Through the next couple of releases performance improved but was still causing performance to be half of non cached pools.

• System Professional submitted a patch to fix promotion logic. Performance suddenly increased ten fold.

• Other patches tweaked flushing logic and allowed large block IO’s to skip the cache.

what have we learnt


?

Effect of CPU Frequency on Latency

CPU Mhz 4Kb Write IOs Min Latency (us) Avg Latency (us)

1600 797 886 1250

2000 815 746 1222

2400 1161 630 857

2800 1227 549 812

3300 1320 482 755

4300 1548 437 644

what have we learnt

• Expand cluster

• Use Ceph Erasure Coding for storing backups

• Further development improvements to cache tiering

• Client side caching of block devices to improve sync write latency


the future

• Enterprise Support, Stability, Security, Reference Architecture

• Subscription/Consumption based

Red Hat benefits• Free trials

• Existing relationships with many HE/FE clients

• Vetted Partner Community

• Open Source but Vendor-backed – credibility of established vendor

• Training & Certification programme

• Integrated Management Tools (e.g. to manage CEPH/Gluster)

• Large environments have BIG challenges

• Analyse what you’ve got (CSA)

• Capture as many requirements as possible (not just IT related)

• Don’t ignore the disruptive technology

• It’s going to happen

futurepast present

of your environment/organisation

• Recognise your storage journey

• Figure out Growth, complexity, business drivers etc.

Summary

next stepsSystem Professional and Open Source Storage

Do a

PoC!

Do a

Techworkshop!

If Enterprise support will ever be required - make sure solution is aligned to RedHat Reference Architecture (even if this is part of your DR plan)

Do a

Pilot!

Measure

Measure!

Do a

Project!

next stepsSystem Professional and Open Source Storage

Beer & Pizza Tech Evenings Small numbers (<6)

Email [email protected]

Free beer Free pizza Free techies *

* before beer

Summary

CEPH daysOrganised by Red

Hat+Community

resourcesRed Hat Trial

Subscription

community

learn try join Move forward

CERNJune 14th 2016

Open Source etc

Join the community

On behalf of the team at System Professional,

thank you for your time today

[email protected]