system professional / red hat software defined storage - past, present & future
TRANSCRIPT
Systems Integrator of Infrastructure Technology
Storage & Virtualisation
Support Services
• 25+ Service Desk• ITIL Framework• Remote Monitoring• 50+ contracts, 2K calls
Managed Services
• IaaS• DRaaS (on demand)• On-line Backup
IT Services Portfolio - Summary
Consultancy Services
• IT Specialists• Design - Deploy• PRINCE 2 Framework
Microsoft Solutions Backup & Disaster Recovery
Red Hat Software Defined Storage
Past, Present and FutureNick FiskTarquin Dunn
SysPro Cloud Team
London, UK26th February, 2016
Introductions
Tarquin DunnSysPro CTO
@Tarqs
Nick FiskSysPro Senior Cloud Architect
Introductions
• Software Defined Storage – Evolution
• Industry Uptake• Key Architecture
• Case Study – us!
• Current Challenges & Issues
• Gluster Roadmap• Use Cases / Benefits
• CEPH Roadmap• Use Cases / Benefits
Red Hat benefits
SysPro benefits & next steps
futurepast present
Today’s discussion
A (VERY) BRIEF HISTORY OF EVERYTHING STORAGE
FROM BLOCKS
TO OBJECTS
Raw volumes accessed by client O/S
block
• Evenly sized “blocks” of data to emulate a physical hard drive (with sectors and tracks)
• No metadata
• Directly accessed from client OS (e.g. as a mounted drive)
• Faster writes (when compared to file/object)
• Geographically sensitive (further app is from block storage, latency increases)
Storage based on files (group of blocks)
file
• Simple to implement & use
• Clients & Systems see same information
• File metadata stored with data
• Accessed by protocol such as NFS or SMB/CIFS
001001010
Every object contains 3 things:
object
• Data (photo, doc, database etc)
• Expandable Metadata
• Globally unique identifier
• Defined by the creator• Contextual information
(what, why, status, how used)
• Address and location discovery
• E.g. Facebook, Spotify, Amazon S3
• Manipulated as a whole unit
• Good for web, archive/storage, high read / low write
00100 8de4
In the beginning there was just File Servers…
Local files stored on
locally, direct attached hard
disks
Shared storage requirements appeared/increased
Centrally managed storage e.g. for HA or multiple access
…then block storage became more prevalent
The ability to make remote storage look and work like
local disks
… now object storage
Big dataHyper-scale computing
Automatic provisioning
Increased element management
00100 8de4
Evolution from DAS to NAS/SAN
Solved a challenge• Central management
of storage• Efficiency
• Allowed shared storage
• HA etc
Created a challenge *• Scale out• Cost
• Flexibility
• Multiple diverse clients
* Eventually!
• File share, Virtualisation, Analytics etc
• File, object, block
Requirements have become more complex
• Multiple different workloads
• Virtualisation
• Self-service
• Data growth (structured & unstructured)
• Snapshots etc.
Open Source - why is this important?
Internal skills Shape and
develop products
Innovation rather than
profit
Open Standards
Driven
Big vendors – big contributors
Open Source =
Current Challenges (in more detail…!)
the present
Potential Problems
Raid rebuild –
larger disks
Scale up reaching
limitsExpensive Fork Lift
upgrades Proprietary
More complex ways of
consuming storage
What can I do today to address my current (or potentially future) issues?
Many ways of consuming storage
iSCSI Fibre Channel SMB/CIFS NFS
Object S3
?Is your storage agile enough to deliver
Distributed File System
0.5 1 2 3 4 6 8 10
Disk Size 8 8 8
0
2
4
6
8
10
12
14
Rebuild Time (days)
Rebuild Time (days)
RAID Rebuild TimesDisk capacity increases every year,but speeds altered little
Expect active array rebuild rate ≤ 10MB/s
i
∴ 10TB drive @ rebuild rate of 10MB/s = >2 weeks!
More data + same speed = longer rebuild times
Performance is degraded and data is at risk during this period!
?
…..about 5% Annual Failure Rate
How often do disks fail
?Why does that 5% matter
Probability of data loss with 10TB drives in RAID6 over a 5 year period25%
rebuild times +
unrecoverable read errors
2 weeks RAID5 is 99%...
… but you’re not using that with large drives…. are you?This is only going to
get worse as drive capacities increase!
Don’t take our word for it Run the numbers yourself:http://wintelguy.com/raidmttdl.pl
?So, can you afford to gamble with those odds
5 out of 6 Russian Roulette players believe its completely safe….
?
?Question is…
…well, do ya, ****?
Every remaining disk reads its entire capacity.The lost data is calculated from parity.Then written to the replacement disk.
Every disk reads and writes and communicates with all other disks to replace just the degraded data.
Rebuilding RAID vs Software Defined Storage
RAID
The more disks,the faster it can recover.! The more disks,
the longer it takes to read all the data.
SDS
i
Upgrades and Expansion
?Recognise this model
Every 3-5 years replace entire SAN with a faster larger model
• Expensive
• Fork lift upgrade model
• Require lengthy data migration period
• Extended Vendor support conveniently equals cost of upgrading to new model?
Upgrades and Expansion
?Or the other way….
Upgrade by adding more nodes
• Choose your own hardware and support terms
• Retire old nodes as and when you see fit
• Data migration is an intrinsic part of the system
I’ve got 99 problems, but a dead server ain’t one …
Design to fail Use cost effective commodity hardware
Resilience is provided by the software layer
No single point of failure – Spot the difference
Software Defined Solutions
Gluster and Ceph
Gluster
Gluster OverviewScale-out File storage for petabyte-scale workloads
• Purpose-built as a scale-out file store
• Straightforward architecture suitable for public, private and hybrid cloud
• Simple to install and configure
• Minimal hardware footprint
• Offers mature NFS, SMB and HDFS interfaces for enterprise use
Analytics
TARGET USE
CASES
• Machine analytics with Splunk
• Big data analytics with Hadoop
Enterprises File Sharing• Media Streaming
• Active Archives
Enterprise Virtualisation
Rich Media & Archival
Gluster Big Data Analytics
In-place Hadoop analytics in a POSIX compatible environment
Gluster Machine Data Analytics
High-performance, scale-out, online cold storage for Splunk Enterprise
Gluster Rich Media
Massively-scalable, flexible, and cost-effective storage for image, video and audio content
Gluster Active Archives
Open-source, capacity-optimised archival storage on commodity hardware
Gluster File Sync and Share
Powerful, software-defined, scale-out, on-promise storage for file sync and share with own Cloud
Gluster Storage Concepts
Gluster Bricks• A brick is the combination of a
node and a file system: hostname:/dir
• Each brick inherits limits of the underlying file system (XFS).
• Red Hat Storage Server operates at the brick level, not at thenode level
• Ideally, each brick in a cluster should be the same size.
Gluster Volumes
• A volume is some number of bricks = 2, clusters and exported with Gluster
• Volumes have administrator assigned names (= export names)
• A brick is a member of only one volume
• A global namespace can have a mix of replicated and distributed volumes
• Data in different volumes physically exists on different bricks
• Volumes can be sub-mounted on clients using NFS, CIFS and/or Glusterfs clients
• The directory structure of the volume exists on every brick in the volume
Gluster Elastic Hash Algorithm
No central metadata No performance bottleneck Eliminates risk scenarios
Location hashed on file name Unique identifiers, similar to md5sum
The elastic part Files assigned to virtual volumes Virtual volumes assigned to multiple bricks Volumes easily reassigned on the fly
Gluster Data Placement StrategiesGlusterFS volume type Characteristics
Distributed•Distributes files across bricks in the volume•Used where scaling and redundancy requirements are not important, or provided by other hardware or software layers
Replicated•Replicates files across bricks in the volume•Used in environments where high availability and high reliability are critical
Distributed Replicated•Offers improved read performance in most environments•Used in environments where high reliability and scalability are critical
Gluster Default Data Placement (distributed volume)
Gluster Fault-tolerant data placement (distributed replicated volume)
Gluster Erasure Coding
Storing more data with less hardware
• Standard replication back-ends are very durable, can recover quickly, but have inherently large capacity overheads
• Erasure coding back-ends reconstruct corrupted or lost data by using information about the data stored elsewhere in the system
• Providing failure protection with erasure coding Eliminates the need for RAID Consumes far less space than replication Can be appropriate for capacity
optimised use cases
Gluster Tiering
Cost-effective flash acceleration
• Optimally:
• Frequently accessed data can be served from faster, more expensive systems
• Manually moving data between storage tiers can be time-consuming and expensive
Gluster supports automated promotion and demotion of date between ‘hot’ and ‘cold’ sub volumes
• Infrequently accessed data is served from less expensive storage systems
Gluster Bit Rot Detection
Detection of silent data corruption
• A mechanism that detects data corruption resulting from silent hardware failures, leading to deterioration in performance and integrity
• Gluster provides a mechanism to scan data periodically and detect bit-rot
• Using SHA256 algorithm:
• Checksums are computed when files are accessed
• Compared against previously stored values
• Unmatched value logged as error for storage admin
Gluster Multi-protocol access
GlusterFS Native Client (FUSE)
• Based on FUSE kernel module, which allows the filesystem to operate entirely in userspace
• Specify mount to any GlusterFS server
• Recommended for high concurrency and high write performance
• Load is inherently balanced across distributed volumes
• Native Client fetches volfile from mount server, then communicates directly with all nodes to access data
Gluster NFS
• Standard NFS v3 clients connect to GlusterFS NFS server process (user space) on storage node
• Mount GlusterFS volume from any storage node
• Better performance for reading many small files from a single client
• Load balancing must be managed externally
• GlusterFS NFS server includes network lock manager (NLM) to synchronize locks across clients
• Standard automounter is supported
Gluster SMB/CIFS
• Storage node uses Samba with winbind to connect with Active Directory environments
• Samba uses Libgfapi library to communicate directly with GlusterFS server process without goingthrough FUSE
• SMB version 2.0 supported
• Load balancing must be managed externally
• SMB clients can connect to any storage node running Samba
• CTDB is required for Samba clustering
Gluster Object access of GlusterFS volume
• Built upon OpenStack’s Swift object storage
• GlusterFS is the back-end file system for Swift
• Accounts are implemented as GlusterFS volumes
• Store and retrieve files using the REST interface
• Implements objects as files and directories under the container
• Support integration with SWAuth and Keystone authentication service
Gluster Hadoop plug-in for HDFC accessRed Hat Storage Server now offers a Hadoop file system plug-in
• Benefit: Run in-place analytics on data stored in a Red Hat Storage Server without the overhead of preparing andmoving data into a file system that is built for running Hadoop workloads
Supported Hortonworks Data Platform (HDP) 2.1, which includes the management tool Apache Ambari 1.6
Benefits of using Red Hat Storage Server for Hadoop analytics workloads:
• Data ingest via NFS & FUSE• No single point of failure• POSIX compliance• Co-location of compute and data
• Ability to run Hadoop across across multiple namespaces using multiple volumes
• Strong disaster recovery capabilities
Gluster Roadmap
• Improved small file perfomance with Samba
• Improved ACL Support
• Further RedHat Integration
Cepha 1 minute history…
Ceph Architectural components
Ceph RADOS components
Ceph Object Storage Daemons
Ceph Where do objects live?
Ceph A metadata server?
Ceph Calculated placement
Ceph Even better – Crush!
Ceph Crush: Dynamic data placement
Ceph Data is organised into pools
Ceph Accessing a RADOS cluster
Ceph LIBRADOS: RADOS access for apps
Ceph The RADOS gateway
Ceph RADOSGW makes RADOS webby
Ceph RBD stores virtual disks
Ceph Storing virtual disks
Ceph Kernel module for max flexible!
CephUse Cases
Ceph and Openstack
Ceph Web application storage
Ceph Multi-site object storage
Ceph Archive / cold storage
Ceph Erasure coding
Ceph Cache tiering
Ceph Cache tiering
Ceph Roadmap
• Improved small IO performance for erasure coded pools• Improved cache tiering performance• Improved automated Bit Rot detection and healing• Faster IO Latency• Improved performance on high performance NVME SSD’s• New backend object store for OSD’s (Replaces XFS) – 2x performance• Quality of Service for IO operations
RADOS
Ceph Roadmap
• Global Active Active Clusters• LDAP/AD Authentication• Access objects via NFS• Various Swift API enhancements
Rados Gateway (S3/Object)
Ceph Roadmap
• Async Block device mirroring between two clusters• HA iSCSI support• Persistent client side caching with SSD’s• Snapshot improvements• Userspace RBD driver, tracks Ceph development faster than kernel
RBD (Block Device)
Ceph Roadmap
• CephFS (Distributed File system) Production Ready – Community Release only at this stage
• Tech Preview in Redhat Ceph Storage 2.0• Active/Active Metadata Server Support• Fsck Tool• Multiple namespaces per cluster• Manila – File As a Service in Openstack
CephFS
Unified Storage Management Console• Developed by Redhat to allow a single pane of management of both Gluster and
Ceph• Foreman (Puppet) and Satellite to install and configure clusters
and Open Source
Storage
Hosting Methodology
Gluster for High Available Web Services (including PaaS)
CEPH for DR Storage of IaaS workloads
• Cost per GB
present
• Scalability (with an unknown predicted growth curve)
• Flexibility (relatively hardware agnostic, allows for best-of-breed upgrade paths)
• Multi-characteristic storage requirement, e.g. bulk storage of DR/Backup VM images with the ability to run these in a high-performing mode if required.
• Separated Vendor risk (e.g. use different vendors)
System Professional and Open Source Storage
• We replicate all VM’s in our hosting environment to a 2nd Data Centre
• We needed a large amount of bulk storage to store these replicas
• The storage needed to be Highly Available and Resilient
why Ceph
?• We needed the storage to be
very dense and power efficient
• In the event of having to invoke our DR, the storage needed to be capable of providing sufficient performance
• As the above is hopefully unlikely, the storage needed to be cost effective for its role
• We had experience of Ceph through our R&D team and we were very interested in it
• Although not the easiest solution it would give us extensive knowledge into installing and running Ceph
• Attended “Ceph Days” which increased our interest
System Professional and Open Source Storage
• It was a bit of an unknown technology. Would it be flaky or lose our data?
• We also use ESXi, how well could we present Ceph block devices (RBD’s) to ESXi?
• Would it require a lot of learning for our support & operations teams?
why not Ceph
?
• Would it require a lot of implementation compared to a drop in legacy array (Nimble, EMC, NetApp etc.)
• What about our reputation internally in the company if it went wrong?
System Professional and Open Source Storage
• What about on balance?
• In 4U
• 48x 3.5” Disks
• 8x 2.5” Disks
• Shared 95% PSU’s for high efficiency
• Dual CPU’s
• Onboard 10Gb-T
what we built
System Professional and Open Source Storage
• Ceph as a technology is awesome!!!
Thank you, have a safe journey home
System Professional and Open Source Storage
?what have we learnt
• Our fears around presenting block devices to ESXi were realised
• Linux iSCSI Target (LIO) doesn’t work with ESXi and RBD’s
• Erasure Coding performed very poorly
• 10Gb Networking is a must
• Minimal outages and all have been caused by administrative error
• No data loss
• Overall a big success
• Recovery from failed disks is fast
• Very resilient
• Caching was severely broken (More on this later)
Outages
System Professional and Open Source Storage
• Limit Ceph so it won’t try and recover from a whole node loss. Unless you have hundreds of nodes, this will cause a bigger impact than the node going offline
• Be careful when PG splitting, can cause large performance dips
?what have we learnt
Cache tiering
System Professional and Open Source Storage
?
• Initially we turned it on and everything slowed down by a significant amount.
• Through the next couple of releases performance improved but was still causing performance to be half of non cached pools.
• System Professional submitted a patch to fix promotion logic. Performance suddenly increased ten fold.
• Other patches tweaked flushing logic and allowed large block IO’s to skip the cache.
what have we learnt
System Professional and Open Source Storage
?what have we learnt
System Professional and Open Source Storage
?
Effect of CPU Frequency on Latency
CPU Mhz 4Kb Write IOs Min Latency (us) Avg Latency (us)
1600 797 886 1250
2000 815 746 1222
2400 1161 630 857
2800 1227 549 812
3300 1320 482 755
4300 1548 437 644
what have we learnt
• Expand cluster
• Use Ceph Erasure Coding for storing backups
• Further development improvements to cache tiering
• Client side caching of block devices to improve sync write latency
System Professional and Open Source Storage
the future
• Enterprise Support, Stability, Security, Reference Architecture
• Subscription/Consumption based
Red Hat benefits• Free trials
• Existing relationships with many HE/FE clients
• Vetted Partner Community
• Open Source but Vendor-backed – credibility of established vendor
• Training & Certification programme
• Integrated Management Tools (e.g. to manage CEPH/Gluster)
• Large environments have BIG challenges
• Analyse what you’ve got (CSA)
• Capture as many requirements as possible (not just IT related)
• Don’t ignore the disruptive technology
• It’s going to happen
futurepast present
of your environment/organisation
• Recognise your storage journey
• Figure out Growth, complexity, business drivers etc.
Summary
next stepsSystem Professional and Open Source Storage
Do a
PoC!
Do a
Techworkshop!
If Enterprise support will ever be required - make sure solution is aligned to RedHat Reference Architecture (even if this is part of your DR plan)
Do a
Pilot!
Measure
Measure!
Do a
Project!
next stepsSystem Professional and Open Source Storage
Beer & Pizza Tech Evenings Small numbers (<6)
Email [email protected]
Free beer Free pizza Free techies *
* before beer
Summary
CEPH daysOrganised by Red
Hat+Community
resourcesRed Hat Trial
Subscription
community
learn try join Move forward
CERNJune 14th 2016
Open Source etc
Join the community