building auroraobjects- ceph day frankfurt
DESCRIPTION
Wido den Hollander, 42on.comTRANSCRIPT
![Page 1: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/1.jpg)
Building AuroraObjects
![Page 2: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/2.jpg)
Who am I?
● Wido den Hollander (1986)● Co-owner and CTO of a PCextreme B.V., a
dutch hosting company● Ceph trainer and consultant at 42on B.V.● Part of the Ceph community since late 2009
– Wrote the Apache CloudStack integration
– libvirt RBD storage pool support
– PHP and Java bindings for librados
![Page 3: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/3.jpg)
PCextreme?
● Founded in 2004● Medium-sized ISP in the Netherlands● 45.000 customers● Started as a shared hosting company● Datacenter in Amsterdam
![Page 4: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/4.jpg)
What is AuroraObjects?
● Under the name “Aurora” my hosting company PCextreme B.V. has two services:– AuroraCompute, a CloudStack based public cloud
backed by Ceph's RBD
– AuroraObjects, a public object store using Ceph's RADOS Gateway
● AuroraObjects is a public RADOS Gateway service (S3 only) running in production
![Page 5: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/5.jpg)
The RADOS Gateway (RGW)
● Service objects using either Amazon's S3 or OpenStack's Swift protocol
● All objects are stored in RADOS, the gateway is just a abstraction between HTTP/S3 and RADOS
![Page 6: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/6.jpg)
The RADOS Gateway
![Page 7: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/7.jpg)
Our ideas
● We wanted to cache frequently accessed objects using Varnish– Only possible with anonymous clients
● SSL should be supported● Storage between Compute and Objects
services shared● 3x replication
![Page 8: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/8.jpg)
Varnish
● A caching reverse HTTP proxy– Very fast
● Up to 100k requests/s
– Configurable using the Varnish Configuration Language (VCL)
– Used by Facebook and eBay
● Not a part of Ceph, but can be used with the RADOS Gateway
![Page 9: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/9.jpg)
The Gateways
● SuperMicro 1U– AMD Opteron 6200 series CPU
– 128GB RAM
● 20Gbit LACP trunk● 4 nodes● Varnish runs locally with RGW on each node
– Uses the RAM to cache objects
![Page 10: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/10.jpg)
The Ceph cluster
● SuperMicro 2U chassis– AMD Opteron 4334 CPU
– 32GB Ram
– Intel S3500 80GB SSD for OS
– Intel S3700 200GB SSD for Journaling
– 6x Seagate 3TB 7200RPM drive for OSD
● 2Gbit LACP trunk● 18 nodes● ~320TB of raw storage
![Page 11: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/11.jpg)
Our problems
● When we cache Objects in Varnish, they don't show up in the usage accounting of the RGW– The HTTP request never reaches RGW
● When a Object changes we have to purge all caches to maintain cache consistency– User might change a ACL or modify a object with a
PUT request
● We wanted to make cached requests cheaper then non-cached requests
![Page 12: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/12.jpg)
Our solution: Logstash
● All requests go from Varnish into Logstash and into ElasticSearch– From ElasticSearch we do the usage accounting
● When Logstash sees a PUT, DELETE or PUT request it makes a local request which sends out a multicast to all other RGW nodes to purge that specific object
● We also store bucket storage usage in ElasticSearch so we have an average over the month
![Page 13: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/13.jpg)
Our solution: Logstash
● All requests go from Varnish into Logstash and into ElasticSearch– From ElasticSearch we do the usage accounting
● When Logstash sees a PUT, DELETE or PUT request it makes a local request which sends out a multicast to all other RGW nodes to purge that specific object
● We also store bucket storage usage in ElasticSearch so we have an average over the month
![Page 14: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/14.jpg)
LogStash and ElasticSearch
● varnishncsa → logstash → redis → elasticsearchinput {
pipe {
command => "/usr/local/bin/varnishncsa.logstash"
type => "http"
}
}
● And we simply execute varnishncsavarnishncsa -F '%{VCL_Log:client}x %{VCL_Log:proto}x %{VCL_Log:authorization}x %{Bucket}o %m %{Host}i %U %b %s %{Varnish:time_firstbyte}x %{Varnish:hitmiss}x'
![Page 15: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/15.jpg)
%{Bucket}o?
● With %{<header>}o you can display the output of the return header <header>:– %{Server}o: Apache 2
– %{Content-Type}o: text/html
● We patched RGW (is in master) that it can optionally return the bucket name in the response:200 OK
Connection: close
Date: Tue, 25 Feb 2014 14:42:31 GMT
Server: AuroraObjects
Content-Length: 1412
Content-Type: application/xml
Bucket: "ceph"
X-Cache-Hit: No
● 'rgw expose bucket = true' in ceph.conf returns Bucket
![Page 16: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/16.jpg)
Usage accounting
● We only query RGW for storage usage and also store that in ElasticSearch
● ElasticSearch is used for all traffic accounting– Allows us to differentiate between cached and
non-cached traffic
![Page 17: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/17.jpg)
Back to Ceph: CRUSHMap
● A good CRUSHMap design should reflect the physical topology of your Ceph cluster– All machines have a single power supply
– The datacenter has a A and B powercircuit● We use a STS (Static Transfer Switch) to create a third
powercircuit
● With CRUSH we store each replica on a different powercircuit– When a circuit fails, we loose 2/3 of the Ceph cluster
– Each powercircuit has it's own switching / network
![Page 18: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/18.jpg)
The CRUSHMaptype 7 powerfeed
host ceph03 {
alg straw
hash 0
item osd.12 weight 1.000
item osd.13 weight 1.000
..
}
powerfeed powerfeed-a {
alg straw
hash 0
item ceph03 weight 6.000
item ceph04 weight 6.000
..
}
root ams02 {
alg straw
hash 0
item powerfeed-a
item powerfeed-b
item powerfeed-c
}
rule powerfeed {
ruleset 4
type replicated
min_size 1
max_size 3
step take ams02
step chooseleaf firstn 0 type powerfeed
step emit
}
![Page 19: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/19.jpg)
The CRUSHMap
![Page 20: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/20.jpg)
Testing the CRUSHMap
● With crushtool you can test your CRUSHMap● $ crushtool -c ceph.zone01.ams02.crushmap.txt -o /tmp/crushmap
● $ crushtool -i /tmp/crushmap --test --rule 4 --num-rep 3 –show-statistics
● This shows you the result of the CRUSHMap:rule 4 (powerfeed), x = 0..1023, numrep = 3..3
CRUSH rule 4 x 0 [36,68,18]
CRUSH rule 4 x 1 [21,52,67]
..
CRUSH rule 4 x 1023 [30,41,68]
rule 4 (powerfeed) num_rep 3 result size == 3: 1024/1024
● Manually verify those locations are correct
![Page 21: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/21.jpg)
A summary
● We cache anonymously accessed objects with Varnish– Allows us to process thousands of requests per
second
– Saves us I/O on the OSDs
● We use LogStash and ElasticSearch to store all requests and do usage accounting
● With CRUSH we store each replica on a different power circuit
![Page 22: Building AuroraObjects- Ceph Day Frankfurt](https://reader034.vdocument.in/reader034/viewer/2022051412/54b823344a79598b168b46b3/html5/thumbnails/22.jpg)
Resources
● LogStash: http://www.logstash.net/● ElasticSearch: http://www.elasticsearch.net/● Varnish: http://www.varnish-cache.org/● CRUSH: http://ceph.com/docs/master/
● E-Mail: [email protected]● Twitter: @widodh