email storage with ceph
TRANSCRIPT
TelekomMail
DT'smailplatformforcustomersdovecotNetwork-AttachedStorage(NAS)NFS(sharded)~1.3petabytenetstorage~39millionaccounts
NFSStatistics
~42%usablerawspace
NFSIOPS
max:~835,000avg:~390,000
relevantIO:
WRITE:107,700/50,000READ:65,700/30,900
EmailStatistics
6.7billionemails
1.2petabytenetcompression
1.2billionindex/cache/metadatafiles
avg:24kiBmax:~600MiB
Howareemailsstored?
Emailsarewrittenonce,readmany(WORM)
Usagedependson:
protocol(IMAPvsPOP3)userfrontend(mailervswebmailer)
usuallyseparatedmetadata,cachesandindexes
lostofmetadata/indexesiscritical
withoutattachmentseasytocompress
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self managing, intelligent storage nodes
LIBRADOS
A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Rubyand PHP
RADOSGW
A bucket-based REST Gateway, compatible with
S3 and Swift
RBD
A reliable and fully-distributed block device,with a Linux kernel clientand a QEMU/KVM driver
CEPH FS
A POSIX-compliantdistributed file system,with a Linux kernel clientand support for FUSE
CLIENTHOST/VMAPPAPP
Motivation
Scale-outvsScale-upFastselfhealingCommodityhardwarePreventvendorlock-inOpenSourcewherefeasibleReduceTotalCostofOwnership(TCO)
WheretostoreinCeph?
CephFS
sameissuesasNFSmailstorageonPOSIXlayeraddscomplexitynooptionforemailsusableformetadata/caches/indexes
M
MRADOSCLUSTER
LINUXHOSTKERNELMODULE
0110 metadatadata
WheretostoreinCeph?
RBD
needsshardingandlargeRBDsneedsaccountmigrationneedsRBD/fsextendscenariosnosharingbetweenclientsimpracticable
M
MRADOSCLUSTER
HYPERVISORLIBRBD
VM
WheretostoreinCeph?
RadosGW
canstoreemailsasobjectsextranetworkhopspotentialbottleneckverylikelynotfastenough
M
MRADOSCLUSTER
APPLICATION
RADOSGWLIBRADOS
REST
socket
WheretostoreinCeph?
Librados
directaccesstoRADOSparallelI/Onotoptimizedforemailshowtohandlemetadata/caches/indexes?
M
MRADOSCLUSTER
APP
LIBRADOS
APP
LIBRADOS
Dovecot
Opensourceproject(LGPL2.1,MIT)
72%marketshare(openemailsurvey.org,02/2017)
Objectstorepluginavailable(obox)
supportsonlyRESTAPIslikeS3/SwiftnotopensourcerequiresDovecotProlargeimpactonTCO
DovecotProoboxPlugin
IMAP4/POP3/LMTPprocess
StorageAPI
dovecotoboxbackend
metacacheRFC5322mails
fsAPI
fscachebackend
objectstorebackend
RFC5322objects
objectstore
index&cachebundles
synclocalindex&cachewith
objectstore
writeindex&cachetolocalstore
localstorage
mailcache
localstorage
DT'sapproach
noopensourcesolutiononthemarketclosedsourceisnooptiondevelop/sponsorasolutionopensourceitpartnerwith:WidodenHollander(42on.com)forconsultingTallenceAGfordevelopmentSUSEforCeph
CephpluginforDovecot
FirstStep:hybridapproach
Emails
StoreinRADOSCluster
Metadataandindexes
StoreinCephFS
Beasgenericaspossible
SplitoutcodeintolibrariesIntegrateintocorrespondingupstreamprojects
Mail User Agent
RADOSCLUSTER
Ceph Client
Dovecot
rbox storage plugin
librmb CephFS
librados Linux Kernel
IMAP/POP
Libradosmailbox(librmb)
Genericemailabstractionontopoflibrados
Outofscope:
Userdataandcredentialstoragetargetarehugeinstallationswhereusuallyarealreadysolutionsinplace
FulltextindexesTherearesolutionsalreadyavailableandworkingoutsideemailstorage
Libradosmailbox(librmb)
M
MRADOSCLUSTER
M
MRADOSCLUSTER
RFC5322
CACHE
LOCALSTORAGE
IMAP4/POP3/LMTPprocess
StorageAPI
dovecotrboxbackend
librmb
librados
dovecotlib-index
CephFS
LinuxkernelRFC5322mails
RFC5322objects index&metadata
librmb-MailObjectFormat
MailsareimmutableregardingtheRFC-5322content
RFC-5322contentstoredinRADOSdirectly
ImmutableattributesusedbyDovecotstoredinRADOSxattr
rboxformatversionGUIDReceivedandsavedatePOP3UIDLandPOP3orderMailboxGUIDPhysicalandvirtualsizeMailUID
writableattributesarestoredinDovecotindexfiles
DumpemaildetailsfromRADOS
$>rmb-pmail_storage-Nt1lsM=ad54230e65b49a59381100009c60b9f7
mailbox_count:1
MAILBOX:M(mailbox_guid)=ad54230e65b49a59381100009c60b9f7mail_total=2,mails_displayed=2mailbox_size=5539bytes
MAIL:U(uid)=4oid=a2d69f2868b49a596a1d00009c60b9f7R(receive_time)=TueJan1400:18:112003S(save_time)=MonAug2112:22:322017Z(phy_size)=2919V(v_size)=2919stat_size=2919M(mailbox_guid)=ad54230e65b49a59381100009c60b9f7G(mail_guid)=a3d69f2868b49a596a1d00009c60b9f7I(rbox_version):0.1[..]
RADOSDictionaryPlugin
makeuseofCephomapkey/valuestore
RADOSnamespaces
shared/<key>priv/<key>
usedbyDovecottostoremetadata,quota,...
It'sopensource!
License:LGPLv2.1
Language:C++
Location:
SupportedDovecotversions:
2.2>=2.2.212.3
github.com/ceph-dovecot/
CephRequirements
Performance
Writeperformanceforemailsiscriticalmetadata/indexread/writeperformance
Cost
ErasureCoding(EC)foremailsReplicationforCephFS
Reliability
MUSTsurvicefailureofdisk,server,rackandevenfirecompartments
WhichCephRelease?
RequiredFeatures:
Bluestoreshouldbeatleast2xfasterthanfilestore
CephFSStablereleaseMulti-MDS
Erasurecoding
Hardware
Commodityx86_64server
HPEProLiantDL380Gen9Dualsocket
Intel®Xeon®E5V42xIntel®X710-DA2Dual-port10G2xbootSSDs,SATA,HBA,noseparateRAIDcontroller
CephFS,Rados,MDSandMONnodes
StorageNodes
CephFSSSDNodes
CPU:[email protected],6Cores,turbo3.7GHzRAM:256GByte,DDR4,ECCSSD:8x1.6TBSSD,3DWPD,SAS,RR/RW125k/92kiops
RadosHDDNodes
CPU:[email protected],10Cores,turbo3.4GHzRAM:128GByte,DDR4,ECCSSD:2x400GByte,3DWPD,SAS,RR/RW108k/49kiops
forBlueStoredatabaseetc.HDD:10x4TByte,7.2K,128MBcache,SAS
ComputeNodes
MDS
CPU:[email protected],6Cores,turbo3.7GHzRAM:256GByte,DDR4,ECC
MON/SUSEadmin
CPU:[email protected],10Cores,turbo3.4GHzRAM:64GByte,DDR4,ECC
WhythisspecificHW?
Communityrecommendations?
OSD:1x64-bitAMD-64,1GBRAM/1TBofstorage,2x1GBitNICsMDS:1x64-bitAMD-64quad-core,1GBRAMminimumperMDS,2x1GBitNICs
NUMA,highclockedCPUsandlargeRAMoverkill?
VendordidnotoffersingleCPUnodesfornumberofdrivesMDSperformanceismostlyCPUclockboundandpartlysinglethreaded
HighclockedCPUsforfastsinglethreadedperformanceLargeRAM:bettercaching!
Issues
Datacenter
usuallytwoindependentfirecompartments(FCs)mayadditionalvirtualFCs
Requirements
LostofcustomerdataMUSTbepreventedAnyserver,switchorrackcanfailOneFCcanfailDatareplicationatleast3times(orequivalent)
Issues
Questions
Howtoplace3copiesintwoFCs?HowindependentandreliablearethevirtualFCs?Networkarchitecture?Networkbandwidth?
FireCompartmentA
34333231302928272625242322212019181716151413121110987654321
Switches
MON
MDS
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
SSDnode
SSDnode
SSDnode
FireCompartmentB
34333231302928272625242322212019181716151413121110987654321
Switches
MON
MDS
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
SSDnode
SSDnode
SSDnode
SUSEAdmin
FireCompartmentC(thirdroom)
34333231302928272625242322212019181716151413121110987654321
Switches
MON
MDS
SSDnode
SSDnode
SSDnode
Network
10Gnetwork
2NICs/4portspernodeSFP+DAC
Multi-chassisLinkAggregation(MC-LAG/M-LAG)
Foraggregationandfail-over
Spine-Leafarchitecture
Interconnectmustnotreflecttheoreticalrack/FCbandwidthL2:terminatedinrackL3:TOR<->spine/spine<->spineBorderGatewayProtocol(BGP)
2x 40G QSFP L3 crosslink
40G QSFPLAG, L3, BGP
N*10G SFP+MCLAG
rack , L2 terminated
Spine switch
DC-R
FC1 FC2 vFC
Status
DovecotCephPlugin
Opensourcedgithub
stillunderdevelopmentstillincludeslibrmb
planned:movetoCephproject
Testing
Functionaltesting
Setupsmall5-nodeclusterSLES12-SP3GMCSES5BetaRunDovecotfunctionaltestsagainstCeph
Proof-of-Concept
Hardware
9SSDnodesforCephFS12HDDnodes3MDS/3MON
2FCs+1vFC
Testing
runloadtestsrunfailurescenariosagainstCephimproveandtuneCephsetupverifyandoptimizehardware
FurtherDevelopment
Goal:PureRADOSbackend,storemetadata/indexinCephomap
M
MRADOSCLUSTER
RFC5322
CACHE
LOCALSTORAGE
IMAP4/POP3/LMTPprocess
StorageAPI
dovecotrboxbackend
librmb
librados
RFC5322mails
RFC5322objectsandindex
NextSteps
Production
verifyifallrequirementsarefulfilledintegrateinproductionmigrateusersstep-by-stepextendtofinalsize
128HDDnodes,1200OSDs,4,7PiB15SSDnodes,120OSDs,175TiB
Summaryandconclusions
CephcanreplaceNFS
mailsinRADOSmetadata/indexesinCephFSBlueStore,EC
librmbanddovecotrbox
OpenSource,LGPLv2.1,nolicensecostslibrmbcanbeusedinnon-dovecotsystemsstillunderdevelopment
PoCwithdovecotinprogress
Performanceoptimization
Beinvitedto:Participate!
Tryit,testit,feedbackandreportbugs!Contribute!
Thankyou.
github.com/ceph-dovecot/
https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb