scalable data storage getting you down? to the cloud!

1. SCALABLE DATA STORAGEGETTING YOU DOWN?TO THE CLOUD!Web 2.0 Expo SF 20 Mike Male, Mike Pcnko, Dek Smh, Paul Lhrop

2. THE CAST MIKE MALONE INFRASTRUCTURE ENGINEER @MJMALONE MIKE PANCHENKO INFRASTRUCTURE ENGINEER @MIHASYA DEREK SMITH INFRASTRUCTURE ENGINEER @DSMITTS PAUL LATHROP OPERATIONS @GREYTALYN 3. SIMPLEGEO We originally began as a mobilegaming startup, but quickly discovered that the location services and infrastructure needed to supportour ideas didnt exist. So we took matters into our own hands andbegan building it ourselves. Mt GaigJoe StumpCSO & co-founder CTO & co-founder 4. THE STACK wwwgnop AWS RDSAWS auth/proxy ELBHTTP data centers...... api servers recordstorage readsgeocoderqueuesreverse geocoder GeoIP pushpin writesindex storage Apache Cassandra 5. BUT WHY NOTPOSTGIS? 6. DATABASESWHAT ARE THEY GOOD FOR?DATA STORAGEDurably persist system stateCONSTRAINT MANAGEMENTEnforce data integrity constraintsEFFICIENT ACCESSOrganize data and implement access methods for efficientretrieval and summarization 7. DATA INDEPENDENCEData independence shields clients from the detailsof the storage system, and data structureLOGICAL DATA INDEPENDENCE Clients that operate on a subset of the attributes in a data set should not be affected later when new attributes are addedPHYSICAL DATA INDEPENDENCE Clients that interact with a logical schema remain the same despite physical data structure changes like File organization Compression Indexing strategy 8. TRANSACTIONAL RELATIONALDATABASE SYSTEMSHIGH DEGREE OF DATA INDEPENDENCE Logical structure: SQL Data Definition Language Physical structure: Managed by the DBMSOTHER GOODIES Theyre theoretically pure, well understood, and mostly standardized behind a relatively clean abstraction They provide robust contracts that make it easy to reason about the structure and nature of the data they contain Theyre ubiquitous, battle hardened, robust, durable, etc. 9. ACIDThese terms are not formally dened - theyre aframework, not mathematical axiomsATOMICITY Either all of a transactions actions are visible to another transaction, or none areCONSISTENCY Application-specific constraints must be met for transaction to succeedISOLATION Two concurrent transactions will not see one anothers transactions while in flightDURABILITY The updates made to the database in a committed transaction will be visible to future transactions 10. ACID HELPSACID is a sort-of-formal contract that makes iteasy to reason about your data, and thats goodIT DOES SOMETHING HARD FOR YOU With ACID, youre guaranteed to maintain a persistent global state as long as youve defined proper constraints and your logical transactions result in a valid system state 11. CAP THEOREMAt PODC 2000 Eric Brewer told us there were threedesirable DB characteristics. But we can only have two.CONSISTENCY Every node in the system contains the same data (e.g., replicas are never out of date)AVAILABILITY Every request to a non-failing node in the system returns a responsePARTITION TOLERANCE System properties (consistency and/or availability) hold even when the system is partitioned and data is lost 12. CAP THEOREM IN 30 SECONDSCLIENTSERVERREPLICA 13. CAP THEOREM IN 30 SECONDSCLIENTSERVER REPLICA wre 14. CAP THEOREM IN 30 SECONDSCLIENTSERVER plice REPLICA wre 15. CAP THEOREM IN 30 SECONDSCLIENTSERVER plice REPLICA wre ack 16. CAP THEOREM IN 30 SECONDSCLIENTSERVER plice REPLICAwre aeptack 17. CAP THEOREM IN 30 SECONDSCLIENT SERVER REPLICA wreFAIL! niUNAVAILAB! 18. CAP THEOREM IN 30 SECONDSCLIENTSERVERREPLICAwreFAIL! aeptCSTT! 19. ACID HURTSCertain aspects of ACID encourage (require?)implementors to do bad thingsUnfortunately, ANSI SQLs definition of isolation... relies in subtle ways on an assumption that a locking scheme is used for concurrency control, as opposed to an optimistic or multi-version concurrency scheme. This implies that the proposed semantics are ill-defined.Joseph M. Hellerstein and Michael Stonebraker Anatomy of a Database System 20. BALANCEITS A QUESTION OF VALUES For traditional databases CAP consistency is the holy grail: its maximized at the expense of availability and partition tolerance At scale, failures happen: when youre doing something a million times a second a one-in-a-million failure happens every second Were witnessing the birth of a new religion... CAP consistency is a luxury that must be sacrificed at scale in order to maintain availability when faced with failures 21. NETWORK INDEPENDENCEA distributed system must also manage thenetwork - if it doesnt, the client has toCLIENT APPLICATIONS ARE LEFT TO HANDLE Partitioning data across multiple machines Working with loosely defined replication semantics Detecting, routing around, and correcting network and hardware failures 22. WHATS WRONGWITH MYSQL..?TRADITIONAL RELATIONAL DATABASES They are from an era (er, one of the eras) when Big Iron was the answer to scaling up In general, the network was not considered part of the systemNEXT GENERATION DATABASES Deconstructing, and decoupling the beast Trying to create a loosely coupled structured storage system Something that the current generation of database systems never quite accomplished 23. UNDERSTANDINGCASSANDRA 24. APACHE CASSANDRAA DISTRIBUTED STRUCTURED STORAGE SYSTEMEMPHASIZING Extremely large data sets High transaction volumes High value data that necessitates high availabilityTO USE CASSANDRA EFFECTIVELY IT HELPS TOUNDERSTAND WHATS GOING ON BEHIND THE SCENES 25. APACHE CASSANDRAA DISTRIBUTED HASH TABLE WITH SOME TRICKS Peer-to-peer architecture with no distinguished nodes, and therefore no single points of failure Gossip-based cluster management Generic distributed data placement strategy maps data to nodes Pluggable partitioning Pluggable replication strategy Quorum based consistency, tunable on a per-request basis Keys map to sparse, multi-dimensional sorted maps Append-only commit log and SSTables for efficient disk utilization 26. NETWORK MODELDYNAMO INSPIREDCONSISTENT HASHING Simple random partitioning mechanism for distribution Low fuss online rebalancing when operational requirements changeGOSSIP PROTOCOL Simple decentralized cluster configuration and fault detection Core protocol for determining cluster membership and providing resilience to partial system failure 27. CONSISTENT HASHINGImproves shortcomings of modulo-based hashing1sh(alice) % 3 2 => 23 % 3=> 2 3 28. CONSISTENT HASHINGWith modulo hashing, a change in the number ofnodes reshues the entire data set 1sh(alice) % 4 2 => 23 % 4=> 3 3 4 29. CONSISTENT HASHINGInstead the range of the hash function is mapped to aring, with each node responsible for a segment0 sh(alice) => 2384 42 30. CONSISTENT HASHINGWhen nodes are added (or removed) most of the datamappings remain the same 0sh(alice) => 23 844264 31. CONSISTENT HASHINGRebalancing the ring requires a minimal amount ofdata shuing 0 sh(alice) => 23 96 3264 32. GOSSIPDISSEMINATES CLUSTER MEMBERSHIP ANDRELATED CONTROL STATE Gossip is initiated by an interval timer At each gossip tick a node will Randomly select a live node in the cluster, sending it a gossip message Attempt to contact cluster members that were previously marked as down If the gossip message is unacknowledged for some period of time (statistically adjusted based on the inter-arrival time of previous messages) the remote node is marked as down 33. REPLICATIONREPLICATION FACTOR Determines how manycopies of each piece of data are created in thesystem RF=3 0 sh(alice) => 23 96 3264 34. CONSISTENCY MODELDYNAMO INSPIREDQUORUM-BASED CONSISTENCYW=20wre sh(alice) => 23ad9632W+R>NR=2 64Cstt 35. TUNABLE CONSISTENCYWRITESZERO DONT BOTHER WAITING FOR A RESPONSE ANY WAIT FOR SOME NODE (NOT NECESSARILY A REPLICA) TO RESPOND ONE WAIT FOR ONE REPLICA TO RESPOND QUORUM WAIT FOR A QUORUM (N/2+1) TO RESPOND ALL WAIT FOR ALL N REPLICAS TO RESPOND 36. TUNABLE CONSISTENCYREADS ONE WAIT FOR ONE REPLICA TO RESPOND QUORUM WAIT FOR A QUORUM (N/2+1) TO RESPOND ALL WAIT FOR ALL N REPLICAS TO RESPOND 37. CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIRHINTED HANDOFFANTI-ENTROPY W=2 wrefail 38. CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIR Asynchronously checks replicas duringreads and repairs any inconsistenciesHINTED HANDOFFANTI-ENTROPYW=2 wre ad + x 39. CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIRHINTED HANDOFF Sends failed writes to another nodewith a hint to re-replicate when the failed node returnsANTI-ENTROPY wreplica 40. CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIRHINTED HANDOFF Sends failed writes to another nodewith a hint to re-replicate when the failed node returnsANTI-ENTROPY * ck *pair 41. CONSISTENCY MODELDYNAMO INSPIREDREAD REPAIRHINTED HANDOFFANTI-ENTROPY Manual repair process where nodesgenerate Merkle trees (hash trees) to detect andrepair data inconsistenciespair 42. DATA MODELBIGTABLE INSPIREDSPARSE MATRIX its a hash-map (associative array):a simple, versatile data structureSCHEMA-FREE data model, introduces new freedomand new responsibilitiesCOLUMN FAMILIES blend row-oriented and column-oriented structure, providing a high level mechanismfor clients to manage on-disk and inter-node datalocality 43. DATA MODELTERMINOLOGYKEYSPACE A named collection of column families(similar to a database in MySQL) you only need one andyou can mostly ignore itCOLUMN FAMILY A named mapping of keys to rowsROW A named sorted map of columns or supercolumnsCOLUMN A tripleSUPERCOLUMN A named collection of columns, forpeople who want to get fancy 44. DATA MODEL{ column familyusers: {keyalice: {city: [St. Louis, 1287040737182],columnsrow (name, value, timestamp)name: [Alice 1287080340940], ,},...},locations: {},...} 45. ITS A DISTRIBUTED HASH TABLEWITH A TWIST...COLUMNS IN ARE STORED TOGETHER ON ONE NODE,IDENTIFIED BY {column familyusers: {key alice: {city: [St. Louis 1287040737182], , columnsname: [Alice 1287080340940], ,},...},}... bob alice s3b 3e8 46. HASH TABLESUPPORTED QUERIESEXACT MATCHRANGEPROXIMITYANYTHING THATS NOTEXACT MATCH 47. COLUMNSSUPPORTED QUERIESEXACT MATCH {RANGEusers: {alice: {city: [St. Louis, 1287040737182],PROXIMITY friend-1: [Bob 1287080340940], ,friends friend-2: [Joe, 1287080340940],friend-3: [Meg 1287080340940],,name: [Alice 1287080340940], ,},... } } 48. LOG-STRUCTURED MERGEMEMTABLES are in memory data structures thatcontain newly written dataCOMMIT LOGS are append only files where newdata is durably writtenSSTABLES are serialized memtables, persisted todiskCOMPACTION periodically merges multiplememtables to improve system performance 49. CASSANDRACONCEPTUAL SUMMARY...ITS A DISTRIBUTED HASH TABLE Gossip based peer-to-peer ring with no distinguished nodes and no single point of failure Consistent hashing distributes workload and simple replication strategy for fault tolerance and improved throughputWITH TUNABLE CONSISTENCY Based on quorum protocol to ensure consistency And simple repair mechanisms to stay available during partial system failuresAND A SIMPLE, SCHEMA-FREE DATA MODEL Its just a key-value store Whose values are multi-dimensional sorted map 50. ADVANCED CASSANDRA - A case study -SPATIAL DATA IN A DHT 51. A FIRST PASSTHE ORDER PRESERVING PARTITIONERCASSANDRAS PARTITIONINGSTRATEGY IS PLUGGABLE Partitioner maps keys to nodes Random partitioner destroys locality by hashing Order preserving partitioner retains locality, storing keys in natural lexicographical order around ringz aaliceabob uh samm 52. ORDER PRESERVING PARTITIONEREXACT MATCHRANGEOn a single dimension? PROXIMITY 53. SPATIAL DATAITS INHERENTLY MULTIDIMENSIONAL 2 x 2, 2 1 1 2 54. DIMENSIONALITY REDUCTIONWITH SPACE-FILLING CURVES 1 2 3 4 55. Z-CURVESECOND ITERATION 56. Z-VALUE14x 57. GEOHASHSIMPLE TO COMPUTE Interleave the bits of decimal coordinates (equivalent to binary encoding of pre-order traversal!) Base32 encode the resultAWESOME CHARACTERISTICS Arbitrary precision Human readable Sorts lexicographically01101e 58. DATA MODEL{record-index: {key : 9yzgcjn0:moonrise hotel: { : [, 1287040737182], }, ...},records: { moonrise hotel: { latitude: [38.6554420, 1287040737182], longitude: [-90.2992910, 1287040737182], ...}}} 59. BOUNDING BOXE.G., MULTIDIMENSIONAL RANGE Gie u bg box! Gie 2 31 23 4Gie 4 5 60. SPATIAL DATASTILL MULTIDIMENSIONALDIMENSIONALITY REDUCTION ISNT PERFECT Clients must Pre-process to compose multiple queries Post-process to filter and merge results Degenerate cases can be bad, particularly for nearest-neighbor queries 61. Z-CURVE LOCALITY 62. Z-CURVE LOCALITYxx 63. Z-CURVE LOCALITYxx 64. Z-CURVE LOCALITY x o o o xoo oo 65. THE WORLDIS NOT BALANCEDCredit: C. Mayhew & R. Simmon (NASA/GSFC), NOAA/NGDC, DMSP Digital Archive 66. TOO MUCH LOCALITY 12SAN FRANCISCO3 4 67. TOO MUCH LOCALITY 12SAN FRANCISCO3 4 68. TOO MUCH LOCALITY 12 Im sad.SAN FRANCISCO3 4 69. TOO MUCH LOCALITY Im b. 12 Im sad.Me o.SAN FRANCISCO3 4 Lets py xbox. 70. A TURNING POINT 71. HELLO, DRAWING BOARDSURVEY OF DISTRIBUTED P2P INDEXING An overlay-dependent index works directly with nodes of the peer-to-peer network, defining its own overlay An over-DHT index overlays a more sophisticated data structure on top of a peer-to-peer distributed hash table 72. ANOTHER LOOK AT POSTGISMIGHT WORK, BUT The relational transaction management system (which wed want to change) and access methods (which wed have to change) are tightly coupled (necessarily?) to other parts of the system Could work at a higher level and treat PostGIS as a black box Now were back to implementing a peer-to-peer network with failure recovery, fault detection, etc... and Cassandra already had all that. Its probably clear by now that I think these problems are moredifficult than actually storing structured data on disk 73. LETS TAKE A STEP BACK 74. EARTH 75. EARTH 76. EARTH 77. EARTH 78. EARTH, TREE, RING 79. DATA MODEL{record-index: { layer-name:37 .875, -90:40.25, -101.25: { 38.6554420, -90.2992910:moonrise hotel: [ 1287040737182], , ... },},record-index-meta: { layer-name:37 .875, -90:40.25, -101.25: { split: [false, 1287040737182],}layer-name: 37 .875, -90:42.265, -101.25 {split: [true, 1287040737182],child-left: [layer-name:37 .875, -90:40.25, -101.25 1287040737182],child-right: [layer-name:40.25, -90:42.265, -101.25 1287040737182] , }}} 80. DATA MODEL{record-index: { layer-name:37 .875, -90:40.25, -101.25: { 38.6554420, -90.2992910:moonrise hotel: [ 1287040737182], , ... },},record-index-meta: { layer-name:37 .875, -90:40.25, -101.25: { split: [false, 1287040737182],}layer-name: 37 .875, -90:42.265, -101.25 {split: [true, 1287040737182],child-left: [layer-name:37 .875, -90:40.25, -101.25 1287040737182],child-right: [layer-name:40.25, -90:42.265, -101.25 1287040737182] , }}} 81. DATA MODEL{record-index: { layer-name:37 .875, -90:40.25, -101.25: { 38.6554420, -90.2992910:moonrise hotel: [ 1287040737182], , ... },},record-index-meta: { layer-name:37 .875, -90:40.25, -101.25: { split: [false, 1287040737182],}layer-name: 37 .875, -90:42.265, -101.25 {split: [true, 1287040737182],child-left: [layer-name:37 .875, -90:40.25, -101.25 1287040737182],child-right: [layer-name:40.25, -90:42.265, -101.25 1287040737182] , }}} 82. SPLITTINGITS PRETTY MUCH JUST A CONCURRENT TREE Splitting shouldnt lock the tree for reads or writes and failures shouldnt cause corruption Splits are optimistic, idempotent, and fail-forward Instead of locking, writes are replicated to the splitting node and the relevant child[ren] while a split operation is taking place Cleanup occurs after the split is completed and all interested nodes are aware that the split has occurred Cassandra writes are idempotent, so splits are too - if a split fails, it is simply be retried Split size: A Tunable knob for balancing locality and distributedness The other hard problem with concurrent trees is rebalancing - we just dont do it! (more on this later) 83. THE ROOT IS HOTMIGHT BE A DEAL BREAKER For a tree to be useful, it has to be traversed Typically, tree traversal starts at the root Root is the only discoverable node in our tree Traversing through the root meant reading the root for every read or write below it - unacceptable Lots of academic solutions - most promising was a skip graph, but that required O(n log(n)) data - also unacceptable Minimum tree depth was propsed, but then you just get multiple hot- spots at your minimum depth nodes 84. BACK TO THE BOOKSLOTS OF ACADEMIC WORK ON THIS TOPIC But academia is obsessed with provable, deterministic, asymptotically optimal algorithms And we only need something that is probably fast enough most of the time (for some value of probably and most of the time) And if the probably good enough algorithm is, you know... tractable... one might even consider it qualitatively better! 85. - THE ROOT -A HOT SPOT AND A SPOF 86. We have We want 87. THINKING HOLISTICALLYWE OBSERVED THAT Once a node in the tree exists, it doesnt go away Node state may change, but that state only really matters locally - thinking a node is a leaf when it really has children is not fatalSO... WHAT IF WE JUST CACHED NODES THATWERE OBSERVED IN THE SYSTEM!? 88. CACHE ITSTUPID SIMPLE SOLUTION Keep an LRU cache of nodes that have been traversed Start traversals at the most selective relevant node If that node doesnt satisfy you, traverse up the tree Along with your result set, return a list of nodes that were traversed so the caller can add them to its cache 89. TRAVERSALNEAREST NEIGHBORo oo xo 90. TRAVERSALNEAREST NEIGHBORo oo xo 91. TRAVERSALNEAREST NEIGHBORo oo xo 92. KEY CHARACTERISTICSPERFORMANCE Best case on the happy path (everything cached) has zero read overhead Worst case, with nothing cached, O(log(n)) read overheadRE-BALANCING SEEMS UNNECESSARY! Makes worst case more worser, but so far so good 93. DISTRIBUTED TREESUPPORTED QUERIESEXACT MATCHRANGEPROXIMITYSOMETHING ELSE I HAVENTEVEN HEARD OF 94. DISTRIBUTED TREESUPPORTED QUERIES MULEXACT MATCH DI P NS RANGE NS!PROXIMITYSOMETHING ELSE I HAVENTEVEN HEARD OF 95. LIFE OF A REQUEST 96. THE BIRDS N THE BEESELB gategate service servicecassworker pool worker poolindex 97. THE BIRDS N THE BEESELB gate servicecassworker poolindex 98. THE BIRDS N THE BEESELB load bag; AWS svice gate servicecassworker poolindex 99. THE BIRDS N THE BEESELB load bag; AWS svice gate auicn; fwdg servicecassworker poolindex 100. THE BIRDS N THE BEESELB load bag; AWS svice gate auicn; fwdg servicebuss logic - bic validncassworker poolindex 101. THE BIRDS N THE BEESELB load bag; AWS svice gate auicn; fwdg servicebuss logic - bic validncasscd ageworker poolindex 102. THE BIRDS N THE BEESELB load bag; AWS svice gate auicn; fwdg servicebuss logic - bic validncasscd ageworker pool buss logic - age/xgindex 103. THE BIRDS N THE BEESELB load bag; AWS svice gate auicn; fwdg servicebuss logic - bic validncasscd ageworker pool buss logic - age/xgindex awome sauce f qryg 104. ELBTrac management Control which AZs are serving trac Upgrades without downtimeAble to remove an AZ, upgrade, test,replaceAPI-level failure scenarios Periodically runs healthchecks on nodes Removes nodes that fail 105. GATEBasic authHTTP proxy to specic services Services are independent of one another Auth is Decoupled from business logicFirst line of defense Very fast, very cheap Keeps services from being overwhelmed by poorly authenticated requests 106. RABBITMQDecouple accepting writes fromperforming the heavy lifting Dont block client while we write to db/ indexFlexibility in the event of degradationfurther down the stack Queues can hold a lot, and can keep accepting writes throughout incidentHeterogenous consumers - pass the samemessage through multiple code paths easily 107. AWS 108. EC2 Security groups Static IPs Choose your data center Choose your instance type On-demand vs. Reserved 109. ELASTIC BLOCK SUPPORT Storage devices that can be anywhere from 1GB to 1TB Snapshotting Automatic replication Mobility 110. ELASTIC LOAD BALANCING Distribute incoming trac Automatic scaling Health checks 111. SIMPLE STORAGE SERVICE File sizes can be up to 5TBs Unlimited amount of les Individual read/write access credentials 112. RELATIONAL DATABASE SERVICE MySQL in the cloud Manages replication Specify instance types based on need Snapshots 113. CONFIGURATION MANAGEMENT 114. WHY IS THIS NECESSARY? Easier DevOps integration Reusable modules Infrastructure as code One word: WINNING 115. POSSIBLE OPTIONS 116. SAMPLE MANIFEST # /root/learning-manifests/apache2.pp package { apache2:ensure => present; } file { /etc/apache2/apache2.conf:ensure => file,mode => 600,notify => Service[apache2],source => /root/learning-manifests/apache2.conf, } service { apache2: ensure=> running, enable=> true, subscribe => File[/etc/apache2/apache2.conf], } 117. AN EXAMPLE TERMINAL 118. CONTINUOUSINTEGRATION 119. GET ER DONE Revision control Automate build process Automate testing process Automate deploymentLocal code changes should result in productiondeployments. 120. DONT FORGET TO DEBIANIZE All codebases must be debianized Open source project not debianized yet, fork the repo anddo it yourself! Take the time to teach others Debian directories can easily be reused after a simple searchand replace 121. REPOMANHTTPS://GITHUB.COM/SYNACK/REPOMAN.GIT repoman upload myrepo sg-example.tar.gz repoman promote development/sg-example staging 122. HUDSON 123. MAINTAINING MULTIPLEENVIRONMENTS Run unit tests in a development environment Promote to staging Run system tests in a staging environment Run consumption tests in a staging environment Promote to productionCongratz, you have now just automated yourself out of a job. 124. THE MONEY MAKER Github Plugin 125. TYING IT ALL TOGETHERTERMINAL 126. FLUMEFlume is a distributed, reliable and availableservice for eciently collecting, aggregating andmoving large amounts of log data. syslog on steriods 127. DATA-FLOWS 128. AGENTSPhysical Host Logical NodesSource and Sinktwitter_streamtwitter(dsmitts, mypassword, url) agentSink(35853)tail(/var/log/nginx/access.log)i-192df98tail agentSink(35853)collectorSource(35853) hdfs_writer collectorSink("hdfs://namenode.sg.com/bogus/", "logs") 129. RELIABILITY END-TO-END STORE ON FAILURE BEST EFFORT 130. GETTIN JIGGY WIT ITCustom Decorators 131. HOW DO WE USE IT? 132. PERSONAL EXPERIENCES #ume Automation was gnarly Its never a good day when Eclipse is involved Resource hog (at rst) 133. We buildg kick-s ols f vops x,tpt, csume da cnect a locnMARCH 2011

scalable data storage getting you down? to the cloud!

Technology

data storagedurably

partitioning data

data set

scalable data storagegetting

structured storage system

valid system state

cap theoremat podc

logical transactions