scalable, consistent, and elastic database systems for cloud platforms sudipto das computer science,...

Download Scalable, Consistent, and Elastic Database Systems for Cloud Platforms Sudipto Das Computer Science, UC Santa Barbara sudipto@cs.ucsb.edu Sponsors:

If you can't read please download the document

Post on 20-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

  • Slide 1
  • Scalable, Consistent, and Elastic Database Systems for Cloud Platforms Sudipto Das Computer Science, UC Santa Barbara [email protected] Sponsors:
  • Slide 2
  • Web replacing Desktop Sudipto Das {[email protected]}2
  • Slide 3
  • 3 Paradigm shift in Infrastructure Sudipto Das {[email protected]}
  • Slide 4
  • Cloud computing Computing infrastructure and solutions delivered as a service Industry worth USD150 billion by 2014 * Contributors to success Economies of scale Elasticity and pay-per-use pricing Popular paradigms Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS) 4Sudipto Das {[email protected]} *http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htm
  • Slide 5
  • Databases for cloud platforms Data is central to applications DBMSs are mission critical component in cloud software stack Manage petabytes of data, drive revenue Serve a variety of applications (multitenancy) Data needs for cloud applications OLTP systems: store and serve data Data analysis systems: decision support, intelligence 5Sudipto Das {[email protected]}
  • Slide 6
  • Application landscape Social gaming Rich content and mash-ups Managed applications Cloud application platforms Sudipto Das {[email protected]}6
  • Slide 7
  • Challenges for OLTP systems Scalability While ensuring efficient transaction execution! 7 Lightweight Elasticity Scale on-demand! Self-Manageability Intelligence without a human controller! Sudipto Das {[email protected]}
  • Slide 8
  • Two approaches to scalability Scale-up Preferred in classical enterprise setting (RDBMS) Flexible ACID transactions Transactions access a single node Scale-out Cloud friendly (Key value stores) Execution at a single server Limited functionality & guarantees No multi-row or multi-step transactions Sudipto Das {[email protected]}8
  • Slide 9
  • Why care about transactions? 9 confirm_friend_request(user1, user2) { begin_transaction(); update_friend_list(user1, user2, status.confirmed); update_friend_list(user2, user1, status.confirmed); end_transaction(); } Simplicity in application design with ACID transactions Sudipto Das {[email protected]}
  • Slide 10
  • 10 confirm_friend_request_A(user1, user2) { try { update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); return; } try { update_friend_list(user2, user1, status.confirmed); } catch(exception e) { revert_friend_list(user1, user2); report_error(e); return; } confirm_friend_request_B(user1, user2) { try{ update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time()); } try { update_friend_list(user2, user1, status.confirmed); } catch(exception e) { report_error(e); add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time()); } It gets too complicated with reduced consistency guarantees Sudipto Das {[email protected]}
  • Slide 11
  • Challenge: Transactions at Scale 11 Scale-out ACID transactions Key Value Stores RDBMSs Sudipto Das {[email protected]}
  • Slide 12
  • Unused resources Challenge: Lightweight Elasticity Provisioning on-demand and not for peak Optimize operating cost! Traditional InfrastructuresDeployment in the Cloud Demand Capacity Time Resources Demand Capacity Time Resources Slide Credits: Berkeley RAD Lab 12Sudipto Das {[email protected]}
  • Slide 13
  • Challenge: Self-Manageability Managing a large distributed system Detecting failures and recovering Coordination and synchronization Provisioning Capacity planning A large distributed system is a Zoo Cloud platforms inherently multitenant Balance conflicting goals Minimize operating cost while ensuring good performance 13Sudipto Das {[email protected]}
  • Slide 14
  • Contributions for OLTP systems Transactions at Scale ElasTraS [HotCloud 2009, UCSB TR 2010] G-Store [SoCC 2010] 14 Lightweight Elasticity Albatross [VLDB 2011] Zephyr [SIGMOD 2011] Self-Manageability Pythia [in progress] Sudipto Das {[email protected]}
  • Slide 15
  • Contributions 15 Data Management Analytics Transaction Processing Ricardo [SIGMOD 10] MD-HBase [MDM 11] Best Paper Runner up MD-HBase [MDM 11] Best Paper Runner up CoTS [ICDE 09], [VLDB 09] CoTS [ICDE 09], [VLDB 09] Dynamic partitioning G-Store [SoCC 10] Static partitioning ElasTraS [HotCloud 09] [TR 10] ElasTraS [HotCloud 09] [TR 10] Albatross [VLDB 11] Zephyr [SIGMOD 11] Albatross [VLDB 11] Zephyr [SIGMOD 11] Pythia [in progress] Novel Architectures Hyder [CIDR 11] Best Paper Hyder [CIDR 11] Best Paper TCAM [DaMoN 08] TCAM [DaMoN 08] This talk Anonimos [ICDE 10], [TKDE] Anonimos [ICDE 10], [TKDE]
  • Slide 16
  • Transactions at Scale 16 Scale-out ACID transactions Key Value Stores RDBMSs Sudipto Das {[email protected]}
  • Slide 17
  • Scale-out with static partitioning Table level partitioning (range, hash) Distributed transactions Partitioning the Database schema Co-locate data items accessed together Goal: Minimize distributed transactions 17Sudipto Das {[email protected]}
  • Slide 18
  • Scale-out with static partitioning Table level partitioning (range, hash) Distributed transactions Partitioning the Database schema Co-locate data items accessed together Goal: Minimize distributed transactions Scaling-out with static partitioning ElasTraS [HotCloud 2009, TR 2010] Cloud SQL Server [ICDE 2011] Megastore [CIDR 2011] Relational Cloud [CIDR 2011] 18Sudipto Das {[email protected]}
  • Slide 19
  • Dynamically formed partitions Access patterns change, often rapidly Online multi-player gaming applications Collaboration based applications Scientific computing applications Not amenable to static partitioning How to get the benefit of partitioning when accesses do not statically partition? Ours is the first solution to allow that 19Sudipto Das {[email protected]}
  • Slide 20
  • Online Multi-player Games 20 Player Profile IDName$$$Score Sudipto Das {[email protected]}
  • Slide 21
  • Online Multi-player Games 21 Execute transactions on player profiles while the game is in progress Sudipto Das {[email protected]}
  • Slide 22
  • Online Multi-player Games 22Sudipto Das {[email protected]} Partitions/groups are dynamic
  • Slide 23
  • Online Multi-player Games 23Sudipto Das {[email protected]} Hundreds of thousands of concurrent groups
  • Slide 24
  • Data Fusion for dynamic partitions [G-Store, SoCC 2010] Transactional access to a group of data items formed on-demand Challenge: Avoid distributed transactions! Key Group Abstraction Groups are small Groups execute non-trivial no. of transactions Groups are dynamic and on-demand Groups are dynamically formed tenant databases 24Sudipto Das {[email protected]}
  • Slide 25
  • Transactions on Groups Without distributed transactions 25 Ownership of keys at a single node Key Group One key selected as the leader Followers transfer ownership of keys to leader Sudipto Das {[email protected]} Grouping Protocol
  • Slide 26
  • Why is group formation hard? Guarantee the contract between leaders and followers in the presence of: Leader and follower failures Lost, duplicated, or re-ordered messages Dynamics of the underlying system How to ensure efficient and ACID execution of transactions? Sudipto Das {[email protected]}26
  • Slide 27
  • Grouping protocol Conceptually akin to locking Locks held by groups Sudipto Das {[email protected]}27 Follower(s) Leader L(Creating) L(Joined) L(Joining) L(Joined) L(Deleting) L(Free) L(Deleted) Group Opns JJA JAA D DA Log entries Time Create Request Delete Request
  • Slide 28
  • Efficient transaction processing How does the leader execute transactions? Caches data for group members underlying data store equivalent to a disk Transaction logging for durability Cache asynchronously flushed to propagate updates Guaranteed update propagation 28Sudipto Das {[email protected]} Log Transaction Manager Cache Manager Leader Followers Asynchronous update Propagation
  • Slide 29
  • Prototype: G-Store [SoCC 2010] An implementation over Key-value stores 29 Grouping Layer Key-Value Store Logic Distributed Storage Application Clients Transactional Multi-Key Access G-Store Transaction Manager Grouping Layer Key-Value Store Logic Transaction Manager Grouping Layer Key-Value Store Logic Transaction Manager Grouping middleware layer resident on top of a key-value store Sudipto Das {[email protected]}
  • Slide 30
  • G-Store Evaluation Implemented using HBase Added the middleware layer ~10000 LOC Experiments in Amazon EC2 Benchmark: An online multi-player game Cluster size: 10 nodes Data size: ~1 billion rows (>1 TB) For groups with 100 keys Group creation latency: ~10 100ms More than 10,000 groups concurrently created 30Sudipto Das {[email protected]}
  • Slide 31
  • G-Store Evaluation Sudipto Das {[email protected]}31 Group creation latency Group creation throughput
  • Slide 32
  • Unused resources Lightweight Elasticity Provisioning on-demand and not for peak Optimize operating cost! Traditional InfrastructuresDeployment in the Cloud Demand Capacity Time Resources Demand Capacity Time Resources Slide Credits: Berkeley RAD Lab 32Sudipto Das {[email protected]}
  • Slide 33
  • Elasticity in the Database tier Database tier Sudipto Das {[email protected]} Load Balancer Application/ Web/Caching tier 33
  • Slide 34
  • Live database migration Migrate a database partition (or tenant) in a live system Optimize operating cost Resource orchestration in multitenant systems Different from Migration between software versions Migration in case of schema evolution 34Sudipto Das {[email protected]}
  • Slide 35
  • VM migration for DB elasticity One DB partition-per-VM Pros: allows fine-grained load balancing Cons Performance overhead Poor consolidation ratio [Curino et al., CIDR 2011] Multiple DB partitions in a VM Pros: good performance Cons: Migrate all partitions Coarse-grained load balancing Sudipto Das {[email protected]}35 Hypervisor VM Hypervisor VM
  • Slide 36
  • Live database migration Multiple partitions share the same database process Shared process multitenancy Migrate individual partitions on- demand in a live system Virtualization in the database tier Straightforward solution Stop serving partition at the source Copy to destination Start serving at the destination Expensive! 36Sudipto Das {[email protected]}
  • Slide 37
  • Migration cost measures Service un-availability Time the partition is unavailable Number of failed requests Number of operations failing/transactions aborting Performance overhead Impact on response times Additional data transferred 37Sudipto Das {[email protected]}
  • Slide 38
  • Two common DBMS architectures Decoupled storage architectures ElasTraS, G-Store, Deuteronomy, MegaStore Persistent data is not migrated Albatross [VLDB 2011] Shared nothing architectures SQL Azure, Relational Cloud, MySQL Cluster Migrate persistent data Zephyr [SIGMOD 2011] 38Sudipto Das {[email protected]}
  • Slide 39
  • Persistent data must be migrated (GBs) How to ensure no downtime? Nodes can fail during migration How to guarantee correctness during failures? Transaction atomicity and durability Recover migration state after failure Transactions execute during migration How to guarantee serializability? Transaction correctness equivalent to normal operation Sudipto Das {[email protected]} Why is live DB migration hard? 39
  • Slide 40
  • Migration executed in phases Starts with transfer of minimal information to destination (wireframe) Database pages used as granule of migration Unique page ownership Source and destination concurrently execute transactions in one migration phase Minimal transaction synchronization Guaranteed serializability Logging and handshaking protocols Sudipto Das {[email protected]} Our approach: Zephyr [SIGMOD 2011] 40
  • Slide 41
  • For this talk Transactions access a single partition No replication No structural changes to indices Extensions in the paper [SIGMOD 2011] Relaxes these assumptions Sudipto Das {[email protected]} Simplifying assumptions 41
  • Slide 42
  • Sudipto Das {[email protected]} Design overview Owned Pages Active transactions Page owned by Node Page not owned by Node P1P1 P2P2 P3P3 PnPn T S1,, T Sk Source Destination 42
  • Slide 43
  • Sudipto Das {[email protected]} Init mode Owned Pages Active transactions Un-owned Pages Freeze indices and migrate wireframe P1P1 P2P2 P3P3 PnPn T S1,, T Sk Source Destination P1P1 P2P2 P3P3 PnPn Page owned by Node Page not owned by Node 43
  • Slide 44
  • Sudipto Das {[email protected]} What is an index wireframe? SourceDestination 44
  • Slide 45
  • Dual mode Sudipto Das {[email protected]} Requests for un-owned pages can block Old, still active transactions New transactions P1P1 P2P2 PnPn T Sk+1,, T Sl T D1,, T Dm P3P3 P 3 accessed by T Di P 3 pulled from source Source Destination P1P1 P2P2 P3P3 PnPn Index wireframes remain frozen Page owned by Node Page not owned by Node 45
  • Slide 46
  • Finish mode Sudipto Das {[email protected]} Pages can be pulled by the destination, if needed Completed PnPn Source Destination P1P1 P2P2 P3P3 P 1, P 2, pushed from source T Dm+1,, T Dn PnPn P1P1 P2P2 P3P3 Page owned by Node Page not owned by Node 46
  • Slide 47
  • Normal operation Sudipto Das {[email protected]} Source Destination P1P1 P2P2 P3P3 T Dn+1,, T Dp PnPn Index wireframe un-frozen Page owned by Node Page not owned by Node 47
  • Slide 48
  • Once migrated, pages are never pulled back by source Abort transactions at source accessing the migrated pages No structural changes to indices during migration Abort transactions (at both nodes) that make structural changes to indices Destination pulls pages on-demand Transactions at the destination experience higher latency compared to normal operation Sudipto Das {[email protected]} Artifacts of this design 48
  • Slide 49
  • Only concern is dual mode Init and Finish: only one node is executing transactions Local predicate locking of internal index and exclusive page ownership no phantoms Strict 2PL Transactions are locally serializable Pages transferred only once No T dest T source conflict dependency Guaranteed serializability Sudipto Das {[email protected]}Serializability 49
  • Slide 50
  • Transaction recovery For every database page, T src T dst Recovery: transactions replayed in conflict order Migration recovery Atomic transitions between migration modes Developed logging and handshake protocols Every page has exactly one owner Bookkeeping at the index level Sudipto Das {[email protected]}Recovery 50
  • Slide 51
  • In the presence of arbitrary repeated failures, Zephyr ensures: Updates made to database pages are consistent Failure does not leave a page without an owner Both source and destination are in the same migration mode Guaranteed termination and starvation freedom Sudipto Das {[email protected]}Correctness 51
  • Slide 52
  • Prototyped using an open source OLTP database H2 Supports standard SQL/JDBC API Serializable isolation level Tree Indices Relational data model Modified the database engine Added support for freezing indices Page migration status maintained using index ~6000 LOC Tungsten SQL Router migrates JDBC connections during migration Sudipto Das {[email protected]}Implementation 52
  • Slide 53
  • Downtime (partition unavailability) S&C: 3 8 seconds (needed to migrate, unavailable for updates) Zephyr: No downtime. Either source or destination is available Service interruption (failed operations) S&C: ~100s 1,000s. All transactions with updates are aborted Zephyr: ~10s 100s. Order of magnitude less interruption Minimal operational and data transfer overhead Sudipto Das {[email protected]} Results Overview 53
  • Slide 54
  • Sudipto Das {[email protected]} Failed Operations Order of magnitude fewer failed operations 54
  • Slide 55
  • Concluding Remarks Sudipto Das {[email protected]}55 Major enabling technologies Scalable distributed database infrastructure ElasTraS Dynamically formed data partitions G-Store Live database migration Albatross, Zephyr
  • Slide 56
  • Future Directions Self-managing controller for large multitenant database infrastructures Novel data management architectures Leveraging advances from novel hardware Convergence of transactional and analytics systems for real-time intelligence Putting human-in-the-loop: Leveraging crowd-sourcing Sudipto Das {[email protected]}56
  • Slide 57
  • Thank you! Collaborators UCSB: Divy Agrawal, Amr El Abbadi, mer E ecio lu Shashank Agarwal, Shyam Antony, Aaron Elmore, Shoji Nishimura (NEC Japan) Microsoft Research Redmond: Phil Bernstein, Colin Reid IBM Almaden: Yannis Sismanis, Kevin Beyer, Rainer Gemulla, Peter Haas, John McPherson