scalable, consistent, and elastic database systems for cloud platforms sudipto das computer science,...

Scalable, Consistent, and Elastic Database Systems for Cloud Platforms Sudipto Das Computer Science, UC Santa Barbara [email protected] Sponsors:

Web replacing Desktop Sudipto Das {[email protected]}2

3 Paradigm shift in Infrastructure Sudipto Das {[email protected]}

Cloud computing Computing infrastructure and solutions delivered as a service Industry worth USD150 billion by 2014 * Contributors to success Economies of scale Elasticity and pay-per-use pricing Popular paradigms Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS) 4Sudipto Das {[email protected]} *http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htm

Databases for cloud platforms Data is central to applications DBMSs are mission critical component in cloud software stack Manage petabytes of data, drive revenue Serve a variety of applications (multitenancy) Data needs for cloud applications OLTP systems: store and serve data Data analysis systems: decision support, intelligence 5Sudipto Das {[email protected]}

Application landscape Social gaming Rich content and mash-ups Managed applications Cloud application platforms Sudipto Das {[email protected]}6

Challenges for OLTP systems Scalability While ensuring efficient transaction execution! 7 Lightweight Elasticity Scale on-demand! Self-Manageability Intelligence without a human controller! Sudipto Das {[email protected]}

Two approaches to scalability Scale-up Preferred in classical enterprise setting (RDBMS) Flexible ACID transactions Transactions access a single node Scale-out Cloud friendly (Key value stores) Execution at a single server Limited functionality & guarantees No multi-row or multi-step transactions Sudipto Das {[email protected]}8

Why care about transactions? 9 confirm_friend_request(user1, user2) { begin_transaction(); update_friend_list(user1, user2, status.confirmed); update_friend_list(user2, user1, status.confirmed); end_transaction(); } Simplicity in application design with ACID transactions Sudipto Das {[email protected]}

10 confirm_friend_request_A(user1, user2) { try { update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); return; } try { update_friend_list(user2, user1, status.confirmed); } catch(exception e) { revert_friend_list(user1, user2); report_error(e); return; } confirm_friend_request_B(user1, user2) { try{ update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time()); } try { update_friend_list(user2, user1, status.confirmed); } catch(exception e) { report_error(e); add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time()); } It gets too complicated with reduced consistency guarantees Sudipto Das {[email protected]}

Challenge: Transactions at Scale 11 Scale-out ACID transactions Key Value Stores RDBMSs Sudipto Das {[email protected]}

Unused resources Challenge: Lightweight Elasticity Provisioning on-demand and not for peak Optimize operating cost! Traditional InfrastructuresDeployment in the Cloud Demand Capacity Time Resources Demand Capacity Time Resources Slide Credits: Berkeley RAD Lab 12Sudipto Das {[email protected]}

Challenge: Self-Manageability Managing a large distributed system Detecting failures and recovering Coordination and synchronization Provisioning Capacity planning A large distributed system is a Zoo Cloud platforms inherently multitenant Balance conflicting goals Minimize operating cost while ensuring good performance 13Sudipto Das {[email protected]}

Contributions for OLTP systems Transactions at Scale ElasTraS [HotCloud 2009, UCSB TR 2010] G-Store [SoCC 2010] 14 Lightweight Elasticity Albatross [VLDB 2011] Zephyr [SIGMOD 2011] Self-Manageability Pythia [in progress] Sudipto Das {[email protected]}

Contributions 15 Data Management Analytics Transaction Processing Ricardo [SIGMOD 10] MD-HBase [MDM 11] Best Paper Runner up MD-HBase [MDM 11] Best Paper Runner up CoTS [ICDE 09], [VLDB 09] CoTS [ICDE 09], [VLDB 09] Dynamic partitioning G-Store [SoCC 10] Static partitioning ElasTraS [HotCloud 09] [TR 10] ElasTraS [HotCloud 09] [TR 10] Albatross [VLDB 11] Zephyr [SIGMOD 11] Albatross [VLDB 11] Zephyr [SIGMOD 11] Pythia [in progress] Novel Architectures Hyder [CIDR 11] Best Paper Hyder [CIDR 11] Best Paper TCAM [DaMoN 08] TCAM [DaMoN 08] This talk Anonimos [ICDE 10], [TKDE] Anonimos [ICDE 10], [TKDE]

Transactions at Scale 16 Scale-out ACID transactions Key Value Stores RDBMSs Sudipto Das {[email protected]}

Scale-out with static partitioning Table level partitioning (range, hash) Distributed transactions Partitioning the Database schema Co-locate data items accessed together Goal: Minimize distributed transactions 17Sudipto Das {[email protected]}

Scale-out with static partitioning Table level partitioning (range, hash) Distributed transactions Partitioning the Database schema Co-locate data items accessed together Goal: Minimize distributed transactions Scaling-out with static partitioning ElasTraS [HotCloud 2009, TR 2010] Cloud SQL Server [ICDE 2011] Megastore [CIDR 2011] Relational Cloud [CIDR 2011] 18Sudipto Das {[email protected]}

Dynamically formed partitions Access patterns change, often rapidly Online multi-player gaming applications Collaboration based applications Scientific computing applications Not amenable to static partitioning How to get the benefit of partitioning when accesses do not statically partition? Ours is the first solution to allow that 19Sudipto Das {[email protected]}

Online Multi-player Games 20 Player Profile IDName$$$Score Sudipto Das {[email protected]}

Online Multi-player Games 21 Execute transactions on player profiles while the game is in progress Sudipto Das {[email protected]}

Online Multi-player Games 22Sudipto Das {[email protected]} Partitions/groups are dynamic

Online Multi-player Games 23Sudipto Das {[email protected]} Hundreds of thousands of concurrent groups

Data Fusion for dynamic partitions [G-Store, SoCC 2010] Transactional access to a group of data items formed on-demand Challenge: Avoid distributed transactions! Key Group Abstraction Groups are small Groups execute non-trivial no. of transactions Groups are dynamic and on-demand Groups are dynamically formed tenant databases 24Sudipto Das {[email protected]}

Transactions on Groups Without distributed transactions 25 Ownership of keys at a single node Key Group One key selected as the leader Followers transfer ownership of keys to leader Sudipto Das {[email protected]} Grouping Protocol

Why is group formation hard? Guarantee the contract between leaders and followers in the presence of: Leader and follower failures Lost, duplicated, or re-ordered messages Dynamics of the underlying system How to ensure efficient and ACID execution of transactions? Sudipto Das {[email protected]}26

Grouping protocol Conceptually akin to locking Locks held by groups Sudipto Das {[email protected]}27 Follower(s) Leader L(Creating) L(Joined) L(Joining) L(Joined) L(Deleting) L(Free) L(Deleted) Group Opns JJA JAA D DA Log entries Time Create Request Delete Request

Efficient transaction processing How does the leader execute transactions? Caches data for group members underlying data store equivalent to a disk Transaction logging for durability Cache asynchronously flushed to propagate updates Guaranteed update propagation 28Sudipto Das {[email protected]} Log Transaction Manager Cache Manager Leader Followers Asynchronous update Propagation

Prototype: G-Store [SoCC 2010] An implementation over Key-value stores 29 Grouping Layer Key-Value Store Logic Distributed Storage Application Clients Transactional Multi-Key Access G-Store Transaction Manager Grouping Layer Key-Value Store Logic Transaction Manager Grouping Layer Key-Value Store Logic Transaction Manager Grouping middleware layer resident on top of a key-value store Sudipto Das {[email protected]}

G-Store Evaluation Implemented using HBase Added the middleware layer ~10000 LOC Experiments in Amazon EC2 Benchmark: An online multi-player game Cluster size: 10 nodes Data size: ~1 billion rows (>1 TB) For groups with 100 keys Group creation latency: ~10 100ms More than 10,000 groups concurrently created 30Sudipto Das {[email protected]}

G-Store Evaluation Sudipto Das {[email protected]}31 Group creation latency Group creation throughput

Unused resources Lightweight Elasticity Provisioning on-demand and not for peak Optimize operating cost! Traditional InfrastructuresDeployment in the Cloud Demand Capacity Time Resources Demand Capacity Time Resources Slide Credits: Berkeley RAD Lab 32Sudipto Das {[email protected]}

Elasticity in the Database tier Database tier Sudipto Das {[email protected]} Load Balancer Application/ Web/Caching tier 33

Live database migration Migrate a database partition (or tenant) in a live system Optimize operating cost Resource orchestration in multitenant systems Different from Migration between software versions Migration in case of schema evolution 34Sudipto Das {[email protected]}

VM migration for DB elasticity One DB partition-per-VM Pros: allows fine-grained load balancing Cons Performance overhead Poor consolidation ratio [Curino et al., CIDR 2011] Multiple DB partitions in a VM Pros: good performance Cons: Migrate all partitions Coarse-grained load balancing Sudipto Das {[email protected]}35 Hypervisor VM Hypervisor VM

Live database migration Multiple partitions share the same database process Shared process multitenancy Migrate individual partitions on- demand in a live system Virtualization in the database tier Straightforward solution Stop serving partition at the source Copy to destination Start serving at the destination Expensive! 36Sudipto Das {[email protected]}

Migration cost measures Service un-availability Time the partition is unavailable Number of failed requests Number of operations failing/transactions aborting Performance overhead Impact on response times Additional data transferred 37Sudipto Das {[email protected]}

Two common DBMS architectures Decoupled storage architectures ElasTraS, G-Store, Deuteronomy, MegaStore Persistent data is not migrated Albatross [VLDB 2011] Shared nothing architectures SQL Azure, Relational Cloud, MySQL Cluster Migrate persistent data Zephyr [SIGMOD 2011] 38Sudipto Das {[email protected]}

Persistent data must be migrated (GBs) How to ensure no downtime? Nodes can fail during migration How to guarantee correctness during failures? Transaction atomicity and durability Recover migration state after failure Transactions execute during migration How to guarantee serializability? Transaction correctness equivalent to normal operation Sudipto Das {[email protected]} Why is live DB migration hard? 39

Migration executed in phases Starts with transfer of minimal information to destination (wireframe) Database pages used as granule of migration Unique page ownership Source and destination concurrently execute transactions in one migration phase Minimal transaction synchronization Guaranteed serializability Logging and handshaking protocols Sudipto Das {[email protected]} Our approach: Zephyr [SIGMOD 2011] 40

For this talk Transactions access a single partition No replication No structural changes to indices Extensions in the paper [SIGMOD 2011] Relaxes these assumptions Sudipto Das {[email protected]} Simplifying assumptions 41

Sudipto Das {[email protected]} Design overview Owned Pages Active transactions Page owned by Node Page not owned by Node P1P1 P2P2 P3P3 PnPn T S1,, T Sk Source Destination 42

Sudipto Das {[email protected]} Init mode Owned Pages Active transactions Un-owned Pages Freeze indices and migrate wireframe P1P1 P2P2 P3P3 PnPn T S1,, T Sk Source Destination P1P1 P2P2 P3P3 PnPn Page owned by Node Page not owned by Node 43

Sudipto Das {[email protected]} What is an index wireframe? SourceDestination 44

Dual mode Sudipto Das {[email protected]} Requests for un-owned pages can block Old, still active transactions New transactions P1P1 P2P2 PnPn T Sk+1,, T Sl T D1,, T Dm P3P3 P 3 accessed by T Di P 3 pulled from source Source Destination P1P1 P2P2 P3P3 PnPn Index wireframes remain frozen Page owned by Node Page not owned by Node 45

Finish mode Sudipto Das {[email protected]} Pages can be pulled by the destination, if needed Completed PnPn Source Destination P1P1 P2P2 P3P3 P 1, P 2, pushed from source T Dm+1,, T Dn PnPn P1P1 P2P2 P3P3 Page owned by Node Page not owned by Node 46

Normal operation Sudipto Das {[email protected]} Source Destination P1P1 P2P2 P3P3 T Dn+1,, T Dp PnPn Index wireframe un-frozen Page owned by Node Page not owned by Node 47

Once migrated, pages are never pulled back by source Abort transactions at source accessing the migrated pages No structural changes to indices during migration Abort transactions (at both nodes) that make structural changes to indices Destination pulls pages on-demand Transactions at the destination experience higher latency compared to normal operation Sudipto Das {[email protected]} Artifacts of this design 48

Only concern is dual mode Init and Finish: only one node is executing transactions Local predicate locking of internal index and exclusive page ownership no phantoms Strict 2PL Transactions are locally serializable Pages transferred only once No T dest T source conflict dependency Guaranteed serializability Sudipto Das {[email protected]}Serializability 49

Transaction recovery For every database page, T src T dst Recovery: transactions replayed in conflict order Migration recovery Atomic transitions between migration modes Developed logging and handshake protocols Every page has exactly one owner Bookkeeping at the index level Sudipto Das {[email protected]}Recovery 50

In the presence of arbitrary repeated failures, Zephyr ensures: Updates made to database pages are consistent Failure does not leave a page without an owner Both source and destination are in the same migration mode Guaranteed termination and starvation freedom Sudipto Das {[email protected]}Correctness 51

Prototyped using an open source OLTP database H2 Supports standard SQL/JDBC API Serializable isolation level Tree Indices Relational data model Modified the database engine Added support for freezing indices Page migration status maintained using index ~6000 LOC Tungsten SQL Router migrates JDBC connections during migration Sudipto Das {[email protected]}Implementation 52

Downtime (partition unavailability) S&C: 3 8 seconds (needed to migrate, unavailable for updates) Zephyr: No downtime. Either source or destination is available Service interruption (failed operations) S&C: ~100s 1,000s. All transactions with updates are aborted Zephyr: ~10s 100s. Order of magnitude less interruption Minimal operational and data transfer overhead Sudipto Das {[email protected]} Results Overview 53

Sudipto Das {[email protected]} Failed Operations Order of magnitude fewer failed operations 54

Concluding Remarks Sudipto Das {[email protected]}55 Major enabling technologies Scalable distributed database infrastructure ElasTraS Dynamically formed data partitions G-Store Live database migration Albatross, Zephyr

Future Directions Self-managing controller for large multitenant database infrastructures Novel data management architectures Leveraging advances from novel hardware Convergence of transactional and analytics systems for real-time intelligence Putting human-in-the-loop: Leveraging crowd-sourcing Sudipto Das {[email protected]}56

Thank you! Collaborators UCSB: Divy Agrawal, Amr El Abbadi, mer E ecio lu Shashank Agarwal, Shyam Antony, Aaron Elmore, Shoji Nishimura (NEC Japan) Microsoft Research Redmond: Phil Bernstein, Colin Reid IBM Almaden: Yannis Sismanis, Kevin Beyer, Rainer Gemulla, Peter Haas, John McPherson

scalable, consistent, and elastic database systems for cloud platforms sudipto das computer science,...

Documents