probabilistically consistent indranil gupta (indy) department of computer science, uiuc...

Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC [email protected] FuDiCo 2015 DPRG: 1

Upload: heather-powell

Post on 13-Dec-2015




0 download


Page 1: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Probabilistically Consistent

Indranil Gupta (Indy)Department of Computer Science,

[email protected]

FuDiCo 2015DPRG: 1

Page 2: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Joint Work With

• Muntasir Rahman (Graduating PhD Student)• Luke Leslie, Lewis Tseng• Mayank Pundir (MS, now at Facebook)

• Work funded by Air Force Research Labs/AFOSR, National Science Foundation, Google, Yahoo!, and Microsoft

Page 3: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Hard Choices in Extensible Distributed Systems• Users in extensible distributed systems desire

• Timeliness and Correctness Guarantees

• But these are at odds with…• Unpredictability

• Network Delays and Failures

• Research community and industry often tends to translate this into hard choices in systems design

• Examples1. CAP Theorem: choice between consistency and availability (or latency)

• Either relational databases or eventually consistent NoSQL stores• (Maybe a convergence now?)

2. Always get 100% answers in computation engines (batch or stream)• Use checkpointing

Page 4: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Hard Choices… Can in fact be Probabilistic Choices!

• Many of these are in fact probabilistic choices• One of the earliest examples: pbcast/Bimodal Multicast

• Examples1. CAP Theorem:

• We derive a probabilistic CAP theorem that defines an achievable boundary between consistency and latency in any database system

• We use this to incorporate probabilistic consistency and latency SLAs into Cassandra and Riak

2. Always get 100% answers in computation engines (batch or stream)• In many systems, checkpointing results in 8-31x higher execution time!• We show that in systems like distributed graph processing systems

• We can avoid checkpointing altogether• Instead, have a reactive approach: upon failure, reactively scrounge state (naturally replicated)• And achieve very high accuracy (95-99%)

Page 5: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Key-value/NoSQL Storage Systems

• Key-value/NoSQL stores: $3.4B sector by 2018• Distributed storage in the cloud• Netflix: video position (Cassandra) • Amazon: shopping cart (DynamoDB)• And many others

• Necessary API operations: get(key) and put(key, value)• And some extended operations, e.g., “CQL”

in Cassandra key-value store

Page 6: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Key-value/NoSQL Storage: Fast and Fresh

• Cloud clients expect both • Latency: Low latency for all operations (reads/writes)

• 500ms latency increase at costs 20% drop in revenue • each extra ms $4 M revenue loss• Long latency User Cognitive Drift

• Consistency: read returns value of one of latest writes• Freshness of data means accurate tracking and higher user satisfaction• Most KV stores only offer weak consistency (Eventual consistency)• Eventual consistency = if writes stop, all replicas converge, eventually

Page 7: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Hard vs. Soft Partitions

• CAP Theorem looks at hard partitions• However, soft partitions may happen inside a

data-center• Periods of elevated message delays • Periods of elevated loss rates

• Soft partitions are more frequent

Data-center 1(America)

Data-center 2(Europe)

Hard partition



Congestion at switches=> Soft partition

Page 8: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Our work: From Impossibility to Possibility

• C Probabilistic C (Consistency)• A Probabilistic A (Latency)• P Probabilistic P (Partition Model)

• A probabilistic CAP theorem• A system that validates how close we are to the

achievable envelope• (Goal is not: another consistency model, or

NoSQL vs New/Yes SQL)


Page 9: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://


W(1) W(2) R(1)


A read is tc-fresh if it returns the value of a write that starts at-most tc time before the read

pic is likelihood a read is NOT tc-fresh

Probabilistic Consistency (pic ,tc)

pua is likelihood a read DOES NOT return an answer within ta time units

Probabilistic Latency (pua ,ta)

α is likelihood that a random path ( client server client) has message delay exceeding tp

time units

Probabilistic Partition (α, tp )

PCAP Theorem: Impossible to achieve both Probabilistic Consistency and Latency

under Probabilistic Partitions if:

tc + ta < tp and pua + pic < α

Bad network -> High (α, tp )

To get better consistency -> lower (pic ,tc)

To get better latency -> lower (pua ,ta)

Probabilistic CAP

9Full proof in our arXiv paper:

Special case: Original CAP has α=1 and tp = ∞

Page 10: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://


Towards Probabilistic SLAs

• Latency SLA: Similar to latency SLAs already existing in industry.• Meet a desired probability that client receives operation’s result

within the timeout• Maximize freshness probability within given freshness interval• Example: Amazon shopping cart

• Doesn’t want to lose customers due to high latency• Only 10% operations can take longer than 300ms

• SLA: (pua, ta) = (0.1, 300ms)

• Minimize staleness (don’t want customers to lose items)

• Minimize: pic (Given: tc)

Page 11: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://


Towards Probabilistic SLAs (2)

• Consistency SLA: Goal is to • Meet a desired freshness probability (given freshness interval) • Maximize probability that client receives operation’s result

within the timeout• Example: Google search application/Twitter search

• Wants users to receive “recent” data as search• Only 10% results can be more than 5 min stale

• SLA: (pic , tc)=(0.1, 5 min)

• Minimize response time (fast response to query)

• Minimize: pua (Given: ta)

Page 12: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Meeting these SLAs: PCAP Systems

Increased Knob Latency Consistency

Read Delay Degrades Improves

Read Repair Rate Unaffected Improves

Consistency Level

Degrades Improves

Continuously adapt control knobs to always satisfy PCAP SLA

KV-store (Cassandra,





System assumptions:• Client sends query to coordinator server which then forwards to replicas (answers reverse path)• There exist background mechanisms to bring stale replicas up to date

Page 13: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Meeting Consistency SLA for PCAP Cassandra (pic=0.135)

Consistency always below target SLA

Setup • 9 server Emulab cluster: each server has 4 Xeon + 12 GB RAM• 100 Mbps Ethernet• YCSB workload (144 client threads)• Network delay: Log-normal distribution [Benson 2010]

Mean latency = 3 ms | 4 ms | 5 ms

Page 14: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Meeting Consistency SLA for PCAP Cassandra (pic=0.135)

Optimal envelopes under different Network conditions (based on PCAP theorems)

PCAP system SatisfiesSLA and close to Optimal envelope

Page 15: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Geo-Distributed PCAP


N(20,sqrt(2)) | N(22,sqrt(2.2)Latency SLA met before and after jump

Consistency degrades after delay jump

Fast convergence initially, and after delay jump

Reduced oscillation, compared to multiplicative controller

PCAP multiplicative controller

Page 16: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Related Work

• Pileus/Tuba [Doug Terry et al]• Utility-based SLAs • Focus on wide-area• Can be used underneath our PCAP system (instead of our SLAs)

• Consistency Metrics: PBS [Peter Bailis et al] • Considers write end time (we consider write start time)• May not be able to define consistency for some read-write pairs (PCAP

accommodates all combinations)• Can use it in PCAP system

• Approximate answers: Hadoop [ApproxHadoop], Querying [BlinkDB], Bimodal multicast


Page 17: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

PCAP Summary

• CAP Theorem motivated NoSQL Revolution• But apps need freshness + fast responses

• Under soft partition• We proposed

• Probabilistic models for C, A, P• Probabilistic CAP theorem – generalizes classical CAP• PCAP system satisfies Latency/Consistency SLAs• Integrated into Apache Cassandra and Riak KV stores

• Riak has expressed interest in incorporating these into their mainline code


Page 18: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Distributed Graph Processing and Checkpointing

• Checkpointing: Proactively save state to persistent storage• If there’s a failure, recover 100% cost• Used by:

•PowerGraph [Gonzalez et al. OSDI 2012]•Giraph [Apache Giraph]•Distributed GraphLab [Low et al. VLDB 2012]•Hama [Seo et al. CloudCom 2010]


Page 19: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Checkpointing Bad


Graph Dataset

Vertex Count

Edge Count

CA-Road 1.96 M 2.77 M

Twitter 41.65 M 1.47 B

UK Web 105.9 M 3.74 B




8 – 31x Increased Per-Iteration Execution Time

Page 20: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Users Already Don’t (Use or Like) Checkpointing

• “While we could turn on checkpointing to handle some of these failures, in practice we choose to disable checkpointing.” [Ching et. al. (Giraph @ Facebook) VLDB 2015]

• “Existing graph systems only support checkpoint-based fault tolerance, which most users leave disabled due to performance overhead.” [Gonzalez et. al. (GraphX) OSDI 2014]

• “The choice of interval must balance the cost of constructing the checkpoint with the computation lost since the last checkpoint in the event of a failure.” [Low et. al. (GraphLab) VLDB 2012]

• “Better performance can be obtained by balancing fault tolerance costs against that of a job restart.” [Low et al. (GraphLab) VLDB 2012]


Page 21: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Our Approach: Zorro

• No checkpointing. Common case is fast.• When failure occurs, opportunistically scrounge state (from surviving

servers) and continue computation• Natural replication in distributed processing systems

• A vertex data is present at its neighbor vertices• Each vertex assigned to one server, and its neighbors likely on

other servers• We get very high accuracy (95%+)


Page 22: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Natural Replication => Can Retrieve a Lot of State


PowerGraph LFGraph87 – 95% Graph State is Recoverable

Even After Half the Servers Fail

92 – 95%

87 – 91%


Page 23: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Natural Replication => Low InAccuracy


PowerGraph LFGraph



Page 24: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Natural Replication => Low InAccuracy


Algorithm PowerGraph LFGraphPageRank 2 % 3 %

Single-Source Shortest Paths

0.0025 % 0.06 %

Connected Components 1.6 % 2.15 %K-Core 0.0054% 1.4 %

Graph Coloring* 5.02 % NAGroup-Source Shortest

Paths*0.84 % NA

Triangle Count* 0 % NAApproximate Diameter* 0 % NA

Page 25: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://


• Impossibility theorems and 100% correct answers are great• But they entail

• Inflexibility in design (NoSQL or SQL)• High overhead (Checkpointing)

• Important to explore • Probabilistic tradeoffs and Achievable envelopes • Leads to more flexibility in design

• Other applicable areas: stream processing, machine learning


Page 26: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Plug: MOOC on “Cloud Computing Concepts”

• Free course, On Coursera• Ran Feb-Apr 2015• 120K+ students

Next run: Spring 2016• Covered distributed systems and algorithms used in cloud computing• Free and Open to everyone

•• Or do a search on Google for “Cloud Computing Course” (click on first


Page 27: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Backup Slides

Page 28: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://


PCAP Consistency Metric Is more Generic Than PBS





A read is tc-fresh if it returns the value of a write that starts at-most tc time before the read starts

W(1) and R(1) can overlap





A read is tc-fresh if it returns the value of a write that starts at-most tc time before the read ends

W(1) and R(1) cannot overlap



Page 29: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

GeoPCAP: 2 Key Techniques

Client Read, SLA

Prob C1, L1

Local DC

Composed modelProb CC, LC



Given client C or L SLA:• QUICKEST: at-least one DC satisfies SLA• ALL: each DC satisfies SLA

Prob C2, L2 Prob C3,L3

(1) Prob Composition Rules

Prob WAN Model

Δ Δ Δ(2) Tune Geo-delay using PID Control

Page 30: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

CAP Theorem NoSQL Revolution

• Conjectured: [Brewer 00] • Proved: [Gilbert Lynch 02]• Kicked off NoSQL

revolution• Abadi’s PACELC

• If P, choose A or C• Else, choose L

(latency) or C


Partition-tolerance Availability (Latency)

RDBMSs (non-replicated)

Cassandra, RIAK, Dynamo, Voldemort

HBase, HyperTable,BigTable, Spanner

Page 31: Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC FuDiCo 2015 DPRG: ://

Geo-Distributed PCAP


N(20,sqrt(2)) | N(22,sqrt(2.2)Latency SLA met before and after jump

Consistency degrades after delay jump

Fast convergence initially, and after delay jump

Reduced oscillation, compared to multiplicative controller

PCAP multiplicative controller