fault-tolerant and transactional stateful serverless workflows...call lambda2 log in progress table...

Post on 21-Mar-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Fault-tolerant and Transactional Stateful Serverless Workflows

Haoran Zhang, Adney Cardoza, Peter Baile ChenSebastian Angel, Vincent Liu

What is serverless?

DeveloperClient

Cloud

What is serverless?

APIGateway

UserDeveloperClient

Cloud

What is serverless?

WorkerWorkerWorkerWorker

APIGateway

UserDeveloperClient

Cloud

What is serverless?

SharedDatabaseDatabase

WorkerWorkerWorkerWorker

APIGateway

UserDeveloperClient

Cloud

What is serverless?

SharedDatabaseDatabase

WorkerWorkerWorkerWorker

APIGateway

UserDeveloperClient

Cloud

X

What is serverless?

SharedDatabaseDatabase

WorkerWorkerWorkerWorker

APIGateway

UserDeveloperClient

Cloud

X

What is serverless?

SharedDatabaseDatabase

WorkerWorkerWorkerWorker

APIGateway

UserDeveloperClient

Cloud

X

What is serverless?

SharedDatabaseDatabase

WorkerWorkerWorkerWorker

APIGateway

UserDeveloperClient

Cloud

X

What is serverless?

Workers can fail!

How could serverless go wrong?

End

Write(“a,endees”,N+1)

N=Read(“a,endees”)

StartSendRequest

CloudClient

End

Write(“a,endees”,N+1)

N=Read(“a,endees”)

Start

ReceiveError/Timeout

SendRequest

CloudClient

How could serverless go wrong?

End

Write(“a,endees”,N+1)

N=Read(“a,endees”)

Start

ShouldIRetry?

ReceiveError/Timeout

SendRequest

CloudClient

How could serverless go wrong?

End

Write(“a,endees”,N+1)

N=Read(“a,endees”)

Start

ShouldIRetry?

ReceiveError/Timeout

SendRequest

CloudClient

How could serverless go wrong?

End

Write(“a,endees”,N+1)

N=Read(“a,endees”)

Start

ShouldIRetry?

RecieveError/Timeout

SendRequest

CloudClient

How could serverless go wrong?

Write Idempotent Functions!

Beldi makes stateful serverless functions idempotent automatically!

Outline

• Beldi’s Infrastructure• Linked DAAL• Invocation with exactly-once semantics• Evaluation• Conclusion

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

DatabaseAPI

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

DatabaseAPI

Invoca.onAPI

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

DatabaseAPI

Transac.onAPI

Invoca.onAPI

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start InstanceId Done

DatabaseAPI

Transac.onAPI

Invoca.onAPI

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

Key Value

a7endees 10

InstanceId Done

DatabaseAPI

Transac.onAPI

Invoca.onAPI

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

Key Value

a7endees 10

InstanceId Done

Opera.on Value

DatabaseAPI

Transac.onAPI

Invoca.onAPI

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

Key Value

a7endees 10

InstanceId Done

Opera.on Value

DatabaseAPI

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

Key Value

a7endees 10

InstanceId Done

d78590e False

Opera.on Value

DatabaseAPI

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

Key Value

a7endees 10

InstanceId Done

d78590e False

Opera.on Value

DatabaseAPI

ProgressLambda

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

Key Value

a7endees 10

InstanceId Done

d78590e False

Opera.on Value

DatabaseAPI

ProgressLambda

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

Key Value

a7endees 10

InstanceId Done

d78590e False

Opera.on Value

d78590e-1 10

DatabaseAPI

ProgressLambda

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

Key Value

a7endees 10

InstanceId Done

d78590e False

Opera.on Value

d78590e-1 10

DatabaseAPI

ProgressLambda

Beldi’s architecture

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

Key Value

a7endees 10

InstanceId Done

d78590e False

Opera.on Value

d78590e-1 10

DatabaseAPI

ProgressLambda

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start Key Value

a7endees 11

Opera.on Value

d78590e-1

d78590e-2

10

DatabaseAPI

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start Key Value

a7endees 11

Opera.on Value

d78590e-1

d78590e-2

10

DatabaseAPI

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start Key Value

a7endees 11

Opera.on Value

d78590e-1

d78590e-2

10

DatabaseAPI

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start Key Value

a7endees 11

Opera.on Value

d78590e-1

d78590e-2

10

DatabaseAPI

Beldi’s architecture

Problem: ➀ and ➁ must be done atomicallySolution: Collocate write log with the data!

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

Key Value RecentWrites

a7endees 10

InstanceId Done

d78590e False

Opera.on Value

d78590e-1 10

DatabaseAPI

ProgressLambda

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

Key Value RecentWrites

a7endees [d78590e-2]11

InstanceId Done

d78590e False

Opera.on Value

d78590e-1 10

DatabaseAPI

ProgressLambda

Beldi’s architecture

Worker BeldiRun.me Storage

End

Write(“a7endees”,N+1)

N=Read(“a7endees”)

Start

Key Value RecentWrites

a7endees [d78590e-2]11

InstanceId Done

d78590e False

Opera.on Value

d78590e-1 10

DatabaseAPI

ProgressLambda

GarbageCollector

Beldi’s architecture

Technical Challenges

1. Limitation of databases

2. Federated setup

3. Transactions across multiple lambdas

Key Value RecentWrites

a1endees [d78590e-1,d78590e-2,…,d78590e-1000]10

Limitation of databases

Solution: spread the log for a given keyacross multiple rows

NextRowRowId

f9cec2e

Key Value RecentWrites

a:endees [d78590e-1001]11

NextRow

f9cec2e

RowId

HEAD

Key Value RecentWrites

a:endees [d78590e-1,d78590e-2,…,d78590e-1000]10

Limitation of databases

RowId Key Value RecentWrites NextRow

RowId Key Value RecentWrites NextRow

HEAD Key Value RecentWrites NextRow

{PrimaryKey

Linked DAAL

How do we traverse to the tail?

RowId Key Value RecentWrites NextRow

RowId Key Value RecentWrites NextRow

HEAD Key Value RecentWrites NextRow

{PrimaryKey

Linked DAAL

RowId Key Value RecentWrites NextRow

RowId Key Value RecentWrites NextRow

HEAD Key Value RecentWrites NextRow

{PrimaryKey

Linked DAAL

RowId Key Value RecentWrites NextRow

RowId Key Value RecentWrites NextRow

HEAD Key Value RecentWrites NextRow

{PrimaryKey

Linked DAAL

RowId Key Value RecentWrites NextRow

RowId Key Value RecentWrites NextRow

HEAD Key Value RecentWrites NextRow

{PrimaryKey

Linked DAAL

RowId Key Value RecentWrites NextRow

RowId Key Value RecentWrites NextRow

HEAD Key Value RecentWrites NextRow

{PrimaryKey

Linked DAAL

Solution: Use scan and projection todownload a skeleton version of Linked DAAL

RowId Key Value RecentWrites NextRow

RowId Key Value RecentWrites NextRow

HEAD Key Value RecentWrites NextRow

{PrimaryKey

Linked DAAL

RowId Key Value RecentWrites NextRow

RowId Key Value RecentWrites NextRow

HEAD Key Value RecentWrites NextRow

{PrimaryKey

Linked DAAL

RowId Key Value RecentWrites NextRow

RowId Key Value RecentWrites NextRow

HEAD Key Value RecentWrites NextRow

{PrimaryKey

RowId NextRow

RowId NextRow

HEAD NextRow 256Bits

Linked DAAL

RowId Key Value RecentWrites NextRow

RowId Key Value RecentWrites NextRow

HEAD Key Value RecentWrites NextRow

{PrimaryKey

RowId NextRow

RowId NextRow

HEAD NextRow 256Bits

Linked DAAL

Outline

• Beldi’s Infrastructure• Linked DAAL• Invocation with exactly-once semantics• Evaluation• Conclusion

Invocation with exactly-once semantics

CallLambda2

Lambda1 Lambda2

CallLambda2

Opera.on Callee

d78590e-1 b97bbe0

Lambda1 Lambda2

Invocation with exactly-once semantics

InstanceId Done

b97bbe0 False

LoginProgressTableCallLambda2

Opera=on Callee

d78590e-1 b97bbe0

Lambda1 Lambda2

Invocation with exactly-once semantics

InstanceId Done

b97bbe0 False

makesomewrites

LoginProgressTableCallLambda2

Opera?on Callee

d78590e-1 b97bbe0

Lambda1 Lambda2

Invocation with exactly-once semantics

InstanceId Done

b97bbe0 True

MarkasDone

makesomewrites

LoginProgressTableCallLambda2

Opera@on Callee

d78590e-1 b97bbe0

Lambda1 Lambda2

Invocation with exactly-once semantics

InstanceId Done

b97bbe0 True

MarkasDone

makesomewrites

LoginProgressTableCallLambda2

Opera@on Callee

d78590e-1 b97bbe0

Lambda1 Lambda2

X

Invocation with exactly-once semantics

InstanceId Done

b97bbe0 True

MarkasDone

makesomewrites

LoginProgressTableCallLambda2

Opera@on Callee

d78590e-1 b97bbe0

X

Lambda1 Lambda2

X

Invocation with exactly-once semantics

InstanceId Done

b97bbe0 True

CallLambda2

Opera:on Callee

d78590e-1 b97bbe0

Lambda1 Lambda2

Invocation with exactly-once semantics

InstanceId Done

b97bbe0 True

MarkasDone

makesomewrites

LoginProgressTableCallLambda2

Opera@on Callee

d78590e-1 b97bbe0

Lambda1 Lambda2

Invocation with exactly-once semantics

InstanceId Done

b97bbe0 True

MarkasDone

makesomewrites

LoginProgressTableCallLambda2

Opera@on Callee

d78590e-1 b97bbe0

Lambda1 Lambda2

ReceiveResponse

Invocation with exactly-once semantics

InstanceId Done

b97bbe0 True

GC

CallLambda2

Opera;on Callee

d78590e-1 b97bbe0

Lambda1 Lambda2

Invocation with exactly-once semantics

InstanceId Done

GC

CallLambda2

Opera6on Callee

d78590e-1 b97bbe0

Lambda1 Lambda2

Invocation with exactly-once semantics

InstanceId Done

b97bbe0 False

makesomewrites

LoginProgressTableCallLambda2

Opera?on Callee

d78590e-1 b97bbe0

Lambda1 Lambda2

Invocation with exactly-once semantics

CallLambda2

Opera.on Callee Result

d78590e-1 b97bbe0

Lambda1 Lambda2

Invocation with exactly-once semantics

makesomewrites

LoginIntentTableCallLambda2

Opera8on Callee Result

d78590e-1 b97bbe0

Lambda1 Lambda2

Invocation with exactly-once semantics

Callback

makesomewrites

LoginIntentTableCallLambda2

Opera9on Callee Result

resultd78590e-1 b97bbe0

Lambda1 Lambda2

Invocation with exactly-once semantics

MarkasDone

Callback

makesomewrites

LoginIntentTableCallLambda2

Opera;on Callee Result

resultd78590e-1 b97bbe0

Lambda1 Lambda2

X

Invocation with exactly-once semantics

Outline

• Beldi’s Infrastructure• Linked DAAL• Invocation with exactly-once semantics• Evaluation• Conclusion

Evaluation

1. What are the costs of Beldi’s API operations?

2. How does Beldi perform in real-world applications?

3. What is the effect of garbage collection?

What are the costs of Beldi’s API operations?

20 rows in Linked DAAL, 2 - 4x more expensive than baseline

��

���

���

���

���

���

���

�� ��� ���� ��� �����

���

���

���

����������������������� ��!���"��

How does Beldi perform in real-worldapplications?

Frontend

Search

Reserve

User

Profile

Geo

Rate

Reserve Flight

RecommendClient

Reserve Hotel

DeathStarBench (ASPLOS 19): open-source microservices benchmark• Movie review service (Cf. IMDB)• Travel reservation (Cf. Expedia)• Social media site (Cf. Twitter)

How does Beldi perform in real-worldapplications?

��

����

�����

�����

�����

�����

�� ���� ���� ���� ���� ���� ���� ���

��

���

���

��

������������� �� ���� �����

��� !� ������� !� �""�� �!����� �!�""�

<400 req/s:2× higher thanbaseline

700 req/s(saturation):3.3 × higher thanbaseline

Outline

• Beldi’s Infrastructure• Linked DAAL• Invocation with exactly-once semantics• Evaluation• Conclusion

Conclusion

1. A framework to write transactional and fault-tolerant applicationson serverless.

2. A lock-free data structure (Linked DAAL) to support fast logging andexactly-once semantics

3. A collaborative distributed transaction protocol across multiple lambdas

4. An efficient garbage collection algorithm that runs independently without affecting running lambdas or requiring any pauses.

https://github.com/eniac/beldi

Thank you!

top related