robust distributed system nucleus (rdsn) · program test debug perf tuning deploy operate analysi s...

73
Robust Distributed System Nucleus (rDSN) https://github.com/Microsoft/rDSN Zhenyu Guo (@imzhenyu)

Upload: others

Post on 25-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Robust Distributed System Nucleus (rDSN)

https://github.com/Microsoft/rDSN

Zhenyu Guo (@imzhenyu)

Page 2: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Building robust distributed systems is important • Our lives are relying on distributed (computer) systems more and

more •  (Traditional) Reading, Entertainment, Communication, ... •  (O2O) Taxi, Dating, Tickets, Doctor, ... •  (IoT) Health/Security Monitoring, Smart Home/Car, IFTTT, … •  …

•  Alipay was offline for ~2 hrs starting 2015.5.27 5PM1

•  @

•  @

•  …

1.http://money.hexun.com/2015-05-29/176281908.html 2

Page 3: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Service robustness is far from being perfect

3

Page 4: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Many works done to enhance robustness

Program Test Debug Deploy Operate Reason

Engineering Process

Runtime

Fail-over

Scale-out

Replication

Page 5: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

However, many are ineffective, not adopted, or even doing the opposite!

Page 6: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Example: replication w/ resource contention

6

Failure Detection

Data Bus

X.Replica

Failure Detection

Data Bus

X.Replica

Failure Detection

Data Bus

X.Replica

Failure Detection

Data Bus

Y.Replica

Failure Detection

Data Bus

Y.Replica

Failure Detection

Data Bus

Y.Replica

A B C

D E F

X.Potential Replica

Page 7: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Try to enhance robustness with automated failure recovery, but amplify a small cold into a contagious deadly plague!

7

Page 8: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Problem: An ERROR log entry is found, how to find relevant log entries that may lead to this failure across thousands of machines?

Example: What developers do in G2

8

Backward Slicing

Related Logs

Case study: Slice-based error log analysis using a simple query –  Automatic focus on relevant log entries –  Declarative analysis without headaches of distributed

systems Events.Where(e => e.Type == Event.ErrorLog && e.Payload.Contains(“02/24/2009 07:20:27”)) .Slicing(Slice.Backward) .Select (s => s.DumpHtml("AssertionLog")) .Count();

Page 9: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

What we do to enable a given system for this tool?

• Redirect all the logs

•  Instrument all the asynchronous execution and messages flows •  (Incomplete) Difficult to be complete (usually 10+ even 20+) •  Sometimes unsafe (hidden assumption) •  Fragile due to later system development

• Also required by other tools/frameworks, e.g., automated test, replay, replication, illustrated later

9

Page 10: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

So what we see today

VS

Page 11: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

What are the problems? •  Implicit contract

•  Hidden/volatile dependencies/assumptions •  Unknown interference •  Systems remain black boxes therefore difficult (incomplete/unsafe/fragile) to

build automation and robustness

Reinvent the

wheels

Low quality

Difficult to reuse

Page 12: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Problems (2) • Lack of `Systems Thinking’

•  Local and global decisions do not align which leads to robustness issues •  e.g., resource contention, local failure recovery

Page 13: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Problems (3) •  Intrusive (After-thought) tooling support

•  Testing, debugging, and post-mortem analysis is difficult

Page 14: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Seems difficult to fix for the legacy code - what if we come up a new development framework?

14

Page 15: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Design goal

• Turn distributed systems from black boxes into white ones •  Explicit dependency, non-determinism, and interference •  At various granularities (e.g., task, module, service, …)

• Embrace Systems Thinking •  Break the dilemma between modularity and global coordination

• Native tooling support •  Build first-class tooling support into both system runtime and

engineering process in a coherent way •  Inclusive

•  Support legacy components/languages/protocols/environments as much as we can

Page 16: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

• Robustness is at the core of its design • A cloud service development framework with pluggable modules

•  Applications => business logic •  Frameworks => distributed system challenges •  Devops tools => engineering and operation challenges •  Local runtime/resource libraries

• Existing and growing modules benefiting each other • An ecosystem for quickly building robust distributed systems

Robust Distributed System Nucleus (rDSN)

Page 17: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Key Design Decisions

Page 18: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Explicit contract via microkernel architecture

NO direct interaction among the frameworks, applications, tools, and local libraries, to ensure dependencies and non-determinisms reliably

captured and manipulated.

Page 19: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Explicit cross-service contract managed though declarative data and control flow

Query Annotation

(sQU2)

Web Cache (sWebCache)

L1 Static Ranking (sSaaS)

L2 Dynamic Ranking (sRaaS)

Caption Store (sCDG)

WebAnswer

Search keyword

Page 20: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Late binding with deploy-time global view for modules, resources, and their connections

Page 21: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

• Dedicated tool API for fast tool development • Transparent and reliable integration with ALL upper apps and

frameworks

First class tooling support

Program Test Debug Perf

Tuning Deploy Operate Analysi

s

Fault injection Model checking Global assertion Declarative Scenario test …

Replay Simulator

Bottom-up analyzer Profiler …

Declarative deployment model, App store, Auto-deployment …

Model-based simulation, Model-assisted static analysis …

Code generator …

Dashboard, Tracer, Profiler …

Page 22: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Common service model for auto scale-out and fail-over for ALL upper applications (micro service and storage)

Page 23: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

All together: (1). enhance service robustness from both runtime support and engineering process; (2). integrate all into a shared dev & op platform

Engineering process with integrated tools Exp

licit

cont

ract

and

aut

omat

ed fa

il-ov

er

Core

Applications

Tool API

Tools Runtime

Syntactic Libraries

Service API

RPC Task AIO Syn

c Env

Component Providers

Join Points

State Extensions

stateless

stateful

w/o replication

w/ replication

w/ partition w/o partition

S1 S5

Service 2

S3 S4

Workflow Controller

Async Update

Request

❶ ❷

❹ ❺

Task Queue

Program Test Debug Perf

Tuning Deploy Operate Analysis

Fault injection Model checking Global assertion Declarative Scenario test …

Replay Simulator

Bottom-up analyzer Profiler …

Declarative deployment model, App store, Auto-deployment …

Model-based simulation, Model-assisted static analysis …

Code generator …

Dashboard, Tracer, Profiler …

rDSN Studio

Page 24: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

rDSN Deep Dive

Page 25: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Core

Applications

Tool API

Tools Runtime

Syntactic Libraries

Service API

RPC Task AIO Syn

c Env

Component Providers

Join Points

State Extensions

stateless

stateful

w/o replication

w/ replication

w/ partition w/o partition

S1 S5

Service 2

S3 S4

Workflow Controller

Async Update

Request

❶ ❷

❹ ❺

Task Queue

Program Test Debug Perf

Tuning Deploy Operate Analysis

Fault injection Model checking Global assertion Declarative Scenario test …

Replay Simulator

Bottom-up analyzer Profiler …

Declarative deployment model, App store, Auto-deployment …

Model-based simulation, Model-assisted static analysis …

Code generator …

Dashboard, Tracer, Profiler …

❸ ❺

rDSN Studio

Page 26: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Layer 1: a microkernel approach for independent development & seamless integration of applications, frameworks, tools, and local libraries 26

Core

Applications

Tool API

Tools Runtime

Syntactic Libraries

Service API

RPC

Task AIO

Sync

Env

Component Providers

Join Points

State Extensions

stateless

stateful

w/o replication

w/ replication

w/ partition w/o partition

S1

S5

Service 2

S3 S4

Workflow Controller

Async Update

Request

❶ ❷

❹ ❺

Task Queue

Program Test Debug Perf

Tuning Deploy Operat

e Analysi

s Fault injection Model checking Global assertion Declarative Scenario test …

Replay Simulator

Bottom-up analyzer Profiler …

Declarative deployment model, App store, Auto-deployment …

Model-based simulation, Model-assisted static analysis …

Code generator …

Dashboard, Tracer, Profiler …

rDSN Studio

Page 27: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Deployment model as with the microkernel architecture •  All module types (e.g., a service app or a tool) are materialized in dynamic linked

libraries and registered into rDSN

• A loader (dsn.svchost) initiates the module instances

Page 28: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay
Page 29: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Application model (layer 1)

•  Start is similar to traditional main(argc, argv) •  + service support: gpid (global service partition identity) •  + tool support

•  multiple application state allowed in the same process •  application state can be cleaned-up if necessary (in Destroy)

app = Create(name, gpid)

Start(app, argc, argv)

Destroy (app, is_cleanup)

Page 30: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Task-centric (event-driven) execution model with fine-grain resource configuration

30

Thread Pool

DISK

Thread Pool

Tasks (Events)

Thread Pool One

(Hashed)

Thread Pool Two

(Shared)

Task queue

Thread Pools

Network Task

Local Task

Disk Task [task.RPC_PREPARE] pool = THREAD_POOL_REPLICATION is_trace = true is_profile = false ; …

[thread_pool.THREAD_POOL_REPLICATION] worker_count = 8 partitioned = true priority = THREAD_PRIORITY_ABOVE_NORMAL Affinity = 0x10 ; …

Rethink the resource

contention example at the beginning …

Page 31: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Service API

• Always trade-off between programming generality and agility • Considerations

•  Target service applications (no UI) •  Following common programming practice today •  Be inclusive

•  API in C, wrapped in other languages for cross language integration •  + native tooling support

•  Non-deterministic behavior must go through this API for being monitored and manipulated

•  With appropriate application level semantic

Page 32: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Example: a simple counter service Client Server

Page 33: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Tool abstractions and API •  Dedicated tool API for tool & local library development •  Three levels of tooling capability

•  Monitor •  Tainting •  Manipulation

•  API: complete interposition for reliably controlling all dependencies and non-determinisms in the system

•  Component providers •  network, disk, task_queue, lock, perf_counter, logger, …

•  Join points (hooks) •  monitor and manipulate all task-task and task-env interactions

•  State extensions •  task, message, ...

•  Two kinds of tools •  “tools”: only one tool exist in one process •  “toollets”: multiple tools may co-exist

Page 34: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Example: network providers and RPC tracing

Service Kernel

App: Client

Serivce API: dsn_rpc_call

Tool API: network::send, on_recv_reply

Native Network: boost::asio::async_write

/async_read

on_rpc_call on_rpc_request_enqueue

Service Kernel

App: Server

Serivce API: dsn_rpc_register, dsn_rpc_reply

Tool API: network::on_recv_request,

send Native Network:

boost::asio::async_read /async_write

on_rpc_reply on_rpc_response_enqueue

Tracer Toollet Tracer Toollet

Page 35: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Or, we use emulated network, tracer, and fault injector

Service Kernel

App: Client

Serivce API: dsn_rpc_call

Tool API: network::send, on_recv_reply

Emulated Network

on_rpc_call on_rpc_request_enqueue

App: Server

Serivce API: dsn_rpc_register, dsn_rpc_reply

Tool API: network::on_recv_request,

send

on_rpc_reply on_rpc_response_enqueue

Tracer/ Fault Injector Tracer/

Fault Injector

Page 36: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

rDSN Microkernel

+ configuration

emulator model checking

distributed declarative testing tracer/profiler/fault-injector

network/disk/thread/timer … for Windows, Linux, Emulator

Replication + Partitioning

for stateless/stateful applications

Micro-service Storage …

36

Discussed next!

Page 37: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Layer 2: turn single-box components into scalable and reliable service

37

Core

Applications

Tool API

Tools Runtime

Syntactic Libraries

Service API

RPC

Task AIO

Sync

Env

Component Providers

Join Points

State Extensions

stateless

stateful

w/o replication

w/ replication

w/ partition w/o partition

S1

S5

Service 2

S3 S4

Workflow Controller

Async Update

Request

❶ ❷

❹ ❺

Task Queue

Program Test Debug Perf

Tuning Deploy Operat

e Analysi

s Fault injection Model checking Global assertion Declarative Scenario test …

Replay Simulator

Bottom-up analyzer Profiler …

Declarative deployment model, App store, Auto-deployment …

Model-based simulation, Model-assisted static analysis …

Code generator …

Dashboard, Tracer, Profiler …

rDSN Studio

Page 38: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Service model for all rDSN applications

stateless

stateful

w/o replication

w/ replication

w/ partition w/o partition

Reliability & Availability

Scalability 38

Page 39: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Daemon Server

From single component to scalable and highly-available service (stateless) ( : frameworks)

39

Meta Server

URI Resolver

P0 P1 P2 P3

P0 P1 P2 P3

P2

Partition

Replicate

P0

•  Request dispatch •  Request failover

•  Per node service management (start, stop) •  Service node failure detection

•  Service node membership •  Service node failure detection

and load balance •  Machine pool management

Page 40: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Replica Server

40

Meta Server

P0 P1 P2 P3

P0 P1 P2 P3

P0 P2

Partition

Replicate

•  Service node hosting •  Replication via Paxos •  State failure recovery

From single component to scalable and highly-available service (stateful) ( : frameworks)

•  Service node membership •  Service node failure detection

and load balance •  Machine pool management

URI Resolver •  Request dispatch •  Request failover

Page 41: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Application enhancement for being services

Client: use service URL as the target address

Server: implement several helper functions (stateful service only)

Page 42: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Adoption example

• Rocksdb + dsn.dist.service.stateful in Xiaomi for replacing the original HBase/Redist clusters

• For HBase •  No GC problems which leads to latency spikes •  Faster failure recovery due to replicated application state (instead of

only on-disk state) • For redis

•  Better consistency guarantee and reconfiguration support

Page 43: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Framework development

• Frameworks are special apps that can host other apps

• Frameworks get access to application models through their type •  dsn_app_get_callbacks

• Application messages are sent to frameworks first •  Frameworks implement on_rpc_request to receive app messages •  Frameworks send requests to apps through

dsn_hosted_app_commit_rpc_request

Page 44: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Integrated into Xiaomi cluster managements

• Xiaomi has its own way to (so is Microsoft Bing) •  Write and manage logs •  Performance counter implementation and monitoring

• Wrap the following components and plug into rDSN •  Log providers – we can immediately use many old log collection and

search tools •  Performance counters – we can immediately reuse the dashboard

previously designed for HBase

44

Page 45: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Integration with legacy Java programs

• Use Thrift generated java client

• Problem •  Thrift header is different from rDSN •  Thrift data encoding is different from that in rDSN

• Solution •  Plugin customized message header parser (thrift_message_parser) •  rDSN allows multiple payload encoding/decoding protocols

simultaneously

Page 46: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Layer 3 (Tron): connect many services as flows to serve end-to-end user requests (in progress)

46

Core

Applications

Tool API

Tools Runtime

Syntactic Libraries

Service API

RPC

Task AIO

Sync

Env

Component Providers

Join Points

State Extensions

stateless

stateful

w/o replication

w/ replication

w/ partition w/o partition

S1

S5

Service 2

S3 S4

Workflow Controller

Async Update

Request

❶ ❷

❹ ❺

Task Queue

Program Test Debug Perf

Tuning Deploy Operat

e Analysi

s Fault injection Model checking Global assertion Declarative Scenario test …

Replay Simulator

Bottom-up analyzer Profiler …

Declarative deployment model, App store, Auto-deployment …

Model-based simulation, Model-assisted static analysis …

Code generator …

Dashboard, Tracer, Profiler …

rDSN Studio

Page 47: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Language: declarative service composition

47

❼ ❽

Query Annotation

(sQU2)

Web Cache (sWebCache)

L1 Static Ranking (sSaaS)

L2 Dynamic Ranking (sRaaS)

Caption Store

(sCDG)

WebAnswer

Search keyword

Page 48: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Why? •  Explicit cross-service dependency •  Agile programming for new services or service changes •  Automated global coordination

•  Latency optimization •  Parallel service access •  Delayed computation (when response is not needed for client response) •  …

•  Consistency •  E.g., atomicity

•  Testing, debugging •  E.g., Service level tracing, profiling

•  A lot can be done during service flow compilation and code generation for better robustness and performance!

Page 49: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Deploy composed service as layer 2 services ( : compiler generated code)

49

FlowController

ITaskQueue

sQU2

sWebCache

sSaaS

sCDG

Async Update

WebAnswer

Consistency guarantee SLA enforcement

Asynchronous task

Search keyword

sRaaS

❹ ❺

Query Annotation

(sQU2)

Web Cache (sWebCache)

L1 Static Ranking (sSaaS)

L2 Dynamic Ranking (sRaaS)

Caption Store

(sCDG)

WebAnswer

Search keyword

❽ ❽

Page 50: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Enjoy with auto-generated client libraries

50

Page 51: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Enhanced engineering process

51

Core

Applications

Tool API

Tools Runtime

Syntactic Libraries

Service API

RPC

Task AIO

Sync

Env

Component Providers

Join Points

State Extensions

stateless

stateful

w/o replication

w/ replication

w/ partition w/o partition

S1

S5

Service 2

S3 S4

Workflow Controller

Async Update

Request

❶ ❷

❹ ❺

Task Queue

Program Test Debug Perf

Tuning Deploy Operat

e Analysi

s Fault injection Model checking Global assertion Declarative Scenario test …

Replay Simulator

Bottom-up analyzer Profiler …

Declarative deployment model, App store, Auto-deployment …

Model-based simulation, Model-assisted static analysis …

Code generator …

Dashboard, Tracer, Profiler …

rDSN Studio

Page 52: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Program Test Debug Perf

Tuning Deploy Operate Analysis

Fault injection Model checking Global assertion Declarative Scenario test …

Replay Simulator

Bottom-up analyzer Profiler …

Declarative deployment model, App store, Auto-deployment …

Model-based simulation, Model-assisted static analysis …

Code generator …

Dashboard, Tracer, Profiler …

Page 53: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Example: model checking (test) (in dsn.tools.emulator)

•  Observation •  Test coverage is always low compared to all the possible scenarios in reality •  Difficult to test certain scenarios even if we know how they may happen

•  Explore all possible scheduling and failure combinations •  Thread interleaving •  Message reordering, lost, delay •  Disk slow, failure

•  Due to the large state space, randomly select the possible choice at each join points

•  e.g., on_rpc_request_enqueue, on_aio_completed, …

•  Effectively expose bugs that are difficult to found before

Page 54: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Example: replay (debug) (in dsn.tools.emulator)

• Observation: some subtle bugs are difficult to be reproduced, leading root cause analysis almost infeasible

• The random choices made in model checker are decided by a random seed

• Using the same seed leads to the same sequence of decisions (e.g., order, failure, …), therefore reproduces the same bug

Page 55: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Example: task explorer (dsn.tools.explorer) (debug) • Observation: when the system gets large, it is difficult for

developers to figure out how the requests are processed in the system across threads and machines

• This tool automatically extracts such information and visualizes as task dependency graphs

Page 56: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Example: global state inspection (debug) (in dsn.tools.emulator)

56

Observation: it is annoying when unwanted timeout happens after pausing in debuggers

Page 57: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Example: progressive system complexity (debug and performance tuning) (dsn.core + dsn.tools.common) • Observation: it is difficult for developers to reasoning about big

changes (which lead to either correctness or performance issues)

• Debug correctness and performance issues with controlled and gradually changed environments

•  Thread number •  Network failure •  Disk failure •  Disk performance (delay, …) •  Network performance (delay, …) •  Node number •  …

Page 58: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Example: declarative distributed testing (test, not generalized yet)

Page 59: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

All are done by the tool modules, without changing any code of the upper applications and/or frameworks!

Page 60: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

rDSN Studio: towards a one-stop IDOE for service development, sharing, discovery, deployment, and operation

60

Core

Applications

Tool API

Tools Runtime

Syntactic Libraries

Service API

RPC

Task AIO

Sync

Env

Component Providers

Join Points

State Extensions

stateless

stateful

w/o replication

w/ replication

w/ partition w/o partition

S1

S5

Service 2

S3 S4

Workflow Controller

Async Update

Request

❶ ❷

❹ ❺

Task Queue

Program Test Debug Perf

Tuning Deploy Operat

e Analysi

s Fault injection Model checking Global assertion Declarative Scenario test …

Replay Simulator

Bottom-up analyzer Profiler …

Declarative deployment model, App store, Auto-deployment …

Model-based simulation, Model-assisted static analysis …

Code generator …

Dashboard, Tracer, Profiler …

❺rDSN Studio

Page 61: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Service registration

Page 62: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Cluster management (dsn.dist.service.stateless)

Page 63: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Service deployment

Page 64: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Configurations as scenarios

Page 65: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Service operation (stateless)

Page 66: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Service operation (stateful)

Page 67: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Share and Benefit

configuration

emulator model checking

distributed declarative testing tracer/profiler/fault-injector web studio (visualization)

network/disk/thread/timer … for Windows, Linux, Emulator

Replication + Partitioning

for stateless/stateful applications

Micro services Storages

67

Page 68: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Key take away

68

Page 69: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

What is rDSN?

• Robustness is far from being perfect in current systems due to •  Implicit contract •  Lack of Systems Thinking •  After-thought tooling support

•  rDSN advocates •  Explicit task and service level dependency, non-determinism, and

interference •  Flexible configurations that break the dilemma between local decisions

and global coordination •  Native tooling support

Page 70: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Effort in (1) both runtime and engineering process; (2) a coherent way that they can be seamlessly integrated

Engineering process with integrated tools Exp

licit

cont

ract

and

aut

omat

ed fa

il-ov

er

Core

Applications

Tool API

Tools Runtime

Syntactic Libraries

Service API

RPC Task AIO Syn

c Env

Component Providers

Join Points

State Extensions

stateless

stateful

w/o replication

w/ replication

w/ partition w/o partition

S1 S5

Service 2

S3 S4

Workflow Controller

Async Update

Request

❶ ❷

❹ ❺

Task Queue

Program Test Debug Perf

Tuning Deploy Operate Analysis

Fault injection Model checking Global assertion Declarative Scenario test …

Replay Simulator

Bottom-up analyzer Profiler …

Declarative deployment model, App store, Auto-deployment …

Model-based simulation, Model-assisted static analysis …

Code generator …

Dashboard, Tracer, Profiler …

rDSN Studio

Page 71: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

rDSN Studio: towards a one-stop IDOE for service applications/frameworks/tools/local libraries development, sharing, discovery, deployment, and operation

71

Page 72: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Our vision: things work, and we know why, by turning distributed systems from black boxes into white boxes.

72

Page 73: Robust Distributed System Nucleus (rDSN) · Program Test Debug Perf Tuning Deploy Operate Analysi s Fault injection Model checking Global assertion Declarative Scenario test … Replay

Thanks! Questions? https://github.com/Microsoft/rDSN