confluo - usenix€¦ · anurag khandelwal, rachit agarwal, ion stoica 1. motivation!2. motivation...

167
Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks Anurag Khandelwal, Rachit Agarwal, Ion Stoica 1

Upload: others

Post on 23-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Confluo: 
Distributed Monitoring and Diagnosis

    Stack for High-speed Networks

    Anurag Khandelwal, Rachit Agarwal, Ion Stoica

    �1

  • Motivation

    !2

  • Motivation• Managing large scale networks is increasingly complex

    !

    Network Misconfigurations Network Failures Load Imbalance Network Congestion

    !2

  • Motivation• Managing large scale networks is increasingly complex

    !

    Network Misconfigurations Network Failures Load Imbalance Network Congestion

    • Network issues ⇒ Performance degradation, loss in revenue

    !2

  • Opportunity: Networks can capture a lot of data…

    !3

  • Opportunity: Networks can capture a lot of data…

    !3

  • Opportunity: Networks can capture a lot of data…

    !3

    In-network techniques, e.g., Marple [SIGCOMM’19], 
FlowRadar [NSDI’16], UnivMon [SIGCOMM’16]

  • Opportunity: Networks can capture a lot of data…

    !3

    Limited storage

  • Opportunity: Networks can capture a lot of data…

    !3

    Limited storage

    In-band Network 
Telemetry (INT)

  • Opportunity: Networks can capture a lot of data…

    • Embed wide range of telemetry data within packet headers: ‣ Packet trajectory‣ Hop latency

    ‣ Queue lengths ‣ Link utilization, and many more…

    !3

    Limited storage

    In-band Network 
Telemetry (INT)

  • Opportunity: Networks can capture a lot of data…

    • Embed wide range of telemetry data within packet headers: ‣ Packet trajectory‣ Hop latency

    ‣ Queue lengths ‣ Link utilization, and many more…

    !3

    Limited storage

    In-band Network 
Telemetry (INT)

    • Analyze telemetry data at end-hosts
e.g., Trumpet [SIGCOMM’16], PathDump [OSDI’16], SwitchPointer [NSDI’18]

  • Example: Checking Path Conformance

    !4

    S1 S2 S3

    • Embed wide range of telemetry data within packet headers: ‣ Packet trajectory‣ Hop latency

    ‣ Queue lengths ‣ Link utilization, and many more…

  • Example: Checking Path Conformance

    !4

    S1 S2 S3

    • Does the packet pass through switch S1? [PathDump, OSDI’16]

    • Embed wide range of telemetry data within packet headers: ‣ Packet trajectory‣ Hop latency

    ‣ Queue lengths ‣ Link utilization, and many more…

  • Example: Checking Path Conformance

    !4

    S1 S2 S3

    • Does the packet pass through switch S1? [PathDump, OSDI’16]

    • Embed wide range of telemetry data within packet headers: ‣ Packet trajectory‣ Hop latency

    ‣ Queue lengths ‣ Link utilization, and many more…

  • Example: Checking Path Conformance

    !4

    S1 S2 S3

    • Does the packet pass through switch S1? [PathDump, OSDI’16]

    • Embed wide range of telemetry data within packet headers: ‣ Packet trajectory‣ Hop latency

    ‣ Queue lengths ‣ Link utilization, and many more…

  • Example: Checking Path Conformance

    !4

    S1

    S1 S2 S3

    • Does the packet pass through switch S1? [PathDump, OSDI’16]

    • Embed wide range of telemetry data within packet headers: ‣ Packet trajectory‣ Hop latency

    ‣ Queue lengths ‣ Link utilization, and many more…

  • Example: Checking Path Conformance

    !4

    S1 S2 S3

    • Does the packet pass through switch S1? [PathDump, OSDI’16]

    • Embed wide range of telemetry data within packet headers: ‣ Packet trajectory‣ Hop latency

    ‣ Queue lengths ‣ Link utilization, and many more…

  • Example: Checking Path Conformance

    !4

    S1 S2 S3S1

    • Does the packet pass through switch S1? [PathDump, OSDI’16]

    • Embed wide range of telemetry data within packet headers: ‣ Packet trajectory‣ Hop latency

    ‣ Queue lengths ‣ Link utilization, and many more…

  • Example: Checking Path Conformance

    !4

    S1 S2 S3S1

    S2

    • Does the packet pass through switch S1? [PathDump, OSDI’16]

    • Embed wide range of telemetry data within packet headers: ‣ Packet trajectory‣ Hop latency

    ‣ Queue lengths ‣ Link utilization, and many more…

  • Example: Checking Path Conformance

    !4

    S1 S2 S3

    • Does the packet pass through switch S1? [PathDump, OSDI’16]

    • Embed wide range of telemetry data within packet headers: ‣ Packet trajectory‣ Hop latency

    ‣ Queue lengths ‣ Link utilization, and many more…

  • !5

    Goals for end-host stack design

  • !5

    Goals for end-host stack design

    End-host stacks need to support:

    Real-time monitoring of rich telemetry data

  • !5

    Goals for end-host stack design

    End-host stacks need to support:

    Real-time monitoring of rich telemetry data

    Low-overhead distributed diagnosis of network events

  • !5

    Goals for end-host stack design

    End-host stacks need to support:

    Real-time monitoring of rich telemetry data

    Low-overhead distributed diagnosis of network events

    Highly-concurrent reads & writes of headers using minimal CPU

  • Challenge: …Networks capture a lot of data

    !6

  • Challenge: …Networks capture a lot of data

    Line-rate for 10Gbps links ⇒ 0.9-16 million packets/second
~ 50 nanoseconds budget per packet header!

    !6

  • Challenge: …Networks capture a lot of data

    Line-rate for 10Gbps links ⇒ 0.9-16 million packets/second
~ 50 nanoseconds budget per packet header!

    !6

    Thro

    ughp

    ut (p

    acke

    ts/s

    )

    0M

    4M

    8M

    12M

    16M

    Storm+Kafka Flink+Kafka KafkaBTrDB CorfuDB TimescaleDB

    #Cores 32 32 32 32 32 32

    Max packet rate @ 10Gbps

  • Challenge: …Networks capture a lot of data

    Line-rate for 10Gbps links ⇒ 0.9-16 million packets/second
~ 50 nanoseconds budget per packet header!

    !6

    Thro

    ughp

    ut (p

    acke

    ts/s

    )

    0M

    4M

    8M

    12M

    16M

    Storm+Kafka Flink+Kafka KafkaBTrDB CorfuDB TimescaleDB

    #Cores 32 32 32 32 32 32

    Transactional Semantics

    Max packet rate @ 10Gbps

  • Existing Approaches

    Func

    tiona

    lity

    Performance

    !7

  • Existing Approaches

    Func

    tiona

    lity

    Performance

    Traditional stacksTraditional End-host Stacks
Tribeca [VLDB’96], Gigascope [SIGMOD’03], 


    Time Machine [SIGCOMM’08]

    !7

  • Existing Approaches

    Func

    tiona

    lity

    Performance

    Traditional stacks

    End-host Stacks using Stream-processing systems/Key-Value Stores


    OpenSOC, Tigon, PathDump [OSDI’16], SwitchPointer [NSDI’18]

    Stacks using external 
data-processing systems

    Traditional End-host Stacks
Tribeca [VLDB’96], Gigascope [SIGMOD’03], 


    Time Machine [SIGCOMM’08]

    !7

  • Existing Approaches

    Func

    tiona

    lity

    Performance

    Traditional stacks

    End-host Stacks using Stream-processing systems/Key-Value Stores


    OpenSOC, Tigon, PathDump [OSDI’16], SwitchPointer [NSDI’18]

    Stacks using external 
data-processing systems

    Monitor rich telemetry data,


  • Existing Approaches

    Func

    tiona

    lity

    Performance

    Traditional stacks

    End-host Stacks using Stream-processing systems/Key-Value Stores


    OpenSOC, Tigon, PathDump [OSDI’16], SwitchPointer [NSDI’18]

    Stacks using external 
data-processing systems

    Custom-designed 
monitoring stacks

    Custom-designed monitoring stacks
FloSIS [USENIX ATC’15], Trumpet

    [SIGCOMM’16]

    Traditional End-host Stacks
Tribeca [VLDB’96], Gigascope [SIGMOD’03], 


    Time Machine [SIGCOMM’08]

    !7

  • Existing Approaches

    Func

    tiona

    lity

    Performance

    Traditional stacks

    End-host Stacks using Stream-processing systems/Key-Value Stores


    OpenSOC, Tigon, PathDump [OSDI’16], SwitchPointer [NSDI’18]

    Stacks using external 
data-processing systems

    Custom-designed 
monitoring stacks

    Custom-designed monitoring stacks
FloSIS [USENIX ATC’15], Trumpet

    [SIGCOMM’16]

    10-40 Gbps links,
Limited functionality or precision

    Traditional End-host Stacks
Tribeca [VLDB’96], Gigascope [SIGMOD’03], 


    Time Machine [SIGCOMM’08]

    !7

  • Existing Approaches

    Func

    tiona

    lity

    Performance

    Traditional stacks

    End-host Stacks using Stream-processing systems/Key-Value Stores


    OpenSOC, Tigon, PathDump [OSDI’16], SwitchPointer [NSDI’18]

    Stacks using external 
data-processing systems

    Custom-designed 
monitoring stacks

    Custom-designed monitoring stacks
FloSIS [USENIX ATC’15], Trumpet

    [SIGCOMM’16]

    Traditional End-host Stacks
Tribeca [VLDB’96], Gigascope [SIGMOD’03], 


    Time Machine [SIGCOMM’08]

    Ideal

    Can we achieve both simultaneously?!7

  • Confluo

    !8

    Writ

    e Th

    roug

    hput

    (Ops

    )

    0M

    4M

    8M

    12M

    16M

    20M

    24M

    28M

    Storm+Kafka Flink+Kafka Kafka BTrDBCorfuDB TimescaleDB Atomic MultiLog

    #Cores 32 32 32 32 32 32 1

  • Confluo

    !8

    Writ

    e Th

    roug

    hput

    (Ops

    )

    0M

    4M

    8M

    12M

    16M

    20M

    24M

    28M

    Storm+Kafka Flink+Kafka Kafka BTrDBCorfuDB TimescaleDB Atomic MultiLog

    #Cores 32 32 32 32 32 32 1

    Max packet rate @ 10Gbps

  • Confluo

    !8

    Writ

    e Th

    roug

    hput

    (Ops

    )

    0M

    4M

    8M

    12M

    16M

    20M

    24M

    28M

    Storm+Kafka Flink+Kafka Kafka BTrDBCorfuDB TimescaleDB Atomic MultiLog

    #Cores 32 32 32 32 32 32 1

    Max packet rate @ 10Gbps

    Confluo achieves this using a new data structure 
Atomic MultiLog

  • Atomic MultiLogNew data structure that exploits structure in telemetry data to meet all goals

    !9

  • Atomic MultiLogNew data structure that exploits structure in telemetry data to meet all goals

    !9

    Attributes of interest are fixed-sized

    32-bit IP addresses, timestamps, 16-bit port numbers, switchIDs, queue-lengths, etc.

  • Atomic MultiLogNew data structure that exploits structure in telemetry data to meet all goals

    !9

    Attributes of interest are fixed-sized

    32-bit IP addresses, timestamps, 16-bit port numbers, switchIDs, queue-lengths, etc.

    Low-overhead indexing with specialized perfect k-ary trees

  • Atomic MultiLogNew data structure that exploits structure in telemetry data to meet all goals

    Data once written is not updated

    Aggregated only at coarse-grained timescales.

    !9

    Attributes of interest are fixed-sized

    32-bit IP addresses, timestamps, 16-bit port numbers, switchIDs, queue-lengths, etc.

    Low-overhead indexing with specialized perfect k-ary trees

  • Atomic MultiLogNew data structure that exploits structure in telemetry data to meet all goals

    Data once written is not updated

    Aggregated only at coarse-grained timescales.

    !9

    Append-only write-efficient data structures

    Attributes of interest are fixed-sized

    32-bit IP addresses, timestamps, 16-bit port numbers, switchIDs, queue-lengths, etc.

    Low-overhead indexing with specialized perfect k-ary trees

  • Atomic MultiLogNew data structure that exploits structure in telemetry data to meet all goals

    Data once written is not updated

    Aggregated only at coarse-grained timescales.

    !9

    Append-only write-efficient data structures

    Do not require serializable transactions, linearizability is sufficient

    Linearizability: single-operation, single-object
Serializability: multi-operation, multi-object

    Attributes of interest are fixed-sized

    32-bit IP addresses, timestamps, 16-bit port numbers, switchIDs, queue-lengths, etc.

    Low-overhead indexing with specialized perfect k-ary trees

  • Atomic MultiLogNew data structure that exploits structure in telemetry data to meet all goals

    Data once written is not updated

    Aggregated only at coarse-grained timescales.

    !9

    Append-only write-efficient data structures

    Do not require serializable transactions, linearizability is sufficient

    Linearizability: single-operation, single-object
Serializability: multi-operation, multi-object

    Trim down concurrency mechanisms to updating 2 integers

    Attributes of interest are fixed-sized

    32-bit IP addresses, timestamps, 16-bit port numbers, switchIDs, queue-lengths, etc.

    Low-overhead indexing with specialized perfect k-ary trees

  • Atomic MultiLog: Write Efficient Storage

    10

  • Traditional Data Stores: Use complex data structures to support general workloads, 
compromising on write efficiency

    Atomic MultiLog: Write Efficient Storage

    10

  • Atomic MultiLog: Write Efficient Storage

    10

    Data once written is not updated

  • Atomic MultiLog: Write Efficient Storage

    Header Log

    Concurrent, Append-Only Logs

    10

    Data once written is not updated

  • Atomic MultiLog: Write Efficient Storage

    Header Log Attribute 
IndexES

    Concurrent, Append-Only Logs

    Reference Logs

    10

    Data once written is not updated

    e.g., srcIP

  • Atomic MultiLog: Write Efficient Storage

    Header Log Attribute 
IndexES

    Time-Indexed
Filters

    Concurrent, Append-Only Logs

    Reference Logs

    10

    Data once written is not updated

    e.g., srcIP

    e.g., srcIP=10.0.0.1 
&& distort=90

  • Atomic MultiLog: Write Efficient Storage

    Header Log Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Concurrent, Append-Only Logs

    Reference Logs

    10

    Data once written is not updated

    e.g., srcIP

    e.g., srcIP=10.0.0.1 
&& distort=90

    e.g., min(CWND)

  • Atomic MultiLog: Write Efficient Storage

    Header Log Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Concurrent, Append-Only Logs

    Reference Logs

    10

    Append-only logs provide write efficiency

    Data once written is not updated

    e.g., srcIP

    e.g., srcIP=10.0.0.1 
&& distort=90

    e.g., min(CWND)

  • Atomic MultiLog: Write Efficient Storage

    Header Log Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Concurrent, Append-Only Logs

    Reference Logs

    10

    Append-only logs provide write efficiency

    Do not support in-place updates

    Data once written is not updated

    e.g., srcIP

    e.g., srcIP=10.0.0.1 
&& distort=90

    e.g., min(CWND)

  • Atomic MultiLog Consistency

    11

  • Atomic MultiLog Consistency

    11

    Database
Transactions

  • Atomic MultiLog Consistency

    11

    Database
Transactions

    User:

    ReadWriteWriteRead

    …WriteReadWrite

    WriteWriteRead

  • Atomic MultiLog Consistency

    11

    Database
Transactions

    Network Monitoring & Diagnosis

    User:

    ReadWriteWriteRead

    …WriteReadWrite

    WriteWriteRead

  • Atomic MultiLog Consistency

    11

    Database
Transactions

    Network Monitoring & Diagnosis

    User:

    ReadWriteWriteRead

    …WriteReadWrite

    WriteWriteRead

    Network:

    Write

    Write

    Write…

  • Atomic MultiLog Consistency

    11

    Database
Transactions

    Network Monitoring & Diagnosis

    User:

    ReadWriteWriteRead

    …WriteReadWrite

    WriteWriteRead

    Network:

    Write

    Write

    Write…

    Network Operator:

    Read

    Read

    Read

  • Atomic MultiLog Consistency

    11

    Do not require serializable transactions, linearizability is sufficient

    Database
Transactions

    Network Monitoring & Diagnosis

    User:

    ReadWriteWriteRead

    …WriteReadWrite

    WriteWriteRead

    Network:

    Write

    Write

    Write…

    Network Operator:

    Read

    Read

    Read

  • Efficient Linearizablity for LogsSupport for concurrent appends & reads

    12

  • Efficient Linearizablity for Logs

    Read-Tail,
Write-Tail

    Support for concurrent appends & reads

    12

  • Efficient Linearizablity for Logs

    Read-Tail Write-Tail

    Append

    Support for concurrent appends & reads

    12

  • Efficient Linearizablity for Logs

    Read-Tail Write-Tail

    Append

    Support for concurrent appends & reads

    Safe for 
Concurrent 


    READS

    12

  • Efficient Linearizablity for Logs

    Safe for 
Concurrent 


    READS

    Read-Tail,
Write-Tail

    Support for concurrent appends & reads

    12

  • Efficient Linearizablity for Logs

    Read-Tail,
Write-Tail

    Linearizable reads & appends

    Support for concurrent appends & reads

    12

  • Efficient Linearizablity for Logs

    Read-Tail,
Write-Tail

    Lock-free techniques for efficiency

    Linearizable reads & appends

    Support for concurrent appends & reads

    12

  • Atomic MultiLog LinearizabilityHeader Log Attribute 


    IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Reference Logs

    13

    ATOMIC MULTILOG

  • Atomic MultiLog LinearizabilityHeader Log Attribute 


    IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Offsets

    0

    50

    100

    Reference Logs

    13

    ATOMIC MULTILOG

  • Atomic MultiLog LinearizabilityHeader Log Attribute 


    IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Offsets

    0

    50

    100

    0 50

    100

    100 150

    100 150

    50

    Reference Logs

    13

    ATOMIC MULTILOG

  • 200

    200

    Atomic MultiLog LinearizabilityHeader Log Attribute 


    IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Global Read Tail

    Global Write Tail

    Offsets

    0

    50

    100

    0 50

    100

    100 150

    100 150

    50

    OFFSETS

    Reference Logs

    13

    ATOMIC MULTILOG

  • 200

    250

    Atomic MultiLog LinearizabilityHeader Log Attribute 


    IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Global Read Tail

    Global Write Tail

    200

    Offsets

    0

    50

    100

    0 50

    100

    100 150

    100 150

    50

    OFFSETS

    Reference Logs

    13

    ATOMIC MULTILOG

  • 200

    250

    Atomic MultiLog LinearizabilityHeader Log Attribute 


    IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Global Read Tail

    Global Write Tail

    200

    Offsets

    0

    50

    100

    200

    200

    200

    0 50

    100

    100 150

    100 150

    50

    OFFSETS

    Reference Logs

    13

    ATOMIC MULTILOG

  • 250

    250

    Atomic MultiLog LinearizabilityHeader Log Attribute 


    IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Global Read Tail

    Global Write Tail

    200

    Offsets

    0

    50

    100

    200

    200

    200

    0 50

    100

    100 150

    100 150

    50

    OFFSETS

    Reference Logs

    13

    ATOMIC MULTILOG

  • 250

    250

    Atomic MultiLog LinearizabilityHeader Log Attribute 


    IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Global Read Tail

    Global Write Tail

    200

    Offsets

    0

    50

    100

    200

    200

    200

    0 50

    100

    100 150

    100 150

    50

    OFFSETS

    Relax linearizability for individual logs; ensure linearizability only

    for end-to-end operations

    Reference Logs

    13

    ATOMIC MULTILOG

  • 250

    250

    Atomic MultiLog LinearizabilityHeader Log Attribute 


    IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Global Read Tail

    Global Write Tail

    200

    Offsets

    0

    50

    100

    200

    200

    200

    0 50

    100

    100 150

    100 150

    50

    OFFSETS

    Relax linearizability for individual logs; ensure linearizability only

    for end-to-end operations

    Reference Logs

    13

    ATOMIC MULTILOG

    Significant performance gains with linearizability at high degrees of concurrency

  • 250

    250

    Atomic MultiLog LinearizabilityHeader Log Attribute 


    IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Global Read Tail

    Global Write Tail

    200

    Offsets

    0

    50

    100

    200

    200

    200

    0 50

    100

    100 150

    100 150

    50

    OFFSETS

    Relax linearizability for individual logs; ensure linearizability only

    for end-to-end operations

    Reference Logs

    13

    No support for transactions

    ATOMIC MULTILOG

    Significant performance gains with linearizability at high degrees of concurrency

  • Atomic MultiLog Indexing

    14

  • Traditional Indexes: Expensive to ensure atomicity, high overhead write paths, etc.

    Atomic MultiLog Indexing

    14

  • Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Atomic MultiLog Indexing

    14

    Attributes of interest are fixed-sized

  • Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Atomic MultiLog Indexing

    14

    Attributes of interest are fixed-sized Header fields have fixed domain sizes, e.g., 16-bit port in [0, 216]

  • Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Atomic MultiLog Indexing

    Perfect K-ary Tree

    NULLNU

    LL

    14

    Attributes of interest are fixed-sized Header fields have fixed domain sizes, e.g., 16-bit port in [0, 216]

  • Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Atomic MultiLog Indexing

    Perfect K-ary Tree

    NULLNU

    LL

    14

    Attributes of interest are fixed-sized Header fields have fixed domain sizes, e.g., 16-bit port in [0, 216]

    Exactly k children

  • Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Atomic MultiLog Indexing

    Perfect K-ary Tree

    NULLNU

    LL

    14

    Attributes of interest are fixed-sized Header fields have fixed domain sizes, e.g., 16-bit port in [0, 216]

    Exactly k children

    fixed-depth

  • Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Atomic MultiLog Indexing

    Perfect K-ary Tree

    NULLNU

    LL

    Reference LogsLeaf Nodes

    14

    Attributes of interest are fixed-sized Header fields have fixed domain sizes, e.g., 16-bit port in [0, 216]

    Exactly k children

    fixed-depth

  • Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Atomic MultiLog Indexing

    Perfect K-ary Tree

    NULLNU

    LL

    Reference LogsLeaf Nodes

    Efficient write path and write conflict resolutions

    14

    Attributes of interest are fixed-sized Header fields have fixed domain sizes, e.g., 16-bit port in [0, 216]

    Exactly k children

    fixed-depth

  • Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Atomic MultiLog Indexing

    Perfect K-ary Tree

    NULLNU

    LL

    Reference LogsLeaf Nodes

    Efficient write path and write conflict resolutions

    14

    Attributes of interest are fixed-sized Header fields have fixed domain sizes, e.g., 16-bit port in [0, 216]

    Exactly k children

    fixed-depth

    2.2x faster (1core), 7.8x faster (48 cores)

  • Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Atomic MultiLog Indexing

    Perfect K-ary Tree

    NULLNU

    LL

    Reference LogsLeaf Nodes

    Efficient write path and write conflict resolutions

    Ordered access via inexpensive range queries

    14

    Attributes of interest are fixed-sized Header fields have fixed domain sizes, e.g., 16-bit port in [0, 216]

    Exactly k children

    fixed-depth

    2.2x faster (1core), 7.8x faster (48 cores)

  • Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Atomic MultiLog Indexing

    Perfect K-ary Tree

    NULLNU

    LL

    Reference LogsLeaf Nodes

    Efficient write path and write conflict resolutions

    Ordered access via inexpensive range queries

    14

    Attributes of interest are fixed-sized Header fields have fixed domain sizes, e.g., 16-bit port in [0, 216]

    Exactly k children

    fixed-depth

    2.2x faster (1core), 7.8x faster (48 cores)

    ~5.1x faster

  • Attribute 
IndexES

    Time-Indexed
Filters

    Time-Indexed
Aggregates

    Atomic MultiLog Indexing

    Perfect K-ary Tree

    NULLNU

    LL

    Reference LogsLeaf Nodes

    Efficient write path and write conflict resolutions

    Ordered access via inexpensive range queries

    Only supports fixed-sized attributes 14

    Attributes of interest are fixed-sized Header fields have fixed domain sizes, e.g., 16-bit port in [0, 216]

    Exactly k children

    fixed-depth

    2.2x faster (1core), 7.8x faster (48 cores)

    ~5.1x faster

  • Confluo End-host Architecture

    !15

  • Confluo End-host ArchitectureHypervisor VM1

    VM2

    VMk

    End-host Module

    !15

  • Confluo End-host ArchitectureHypervisor VM1

    VM2

    VMk

    End-host Module

    !15

  • Confluo End-host ArchitectureHypervisor VM1

    VM2

    VMk

    End-host Module

    NIC

    Native Apps

    !15

  • Confluo End-host ArchitectureHypervisor VM1

    VM2

    VMk

    End-host Module

    NIC

    MM

    SM

    Native Apps

    MM = Mirror Module

    SM = Spray Module

    Ring Buffers

    !15

  • Confluo End-host ArchitectureHypervisor VM1

    VM2

    VMk

    End-host Module

    NIC

    MM

    SM

    Native AppsWriter Writer Writer…

    Atomic MultiLog

    MM = Mirror Module

    SM = Spray Module

    Ring Buffers

    !15

  • Confluo End-host ArchitectureHypervisor VM1

    VM2

    VMk

    End-host Module

    NIC

    MM

    SM

    Native AppsWriter Writer Writer…

    Atomic MultiLog

    Monitor Diagnoser

    MM = Mirror Module

    SM = Spray Module

    Ring Buffers

    !15

  • Confluo End-host ArchitectureHypervisor VM1

    VM2

    VMk

    End-host Module

    NIC

    MM

    SM

    Native AppsWriter Writer Writer…

    Atomic MultiLog

    Monitor Diagnoser Archiver

    MM = Mirror Module

    SM = Spray Module

    Ring Buffers

    !15

  • Confluo End-host ArchitectureHypervisor VM1

    VM2

    VMk

    End-host Module

    Hypervisor VM1

    VM2

    VMk

    End-host Module

    Hypervisor VM1

    VM2

    VMk

    End-host Module

    !15

  • Confluo End-host Architecture

    Coordinator

    Hypervisor VM1

    VM2

    VMk

    End-host Module

    Hypervisor VM1

    VM2

    VMk

    End-host Module

    Hypervisor VM1

    VM2

    VMk

    End-host Module

    !15

  • Confluo End-host Architecture

    Coordinator

    Hypervisor VM1

    VM2

    VMk

    End-host Module

    Hypervisor VM1

    VM2

    VMk

    End-host Module

    Hypervisor VM1

    VM2

    VMk

    End-host Module

    Consistent analysis of network-wide events using Linearizable Snapshots

    !15

  • Consistency in Distributed Diagnosis

    16

  • Consistency in Distributed Diagnosis

    16

    Diagnostic Query Q1

    Diagnostic Query Q2

  • Consistency in Distributed Diagnosis

    16

    Diagnostic Query Q1

    Diagnostic Query Q2

    Snapshot S1

    Snapshot S2

  • Consistency in Distributed Diagnosis

    16

    Diagnostic Query Q1

    Diagnostic Query Q2

    Snapshot S1

    Snapshot S2

    • Confluo provides linearizable snapshots, i.e., • Each snapshot is atomic• Snapshots are totally ordered, i.e.,

    • if S1 “happens before” S2, S2 must contain all the changes in S1

  • Consistency in Distributed Diagnosis

    16

    Diagnostic Query Q1

    Diagnostic Query Q2

    Snapshot S1

    Snapshot S2

    • Confluo provides linearizable snapshots, i.e., • Each snapshot is atomic• Snapshots are totally ordered, i.e.,

    • if S1 “happens before” S2, S2 must contain all the changes in S1• Limitation: Does not consider stack delays in ordering packets across end-hosts

  • Consistency in Distributed Diagnosis

    16

    Diagnostic Query Q1

    Diagnostic Query Q2

    Snapshot S1

    Snapshot S2

    • Confluo provides linearizable snapshots, i.e., • Each snapshot is atomic• Snapshots are totally ordered, i.e.,

    • if S1 “happens before” S2, S2 must contain all the changes in S1• Limitation: Does not consider stack delays in ordering packets across end-hosts

    Please see our paper for details on snapshot algorithm!

  • Evaluation• Setup:‣ Servers: 12 core 2.3 GHz Xeon CPUs, 252GB RAM‣ Network: 10Gbps links, Pica8 P-3297 switches

    !17

  • Evaluation• Setup:‣ Servers: 12 core 2.3 GHz Xeon CPUs, 252GB RAM‣ Network: 10Gbps links, Pica8 P-3297 switches

    • Summary of Results:‣ Capture packet headers at line rate > 10Gbps while evaluating

    1000s of triggers & 10s of filters with minimal CPU %

    !17

  • Evaluation• Setup:‣ Servers: 12 core 2.3 GHz Xeon CPUs, 252GB RAM‣ Network: 10Gbps links, Pica8 P-3297 switches

    • Summary of Results:‣ Capture packet headers at line rate > 10Gbps while evaluating

    1000s of triggers & 10s of filters with minimal CPU %

    !17

    ‣ Exploit rich telemetry data in packet headers to enable large class of network monitoring and diagnosis applications

  • Evaluation• Setup:‣ Servers: 12 core 2.3 GHz Xeon CPUs, 252GB RAM‣ Network: 10Gbps links, Pica8 P-3297 switches

    • Summary of Results:‣ Capture packet headers at line rate > 10Gbps while evaluating

    1000s of triggers & 10s of filters with minimal CPU %

    Please see our paper for detailed evaluation!

    !17

    ‣ Exploit rich telemetry data in packet headers to enable large class of network monitoring and diagnosis applications

  • Atomic MultiLog Performance

    !18

  • Atomic MultiLog PerformanceT

    hrou

    ghpu

    t (P

    acke

    ts/s

    )

    0K

    10M

    20M

    30M

    40M

    #Indexes

    0 1 2 4

    1 Filter 16 Filters

    Indexes: srcIP, srcPort, dstIP, dstPort, timestamp

    Filter Templates: (f1) packets from VM A to VM B; (f2) packets to VM A; (f3) packets from VM A on destination port P; (f4) packets between (IP1, P1) and (IP2, P2); and (f5) packets to or from VM A.

    !18

  • Atomic MultiLog PerformanceT

    hrou

    ghpu

    t (P

    acke

    ts/s

    )

    0K

    10M

    20M

    30M

    40M

    #Indexes

    0 1 2 4

    1 Filter 16 Filters

    Indexes: srcIP, srcPort, dstIP, dstPort, timestamp

    Filter Templates: (f1) packets from VM A to VM B; (f2) packets to VM A; (f3) packets from VM A on destination port P; (f4) packets between (IP1, P1) and (IP2, P2); and (f5) packets to or from VM A.

    !18

  • Atomic MultiLog Performance

    Takeaway: Packet write-rate degrades gracefully on adding more filters and indexes

    Thr

    ough

    put

    (Pac

    kets

    /s)

    0K

    10M

    20M

    30M

    40M

    #Indexes

    0 1 2 4

    1 Filter 16 Filters

    Indexes: srcIP, srcPort, dstIP, dstPort, timestamp

    Filter Templates: (f1) packets from VM A to VM B; (f2) packets to VM A; (f3) packets from VM A on destination port P; (f4) packets between (IP1, P1) and (IP2, P2); and (f5) packets to or from VM A.

    !18

  • Atomic MultiLog Performance

    Takeaway: Packet write-rate degrades gracefully on adding more filters and indexes

    Takeaway: Confluo’s write throughput scales well with #cores due to inexpensive concurrency control

    Thr

    ough

    put

    (Pac

    kets

    /s)

    0K

    10M

    20M

    30M

    40M

    #Indexes

    0 1 2 4

    1 Filter 16 Filters

    0K

    25M

    50M

    75M

    100M

    #Cores

    1 2 4 8

    1 Filter 16 Filters

    0K

    25M

    50M

    75M

    100M

    #Indexes

    1 2 4 8

    1 Index 4 Indexes

    Indexes: srcIP, srcPort, dstIP, dstPort, timestamp

    Filter Templates: (f1) packets from VM A to VM B; (f2) packets to VM A; (f3) packets from VM A on destination port P; (f4) packets between (IP1, P1) and (IP2, P2); and (f5) packets to or from VM A.

    !18

  • Atomic MultiLog PerformanceC

    PU

    Uti

    lizat

    ion

    (%)

    0

    25

    50

    75

    100

    Packet Size (Bytes)

    64 128 256 512 1024 1500

    1 Filter 16 Filters

    0

    25

    50

    75

    100

    Packet Size (Bytes)

    64 128 256 512 1024 1500

    1 Index 4 Indexes

    !19

    Indexes: srcIP, srcPort, dstIP, dstPort, timestamp

    Filter Templates: (f1) packets from VM A to VM B; (f2) packets to VM A; (f3) packets from VM A on destination port P; (f4) packets between (IP1, P1) and (IP2, P2); and (f5) packets to or from VM A.

    CPU Utilization for processing packets at line rate on a 10Gbps link

  • Atomic MultiLog PerformanceC

    PU

    Uti

    lizat

    ion

    (%)

    0

    25

    50

    75

    100

    Packet Size (Bytes)

    64 128 256 512 1024 1500

    1 Filter 16 Filters

    0

    25

    50

    75

    100

    Packet Size (Bytes)

    64 128 256 512 1024 1500

    1 Index 4 Indexes

    !19

    Indexes: srcIP, srcPort, dstIP, dstPort, timestamp

    Filter Templates: (f1) packets from VM A to VM B; (f2) packets to VM A; (f3) packets from VM A on destination port P; (f4) packets between (IP1, P1) and (IP2, P2); and (f5) packets to or from VM A.

    CPU Utilization for processing packets at line rate on a 10Gbps link

  • Atomic MultiLog PerformanceC

    PU

    Uti

    lizat

    ion

    (%)

    0

    25

    50

    75

    100

    Packet Size (Bytes)

    64 128 256 512 1024 1500

    1 Filter 16 Filters

    0

    25

    50

    75

    100

    Packet Size (Bytes)

    64 128 256 512 1024 1500

    1 Index 4 Indexes

    !19

    Indexes: srcIP, srcPort, dstIP, dstPort, timestamp

    Filter Templates: (f1) packets from VM A to VM B; (f2) packets to VM A; (f3) packets from VM A on destination port P; (f4) packets between (IP1, P1) and (IP2, P2); and (f5) packets to or from VM A.

    CPU Utilization for processing packets at line rate on a 10Gbps link

  • Atomic MultiLog Performance

    Takeaway: Confluo can capture packets at line rate for 10Gbps links for a wide range of packet sizes using a single CPU core.

    CP

    U U

    tiliz

    atio

    n (%

    )

    0

    25

    50

    75

    100

    Packet Size (Bytes)

    64 128 256 512 1024 1500

    1 Filter 16 Filters

    0

    25

    50

    75

    100

    Packet Size (Bytes)

    64 128 256 512 1024 1500

    1 Index 4 Indexes

    !19

    Indexes: srcIP, srcPort, dstIP, dstPort, timestamp

    Filter Templates: (f1) packets from VM A to VM B; (f2) packets to VM A; (f3) packets from VM A on destination port P; (f4) packets between (IP1, P1) and (IP2, P2); and (f5) packets to or from VM A.

    CPU Utilization for processing packets at line rate on a 10Gbps link

  • Atomic MultiLog Performance

    Takeaway: Confluo can capture packets at line rate for 10Gbps links for a wide range of packet sizes using a single CPU core.

    CP

    U U

    tiliz

    atio

    n (%

    )

    0

    25

    50

    75

    100

    Packet Size (Bytes)

    64 128 256 512 1024 1500

    1 Filter 16 Filters

    0

    25

    50

    75

    100

    Packet Size (Bytes)

    64 128 256 512 1024 1500

    1 Index 4 Indexes

    !19

    Indexes: srcIP, srcPort, dstIP, dstPort, timestamp

    Filter Templates: (f1) packets from VM A to VM B; (f2) packets to VM A; (f3) packets from VM A on destination port P; (f4) packets between (IP1, P1) and (IP2, P2); and (f5) packets to or from VM A.

    Many more results in the paper…

    CPU Utilization for processing packets at line rate on a 10Gbps link

  • General Applicability and Impact

  • • Confluo exploits three properties: fixed-sized attributes, append-only, non-transactional

    General Applicability and Impact

  • • Confluo exploits three properties: fixed-sized attributes, append-only, non-transactional

    • Many other applications exhibit similar properties:

    • Distributed messaging, e.g., Apache Kafka, Amazon Kinesis, etc.

    • Time-series databases, e.g., OpenTSDB, InfluxDB, etc.

    General Applicability and Impact

  • • Confluo exploits three properties: fixed-sized attributes, append-only, non-transactional

    • Many other applications exhibit similar properties:

    • Distributed messaging, e.g., Apache Kafka, Amazon Kinesis, etc.

    • Time-series databases, e.g., OpenTSDB, InfluxDB, etc.

    • We are actively exploring Confluo applicability beyond network monitoring and debugging

    General Applicability and Impact

  • • Confluo exploits three properties: fixed-sized attributes, append-only, non-transactional

    • Many other applications exhibit similar properties:

    • Distributed messaging, e.g., Apache Kafka, Amazon Kinesis, etc.

    • Time-series databases, e.g., OpenTSDB, InfluxDB, etc.

    • We are actively exploring Confluo applicability beyond network monitoring and debugging

    General Applicability and Impact

    Open Source: https://www.github.com/ucbrise/confluo

    https://www.github.com/ucbrise/confluo

  • Confluo Summary

    21

  • Confluo Summary

    21

    Thank You!

    https://www.github.com/ucbrise/confluo

    Introduces a new data structure: Atomic MultiLog

    Exploits structure in network telemetry data to support:

    • Rich monitoring• Low-overhead diagnosis and,• High-concurrency reads and writes

    Distributed monitoring and diagnosis stack for high-speed networks

    https://www.github.com/ucbrise/confluo

  • Backup Slides

    22

  • Atomic MultiLog Performance

    Takeaway: Packet write-rate degrades gracefully on adding more filters and indexes

    Takeaway: Confluo’s write throughput scales well with #cores owing to its inexpensive lock-free concurrency

    Thr

    ough

    put

    (Pac

    kets

    /s)

    0K

    10M

    20M

    30M

    40M

    #Indexes

    0 1 2 4

    1 Filter 4 Filters16 Filters 64 Filters

    0K

    25M

    50M

    75M

    100M

    #Cores

    1 2 4 8

    1 Filter 4 Filters16 Filters 64 Filters

    0K

    25M

    50M

    75M

    100M

    #Indexes

    1 2 4 8

    0 Indexes 1 Index 2 Indexes 4 Indexes

    Indexes: srcIP, srcPort, dstIP, dstPort, timestamp

    Filter Templates: (f1) packets from VM A to VM B; (f2) packets to VM A; (f3) packets from VM A on destination port P; (f4) packets between (IP1, P1) and (IP2, P2); and (f5) packets to or from VM A.

    !23

  • Atomic MultiLog Performance

    Takeaway: At line rate of 10Gbps, Confluo can handle average packet size as small as 128B with 16 filters and 2 indexes on a single core.

    CPU

    Util

    izat

    ion

    (%)

    0

    25

    50

    75

    100

    Packet Size (Bytes)

    64 128 256 512 1024 1500

    1 Filter 4 Filters16 Filters 64 Filters

    0

    25

    50

    75

    100

    Packet Size (Bytes)

    64 128 256 512 1024 1500

    0 Indexes 1 Index 2 Indexes 4 Indexes

    !24

    Indexes: srcIP, srcPort, dstIP, dstPort, timestamp

    Filter Templates: (f1) packets from VM A to VM B; (f2) packets to VM A; (f3) packets from VM A on destination port P; (f4) packets between (IP1, P1) and (IP2, P2); and (f5) packets to or from VM A.

  • Atomic MultiLog Performance

    Takeaway: Confluo can evaluate 1000s of trigger queries with less than 4% CPU utilization at 1ms intervals, and with latency less than 70μs.

    Takeaway: Diagnostic query latency in Confluo increases linearly with number of captured packets in Confluo.

    CPU

    Util

    izat

    ion

    (%)

    0

    1

    2

    3

    4

    #Triggers

    1 10 100 1000

    1 ms 5 ms10 ms 20 ms

    Trig

    ger

    Late

    ncy

    10us

    100us

    1ms

    10ms

    100ms

    #Cores

    1 10 100 1000

    Late

    ncy

    (ms)

    0

    50

    100

    150

    200

    250

    #Captured Packets (millions)

    50 100 150 200 250 300

    q1 q2 q3q4 q5

    Query Templates: (q1) packets from VM A to VM B; (q2) packets to VM A; (q3) packets from VM A on destination port P; (q4) packets between (IP1, P1) and (IP2, P2); and (q5) packets to or from VM A.

    Trigger Templates: aggregate > threshold, aggregate in {sum(pktSize), min(priority), max(CWND), count(pkts), …}

    !25

  • Monitoring & Diagnosis Scenario

    26

  • Monitoring & Diagnosis Scenario“Detect TCP packet losses, determine if it is due to difference in flow priorities.”

    26

  • Monitoring & Diagnosis Scenario“Detect TCP packet losses, determine if it is due to difference in flow priorities.”

    Switch ’S'

    26

  • Monitoring & Diagnosis Scenario“Detect TCP packet losses, determine if it is due to difference in flow priorities.”

    Flow1 Flow2

    Switch ’S'

    26

  • Monitoring & Diagnosis Scenario

    priority(flow1) > priority(flow2)

    “Detect TCP packet losses, determine if it is due to difference in flow priorities.”

    Flow1 Flow2

    Switch ’S'

    26

  • Monitoring & Diagnosis Scenario

    priority(flow1) > priority(flow2)

    • Filter TCP retransmissions as pkt_drops

    “Detect TCP packet losses, determine if it is due to difference in flow priorities.”

    Flow1 Flow2

    Switch ’S'

    26

    Monitoring:

  • Monitoring & Diagnosis Scenario

    priority(flow1) > priority(flow2)

    • Filter TCP retransmissions as pkt_drops

    “Detect TCP packet losses, determine if it is due to difference in flow priorities.”

    Flow1 Flow2

    Switch ’S'

    • Aggregate drop_count on pkt_drops• Trigger alert if drop_count > T in 1ms interval

    26

    Monitoring:

  • Monitoring & Diagnosis Scenario

    priority(flow1) > priority(flow2)

    • Filter TCP retransmissions as pkt_drops

    • priority(flow1) > priority(flow2) • drop_count(flow1) < drop_count(flow2)

    “Detect TCP packet losses, determine if it is due to difference in flow priorities.”

    Flow1 Flow2

    Switch ’S'

    • Aggregate drop_count on pkt_drops• Trigger alert if drop_count > T in 1ms interval

    Diagnosis: Check if:

    26

    Monitoring:

  • Diagnosing Losses due to Flow Priorities“Detect TCP packet losses, determine if it is due to difference in flow priorities.”

    27

    priority(flow1) > priority(flow2)

    Flow1 Flow2

    Switch ’S'

  • Diagnosing Losses due to Flow Priorities

    Setup: 15 low priority flows & 1 high priority flow w/ 10Gbps links

    “Detect TCP packet losses, determine if it is due to difference in flow priorities.”

    27

    priority(flow1) > priority(flow2)

    Flow1 Flow2

    Switch ’S'

  • Diagnosing Losses due to Flow Priorities

    Setup: 15 low priority flows & 1 high priority flow w/ 10Gbps links

    Takeaway: Confluo is able to diagnose packet drops due to flow priorities.

    drop

    _cou

    nt

    0

    450

    900

    1350

    1800

    FlowID

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

    “Detect TCP packet losses, determine if it is due to difference in flow priorities.”

    FlowID 1: High Priority, FlowID 2-16: Low Priority

    27

    priority(flow1) > priority(flow2)

    Flow1 Flow2

    Switch ’S'

  • Diagnosing Losses due to Flow Priorities“Detect TCP packet losses, determine if it is due to difference in flow priorities.”

    28

  • Diagnosing Losses due to Flow Priorities“Detect TCP packet losses, determine if it is due to difference in flow priorities.”

    Setup: k low priority flows & 1 high priority flow w/ 10Gbps links

    28

  • Diagnosing Losses due to Flow Priorities

    Takeaway: Confluo can diagnose issues across 100s of VMs in a few ms

    Late

    ncy

    (ms)

    0

    1

    2

    3

    4

    5

    #Low priority flows (k)1 2 4 8 16 32 64 128

    Atomic Snapshot Query Execution

    “Detect TCP packet losses, determine if it is due to difference in flow priorities.”

    Setup: k low priority flows & 1 high priority flow w/ 10Gbps links

    28

  • MultiLog#1 Multilog#2 Multilog#3

    Consistency in Distributed Diagnosis

    Wall-

    Clock Time

    29

  • MultiLog#1 Multilog#2 Multilog#3

    P12 P32

    Consistency in Distributed Diagnosis

    Wall-

    Clock Time

    P21P31

    P22

    Pi = Packet Writes

    29

    Write begin

    Write Complete (Visible)

    P11

  • MultiLog#1 Multilog#2 Multilog#3

    P12 P32

    Consistency in Distributed Diagnosis

    Wall-

    Clock Time

    P21P31

    P22

    Naive approach to snapshot: obtain globalReadTails for all MultiLogs

    Pi = Packet Writes

    29

    P11

  • MultiLog#1 Multilog#2 Multilog#3

    P12 P32

    Consistency in Distributed Diagnosis

    Wall-

    Clock Time

    P21P31

    P22

    Naive approach to snapshot: obtain globalReadTails for all MultiLogs

    Pi = Packet Writes Snapshot 1

    29

    P11

  • MultiLog#1 Multilog#2 Multilog#3

    P12 P32

    Consistency in Distributed Diagnosis

    Wall-

    Clock Time

    P21P31

    P22

    Naive approach to snapshot: obtain globalReadTails for all MultiLogs

    Pi = Packet Writes Snapshot 1

    29

    P11

    Snapshot 2

  • MultiLog#1 Multilog#2 Multilog#3

    P12P12 P32

    Consistency in Distributed Diagnosis

    Wall-

    Clock Time

    P21P31

    P22

    Naive approach to snapshot: obtain globalReadTails for all MultiLogs

    Pi = Packet Writes Snapshot 1

    Snapshot 1 contains P22 but not P12, Snapshot 2 contains P12 but not P22 29

    P11

    Snapshot 2

    P22

  • MultiLog#1 Multilog#2 Multilog#3

    P12P12 P32

    Consistency in Distributed Diagnosis

    Wall-

    Clock Time

    P21P31

    P22

    Naive approach to snapshot: obtain globalReadTails for all MultiLogs

    Pi = Packet Writes Snapshot 1

    Snapshot 1 contains P22 but not P12, Snapshot 2 contains P12 but not P22

    Not consistent!

    29

    P11

    Snapshot 2

    P22

  • Consistency in Distributed Diagnosis

    30

    MultiLog#1 Multilog#2 Multilog#3

    P12P12 P32

    Wall-

    Clock Time

    P21P31

    P22

    Pi = Packet Writes Snapshot 1

    P11

    Snapshot 2

    P22

  • Consistency in Distributed Diagnosis

    • Centralized sequencer orders all writes to system (e.g., DBMS)Existing 
Approaches

    30

    MultiLog#1 Multilog#2 Multilog#3

    P12P12 P32

    Wall-

    Clock Time

    P21P31

    P22

    Pi = Packet Writes Snapshot 1

    P11

    Snapshot 2

    P22

  • Consistency in Distributed Diagnosis

    • Centralized sequencer orders all writes to system (e.g., DBMS)• Algorithms with weaker consistency (e.g., Chandy Lamport)

    Existing 
Approaches

    30

    MultiLog#1 Multilog#2 Multilog#3

    P12P12 P32

    Wall-

    Clock Time

    P21P31

    P22

    Pi = Packet Writes Snapshot 1

    P11

    Snapshot 2

    P22

  • Consistency in Distributed Diagnosis

    • Centralized sequencer orders all writes to system (e.g., DBMS)• Algorithms with weaker consistency (e.g., Chandy Lamport)

    Existing 
Approaches

    30

    MultiLog#1 Multilog#2 Multilog#3

    P12P12 P32

    Wall-

    Clock Time

    P21P31

    P22

    Pi = Packet Writes Snapshot 1

    P11

    Snapshot 2

    P22

    Infeasible!

  • Linearizable Snapshots in Confluo

    31

    MultiLog#1 Multilog#2 Multilog#3

    P12P12 P32

    Wall-

    Clock Time

    P21P31

    P22

    Pi = Packet Writes Snapshot 1

    P11

    Snapshot 2

    P22

  • Linearizable Snapshots in Confluo

    • Impose order on some writes during query execution rather than during writes

    31

    MultiLog#1 Multilog#2 Multilog#3

    P12P12 P32

    Wall-

    Clock Time

    P21P31

    P22

    Pi = Packet Writes Snapshot 1

    P11

    Snapshot 2

    P22

  • Linearizable Snapshots in Confluo

    • Impose order on some writes during query execution rather than during writes• How do we make the naive snapshots consistent?

    31

    MultiLog#1 Multilog#2 Multilog#3

    P12P12 P32

    Wall-

    Clock Time

    P21P31

    P22

    Pi = Packet Writes Snapshot 1

    P11

    Snapshot 2

    P22

  • • Key insight: Delay visibility of P11, P22 to all queries until after the snapshots

    Linearizable Snapshots in Confluo

    • Impose order on some writes during query execution rather than during writes

    • P11, P12 now excluded from both snapshots 31

    MultiLog#1 Multilog#2 Multilog#3

    P12P12 P32

    Wall-

    Clock Time

    P21P31

    P22

    Pi = Packet Writes Snapshot 1

    P11

    Snapshot 2

    P22P22P12

  • Linearizable Snapshots in Confluo

    Server#1 Server#3Server#2 Server#n Coordinator

    32

    Delay packet writes that happen during a snapshot operation

  • Read-Tail Values

    GeT-And-Freeze

    Linearizable Snapshots in Confluo

    Server#1 Server#3Server#2 Server#n Coordinator

    1. Broadcast FreezeAndGet request to all servers- Each server atomically freezes and gets the value of readTail

    2. Receive the readTail values from all servers

    32

    Delay packet writes that happen during a snapshot operation

  • ACKS

    Linearizable Snapshots in Confluo

    Server#1 Server#3Server#2 Server#n Coordinator

    1. Broadcast FreezeAndGet request to all servers- Each server atomically freezes and gets the value of readTail

    2. Receive the readTail values from all servers

    4. Receive ACKs from all servers

    3. Send Unfreeze request to all servers- If no other snapshot in progress, atomically un-freeze readTail

    32

    Un-freeze

    Delay packet writes that happen during a snapshot operation

  • Linearizable Snapshots in Confluo

    Server#1 Server#3Server#2 Server#n Coordinator

    1. Broadcast FreezeAndGet request to all servers- Each server atomically freezes and gets the value of readTail

    2. Receive the readTail values from all servers

    4. Receive ACKs from all servers

    3. Send Unfreeze request to all servers- If no other snapshot in progress, atomically un-freeze readTail

    The visibility of any packet write that would have completed during the snapshot is delayed by freezing the globalReadTail (Step 1)

    33

    Delay packet writes that happen during a snapshot operation

  • Linearizable Snapshots in Confluo

    Server#1 Server#3Server#2 Server#n Coordinator

    1. Broadcast FreezeAndGet request to all servers- Each server atomically freezes and gets the value of readTail

    2. Receive the readTail values from all servers

    4. Receive ACKs from all servers

    3. Send Unfreeze request to all servers- If no other snapshot in progress, atomically un-freeze readTail

    The visibility of any packet write that would have completed during the snapshot is delayed by freezing the globalReadTail (Step 1)

    The packet writes are only made visible (in Step 3) after snapshot(s) have been collected (in Step 2)

    33

    Delay packet writes that happen during a snapshot operation