seastar at linux foundation collaboration summit

Millions of transactions per second, with an

advanced new programming model

Seastar

How multifarious and how mutually

complicated are the considerations which

the working of such an engine involve.

There are frequently several distinct sets of

effects going on simultaneously; all in a

manner independent of each other, and yet

to a greater or less degree exercising a

mutual influence. To adjust each to every

other, and indeed even to perceive and

trace them out with perfect correctness and

success, entails difficulties whose nature

partakes to a certain extent of those

involved in every question where conditions

are very numerous and inter-complicated.

Hardware outgrowing software

+ CPU clocks not getting faster.

+ More cores, but hard to use them.

+ Locks have costs even when no contention

+ Data is allocated on one core, copied and used on

others

+ Result: Software can’t keep up with new

hardware (SSD, 10Gbps networking…)

Kernel

Application

TCP/IPScheduler

queuequeuequeuequeuequeue

threads

NIC

Queues

Kernel

Memory

Workloads changing

+ Complex, multi-layered applications

+ NoSQL data stores

+ More users

+ Lower latencies needed

+ Microservices

- 81% of Redis processing is in the kernel.

- If 100 requests needed for a page, the “99%

latency” affects 63% of pageviews.

Kernel

Application

TCP/IPScheduler


threads

NIC

Queues

Kernel

Memory

7 Million IOPS

Benchmark hardware

■ 2x Xeon E5-2695v3, 2.3GHz

35M cache, 14 cores

(28 cores total, 56 HT)

■ 8x 8GB DDR4 Micron memory

■ Intel Ethernet CNA XL710-QDA1

A new model

Threads

- Costly locking (example:

POSIX requires multiple

threads to be able to use same

socket)

+ Uses available skills/tools

Shared-nothing

+ Fewer wasted cycles

- Cross-core communication

must be explicit, so harder to

program

How

■ Single-threaded async engine

running on each CPU

■ No threads

■ No shared data

■ All inter-CPU communication by message

passing

Linear scaling+ Each engine is executed by each core

+ Shared-nothing per-core design

+ Fits existing shared-nothing distributed

applications model

+ Full kernel bypass, supports zero-copy

+ No threads, no context switch and no locks!

+ Instead, asynchronous lambda

invocation

Application

TCP/I

P

Task Scheduler


smp queue

NIC

Queue

DPDK

Kernel

(isn’t

involved)

Userspace

Application

TCP/I

P

Task Scheduler


smp queue

NIC

Queue

DPDK

Kernel

(isn’t

involved)

Userspace

Application

TCP/I

P

Task Scheduler


smp queue

NIC

Queue

DPDK

Kernel

(isn’t

involved)

Userspace

Application

TCP/I

P

Task Scheduler


smp queue

NIC

Queue

DPDK

Kernel

(isn’t

involved)

Userspace

Kernel

Comparison with old school

Application

TCP/IPScheduler


threads

NIC

Queues

Kernel

Traditional stack SeaStar’s sharded stack

Memory

Application

TCP/I

P

Task Scheduler


smp queue

NIC

Queue

DPDK

Kernel

(isn’t

involved)

Userspace

Application

TCP/I

P

Task Scheduler


smp queue

NIC

Queue

DPDK

Kernel

(isn’t

involved)

Userspace

Application

TCP/I

P

Task Scheduler


smp queue

NIC

Queue

DPDK

Kernel

(isn’t

involved)

Userspace

Application

TCP/I

P

Task Scheduler


smp queue

NIC

Queue

DPDK

Kernel

(not

involved)

Userspace

Millions of connections Traditional stack SeaStar’s sharded stack

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise is a

pointer to

eventually

computed value

Task is a

pointer to a

lambda function

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Threa

d

Stack

Threa

d

Stack

Threa

d

Stack

Threa

d

Stack

Threa

d

Stack

Threa

d

Stack

Threa

d

Stack

Threa

d

Stack

Thread is a

function pointer

Stack is a byte

array from 64k

to megabytes

But how can you program it?

■ Ada Lovelace’s

problem today

■ Need max. possible

“easy” without

giving up any “fast.”

If the answer

were “no”,

would this

book be 467

pages long?

Basic model

■ Futures

■ Promises

■ Continuations

F-P-C defined: Future

A future is a result of a computation

that may not be available yet. ■ a data buffer that we are reading from the network

■ the expiration of a timer

■ the completion of a disk write

■ the result computation that requires the values from one

or more other futures.

F-P-C defined: Promise

A promise is an object or function that

provides you with a future, with the

expectation that it will fulfill the future.

Basic future/promise

future<int> get(); // promises an int will be produced eventuallyfuture<> put(int) // promises to store an int

void f() {get().then([] (int value) {

put(value + 1).then([] {std::cout << "value stored successfully\n";

});});

}

Chaining

future<int> get(); // promises an int will be produced eventuallyfuture<> put(int) // promises to store an int

void f() {get().then([] (int value) {

return put(value + 1);}).then([] {

std::cout << "value stored successfully\n";});

}

Zero copy friendly

future<temporary_buffer>socket::read(size_t n);

■ temporary_buffer points at driver-provided pages if

possible

■ stack can linearize scatter-gather buffers using page

tables

■ discarded after use

Zero copy friendly (2)

pair<future<size_t>, future<temporary_buffer>>socket::write(temporary_buffer);

■ First future becomes ready when TCP window allows

sending more data (usually immediately)

■ Second future becomes ready when buffer can be

discarded (after TCP ACK)

■ May complete in any order

Fully async filesystem

No threads

read_metadata().then([] {return lock_pages();

}).then([] {return read_data();

});

Shared state: networking

■ No shared state except index of

net channels (1 per cpu)

■ No migration of existing TCP connections

Handling shared state: block

■ Each CPU is responsible for handling

specific files/directories/free blocks

(by hash)

■ Can delegate access to another CPU for

locality, but not concurrent shared access

■ Flash optimized - no fancy layout

■ DMA only

Seastar

TCP

Seastar

TCP

Linux

sockets

Seastar TCP

DPDK Virtio or raw

device

access

Linux

process

OSv

networking

Deployment models

Licensing

■ Apache

■ Goals: compatibility and contributor safety

Performance results

■ Linear scaling to 20 cores and beyond

■ 250,000 transactions/core (memcached)

■ Currently limited by client. More client

development in progress.

Applications

■ HTTP server

■ NoSQL system

■ Distributed filesystem

■ Object store

■ Transparent proxy

■ Cache (Memcache, CDN,..)

■ NFV

Thank you

http://www.seastar-project.org/

@CloudiusSystems

seastar at linux foundation collaboration summit

Software

large cache polution

large cache pollution

shared data

large stacks

multiple threads

3ghz35m cache

copyno threads

intercpu communication