genstage and flow - jose valim

79
@elixirlang / elixir-lang.org

Upload: elixir-meetup

Post on 15-Feb-2017

60 views

Category:

Software


0 download

TRANSCRIPT

Page 1: GenStage and Flow - Jose Valim

@elixirlang / elixir-lang.org

Page 2: GenStage and Flow - Jose Valim

GenStage & Flow github.com/elixir-lang/gen_stage

Page 3: GenStage and Flow - Jose Valim

Prelude: From eager,

to lazy,to concurrent,to distributed

Page 4: GenStage and Flow - Jose Valim

Example problem: word counting

Page 5: GenStage and Flow - Jose Valim

“roses are red\n violets are blue\n"

%{“are” => 2, “blue” => 1, “red” => 1, “roses" => 1, “violets" => 1}

Page 6: GenStage and Flow - Jose Valim

Eager

Page 7: GenStage and Flow - Jose Valim

File.read!("source")

"roses are red\n violets are blue\n"

Page 8: GenStage and Flow - Jose Valim

File.read!("source") |> String.split("\n")

["roses are red", "violets are blue"]

Page 9: GenStage and Flow - Jose Valim

File.read!("source") |> String.split("\n") |> Enum.flat_map(&String.split/1)

["roses", “are", “red", "violets", "are", "blue"]

Page 10: GenStage and Flow - Jose Valim

File.read!("source") |> String.split("\n") |> Enum.flat_map(&String.split/1) |> Enum.reduce(%{}, fn word, map -> Map.update(map, word, 1, & &1 + 1) end)

%{“are” => 2, “blue” => 1, “red” => 1, “roses" => 1, “violets" => 1}

Page 11: GenStage and Flow - Jose Valim

Eager

• Simple • Efficient for small collections • Inefficient for large collections with

multiple passes

Page 12: GenStage and Flow - Jose Valim

File.read!(“really large file") |> String.split("\n") |> Enum.flat_map(&String.split/1)

Page 13: GenStage and Flow - Jose Valim

Lazy

Page 14: GenStage and Flow - Jose Valim

File.stream!("source", :line)

#Stream<...>

Page 15: GenStage and Flow - Jose Valim

File.stream!("source", :line) |> Stream.flat_map(&String.split/1)

#Stream<...>

Page 16: GenStage and Flow - Jose Valim

File.stream!("source", :line) |> Stream.flat_map(&String.split/1) |> Enum.reduce(%{}, fn word, map -> Map.update(map, word, 1, & &1 + 1) end)

%{“are” => 2, “blue” => 1, “red” => 1, “roses" => 1, “violets" => 1}

Page 17: GenStage and Flow - Jose Valim

Lazy• Folds computations, goes item by item • Less memory usage at the cost of

computation • Allows us to work with large or

infinite collections

Page 18: GenStage and Flow - Jose Valim

Concurrent

Page 19: GenStage and Flow - Jose Valim

#Stream<...>

File.stream!("source", :line)

Page 20: GenStage and Flow - Jose Valim

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) |> Flow.partition() |> Flow.reduce(fn -> %{} end, fn word, map -> Map.update(map, word, 1, & &1 + 1) end)

#Flow<...>

Page 21: GenStage and Flow - Jose Valim

%{“are” => 2, “blue” => 1, “red” => 1, “roses" => 1, “violets" => 1}

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) |> Flow.partition() |> Flow.reduce(fn -> %{} end, fn word, map -> Map.update(map, word, 1, & &1 + 1) end) |> Enum.into(%{})

Page 22: GenStage and Flow - Jose Valim

Flow

• We give up ordering and process locality for concurrency

• Tools for working with bounded and unbounded data

Page 23: GenStage and Flow - Jose Valim

Flow

• It is not magic! There is an overhead when data flows through processes

• Requires volume and/or cpu/io-bound work to see benefits

Page 24: GenStage and Flow - Jose Valim

Flow stats

• ~1200 lines of code (LOC) • ~1300 lines of documentation (LOD)

Page 25: GenStage and Flow - Jose Valim

Topics

• 1200LOC: How is flow implemented? • 1300LOD: How to reason about flows?

Page 26: GenStage and Flow - Jose Valim

GenStage

Page 27: GenStage and Flow - Jose Valim

GenStage• It is a new behaviour • Exchanges data between stages

transparently with back-pressure • Breaks into producers, consumers and

producer_consumers

Page 28: GenStage and Flow - Jose Valim

GenStage

producer consumerproducer producer

consumerproducer consumer consumer

Page 29: GenStage and Flow - Jose Valim

GenStage: Demand-driven

BA

Producer Consumer

1. consumer subscribes to producer 2. consumer sends demand 3. producer sends events

Asks 10

Sends max 10

Subscribes

Page 30: GenStage and Flow - Jose Valim

B CAAsks 10Asks 10

Sends max 10 Sends max 10

GenStage: Demand-driven

Page 31: GenStage and Flow - Jose Valim

• It is a message contract • It pushes back-pressure to the boundary • GenStage is one impl of this contract

GenStage: Demand-driven

Page 32: GenStage and Flow - Jose Valim

GenStage example

Page 33: GenStage and Flow - Jose Valim

BA

Printer (Consumer)

Counter (Producer)

Page 34: GenStage and Flow - Jose Valim

defmodule Producer do use GenStage

def init(counter) do {:producer, counter} end

def handle_demand(demand, counter) when demand > 0 do events = Enum.to_list(counter..counter+demand-1) {:noreply, events, counter + demand} end end

Page 35: GenStage and Flow - Jose Valim

state demand handle_demand

0 10 {:noreply, [0, 1, …, 9], 10}

10 5 {:noreply, [10, 11, 12, 13, 14], 15}

15 5 {:noreply, [15, 16, 17, 18, 19], 20}

Page 36: GenStage and Flow - Jose Valim

defmodule Consumer do use GenStage

def init(:ok) do {:consumer, :the_state_does_not_matter} end

def handle_events(events, _from, state) do Process.sleep(1000) IO.inspect(events) {:noreply, [], state} end end

Page 37: GenStage and Flow - Jose Valim

{:ok, counter} = GenStage.start_link(Producer, 0)

{:ok, printer} = GenStage.start_link(Consumer, :ok)

GenStage.sync_subscribe(printer, to: counter)

(wait 1 second) [0, 1, 2, ..., 499] (500 events) (wait 1 second) [500, 501, 502, ..., 999] (500 events)

Page 38: GenStage and Flow - Jose Valim

Subscribe options

• max_demand: the maximum amount of events to ask (default 1000)

• min_demand: when reached, ask for more events (default 500)

Page 39: GenStage and Flow - Jose Valim

max_demand: 10, min_demand: 0

1. the consumer asks for 10 items 2. the consumer receives 10 items 3. the consumer processes 10 items 4. the consumer asks for 10 more 5. the consumer waits

Page 40: GenStage and Flow - Jose Valim

max_demand: 10, min_demand: 5

1. the consumer asks for 10 items 2. the consumer receives 10 items 3. the consumer processes 5 of 10 items 4. the consumer asks for 5 more 5. the consumer processes the remaining 5

Page 41: GenStage and Flow - Jose Valim

GenStage dispatchers

Page 42: GenStage and Flow - Jose Valim

Dispatchers• Per producer • Effectively receive the demand and

send data • Allow a producer to dispatch to

multiple consumers at once

Page 43: GenStage and Flow - Jose Valim

prod

1

2,5

3

4

1,2,3,4,5

DemandDispatcher

Page 44: GenStage and Flow - Jose Valim

prod

1,2,3

1,2,3

1,2,3

1,2,3

1,2,3

BroadcastDispatcher

Page 45: GenStage and Flow - Jose Valim

prod

1,5

2,6

3

4

1,2,3,4,5,6

PartitionDispatcherrem(event, 4)

Page 46: GenStage and Flow - Jose Valim

http://bit.ly/genstage

Page 47: GenStage and Flow - Jose Valim

Topics

• 1200LOC: How is flow implemented? • Its core is a 80LOC stage

• 1300LOD: How to reason about flows?

Page 48: GenStage and Flow - Jose Valim

Flow

Page 49: GenStage and Flow - Jose Valim

“roses are red\n violets are blue\n"

%{“are” => 2, “blue” => 1, “red” => 1, “roses" => 1, “violets" => 1}

Page 50: GenStage and Flow - Jose Valim

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) |> Flow.partition() |> Flow.reduce(fn -> %{} end, fn word, map -> Map.update(map, word, 1, & &1 + 1) end)

#Flow<...>

Page 51: GenStage and Flow - Jose Valim

%{“are” => 2, “blue” => 1, “red” => 1, “roses" => 1, “violets" => 1}

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) |> Flow.partition() |> Flow.reduce(fn -> %{} end, fn word, map -> Map.update(map, word, 1, & &1 + 1) end) |> Enum.into(%{})

Page 52: GenStage and Flow - Jose Valim

File.stream!("source", :line) |> Flow.from_enumerable()

Producer

Page 53: GenStage and Flow - Jose Valim

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1)

Producer

Stage 1 Stage 2 Stage 3 Stage 4

DemandDispatcher

Page 54: GenStage and Flow - Jose Valim

"blue"

Producer

Stage 1 Stage 2 Stage 3 Stage 4

"roses are red"

"roses""are""red"

"violets are blue"

"violets""are"

Page 55: GenStage and Flow - Jose Valim

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1)

Producer

Stage 1 Stage 2 Stage 3 Stage 4

Page 56: GenStage and Flow - Jose Valim

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) |> Flow.reduce(fn -> %{} end, fn word, map -> Map.update(map, word, 1, & &1 + 1) end)

Producer

Stage 1 Stage 2 Stage 3 Stage 4

Page 57: GenStage and Flow - Jose Valim

Producer

Stage 1 Stage 2 Stage 3 Stage 4

"roses are red"

%{“are” => 1, “red” => 1, “roses" => 1}

"violets are blue"

%{“are” => 1, “blue” => 1, “violets" => 1}

Page 58: GenStage and Flow - Jose Valim

Producer

Stage 1 Stage 2 Stage 3 Stage 4

%{“are” => 1, “red” => 1, “roses" => 1}

%{“are” => 1, “blue” => 1, “violets" => 1}

Page 59: GenStage and Flow - Jose Valim

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) |> Flow.partition() |> Flow.reduce(fn -> %{} end, fn word, map -> Map.update(map, word, 1, & &1 + 1) end)

Producer

Stage 1 Stage 2 Stage 3 Stage 4

Stage A Stage B Stage C Stage D

Page 60: GenStage and Flow - Jose Valim

Stage CStage A Stage B Stage D

Stage 1

"roses""are"Stage 4

%{“are” => 1}

"red"

Producer"roses are red"

%{“roses” => 1}

"violets are blue"

%{“are” => 1, “red” => 1}

Stage 2 Stage 3

%{“are” => 2, “red” => 1}

"are"

Page 61: GenStage and Flow - Jose Valim

Mapper 4

Reducer C

Producer

Reducer A Reducer B Reducer D

Mapper 1 Mapper 2 Mapper 3

DemandDispatcher

PartitionDispatcher

Page 62: GenStage and Flow - Jose Valim

File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) |> Flow.partition() |> Flow.reduce(fn -> %{} end, fn word, map -> Map.update(map, word, 1, & &1 + 1) end) |> Enum.into(%{})

• reduce/3 collects all data into maps • when it is done, the maps are streamed • into/2 collects the state into a map

Page 63: GenStage and Flow - Jose Valim

Windows and triggers

• If reduce/3 runs until all data is processed… what happens on unbounded data?

• See Flow.Window documentation.

Page 64: GenStage and Flow - Jose Valim

Postlude: From eager,

to lazy,to concurrent,to distributed

Page 65: GenStage and Flow - Jose Valim

Enum (eager)File.read!("source") |> String.split("\n") |> Enum.flat_map(&String.split/1) |> Enum.reduce(%{}, fn word, map -> Map.update(map, word, 1, & &1 + 1) end)

Page 66: GenStage and Flow - Jose Valim

Stream (lazy)File.stream!("source", :line) |> Stream.flat_map(&String.split/1) |> Enum.reduce(%{}, fn word, map -> Map.update(map, word, 1, & &1 + 1) end)

Page 67: GenStage and Flow - Jose Valim

Flow (concurrent)File.stream!("source", :line) |> Flow.from_enumerable() |> Flow.flat_map(&String.split/1) |> Flow.partition() |> Flow.reduce(%{}, fn word, map -> Map.update(map, word, 1, & &1 + 1) end) |> Enum.into(%{})

Page 68: GenStage and Flow - Jose Valim

Flow features• Provides map and reduce operations,

partitions, flow merge, flow join • Configurable batch size (max & min demand) • Data windowing with triggers and

watermarks

Page 69: GenStage and Flow - Jose Valim

Distributed?

• Flow API has feature parity with frameworks like Apache Spark

• However, there is no distribution nor execution guarantees

Page 70: GenStage and Flow - Jose Valim

CHEN, Y., ALSPAUGH, S., AND KATZ, R. Interactive analytical processing in big data systems:

a cross-industry study of MapReduce workloads

“small inputs are common in practice: 40–80% of Cloudera customers’ MapReduce jobs and 70% of

jobs in a Facebook trace have ≤ 1GB of input”

Page 71: GenStage and Flow - Jose Valim

GOG, I., SCHWARZKOPF, M., CROOKS, N., GROSVENOR, M. P., CLEMENT, A., AND HAND, S. Musketeer: all for one, one for all in data

processing systems.

“For between 40-80% of the jobs submitted to MapReduce systems, you’d be better off

just running them on a single machine”

Page 72: GenStage and Flow - Jose Valim

Distributed?

• Single machine matters - try it! • The gap between concurrent and

distributed in Elixir is small • Durability concerns will be tackled next

Page 73: GenStage and Flow - Jose Valim

Thank-yous

Page 74: GenStage and Flow - Jose Valim

Library authors

• Give GenStage a try • Build your own producers • TIP: use @behaviour GenStage if you

want to make it an optional dep

Page 75: GenStage and Flow - Jose Valim

Inspirations

• Akka Streams - back pressure contract • Apache Spark - map reduce API • Apache Beam - windowing model • Microsoft Naiad - stage notifications

Page 76: GenStage and Flow - Jose Valim

The Team

• The Elixir Core Team • Specially James Fish (fishcakez) • And Eric and Chris for therapy sessions

Page 77: GenStage and Flow - Jose Valim

consulting and software engineering

Built and designed at

Page 78: GenStage and Flow - Jose Valim

consulting and software engineering

Elixir coaching

Elixir design review

Custom development

Page 79: GenStage and Flow - Jose Valim

@elixirlang / elixir-lang.org