hadoop summit europe 2014: apache storm architecture

© Hortonworks Inc. 2011

P. Taylor Goetz Apache Storm Committer [email protected] @ptgoetz

Apache Storm Architecture and Integration

Real-Time Big Data

Shedding Light on Data

Shedding Light on Big Data

Shedding Light on Big Data In Real Time

What is Storm?

Storm is Streaming

Storm is StreamingKey enabler of the Lamda Architecture

Storm is Fast

Storm is FastClocked at 1M+ messages per second per node

Storm is Scalable

Storm is ScalableThousands of workers per cluster

Storm is Fault Tolerant

Storm is Fault TolerantFailure is expected, and embraced

Storm is Reliable

Storm is ReliableGuaranteed message delivery

Storm is ReliableExactly-once semantics

Conceptual Model

Tuple

{…}

Tuple

{…} • Core Unit of Data • Immutable Set of Key/Value

Pairs

Streams

{…} {…} {…} {…} {…} {…} {…}

Unbounded Sequence of Tuples

Spouts

Spouts

• Source of Streams • Wraps a streaming data source

and emits Tuples

{…}{…}

{…}{…}

{…}{…}

{…}

{…} {…} {…} {…} {…} {…} {…}

Spout APIpublic interface ISpout extends Serializable {!! void open(Map conf, !! TopologyContext context, !! ! ! SpoutOutputCollector collector);!! void close();! ! void activate();! ! void deactivate();!! void nextTuple();!! void ack(Object msgId);!! void fail(Object msgId);!}

Lifecycle API


Core API


Reliability API

Bolts

• Core functions of a streaming computation

• Receive tuples and do stuff • Optionally emit additional

tuples

Bolts

• Write to a data store

Bolts

• Read from a data store

Bolts

• Perform arbitrary computation

Compute

{…}{…}

{…}{…}

{…}{…}

{…}

Bolts

• (Optionally) Emit additional streams

{…} {…} {…} {…} {…} {…} {…}

Bolt API

public interface IBolt extends Serializable {!! void prepare(Map stormConf, ! TopologyContext context, ! OutputCollector collector);!! void cleanup();!! ! void execute(Tuple input);!! !}

Lifecycle API

Bolt API

public interface IBolt extends Serializable {!! void prepare(Map stormConf, ! TopologyContext context, ! OutputCollector collector);!! void cleanup();!! ! void execute(Tuple input);!! !}

Core API

Bolt Output API

public interface IOutputCollector extends IErrorReporter {!! List<Integer> emit(String streamId, ! Collection<Tuple> anchors, ! List<Object> tuple);!! ! void emitDirect(int taskId, ! String streamId, ! Collection<Tuple> anchors, ! List<Object> tuple);!! ! void ack(Tuple input);!! ! void fail(Tuple input);!}

Core API

Bolt Output API

public interface IOutputCollector extends IErrorReporter {!! List<Integer> emit(String streamId, ! Collection<Tuple> anchors, ! List<Object> tuple);!! ! void emitDirect(int taskId, ! String streamId, ! Collection<Tuple> anchors, ! List<Object> tuple);!! ! void ack(Tuple input);!! ! void fail(Tuple input);!}

Reliability API

Topologies

Topologies

• DAG of Spouts and Bolts • Data Flow Representation • Streaming Computation

Topologies

• Storm executes spouts and bolts as individual Tasks that run in parallel on multiple machines.

Stream Groupings

Stream Groupings

Stream Groupings determine how Storm routes Tuples between tasks in a topology

Stream Groupings

Shuffle!!

Randomized round-robin.

Stream Groupings

LocalOrShuffle!!

Randomized round-robin. (With a preference for intra-worker Tasks)

Stream Groupings

Fields Grouping!!

Ensures all Tuples with with the same field value(s) are always routed to the same task.

Stream Groupings

Fields Grouping!!

Ensures all Tuples with with the same field value(s) are always routed to the same task.

!(this is a simple hash of the field values,

modulo the number of tasks)

Physical View

Physical ViewZooKeeperNimbus

Supervisor Supervisor Supervisor Supervisor

Worker* Worker* Worker* Worker*

Topology Deployment

ZooKeeperNimbus


TopologySubmitter

Topology Submitter uploads topology:!• topology.jar!• topology.ser!• conf.ser

$ bin/storm jar

Topology Deployment

Nimbus calculates assignments and sends to Zookeeper

ZooKeeperNimbus


TopologySubmitter

Topology Deployment

Supervisor nodes receive assignment information !via Zookeeper watches.

ZooKeeperNimbus


TopologySubmitter

Topology Deployment

Supervisor nodes download topology from Nimbus:!• topology.jar!• topology.ser!• conf.ser

ZooKeeperNimbus


TopologySubmitter

Topology Deployment

Supervisors spawn workers (JVM processes) to start the topology

ZooKeeperNimbus


TopologySubmitter

Worker Worker Worker Worker

Fault Tolerance

Fault Tolerance

Workers heartbeat back to Supervisors and Nimbus via ZooKeeper, !as well as locally.

ZooKeeperNimbus


TopologySubmitter


Fault Tolerance

If a worker dies (fails to heartbeat), the Supervisor will restart it

ZooKeeperNimbus


TopologySubmitter

Worker Worker Worker WorkerX

Fault Tolerance

If a worker dies repeatedly, Nimbus will reassign the work to other!nodes in the cluster.

ZooKeeperNimbus


TopologySubmitter

Worker Worker Worker WorkerX

Fault Tolerance

If a supervisor node dies, Nimbus will reassign the work to other nodes.

ZooKeeperNimbus


TopologySubmitter

Worker Worker Worker WorkerXX

Fault Tolerance

If Nimbus dies, topologies will continue to function normally,!but won’t be able to perform reassignments.

ZooKeeperNimbus


TopologySubmitter


X

ParallelismScaling a Distributed Computation

Parallelism

Worker (JVM)

Executor (Thread) Executor (Thread) Executor (Thread)

Task Task Task

1 Worker, Parallelism = 1

ParallelismWorker (JVM)


Task Task Task

Executor (Thread)

Task

1 Worker, Parallelism = 2

ParallelismWorker (JVM)

Executor (Thread) Executor (Thread)

Task Task

Executor (Thread)

Task

Task

1 Worker, Parallelism = 2, NumTasks = 2

Parallelism

3 Workers, Parallelism = 1, NumTasks = 1

Worker (JVM)Worker (JVM)Worker (JVM)


Task Task Task

Internal Messaging

Internal MessagingWorker Mechanics

Worker Internal Messaging

Worker Receive Thread

Worker Port

List<List<Tuple>>Receive Buffer

Executor Thread *

Inbound Queue Outbound Queue

Router Send Thread

Worker Transfer Thread

List<List<Tuple>>Transfer Buffer

To Other Workers

Task(Spout/Bolt)

Task(Spout/Bolt)

Task(s)(Spout/Bolt)

Reliable ProcessingAt Least Once

Reliable Processing

Bolts may emit Tuples Anchored to one received. Tuple “B” is a descendant of Tuple “A”

{A} {B}

Reliable Processing

Multiple Anchorings form a Tuple tree (bolts not shown)

{A} {B}

{C}

{D}

{E}

{F}

{G}

{H}

Reliable Processing

Bolts can Acknowledge that a tuple has been processed successfully.

{A} {B}

ACK

Reliable Processing

Acks are delivered via a system-level bolt

ACK

{A} {B}

Acker Bolt

ackack

Reliable Processing

Bolts can also Fail a tuple to trigger a spout to replay the original.

FAIL

{A} {B}

Acker Bolt

failfail

Reliable Processing

Any failure in the Tuple tree will trigger a replay of the original tuple

{A} {B}

{C}

{D}

{E}

{F}

{G}

{H}

X

X

Reliable Processing

How to track a large-scale tuple tree efficiently?

Reliable Processing

A single 64-bit integer.

XOR Magic

Long a, b, c = Random.nextLong();

XOR Magic

Long a, b, c = Random.nextLong();!!a ^ a == 0

XOR Magic

Long a, b, c = Random.nextLong();!!a ^ a == 0!!a ^ a ^ b != 0

XOR MagicLong a, b, c = Random.nextLong();!!a ^ a == 0!!a ^ a ^ b != 0!!a ^ a ^ b ^ b == 0

XOR Magic

Long a, b, c = Random.nextLong();!!a ^ (a ^ b) ^ c ^ (b ^ c) == 0

XOR Magic

Long a, b, c = Random.nextLong();!!a ^ (a ^ b) ^ c ^ (b ^ c) == 0

Acks can arrive asynchronously, in any order

Trident

Trident

High-level abstraction built on Storm’s core primitives.

TridentBuilt-in support for:

• Merges and Joins

• Aggregations

• Groupings

• Functions

• Filters

Trident

Stateful, incremental processing on top of any persistence store.

Trident

Trident is Storm

Trident

Fluent, Stream-oriented API

TridentFluent, Stream-Oriented API

TridentTopology topology = new TridentTopology();!FixedBatchSpout spout = new FixedBatchSpout(…);!Stream stream = topology.newStream("words", spout);!!stream.each(…, new MyFunction())! .groupBy()! .each(…, new MyFilter())! .persistentAggregate(…);!

User-defined functions

Trident

Micro-Batch Oriented

Tuple Micro-Batch

{…} {…} {…} {…}

{…} {…} {…} {…}

{…} {…} {…} {…}

{…} {…} {…} {…}

Trident

Trident Batches are Ordered

Tuple Micro-Batch

{…} {…} {…} {…}

{…} {…} {…} {…}

{…} {…} {…} {…}

{…} {…} {…} {…}

Tuple Micro-Batch

{…} {…} {…} {…}

{…} {…} {…} {…}

{…} {…} {…} {…}

{…} {…} {…} {…}

Batch #1 Batch #2

Trident

Trident Batches can be Partitioned

Tuple Micro-Batch

{…} {…} {…} {…}

{…} {…} {…} {…}

{…} {…} {…} {…}

{…} {…} {…} {…}

Trident

Trident Batches can be Partitioned

Tuple Micro-Batch

{…} {…} {…} {…}

{…} {…} {…} {…}

{…} {…} {…} {…}

{…} {…} {…} {…}

Partition Operation

Partition A

{…} {…}

{…}{…}

Partition B

{…} {…}

{…}{…}

Partition C

{…} {…}

{…}{…}

Partition D

{…} {…}

{…}{…}

Trident Operation Types

1. Local Operations (Functions/Filters)

2. Repartitioning Operations (Stream Groupings, etc.)

3. Aggregations

4. Merges/Joins

Trident Topologies

each

each

shuffle

Function

Filter

partition persist

Trident Toplogies

Partitioning operations define the boundaries between bolts, and thus network transfer

and parallelism

Trident Topologies

each

each

shuffle

Function

Filter

partition persist

Bolt 1

Bolt 2

shuffleGrouping()

Partitioning!Operation

Trident Batch Coordination

Trident Batch Coordination

Trident SpoutMaster Batch Coordinator User Logic

nextbatch

{…} {…} {…} {…}

{…} {…} {…} {…}

{…} {…} {…} {…}

{…} {…} {…} {…}

commit

Controlling Deployment


How do you control where spouts and bolts get deployed in a cluster?



Plug-able Schedulers



Isolation Scheduler

Wait… Nimbus, Supervisor, Schedulers… !

Doesn’t that sound kind of like resource negotiation?

Storm on YARN

HDFS2 (redundant, reliable storage)

YARN (cluster resource management)

MapReduce (batch)

Apache STORM (streaming)

HADOOP 2.0

Tez (interactive)

Multi Use Data Platform Batch, Interactive, Online, Streaming, …

Storm on YARN



MapReduce (batch)


HADOOP 2.0

Tez (interactive)


Batch and real-time on the same cluster

Storm on YARN



MapReduce (batch)


HADOOP 2.0

Tez (interactive)


Security and Multi-tenancy

Storm on YARN



MapReduce (batch)


HADOOP 2.0

Tez (interactive)


Elasticity

Storm on YARN

Nimbus Resource Management, Scheduling

Supervisor Node and Process management

Workers Runs topology tasks

YARN RM Resource Management

Storm AM Manage Topology

Containers Runs topology tasks

YARN NM Process Management

Storm’s resource management system maps very naturally to the YARN model.

Storm on YARN








High Availability

Storm on YARN








Detect and scale around bottlenecks

Storm on YARN








Optimize for available resources

Shameless Plug

https://www.packtpub.com/storm-distributed-real-time-

computation-blueprints/book

https://www.packtpub.com/storm-distributed-real-time-computation-blueprints/book

Thank You!

Contributions welcome.

Join the storm community at:http://storm.incubator.apache.org

P. Taylor Goetz [email protected] @ptgoetz

http://storm.incubator.apache.org

hadoop summit europe 2014: apache storm architecture

Software

xor magic

scheduling

partitioned

cluster resource

data platform

fault tolerance

list emit

reliable processing