storm - as deep into real-time data processing as you can get in 30 minutes

61
Storm Dan Lynn [email protected] @danklynn As deep into real-time data processing as you can get* *in 30 minutes.

Upload: dan-lynn

Post on 10-May-2015

15.815 views

Category:

Technology


1 download

DESCRIPTION

My slides from GlueCon 2013

TRANSCRIPT

Page 1: Storm - As deep into real-time data processing as you can get in 30 minutes

Storm

Dan [email protected]

@danklynn

As deep into real-time data processing as you can get**in 30 minutes.

Page 2: Storm - As deep into real-time data processing as you can get in 30 minutes

Keeps Contact Information Current and Complete

Based in Denver, Colorado

CTO & [email protected]

@danklynn

Page 3: Storm - As deep into real-time data processing as you can get in 30 minutes

Turn Partial Contacts Into Full Contacts

Page 4: Storm - As deep into real-time data processing as you can get in 30 minutes

Storm

Page 5: Storm - As deep into real-time data processing as you can get in 30 minutes

StormDistributed  and  fault-­‐tolerant  real-­‐3me  computa3on

Page 6: Storm - As deep into real-time data processing as you can get in 30 minutes

StormDistributed  and  fault-­‐tolerant  real-­‐3me  computa3on

Page 7: Storm - As deep into real-time data processing as you can get in 30 minutes

StormDistributed  and  fault-­‐tolerant  real-­‐3me  computa3on

Page 8: Storm - As deep into real-time data processing as you can get in 30 minutes

StormDistributed  and  fault-­‐tolerant  real-­‐3me  computa3on

Page 9: Storm - As deep into real-time data processing as you can get in 30 minutes

THE HARD WAY

Queues

Workers

Page 10: Storm - As deep into real-time data processing as you can get in 30 minutes

THE HARD WAY

Page 11: Storm - As deep into real-time data processing as you can get in 30 minutes

Key Concepts

Page 12: Storm - As deep into real-time data processing as you can get in 30 minutes

TuplesOrdered list of elements

Page 13: Storm - As deep into real-time data processing as you can get in 30 minutes

TuplesOrdered list of elements

("search-01384", "e:[email protected]")

Page 14: Storm - As deep into real-time data processing as you can get in 30 minutes

StreamsUnbounded sequence of tuples

Page 15: Storm - As deep into real-time data processing as you can get in 30 minutes

StreamsUnbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple Tuple

Page 16: Storm - As deep into real-time data processing as you can get in 30 minutes

SpoutsSource of streams

Page 17: Storm - As deep into real-time data processing as you can get in 30 minutes

SpoutsSource of streams

Page 18: Storm - As deep into real-time data processing as you can get in 30 minutes

SpoutsSource of streams

Tuple Tuple Tuple Tuple Tuple Tuple

Page 19: Storm - As deep into real-time data processing as you can get in 30 minutes

Spouts can talk with

some  images  from  h,p://commons.wikimedia.org

•Queues

•Web  logs

•API  calls

•Event  data

Page 20: Storm - As deep into real-time data processing as you can get in 30 minutes

BoltsProcess tuples and create new streams

Page 21: Storm - As deep into real-time data processing as you can get in 30 minutes

Bolts

some  images  from  h,p://commons.wikimedia.org

•Apply  funcBons  /  transforms•Filter•AggregaBon•Streaming  joins•Access  DBs,  APIs,  etc...

Page 22: Storm - As deep into real-time data processing as you can get in 30 minutes

Bolts

Tuple Tuple Tuple Tuple Tuple Tuple

some  images  from  h,p://commons.wikimedia.org

TupleTuple

TupleTuple

TupleTuple

TupleTuple

TupleTuple

TupleTuple

Page 23: Storm - As deep into real-time data processing as you can get in 30 minutes

TopologiesA directed graph of Spouts and Bolts

Page 24: Storm - As deep into real-time data processing as you can get in 30 minutes

This is a Topology

some  images  from  h,p://commons.wikimedia.org

Page 25: Storm - As deep into real-time data processing as you can get in 30 minutes

This is also a topology

some  images  from  h,p://commons.wikimedia.org

Page 26: Storm - As deep into real-time data processing as you can get in 30 minutes

TasksExecute Streams or Bolts

Page 27: Storm - As deep into real-time data processing as you can get in 30 minutes

Running a Topology

$ storm jar my-code.jar com.example.MyTopology arg1 arg2

Page 28: Storm - As deep into real-time data processing as you can get in 30 minutes

Storm Cluster

Nathan  Marz

Page 29: Storm - As deep into real-time data processing as you can get in 30 minutes

Storm Cluster

Nathan  Marz

If this wereHadoop...

Page 30: Storm - As deep into real-time data processing as you can get in 30 minutes

Storm Cluster

Nathan  Marz

Job Tracker

If this wereHadoop...

Page 31: Storm - As deep into real-time data processing as you can get in 30 minutes

Storm Cluster

Nathan  MarzTask Trackers

If this wereHadoop...

Page 32: Storm - As deep into real-time data processing as you can get in 30 minutes

Storm Cluster

Nathan  Marz

Coordinates everything

But it’s not Hadoop

Page 33: Storm - As deep into real-time data processing as you can get in 30 minutes

Example:Streaming Word Count

Page 34: Storm - As deep into real-time data processing as you can get in 30 minutes

Streaming Word Count

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

Page 35: Storm - As deep into real-time data processing as you can get in 30 minutes

Streaming Word Count

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

Page 36: Storm - As deep into real-time data processing as you can get in 30 minutes

Streaming Word Count

public static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }

    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }

    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }}

SplitSentence.java

Page 37: Storm - As deep into real-time data processing as you can get in 30 minutes

Streaming Word Count

public static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }

    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }

    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }}

SplitSentence.java

splitsentence.py

Page 38: Storm - As deep into real-time data processing as you can get in 30 minutes

Streaming Word Count

public static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }

    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }

    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }}

SplitSentence.java

Page 39: Storm - As deep into real-time data processing as you can get in 30 minutes

Streaming Word Count

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

java

Page 40: Storm - As deep into real-time data processing as you can get in 30 minutes

Streaming Word Count

public static class WordCount extends BaseBasicBolt {    Map<String, Integer> counts = new HashMap<String, Integer>();

    @Override    public void execute(Tuple tuple, BasicOutputCollector collector) {        String word = tuple.getString(0);        Integer count = counts.get(word);        if(count==null) count = 0;        count++;        counts.put(word, count);        collector.emit(new Values(word, count));    }

    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word", "count"));    }}

WordCount.java

Page 41: Storm - As deep into real-time data processing as you can get in 30 minutes

Streaming Word Count

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

java

Groupings control how tuples are routed

Page 42: Storm - As deep into real-time data processing as you can get in 30 minutes

Shuffle groupingTuples are randomly distributed across all of the

tasks running the bolt

Page 43: Storm - As deep into real-time data processing as you can get in 30 minutes

Fields groupingGroups tuples by specific named fields and routes

them to the same task

Page 44: Storm - As deep into real-time data processing as you can get in 30 minutes

Fields groupingGroups tuples by specific named fields and routes

them to the same task

Analogous to Hadoop’s

partitioning behavior

Page 45: Storm - As deep into real-time data processing as you can get in 30 minutes

Trending Topics

Page 46: Storm - As deep into real-time data processing as you can get in 30 minutes

Twitter Trending Topics

TwitterStreamingTopicSpoutparallelism = 1 (unless you use GNip)

(word)

RollingCountsBoltparallelism = n

(word, count)

IntermediateRankingsBoltparallelism = n

(rankings)

(tweets)

(JSON rankings)

RankingsReportBoltparallelism = 1

TotalRankingsBoltparallelism = 1

(rank

ings)

Page 47: Storm - As deep into real-time data processing as you can get in 30 minutes

Live Coding!

Page 48: Storm - As deep into real-time data processing as you can get in 30 minutes

Twitter Trending Topics

TwitterStreamingTopicSpoutparallelism = 1 (unless you use GNip)

(word)

RollingCountsBoltparallelism = n

(word, count)

IntermediateRankingsBoltparallelism = n

(rankings)

(tweets)

(JSON rankings)

RankingsReportBoltparallelism = 1

TotalRankingsBoltparallelism = 1

(rank

ings)

Page 49: Storm - As deep into real-time data processing as you can get in 30 minutes

Tips

Page 50: Storm - As deep into real-time data processing as you can get in 30 minutes

loggly.com

Graylog2logstash

Use a log aggregator

Page 51: Storm - As deep into real-time data processing as you can get in 30 minutes

"$topologyName-$buildNumber"

Rolling Deploys

Page 52: Storm - As deep into real-time data processing as you can get in 30 minutes

1.  Launch  new  topology

2.  Wait  for  it  to  be  healthy

3.  Kill  the  old  one

Rolling Deploys

Page 53: Storm - As deep into real-time data processing as you can get in 30 minutes

These are under active development

Rolling Deploys

Page 54: Storm - As deep into real-time data processing as you can get in 30 minutes

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

java

see:https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology

Tune your parallelism

Page 55: Storm - As deep into real-time data processing as you can get in 30 minutes

Tune your parallelismSupervisor

Worker  Process  (JVM)

Executor  (thread)

Task

Task

Executor  (thread)

Task

Task

Worker  Process  (JVM)

Executor  (thread)

Task

Task

Executor  (thread)

Task

Task

Parallelism hints control the number of Executors

Page 56: Storm - As deep into real-time data processing as you can get in 30 minutes

collector.emit(new Values(word, count));

see:https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology

Anchor your tuples (or not)

collector.emit(tuple, new Values(word, count));

Page 57: Storm - As deep into real-time data processing as you can get in 30 minutes

But Dan, you left out Trident!

Page 58: Storm - As deep into real-time data processing as you can get in 30 minutes

if (storm == hadoop) { trident = pig / cascading}

Page 59: Storm - As deep into real-time data processing as you can get in 30 minutes

A little taste of Trident TridentState  urlToTweeters  =              topology.newStaticState(getUrlToTweetersState());TridentState  tweetersToFollowers  =              topology.newStaticState(getTweeterToFollowersState());

topology.newDRPCStream("reach")    .stateQuery(urlToTweeters,  new  Fields("args"),  new  MapGet(),                      new  Fields("tweeters"))    .each(new  Fields("tweeters"),  new  ExpandList(),  new  Fields("tweeter"))    .shuffle()    .stateQuery(tweetersToFollowers,  new  Fields("tweeter"),  new  MapGet(),                        new  Fields("followers"))      .parallelismHint(200)    .each(new  Fields("followers"),  new  ExpandList(),  new  Fields("follower"))    .groupBy(new  Fields("follower"))    .aggregate(new  One(),  new  Fields("one"))    .parallelismHint(20)    .aggregate(new  Count(),  new  Fields("reach"));

h,ps://github.com/nathanmarz/storm/wiki/Trident-­‐tutorial