hadoop and storm - ajug talk

31
©MapR Technologies Hadoop and Storm AJUG 5/21/2013

Upload: boorad

Post on 27-Jan-2015

107 views

Category:

Technology


0 download

DESCRIPTION

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm often coexists in Big Data architectures with Hadoop. We will talk about different approaches to this interoperability between the systems, their benefits & trade-offs, and a new open source library available for high throughput use.

TRANSCRIPT

Page 1: Hadoop and Storm - AJUG talk

©MapR Technologies

Hadoop and StormAJUG 5/21/2013

Page 2: Hadoop and Storm - AJUG talk

whoami

• Brad Anderson

• Solutions Architect at MapR (Atlanta)

• ATLHUG co-chair

• NoSQL East Conference 2009

• “boorad” most places (twitter, github)

[email protected]

Page 3: Hadoop and Storm - AJUG talk

Hadoop: A Paradigm Shift

Distributed computing platform– Large clusters– Commodity hardware

Pioneered at Google– Google File System, MapReduce and BigTable

Commercially available as Hadoop

Page 4: Hadoop and Storm - AJUG talk

Ship the Function to the Data

SAN/NAS

data data data

data data data

data data data

data data data

data data data

function

RDBMS

Traditional Architecture

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

Distributed Computing

Page 5: Hadoop and Storm - AJUG talk

MapReduce Flow

Input

Map Combine

Shuffleand sort

Reduce

Output

Reduce

Page 6: Hadoop and Storm - AJUG talk

Variation: No Reduce NecessaryExample: Batch File Transformation

Input

Map Output

MPG M4V

Page 7: Hadoop and Storm - AJUG talk

Variation: Multiple MapReducesExample: Fraud Detection in User Transactions

LDA training

Transaction data

LDA scoring

HBase /MapR M7 Edition

G2 score

Candidate events for

analyst review

95 %-ile LDA anomaly

MapReduce

http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Page 8: Hadoop and Storm - AJUG talk

Pig

Page 9: Hadoop and Storm - AJUG talk

MR Equivalent to Pig Script

Page 10: Hadoop and Storm - AJUG talk

Hive

Page 11: Hadoop and Storm - AJUG talk

MapR Distribution for Apache Hadoop

Complete Hadoop distribution

Comprehensive management suite

Industry-standard interfaces

Enterprise-grade dependability

Enterprise-grade security (US Intelligence Agency)

Patents - IP

Higher performance

Pig

Hive

HBase

Mahout

Oozie

Whirr

Map Reduce

Cascading

Nagios

Ganglia

MapR Control System

MapR Data Platform

MapR Control System

MapR Data Platform

Flume

Sqoop

HCatalog

Zookeeper

Avro

Map

Reduc

e

Page 12: Hadoop and Storm - AJUG talk

Hadoop Use Cases

ETL/EDW Offload

Sensor / Telemetry Data

Recommendation Engine

Search•ML algorithms•eDiscovery

Fleet Management

Fraud Detection / Risk Management

Traffic Decongestion

Page 13: Hadoop and Storm - AJUG talk

One Platform for Big Data

99.999% HA

Data Protection

Disaster Recovery

Scalability &

PerformanceEnterprise Integration

Multi-tenancy

MapReduce

File-Based Applications SQL Database Search Stream

Processing

Batch

Interactive

Realtime

BatchLog file Analysis

Data Warehouse OffloadFraud Detection

Clickstream Analytics

RealtimeSensor Analysis

“Twitterscraping”Telematics

Process Optimization

InteractiveForensic Analysis

Analytic ModelingBI User Focus

Page 14: Hadoop and Storm - AJUG talk

©MapR Technologies

Storm

“Hadoop for Realtime”

Page 15: Hadoop and Storm - AJUG talk

©MapR Technologies

Before Storm

Queues Workers

Page 16: Hadoop and Storm - AJUG talk

©MapR Technologies

Example

(simplified)

Page 17: Hadoop and Storm - AJUG talk

©MapR Technologies

Storm

Guaranteed data processing

Horizontal scalability

Fault-tolerance

No intermediate message brokers!

Higher level abstraction than message passing

“Just works”

Page 18: Hadoop and Storm - AJUG talk

©MapR Technologies

Unbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Streams

Page 19: Hadoop and Storm - AJUG talk

©MapR Technologies

Source of streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Spouts

Page 20: Hadoop and Storm - AJUG talk

©MapR Technologies

public interface ISpout extends Serializable { void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void nextTuple(); void ack(Object msgId); void fail(Object msgId);}

Spouts

Page 21: Hadoop and Storm - AJUG talk

©MapR Technologies

Processes input streams and produces new streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple

Bolts

Page 22: Hadoop and Storm - AJUG talk

©MapR Technologies

public class DoubleAndTripleBolt extends BaseRichBolt { private OutputCollectorBase _collector;

public void prepare(Map conf, TopologyContext context, OutputCollectorBase collector) { _collector = collector; }

public void execute(Tuple input) { int val = input.getInteger(0); _collector.emit(input, new Values(val*2, val*3)); _collector.ack(input); }

public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("double", "triple")); } }

Bolts

Page 23: Hadoop and Storm - AJUG talk

©MapR Technologies

Network of spouts and bolts

Topologies

Page 24: Hadoop and Storm - AJUG talk

©MapR Technologies

TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6);

Trident

Cascading for Storm

Page 25: Hadoop and Storm - AJUG talk

Storm

©MapR Technologies

Hadoop

batchprocesses

Apps

Business

Value

RawData

realtime processes

Queue (

Kafk

a)

Parallel Cluster Ingest

Page 26: Hadoop and Storm - AJUG talk

©MapR Technologies

Hadoop

batchprocesses

Apps

Business

Value

RawData

realtime processes

Storm

TailSpoutFr

anzQueue (

Kafk

a)

Page 27: Hadoop and Storm - AJUG talk

StormKafka

Twitter

Twitter API

TweetLoggerKafka

ClusterKafka

ClusterKafka

Cluster

Kafka API

Storm

Web Service NAS

Web Data

Hadoop

Flume

HDFS Data

Page 28: Hadoop and Storm - AJUG talk

Twitter

TwitterAPI

Catcher Storm

Topic Queue

Web-server

http

Web Data

MapR

TweetLogger

Page 29: Hadoop and Storm - AJUG talk

Scaling EstimatesTwitter Firehose

Old School – 8+ separate clusters, 20-25 nodes• >3 Kafka nodes• >2 TweetLoggers• 5-10 Hadoop• >2 Catcher nodes• >3 Storm• 3 zookeepers• NAS for web storage• >2 web servers

MapR – One Platform• 5-10 nodes total• Any node does any job• Full HA included• Backups included

Page 30: Hadoop and Storm - AJUG talk

©MapR Technologies

github

• Watch TailSpout & Franz development

• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout

• And our example Twitter implementation

• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout-test

Page 31: Hadoop and Storm - AJUG talk

Demo