hadoop and storm - ajug talk

Post on 27-Jan-2015

107 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm often coexists in Big Data architectures with Hadoop. We will talk about different approaches to this interoperability between the systems, their benefits & trade-offs, and a new open source library available for high throughput use.

TRANSCRIPT

©MapR Technologies

Hadoop and StormAJUG 5/21/2013

whoami

• Brad Anderson

• Solutions Architect at MapR (Atlanta)

• ATLHUG co-chair

• NoSQL East Conference 2009

• “boorad” most places (twitter, github)

• banderson@maprtech.com

Hadoop: A Paradigm Shift

Distributed computing platform– Large clusters– Commodity hardware

Pioneered at Google– Google File System, MapReduce and BigTable

Commercially available as Hadoop

Ship the Function to the Data

SAN/NAS

data data data

data data data

data data data

data data data

data data data

function

RDBMS

Traditional Architecture

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

Distributed Computing

MapReduce Flow

Input

Map Combine

Shuffleand sort

Reduce

Output

Reduce

Variation: No Reduce NecessaryExample: Batch File Transformation

Input

Map Output

MPG M4V

Variation: Multiple MapReducesExample: Fraud Detection in User Transactions

LDA training

Transaction data

LDA scoring

HBase /MapR M7 Edition

G2 score

Candidate events for

analyst review

95 %-ile LDA anomaly

MapReduce

http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Pig

MR Equivalent to Pig Script

Hive

MapR Distribution for Apache Hadoop

Complete Hadoop distribution

Comprehensive management suite

Industry-standard interfaces

Enterprise-grade dependability

Enterprise-grade security (US Intelligence Agency)

Patents - IP

Higher performance

Pig

Hive

HBase

Mahout

Oozie

Whirr

Map Reduce

Cascading

Nagios

Ganglia

MapR Control System

MapR Data Platform

MapR Control System

MapR Data Platform

Flume

Sqoop

HCatalog

Zookeeper

Avro

Map

Reduc

e

Hadoop Use Cases

ETL/EDW Offload

Sensor / Telemetry Data

Recommendation Engine

Search•ML algorithms•eDiscovery

Fleet Management

Fraud Detection / Risk Management

Traffic Decongestion

One Platform for Big Data

99.999% HA

Data Protection

Disaster Recovery

Scalability &

PerformanceEnterprise Integration

Multi-tenancy

MapReduce

File-Based Applications SQL Database Search Stream

Processing

Batch

Interactive

Realtime

BatchLog file Analysis

Data Warehouse OffloadFraud Detection

Clickstream Analytics

RealtimeSensor Analysis

“Twitterscraping”Telematics

Process Optimization

InteractiveForensic Analysis

Analytic ModelingBI User Focus

©MapR Technologies

Storm

“Hadoop for Realtime”

©MapR Technologies

Before Storm

Queues Workers

©MapR Technologies

Example

(simplified)

©MapR Technologies

Storm

Guaranteed data processing

Horizontal scalability

Fault-tolerance

No intermediate message brokers!

Higher level abstraction than message passing

“Just works”

©MapR Technologies

Unbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Streams

©MapR Technologies

Source of streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Spouts

©MapR Technologies

public interface ISpout extends Serializable { void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void nextTuple(); void ack(Object msgId); void fail(Object msgId);}

Spouts

©MapR Technologies

Processes input streams and produces new streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple

Bolts

©MapR Technologies

public class DoubleAndTripleBolt extends BaseRichBolt { private OutputCollectorBase _collector;

public void prepare(Map conf, TopologyContext context, OutputCollectorBase collector) { _collector = collector; }

public void execute(Tuple input) { int val = input.getInteger(0); _collector.emit(input, new Values(val*2, val*3)); _collector.ack(input); }

public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("double", "triple")); } }

Bolts

©MapR Technologies

Network of spouts and bolts

Topologies

©MapR Technologies

TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6);

Trident

Cascading for Storm

Storm

©MapR Technologies

Hadoop

batchprocesses

Apps

Business

Value

RawData

realtime processes

Queue (

Kafk

a)

Parallel Cluster Ingest

©MapR Technologies

Hadoop

batchprocesses

Apps

Business

Value

RawData

realtime processes

Storm

TailSpoutFr

anzQueue (

Kafk

a)

StormKafka

Twitter

Twitter API

TweetLoggerKafka

ClusterKafka

ClusterKafka

Cluster

Kafka API

Storm

Web Service NAS

Web Data

Hadoop

Flume

HDFS Data

Twitter

TwitterAPI

Catcher Storm

Topic Queue

Web-server

http

Web Data

MapR

TweetLogger

Scaling EstimatesTwitter Firehose

Old School – 8+ separate clusters, 20-25 nodes• >3 Kafka nodes• >2 TweetLoggers• 5-10 Hadoop• >2 Catcher nodes• >3 Storm• 3 zookeepers• NAS for web storage• >2 web servers

MapR – One Platform• 5-10 nodes total• Any node does any job• Full HA included• Backups included

©MapR Technologies

github

• Watch TailSpout & Franz development

• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout

• And our example Twitter implementation

• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout-test

Demo

top related