hadoop and storm - ajug talk

©MapR Technologies

Hadoop and StormAJUG 5/21/2013

whoami

• Brad Anderson

• Solutions Architect at MapR (Atlanta)

• ATLHUG co-chair

• NoSQL East Conference 2009

• “boorad” most places (twitter, github)

• banderson@maprtech.com

Hadoop: A Paradigm Shift

Distributed computing platform– Large clusters– Commodity hardware

Pioneered at Google– Google File System, MapReduce and BigTable

Commercially available as Hadoop

Ship the Function to the Data

SAN/NAS

data data data

function

Traditional Architecture

function

Distributed Computing

MapReduce Flow

Map Combine

Shuffleand sort

Reduce

Output

Reduce

Variation: No Reduce NecessaryExample: Batch File Transformation

Map Output

MPG M4V

Variation: Multiple MapReducesExample: Fraud Detection in User Transactions

LDA training

Transaction data

LDA scoring

HBase /MapR M7 Edition

G2 score

Candidate events for

analyst review

95 %-ile LDA anomaly

MapReduce

http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

MR Equivalent to Pig Script

MapR Distribution for Apache Hadoop

Complete Hadoop distribution

Comprehensive management suite

Industry-standard interfaces

Enterprise-grade dependability

Enterprise-grade security (US Intelligence Agency)

Patents - IP

Higher performance

Mahout

Map Reduce

Cascading

Nagios

Ganglia

MapR Control System

MapR Data Platform

MapR Control System

MapR Data Platform

HCatalog

Zookeeper

Hadoop Use Cases

ETL/EDW Offload

Sensor / Telemetry Data

Recommendation Engine

Search•ML algorithms•eDiscovery

Fleet Management

Fraud Detection / Risk Management

Traffic Decongestion

One Platform for Big Data

99.999% HA

Data Protection

Disaster Recovery

Scalability &

PerformanceEnterprise Integration

Multi-tenancy

MapReduce

File-Based Applications SQL Database Search Stream

Processing

Interactive

Realtime

BatchLog file Analysis

Data Warehouse OffloadFraud Detection

Clickstream Analytics

RealtimeSensor Analysis

“Twitterscraping”Telematics

Process Optimization

InteractiveForensic Analysis

Analytic ModelingBI User Focus

©MapR Technologies

“Hadoop for Realtime”

©MapR Technologies

Before Storm

Queues Workers

©MapR Technologies

Example

(simplified)

©MapR Technologies

Guaranteed data processing

Horizontal scalability

Fault-tolerance

No intermediate message brokers!

Higher level abstraction than message passing

“Just works”

©MapR Technologies

Unbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Streams

©MapR Technologies

Source of streams

Spouts

©MapR Technologies

public interface ISpout extends Serializable { void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void nextTuple(); void ack(Object msgId); void fail(Object msgId);}

Spouts

©MapR Technologies

Processes input streams and produces new streams

Tuple Tuple Tuple Tuple

©MapR Technologies

public class DoubleAndTripleBolt extends BaseRichBolt { private OutputCollectorBase _collector;

public void prepare(Map conf, TopologyContext context, OutputCollectorBase collector) { _collector = collector; }

public void execute(Tuple input) { int val = input.getInteger(0); _collector.emit(input, new Values(val*2, val*3)); _collector.ack(input); }

public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("double", "triple")); } }

Network of spouts and bolts

Topologies

TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6);

Trident

Cascading for Storm

Hadoop

batchprocesses

Business

RawData

realtime processes

Queue (

Parallel Cluster Ingest

Hadoop

batchprocesses

Business

RawData

realtime processes

TailSpoutFr

anzQueue (

StormKafka

Twitter

Twitter API

TweetLoggerKafka

ClusterKafka

Cluster

Kafka API

Web Service NAS

Web Data

Hadoop

HDFS Data

Twitter

TwitterAPI

Catcher Storm

Topic Queue

Web-server

Web Data

TweetLogger

Scaling EstimatesTwitter Firehose

Old School – 8+ separate clusters, 20-25 nodes• >3 Kafka nodes• >2 TweetLoggers• 5-10 Hadoop• >2 Catcher nodes• >3 Storm• 3 zookeepers• NAS for web storage• >2 web servers

MapR – One Platform• 5-10 nodes total• Any node does any job• Full HA included• Backups included

github

• Watch TailSpout & Franz development

• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout

• And our example Twitter implementation

• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout-test

hadoop and storm - ajug talk

mapr distribution

mapr technologiesstormhadoop

web servers mapr

new split

new count

new fieldscount

new valuesval

outputcollectorbase

Technology

ajug - the spring update

a real-time (lambda) architecture using hadoop & storm...

analyzing hadoop with hadoop

scaling apache storm - hadoop summit 2014

flickr: computer vision at scale with hadoop and storm (huy...

big data brief - national oceanic and atmospheric ... ·...

tomorrow’s enterprise - delivered...

professional hadoop® solutions - startseite€¦ · the...

html5 for java developers ajug june 2011 v2

bigdata hadoop course content · industries using hadoop....

why use hadoop?, challenges / learning hadoop & average...

1 ©mapr technologies - confidential real-time and long-time...

hadoop deployment manual -...

hadoop summit europe 2014: apache storm architecture

ajug april 2011

combining real-time and batch analytics with nosql, storm...

ajug hibernate-dos-donts

big data systeme recommendations - haw...

· (page views ? hourly? monthly hadoop node hadoop node...

2. hadoop -...