more than websites: php and the firehose @datasift (2013)

76
And The Firehose @ More Than Websites Saturday, 23 March 13

Upload: stuart-herbert

Post on 09-May-2015

14.048 views

Category:

Technology


0 download

DESCRIPTION

PHP is the world's #1 programming language for creating websites. But it's capable of so much more. How about real-time processing the social firehose? :)

TRANSCRIPT

Page 1: More Than Websites: PHP And The Firehose @DataSift (2013)

And The Firehose

@

More Than Websites

Saturday, 23 March 13

Page 2: More Than Websites: PHP And The Firehose @DataSift (2013)

@Introduce YourselvesSaturday, 23 March 13

Page 3: More Than Websites: PHP And The Firehose @DataSift (2013)

@@stuherbert

Saturday, 23 March 13

Page 4: More Than Websites: PHP And The Firehose @DataSift (2013)

@

What is

Saturday, 23 March 13

Page 5: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Sift throughsocial data

Twitter firehose, Facebook, bitly clicks,news, videos, comments

and more

Saturday, 23 March 13

Page 6: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Gain insights using augmentations

Language, gender, trends, links,sentiment, salience & entity analysis

and more

Saturday, 23 March 13

Page 7: More Than Websites: PHP And The Firehose @DataSift (2013)

@

RealtimeGet matching data within

secondsof it being posted

Saturday, 23 March 13

Page 8: More Than Websites: PHP And The Firehose @DataSift (2013)

@

HistoricsSearch our social data archive

going back toJanuary 2010

Saturday, 23 March 13

Page 9: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Pull the datafrom our servers

via HTTP/1.1 streamingor websockets

Saturday, 23 March 13

Page 10: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Let us pushdata to you

Have the data delivered directlyto your servers

or into your databases

Saturday, 23 March 13

Page 11: More Than Websites: PHP And The Firehose @DataSift (2013)

@

in numbers

Saturday, 23 March 13

Page 12: More Than Websites: PHP And The Firehose @DataSift (2013)

@

30Sources of social data

and data augmentations

Saturday, 23 March 13

Page 13: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Up to 20,000Number of new pieces of data

ingested into DataSiftevery second

Saturday, 23 March 13

Page 14: More Than Websites: PHP And The Firehose @DataSift (2013)

@

3 TerabytesAmount of new data added

to the Historics archiveevery week

Saturday, 23 March 13

Page 15: More Than Websites: PHP And The Firehose @DataSift (2013)

@

12Different wayswe can deliver

data to you

Saturday, 23 March 13

Page 16: More Than Websites: PHP And The Firehose @DataSift (2013)

@

1Average number of seconds

to pass the datathrough DataSift

Saturday, 23 March 13

Page 17: More Than Websites: PHP And The Firehose @DataSift (2013)

@

12Number of servicesdata passes through

inside DataSift

Saturday, 23 March 13

Page 18: More Than Websites: PHP And The Firehose @DataSift (2013)

@

25Number of engineerswho write code for

the DataSift platform

Saturday, 23 March 13

Page 19: More Than Websites: PHP And The Firehose @DataSift (2013)

@

5Primary programming languages:C++, Node, PHP, Python, Scala

Saturday, 23 March 13

Page 20: More Than Websites: PHP And The Firehose @DataSift (2013)

@

154Private GitHub repos

Saturday, 23 March 13

Page 21: More Than Websites: PHP And The Firehose @DataSift (2013)

@Our GitHub Repositories

PHP

Java & Scala

C & C++

JS & Node

Unclassified

Python

Shell Script

Ruby

C#

VimL

0 15 30 45 60

Saturday, 23 March 13

Page 22: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Architecture

Saturday, 23 March 13

Page 23: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Three majordata pipelines

+ supporting services

Saturday, 23 March 13

Page 24: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Data ArchivingAdds new data to the

Historics Archive

Saturday, 23 March 13

Page 25: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Filtering PipelineFiltering and delivery of data

in realtime

Saturday, 23 March 13

Page 26: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Playback PipelineFiltering and delivery of data

from the Historics Archive

Saturday, 23 March 13

Page 27: More Than Websites: PHP And The Firehose @DataSift (2013)

@DataSift Technical Architecture

UltrahoseArchiver

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

Http

Stre

amin

g, P

uSH,

Sea

rch

Stream Recorder

MonitoringAggregator

EDRs(licensed contentmetrics)

ControlChannels

(D5)

Har

dwar

e Lo

ad B

alan

cer

UltrahoseArchiver

100%Prism

100%

PickleFilteringEngine

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg

@lorenzoalbertonDataSift Architecture 2.2

Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

MonitoringKafkaQueue

EventsStorage

ACL(with interaction

counter)

tracker

LimitManager

AuthenticationManager

NotificationService

WEB

APIStream .Manager .DB

Definition .Manager .DB

CSDL Compiler,Validator,

Normaliser

HistoricsScheduler

RecordingScheduler

PushScheduler

InteractionTargets

Mapping

FilteringTardis Pickle

InteractionTargets

Mapping

FilteringTardis Pickle

...

...

Hadoop

Titan Historics

Map/Reduce

HBase Cluster

Region 1 Region 2 Region N...

...

Data Node Data Node Data Node Data Node Data Node100%

100%

StopPUB

License Manager DB

Billing Pipeline DB

DB

DB

MaskManager DB

ConnectionManager

Time Machine + InsightsPost-Processing, Stream Analytics

jobsDB

chunksDB

chunkselector

jobtracker

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

Exports andAnalytics

WebSockets

HTTPStreaming

Delivery Subscriptions

ConnectionsStorage

kafka-consumer

Oracle

Stream results

CouchDB

PickleDB .DB

Audit

KafkaKafka

Historical Queries

@datasift

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

IBM Cognos

HDFSArchiver

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Google BigQuery

Stream results

Cloud Storage

DBs

BI tools

Saturday, 23 March 13

Page 28: More Than Websites: PHP And The Firehose @DataSift (2013)

@Filtering Pipeline

UltrahoseArchiver

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

Http

Stre

amin

g, P

uSH,

Sea

rch

Stream Recorder

MonitoringAggregator

EDRs(licensed contentmetrics)

ControlChannels

(D5)

Har

dwar

e Lo

ad B

alan

cer

UltrahoseArchiver

100%Prism

100%

PickleFilteringEngine

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg

@lorenzoalbertonDataSift Architecture 2.2

Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

MonitoringKafkaQueue

EventsStorage

ACL(with interaction

counter)

tracker

LimitManager

AuthenticationManager

NotificationService

WEB

APIStream .Manager .DB

Definition .Manager .DB

CSDL Compiler,Validator,

Normaliser

HistoricsScheduler

RecordingScheduler

PushScheduler

InteractionTargets

Mapping

FilteringTardis Pickle

InteractionTargets

Mapping

FilteringTardis Pickle

...

...

Hadoop

Titan Historics

Map/Reduce

HBase Cluster

Region 1 Region 2 Region N...

...

Data Node Data Node Data Node Data Node Data Node100%

100%

StopPUB

License Manager DB

Billing Pipeline DB

DB

DB

MaskManager DB

ConnectionManager

Time Machine + InsightsPost-Processing, Stream Analytics

jobsDB

chunksDB

chunkselector

jobtracker

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

Exports andAnalytics

WebSockets

HTTPStreaming

Delivery Subscriptions

ConnectionsStorage

kafka-consumer

Oracle

Stream results

CouchDB

PickleDB .DB

Audit

KafkaKafka

Historical Queries

@datasift

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

IBM Cognos

HDFSArchiver

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Google BigQuery

Stream results

Cloud Storage

DBs

BI tools

Http

Stre

amin

g, P

uSH,

Sea

rch

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

100%

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Links Resolution+ OpenGraph

+ Twitter Cards+ MetadataOgreOgreOgreOgreOgre

Ogre

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

Stream Recorder

ControlChannels

(D5)

Har

dwar

e Lo

ad B

alan

cer

100%Prism

100%

PickleFilteringEngine push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

100%

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

WebSockets

HTTPStreaming

Delivery Subscriptions

kafka-consumer

OracleCouchDB

Kafka

IBM CognosGoogle BigQuery

Cloud Storage

DBs

BI tools

Saturday, 23 March 13

Page 29: More Than Websites: PHP And The Firehose @DataSift (2013)

@Data Archiving Pipeline

UltrahoseArchiver

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

Http

Stre

amin

g, P

uSH,

Sea

rch

Stream Recorder

MonitoringAggregator

EDRs(licensed contentmetrics)

ControlChannels

(D5)

Har

dwar

e Lo

ad B

alan

cer

UltrahoseArchiver

100%Prism

100%

PickleFilteringEngine

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg

@lorenzoalbertonDataSift Architecture 2.2

Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

MonitoringKafkaQueue

EventsStorage

ACL(with interaction

counter)

tracker

LimitManager

AuthenticationManager

NotificationService

WEB

APIStream .Manager .DB

Definition .Manager .DB

CSDL Compiler,Validator,

Normaliser

HistoricsScheduler

RecordingScheduler

PushScheduler

InteractionTargets

Mapping

FilteringTardis Pickle

InteractionTargets

Mapping

FilteringTardis Pickle

...

...

Hadoop

Titan Historics

Map/Reduce

HBase Cluster

Region 1 Region 2 Region N...

...

Data Node Data Node Data Node Data Node Data Node100%

100%

StopPUB

License Manager DB

Billing Pipeline DB

DB

DB

MaskManager DB

ConnectionManager

Time Machine + InsightsPost-Processing, Stream Analytics

jobsDB

chunksDB

chunkselector

jobtracker

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

Exports andAnalytics

WebSockets

HTTPStreaming

Delivery Subscriptions

ConnectionsStorage

kafka-consumer

Oracle

Stream results

CouchDB

PickleDB .DB

Audit

KafkaKafka

Historical Queries

@datasift

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

IBM Cognos

HDFSArchiver

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Google BigQuery

Stream results

Cloud Storage

DBs

BI tools

UltrahoseArchiver

UltrahoseArchiver

Kafka

Http

Stre

amin

g, P

uSH,

Sea

rch

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

100%

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Kafka

HBase Cluster

Region 1 Region 2 Region N... HDFSArchiver

Links Resolution+ OpenGraph

+ Twitter Cards+ MetadataOgreOgreOgreOgreOgre

Ogre

Saturday, 23 March 13

Page 30: More Than Websites: PHP And The Firehose @DataSift (2013)

@Playback Pipeline

UltrahoseArchiver

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

Http

Stre

amin

g, P

uSH,

Sea

rch

Stream Recorder

MonitoringAggregator

EDRs(licensed contentmetrics)

ControlChannels

(D5)

Har

dwar

e Lo

ad B

alan

cer

UltrahoseArchiver

100%Prism

100%

PickleFilteringEngine

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg

@lorenzoalbertonDataSift Architecture 2.2

Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

MonitoringKafkaQueue

EventsStorage

ACL(with interaction

counter)

tracker

LimitManager

AuthenticationManager

NotificationService

WEB

APIStream .Manager .DB

Definition .Manager .DB

CSDL Compiler,Validator,

Normaliser

HistoricsScheduler

RecordingScheduler

PushScheduler

InteractionTargets

Mapping

FilteringTardis Pickle

InteractionTargets

Mapping

FilteringTardis Pickle

...

...

Hadoop

Titan Historics

Map/Reduce

HBase Cluster

Region 1 Region 2 Region N...

...

Data Node Data Node Data Node Data Node Data Node100%

100%

StopPUB

License Manager DB

Billing Pipeline DB

DB

DB

MaskManager DB

ConnectionManager

Time Machine + InsightsPost-Processing, Stream Analytics

jobsDB

chunksDB

chunkselector

jobtracker

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

Exports andAnalytics

WebSockets

HTTPStreaming

Delivery Subscriptions

ConnectionsStorage

kafka-consumer

Oracle

Stream results

CouchDB

PickleDB .DB

Audit

KafkaKafka

Historical Queries

@datasift

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

IBM Cognos

HDFSArchiver

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Google BigQuery

Stream results

Cloud Storage

DBs

BI tools

ACL(with interaction

counter)

(D5)

Har

dwar

e Lo

ad B

alan

cer

ACL(with interaction

counter)

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

WebSockets

HTTPStreaming

Delivery Subscriptions

kafka-consumer

OracleCouchDBIBM Cognos

Google BigQuery

Cloud Storage

DBs

BI tools

InteractionTargets

Mapping

FilteringTardis Pickle

InteractionTargets

Mapping

FilteringTardis Pickle

...

...

Hadoop

Titan Historics

Map/Reduce

HBase Cluster

Region 1 Region 2 Region N...

...

Data Node Data Node Data Node Data Node Data Node

Time Machine + InsightsPost-Processing, Stream Analytics

jobsDB

chunksDB

chunkselector

jobtracker

Exports andAnalytics

Stream results

Historical Queries

HDFSArchiver

Stream results

Saturday, 23 March 13

Page 31: More Than Websites: PHP And The Firehose @DataSift (2013)

@Written In PHP

UltrahoseArchiver

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

Http

Stre

amin

g, P

uSH,

Sea

rch

Stream Recorder

MonitoringAggregator

EDRs(licensed contentmetrics)

ControlChannels

(D5)

Har

dwar

e Lo

ad B

alan

cer

UltrahoseArchiver

100%Prism

100%

PickleFilteringEngine

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg

@lorenzoalbertonDataSift Architecture 2.2

Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

MonitoringKafkaQueue

EventsStorage

ACL(with interaction

counter)

tracker

LimitManager

AuthenticationManager

NotificationService

WEB

APIStream .Manager .DB

Definition .Manager .DB

CSDL Compiler,Validator,

Normaliser

HistoricsScheduler

RecordingScheduler

PushScheduler

InteractionTargets

Mapping

FilteringTardis Pickle

InteractionTargets

Mapping

FilteringTardis Pickle

...

...

Hadoop

Titan Historics

Map/Reduce

HBase Cluster

Region 1 Region 2 Region N...

...

Data Node Data Node Data Node Data Node Data Node100%

100%

StopPUB

License Manager DB

Billing Pipeline DB

DB

DB

MaskManager DB

ConnectionManager

Time Machine + InsightsPost-Processing, Stream Analytics

jobsDB

chunksDB

chunkselector

jobtracker

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

Exports andAnalytics

WebSockets

HTTPStreaming

Delivery Subscriptions

ConnectionsStorage

kafka-consumer

Oracle

Stream results

CouchDB

PickleDB .DB

Audit

KafkaKafka

Historical Queries

@datasift

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

IBM Cognos

HDFSArchiver

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Google BigQuery

Stream results

Cloud Storage

DBs

BI tools

Http

Stre

amin

g, P

uSH,

Sea

rch

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Input Streams

Augmentation Pipeline

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgreOgreOgreOgreOgreOgre

Ogre

OgreOgreOgreOgreOgreOgre

Data ingestion + Augmentation

Bit.ly

Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queue

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

Delivery Subscriptions

kafka-consumer

OracleCouchDBIBM Cognos

Google BigQuery

Cloud Storage

DBs

BI toolsMonitoringAggregator

LimitManager

NotificationService

License Manager DB

Billing Pipeline DB

MaskManager DB

AuthenticationManager DB

DB

Stream .Manager .DB

Definition .Manager .DB

RecordingScheduler

Saturday, 23 March 13

Page 32: More Than Websites: PHP And The Firehose @DataSift (2013)

@

100%Every piece of data

is handled by our PHP code

in realtime

Saturday, 23 March 13

Page 33: More Than Websites: PHP And The Firehose @DataSift (2013)

@

What we do in

Saturday, 23 March 13

Page 34: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Marketing websiteRuns on Drupal

Saturday, 23 March 13

Page 35: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Our main webappCustomer signup,stream creation,

account management

Saturday, 23 March 13

Page 36: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Our external APIOur main interface with customers

Saturday, 23 March 13

Page 37: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Boring!That’s all very standard stuff,

well understood

The interesting uses are behind the scenes

Saturday, 23 March 13

Page 38: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Behind the scenes?Are you mad?!?

Everyone knows that PHPis only for building websites!

Saturday, 23 March 13

Page 39: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Internal servicesAPIs that support our data pipelines

User management, billing,data security

Saturday, 23 March 13

Page 40: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Data assemblyConvert incoming data

into common ‘interaction’structure

Saturday, 23 March 13

Page 41: More Than Websites: PHP And The Firehose @DataSift (2013)

@

100%Every piece of data

is handled by our PHP code

in realtime

Saturday, 23 March 13

Page 42: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Push deliveryOutbound delivery of data

to customers’ serversand into their databases

Saturday, 23 March 13

Page 43: More Than Websites: PHP And The Firehose @DataSift (2013)

@

1 MP3/secHow much data we can deliverto a single EC2 micro-instance

Saturday, 23 March 13

Page 44: More Than Websites: PHP And The Firehose @DataSift (2013)

@

500Number of simultaneousdeliveries to customers

every second

Saturday, 23 March 13

Page 45: More Than Websites: PHP And The Firehose @DataSift (2013)

@

HornetOur EvilTestTool(tm)

Designed to melt thedata centre

Saturday, 23 March 13

Page 46: More Than Websites: PHP And The Firehose @DataSift (2013)

@

StorytellerOur functional test tool

Brings user stories to lifeFires up VMs, deploys code,

tests services

Reproducibly

Saturday, 23 March 13

Page 47: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Why

Saturday, 23 March 13

Page 48: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Our HistoryDataSift grew out of TweetMeme

Saturday, 23 March 13

Page 49: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Our ProductPHP is superb

at handlingunstructured data

Saturday, 23 March 13

Page 50: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Our CustomersPHP can talk to

any server, database / datastorethat we want to deliver data to

Saturday, 23 March 13

Page 51: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Our PeopleSeveral ‘names’ from PHP community

PHP is a language most engineers know

Saturday, 23 March 13

Page 52: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Our TimePHP is a great language

to build high-quality codevery very quickly

Saturday, 23 March 13

Page 53: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Our PerformancePHP is fast enough

for data assembly work

and is getting faster with every major release

Saturday, 23 March 13

Page 54: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Our SanityOur PHP applications require

less Ops timethan any of the others

Saturday, 23 March 13

Page 55: More Than Websites: PHP And The Firehose @DataSift (2013)

@

frameworks

Saturday, 23 March 13

Page 56: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Rolled our ownFrink & Stone

Saturday, 23 March 13

Page 57: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Right choice for usWe’re not part of the target demographic

for the major PHP frameworks

(nor the minor ones, tbh)

Saturday, 23 March 13

Page 58: More Than Websites: PHP And The Firehose @DataSift (2013)

@

FrinkTweetmeme’s framework

built to handlemillions of tweeted links

a day

Saturday, 23 March 13

Page 59: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Built for speedStripped down tothe bare essentials

a reaction to experiencewith early Zend Framework

Saturday, 23 March 13

Page 60: More Than Websites: PHP And The Firehose @DataSift (2013)

@

JobqueuesLong-running daemon processes

Worker processes handle data queuesManager process monitors workers

Saturday, 23 March 13

Page 61: More Than Websites: PHP And The Firehose @DataSift (2013)

@

StoneFoundation of ourin-house test tools

Hornet and Storyteller

Saturday, 23 March 13

Page 62: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Built for speedPowers our fake Twitter firehose

used for testing

Saturday, 23 March 13

Page 63: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Built for inspectionAllows us to measure

activity normally hiddenby libraries and PHP extensions

Saturday, 23 March 13

Page 64: More Than Websites: PHP And The Firehose @DataSift (2013)

@

tools & utilities

Saturday, 23 March 13

Page 65: More Than Websites: PHP And The Firehose @DataSift (2013)

@

PHP 5.3.latestCompiled in-house

Extensions statically-linkedfor performance

Saturday, 23 March 13

Page 66: More Than Websites: PHP And The Firehose @DataSift (2013)

@

ZeroMQ extensionTransport layer for our pipelines

Saturday, 23 March 13

Page 67: More Than Websites: PHP And The Firehose @DataSift (2013)

@

APC extensionShared memory for app metrics

PHP is too slow without an opcache

Lack of APC has prevented usmoving to PHP 5.4

Saturday, 23 March 13

Page 68: More Than Websites: PHP And The Firehose @DataSift (2013)

@

XHProf extensionFor profiling code

Skews the resultsless than Xdebug

Saturday, 23 March 13

Page 69: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Redis extensionBuffering and queueing

(being phased out)

Saturday, 23 March 13

Page 70: More Than Websites: PHP And The Firehose @DataSift (2013)

@

XdebugFor code coverage metrics(and readable vardump()s!)

Saturday, 23 March 13

Page 71: More Than Websites: PHP And The Firehose @DataSift (2013)

@

PHPunitFor all our unit tests

Saturday, 23 March 13

Page 72: More Than Websites: PHP And The Firehose @DataSift (2013)

@

phpdoc2For code documentation

(although nobody reads it -code is king)

Saturday, 23 March 13

Page 73: More Than Websites: PHP And The Firehose @DataSift (2013)

@

MavenFor building all

release RPM packages

Saturday, 23 March 13

Page 74: More Than Websites: PHP And The Firehose @DataSift (2013)

@

JenkinsContinuous integration

Saturday, 23 March 13

Page 75: More Than Websites: PHP And The Firehose @DataSift (2013)

@

RPMPackages for deployment

into dev, test, staging, and production

Saturday, 23 March 13

Page 76: More Than Websites: PHP And The Firehose @DataSift (2013)

@

Thank youPS: We’re hiring :-)

Saturday, 23 March 13