more than websites: php and the firehose @datasift (2013)

Post on 09-May-2015

14.048 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

PHP is the world's #1 programming language for creating websites. But it's capable of so much more. How about real-time processing the social firehose? :)

TRANSCRIPT

And The Firehose

@

More Than Websites

Saturday, 23 March 13

@Introduce YourselvesSaturday, 23 March 13

@@stuherbert

Saturday, 23 March 13

@

What is

Saturday, 23 March 13

@

Sift throughsocial data

Twitter firehose, Facebook, bitly clicks,news, videos, comments

and more

Saturday, 23 March 13

@

Gain insights using augmentations

Language, gender, trends, links,sentiment, salience & entity analysis

and more

Saturday, 23 March 13

@

RealtimeGet matching data within

secondsof it being posted

Saturday, 23 March 13

@

HistoricsSearch our social data archive

going back toJanuary 2010

Saturday, 23 March 13

@

Pull the datafrom our servers

via HTTP/1.1 streamingor websockets

Saturday, 23 March 13

@

Let us pushdata to you

Have the data delivered directlyto your servers

or into your databases

Saturday, 23 March 13

@

in numbers

Saturday, 23 March 13

@

30Sources of social data

and data augmentations

Saturday, 23 March 13

@

Up to 20,000Number of new pieces of data

ingested into DataSiftevery second

Saturday, 23 March 13

@

3 TerabytesAmount of new data added

to the Historics archiveevery week

Saturday, 23 March 13

@

12Different wayswe can deliver

data to you

Saturday, 23 March 13

@

1Average number of seconds

to pass the datathrough DataSift

Saturday, 23 March 13

@

12Number of servicesdata passes through

inside DataSift

Saturday, 23 March 13

@

25Number of engineerswho write code for

the DataSift platform

Saturday, 23 March 13

@

5Primary programming languages:C++, Node, PHP, Python, Scala

Saturday, 23 March 13

@

154Private GitHub repos

Saturday, 23 March 13

@Our GitHub Repositories

PHP

Java & Scala

C & C++

JS & Node

Unclassified

Python

Shell Script

Ruby

C#

VimL

0 15 30 45 60

Saturday, 23 March 13

@

Architecture

Saturday, 23 March 13

@

Three majordata pipelines

+ supporting services

Saturday, 23 March 13

@

Data ArchivingAdds new data to the

Historics Archive

Saturday, 23 March 13

@

Filtering PipelineFiltering and delivery of data

in realtime

Saturday, 23 March 13

@

Playback PipelineFiltering and delivery of data

from the Historics Archive

Saturday, 23 March 13

@DataSift Technical Architecture

UltrahoseArchiver

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

Http

Stre

amin

g, P

uSH,

Sea

rch

Stream Recorder

MonitoringAggregator

EDRs(licensed contentmetrics)

ControlChannels

(D5)

Har

dwar

e Lo

ad B

alan

cer

UltrahoseArchiver

100%Prism

100%

PickleFilteringEngine

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg

@lorenzoalbertonDataSift Architecture 2.2

Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

MonitoringKafkaQueue

EventsStorage

ACL(with interaction

counter)

tracker

LimitManager

AuthenticationManager

NotificationService

WEB

APIStream .Manager .DB

Definition .Manager .DB

CSDL Compiler,Validator,

Normaliser

HistoricsScheduler

RecordingScheduler

PushScheduler

InteractionTargets

Mapping

FilteringTardis Pickle

InteractionTargets

Mapping

FilteringTardis Pickle

...

...

Hadoop

Titan Historics

Map/Reduce

HBase Cluster

Region 1 Region 2 Region N...

...

Data Node Data Node Data Node Data Node Data Node100%

100%

StopPUB

License Manager DB

Billing Pipeline DB

DB

DB

MaskManager DB

ConnectionManager

Time Machine + InsightsPost-Processing, Stream Analytics

jobsDB

chunksDB

chunkselector

jobtracker

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

Exports andAnalytics

WebSockets

HTTPStreaming

Delivery Subscriptions

ConnectionsStorage

kafka-consumer

Oracle

Stream results

CouchDB

PickleDB .DB

Audit

KafkaKafka

Historical Queries

@datasift

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

IBM Cognos

HDFSArchiver

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Google BigQuery

Stream results

Cloud Storage

DBs

BI tools

Saturday, 23 March 13

@Filtering Pipeline

UltrahoseArchiver

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

Http

Stre

amin

g, P

uSH,

Sea

rch

Stream Recorder

MonitoringAggregator

EDRs(licensed contentmetrics)

ControlChannels

(D5)

Har

dwar

e Lo

ad B

alan

cer

UltrahoseArchiver

100%Prism

100%

PickleFilteringEngine

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg

@lorenzoalbertonDataSift Architecture 2.2

Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

MonitoringKafkaQueue

EventsStorage

ACL(with interaction

counter)

tracker

LimitManager

AuthenticationManager

NotificationService

WEB

APIStream .Manager .DB

Definition .Manager .DB

CSDL Compiler,Validator,

Normaliser

HistoricsScheduler

RecordingScheduler

PushScheduler

InteractionTargets

Mapping

FilteringTardis Pickle

InteractionTargets

Mapping

FilteringTardis Pickle

...

...

Hadoop

Titan Historics

Map/Reduce

HBase Cluster

Region 1 Region 2 Region N...

...

Data Node Data Node Data Node Data Node Data Node100%

100%

StopPUB

License Manager DB

Billing Pipeline DB

DB

DB

MaskManager DB

ConnectionManager

Time Machine + InsightsPost-Processing, Stream Analytics

jobsDB

chunksDB

chunkselector

jobtracker

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

Exports andAnalytics

WebSockets

HTTPStreaming

Delivery Subscriptions

ConnectionsStorage

kafka-consumer

Oracle

Stream results

CouchDB

PickleDB .DB

Audit

KafkaKafka

Historical Queries

@datasift

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

IBM Cognos

HDFSArchiver

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Google BigQuery

Stream results

Cloud Storage

DBs

BI tools

Http

Stre

amin

g, P

uSH,

Sea

rch

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

100%

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Links Resolution+ OpenGraph

+ Twitter Cards+ MetadataOgreOgreOgreOgreOgre

Ogre

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

Stream Recorder

ControlChannels

(D5)

Har

dwar

e Lo

ad B

alan

cer

100%Prism

100%

PickleFilteringEngine push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

100%

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

WebSockets

HTTPStreaming

Delivery Subscriptions

kafka-consumer

OracleCouchDB

Kafka

IBM CognosGoogle BigQuery

Cloud Storage

DBs

BI tools

Saturday, 23 March 13

@Data Archiving Pipeline

UltrahoseArchiver

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

Http

Stre

amin

g, P

uSH,

Sea

rch

Stream Recorder

MonitoringAggregator

EDRs(licensed contentmetrics)

ControlChannels

(D5)

Har

dwar

e Lo

ad B

alan

cer

UltrahoseArchiver

100%Prism

100%

PickleFilteringEngine

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg

@lorenzoalbertonDataSift Architecture 2.2

Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

MonitoringKafkaQueue

EventsStorage

ACL(with interaction

counter)

tracker

LimitManager

AuthenticationManager

NotificationService

WEB

APIStream .Manager .DB

Definition .Manager .DB

CSDL Compiler,Validator,

Normaliser

HistoricsScheduler

RecordingScheduler

PushScheduler

InteractionTargets

Mapping

FilteringTardis Pickle

InteractionTargets

Mapping

FilteringTardis Pickle

...

...

Hadoop

Titan Historics

Map/Reduce

HBase Cluster

Region 1 Region 2 Region N...

...

Data Node Data Node Data Node Data Node Data Node100%

100%

StopPUB

License Manager DB

Billing Pipeline DB

DB

DB

MaskManager DB

ConnectionManager

Time Machine + InsightsPost-Processing, Stream Analytics

jobsDB

chunksDB

chunkselector

jobtracker

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

Exports andAnalytics

WebSockets

HTTPStreaming

Delivery Subscriptions

ConnectionsStorage

kafka-consumer

Oracle

Stream results

CouchDB

PickleDB .DB

Audit

KafkaKafka

Historical Queries

@datasift

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

IBM Cognos

HDFSArchiver

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Google BigQuery

Stream results

Cloud Storage

DBs

BI tools

UltrahoseArchiver

UltrahoseArchiver

Kafka

Http

Stre

amin

g, P

uSH,

Sea

rch

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

100%

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Kafka

HBase Cluster

Region 1 Region 2 Region N... HDFSArchiver

Links Resolution+ OpenGraph

+ Twitter Cards+ MetadataOgreOgreOgreOgreOgre

Ogre

Saturday, 23 March 13

@Playback Pipeline

UltrahoseArchiver

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

Http

Stre

amin

g, P

uSH,

Sea

rch

Stream Recorder

MonitoringAggregator

EDRs(licensed contentmetrics)

ControlChannels

(D5)

Har

dwar

e Lo

ad B

alan

cer

UltrahoseArchiver

100%Prism

100%

PickleFilteringEngine

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg

@lorenzoalbertonDataSift Architecture 2.2

Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

MonitoringKafkaQueue

EventsStorage

ACL(with interaction

counter)

tracker

LimitManager

AuthenticationManager

NotificationService

WEB

APIStream .Manager .DB

Definition .Manager .DB

CSDL Compiler,Validator,

Normaliser

HistoricsScheduler

RecordingScheduler

PushScheduler

InteractionTargets

Mapping

FilteringTardis Pickle

InteractionTargets

Mapping

FilteringTardis Pickle

...

...

Hadoop

Titan Historics

Map/Reduce

HBase Cluster

Region 1 Region 2 Region N...

...

Data Node Data Node Data Node Data Node Data Node100%

100%

StopPUB

License Manager DB

Billing Pipeline DB

DB

DB

MaskManager DB

ConnectionManager

Time Machine + InsightsPost-Processing, Stream Analytics

jobsDB

chunksDB

chunkselector

jobtracker

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

Exports andAnalytics

WebSockets

HTTPStreaming

Delivery Subscriptions

ConnectionsStorage

kafka-consumer

Oracle

Stream results

CouchDB

PickleDB .DB

Audit

KafkaKafka

Historical Queries

@datasift

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

IBM Cognos

HDFSArchiver

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Google BigQuery

Stream results

Cloud Storage

DBs

BI tools

ACL(with interaction

counter)

(D5)

Har

dwar

e Lo

ad B

alan

cer

ACL(with interaction

counter)

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

WebSockets

HTTPStreaming

Delivery Subscriptions

kafka-consumer

OracleCouchDBIBM Cognos

Google BigQuery

Cloud Storage

DBs

BI tools

InteractionTargets

Mapping

FilteringTardis Pickle

InteractionTargets

Mapping

FilteringTardis Pickle

...

...

Hadoop

Titan Historics

Map/Reduce

HBase Cluster

Region 1 Region 2 Region N...

...

Data Node Data Node Data Node Data Node Data Node

Time Machine + InsightsPost-Processing, Stream Analytics

jobsDB

chunksDB

chunkselector

jobtracker

Exports andAnalytics

Stream results

Historical Queries

HDFSArchiver

Stream results

Saturday, 23 March 13

@Written In PHP

UltrahoseArchiver

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

ACL(with interaction

counter)

Http

Stre

amin

g, P

uSH,

Sea

rch

Stream Recorder

MonitoringAggregator

EDRs(licensed contentmetrics)

ControlChannels

(D5)

Har

dwar

e Lo

ad B

alan

cer

UltrahoseArchiver

100%Prism

100%

PickleFilteringEngine

Twitter

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg

@lorenzoalbertonDataSift Architecture 2.2

Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Redis

Input Streams

NewsCredBoardReader

MySpaceSuperFeeder

Augmentation Pipeline

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

push

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

PickleNode

Node Shard

MonitoringKafkaQueue

EventsStorage

ACL(with interaction

counter)

tracker

LimitManager

AuthenticationManager

NotificationService

WEB

APIStream .Manager .DB

Definition .Manager .DB

CSDL Compiler,Validator,

Normaliser

HistoricsScheduler

RecordingScheduler

PushScheduler

InteractionTargets

Mapping

FilteringTardis Pickle

InteractionTargets

Mapping

FilteringTardis Pickle

...

...

Hadoop

Titan Historics

Map/Reduce

HBase Cluster

Region 1 Region 2 Region N...

...

Data Node Data Node Data Node Data Node Data Node100%

100%

StopPUB

License Manager DB

Billing Pipeline DB

DB

DB

MaskManager DB

ConnectionManager

Time Machine + InsightsPost-Processing, Stream Analytics

jobsDB

chunksDB

chunkselector

jobtracker

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

NodeMeteor

Real-timeStreams

Node

Node

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queuePUSHProducer

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

Exports andAnalytics

WebSockets

HTTPStreaming

Delivery Subscriptions

ConnectionsStorage

kafka-consumer

Oracle

Stream results

CouchDB

PickleDB .DB

Audit

KafkaKafka

Historical Queries

@datasift

Goblin Head

Goblin Head

Goblin HeadGoblin TailGoblin Tail

Goblin Tail

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

Ogre

Ogre

Ogre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

OgreOgreOgreOgreOgreOgre

IBM Cognos

HDFSArchiver

Data ingestion + Augmentation

Bit.ly

StreamSplitter/Joiner

Deduper

Msg splitter

Google BigQuery

Stream results

Cloud Storage

DBs

BI tools

Http

Stre

amin

g, P

uSH,

Sea

rch

FacebookWikipedia

RedditLexisNexisMeltwaterEstimize

Digg Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

DeletesProcessor

Input Streams

Augmentation Pipeline

Interaction GenerationInteraction Generation

3rd party APIs

Demographics Trends Analysis

Sentiment Analysis

Named Entities

Topics Analysis

Language Detection

KloutScore + Profile

OgreOgreOgreOgreOgreOgreOgreOgreOgreOgreOgre

Ogre

OgreOgreOgreOgreOgreOgre

Data ingestion + Augmentation

Bit.ly

Links Resolution+ OpenGraph

+ Twitter Cards+ Metadata

WorkerSnapshotter

Buffered Streams Redis

Worker

Worker

HTTP Request

GET batch

PUSHScheduler

subscription X

subscription Y

job queue

SubscriptionsDB

PUSHDelivery

HTTP(S) POST(S)FTP

Amazon S3DynamoDB

Microsoft AzureMongoDB

Delivery Subscriptions

kafka-consumer

OracleCouchDBIBM Cognos

Google BigQuery

Cloud Storage

DBs

BI toolsMonitoringAggregator

LimitManager

NotificationService

License Manager DB

Billing Pipeline DB

MaskManager DB

AuthenticationManager DB

DB

Stream .Manager .DB

Definition .Manager .DB

RecordingScheduler

Saturday, 23 March 13

@

100%Every piece of data

is handled by our PHP code

in realtime

Saturday, 23 March 13

@

What we do in

Saturday, 23 March 13

@

Marketing websiteRuns on Drupal

Saturday, 23 March 13

@

Our main webappCustomer signup,stream creation,

account management

Saturday, 23 March 13

@

Our external APIOur main interface with customers

Saturday, 23 March 13

@

Boring!That’s all very standard stuff,

well understood

The interesting uses are behind the scenes

Saturday, 23 March 13

@

Behind the scenes?Are you mad?!?

Everyone knows that PHPis only for building websites!

Saturday, 23 March 13

@

Internal servicesAPIs that support our data pipelines

User management, billing,data security

Saturday, 23 March 13

@

Data assemblyConvert incoming data

into common ‘interaction’structure

Saturday, 23 March 13

@

100%Every piece of data

is handled by our PHP code

in realtime

Saturday, 23 March 13

@

Push deliveryOutbound delivery of data

to customers’ serversand into their databases

Saturday, 23 March 13

@

1 MP3/secHow much data we can deliverto a single EC2 micro-instance

Saturday, 23 March 13

@

500Number of simultaneousdeliveries to customers

every second

Saturday, 23 March 13

@

HornetOur EvilTestTool(tm)

Designed to melt thedata centre

Saturday, 23 March 13

@

StorytellerOur functional test tool

Brings user stories to lifeFires up VMs, deploys code,

tests services

Reproducibly

Saturday, 23 March 13

@

Why

Saturday, 23 March 13

@

Our HistoryDataSift grew out of TweetMeme

Saturday, 23 March 13

@

Our ProductPHP is superb

at handlingunstructured data

Saturday, 23 March 13

@

Our CustomersPHP can talk to

any server, database / datastorethat we want to deliver data to

Saturday, 23 March 13

@

Our PeopleSeveral ‘names’ from PHP community

PHP is a language most engineers know

Saturday, 23 March 13

@

Our TimePHP is a great language

to build high-quality codevery very quickly

Saturday, 23 March 13

@

Our PerformancePHP is fast enough

for data assembly work

and is getting faster with every major release

Saturday, 23 March 13

@

Our SanityOur PHP applications require

less Ops timethan any of the others

Saturday, 23 March 13

@

frameworks

Saturday, 23 March 13

@

Rolled our ownFrink & Stone

Saturday, 23 March 13

@

Right choice for usWe’re not part of the target demographic

for the major PHP frameworks

(nor the minor ones, tbh)

Saturday, 23 March 13

@

FrinkTweetmeme’s framework

built to handlemillions of tweeted links

a day

Saturday, 23 March 13

@

Built for speedStripped down tothe bare essentials

a reaction to experiencewith early Zend Framework

Saturday, 23 March 13

@

JobqueuesLong-running daemon processes

Worker processes handle data queuesManager process monitors workers

Saturday, 23 March 13

@

StoneFoundation of ourin-house test tools

Hornet and Storyteller

Saturday, 23 March 13

@

Built for speedPowers our fake Twitter firehose

used for testing

Saturday, 23 March 13

@

Built for inspectionAllows us to measure

activity normally hiddenby libraries and PHP extensions

Saturday, 23 March 13

@

tools & utilities

Saturday, 23 March 13

@

PHP 5.3.latestCompiled in-house

Extensions statically-linkedfor performance

Saturday, 23 March 13

@

ZeroMQ extensionTransport layer for our pipelines

Saturday, 23 March 13

@

APC extensionShared memory for app metrics

PHP is too slow without an opcache

Lack of APC has prevented usmoving to PHP 5.4

Saturday, 23 March 13

@

XHProf extensionFor profiling code

Skews the resultsless than Xdebug

Saturday, 23 March 13

@

Redis extensionBuffering and queueing

(being phased out)

Saturday, 23 March 13

@

XdebugFor code coverage metrics(and readable vardump()s!)

Saturday, 23 March 13

@

PHPunitFor all our unit tests

Saturday, 23 March 13

@

phpdoc2For code documentation

(although nobody reads it -code is king)

Saturday, 23 March 13

@

MavenFor building all

release RPM packages

Saturday, 23 March 13

@

JenkinsContinuous integration

Saturday, 23 March 13

@

RPMPackages for deployment

into dev, test, staging, and production

Saturday, 23 March 13

@

Thank youPS: We’re hiring :-)

Saturday, 23 March 13

top related