more than websites: php and the firehose @datasift (2013)
DESCRIPTION
PHP is the world's #1 programming language for creating websites. But it's capable of so much more. How about real-time processing the social firehose? :)TRANSCRIPT
And The Firehose
@
More Than Websites
Saturday, 23 March 13
@Introduce YourselvesSaturday, 23 March 13
@@stuherbert
Saturday, 23 March 13
@
What is
Saturday, 23 March 13
@
Sift throughsocial data
Twitter firehose, Facebook, bitly clicks,news, videos, comments
and more
Saturday, 23 March 13
@
Gain insights using augmentations
Language, gender, trends, links,sentiment, salience & entity analysis
and more
Saturday, 23 March 13
@
RealtimeGet matching data within
secondsof it being posted
Saturday, 23 March 13
@
HistoricsSearch our social data archive
going back toJanuary 2010
Saturday, 23 March 13
@
Pull the datafrom our servers
via HTTP/1.1 streamingor websockets
Saturday, 23 March 13
@
Let us pushdata to you
Have the data delivered directlyto your servers
or into your databases
Saturday, 23 March 13
@
in numbers
Saturday, 23 March 13
@
30Sources of social data
and data augmentations
Saturday, 23 March 13
@
Up to 20,000Number of new pieces of data
ingested into DataSiftevery second
Saturday, 23 March 13
@
3 TerabytesAmount of new data added
to the Historics archiveevery week
Saturday, 23 March 13
@
12Different wayswe can deliver
data to you
Saturday, 23 March 13
@
1Average number of seconds
to pass the datathrough DataSift
Saturday, 23 March 13
@
12Number of servicesdata passes through
inside DataSift
Saturday, 23 March 13
@
25Number of engineerswho write code for
the DataSift platform
Saturday, 23 March 13
@
5Primary programming languages:C++, Node, PHP, Python, Scala
Saturday, 23 March 13
@
154Private GitHub repos
Saturday, 23 March 13
@Our GitHub Repositories
PHP
Java & Scala
C & C++
JS & Node
Unclassified
Python
Shell Script
Ruby
C#
VimL
0 15 30 45 60
Saturday, 23 March 13
@
Architecture
Saturday, 23 March 13
@
Three majordata pipelines
+ supporting services
Saturday, 23 March 13
@
Data ArchivingAdds new data to the
Historics Archive
Saturday, 23 March 13
@
Filtering PipelineFiltering and delivery of data
in realtime
Saturday, 23 March 13
@
Playback PipelineFiltering and delivery of data
from the Historics Archive
Saturday, 23 March 13
@DataSift Technical Architecture
UltrahoseArchiver
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
ACL(with interaction
counter)
Http
Stre
amin
g, P
uSH,
Sea
rch
Stream Recorder
MonitoringAggregator
EDRs(licensed contentmetrics)
ControlChannels
(D5)
Har
dwar
e Lo
ad B
alan
cer
UltrahoseArchiver
100%Prism
100%
PickleFilteringEngine
FacebookWikipedia
RedditLexisNexisMeltwaterEstimize
Digg
@lorenzoalbertonDataSift Architecture 2.2
Links Resolution+ OpenGraph
+ Twitter Cards+ Metadata
DeletesProcessor
Redis
Input Streams
NewsCredBoardReader
MySpaceSuperFeeder
Augmentation Pipeline
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
MonitoringKafkaQueue
EventsStorage
ACL(with interaction
counter)
tracker
LimitManager
AuthenticationManager
NotificationService
WEB
APIStream .Manager .DB
Definition .Manager .DB
CSDL Compiler,Validator,
Normaliser
HistoricsScheduler
RecordingScheduler
PushScheduler
InteractionTargets
Mapping
FilteringTardis Pickle
InteractionTargets
Mapping
FilteringTardis Pickle
...
...
Hadoop
Titan Historics
Map/Reduce
HBase Cluster
Region 1 Region 2 Region N...
...
Data Node Data Node Data Node Data Node Data Node100%
100%
StopPUB
License Manager DB
Billing Pipeline DB
DB
DB
MaskManager DB
ConnectionManager
Time Machine + InsightsPost-Processing, Stream Analytics
jobsDB
chunksDB
chunkselector
jobtracker
WorkerSnapshotter
Buffered Streams Redis
Worker
Worker
NodeMeteor
Real-timeStreams
Node
Node
HTTP Request
GET batch
PUSHScheduler
subscription X
subscription Y
job queuePUSHProducer
SubscriptionsDB
PUSHDelivery
HTTP(S) POST(S)FTP
Amazon S3DynamoDB
Microsoft AzureMongoDB
Exports andAnalytics
WebSockets
HTTPStreaming
Delivery Subscriptions
ConnectionsStorage
kafka-consumer
Oracle
Stream results
CouchDB
PickleDB .DB
Audit
KafkaKafka
Historical Queries
@datasift
Goblin Head
Goblin Head
Goblin HeadGoblin TailGoblin Tail
Goblin Tail
Interaction GenerationInteraction Generation
3rd party APIs
Demographics Trends Analysis
Sentiment Analysis
Named Entities
Topics Analysis
Language Detection
KloutScore + Profile
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
Ogre
Ogre
Ogre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
IBM Cognos
HDFSArchiver
Data ingestion + Augmentation
Bit.ly
StreamSplitter/Joiner
Deduper
Msg splitter
Google BigQuery
Stream results
Cloud Storage
DBs
BI tools
Saturday, 23 March 13
@Filtering Pipeline
UltrahoseArchiver
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
ACL(with interaction
counter)
Http
Stre
amin
g, P
uSH,
Sea
rch
Stream Recorder
MonitoringAggregator
EDRs(licensed contentmetrics)
ControlChannels
(D5)
Har
dwar
e Lo
ad B
alan
cer
UltrahoseArchiver
100%Prism
100%
PickleFilteringEngine
FacebookWikipedia
RedditLexisNexisMeltwaterEstimize
Digg
@lorenzoalbertonDataSift Architecture 2.2
Links Resolution+ OpenGraph
+ Twitter Cards+ Metadata
DeletesProcessor
Redis
Input Streams
NewsCredBoardReader
MySpaceSuperFeeder
Augmentation Pipeline
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
MonitoringKafkaQueue
EventsStorage
ACL(with interaction
counter)
tracker
LimitManager
AuthenticationManager
NotificationService
WEB
APIStream .Manager .DB
Definition .Manager .DB
CSDL Compiler,Validator,
Normaliser
HistoricsScheduler
RecordingScheduler
PushScheduler
InteractionTargets
Mapping
FilteringTardis Pickle
InteractionTargets
Mapping
FilteringTardis Pickle
...
...
Hadoop
Titan Historics
Map/Reduce
HBase Cluster
Region 1 Region 2 Region N...
...
Data Node Data Node Data Node Data Node Data Node100%
100%
StopPUB
License Manager DB
Billing Pipeline DB
DB
DB
MaskManager DB
ConnectionManager
Time Machine + InsightsPost-Processing, Stream Analytics
jobsDB
chunksDB
chunkselector
jobtracker
WorkerSnapshotter
Buffered Streams Redis
Worker
Worker
NodeMeteor
Real-timeStreams
Node
Node
HTTP Request
GET batch
PUSHScheduler
subscription X
subscription Y
job queuePUSHProducer
SubscriptionsDB
PUSHDelivery
HTTP(S) POST(S)FTP
Amazon S3DynamoDB
Microsoft AzureMongoDB
Exports andAnalytics
WebSockets
HTTPStreaming
Delivery Subscriptions
ConnectionsStorage
kafka-consumer
Oracle
Stream results
CouchDB
PickleDB .DB
Audit
KafkaKafka
Historical Queries
@datasift
Goblin Head
Goblin Head
Goblin HeadGoblin TailGoblin Tail
Goblin Tail
Interaction GenerationInteraction Generation
3rd party APIs
Demographics Trends Analysis
Sentiment Analysis
Named Entities
Topics Analysis
Language Detection
KloutScore + Profile
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
Ogre
Ogre
Ogre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
IBM Cognos
HDFSArchiver
Data ingestion + Augmentation
Bit.ly
StreamSplitter/Joiner
Deduper
Msg splitter
Google BigQuery
Stream results
Cloud Storage
DBs
BI tools
Http
Stre
amin
g, P
uSH,
Sea
rch
FacebookWikipedia
RedditLexisNexisMeltwaterEstimize
Digg Links Resolution+ OpenGraph
+ Twitter Cards+ Metadata
DeletesProcessor
Redis
Input Streams
NewsCredBoardReader
MySpaceSuperFeeder
Augmentation Pipeline
100%
Goblin Head
Goblin Head
Goblin HeadGoblin TailGoblin Tail
Goblin Tail
Interaction GenerationInteraction Generation
3rd party APIs
Demographics Trends Analysis
Sentiment Analysis
Named Entities
Topics Analysis
Language Detection
KloutScore + Profile
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
Ogre
Ogre
Ogre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
Data ingestion + Augmentation
Bit.ly
StreamSplitter/Joiner
Deduper
Msg splitter
Links Resolution+ OpenGraph
+ Twitter Cards+ MetadataOgreOgreOgreOgreOgre
Ogre
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
ACL(with interaction
counter)
Stream Recorder
ControlChannels
(D5)
Har
dwar
e Lo
ad B
alan
cer
100%Prism
100%
PickleFilteringEngine push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
ACL(with interaction
counter)
100%
WorkerSnapshotter
Buffered Streams Redis
Worker
Worker
NodeMeteor
Real-timeStreams
Node
Node
HTTP Request
GET batch
PUSHScheduler
subscription X
subscription Y
job queuePUSHProducer
SubscriptionsDB
PUSHDelivery
HTTP(S) POST(S)FTP
Amazon S3DynamoDB
Microsoft AzureMongoDB
WebSockets
HTTPStreaming
Delivery Subscriptions
kafka-consumer
OracleCouchDB
Kafka
IBM CognosGoogle BigQuery
Cloud Storage
DBs
BI tools
Saturday, 23 March 13
@Data Archiving Pipeline
UltrahoseArchiver
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
ACL(with interaction
counter)
Http
Stre
amin
g, P
uSH,
Sea
rch
Stream Recorder
MonitoringAggregator
EDRs(licensed contentmetrics)
ControlChannels
(D5)
Har
dwar
e Lo
ad B
alan
cer
UltrahoseArchiver
100%Prism
100%
PickleFilteringEngine
FacebookWikipedia
RedditLexisNexisMeltwaterEstimize
Digg
@lorenzoalbertonDataSift Architecture 2.2
Links Resolution+ OpenGraph
+ Twitter Cards+ Metadata
DeletesProcessor
Redis
Input Streams
NewsCredBoardReader
MySpaceSuperFeeder
Augmentation Pipeline
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
MonitoringKafkaQueue
EventsStorage
ACL(with interaction
counter)
tracker
LimitManager
AuthenticationManager
NotificationService
WEB
APIStream .Manager .DB
Definition .Manager .DB
CSDL Compiler,Validator,
Normaliser
HistoricsScheduler
RecordingScheduler
PushScheduler
InteractionTargets
Mapping
FilteringTardis Pickle
InteractionTargets
Mapping
FilteringTardis Pickle
...
...
Hadoop
Titan Historics
Map/Reduce
HBase Cluster
Region 1 Region 2 Region N...
...
Data Node Data Node Data Node Data Node Data Node100%
100%
StopPUB
License Manager DB
Billing Pipeline DB
DB
DB
MaskManager DB
ConnectionManager
Time Machine + InsightsPost-Processing, Stream Analytics
jobsDB
chunksDB
chunkselector
jobtracker
WorkerSnapshotter
Buffered Streams Redis
Worker
Worker
NodeMeteor
Real-timeStreams
Node
Node
HTTP Request
GET batch
PUSHScheduler
subscription X
subscription Y
job queuePUSHProducer
SubscriptionsDB
PUSHDelivery
HTTP(S) POST(S)FTP
Amazon S3DynamoDB
Microsoft AzureMongoDB
Exports andAnalytics
WebSockets
HTTPStreaming
Delivery Subscriptions
ConnectionsStorage
kafka-consumer
Oracle
Stream results
CouchDB
PickleDB .DB
Audit
KafkaKafka
Historical Queries
@datasift
Goblin Head
Goblin Head
Goblin HeadGoblin TailGoblin Tail
Goblin Tail
Interaction GenerationInteraction Generation
3rd party APIs
Demographics Trends Analysis
Sentiment Analysis
Named Entities
Topics Analysis
Language Detection
KloutScore + Profile
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
Ogre
Ogre
Ogre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
IBM Cognos
HDFSArchiver
Data ingestion + Augmentation
Bit.ly
StreamSplitter/Joiner
Deduper
Msg splitter
Google BigQuery
Stream results
Cloud Storage
DBs
BI tools
UltrahoseArchiver
UltrahoseArchiver
Kafka
Http
Stre
amin
g, P
uSH,
Sea
rch
FacebookWikipedia
RedditLexisNexisMeltwaterEstimize
Digg Links Resolution+ OpenGraph
+ Twitter Cards+ Metadata
DeletesProcessor
Redis
Input Streams
NewsCredBoardReader
MySpaceSuperFeeder
Augmentation Pipeline
100%
Goblin Head
Goblin Head
Goblin HeadGoblin TailGoblin Tail
Goblin Tail
Interaction GenerationInteraction Generation
3rd party APIs
Demographics Trends Analysis
Sentiment Analysis
Named Entities
Topics Analysis
Language Detection
KloutScore + Profile
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
Ogre
Ogre
Ogre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
Data ingestion + Augmentation
Bit.ly
StreamSplitter/Joiner
Deduper
Msg splitter
Kafka
HBase Cluster
Region 1 Region 2 Region N... HDFSArchiver
Links Resolution+ OpenGraph
+ Twitter Cards+ MetadataOgreOgreOgreOgreOgre
Ogre
Saturday, 23 March 13
@Playback Pipeline
UltrahoseArchiver
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
ACL(with interaction
counter)
Http
Stre
amin
g, P
uSH,
Sea
rch
Stream Recorder
MonitoringAggregator
EDRs(licensed contentmetrics)
ControlChannels
(D5)
Har
dwar
e Lo
ad B
alan
cer
UltrahoseArchiver
100%Prism
100%
PickleFilteringEngine
FacebookWikipedia
RedditLexisNexisMeltwaterEstimize
Digg
@lorenzoalbertonDataSift Architecture 2.2
Links Resolution+ OpenGraph
+ Twitter Cards+ Metadata
DeletesProcessor
Redis
Input Streams
NewsCredBoardReader
MySpaceSuperFeeder
Augmentation Pipeline
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
MonitoringKafkaQueue
EventsStorage
ACL(with interaction
counter)
tracker
LimitManager
AuthenticationManager
NotificationService
WEB
APIStream .Manager .DB
Definition .Manager .DB
CSDL Compiler,Validator,
Normaliser
HistoricsScheduler
RecordingScheduler
PushScheduler
InteractionTargets
Mapping
FilteringTardis Pickle
InteractionTargets
Mapping
FilteringTardis Pickle
...
...
Hadoop
Titan Historics
Map/Reduce
HBase Cluster
Region 1 Region 2 Region N...
...
Data Node Data Node Data Node Data Node Data Node100%
100%
StopPUB
License Manager DB
Billing Pipeline DB
DB
DB
MaskManager DB
ConnectionManager
Time Machine + InsightsPost-Processing, Stream Analytics
jobsDB
chunksDB
chunkselector
jobtracker
WorkerSnapshotter
Buffered Streams Redis
Worker
Worker
NodeMeteor
Real-timeStreams
Node
Node
HTTP Request
GET batch
PUSHScheduler
subscription X
subscription Y
job queuePUSHProducer
SubscriptionsDB
PUSHDelivery
HTTP(S) POST(S)FTP
Amazon S3DynamoDB
Microsoft AzureMongoDB
Exports andAnalytics
WebSockets
HTTPStreaming
Delivery Subscriptions
ConnectionsStorage
kafka-consumer
Oracle
Stream results
CouchDB
PickleDB .DB
Audit
KafkaKafka
Historical Queries
@datasift
Goblin Head
Goblin Head
Goblin HeadGoblin TailGoblin Tail
Goblin Tail
Interaction GenerationInteraction Generation
3rd party APIs
Demographics Trends Analysis
Sentiment Analysis
Named Entities
Topics Analysis
Language Detection
KloutScore + Profile
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
Ogre
Ogre
Ogre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
IBM Cognos
HDFSArchiver
Data ingestion + Augmentation
Bit.ly
StreamSplitter/Joiner
Deduper
Msg splitter
Google BigQuery
Stream results
Cloud Storage
DBs
BI tools
ACL(with interaction
counter)
(D5)
Har
dwar
e Lo
ad B
alan
cer
ACL(with interaction
counter)
WorkerSnapshotter
Buffered Streams Redis
Worker
Worker
NodeMeteor
Real-timeStreams
Node
Node
HTTP Request
GET batch
PUSHScheduler
subscription X
subscription Y
job queuePUSHProducer
SubscriptionsDB
PUSHDelivery
HTTP(S) POST(S)FTP
Amazon S3DynamoDB
Microsoft AzureMongoDB
WebSockets
HTTPStreaming
Delivery Subscriptions
kafka-consumer
OracleCouchDBIBM Cognos
Google BigQuery
Cloud Storage
DBs
BI tools
InteractionTargets
Mapping
FilteringTardis Pickle
InteractionTargets
Mapping
FilteringTardis Pickle
...
...
Hadoop
Titan Historics
Map/Reduce
HBase Cluster
Region 1 Region 2 Region N...
...
Data Node Data Node Data Node Data Node Data Node
Time Machine + InsightsPost-Processing, Stream Analytics
jobsDB
chunksDB
chunkselector
jobtracker
Exports andAnalytics
Stream results
Historical Queries
HDFSArchiver
Stream results
Saturday, 23 March 13
@Written In PHP
UltrahoseArchiver
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
ACL(with interaction
counter)
Http
Stre
amin
g, P
uSH,
Sea
rch
Stream Recorder
MonitoringAggregator
EDRs(licensed contentmetrics)
ControlChannels
(D5)
Har
dwar
e Lo
ad B
alan
cer
UltrahoseArchiver
100%Prism
100%
PickleFilteringEngine
FacebookWikipedia
RedditLexisNexisMeltwaterEstimize
Digg
@lorenzoalbertonDataSift Architecture 2.2
Links Resolution+ OpenGraph
+ Twitter Cards+ Metadata
DeletesProcessor
Redis
Input Streams
NewsCredBoardReader
MySpaceSuperFeeder
Augmentation Pipeline
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
push
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
PickleNode
Node Shard
MonitoringKafkaQueue
EventsStorage
ACL(with interaction
counter)
tracker
LimitManager
AuthenticationManager
NotificationService
WEB
APIStream .Manager .DB
Definition .Manager .DB
CSDL Compiler,Validator,
Normaliser
HistoricsScheduler
RecordingScheduler
PushScheduler
InteractionTargets
Mapping
FilteringTardis Pickle
InteractionTargets
Mapping
FilteringTardis Pickle
...
...
Hadoop
Titan Historics
Map/Reduce
HBase Cluster
Region 1 Region 2 Region N...
...
Data Node Data Node Data Node Data Node Data Node100%
100%
StopPUB
License Manager DB
Billing Pipeline DB
DB
DB
MaskManager DB
ConnectionManager
Time Machine + InsightsPost-Processing, Stream Analytics
jobsDB
chunksDB
chunkselector
jobtracker
WorkerSnapshotter
Buffered Streams Redis
Worker
Worker
NodeMeteor
Real-timeStreams
Node
Node
HTTP Request
GET batch
PUSHScheduler
subscription X
subscription Y
job queuePUSHProducer
SubscriptionsDB
PUSHDelivery
HTTP(S) POST(S)FTP
Amazon S3DynamoDB
Microsoft AzureMongoDB
Exports andAnalytics
WebSockets
HTTPStreaming
Delivery Subscriptions
ConnectionsStorage
kafka-consumer
Oracle
Stream results
CouchDB
PickleDB .DB
Audit
KafkaKafka
Historical Queries
@datasift
Goblin Head
Goblin Head
Goblin HeadGoblin TailGoblin Tail
Goblin Tail
Interaction GenerationInteraction Generation
3rd party APIs
Demographics Trends Analysis
Sentiment Analysis
Named Entities
Topics Analysis
Language Detection
KloutScore + Profile
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
Ogre
Ogre
Ogre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
OgreOgreOgreOgreOgreOgre
IBM Cognos
HDFSArchiver
Data ingestion + Augmentation
Bit.ly
StreamSplitter/Joiner
Deduper
Msg splitter
Google BigQuery
Stream results
Cloud Storage
DBs
BI tools
Http
Stre
amin
g, P
uSH,
Sea
rch
FacebookWikipedia
RedditLexisNexisMeltwaterEstimize
Digg Links Resolution+ OpenGraph
+ Twitter Cards+ Metadata
DeletesProcessor
Input Streams
Augmentation Pipeline
Interaction GenerationInteraction Generation
3rd party APIs
Demographics Trends Analysis
Sentiment Analysis
Named Entities
Topics Analysis
Language Detection
KloutScore + Profile
OgreOgreOgreOgreOgreOgreOgreOgreOgreOgreOgre
Ogre
OgreOgreOgreOgreOgreOgre
Data ingestion + Augmentation
Bit.ly
Links Resolution+ OpenGraph
+ Twitter Cards+ Metadata
WorkerSnapshotter
Buffered Streams Redis
Worker
Worker
HTTP Request
GET batch
PUSHScheduler
subscription X
subscription Y
job queue
SubscriptionsDB
PUSHDelivery
HTTP(S) POST(S)FTP
Amazon S3DynamoDB
Microsoft AzureMongoDB
Delivery Subscriptions
kafka-consumer
OracleCouchDBIBM Cognos
Google BigQuery
Cloud Storage
DBs
BI toolsMonitoringAggregator
LimitManager
NotificationService
License Manager DB
Billing Pipeline DB
MaskManager DB
AuthenticationManager DB
DB
Stream .Manager .DB
Definition .Manager .DB
RecordingScheduler
Saturday, 23 March 13
@
100%Every piece of data
is handled by our PHP code
in realtime
Saturday, 23 March 13
@
What we do in
Saturday, 23 March 13
@
Marketing websiteRuns on Drupal
Saturday, 23 March 13
@
Our main webappCustomer signup,stream creation,
account management
Saturday, 23 March 13
@
Our external APIOur main interface with customers
Saturday, 23 March 13
@
Boring!That’s all very standard stuff,
well understood
The interesting uses are behind the scenes
Saturday, 23 March 13
@
Behind the scenes?Are you mad?!?
Everyone knows that PHPis only for building websites!
Saturday, 23 March 13
@
Internal servicesAPIs that support our data pipelines
User management, billing,data security
Saturday, 23 March 13
@
Data assemblyConvert incoming data
into common ‘interaction’structure
Saturday, 23 March 13
@
100%Every piece of data
is handled by our PHP code
in realtime
Saturday, 23 March 13
@
Push deliveryOutbound delivery of data
to customers’ serversand into their databases
Saturday, 23 March 13
@
1 MP3/secHow much data we can deliverto a single EC2 micro-instance
Saturday, 23 March 13
@
500Number of simultaneousdeliveries to customers
every second
Saturday, 23 March 13
@
HornetOur EvilTestTool(tm)
Designed to melt thedata centre
Saturday, 23 March 13
@
StorytellerOur functional test tool
Brings user stories to lifeFires up VMs, deploys code,
tests services
Reproducibly
Saturday, 23 March 13
@
Why
Saturday, 23 March 13
@
Our HistoryDataSift grew out of TweetMeme
Saturday, 23 March 13
@
Our ProductPHP is superb
at handlingunstructured data
Saturday, 23 March 13
@
Our CustomersPHP can talk to
any server, database / datastorethat we want to deliver data to
Saturday, 23 March 13
@
Our PeopleSeveral ‘names’ from PHP community
PHP is a language most engineers know
Saturday, 23 March 13
@
Our TimePHP is a great language
to build high-quality codevery very quickly
Saturday, 23 March 13
@
Our PerformancePHP is fast enough
for data assembly work
and is getting faster with every major release
Saturday, 23 March 13
@
Our SanityOur PHP applications require
less Ops timethan any of the others
Saturday, 23 March 13
@
frameworks
Saturday, 23 March 13
@
Rolled our ownFrink & Stone
Saturday, 23 March 13
@
Right choice for usWe’re not part of the target demographic
for the major PHP frameworks
(nor the minor ones, tbh)
Saturday, 23 March 13
@
FrinkTweetmeme’s framework
built to handlemillions of tweeted links
a day
Saturday, 23 March 13
@
Built for speedStripped down tothe bare essentials
a reaction to experiencewith early Zend Framework
Saturday, 23 March 13
@
JobqueuesLong-running daemon processes
Worker processes handle data queuesManager process monitors workers
Saturday, 23 March 13
@
StoneFoundation of ourin-house test tools
Hornet and Storyteller
Saturday, 23 March 13
@
Built for speedPowers our fake Twitter firehose
used for testing
Saturday, 23 March 13
@
Built for inspectionAllows us to measure
activity normally hiddenby libraries and PHP extensions
Saturday, 23 March 13
@
tools & utilities
Saturday, 23 March 13
@
PHP 5.3.latestCompiled in-house
Extensions statically-linkedfor performance
Saturday, 23 March 13
@
ZeroMQ extensionTransport layer for our pipelines
Saturday, 23 March 13
@
APC extensionShared memory for app metrics
PHP is too slow without an opcache
Lack of APC has prevented usmoving to PHP 5.4
Saturday, 23 March 13
@
XHProf extensionFor profiling code
Skews the resultsless than Xdebug
Saturday, 23 March 13
@
Redis extensionBuffering and queueing
(being phased out)
Saturday, 23 March 13
@
XdebugFor code coverage metrics(and readable vardump()s!)
Saturday, 23 March 13
@
PHPunitFor all our unit tests
Saturday, 23 March 13
@
phpdoc2For code documentation
(although nobody reads it -code is king)
Saturday, 23 March 13
@
MavenFor building all
release RPM packages
Saturday, 23 March 13
@
JenkinsContinuous integration
Saturday, 23 March 13
@
RPMPackages for deployment
into dev, test, staging, and production
Saturday, 23 March 13
@
Thank youPS: We’re hiring :-)
Saturday, 23 March 13