(bdt306) mission-critical stream processing with amazon emr and amazon kinesis | aws re:invent 2014
DESCRIPTION
Organizations processing mission critical high-volume data must be able to achieve high levels of throughput and durability in data processing workflows. In this session, we will learn how DataXu is using Amazon Kinesis, Amazon S3, and Amazon EMR for its patented approach to programmatic marketing. Every second, the DataXu Marketing Cloud processes over 1 Million ad requests and makes more than 40 billion decisions to select and bid on ad impressions that are most likely to convert. In addition to addressing the scalability and availability of the platform, we will explore Amazon Kinesis producer and consumer applications that support high levels of scalability and durability in mission-critical record processing.TRANSCRIPT
Amazon
Redshift
Amazon EMR
Amazon
EC2
Analyze
Amazon
Glacier
Amazon S3
Amazon
DynamoDB
Store
AWS Import/Export
AWS Direct Connect
Collect
Amazon Kinesis
Big data
•Hourly server logs: were your systems misbehaving 1hr ago
•Weekly / Monthly Bill:
what you spent this billing cycle
•Daily customer-preferences report from your web
site’s click stream:
what deal or ad to try next time
•Daily fraud reports:
was there fraud yesterday
what went wrong now
:
prevent overspending now
what to offer the current customer now
block fraudulent use now
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client
Library
+
Connector Library
Apache
Storm
Amazon Elastic
MapReduce
Sending Reading
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
DataXu
DataXu Records
tx_id: "AFTfN0uAWZ"
exchange: “APPNEXUS"
request_id:"bb656107-3bf7-47a7-8548-8229563e9dc9”
….
adslot: {slot_id: "2686449714718898993”, uuid: "9d2403f1-fc6c-4d38-b6b1-
839fe4b42455”, price_micro_cpm: 661385, currency: "USD”, seat_id: "12-914”,
campaign_id: "C0513n7”, creative_id: “R53a537”}
…
time_stamp: 1415393474434
serviced_by_host: "cr02.us-east-01”
Confirmation Record
[- 69.120.26.172 - - [08/Nov/2014:21:59:54 -0500] "GET
/rs?id=fc6f2106175a43df8ae4f3b7e6fa8c37&t=marketing&cbust=14155020001916
62 HTTP/1.1" 302 - "http://ads-
by.madadsmedia.com/tags/25628/10217/iframe/728x90.html" "Mozilla/5.0
(compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)" "wfivefivec=c876d00e-
1831-4eba-b78d-cd99188e951a" "OWW=-"
Fraud Record
Continuous
Processing
CDN
Real-time
Bidding
Retargeting
Platform
Reporting
Qubole
Real Time
AppsKCL Apps
Archiver
Amazon Kinesis Event ReplayAmazon S3
Producers AggregatorContinuous
ProcessingStorage Analytics
Redshift
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
https://github.com/awslabs/kinesis-log4j-appender
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
Amazon Kinesis storage is replicated across
Availability Zones
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates dataacross three data centers (availability zones)
Aggregate andarchive to S3
Millions ofsources producing100s of terabytes
per hour
FrontEnd
AuthenticationAuthorization
Ordered streamof events supportsmultiple readers
Real-timedashboardsand alarms
Machine learningalgorithms or
sliding windowanalytics
Aggregate analysisin Hadoop or adata warehouse
Inexpensive: $0.028 per million puts
0
200000
400000
600000
800000
1000000
1200000
0 100 200 300 400 500 600 700 800 900 1000 1100
1K
B M
essages/s
ec
Shards
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
Amazon Kinesis
1417182123
Shard-i
235810
Shard
ID
Lock Seq
num
Shard-i
Host A
Host B
Shard ID Last Archived
Shard-i
0
10
18X2
3
5
8
10
14
17
18
21
23
0
310
Host AHost B
{Event 10, …}
1023
14
17
1821
23
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
CDN
Real Time
Bidding
Retargetin
g
Platform
Reporting
Qubole
Real Time
AppsKCL Apps
Archiver
Kinesis Event ReplayS3
Producers AggregatorContinuous
ProcessingStorage Analytics
CDN
Real-time
Bidding
Retargeting
Platform
Reporting
Qubole
Real Time
AppsKCL Apps
Archiver
Amazon Kinesis Event ReplayAmazon S3
Amazon
Redshift
Producers AggregatorContinuous
ProcessingStorage Analytics
CDN
Real-time
Bidding
Retargeting
Platform
Reporting
Qubole
Real Time
AppsKCL Apps
Archiver
Amazon Kinesis Event ReplayAmazon S3
Redshift
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
• Unordered processing
– Randomize partition key to distribute events over
many shards and use multiple workers
• Exact order processing
– Control the partition key to ensure events are
grouped onto the same shard and read by the
same worker.
• Need both? Get global sequence number Producer
Get Global
SequenceUnordered
Stream
Campaign Centric
Stream
Fraud Inspection
Stream
Get Event
Metadata
Id event Stream – partition key
1 confirmation Campaign-centric stream - UUID
2 fraudUnordered Stream
Fraud-inspection stream – sessionid
HTTP
Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Apache
Storm
Amazon
Elastic
MapReduce
Sending Reading
Amazon EMR
PlaybackAmazon S3
Archiver
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
Please give us your feedback on this session.
Complete session evaluations and earn re:Invent swag.
http://bit.ly/awsevals