massive message processing with amazon sqs and amazon dynamodb (arc301) | aws re:invent 2013
DESCRIPTION
Amazon Simple Queue Service (SQS) and Amazon DynamoDB build together a really fast, reliable and scalable layer to receive and process high volumes of messages based on its distributed and high available architecture. We propose a full system that would handle any volume of data or level of throughput, without losing messages or requiring other services to be always available. Also, it enables applications to process messages asynchronously and includes more compute resources based on the number of messages enqueued. The whole architecture helps applications reach predefined SLAs as we can add more workers to improve the whole performance. In addition, it decreases the total costs because we use new workers briefly and only when they are required.TRANSCRIPT
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
ARC301 - Controlling the Flood: Massive Message
Processing with Amazon SQS and Amazon DynamoDB
Ari Dias Neto, Ecosystem Solution Architect
November 14, 2013
Who am I?
• The Mailman from Brazil – Delivering messages around the world!
Returning all the
messages…
How many Mailmen?
When?
How long?
Who am I?
We are going to design and build an application to handle any volume of messages! Right now!
What are we going to do?
Ari Dias Neto – Ecosystem Solutions Architect
Scenario – Super Bowl
Promotion: who is going to win?
Promotion
• We cannot lose any message
• We need to process all the valid messages
• Log all the invalid messages and errors
• Beautiful dashboard at the end
• We must process all the messages during the event!
Requirements
• Subscription based on SMS – Cellphone number is the key
Who is going to be in the front-line?
Fast!
Scalable!
Reliable!
Simple!
Fully managed!
Amazon Simple Queue Service
SQS
Fully Managed
Queue Service
Any volume of data
At any level of throughput
We cannot lose any message
No up-front or
fixed expenses
Architecture – Starting with SQS
SQS
We have received all
the messages
Now we need to
process all of it
Architecture – Amazon EC2 Instances
SQS
BUT!
how many
Instances?
Architecture – Multithread application
Reduce
the costs
and
increase
performance EC2
Instances
Architecture
SQS
Threads Workers
But how many instances do we need?
EC2
m1.xlarge
01 instance 100k
msgs/minute
10 instances 1M
msgs/minute
10 instances 5M messages
5 minutes
Architecture
SQS
Auto Scaling
Group
Auto Scaling
based on the
number of msgs
in the queue
Architecture
SQS
Auto Scaling
Group
Where should
we save all the
messages?
High
Throughput
Needed
Amazon DynamoDB
DynamoDB
valid-votes
invalid-votes
Two tables…
Architecture
SQS
Auto scaling Group
valid-votes
invalid-votes
The Dashboard
Final Architecture
SQS DynamoDB
Auto Scaling Group
Workers
Web
Dashboard
AWS Elastic
Beanstalk Container
Benefits
• Ready for any level of throughput
• SQS
• Ready for any required SLA
• Auto Scaling and EC2
• Low Cost
• Fully managed queue service
• Infrastructure is based on the required SLA
• Infrastructure needed for an small period of time
The challenge!
Process all the
messages from the
queue
in 10 minutes!
Let’s go deep!
Let's code!
Each thread
DynamoDB
Queue
Queue
Connect to SQS queue
Read up to 10 msgs
Validate each
message
Save as valid or invalid
Set “read” in
the queue
Each thread
DynamoDB
Queue
Queue
Connect to SQS queue
Read up to 10 msgs
Validate each
message
Save as valid or invalid
Set “read” in
the queue
Steps to deploy it on AWS
Create the queue. Queue name: votes
Upload application to S3: s3-sa-east-1.amazonaws.com/arineto/processor.jar
Create launch configuration
Create AMI with JRE. Image ID: ami-05355a6c
Create Auto Scaling Group
Create alarms
Launch it!
✔ ✔ ✔
Create bootstrap script: userdata.txt
The Company
• BigData Corp. was founded to help companies
solve the challenges associated with big data,
from collection to processing to information and
knowledge extraction.
The Challenge
• “How many e-commerce websites exist in your
continent? Can we monitor them on a consistent
basis?” – Build a crawling process that can answer this question in a cost
effective and speedy manner.
Architecture
• Spot Instances + SQS + S3 = Magic – Spot Instances allow us to optimize processing costs
– Amazon SQS allows us to orchestrate the process in a
distributed and asynchronous manner
– Amazon Simple Storage Service (S3) facilitates the storage of
intermediate and final processing results
Main Workers
Execute
crawling and
process data
Maestro
(reserved
instance)
List of crawl
URLs
Spot Instances
Secondary Workers
(queue listeners)
Reprocess
data, query
additional
services, store
data on
MongoDB
Spot Instances
Secondary
work queues –
processed data
MongoDB
cluster
Command and
Control Queue
Architecture
Architecture (3)
• Message Volumes – Processing starts by uploading 10MM+ messages
– Each processed message may generate up to 10 new
intermediate messages
– Peak processing of 70K messages / second
• Command & Control Queue – This queue enables us to adjust processing as we go and
request status checks from instances
Results (1)
$-
$100,000.00
$200,000.00
$300,000.00
$400,000.00
$500,000.00
$600,000.00
$700,000.00
$800,000.00
$900,000.00
0 1 2 3 4 5 6 7 8 9 10 11 12
Estimated cost without AWS Cost with AWS
Results(2)
2+ PB of data processed
40+ Bi web pages visited and parsed
500+ services and technologies mapped
A complete new view of the web market
Please give us your feedback on this
presentation
As a thank you, we will select prize
winners daily for completed surveys!
ARC301