massive message processing with amazon sqs and amazon dynamodb (arc301) | aws re:invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

ARC301 - Controlling the Flood: Massive Message

Processing with Amazon SQS and Amazon DynamoDB

Ari Dias Neto, Ecosystem Solution Architect

November 14, 2013

Who am I?

• The Mailman from Brazil – Delivering messages around the world!

Returning all the

messages…

How many Mailmen?

When?

How long?

Who am I?

We are going to design and build an application to handle any volume of messages! Right now!

What are we going to do?

Ari Dias Neto – Ecosystem Solutions Architect

Scenario – Super Bowl

Promotion: who is going to win?

Promotion

• We cannot lose any message

• We need to process all the valid messages

• Log all the invalid messages and errors

• Beautiful dashboard at the end

• We must process all the messages during the event!

Requirements

• Subscription based on SMS – Cellphone number is the key

Who is going to be in the front-line?

Fast!

Scalable!

Reliable!

Simple!

Fully managed!

Amazon Simple Queue Service

SQS

Fully Managed

Queue Service

Any volume of data

At any level of throughput

We cannot lose any message

No up-front or

fixed expenses

Architecture – Starting with SQS

SQS

We have received all

the messages

Now we need to

process all of it

Architecture – Amazon EC2 Instances

SQS

BUT!

how many

Instances?

Architecture – Multithread application

Reduce

the costs

and

increase

performance EC2

Instances

Architecture

SQS

Threads Workers

But how many instances do we need?

EC2

m1.xlarge

01 instance 100k

msgs/minute

10 instances 1M

msgs/minute

10 instances 5M messages

5 minutes

Architecture

SQS

Auto Scaling

Group

Auto Scaling

based on the

number of msgs

in the queue

Architecture

SQS

Auto Scaling

Group

Where should

we save all the

messages?

High

Throughput

Needed

Amazon DynamoDB

DynamoDB

valid-votes

invalid-votes

Two tables…

Architecture

SQS

Auto scaling Group

valid-votes

invalid-votes

The Dashboard

Final Architecture

SQS DynamoDB

Auto Scaling Group

Workers

Web

Dashboard

AWS Elastic

Beanstalk Container

Benefits

• Ready for any level of throughput

• SQS

• Ready for any required SLA

• Auto Scaling and EC2

• Low Cost

• Fully managed queue service

• Infrastructure is based on the required SLA

• Infrastructure needed for an small period of time

The challenge!

Process all the

messages from the

queue

in 10 minutes!

Let’s go deep!

Let's code!

Each thread

DynamoDB

Queue

Queue

Connect to SQS queue

Read up to 10 msgs

Validate each

message

Save as valid or invalid

Set “read” in

the queue

Steps to deploy it on AWS

Create the queue. Queue name: votes

Upload application to S3: s3-sa-east-1.amazonaws.com/arineto/processor.jar

Create launch configuration

Create AMI with JRE. Image ID: ami-05355a6c

Create Auto Scaling Group

Create alarms

Launch it!

✔ ✔ ✔

Create bootstrap script: userdata.txt

The Company

• BigData Corp. was founded to help companies

solve the challenges associated with big data,

from collection to processing to information and

knowledge extraction.

The Challenge

• “How many e-commerce websites exist in your

continent? Can we monitor them on a consistent

basis?” – Build a crawling process that can answer this question in a cost

effective and speedy manner.

Architecture

• Spot Instances + SQS + S3 = Magic – Spot Instances allow us to optimize processing costs

– Amazon SQS allows us to orchestrate the process in a

distributed and asynchronous manner

– Amazon Simple Storage Service (S3) facilitates the storage of

intermediate and final processing results

Main Workers

Execute

crawling and

process data

Maestro

(reserved

instance)

List of crawl

URLs

Spot Instances

Secondary Workers

(queue listeners)

Reprocess

data, query

additional

services, store

data on

MongoDB

Spot Instances

Secondary

work queues –

processed data

MongoDB

cluster

Command and

Control Queue

Architecture

Architecture (3)

• Message Volumes – Processing starts by uploading 10MM+ messages

– Each processed message may generate up to 10 new

intermediate messages

– Peak processing of 70K messages / second

• Command & Control Queue – This queue enables us to adjust processing as we go and

request status checks from instances

Results (1)

$-

$100,000.00

$200,000.00

$300,000.00

$400,000.00

$500,000.00

$600,000.00

$700,000.00

$800,000.00

$900,000.00

0 1 2 3 4 5 6 7 8 9 10 11 12

Estimated cost without AWS Cost with AWS

Results(2)

2+ PB of data processed

40+ Bi web pages visited and parsed

500+ services and technologies mapped

A complete new view of the web market

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

ARC301

massive message processing with amazon sqs and amazon dynamodb (arc301) | aws re:invent 2013

Technology