bdt201 aws data pipeline - aws re: invent 2012
DESCRIPTION
In this session, we'll review the features and architecture of the new AWS Data Pipeline service and explain how you can use it to better manage your data-driven workloads. We'll then go over a few examples of setting up and provisioning a pipeline in the system.TRANSCRIPT
Amazon S3
Amazon
DynamoDB
Amazon
RDS
Amazon
Redshift
On
Premise
HDFS
(Amazon EMR)
Amazon DynamoDB Amazon S3
Amazon S3
Amazon
DynamoDB
Amazon
RDS
Amazon
Redshift
On
Premise
HDFS
(Amazon EMR)
Amazon S3
Amazon
DynamoDB
Amazon
RDS
Amazon
Redshift
On
Premise
HDFS
(Amazon EMR)
Amazon S3
Amazon
DynamoDB
Amazon
RDS
Amazon
Redshift
On
Premise
HDFS
(Amazon EMR)
Amazon S3
Amazon
DynamoDB
Amazon
RDS
Amazon
Redshift
On
Premise
HDFS
(Amazon EMR)
Amazon S3
Amazon
DynamoDB
Amazon
RDS
Amazon
Redshift
On
Premise
HDFS
(Amazon EMR)
Input Datanode
Activity
[Output Datanode]
Input Datanode with precondition check
Activity with failure & delay notifications
Ouput Datanode
Compute Resources
Data Data
Data Stores Data Stores
Start
Interval
[End]
Noon Today
1 hour
…..
12-1pm
1-2pm
2-3pm
X
…..
12-1pm
1-2pm
2-3pm
1 day X
X
Hourly
Daily
Weekly
Monthly
Yearly
Quarterly
S3 logs (hourly) Geolocation data
Per-geography
usage computation
(hourly)
Redshift
results
S3 logs (hourly)
Precondition: files exist
Geolocation data
Precondition: ./geo_available
Per-geography
usage computation
(hourly)
Redshift
results
Dynamo
event data RDS
demographics
Hive-based
analysis (hourly)
Redshift
results
Hourly click updates Hourly event analysis
Daily reporting SQL
Amazon S3
logs
Custom
Precondition
EMR usage-by-geo job
Amazon EC2
report generation
Amazon
DynamoDB
event data
Amazon RDS
demographics
Amazon Redshift
DW table
Amazon
Redshift
DW table
Hive
script
Amazon S3
logs
Custom
Precondition
EMR usage-by-geo job
Amazon EC2
report generation
Amazon
DynamoDB
event data
Amazon RDS
demographics
Amazon Redshift
DW table
Amazon
Redshift
DW table
Hive
script
We Manage You Manage
EC2
Instances
EMR Clusters On Premise Resources
EC2
Instances
EMR Clusters
{
"objects" : [
{
"name" : “My Copy”,
"type" : “Copy Action”,
“input”: {“ref” : “My RDS Data”},
“output”: {“ref” : “My S3 Data”},
”runsOn” : {“ref”: “My Instance”},
"schedule" : { "ref" : “My Schedule" } },
{
"name" : ”My Instance”,
"type" : ”EC2Instance”,
"instanceType" : "m1.small”,
"schedule" : { "ref” : “My Schedule" } },
…..
}
On AWS On Premise
High
Frequency
$1/month $2.50/month
Low Frequency $.60/month $1.50/month
We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form when you have a
chance.