orchestrate your big data workflows with aws data pipeline

Anthony Accardi (Head of Engineering, Swipely)

November 14, 2013

Orchestrating Big Data Integrationand Analytics Data Flows withAWS Data PipelineJon Einkauf (Sr. Product Manager, AWS)

Friday, November 15, 13

What are some of the challenges in dealing with data?

1. Data is stored in different formats andlocations, making it hard to integrate

Amazon Redshift

Amazon S3

Amazon EMRAmazon DynamoDB

Amazon RDS

On-Premises

2. Data workflows require complexdependencies

Input Data Ready? Run…

• For example, a data processing step may depend on:• Input data being ready • Prior step completing• Time of day• Etc.

3. Things go wrong - you must handle exceptions

• For example, do you want to:

• Retry in the case of failure?

• Wait if a dependent step is taking longer than expected?

• Be notified if something goes wrong?

4. Existing tools are not a good fit

• Expensive upfront licenses• Scaling issues• Don’t support scheduling• Not designed for the cloud• Don’t support newer data stores (e.g., Amazon DynamoDB)

Introducing AWS Data Pipeline

A simple pipeline

Input DataNode with PreCondition check

Activity with failure & delay notifications

Output DataNode

Amazon Redshift

Amazon S3

Amazon RDS

Activities

Manages scheduled data movement andprocessing across AWS services

• Copy• MapReduce• Hive• Pig (New)• SQL (New)• Shell command

Amazon Redshift

Amazon S3

Amazon RDS

On-Premises

Facilitates periodic data movement to/from AWS

Supports dependencies (Preconditions)

• Amazon DynamoDB table exists/has data• Amazon S3 key exists• Amazon S3 prefix is not empty• Success of custom Unix/Linux shell command• Success of other pipeline tasks

S3 key exists? Copy…

Alerting and exception handling

• Notification• On failure• On delay

• Automatic retry logic

Task 1

Success Failure

Task 2

Success Failure

Flexible scheduling

• Choose a schedule• Run every: 15 minutes, hour, day, week, etc.• User defined

• Backfill support• Start pipeline on past date• Rapidly backfills to present day

Massively scalable

• Creates and terminates AWS resources (Amazon EC2 and Amazon EMR) to process data

• Manage resources in multiple regions

Easy to get started

• Templates for common use cases

• Graphical interface• Natively understands

CSV and TSV• Automatically

configures Amazon EMR clusters

Inexpensive

• Free tier• Pay per activity/precondition• No commitment• Simple pricing:

An ETL example (1 of 2)• Combine logs in Amazon S3 with customer data in Amazon RDS• Process using Hive on Amazon EMR• Put output in Amazon S3• Load into Amazon Redshift• Run SQL query and load table for BI tools

An ETL example (2 of 2)• Run on a schedule (e.g. hourly)• Use a precondition to make Hive activity depend on Amazon S3 logs being available• Set up Amazon SNS notification on failure• Change default retry logic

Swipely

How big is your data?

Do you have a big data problem?

Don’t use Hadoop: your data isn’t that big.

Keep your data smalland manageable.

Get ahead of your Big Datadon’t wait for data to become a problem

Build novel product featureswith a batch architecture

Decrease development timeby easily backfilling data

Vastly simplify operationswith scalable on-demand services

must innovateby making payments data actionable

and rapidly iteratedeploying multiple times a day

with a lean team.we have 2 ops engineers

Swipely uses AWS Data Pipeline to

build batch analytics,

backfilling all our data,

using resources efficiently.

Fast, dynamic reportsby mashing up datafrom facts.

Generate fast, dynamic reports

insert

AWS Data Pipeline orchestratesbuilding of documents from facts

TransactionFacts

IntermediateS3 Bucket

Sales by DayDocuments

insert

TransactionFacts

EMR Data Transformer

Data Post-Processor

insert

TransactionFacts

Data Post-Processor

AWS Data Pipeline

Mash up data for efficient processing

Cafe 3/30 4980 $72

Spa 5/11 8278 $140

Cafe 5/11 2472 $57

Cafe 5/10: $4030

Cafe 5/11: $5432

Cafe 5/12: $6292

Transactions Sales by Day

Cafe 3/30 4980 $72

Spa 5/11 8278 $140

Cafe 5/11 2472 $57

Cafe 5/10: $4030 60 new

Cafe 5/11: $5432 80 new

Cafe 5/12: $6292 135 new

Cafe 2472 5/11: $57 0 new

Cafe 4980 3/30: $72 1 new

Cafe 4980 5/11: $49 0 new

VisitsTransactions Sales by Day

EMR EMR

Cafe 3/30 4980 $72

Spa 5/11 8278 $140

Cafe 5/11 2472 $57

Cafe 5/10: $4030 60 new

Cafe 5/11: $5432 80 new

Cafe 5/12: $6292 135 new

Cafe 2472 5/11: $57 0 new

Cafe 4980 3/30: $72 1 new

Cafe 4980 5/11: $49 0 new

Mary 5/11: $309

4980 5/11: $218

Bob 5/11: $198

2472 Bob

8278 Mary

Customer SpendCard Opt-In

VisitsTransactions Sales by Day

EMR EMR

Hive (EMR)

insert

TransactionFacts

Data Post-Processor

AWS Data Pipeline

Regularly rebuildto rapidly iterate,using agile process.

Regularly rebuild to avoid backfilling

web service

transactionscard opt-in

AnalyticsDocuments

FactStore

Regularly rebuild to avoid backfilling

web service

transactionscard opt-in

AnalyticsDocuments

FactStore

RecentActivity

Minor changes require little work

change accounting ruleswithout a migration

Rapidly iterate your product

redefine “best”

Leverage agile development process

Wrap pipeline definition

Quickly diagnose failures

Automate common tasks

Reduce variability

Wrap pipeline definition {

"id": "GenerateSalesByDay",

"type": "EmrActivity",

"onFail": { "ref": "FailureNotify" },

"schedule": { "ref": "Nightly" },

"runsOn": { "ref": "SalesByDayEMRCluster" },

"dependsOn": { "ref": "GenerateIndexedSwipes" },

"step": "/.../hadoop-streaming.jar,

-input, s3n://<%= s3_data_path %>/indexed_swipes.csv,

-output, s3://<%= s3_data_path %>/sales_by_day,

-mapper, s3n://<%= s3_code_path %>/sales_by_day_mapper.rb,

-reducer,s3n://<%= s3_code_path %>/sales_by_day_reducer.rb"

"id": "GenerateSalesByDay",

"type": "EmrActivity",

"onFail": { "ref": "FailureNotify" },

"schedule": { "ref": "Nightly" },

"runsOn": { "ref": "SalesByDayEMRCluster" },

"dependsOn": { "ref": "GenerateIndexedSwipes" },

"step": "<%= streaming_hadoop_step(

input: '/indexed_swipes.csv',

output: '/sales_by_day',

mapper: '/sales_by_day_mapper.rb',

reducer: '/sales_by_day_reducer.rb'

Wrap pipeline definition

Reduce variability

No small instances "coreInstanceType": "m1.large"

Lock versions "installHive": "0.8.1.8"

Security groups by database "securityGroups": [ "customerdb" ]

Turn on logging "enableDebugging", "logUri", "emrLogUri"

Namespace your logs "s3://#{LOGS_BUCKET}/#{@s3prefix}/#{START_TIME}/SalesByDayEMRLogs"

Log into dev instances "keyPair"

Quickly diagnose failures

Clean up "terminateAfter": "6 hours"

Bootstrap your environment

Automate common tasks

"id": "BootstrapEnvironment",

"type": "ShellCommandActivity",

"scriptUri": ".../bootstrap_ec2.sh",

"runsOn": { "ref": "SalesByDayEC2Resource" }

using resources efficiently.Scale horizontally,backfilling in 50 min,storing all your data.

Scale Amazon EMR pipelines horizontally

Cost vs latency sweet spot at 50 min

Cost vs latency sweet spot at 50 minUse smallest capable on-demand instance typefixed hourly cost, no idle time

Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )

Target < 1 hour~10 min runtime variability

Crunch 50 GB facts in 50 minusing 40 instances for < $10

Store all your data - it’s cheap

Store all your facts in Amazon S3your source of truth: 50 GB, $5 / month

Store your analytics documents in Amazon RDS for indexed queries: 20 GB, $250 / month

Retain intermediate data in Amazon S3for diagnosis: 1.1 TB (60 days), $100 / month

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

BDT207 Thank You

orchestrate your big data workflows with aws data pipeline

Technology

orchestrate data with agility and responsiveness. learn how...

novell platespin orchestrate - netiq...novell ® novdocx...

managing and optimizing bioinformatics workflows for data...

workflows and data management - cornell university center...

orchestrate data center management

portable resource management for data intensive workflows

building complex data workflows with cascading on...

workflows and data integration vision and …

information visualization for large-scale data workflows

data publishing workflows with dataverse

collaborative ai workflows - microsoft azure › artifact...

orchestrate 7.0 user guide - pravin · pdf filecontents iv...

big data management using scientific workflows

towards understanding data analysis workflows using a

hadoop workflows using sas® data integration … workflows...

implementing big data processing workflows using …

connecting people, data & workflows from design to...

copy data management for analytics catalogic ecx manage,...

ocean data interoperability platform - big data - streams &...

workflows & tools. data analysis review of typical data...