big data integration & analytics data flows with aws data pipeline (bdt207) | aws re:invent 2013

76
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Anthony Accardi (Head of Engineering, Swipely) November 14, 2013 Orchestrating Big Data Integration and Analytics Data Flows with AWS Data Pipeline Jon Einkauf (Sr. Product Manager, AWS) Friday, November 15, 13

Upload: amazon-web-services

Post on 12-Jan-2015

1.375 views

Category:

Technology


4 download

DESCRIPTION

AWS offers many data services, each optimized for a specific set of structure, size, latency, and concurrency requirements. Making the best use of all specialized services has historically required custom, error-prone data transformation and transport. Now, users can use the AWS Data Pipeline service to orchestrate data flows between Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon Redshift, and on-premise data stores, seamlessly and efficiently applying EC2 instances and EMR clusters to process and transform data. In this session, we demonstrate how you can use AWS Data Pipeline to coordinate your Big Data workflows, applying the optimal data storage technology to each part of your data integration architecture. Swipely's Head of Engineering shows how Swipely uses AWS Data Pipeline to build batch analytics, backfilling all their data, while using resources efficiently. Consequently, Swipely launches novel product features with less development time and less operational complexity.

TRANSCRIPT

Page 1: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Anthony Accardi (Head of Engineering, Swipely)

November 14, 2013

Orchestrating Big Data Integrationand Analytics Data Flows withAWS Data PipelineJon Einkauf (Sr. Product Manager, AWS)

Friday, November 15, 13

Page 2: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

What are some of the challenges in dealing with data?

Friday, November 15, 13

Page 3: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

1. Data is stored in different formats andlocations, making it hard to integrate

Amazon Redshift

Amazon S3

Amazon EMRAmazon DynamoDB

Amazon RDS

On-Premises

Friday, November 15, 13

Page 4: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

2. Data workflows require complexdependencies

Input Data Ready? Run…

No

Yes

• For example, a data processing step may depend on:• Input data being ready • Prior step completing• Time of day• Etc.

Friday, November 15, 13

Page 5: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

3. Things go wrong - you must handle exceptions

• For example, do you want to:

• Retry in the case of failure?

• Wait if a dependent step is taking longer than expected?

• Be notified if something goes wrong?

Friday, November 15, 13

Page 6: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

4. Existing tools are not a good fit

• Expensive upfront licenses• Scaling issues• Don’t support scheduling• Not designed for the cloud• Don’t support newer data stores (e.g., Amazon DynamoDB)

Friday, November 15, 13

Page 7: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Introducing AWS Data Pipeline

Friday, November 15, 13

Page 8: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

A simple pipeline

Input DataNode with PreCondition check

Activity with failure & delay notifications

Output DataNode

Friday, November 15, 13

Page 9: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Amazon Redshift

Amazon S3

Amazon EMRAmazon DynamoDB

Amazon RDS

Activities

Manages scheduled data movement andprocessing across AWS services

• Copy• MapReduce• Hive• Pig (New)• SQL (New)• Shell command

Friday, November 15, 13

Page 10: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Amazon Redshift

Amazon S3

Amazon EMRAmazon DynamoDB

Amazon RDS

On-Premises

Facilitates periodic data movement to/from AWS

Friday, November 15, 13

Page 11: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Supports dependencies (Preconditions)

• Amazon DynamoDB table exists/has data• Amazon S3 key exists• Amazon S3 prefix is not empty• Success of custom Unix/Linux shell command• Success of other pipeline tasks

S3 key exists? Copy…

No

Yes

Friday, November 15, 13

Page 12: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Alerting and exception handling

• Notification• On failure• On delay

• Automatic retry logic

Task 1

Success Failure

Alert

Task 2

Success Failure

Alert

Friday, November 15, 13

Page 13: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Flexible scheduling

• Choose a schedule• Run every: 15 minutes, hour, day, week, etc.• User defined

• Backfill support• Start pipeline on past date• Rapidly backfills to present day

Friday, November 15, 13

Page 14: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Massively scalable

• Creates and terminates AWS resources (Amazon EC2 and Amazon EMR) to process data

• Manage resources in multiple regions

Friday, November 15, 13

Page 15: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Easy to get started

• Templates for common use cases

• Graphical interface• Natively understands

CSV and TSV• Automatically

configures Amazon EMR clusters

Friday, November 15, 13

Page 16: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Inexpensive

• Free tier• Pay per activity/precondition• No commitment• Simple pricing:

Friday, November 15, 13

Page 17: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

An ETL example (1 of 2)• Combine logs in Amazon S3 with customer data in Amazon RDS• Process using Hive on Amazon EMR• Put output in Amazon S3• Load into Amazon Redshift• Run SQL query and load table for BI tools

Friday, November 15, 13

Page 18: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

An ETL example (2 of 2)• Run on a schedule (e.g. hourly)• Use a precondition to make Hive activity depend on Amazon S3 logs being available• Set up Amazon SNS notification on failure• Change default retry logic

Friday, November 15, 13

Page 19: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Swipely

Friday, November 15, 13

Page 20: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

1 TB

How big is your data?

Friday, November 15, 13

Page 21: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

How big is your data?

Do you have a big data problem?

Friday, November 15, 13

Page 22: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

How big is your data?

Do you have a big data problem?

Don’t use Hadoop: your data isn’t that big.

Friday, November 15, 13

Page 23: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

How big is your data?

Do you have a big data problem?

Don’t use Hadoop: your data isn’t that big.

Keep your data smalland manageable.

Friday, November 15, 13

Page 24: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Get ahead of your Big Datadon’t wait for data to become a problem

Friday, November 15, 13

Page 25: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Get ahead of your Big Datadon’t wait for data to become a problem

Build novel product featureswith a batch architecture

Friday, November 15, 13

Page 26: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Get ahead of your Big Datadon’t wait for data to become a problem

Build novel product featureswith a batch architecture

Decrease development timeby easily backfilling data

Friday, November 15, 13

Page 27: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Get ahead of your Big Datadon’t wait for data to become a problem

Build novel product featureswith a batch architecture

Decrease development timeby easily backfilling data

Vastly simplify operationswith scalable on-demand services

Friday, November 15, 13

Page 28: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Friday, November 15, 13

Page 29: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

must innovateby making payments data actionable

Friday, November 15, 13

Page 30: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

must innovateby making payments data actionable

and rapidly iteratedeploying multiple times a day

Friday, November 15, 13

Page 31: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

must innovateby making payments data actionable

and rapidly iteratedeploying multiple times a day

with a lean team.we have 2 ops engineers

Friday, November 15, 13

Page 32: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Swipely uses AWS Data Pipeline to

build batch analytics,

backfilling all our data,

using resources efficiently.

Friday, November 15, 13

Page 33: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Swipely uses AWS Data Pipeline to

build batch analytics,

backfilling all our data,

using resources efficiently.

Fast, dynamic reportsby mashing up datafrom facts.

Friday, November 15, 13

Page 34: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Generate fast, dynamic reports

Friday, November 15, 13

Page 35: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Friday, November 15, 13

Page 36: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

insert

AWS Data Pipeline orchestratesbuilding of documents from facts

TransactionFacts

IntermediateS3 Bucket

Sales by DayDocuments

EMR

Friday, November 15, 13

Page 37: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

insert

AWS Data Pipeline orchestratesbuilding of documents from facts

TransactionFacts

IntermediateS3 Bucket

EMR Data Transformer

Data Post-Processor

Sales by DayDocuments

EMR

Friday, November 15, 13

Page 38: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

insert

AWS Data Pipeline orchestratesbuilding of documents from facts

TransactionFacts

IntermediateS3 Bucket

EMR Data Transformer

Data Post-Processor

Sales by DayDocuments

EMR

AWS Data Pipeline

Friday, November 15, 13

Page 39: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Friday, November 15, 13

Page 40: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Mash up data for efficient processing

Cafe 3/30 4980 $72

Spa 5/11 8278 $140

Cafe 5/11 2472 $57

Cafe 5/10: $4030

Cafe 5/11: $5432

Cafe 5/12: $6292

Transactions Sales by Day

EMR

Friday, November 15, 13

Page 41: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Friday, November 15, 13

Page 42: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Mash up data for efficient processing

Cafe 3/30 4980 $72

Spa 5/11 8278 $140

Cafe 5/11 2472 $57

Cafe 5/10: $4030 60 new

Cafe 5/11: $5432 80 new

Cafe 5/12: $6292 135 new

Cafe 2472 5/11: $57 0 new

Cafe 4980 3/30: $72 1 new

Cafe 4980 5/11: $49 0 new

VisitsTransactions Sales by Day

EMR EMR

Friday, November 15, 13

Page 43: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Friday, November 15, 13

Page 44: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Mash up data for efficient processing

Cafe 3/30 4980 $72

Spa 5/11 8278 $140

Cafe 5/11 2472 $57

Cafe 5/10: $4030 60 new

Cafe 5/11: $5432 80 new

Cafe 5/12: $6292 135 new

Cafe 2472 5/11: $57 0 new

Cafe 4980 3/30: $72 1 new

Cafe 4980 5/11: $49 0 new

Mary 5/11: $309

4980 5/11: $218

Bob 5/11: $198

2472 Bob

8278 Mary

Customer SpendCard Opt-In

VisitsTransactions Sales by Day

EMR EMR

Hive (EMR)

Friday, November 15, 13

Page 45: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

insert

AWS Data Pipeline orchestratesbuilding of documents from facts

TransactionFacts

IntermediateS3 Bucket

EMR Data Transformer

Data Post-Processor

Sales by DayDocuments

EMR

AWS Data Pipeline

Friday, November 15, 13

Page 46: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Swipely uses AWS Data Pipeline to

build batch analytics,

backfilling all our data,

using resources efficiently.

Friday, November 15, 13

Page 47: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Swipely uses AWS Data Pipeline to

build batch analytics,

backfilling all our data,

using resources efficiently.

Regularly rebuildto rapidly iterate,using agile process.

Friday, November 15, 13

Page 48: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Regularly rebuild to avoid backfilling

web service

transactionscard opt-in

AnalyticsDocuments

FactStore

daily

Friday, November 15, 13

Page 49: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Regularly rebuild to avoid backfilling

web service

transactionscard opt-in

AnalyticsDocuments

FactStore

daily

RecentActivity

Friday, November 15, 13

Page 50: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Minor changes require little work

Friday, November 15, 13

Page 51: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Minor changes require little work

change accounting ruleswithout a migration

Friday, November 15, 13

Page 52: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Rapidly iterate your product

Friday, November 15, 13

Page 53: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Rapidly iterate your product

redefine “best”

Friday, November 15, 13

Page 54: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Leverage agile development process

Wrap pipeline definition

Quickly diagnose failures

Automate common tasks

Reduce variability

Friday, November 15, 13

Page 55: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Wrap pipeline definition {

"id": "GenerateSalesByDay",

"type": "EmrActivity",

"onFail": { "ref": "FailureNotify" },

"schedule": { "ref": "Nightly" },

"runsOn": { "ref": "SalesByDayEMRCluster" },

"dependsOn": { "ref": "GenerateIndexedSwipes" },

"step": "/.../hadoop-streaming.jar,

-input, s3n://<%= s3_data_path %>/indexed_swipes.csv,

-output, s3://<%= s3_data_path %>/sales_by_day,

-mapper, s3n://<%= s3_code_path %>/sales_by_day_mapper.rb,

-reducer,s3n://<%= s3_code_path %>/sales_by_day_reducer.rb"

}

Friday, November 15, 13

Page 56: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

{

"id": "GenerateSalesByDay",

"type": "EmrActivity",

"onFail": { "ref": "FailureNotify" },

"schedule": { "ref": "Nightly" },

"runsOn": { "ref": "SalesByDayEMRCluster" },

"dependsOn": { "ref": "GenerateIndexedSwipes" },

"step": "<%= streaming_hadoop_step(

input: '/indexed_swipes.csv',

output: '/sales_by_day',

mapper: '/sales_by_day_mapper.rb',

reducer: '/sales_by_day_reducer.rb'

) %>"

}

Wrap pipeline definition

Friday, November 15, 13

Page 57: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Reduce variability

No small instances "coreInstanceType": "m1.large"

Lock versions "installHive": "0.8.1.8"

Security groups by database "securityGroups": [ "customerdb" ]

Friday, November 15, 13

Page 58: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Turn on logging "enableDebugging", "logUri", "emrLogUri"

Namespace your logs "s3://#{LOGS_BUCKET}/#{@s3prefix}/#{START_TIME}/SalesByDayEMRLogs"

Log into dev instances "keyPair"

Quickly diagnose failures

Friday, November 15, 13

Page 59: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Clean up "terminateAfter": "6 hours"

Bootstrap your environment

Automate common tasks

{

"id": "BootstrapEnvironment",

"type": "ShellCommandActivity",

"scriptUri": ".../bootstrap_ec2.sh",

"runsOn": { "ref": "SalesByDayEC2Resource" }

}

Friday, November 15, 13

Page 60: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Swipely uses AWS Data Pipeline to

build batch analytics,

backfilling all our data,

using resources efficiently.

Friday, November 15, 13

Page 61: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Swipely uses AWS Data Pipeline to

build batch analytics,

backfilling all our data,

using resources efficiently.Scale horizontally,backfilling in 50 min,storing all your data.

Friday, November 15, 13

Page 62: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Scale Amazon EMR pipelines horizontally

Friday, November 15, 13

Page 63: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Scale Amazon EMR pipelines horizontally

Friday, November 15, 13

Page 64: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Cost vs latency sweet spot at 50 min

Friday, November 15, 13

Page 65: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Cost vs latency sweet spot at 50 minUse smallest capable on-demand instance typefixed hourly cost, no idle time

Friday, November 15, 13

Page 66: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Cost vs latency sweet spot at 50 minUse smallest capable on-demand instance typefixed hourly cost, no idle time

Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )

Friday, November 15, 13

Page 67: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Cost vs latency sweet spot at 50 minUse smallest capable on-demand instance typefixed hourly cost, no idle time

Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )

Target < 1 hour~10 min runtime variability

Friday, November 15, 13

Page 68: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Cost vs latency sweet spot at 50 minUse smallest capable on-demand instance typefixed hourly cost, no idle time

Scale EMR-heavy jobs horizontally cost ( 1 instance, N hours ) = cost ( N instances, 1 hour )

Target < 1 hour~10 min runtime variability

Crunch 50 GB facts in 50 minusing 40 instances for < $10

Friday, November 15, 13

Page 69: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Store all your data - it’s cheap

Friday, November 15, 13

Page 70: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Store all your data - it’s cheap

Store all your facts in Amazon S3your source of truth: 50 GB, $5 / month

Friday, November 15, 13

Page 71: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Store your analytics documents in Amazon RDS for indexed queries: 20 GB, $250 / month

Store all your data - it’s cheap

Store all your facts in Amazon S3your source of truth: 50 GB, $5 / month

Friday, November 15, 13

Page 72: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Store your analytics documents in Amazon RDS for indexed queries: 20 GB, $250 / month

Store all your data - it’s cheap

Store all your facts in Amazon S3your source of truth: 50 GB, $5 / month

Retain intermediate data in Amazon S3for diagnosis: 1.1 TB (60 days), $100 / month

Friday, November 15, 13

Page 73: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Swipely uses AWS Data Pipeline to

build batch analytics,

backfilling all our data,

using resources efficiently.

Friday, November 15, 13

Page 74: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Swipely uses AWS Data Pipeline to

build batch analytics,

backfilling all our data,

using resources efficiently.

Friday, November 15, 13

Page 75: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Friday, November 15, 13

Page 76: Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

BDT207 Thank You

Friday, November 15, 13