aws enterprise summit netherlands - big data architectural patterns & best practices

60
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Arie Leeuwesteijn, Solutions Architect, AWS September 2016 Big Data Architectural Patterns and Best Practices on AWS

Upload: amazon-web-services

Post on 18-Jan-2017

390 views

Category:

Technology


0 download

TRANSCRIPT

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Arie Leeuwesteijn, Solutions Architect, AWS

September 2016

Big Data Architectural Patterns and Best Practices on AWS

Agenda

Big data challengesHow to simplify big data processingWhat technologies should you use?

• Why?• How?

Reference architectureDesign patterns

Ever Increasing Big Data

Volume

Velocity

Variety

Big Data Evolution

Batch processing

Streamprocessing

Machinelearning

Plethora of Tools

Amazon Glacier

S3 DynamoDB

RDS

EMR

Amazon Redshift

Data PipelineAmazon Kinesis

Kinesis-enabled app

Lambda ML

SQS

ElastiCache

DynamoDBStreams

Amazon ElasticsearchService

Big Data Challenges

Is there a reference architecture?

What tools should I use?

How?

Why?

Decoupled “data bus”• Data → Store → Process → Store → Analyze → Answers

Use the right tool for the job• Data structure, latency, throughput, access patterns

Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer

Leverage AWS managed services• Scalable/elastic, available, reliable, secure, no/low admin

Big data ≠ big cost

Architectural Principles

Simplify Big Data Processing

COLLECT STORE PROCESS/ANALYZE CONSUME

Time to answer (Latency)Throughput

Cost

COLLECT

Types of Data

Database records

Search documents

Log files

Messaging events

Devices / sensors / IoT streamDevices

Sensors & IoT platforms

AWS IoT STREAMS

StreamstorageIo

T

COLLECT STORE

Mobile apps

Web apps

Data centersAWS Direct

Connect

RECORDSDatabase

Appl

icat

ions

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

FILES

Search

File store

Logg

ing

Tran

spor

t

MessagingMessage MESSAGES Queue

Mes

sagi

ng

Store

Amazon Kinesis Firehose

Amazon KinesisStreams/Analytics

Apache Kafka

Amazon DynamoDB Streams

Amazon SQS

Amazon SQS• Managed message queue service

Apache Kafka• High throughput distributed messaging system

Amazon Kinesis Streams• Managed stream storage + processing

Amazon Kinesis Analytics• Process streaming data in real time

Amazon Kinesis Firehose• Managed data delivery

Amazon DynamoDB• Managed NoSQL database• Tables can be stream-enabled

Message & Stream Storage

Devices

Sensors & IoT platforms

AWS IoT STREAMS

IoT

COLLECT STORE

Mobile apps

Web apps

Data centersAWS Direct

Connect

RECORDSDatabase

Appl

icat

ions

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

FILES

Search

File store

Logg

ing

Tran

spor

t

MessagingMessage MESSAGES

Mes

sagi

ng

Que

ueSt

ream

Why Stream Storage?

Decouple producers & consumersPersistent bufferCollect multiple streams

Preserve client orderingStreaming MapReduceParallel consumption

4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 1

4 3 2 14 4 3 3 2 2 1 1

shard 1 / partition 1

shard 2 / partition 2

Consumer 1Count of red = 4

Count of violet = 4

Consumer 2Count of blue = 4

Count of green = 4

DynamoDB stream Amazon Kinesis stream Kafka topic

What About Queues & Pub/Sub ? • Decouple producers &

consumers/subscribers• Persistent buffer• Collect multiple streams• No client ordering• No parallel consumption for

Amazon SQS• Amazon SNS can route

to multiple queues or ʎ functions

• No streaming MapReduce

Consumers

Producers

Producers

Amazon SNS

Amazon SQS

queue

topic

function

ʎ

AWS Lambda

Amazon SQSqueue

Subscriber

What Stream Storage should I use?AmazonDynamoDBStreams

AmazonKinesisStreams

AmazonKinesis Firehose

ApacheKafka

AmazonSQS

AWS managed service

Yes Yes Yes No Yes

Guaranteedordering

Yes Yes Yes Yes No

Delivery exactly-once at-least-once exactly-once at-least-once at-least-once

Data retention period

24 hours 7 days N/A Configurable 14 days

Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ

Scale / throughput

No limit /~ table IOPS

No limit /~ shards

No limit /automatic

No limit /~ nodes

No limits /automatic

Parallel clients Yes Yes No Yes No

Stream MapReduce Yes Yes N/A Yes N/A

Record/object size 400 KB 1 MB Amazon Redshift row size Configurable 256 KB

Cost Higher (table cost) Low Low Low (+admin) Low-medium

Hot Warm

COLLECT STORE

Mobile apps

Web apps

Data centersAWS Direct

Connect

RECORDSDatabase

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

FILES

Search

MessagingMessage MESSAGES

Devices

Sensors & IoT platforms

AWS IoT STREAMS

Apache Kafka

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB Streams

Hot

Stre

am

Amazon S3

Amazon SQS

Mes

sage

Amazon S3File

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

File Storage

Why is Amazon S3 Good for Big Data?

• Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS)• Can run transient Hadoop clusters & Amazon EC2 Spot Instances• Multiple distinct (Spark, Hive, Presto) clusters can use the same data• Unlimited number of objects • Very high bandwidth – no aggregate throughput limit• Highly available – can tolerate AZ failure• Designed for 99.999999999% durability• Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policy• Secure – SSL, client/server-side encryption at rest• Low cost

What about HDFS & Amazon Glacier?

• Use HDFS for very frequently accessed (hot) data

• Use Amazon S3 Standard for frequently accessed data

• Use Amazon S3 Standard – IA for infrequently accessed data

• Use Amazon Glacier for archiving cold data

Cache, database, search

COLLECT STORE

Mobile apps

Web apps

Data centersAWS Direct

Connect

RECORDS

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

DOCUMENTS

FILES

MessagingMessage MESSAGES

Devices

Sensors & IoT platforms

AWS IoT STREAMS

Apache Kafka

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB Streams

Hot

Stre

am

Amazon SQS

Mes

sageAmazon Elasticsearch Service

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Sear

ch

SQL

N

oSQ

L

Cac

heFi

le

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

Database Anti-pattern

Data TierDatabase tier

Best Practice - Use the Right Tool for the Job

Data TierSearch

Amazon Elasticsearch

and CloudSearch

Cache

AmazonElastiCache

RedisMemcached

SQL

Amazon AuroraMySQL

PostgreSQLOracle

SQL ServerMariaDB

NoSQL

AmazonDynamoDBCassandra

HBaseMongoDB

Database tier

Materialized Views

Amazon Elasticsearch Service

KCL

What Data Store Should I Use?

Data structure → Fixed schema, JSON, key-value

Access patterns → Store data in the format you will access it

Data / access characteristics → Hot, warm, cold

Cost → Right cost

Data Structure and Access PatternsAccess Patterns What to use?Put/Get (key, value) Cache, NoSQLSimple relationships → 1:N, M:N NoSQLCross table joins, transaction, SQL SQLFaceting, search, free text Search

Data Structure What to use?Fixed schema SQL, NoSQLSchema-free (JSON) NoSQL, Search(Key, value) Cache, NoSQL

What Is the Temperature of Your Data / Access ?

Hot Warm ColdVolume MB–GB GB–TB PBItem size B–KB KB–MB KB–TBLatency ms ms, sec min, hrsDurability Low – high High Very highRequest rate Very high High LowCost/GB $$-$ $-¢¢ ¢

Hot data Warm data Cold data

Data / Access Characteristics: Hot, Warm, Cold

Cache SQL

Request rateHigh Low

Cost/GBHigh Low

LatencyLow High

Data volumeLow High

Amazon Glacier

Stru

ctur

e

NoSQL

Hot data Warm data Cold data

Low

High

Search

Amazon ElastiCache AmazonDynamoDB

AmazonRDS/Aurora

AmazonElasticsearch

Amazon S3 Amazon Glacier

Average latency

ms ms ms, sec ms,sec ms,sec,min(~ size)

hrs

Typicaldata stored

GB GB–TBs(no limit)

GB–TB(64 TB max)

GB–TB MB–PB(no limit)

GB–PB(no limit)

Typicalitem size

B-KB KB(400 KB max)

KB(64 KB max)

KB(2 GB max)

KB-TB(5 TB max)

GB(40 TB max)

Request Rate

High – very high Very high(no limit)

High High Low – high(no limit)

Very low

Storage costGB/month

$$ ¢¢ ¢¢ ¢¢ ¢ ¢/10

Durability Low - moderate Very high Very high High Very high Very high

Availability High2 AZ

Very high 3 AZ

Very high3 AZ

High2 AZ

Very high3 AZ

Very high3 AZ

Hot data Warm data Cold data

What Data Store Should I Use?

Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?

“I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…”

Request rate (Writes/sec)

Object size(Bytes)

Total size(GB/month)

Objects per month

300 2048 1483 777,600,000

Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?

https://calculator.s3.amazonaws.com/index.html

Simple Monthly Calculator

Request rate (Writes/sec)

Object size(Bytes)

Total size(GB/month)

Objects per month

300 2,048 1,483 777,600,000

Amazon S3 orAmazon DynamoDB?

Request rate (Writes/sec)

Object size(Bytes)

Total size(GB/month)

Objects per month

Scenario 1300 2,048 1,483 777,600,000

Scenario 2300 32,768 23,730 777,600,000

Amazon S3

Amazon DynamoDB

use

use

COLLECT STORE PROCESS / ANALYZE

Amazon Elasticsearch Service

Apache Kafka

Amazon SQS

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon DynamoDB Streams

Hot

Hot

War

m

Sear

ch

SQL

N

oSQ

L

Cac

heFi

leM

essa

geSt

ream

Mobile apps

Web apps

Devices

MessagingMessage

Sensors & IoT platforms

AWS IoT

Data centersAWS Direct

Connect

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

Process /analyze

PROCESS / ANALYZE

Process / Analyze

• Batch - Minutes or hours on cold data• Daily/weekly/monthly reports

• Interactive – Seconds on warm/cold data• Self-service dashboards

• Messaging – Milliseconds or seconds on hot data • Message/event buffering

• Streaming - Milliseconds or seconds on hot data• Billing/fraud alerts, 1 minute metrics

What Analytics Technology Should I Use?Amazon Redshift

Amazon EMR

Presto Spark Hive

Query latency

Low Low Low High

Durability High High High High

Data volume 1.6 PB max ~Nodes ~Nodes ~Nodes

AWS managed

Yes Yes Yes Yes

Storage Native HDFS / S3 HDFS / S3 HDFS / S3

SQL compatibility

High High Low (SparkSQL) Medium (HQL)

Slow

Spark Streaming Apache StormAWS Lambda

KCL appsAmazon Redshift

AmazonRedshift

Hive

Spark Presto

Amazon KinesisApache Kafka

Amazon DynamoDB Amazon S3data

Hot ColdData temperature

Proc

essi

ng s

peed

Fast

Slow Answers

Hive

Native appsKCL appsAWS Lambda

Data Temperature vs. Processing Speed

• Replicate within, to, or from Amazon EC2 or RDS

• Move data to the same or different database engine

• Keep your apps running during the migration

• Start your first migration in 10 minutes or less.

AWSDatabase Migration

Service

• Migrate from Oracle, SQL Server and PostgreSQL

• Converts your database schema including tables, stored procedures, views, and DML

• Downloadable client install for Microsoft

Windows, Mac and Linux desktops

• Highlight where manual edits are needed

AWSSchema

Conversion Tool

Predictions via Machine Learning

ML gives computers the ability to learn without being explicitly programmed

Machine learning algorithms:Supervised learning ← “teach” program

- Classification ← Is this transaction fraud? (yes/no) - Regression ← Customer life-time value?

Unsupervised learning ← Let it learn by itself- Clustering ← Market segmentation

Tools and FrameworksMachine Learning

• Amazon ML, Amazon EMR (Spark ML)Interactive

• Amazon Redshift, Amazon EMR (Presto, Spark)Batch

• Amazon EMR (MapReduce, Hive, Pig, Spark)Messaging

• Amazon SQS application on Amazon EC2Streaming

• Micro-batch: Spark Streaming, KCL• Real-time: Amazon Kinesis Analytics, Storm,

AWS Lambda, KCL

Amazon SQS apps

Streaming

Amazon Kinesis Analytics

Amazon KCLapps

AWS Lambda

Amazon Redshift

PROCESS / ANALYZE

Amazon Machine Learning

Presto

AmazonEMR

Fast

Slow

Fast

Batc

hM

essa

geIn

tera

ctiv

eSt

ream

ML

Amazon EC2

Amazon EC2

What Streaming / Messaging Technology Should I Use?Spark Streaming

Apache Storm

Kinesis KCL Application

AWS Lambda Amazon SQS Apps

Scale ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes

Micro-batch or Real-time

Micro-batch Real-time Near-real-time Near-real-time Near-real-time

AWS managed service

Yes (EMR) No (EC2) No (KCL + EC2 + Auto Scaling)

Yes No (EC2 + AutoScaling)

Scalability No limits ~ nodes

No limits~ nodes

No limits~ nodes

No limits No limits

Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ

Programminglanguages

Java, Python, Scala

Any language via Thrift

Java, via MultiLangDaemon (.NET,Python, Ruby, Node.js)

Node.js, Java, Python

AWS SDK languages (Java, .NET, Python, …)

What About ETL?

https://aws.amazon.com/big-data/partner-solutions/

ETLSTORE PROCESS / ANALYZE

Amazon SQS apps

Streaming

Amazon Kinesis Analytics

Amazon KCLapps

AWS Lambda

Amazon Redshift

COLLECT STORE CONSUMEPROCESS / ANALYZE

Amazon Machine Learning

Presto

AmazonEMR

Amazon Elasticsearch Service

Apache Kafka

Amazon SQS

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon DynamoDB Streams

Hot

Hot

War

m

Fast

Slow

Fast

Batc

hM

essa

geIn

tera

ctiv

eSt

ream

ML

Sear

ch

SQL

N

oSQ

L

Cac

heFi

leM

essa

geSt

ream

Amazon EC2

Amazon EC2

Mobile apps

Web apps

Devices

MessagingMessage

Sensors & IoT platforms

AWS IoT

Data centersAWS Direct

Connect

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

ETL

CONSUME

STORE CONSUMEPROCESS / ANALYZE

Amazon QuickSight

Apps & Services

Anal

ysis

& v

isua

lizat

ion

Not

eboo

ksID

EAP

I

Applications & API

Analysis and visualization

Notebooks

IDE

Business users

Data scientist, developers

COLLECT ETL

Putting It All Together

Amazon SQS apps

Streaming

Amazon Kinesis Analytics

Amazon KCLapps

AWS Lambda

Amazon Redshift

COLLECT STORE CONSUMEPROCESS / ANALYZE

Amazon Machine Learning

Presto

AmazonEMR

Amazon Elasticsearch Service

Apache Kafka

Amazon SQS

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon DynamoDB Streams

Hot

Hot

War

m

Fast

Slow

Fast

Batc

hM

essa

geIn

tera

ctiv

eSt

ream

ML

Sear

ch

SQL

N

oSQ

L

Cac

heFi

leQ

ueue

Stre

am

Amazon EC2

Amazon EC2

Mobile apps

Web apps

Devices

MessagingMessage

Sensors & IoT platforms

AWS IoT

Data centersAWS Direct

Connect

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWSCloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Amazon QuickSight

Apps & Services

Anal

ysis

& v

isua

lizat

ion

Not

eboo

ksID

EAP

I

Reference architecture

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

ETL

Design Patterns

Common issues

• Storage type• Does not match data type• Access patterns• Price/capacity, deletion of data

• Processing models• Tight coupling

• Over capacity• Producer bottleneck• Performance, throughput

• Tooling

Primitive: Multi-Stage Decoupled “Data Bus”

Multiple stagesStorage decouples multiple processing stages

Store Process Store Process

processstore

Primitive: Multiple Stream Processing Applications Can Read from Amazon Kinesis

Amazon Kinesis

AWS Lambda

Amazon DynamoDB

Amazon Kinesis S3Connector

processstore

Amazon S3

Amazon EMR

Primitive: Analysis Frameworks Could Read from Multiple Data Stores

Amazon Kinesis

AWS Lambda

Amazon S3

Amazon DynamoDB

Spark Streaming

Amazon Kinesis S3Connector

process store

Spark SQL

Amazon EMR

Real-time Analytics

ApacheKafka

KCL

AWS Lambda

SparkStreaming

Apache Storm

Amazon SNS

AmazonML

Notifications

AmazonElastiCache

(Redis)Amazon

DynamoDB

AmazonRDS

AmazonES

Alert

App state

Real-time prediction

KPI

processstore

DynamoDBStreams

Amazon Kinesis

Message / EventProcessing

Amazon SQS

Amazon SQS App

Amazon SQS App

Amazon SNS Subscribers

AmazonElastiCache

(Redis)Amazon

DynamoDB

AmazonRDS

AmazonES

Publish

App state

KPI

processstore

Amazon SQS App

Amazon SQSApp

Auto Scaling group

Amazon SQSPriority queue Messages /

events

Interactive & BatchAnalytics

Amazon S3

Amazon EMR

Hive

Pig

Spark

AmazonML

processstore

Consume

Amazon Redshift

Amazon EMR

Presto

Spark

Batch

Interactive

Batch prediction

Real-time predictionAmazon Kinesis

Firehose

Batch layer

AmazonKinesis

Datastream

processstore

Amazon Kinesis

FirehoseAmazon S3

Applications

Amazon Redshift

Amazon EMR

Presto

Hive

Pig

Spark answer

Speed layer

answer

Serving layer

AmazonElastiCache

AmazonDynamoDBAmazon

RDSAmazon

ES

answer

AmazonML

KCL

AWS Lambda

Storm

Lambda Architecture

Spark Streaming on Amazon EMR

Customer Cases

Summary

Decoupled “data bus”• Data → Store → Process → Store → Analyze → Answers

Use the right tool for the job• Data structure, latency, throughput, access patterns

Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer

Leverage AWS managed services• Scalable/elastic, available, reliable, secure, no/low admin

Big data ≠ big cost

Thank you!aws.amazon.com/big-data