big data architectural patterns and best practices on aws · pdf filereference architecture...

Antoine Généreux, Solutions Architect, AWS

April 4th , 2017

Big Data Architectural Patterns and Best Practices on AWSBigDataMontréal(BDM52)

What to Expect from the Session

Big data challengesArchitectural principlesHow to simplify big data processingWhat technologies should you use?

• Why?• How?

Reference architectureDesign patterns

Ever-Increasing Big Data

Volume

Velocity

Variety

Big Data Evolution

Batch processing

Streamprocessing

Machinelearning

Cloud Services Evolution

Virtual machines

Managed services

Serverless

Plethora of Tools

Amazon Glacier

S3 DynamoDB

Amazon Redshift

Data PipelineAmazon Kinesis

Amazon Kinesis Streams app

Lambda Amazon ML

ElastiCache

DynamoDBStreams

Amazon ElasticsearchService

Amazon Kinesis Analytics

Big Data Challenges

What tools should I use?

Is there a reference architecture?

Architectural Principles

Build decoupled systems• Data → Store → Process → Store → Analyze → Answers

Use the right tool for the job• Data structure, latency, throughput, access patterns

Leverage AWS managed services• Scalable/elastic, available, reliable, secure, no/low admin

Use log-centric design patterns• Immutable logs, materialized views

Be cost-conscious• Big data ≠ big cost

Simplify Big Data Processing

COLLECT STORE PROCESS/ANALYZE CONSUME

Time to answer (Latency)Throughput

COLLECT

Types of DataCOLLECT

Mobile apps

Web apps

Data centersAWS Direct

Connect

RECORDS

ions In-memory data structures

Database records

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

Search documents

Log files

MessagingMessage MESSAGES

Messages

Devices

Sensors & IoT platforms

AWS IoT STREAMS

IoT Data streams

Transactions

Events

What Is the Temperature of Your Data ?

Hot Warm ColdVolume MB–GB GB–TB PB–EBItem size B–KB KB–MB KB–TBLatency ms ms, sec min, hrsDurability Low–high High Very highRequest rate Very high High LowCost/GB $$-$ $-¢¢ ¢

Hot data Warm data Cold data

Data Characteristics: Hot, Warm, Cold

Devices

AWS IoT STREAMS

COLLECT

Logging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

Mobile apps

Web apps

Connect

RECORDS

Types of Data Stores

Database SQL & NoSQL databases

Search Search engines

File store File systems

Queue Message queues

Streamstorage

Pub/sub message queues

In-memory Caches, data structure servers

In-memory

Amazon Kinesis Firehose

Amazon KinesisStreams

Apache Kafka

Amazon DynamoDB Streams

Amazon SQS

Amazon SQS• Managed message queue service

Apache Kafka• High throughput distributed streaming

platform

Amazon Kinesis Streams• Managed stream storage + processing

Amazon Kinesis Firehose• Managed data delivery

Amazon DynamoDB• Managed NoSQL database• Tables can be stream-enabled

Message & Stream Storage

Devices

AWS IoT STREAMS

COLLECT STORE

Mobile apps

Web apps

Connect

RECORDSDatabaseAp

Logging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

Search

File store

Why Stream Storage?

Decouple producers & consumersPersistent bufferCollect multiple streams

Preserve client orderingParallel consumptionStreaming MapReduce

4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 14 4 3 3 2 2 1 1

shard 1 / partition 1

shard 2 / partition 2

Consumer 1Count of red = 4

Count of violet = 4

Consumer 2Count of blue = 4

Count of green = 4

DynamoDB stream Amazon Kinesis stream Kafka topic

What About Amazon SQS? • Decouple producers & consumers• Persistent buffer• Collect multiple streams• No client ordering (Standard)

• FIFO queue preserves client ordering

• No streaming MapReduce• No parallel consumption

• Amazon SNS can publish to multiple SNS subscribers (queues or ʎ functions)

Publisher

Amazon SNS

function

AWS Lambdafunction

Amazon SQSqueue

Subscriber

4 3 2 1

12344 3 2 1

Standard

Which Stream/Message Storage Should I Use?AmazonDynamoDBStreams

AmazonKinesisStreams

AmazonKinesis Firehose

ApacheKafka

AmazonSQS (Standard)

Amazon SQS (FIFO)

AWS managed Yes Yes Yes No Yes Yes

Guaranteed ordering Yes Yes No Yes No Yes

Delivery (deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once

Data retention period 24 hours 7 days N/A Configurable 14 days 14 days

Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ 3 AZ

Scale / throughput

No limit /~ table IOPS

No limit /~ shards

No limit /automatic

No limit /~ nodes

No limits /automatic

300 TPS / queue

Parallel consumption Yes Yes No Yes No No

Stream MapReduce Yes Yes N/A Yes N/A N/A

Row/object size 400 KB 1 MB Destination row/object size

Configurable 256 KB 256 KB

Cost Higher (table cost)

Low Low Low (+admin) Low-medium Low-medium

Hot Warm

In-memory

COLLECT STORE

Mobile apps

Web apps

Connect

RECORDSDatabase

Logging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

Search

Devices

AWS IoT STREAMS

Apache Kafka

Amazon S3

Amazon SQS

Amazon S3File

File Storage

Why Is Amazon S3 Good for Big Data?

• Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS)• Can run transient Hadoop clusters & Amazon EC2 Spot Instances• Multiple & heterogeneous analysis clusters can use the same data• Unlimited number of objects and volume of data• Very high bandwidth – no aggregate throughput limit• Designed for 99.99% availability – can tolerate zone failure• Designed for 99.999999999% durability• No need to pay for data replication• Native support for versioning• Tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies• Secure – SSL, client/server-side encryption at rest• Low cost

What About HDFS & Data Tiering?

• Use HDFS for very frequently accessed (hot) data

• Use Amazon S3 Standard for frequently accessed data

• Use Amazon S3 Standard – IA for less frequently accessed data

• Use Amazon Glacier for archiving cold data

In-memory

COLLECT STORE

Mobile apps

Web apps

Connect

RECORDS Database

Logging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

Search

Devices

AWS IoT STREAMS

Apache Kafka

Amazon SQS

Amazon S3File

In-memory, Database, Search

Anti-Pattern

Data Tier

Applications

Best Practice: Use the Right Tool for the Job

Data TierSearch

Amazon Elasticsearch Service

In-memory

Amazon ElastiCacheRedisMemcached

Amazon AuroraAmazon RDS

MySQLPostgreSQLOracleSQL Server

Amazon DynamoDBCassandraHBaseMongoDB

Applications

COLLECT STORE

Mobile apps

Web apps

Connect

RECORDS

Logging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

Devices

AWS IoT STREAMS

Apache Kafka

Amazon SQS

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon ElastiCache• Managed Memcached or Redis service

Amazon DynamoDB• Managed NoSQL database service

Amazon RDS• Managed relational database service

Amazon Elasticsearch Service• Managed Elasticsearch service

Which Data Store Should I Use?

Data structure → Fixed schema, JSON, key-value

Access patterns → Store data in the format you will access it

Data characteristics → Hot, warm, cold

Cost → Right cost

Data Structure and Access PatternsAccess Patterns What to use?Put/Get (key, value) In-memory, NoSQLSimple relationships → 1:N, M:N NoSQLMulti-table joins, transaction, SQL SQLFaceting, search Search

Data Structure What to use?Fixed schema SQL, NoSQLSchema-free (JSON) NoSQL, Search(Key, value) In-memory, NoSQL

In-memory SQL

Request rateHigh Low

Cost/GBHigh Low

LatencyLow High

Data volumeLow High

Amazon Glacier

Amazon ElastiCache AmazonDynamoDB

AmazonRDS/Aurora

AmazonES

Amazon S3 Amazon Glacier

Average latency

ms ms ms, sec ms,sec ms,sec,min(~ size)

Typicaldata stored

GB GB–TBs(no limit)

GB–TB(64 TB max)

GB–TB MB–PB(no limit)

GB–PB(no limit)

Typicalitem size

B-KB KB(400 KB max)

KB(64 KB max)

B-KB(2 GB max)

KB-TB(5 TB max)

GB(40 TB max)

Request Rate

High – very high Very high(no limit)

High High Low – high(no limit)

Very low

Storage costGB/month

$$ ¢¢ ¢¢ ¢¢ ¢ ¢4/10

Durability Low - moderate Very high Very high High Very high Very high

Availability High2 AZ

Very high 3 AZ

Very high3 AZ

High2 AZ

Very high3 AZ

Which Data Store Should I Use?

Cost-Conscious Design

Example: Should I use Amazon S3 or Amazon DynamoDB?

“I’m currently scoping out a project. The design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…”

Request rate (Writes/sec)

Object size(Bytes)

Total size(GB/month)

Objects per month

300 2048 1483 777,600,000

https://calculator.s3.amazonaws.com/index.html

Simple Monthly Calculator

Cost-Conscious Design

Example: Should I use Amazon S3 or Amazon DynamoDB?

Object size(Bytes)

Objects per month

300 2,048 1,483 777,600,000

Amazon S3 orDynamoDB?

Object size(Bytes)

Objects per month

Scenario 1300 2,048 1,483 777,600,000

Scenario 2300 32,768 23,730 777,600,000

Amazon S3

Amazon DynamoDB

PROCESS / ANALYZE

BatchTakes minutes to hours Example: Daily/weekly/monthly reportsAmazon EMR (MapReduce, Hive, Pig, Spark)

InteractiveTakes secondsExample: Self-service dashboardsAmazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)

MessageTakes milliseconds to secondsExample: Message processingAmazon SQS applications on Amazon EC2

StreamTakes milliseconds to secondsExample: Fraud alerts, 1 minute metricsAmazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, Storm, AWS Lambda

Machine LearningTakes milliseconds to minutesExample: Fraud detection, forecast demandAmazon ML, Amazon EMR (Spark ML)

Analytics Types & Frameworks PROCESS / ANALYZE

Amazon Machine LearningM

Amazon SQS appsAmazon EC2

Streaming

KCLapps

AWS Lambda

Amazon EC2

Amazon EMR

Amazon Redshift

Presto

AmazonEMR

Amazon Athena

Which Stream & Message Processing Technology Should I Use?Amazon EMR (Spark Streaming)

Apache Storm

KCL Application Amazon Kinesis Analytics

AWS Lambda

Amazon SQS Application

AWS managed

Yes (Amazon EMR)

No (Do it yourself)

No ( EC2 + AutoScaling)

Yes Yes No (EC2 + AutoScaling)

Serverless No No No Yes Yes No

Scale / throughput

No limits /~ nodes

Up to 8 KPU / automatic

No limits /automatic

No limits /~ nodes

Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Multi-AZ

Programminglanguages

Java, Python, Scala

Almost any language viaThrift

Java, others via MultiLangDaemon

ANSI SQL with extensions

Node.js, Java, Python

AWS SDK languages (Java, .NET, Python, …)

Uses Multistage processing

Multistage processing

Single stage processing

Multistage processing

Simple event-based triggers

Simple event based triggers

Reliability KCL and Spark checkpoints

Framework managed

Managed by KCL Managed by Amazon Kinesis Analytics

Managed by AWS Lambda

Managed by SQS Visibility Timeout

Which Analysis Tool Should I Use?Amazon Redshift Amazon Athena Amazon EMR

Presto Spark Hive

Use case Optimized for data warehousing

Ad-hoc Interactive Queries

Interactive Query

General purpose (iterative ML, RT, ..)

Scale/throughput ~Nodes Automatic / No limits ~ Nodes

AWS ManagedService

Yes Yes, Serverless Yes

Storage Local storage Amazon S3 Amazon S3, HDFS

Optimization Columnar storage, data compression, and zone

CSV, TSV, JSON, Parquet, ORC, Apache

Web log

Framework dependent

Metadata Amazon Redshift managed Athena Catalog Manager

Hive Meta-store

BI tools supports Yes (JDBC/ODBC) Yes (JDBC) Yes (JDBC/ODBC & Custom)

Access controls Users, groups, and access controls

AWS IAM Integration with LDAP

UDF support Yes (Scalar) No Yes

What About ETL?

https://aws.amazon.com/big-data/partner-solutions/

ETLSTORE PROCESS / ANALYZE

Data Integration PartnersReduce the effort to move, cleanse, synchronize, manage, and automatize data related processes. AWS Glue (Preview)

AWSGlueisafullymanagedETLservicethatmakesiteasytounderstandyourdatasources,preparethedata,andmoveitreliablybetweendatastores

CONSUME

COLLECT STORE CONSUMEPROCESS / ANALYZE

Apache Kafka

Amazon SQS

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Mobile apps

Web apps

Devices

MessagingMessage

AWS IoT

Connect

Logging

Amazon CloudWatch

AWS CloudTrail

RECORDS

DOCUMENTS

MESSAGES

STREAMS

Streaming

KCLapps

AWS Lambda

Amazon EC2

Amazon EMR

Amazon SQS apps

Amazon Redshift

Amazon Machine Learning

Presto

AmazonEMR

Amazon EC2

Amazon Athena

STORE CONSUMEPROCESS / ANALYZE

Amazon QuickSight

Apps & Services

Applications & API

Analysis and visualization

Notebooks

Business users

Data scientist, developers

COLLECT ETL

Putting It All Together

Streaming

KCLapps

AWS Lambda

Apache Kafka

Amazon SQS

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon EC2

Mobile apps

Web apps

Devices

MessagingMessage

AWS IoT

Connect

Logging

Amazon CloudWatch

AWS CloudTrail

RECORDS

DOCUMENTS

MESSAGES

STREAMS

Amazon QuickSight

Apps & Services

Reference architecture

Amazon EMR

Amazon SQS apps

Amazon Redshift

Presto

AmazonEMR

Amazon EC2

Amazon Athena

Design Patterns

Primitive: Decoupled Data Bus

Storage decoupled from processingMultiple stages

Store Process Store Process

processstore

Primitive: Pub/Sub

Amazon Kinesis

AWS Lambda

Amazon Kinesis Connector Library

processstore

ApacheSpark

Parallel stream consumption/processing

process store

Primitive: Materialized Views

Amazon EMR

Amazon Kinesis

AWS Lambda

Amazon S3

Amazon DynamoDB

Spark Streaming

Amazon Kinesis Connector Library

Spark SQL

Analysis framework reads from or writes to multiple data stores

Spark Streaming Apache StormAWS Lambda

KCL appsAmazon Redshift

AmazonRedshift

Spark Presto

Amazon Kinesis Amazon DynamoDB Amazon S3data

Hot ColdData temperature

Slow Answers

Native appsKCL appsAWS Lambda

Amazon Athena

Amazon EMR

Real-time Analytics

AmazonKinesis

KCL app

AWS Lambda

SparkStreaming

Amazon SNS

AmazonML

Notifications

AmazonElastiCache

(Redis)

AmazonDynamoDB

AmazonRDS

AmazonES

Alerts

App state

Real-time prediction

processstore

Amazon KinesisAnalytics

AmazonS3

Amazon KinesisFan out

Interactive & BatchAnalytics

Amazon S3

Amazon EMR

AmazonML

processstore

Consume

Amazon Redshift

Amazon EMR

PrestoSpark

Interactive

Batch prediction

Real-time predictionAmazon Kinesis

Firehose

Amazon Athena

Interactive&Batch

Amazon S3

Amazon Redshift

Amazon EMRPresto

AmazonElastiCache

AmazonDynamoDB

AmazonRDS

AmazonES

AWS Lambda

Spark Streaming on Amazon EMR

Amazon Kinesis

App state

AmazonML

Real-time

AmazonDynamoDB

AmazonRDS

Change Data Capture

Transactions

Stream

Data Lake

Amazon Athena

Amazon Kinesis

Firehose

Summary

Build decoupled systems• Data → Store → Process → Store → Analyze → Answers

Use the right tool for the job• Data structure, latency, throughput, access patterns

Leverage AWS managed services• Scalable/elastic, available, reliable, secure, no/low admin

Use log-centric design patterns• Immutable log, batch, interactive & real-time views

Be cost-conscious• Big data ≠ big cost

Amazon SQS apps

Streaming

KCLapps

Amazon Redshift

Presto

AmazonEMR

Apache Kafka

Amazon SQS

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Amazon EC2

Mobile apps

Web apps

Devices

MessagingMessage

AWS IoT

Connect

Logging

Amazon CloudWatch

AWS CloudTrail

RECORDS

DOCUMENTS

MESSAGES

STREAMS

Amazon QuickSight

Apps & Services

Amazon EMR

AWS Lambda

Amazon Athena

AWS Free Tier (aws.amazon.com/free)

We’re Hiring! (jobs.amazon.com)Montreal• Solutions Architect (486946)

Toronto• Solutions Architect (493577,493576,493575,493574)• Security Architect (480504)• Cloud Infrastructure Architect (456134)• Senior Solutions Architect (416514)• Technical Account Manager (403410)

Vancouver• Solutions Architect (493572,493571)• SI Solutions Architect (473414)• Senior Solutions Architect (458043)

Thank you! Questions?

Antoine Généreux, genereux@amazon.com

Resources

Blogs, Whitepapers and Video Sessions(Blog) Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learninghttps://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/(Blog) Implement Serverless Log Analytics Using Kinesis Analyticshttps://aws.amazon.com/blogs/big-data/implement-serverless-log-analytics-using-amazon-kinesis-analytics/(Blog) Harmonize, Search and Analyze Loosely Coupled Datasets on AWShttps://aws.amazon.com/blogs/big-data/harmonize-search-and-analyze-loosely-coupled-datasets-on-aws/(Blog) Build an Event-Based Analytics Pipeline for Amazon Game Studios’ Breakawayhttps://aws.amazon.com/blogs/big-data/building-an-event-based-analytics-pipeline-for-amazon-game-studios-breakaway/(Whitepaper) Big Data Analytics Options on AWShttps://d0.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf

re:Invent 2016 Breakout Sessions | Big Data Mini Con (13 videos)https://www.youtube.com/playlist?list=PLhr1KZpdzukdi1XNNdkDwQqOWObl5FDsV

re:Invent 2016 Breakout Sessions | Databases (26 videos)https://www.youtube.com/playlist?list=PLhr1KZpdzukc97mkL2-TOE-vJtp22c1yc

re:Invent 2016 Breakout Sessions | Machine Learning Mini Con (14 videos)https://www.youtube.com/playlist?list=PLhr1KZpdzukexYSNcIj9iBbmn9jYKu2pu

Code Samples (https://github.com/awslabs)Harmonize, Search and Analyze Loosely Coupled Datasets on AWS (code related to blog post)https://github.com/awslabs/harmonize-search-analyzeAWS Data Lake Solutionhttps://github.com/awslabs/aws-data-lake-solutionServerless MapReduce Reference Architecturehttps://github.com/awslabs/lambda-refarch-mapreduceStreaming Analytics Pipelinehttps://github.com/awslabs/streaming-analytics-pipelineReal-time analytics using Kinesis and Spark Streaminghttps://github.com/awslabs/real-time-analytics-spark-streamingServerless Reference Architecture for Real-time Stream Processinghttps://github.com/awslabs/lambda-refarch-streamprocessingAmazon Kinesis Data Generatorhttps://github.com/awslabs/amazon-kinesis-data-generatorMachine Learning Sampleshttps://github.com/awslabs/machine-learning-samplesCFN Cluster for Deep Learning AMIshttps://github.com/awslabs/deeplearning-cfnAmazon Simple Beer Servicehttps://github.com/awslabs/simplebeerservice

big data architectural patterns and best practices on aws · pdf filereference architecture...

Documents

amazon elasticache deep dive - aws

scanned by camscanner - devopsschool.com - 10.pdf · 260...

(dat407) amazon elasticache: deep dive

amazon elasticache (dan zamansky) - aws db day

big data patterns with mahout

amazon elasticache para redisamazon elasticache para redis...

amazon elasticache - api reference · amazon elasticache...

amazon elasticache - elasticache for memcached...

aws re:invent 2016: elasticache deep dive: best practices...

use case for using the elasticache for redis in production

amazon elasticache deep dive - march 2017 aws online tech...

patterns in big data

elastic cache - quality thoughtindia) pvt. amazon...

fast data at scale with amazon elasticache for redis

big data: the science of patterns -...

elastic cache - home - quality thought...elasticache is an...

s2 l5 monitoring elasticache

patterns of big data forrester

accelerating application performance with amazon elasticache...

amazon elasticache for redis - elasticache for redis … ·...