aws re:invent 2016: building big data applications with the aws big data platform (bda206)

Post on 11-Jan-2017

147 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Matt Yanchyshyn

Sr. Manager, Solutions Architecture, AWS

November 30, 2016

Building Big Data Applications

with the AWS Big Data Platform

BDA206

Ingest/

Collect

Consume/

visualizeStore Process/

analyze

Data1 4

0 95 Answers &

insights

START HEREWITH A BUSINESS CASE

AWS Data PipelineAWS Database Migration Service

EMR

Analyze

Amazon

GlacierS3

StoreCollect

Amazon Kinesis

Direct Connect

Amazon

Machine

Learning

Amazon

Redshift

DynamoDB AWS IoT

AWS Snowball

QuickSight

Amazon Athena

EC2Amazon

Elasticsearch

Service

Lambda

Building a Big Data Application

web clients

mobile clients

DBMSAmazon Redshift

AWS Cloudcorporate data center

Build a data warehouse with Amazon Redshift

Structured Data Processing

• Petabyte-scale relational, MPP, data warehousing

• Fully managed with SSD and HDD platforms

• Built-in end-to-end security, including customer-managed keys

• Fault-tolerant. Automatically recovers from disk and node failures

• Data automatically backed up to Amazon S3 with cross-region

backup capability for global disaster recovery

• Over 140 new features added since launch

• $1,000/TB/Year; start at $0.25/hour. Provision in minutes; scale

from 160 GB to 2 PB of compressed data with just a few clicks

Amazon Redshift

How do you get your (big) data into AWS?

Building a Big Data Application

web clients

mobile clients

DBMSAmazon Redshift

AWS Cloudcorporate data center

Migrate your data to AWS

AWS Database

Migration Service

AWS Direct Connect

AWS Import/Export

& Snowball

Start your first migration in 10 minutes or less

Keep your apps running during the migration

Migrate to databases running on Amazon EC2,

Amazon RDS, or Amazon Redshift

AWS

Database

Migration Service

AWS Snowball: PB-scale Data Transport

E-ink shipping

label

Ruggedized

case

“8.5G Impact”

All data encrypted

end-to-end50TB & 80TB

10G network

Rain & dust

resistant

Tamper-resistant

case & electronics

Your CEO doesn’t want to look at

raw SQL query output

Business Intelligence

• Fast and cloud-powered

• Easy to use, no infrastructure to manage

• Scales to 100s of thousands of users

• Quick calculations with SPICE

• 1/10th the cost of legacy BI software

Amazon

QuickSight

Building a Big Data Application

web clients

mobile clients

DBMSAmazon Redshift

Amazon

QuickSight

AWS Cloudcorporate data center

Visualize your data with Amazon QuickSight

AWS Database

Migration Service

AWS Direct Connect

AWS Import/Export

& Snowball

What if your data isn’t structured?

What if you don’t need all the raw data?

What if you need to combine multiple data sets?

Serverless Event Processing

• Serverless compute service that runs your code in

response to events

• Extend AWS services with user-defined custom logic

• Write custom code in Node.js, Python, and Java

• Pay only for the requests served and compute time

required - billing in increments of 100 milliseconds

AWS Lambda

Building a Big Data Application

web clients

mobile clients

DBMSAmazon Redshift

Amazon

QuickSight

AWS Cloud

Event-driven data transformations with AWS Lambda

corporate data center

AWS LambdaStructured Data

In Amazon S3

Raw data

In Amazon S3

How will this work at scale?

What if the data processing exceeds the timeout?

Semi-structured/Unstructured Data Processing

• Hadoop, Hive, Presto, Spark, Tez, Impala etc.

• Release 5.2: Hadoop 2.7.3, Hive 2.1, Spark 2.02, Zeppelin, Presto, HBase 1.2.3

and HBase on S3, Phoenix, Tez, Flink.

• New applications added within 30 days of their open source release

• Fully managed, Auto Scaling clusters with support for on-demand and

spot pricing

• Support for HDFS and S3 file systems enabling separated compute and

storage; multiple clusters can run against the same data in S3

• HIPAA-eligible. Support for end-to-end encryption, IAM/VPC, S3 client-

side encryption with customer managed keys and AWS KMS

Amazon EMR

Building a Big Data Application

web clients

mobile clients

DBMSAmazon Redshift

Amazon

QuickSight

AWS Cloud

Transform your and explore your data at scale with Amazon EMR

corporate data center

Amazon EMR Structured Data

In Amazon S3

Raw data

In Amazon S3

What about ad-hoc queries when you are

exploring new data?

Serverless Query Processing

• Serverless query service for querying data in S3 using standard SQL with

no infrastructure to manage

• No data loading required; query directly from Amazon S3

• Use standard ANSI SQL queries with support for joins, JSON, and window

functions

• Support for multiple data formats include text, CSV, TSV, JSON, Avro,

ORC, Parquet

• Pay per query only when you’re running queries based on data scanned.

If you compress your data, you pay less and your queries run faster

Amazon

Athena

Building a Big Data ApplicationExtend your data warehouse to S3 with Amazon Athena

web clients

mobile clients

DBMS

Raw data

In Amazon S3

Amazon Redshift

Staging Data

in Amazon S3

Amazon

QuickSight

AWS Cloudcorporate data center

Amazon

EMR

Amazon

Athena

Building a Big Data ApplicationExtend your data warehouse to S3 with Amazon Athena

web clients

mobile clients

DBMSAmazon Redshift

Amazon

QuickSight

AWS Cloudcorporate data center

Amazon

EMR

Orc/Parquet in Amazon S3

(Columnar Data Format)Amazon

EMR

Raw data

In Amazon S3

Staging Data

in Amazon S3

Amazon

Athena

What if I want to run custom code or

multiple frameworks?

Building a Big Data ApplicationExtend your Data Warehouse to S3 with Presto, Spark SQL, etc. on Amazon EMR

web clients

mobile clients

DBMSAmazon Redshift

Orc/Parquet in Amazon S3

(Columnar Data Format)

Amazon

QuickSight

AWS Cloudcorporate data center

Amazon

EMR

Amazon

EMR

Amazon

EMR

Raw data

In Amazon S3Staging Data

in Amazon S3

What about real-time data?

Stream Processing

• Real-time stream processing

• High throughput; elastic

• Highly available; data replicated across multiple Availability

Zones with configurable retention

• S3, Amazon Redshift, DynamoDB integrations

• Amazon Kinesis Streams for custom streaming applications;

Amazon Kinesis Firehose for easy integration with Amazon

S3 and Amazon Redshift; Amazon Kinesis Analytics for

streaming SQL

Amazon

Kinesis

Building a Big Data Application

web clients

mobile clients

DBMSAmazon Redshift

Orc/Parquet

(Columnar Data Format)

Amazon

QuickSight

Amazon Kinesis

Streams

AWS Cloud

Add a real-time layer with Amazon Kinesis + Spark on Amazon EMR

corporate data center

Amazon

EMR

Amazon

EMR

Amazon

EMR

Raw data

In Amazon S3Staging Data

In Amazon S3

Amazon

Athena

Building a Big Data Application

web clients

mobile clients

DBMSAmazon Redshift

Amazon

QuickSight

AWS Cloud

React to real-time data with Amazon Kinesis Analytics and AWS Lambda

corporate data center

Amazon Kinesis

Firehose

Amazon Kinesis

Analytics

AWS Lambda

Amazon

Kinesis

Streams

Amazon SNS

Reference data

in Amazon S3

Amazon

Athena

Building a Big Data Application

web clients

mobile clients

DBMSAmazon Redshift

Amazon

QuickSight

AWS Cloud

React intelligently in real-time with Amazon Machine Learning

corporate data center

Amazon Kinesis

Firehose

Amazon Kinesis

Analytics

AWS Lambda

Amazon

Kinesis

Streams

Reference data

in Amazon S3

Amazon

Machine

Learning

Amazon SNS

Amazon

Athena

What if you need encryption and network

isolation to meet industry regulations?

Building a Big Data Application

web clients

mobile clients

DBMSAmazon Redshift

Amazon

QuickSight

Amazon Kinesis

Streams

AWS Cloud

Add encryption at rest with AWS KMS

corporate data center

AWS KMS

Amazon

EMR

Amazon

EMR

Raw data in S3 Staging Data in S3

Orc/Parquet in Amazon S3

(Columnar data)

Building a Big Data Application

web clients

mobile clients

DBMSAmazon Redshift

Amazon

QuickSight

Amazon Kinesis

Streams

AWS Cloud

AWS KMS

VPC subnet

SSL/TLS

SSL/TLS

Protect data in transit & add network isolation

corporate data center

Raw data in S3 Staging Data in S3

Orc/Parquet in Amazon S3

(Columnar data)

Which customers are doing this?

Ingest/

Collect

Consume/

visualizeStore

Process/

analyze

Data

1 40 9

5

Amazon S3

Data LakeAmazon EMR

Amazon

Kinesis

Amazon Redshift

Answers &

insights

Hot HomesUsers

Properties

Agents

User Profile

Recommendation

Hot Homes

Similar Homes

Agent Follow-up

Agent Scorecard

Marketing

A/B Testing

Real Time Data

Amazon

DynamoDB

BI / Reporting

Redfin

Ingest/

Collect

Consume/

visualizeStore

Process/

analyze

Data

1 40 9

5

Outcomes

& insights

Personalized

recommendations within

seconds (from 15-20 min)

Scale the expertise of

stylists to all shoppers

Reduce costs by 2X order

of magnitude

Mobile Users

Desktop Users

Analytics

Tools

Online Stylist

Amazon

Redshift

Amazon

Kinesis

AWS

Lambda

Amazon

DynamoDBAWS

Lambda

Amazon S3

Data Storage

NORDSTROM

Data Marts

(Amazon

Redshift)

Query Cluster

(EMR)

Query Cluster

(EMR)

Auto Scaling

EC2

Analytics

App

Normalization

ETL Clusters

(EMR)

Batch Analytic

Clusters

(EMR)

Ad Hoc Query

Cluster (EMR)

Auto Scaling

EC2

Analytics

App

Users Data

ProvidersAuto Scaling

EC2

Data

Ingestion

Services

Optimization

ETL Clusters

(EMR)

Shared Metastore

(RDS)

Query Optimized

(S3)

Auto Scaling EC2

Data

Catalog

& Lineage

Services

Reference Data

(RDS)

Shared Data Services

Auto Scaling

EC2

Cluster Mgt

& Workflow

Services

Source of

Truth (S3)

>5 PB, up to 75 billion events per day

web clients

mobile clients

DBMSAmazon Redshift

Amazon

QuickSight

AWS Cloudcorporate data center

Amazon Kinesis

Firehose

Amazon Kinesis

Analytics

AWS Lambda

Amazon

Kinesis

Streams

Reference data

in Amazon S3

Amazon

Machine

Learning

Amazon SNS

<YOUR COMPANY NAME HERE>

Amazon

Athena

Thank you!

Remember to complete

your evaluations!

top related