aws january 2016 webinar series - getting started with big data on aws

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Erik Swensson, Shree Kenghe & Erick Dame

January 26, 2016

Getting Started with Big DataAnalytic Options on AWS & Common Use Cases

Table of Contents

• Big Data Introduction for AWS

• Big Data Analytics Option on AWS• Usage Patterns & Anti-Patterns• Performance & Cost• Durability & Scalability• Interfaces

• Building Big Data Analytic Solutions – The AWS Approach

• Example Scenarios

Big Data on AWS

Immediate Availability. Deploy instantly. No hardware to procure, no infrastructure to maintain & scale

Trusted & Secure. Designed to meet the strictest requirements. Continuously audited, including certifications such as ISO 27001, FedRAMP, DoD CSM, and PCI DSS.

Broad & Deep Capabilities. Over 50 services and 100s of features to support virtually any big data application & workload

Hundreds of Partners & Solutions. Get help from a consulting partner or choose from hundreds of tools and applications across the entire data management stack.

Real-timeAmazon Kinesis Firehose

Object StorageAmazon S3

RDBMSAmazon RDS

NoSQLDynamoDB

Hadoop EcosystemAmazon EMR

Real-timeAWS Lambda

Amazon Kinesis Analytics

Data WarehousingAmazon Redshift

Machine LearningAmazon Machine

Learning

Business Intelligence & Data VisualizationAmazon QuickSight

Real-timeAmazon Kinesis Streams

Elastic Search AnalyticsAmazon ElasticSearch

Collect Store Process & Analyze Visualize

Data ImportAmazon Import/Export

Snowball

IoTAmazon IoT

Broad, Tightly Integrated Capabilities

Petabyte scale

Massively parallel

Relational data warehouse

Fully managed, zero admin

As low as $1,000/TB/Year

a lot fastera lot cheapera whole lot simpler

Amazon Redshift

Amazon Redshift• Ideal Usage Patterns - Analyze

• Sales data • Historical data• Gaming data • Social trends • Ad data

• Performance• Massively Parallel Processing• Columnar Storage• Data Compression• Zone Maps• Direct-attached Storage

• Cost model• No upfront costs or long term commitments• Free backup storage equivalent to 100% of

provisioned storage

With columnar storage, you only read the data you need

Amazon Redshift

• Scalability & Elasticity• Resize or scale - Number or type of nodes

can be changed with a few clicks

• Durability and Availability• Replication• Backup • Automated recovery from failed drives &

nodes • Interfaces

• JDBC/ODBC interface with BI/ETL tools• Amazon S3 or DynamoDB

• Anti-patterns• Small datasets• OLTP• Unstructured Data• Blob Data

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3

JDBC/ODBC

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk


128GB RAM

16TB disk


LeaderNode

Ingest streaming data

Process data in real-time

Store terabytes of data per hour

Amazon Kinesis

Amazon Kinesis Streams• Ideal Usage Patterns – Streaming

data ingestion and processing• Real-time data analytics• Data feed intake and processing e.g. logs• Real-time metrics and reporting

• Performance• Throughput capacity in terms of shards

• Cost model• No upfront costs or long term

commitments•Pay as you go pricing•Hourly charge per shard•Charge for 1 million PUT transactions

Amazon Kinesis Streams

• Scalability & Elasticity•Scale – increase number of shards

• Durability and Availability• Replication• Cursor preservation

• Interfaces•Input – data coming in•Output – data going out•Kinesis Firehose

• Anti-patterns•Small scale consistent throughput•Long term data storage and analytics

Launch a cluster in minutes

Pay by the hour and save with spot

MapReduce, Apache Spark, Presto

Amazon EMR

Amazon EMR• Ideal Usage Patterns

• Log processing and analytics • Large ETL and data movement• Risk modeling and threat analytics• Ad targeting and click stream analytics• Genomics• Predictive analytics• Ad-hoc data mining and analytics

• Performance – driven by• Type of instance• Number of instances

• Cost model• Only pay for hours the cluster is up• EC2 instance and EMR price

Amazon EMR

• Scalability & Elasticity• Resize a running cluster• Add more core or task nodes

• Durability and Availability• Fault tolerant for slave node (HDFS) • Backup to S3 for resilience against master

node failures• Interfaces

• Hive, Pig, Spark, Hbase, Impala, Hunk, Presto, other popular tools

• Anti-patterns• Small data sets• ACID (Atomicity, Consistency, Isolation and

Durability)

Amazon EMR Cluster

Amazon EMR Cluster

Amazon EMR Cluster

Fully managed NoSQL database

Single-Digit Millisecond latency at scale

Supports document and key-value

AmazonDynamoDB

Amazon DynamoDB• Ideal Usage Patterns

• Mobile apps, gaming, digital ad serving, live voting, sensor networks, log ingestion

• Access control for web-based content, e-commerce shopping carts

• Web session management• Performance

• SSD• Provision throughput by table

• Scalability & Elasticity•No limit to the amount of data stored•Dial-up or dial-down the read and write

capacity of a table• Cost Model

• Pay as you go• Provisioned throughput capacity (per hour)• Indexed data storage (per GB per month)• Data transfer in or out (per GB per month)

Provisioned read/write performance per table. Predictable high performance scaled via console or API

Amazon DynamoDB

• Durability and Availability• Three Availability Zones (AZ)

• Interfaces• AWS Management Console• API’s• SDK’s

• Anti-patterns• Application tied to traditional relational

database• Joins and or complex transactions• BLOB data• Large data with low I/O rate

AZ-A

AZ-B

AZ-C

Managed service designed to make it easy for developers of all levels to use machine learning

Based on the same ML technology used for years by Amazon’s internal data scientists

Amazon Machine Learning uses scalable and robust implementations of industry-standard ML algorithms

Amazon Machine Learning

Amazon Machine Learning

• Ideal Usage Patterns • Enable applications that flag suspicious

transactions• Personalize application content• Predict user activity• Listen to social media

• Cost Model• Pay for what you use• No need to manage instances, only pay for

the service• Performance

• Real-time predictions designed to return within 100ms

• 200 transactions can be handled per second by default (can be raised)

Amazon Machine Learning• Durability and Availability

• No maintenance windows or scheduled downtimes

• Designed across multiple availability zones

• Scalability & Elasticity• Model training up to 100GB• Multiple ML jobs can run simultaneously

• Interfaces• Create a data source from S3, RDS and

Redshift• Interact with ML via console, SDKs, and

the ML API• Anti-patterns

• Massive Data Sets for modeling > 100GB

• Sequence prediction or unsupervised clustering task

Event driven, fully managed compute

No Infrastructure to Manage

Automatic Scaling

AWS Lambda

AWS Lambda• Ideal Usage Patterns

• Real-time file processing• Extract, Transform, Load

• Performance• Process events within

milliseconds• Cost Model

• Pay for what you use• No need to manage instances,

only pay for the service• Lambda free tier includes 1M free

requests

1 2 3Serverless Event-Driven Scale Subsecond Billing

AWS Lambda• Durability and Availability

• No maintenance windows or scheduled downtime

• Async functions are retried 3 times if there is a failure

• Scalability & Elasticity• Any number of concurrent functions that

can be run• AWS Lambda will dynamically allocate

capacity to match the rate of incoming events.

• Interfaces• Lambda supports Java, Node.js, and

Python• Trigger via event or schedule

• Anti-patterns• Long running applications• Stateful applications in Lambda

Setup Elasticsearch cluster in minutes

Integrated with Logstash and Kibana

Scale Elasticsearch cluster seamlessly

Amazon Elasticsearch

Service

Amazon Elasticsearch• Ideal Usage Patterns

• Analyze logs• Analyze data stream updates from other AWS

services• Provide customers a rich search and navigation

experience• Usage monitoring for mobile applications

• Performance• Depends on multiple factors including instance

type, workload, index, number of shards used, read replicas

• Storage configurations –instance storage or EBS storage

• Cost Model• Pay as you go• Only pay for compute and storage

Amazon Elasticsearch• Durability and Availability

• Zone Awareness• Automatic and Manual snapshots

• Scalability & Elasticity• Add or remove instances• Modify EBS volumes for data growth

• Interfaces• AWS Management Console• API’s• SDK’s• Kibana and Logstash (ELK Stack)

• Anti-patterns• OLTP• Workloads needing larger than 5TB of

storage requirements

Elasticsearch + Logstash + Kibana = real-time analytics & visualization

Build visualizations

Perform ad-hoc analysis

Share and collaborate via storyboards

Native access on major mobile platforms

Amazon QuickSight

Introducing Amazon QuickSight

Cloud-Powered Business Intelligence Service For 1/10th the Cost of Traditional BI Software

No IT effort. No dimensional modeling

Auto-discovery of all AWS data sources

Super-fast, Parallel, In-memory Calculation Engine

(SPICE)

Fully Managed

Available in Previewaws.amazon.com/quicksight

Scale up or down as needed

Pay for what you use

Multiple options

Do-it-yourself big data applications

Amazon EC2

The AWS Approach

• Flexible Use the best tool for the job• Data structure, latency, throughput, access patterns

• Scalable Immutable (append-only)• Batch/speed/serving layer

• Minimum Admin Overhead Leverage AWS managed services• No or very low admin

• Low Cost Big data ≠ big cost

Scenario 1: Enterprise Data Warehouse

Scenario 2: Capturing and Analyzing Sensor Data

Scenario 3: Sentiment Analysis of Social Media

Big Data Scenarios

Scenario 1: Enterprise Data Warehouse

Data Warehouse Architecture

Data Sources

AmazonS3

AmazonEMR

AmazonS3

AmazonRedshift

AmazonQuickSight

Scenario 2: Capturing and Analyzing Sensor Data

Data Sources

AmazonS3

AmazonRedshift

AmazonQuickSight

AmazonKinesisEnabled

App

AmazonKinesisEnabled

App

AmazonDynamoDB

ReportingDashboard

Customer Access

AmazonKinesis

1

2 3 4 5

6 7 8 9

Scenario 3: Sentiment Analysis of Social Media

Social Media Data

AmazonEC2

AmazonLambda

AmazonML

AmazonKinesis

AmazonS3

AmazonSNS

1 2 4 5 6

3 7

Next Steps• Subscribe to the AWS Big Data Blog

blogs.aws.amazon.com/bigdata

• Learn more, check the tutorials, guides, and self-paced labsaws.amazon.com/big-data

• Register for the next Big Data WebinarBuilding Smart Applications with Amazon Machine Learningaws.amazon.com/about-aws/events/monthlywebinarseriesThu, Jan 28 2016 | 9AM PST

aws january 2016 webinar series - getting started with big data on aws

Technology