intro to big data on aws igor roiter big data cloud ... presentation- aws 18... · aws iot dynamodb...

39
Igor Roiter – Big Data Cloud Solution Architect Intro to Big Data on AWS

Upload: others

Post on 22-Sep-2019

13 views

Category:

Documents


0 download

TRANSCRIPT

Igor Roiter – Big Data Cloud Solution Architect

Intro to Big Data on AWS

Igor Roiter – Big Data Cloud Solution Architect

• Working as a Data Specialist for the last 11 years

• 9 of them as a Consultant specializing in Ad-tech, Fin-tech and the Gaming industries

• Certified NoSQL & BigData Cloud Engineer

• IoT enthusiast in my spare time

Building a Big Data Application on AWS

EMR

Analyze

Amazon GlacierS3

StoreIngest

Amazon Kinesis Amazon

Redshift

DynamoDB AWS IoT

AWS Snowball

Amazon

Athena

EC2

Amazon

Elasticsearch

Service

Lambda

AWS Database Migration

ServiceAWS Data

Pipeline

Amazon

QuickSight

AWS Database Migration

Service

Building a Big Data Application on AWS

AWS CloudCorporate DBMS

How to get corporate data into cloud

Application hosts

Building a Big Data Application on AWS

AWS CloudCorporate DBMS

AWS Database Migration

Service

Getting data into AWS Cloud

Application hosts

AWS Database Migration Service

AWS Database Migration

Service

• Simple usage

• No data-loss, No downtime

• Low cost – only pay for compute resources

Building a Big Data Application on AWS

AWS CloudCorporate DBMS

AWS Database Migration

Service

Where do we migrate the data to

Application hosts

Building a Big Data Application on AWS

Data Warehouse

AWS Cloud

AWS Database Migration

Service Amazon Redshift

Corporate DBMS

Application hosts

AWS Redshift – Structured Data Processing

Amazon Redshift

• Fully Managed columnar data warehouse

• Standard SQL support

• Petabyte-scale

• Fault-tolerant

Building a Big Data Application on AWS

How about some BI

AWS Cloud

AWS Database Migration

Service Amazon Redshift

Corporate DBMS

Application hosts

Building a Big Data Application on AWS

BI tool

AWS Cloud

Amazon

QuickSight

AWS Database Migration

Service Amazon Redshift

Corporate DBMS

Application hosts

Amazon QuickSight

Amazon

QuickSight

• Managed BI tool

• Scales to 100s of users

• Auto-suggest the best visualizations for your data

• 1/10th the cost of other popular BI software

What About Unstructured Data?

• What if your data is unstructured?

• What if you don’t need all the raw data?

• What if you need to combine multiple data sets?

Building a Big Data Application on AWS

Handling unstructured data

AWS Cloud

Amazon

QuickSight

AWS Database Migration

Service Amazon Redshift

Corporate DBMS

Application hosts

Building a Big Data Application on AWS

AWS Lambda – Event driven data transformation

Unstructured Raw data

in S3

Structured Data

In Amazon S3AWS

Lambda

Amazon

QuickSight

AWS Database Migration

Service

Amazon Redshift

Corporate DBMS

Lambda triggerApplication hosts

• Managed object storage

• No administration

• No capacity limit

• Data resilience

• $0.02 per GB/month

AWS Simple Storage Service (S3)

S3

AWS Lambda – Serverless Event Processing

AWS Lambda

• Function as a service

• Write code in NodeJS, Python or Java

• Event driven

• Low cost

Need More Than a Lambda?

What can we use when further throughput is required?

Building a Big Data Application on AWS

Need more throughput…

Unstructured Raw data

in S3

Structured Data

In Amazon S3AWS

Lambda

Amazon

QuickSight

AWS Database Migration

Service

Amazon Redshift

Corporate DBMS

Lambda triggerApplication hosts

Building a Big Data Application on AWS

Unstructured data transformation using EMR

Unstructured Raw data

in S3

Structured Data

In Amazon S3

Amazon

QuickSight

AWS Database Migration

Service

Amazon Redshift

Corporate DBMS

Amazon EMR

Application hosts

Amazon EMR - Unstructured Data Processing

Amazon EMR

• Fully managed, cloud pre-tuned Hadoop eco-system

• Hadoop 2.7.3, Hive 2.3.1, Spark 2.2.0, Hbase 1.3.1 + Hbase on S3, Presto, Flink…

• On-demand and Spot Instances

• Fully integrated with S3

• Provision cluster for a job then terminate

Semi-structured / Unstructured data

What about ad-hoc queries for exploring new data?

Building a Big Data Application on AWS

Ad-hoc query on Raw data

Unstructured Raw data

in S3

Structured Data

In Amazon S3

Amazon

QuickSight

AWS Database Migration

Service

Amazon Redshift

Corporate DBMS

Amazon EMR

Application hosts

Building a Big Data Application on AWS

Ad-hoc query service on S3 buckets

Unstructured Raw data

in S3

Structured Data

In Amazon S3

Amazon

QuickSight

AWS Database Migration

Service

Amazon Redshift

Corporate DBMS

Amazon

Athena

Amazon EMR

Application hosts

AWS Athena – Serverless Query Processing

Amazon

Athena

• Serverless query service for querying data in S3

• No data load/ETL, data is queried on S3

• Pay-per-query – based on scanned data amount

• Standard SQL

Building a Big Data Application on AWS

Pre-process data to columnar data format

Unstructured Raw data

in S3

Structured Data

In Amazon S3

Amazon

QuickSight

AWS Database Migration

Service

Amazon Redshift

Corporate DBMS

Amazon

AthenaAmazon EMR Parquet (columnar) Data

In Amazon S3

Amazon EMR

Application hosts

All Batch - No Fun

What about some real-time processing?

Building a Big Data Application on AWS

No more batch

Unstructured Raw data

in S3

Structured Data

In Amazon S3

Amazon

QuickSight

AWS Database Migration

Service

Amazon Redshift

Corporate DBMS

Amazon

AthenaAmazon EMR Parquet (columnar) Data

In Amazon S3

Amazon EMR

Application hosts

Building a Big Data Application on AWS

Add a real-time layer – kinesis streams

Unstructured Raw data

in S3

Structured Data

In Amazon S3

Amazon

QuickSight

AWS Database Migration

Service

Amazon Redshift

Corporate DBMS

Amazon

Athena

Amazon

EMR

Parquet (columnar) Data

In Amazon S3Kinesis Streams

Amazon

EMR

Application hostsKCL

Amazon Kinesis Streams

Kinesis Streams

• Managed Real-time stream processing

• Dynamically adjust throughput of the stream

• Data resilience

• Produce data into stream using KPL, Read it with KCL

Amazon Kinesis Firehose

Kinesis Firehose

• Load streaming data into AWS data stores: S3, RedShift

• Fully managed, auto-scaled

• Integrated Lambda

Building a Big Data Application on AWS

Add a real-time layer – kinesis streams

Unstructured Raw data

in S3

Structured Data

In Amazon S3

Amazon

QuickSight

AWS Database Migration

Service

Amazon Redshift

Corporate DBMS

Amazon

Athena

Amazon

EMR

Parquet (columnar) Data

In Amazon S3Kinesis Streams

Amazon

EMR

Application hostsKCL

• Pay for what you use only

• Avoid over/under provision

• Dynamic scale – everything

• Go to prod faster then ever

Summary – Big Data on AWS

Reinvent 2017 – So what's new?

S3/Glacier Select Preview

• Pulls out only the data you need from an object

• Offloads filter processing from application to S3 service

• S3/Glacier Select SDK supports Java & PythonS3 Select

Glacier Select

Amazon Kinesis Video Streams

• Fully managed video ingestion and storage service

• Secure SDKs for devices to stream video to AWS

• APIs for access and retrieve indexed video fragments based on tags and timestampsAmazon Kinesis video

streams

AWS Lambda updates…

AWS Lambda

• Added ability to shift traffic between 2 AWS Lambda versions based on pre-assigned weights

• Doubled available memory for a function from 1536MB to 3008MB

• Added ability to add a concurrency limit on a Lambda functions

• The AWS Lambda console has been updated with enhancements: Cloud-9 based editor, Improved monitoring, and more…

AWS IoT Analytics Preview

AWS IoT

Analytics

• Built-in IoT Analytics SQL query engine

• Stores the processed device data in a time-series data

• Scales automatically to support up to petabytes of IoT data

• Apply machine learning to your IoT data with hosted Jupyter notebooks, right from the IoT console

Get in touch at [email protected]

Thank You!