building your data lake on aws · associate-level certification aws certified big data - specialty...

37
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building Your Data Lake on AWS Luke Anderson Business Development, AWS

Upload: others

Post on 22-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Building Your Data Lake

on AWS

Luke Anderson

Business Development, AWS

Page 2: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

What to expect from the session

1. Defining the Data Lake

2. Reducing Costs

3. Increasing Performance

4. Planning for the Future

Page 3: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Rethink how to become a data-driven business

• Business outcomes

• Experimentation

• Agile and timely

Page 4: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Traditionally, Analytics looked like this

(Duplication & Sprawl)

Hadoop

Spark

NoSQL

Storage

Arrays

Databases

Data

Warehouse

Structured Data

SQLRaw Data

ETL

Advanced Analytics

ETL

Page 5: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Defining the AWS data lake

Data lake is an architecture with a virtually

limitless centralized storage platform capable

of categorization, processing, analysis, and

consumption of heterogeneous data sets

Key data lake attributes

• Decoupled storage and compute

• Rapid ingest and transformation

• Secure multi-tenancy

• Query in place

• Schema on read

Page 6: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS Data Lake ComponentsAny analytic workload, any scale, at the lowest possible cost

Insights

Analytics

Data Lake

Data Movement

QuickSight SageMaker

Glue(ETL & Data Catalog)

S3/Glacier(Storage)

Redshift+Spectrum

EMR Athena

Elasticsearch serviceKinesis Data Analytics

Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams

Real-time

Comprehend

DW Big data processing Interactive

Page 7: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Unmatched durability,

availability, and scalability

Best security, compliance, and audit

capabilityObject-level control

at any scale

Business insight into

your dataTwice as many partner

integrations

Most ways to bring

data in

Reasons to choose Amazon S3 for data lake

Page 8: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Reducing Data Lake Costs

Page 9: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Optimize costs with data tiering

Hot

Cold

Amazon

S3 standard

Amazon S3—

infrequent access

Amazon

Glacier

HDFS Use EMR/Hadoop with local

HDFS for hottest data sets

Store cooler data in S3 and

cold in Glacier to reduce costs

Use S3 Analytics to optimize

tiering strategy

S3 Analytics

Page 10: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Process data in place…

Amazon Athena Amazon Redshift

Spectrum

Amazon EMRAWS Glue

Amazon S3

Page 11: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon EMR: Decouple compute & storage

Highly distributed

processing frameworks such

as Hadoop/Spark

Compress datasets

Columnar file formats

Aggregate small files

S3distcp “group-by” clause

Page 12: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon Redshift Spectrum: Exabyte Scale

query-in-place

Structured data w/ joins

Multiple on-demand

clusters-scale concurrency

Columnar file formats

Data partitioning

Better query performance

with predicate pushdown

Page 13: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon Athena: Query without ETL

Serverless service

Schema on read

Compress datasets

Columnar file formats

Optimize file sizes

Optimize querying (Presto

backend)

Query Data in Glacier

(Coming)

Page 14: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Today: All of these tools…

retrieve a lot of data they don’t need and

do the heavy lifting

Page 15: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Today: You need to….

entire object from Amazon Glacier to Amazon S3

and then use it.

Amazon

S3

Amazon

Glacier

Page 16: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Select

Amazon S3 Select and Amazon Glacier Select

Select subset of data from an object based on a SQL expression

Page 17: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Motivation Behind S3 Select

GET all the data from S3 objects, and my application will filter the data that I need

Redshift Spectrum Example:

• Beta customer: Run 50,000 queries

• Amount of data fetched from S3: 6 PBs

• Amount of data used in Redshift: 650 TB

Data needed from S3: 10%

Page 18: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon S3 Select

SELECT a filtered set of data from within an object using standard SQL Statements

• First content aware API within Amazon S3

• Unlike Amazon Athena and Spectrum, operates within the Amazon S3 system

• SQL Statement operates on a per-object basis—not across a group of objects

• Works and scales like GET requests

• Accessible via SDK (Java, Python), AWS CLI and Presto Connector—others to follow

• Who will use it?

• Amazon Redshift Spectrum, Amazon Athena, Presto and other custom Query engines

• Everyone doing log mining

Page 19: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon S3 Select

Output

Format: delimited text (CSV,

TSV), JSON …

Clauses Data types Operators Functions

Select String Conditional String

From Integer, Float, Decimal Math Cast

Where Timestamp Logical Math

Boolean String (Like, ||) Aggregate

Input

Format: delimited text (CSV,

TSV), JSON …

Compression: GZIP …

Page 20: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon S3 Select: Simple pattern matches

…get-object …object… | awk -F ’{ if($4=="x") print $1}’

...select-object …object… ‘SELECT o._1 WHERE o._4 == “x”…’

Page 21: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon S3 Select: Serverless applications

Amazon

S3

AWS

Lambda

Amazon

SNS

S3

Select

Lambda

Trigger

Page 22: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Before

200 seconds and 11.2 cents

# Download and process all keys

for key in src_keys:

response = s3_client.get_object(Bucket=src_bucket, Key=key)

contents = response['Body'].read()

for line in contents.split('\n')[:-1]:

line_count +=1

try:

data = line.split(',')

srcIp = data[0][:8]

….

Amazon S3 Select: Serverless MapReduce

After

95 seconds and costs 2.8 cents

# Select IP Address and Keys

for key in src_keys:

response = s3_client.select_object_content

(Bucket=src_bucket, Key=key, expression =

SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj)

contents = response['Body'].read()

for line in contents:

line_count +=1

try:

….

2X Faster at 1/5 of the cost

Page 23: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Demo – S3 Select Timing

Page 24: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon S3 Select with Presto

Works with your existing Hive Metastore

Automatically converts predicates into S3 Select requests

Amazon S3

S3 Select

Page 25: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Before

Amazon S3 Select: Accelerating big data

After

After

5X Faster with 1/40 of the CPU

Page 26: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Using Amazon Glacier Select

Page 27: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

How Amazon Glacier Select Works

Page 28: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Delivering Results Faster

Page 29: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Optimizing data lake performance

Aggregate small files

EMR: S3distcp

Amazon Kinesis Firehose

S3 Select

Big data cheaper, faster

Up to 400% faster

Data Formats

Columnar formats

EMRFS consistent view

Amazon

S3

Amazon

DynamoDB

Page 30: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon Kinesis—Real Time

Easily collect, process, and analyze video and data streams in real time

Capture, process,

and store video

streams for analytics

Load data streams

into AWS data storesAnalyze data streams

with SQL

Build custom

applications that

analyze data streams

Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics

SQL

New

Page 31: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data preparation accounts for ~80% of the

work

Building training sets

Cleaning and organizing data

Collecting data sets

Mining data for patterns

Refining algorithms

Other

Page 32: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS Glue—Serverless Data catalog & ETL

service

Data CatalogETL Job

authoring

Discover data and

extract schema

Auto-generates

customizable ETL code

in Python and Spark

Automatically discovers data and stores schema

Data searchable, and available for ETL

Generates customizable code

Schedules and runs your ETL jobs

Serverless

Page 33: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon SageMaker (GA)The quickest and easiest way to get ML models from idea to production

End-to-End

Machine Learning

Platform

Zero setup Flexible Model

Training

Pay by the second

$

Page 34: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Planning for the Future

Page 35: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Transactional Data

Stream Data

Collect Store Analyze Visualize

A

iOS Android

Web Apps

Logstash

Amazon

RDS

Amazon

DynamoDB

AmazonES

Amazon

S3

Apache

Kafka

Amazon

Glacier

Amazon

Kinesis

Amazon

DynamoDB

Amazon

Redshift

Impala

Pig

Amazon ML

Streaming

Amazon

Kinesis

AWS

Lambda

Am

azo

n E

las

tic

Ma

pR

ed

uc

e

Amazon

ElastiCache

Se

arc

h S

QL

N

oS

QL

C

ac

he

Str

ea

m P

roc

es

sin

gB

atc

hIn

tera

cti

ve

Lo

gg

ing

Str

ea

m S

tora

ge

IoT

Ap

pli

ca

tio

ns

File

Sto

rag

e An

aly

sis

& V

isu

aliza

tio

n

Hot

Cold

Warm

Hot

Slow

Hot

ML

Fast

Fast

Amazon

QuickSight

File Data

No

teb

oo

ks

Predictions

Apps & APIs

Mobile

Apps

IDE

Search Data

ETL

Evolve As Needed!

Page 36: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS Training Offer

Make your data driven decisions count, and make a career in Big

Data on AWS. Follow the Big Data Specialty learning path and

become a specialist in Big Data:

• Implement core AWS Big Data services according to best

practices

• Design and maintain Big Data

• Leverage tools to automate data analysis

Certified Cloud

PractitionerAssociate-level Certification

AWS Certified Big Data - Specialty

• Enterprise solutions

architects

• Data scientists

• Big Data solutions

architects

• Data analysts

Who should attend

Free AWS digital training: Foundational

knowledge

Big Data on AWS – 3-day Classroom Training

Free AWS digital training:

Big Data Technology Fundamentals

Visit www.aws.training to find out more.

Page 37: Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.

We hope you found it interesting! A kind reminder to complete the survey.

Let us know what you thought of today’s event and how we can improve

the event experience for you in the future.

Thank You For Attending

AWS Data Driven Decisions Webinar Series.

[email protected]

twitter.com/AWSCloud

facebook.com/AmazonWebServices

youtube.com/user/AmazonWebServices

slideshare.net/AmazonWebServices

twitch.tv/aws