aws summit auckland - building a server-less data lake on aws

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Sebastien Menant, Enterprise Solutions Architect & Nam Je Cho, Enterprise Solutions Architect,

Amazon Web Services

Chris Riddell, Senior Software Engineer, Parrot Analytics

Building a Server-less Data Lake on AWS

Technical 301

Business

101 Technical

201 Technical

301 Technical

401 Technical

Session Depth

Agenda

• What is a Data Lake?

• Why You Need a Data Lake

• Building the Data Lake

• Demo

• Next Steps

What is a Data Lake?

Definition

“A data lake provides massive storage for

any kind of data, enormous processing

power and the ability to handle virtually

limitless concurrent tasks or jobs”

- Wikipedia

Characteristics of a Data Lake

Collect

Everything

Dive in

Anywhere

Flexible

Access

Why You Need a Data Lake

What About Modern Business Needs?

Big Data… and The Hadoop Ecosystem

But Both are Complementary

Amazon

EMR

Amazon

Redshift

But Both are Complementary

STORAGECOMPUTE

COMPUTE

COMPUTE

COMPUTE

COMPUTE

COMPUTE

COMPUTECOMPUTE

COMPUTE

Amazon

EMR

Amazon S3

New Business Outcomes and Capabilities

• Enable New Insights in Your Data

• Cost Savings of Compute and Storage

• Use the Right Tool for the Job

• Increase Durability of Data

• Charge Storage Costs to Owner

• Streaming and Real-time Analysis

Retain all your data, for years!

Building the Data Lake

Beware

Building Blocks of the Data Lake

Storage and Ingestion

Catalogue and Search

Security

API and UI

Storage and Ingestion

Storage and

Ingestion

Catalogue and

Search

Security

API and UI

Requirements for Storage

• Multi-year Scalable Storage Capability

• High Durability

• Store Raw Data from Any Input Sources

• Support for Any Data Type

• Low Cost

Amazon S3

1. Highly Scalable and Durable

2. Security and Encryption

3. Lifecycle Management

4. Event Notifications

5. Versioning

Key Services for Storage

Amazon Glacier

1. Long-term Archival Storage

2. Lifecycle Integration with S3

3. Extremely Low-cost

4. Vault Lock

Amazon

S3

Amazon

Glacier

Amazon

S3

Amazon

Glacier

Storage

and

Ingestion

Recommendations #1

• S3 Buckets

• Close to Users and Compute

• Select Region for Regulatory Compliance

• Naming

• Human-readable Path

• Random Hash Prefix for Optimal Partitioning

• Format

• Structured vs Unstructured + Compression

• CSV, Parquet, ORC, JSON, XML, logs, etc

• GZIP for small files, Avro, LZO, Snappy

Recommendations #2

• Optimise

• Store Everything

• Use Large Files with Split-able Format

• Lifecycle Policies for Cost-savings

• Tagging for Cost Allocation

• Security

• Encryption

• Bucket Policies, ACL, Tagging, CloudTrail

Requirements for Ingestion

• Batch File Support

• Traditional ETL

• Streaming Data

• Consumption of any Dataset as a Stream

• Low Latency Analytics

• Replay-ability from the Data Lake

• Server-less ETL Capabilities

Amazon Kinesis Firehose

1. Easy to use with Agent

2. Automatic Elasticity

3. Near Real-time

4. Simultaneous Destinations

Key Services for Ingestion

Amazon Kinesis Streams

1. Enables Custom Processing

2. Continuous Data Collection

3. Real-time

4. API Driven for Custom Apps

Amazon

Kinesis

Streams

Amazon

Kinesis

Firehose

Data

Sources

Data

Sources

Data

Sources

Data

Sources

Data

Sources

S3

DynamoDB

Redshift

Amazon Kinesis

Availability

Zone

Availability

Zone

Availability

Zone

Stream

AWS Lambda

KCL App

EMR

Elasticsearch

Amazon

Glacier

Amazon

Kinesis

Storage

and

IngestionAmazon

S3

Recommendations

• Reminder

• Added Complexity needs Business Justification

• Select the Right Tools

• Real-time Analysis: Apache Spark Streaming, Storm, Flink

• Firehose to Redshift for BI and Dashboards

• Tips

• AWS Lambda for ETL Transformation

• Persist Streams into S3

http://amzn.to/23DWr5O

http://amzn.to/1SRk8wG


Storage and

Ingestion

Catalogue and

Search

Security

API and UI

Requirements for Catalogue and Search

• Metadata Index

• Automated Metadata Processing

• Discovery and Search

• Data Classification

• Server-less and Event-driven

Key Services for Catalogue and Search

1. Server-less

2. Event Driven

3. Auto Scaling

4. Real-time

1. NoSQL

2. Streams

3. Logstash Plugin

1. Deploy Simply

2. Easy Admin

3. Kibana

Amazon

Elasticsearch

Service

Amazon

DynamoDB

AWS

Lambda

Lambda DynamoDB Elasticsearch


AWS

Lambda

Amazon

DynamoDB

Amazon

Elasticsearch

Recommendations

• Tips

• Start Small and Simple… add Capabilities

• File names, size, state, dates, tags, owner

• Region, versions, lineage, relationships

• Search Metadata and Object Content

• Events

• S3 Triggers Lambda

• DynamoDB Streams

• Logstash Plugin to Elasticsearch

http://amzn.to/23E9LUp

http://amzn.to/1TQVBwp

Security

Storage and

Ingestion

Catalogue and

Search

Security

API and UI

Requirements for Security

• Data Encryption at Rest

• Authentication

• Authorisation

AWS IAM

1. Users and Roles

2. Identity Federation

3. Multi Factor Authentication

4. Granular Permissions

Key Services for Security

AWS KMS

1. Seamless Service Integration

2. Extensive Compliance

AWS

IAM

AWS

KMS

AWS

CloudHSM

SSE-S3

Security

AWS

KMS

AWS

IAM

Recommendations

• Start Early

• Security Needs Practice!

• Federate with your Corporate Directory

• Best Practice

• Use CloudTrail and CloudWatch

• Encrypt Where Possible

• Select Bucket Region for Regulatory Compliance

• Tips

• IAM Policies, S3 Versioning and MFA Delete

• Lambda for Data Masking

API and UI

Storage and

Ingestion

Catalogue and

Search

Security

API and UI

Requirements for API and UI

• Serve Data and Capabilities to Customers

• Programmatically

• Search Catalogue

• Run Compute

• Extend Access Control Management

• And… Use of Familiar Visualisation Tools

Amazon API Gateway

1. Performance at Any Scale

2. Create RESTful Frontend

3. Managed API Lifecycle

Key Services for API and UI

AWS Lambda

1. Enables Server-less API

2. Custom Logic for Services

3. Automatic Scaling

AWS

Lambda

Amazon API

Gateway

API

and

UI Amazon

API Gateway

AWS

Lambda

Recommendations

• Tips

• Go Server-less!

• Extend Existing AWS Services and Build Custom Logic

• Data Management, Processing and Transformations

• API Gateway for Data Access

• Serve the Data, Search and Compute via RESTful APIs

• Distribute a Custom SDK

• Extend the Solution

• Build Advanced Security Controls using Metadata Index

The Whole Picture…

Storage and

Ingestion

Catalogue and

Search

Security

API and UI

Storage and

Ingestion

Catalogue and

Search

Security

API and UI

Amazon

EMR

Amazon

RDS

Amazon

S3

Amazon

Glacier

Amazon

Kinesis

Storage

and

Ingestion

Security

AWS

KMS

AWS

IAM

API

And

UI Amazon

API Gateway

AWS

LambdaUSERS

Amazon

Redshift


AWS

Lambda

Amazon

DynamoDB

Amazon

Elasticsearch

A Data Lake is…

• Foundation of Data Storage and Streaming Data

• Metadata index to help Categorise and Govern

• Search Index to Enable Data Discovery

• Robust Set of Security Controls

• Governance Through Technology Not Policy

• Interface to Expose Data and Capabilities to Users

Building Catalogue and Search

ElasticSearch

Metadata

Index

LambdaS3 Bucket Logstash

Data Flow

Data

Source

DynamoDB

Next Steps

Proof of Concept

Next Steps

• How to Get Started

• AWS Documentation

• Getting Started Guide

• AWS Training & Certification

• Big Data on AWS

• AWS Partner Network

• AWS Professional Services

• Big Data Specialists

AWS Training & Certification

Intro Videos & Labs

Free videos and labs to

help you learn to work

with 30+ AWS services

– in minutes!

Training Classes

In-person and online

courses to build

technical skills –

taught by accredited

AWS instructors

Online Labs

Practice working with

AWS services in live

environment –

Learn how related

services work

together

AWS Certification

Validate technical

skills and expertise –

identify qualified IT

talent or show you

are AWS cloud ready

Learn more: aws.amazon.com/training

https://aws.amazon.com/training/

Your Training Next Steps:

Visit the AWS Training & Certification pod to discuss your

training plan & AWS Summit training offer

Register & attend AWS instructor led training

Get Certified

AWS Certified? Visit the AWS Summit Certification Lounge to pick up your swag

Learn more: aws.amazon.com/training

https://aws.amazon.com/training/

Thank You!