aws summit auckland - building a server-less data lake on aws
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sebastien Menant, Enterprise Solutions Architect & Nam Je Cho, Enterprise Solutions Architect,
Amazon Web Services
Chris Riddell, Senior Software Engineer, Parrot Analytics
Building a Server-less Data Lake on AWS
Technical 301
Business
101 Technical
201 Technical
301 Technical
401 Technical
Session Depth
Agenda
• What is a Data Lake?
• Why You Need a Data Lake
• Building the Data Lake
• Demo
• Next Steps
What is a Data Lake?
Definition
“A data lake provides massive storage for
any kind of data, enormous processing
power and the ability to handle virtually
limitless concurrent tasks or jobs”
- Wikipedia
Characteristics of a Data Lake
Collect
Everything
Dive in
Anywhere
Flexible
Access
Why You Need a Data Lake
What About Modern Business Needs?
Big Data… and The Hadoop Ecosystem
But Both are Complementary
Amazon
EMR
Amazon
Redshift
But Both are Complementary
STORAGECOMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTECOMPUTE
COMPUTE
Amazon
EMR
Amazon S3
New Business Outcomes and Capabilities
• Enable New Insights in Your Data
• Cost Savings of Compute and Storage
• Use the Right Tool for the Job
• Increase Durability of Data
• Charge Storage Costs to Owner
• Streaming and Real-time Analysis
Retain all your data, for years!
Building the Data Lake
Beware
Building Blocks of the Data Lake
Storage and Ingestion
Catalogue and Search
Security
API and UI
Storage and Ingestion
Storage and
Ingestion
Catalogue and
Search
Security
API and UI
Requirements for Storage
• Multi-year Scalable Storage Capability
• High Durability
• Store Raw Data from Any Input Sources
• Support for Any Data Type
• Low Cost
Amazon S3
1. Highly Scalable and Durable
2. Security and Encryption
3. Lifecycle Management
4. Event Notifications
5. Versioning
Key Services for Storage
Amazon Glacier
1. Long-term Archival Storage
2. Lifecycle Integration with S3
3. Extremely Low-cost
4. Vault Lock
Amazon
S3
Amazon
Glacier
Amazon
S3
Amazon
Glacier
Storage
and
Ingestion
Recommendations #1
• S3 Buckets
• Close to Users and Compute
• Select Region for Regulatory Compliance
• Naming
• Human-readable Path
• Random Hash Prefix for Optimal Partitioning
• Format
• Structured vs Unstructured + Compression
• CSV, Parquet, ORC, JSON, XML, logs, etc
• GZIP for small files, Avro, LZO, Snappy
Recommendations #2
• Optimise
• Store Everything
• Use Large Files with Split-able Format
• Lifecycle Policies for Cost-savings
• Tagging for Cost Allocation
• Security
• Encryption
• Bucket Policies, ACL, Tagging, CloudTrail
Requirements for Ingestion
• Batch File Support
• Traditional ETL
• Streaming Data
• Consumption of any Dataset as a Stream
• Low Latency Analytics
• Replay-ability from the Data Lake
• Server-less ETL Capabilities
Amazon Kinesis Firehose
1. Easy to use with Agent
2. Automatic Elasticity
3. Near Real-time
4. Simultaneous Destinations
Key Services for Ingestion
Amazon Kinesis Streams
1. Enables Custom Processing
2. Continuous Data Collection
3. Real-time
4. API Driven for Custom Apps
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Data
Sources
Data
Sources
Data
Sources
Data
Sources
Data
Sources
S3
DynamoDB
Redshift
Amazon Kinesis
Availability
Zone
Availability
Zone
Availability
Zone
Stream
AWS Lambda
KCL App
EMR
Elasticsearch
Amazon
Glacier
Amazon
Kinesis
Storage
and
IngestionAmazon
S3
Recommendations
• Reminder
• Added Complexity needs Business Justification
• Select the Right Tools
• Real-time Analysis: Apache Spark Streaming, Storm, Flink
• Firehose to Redshift for BI and Dashboards
• Tips
• AWS Lambda for ETL Transformation
• Persist Streams into S3
http://amzn.to/23DWr5O
http://amzn.to/1SRk8wG
Catalogue and Search
Storage and
Ingestion
Catalogue and
Search
Security
API and UI
Requirements for Catalogue and Search
• Metadata Index
• Automated Metadata Processing
• Discovery and Search
• Data Classification
• Server-less and Event-driven
Key Services for Catalogue and Search
1. Server-less
2. Event Driven
3. Auto Scaling
4. Real-time
1. NoSQL
2. Streams
3. Logstash Plugin
1. Deploy Simply
2. Easy Admin
3. Kibana
Amazon
Elasticsearch
Service
Amazon
DynamoDB
AWS
Lambda
Lambda DynamoDB Elasticsearch
Catalogue and Search
AWS
Lambda
Amazon
DynamoDB
Amazon
Elasticsearch
Recommendations
• Tips
• Start Small and Simple… add Capabilities
• File names, size, state, dates, tags, owner
• Region, versions, lineage, relationships
• Search Metadata and Object Content
• Events
• S3 Triggers Lambda
• DynamoDB Streams
• Logstash Plugin to Elasticsearch
http://amzn.to/23E9LUp
http://amzn.to/1TQVBwp
Security
Storage and
Ingestion
Catalogue and
Search
Security
API and UI
Requirements for Security
• Data Encryption at Rest
• Authentication
• Authorisation
AWS IAM
1. Users and Roles
2. Identity Federation
3. Multi Factor Authentication
4. Granular Permissions
Key Services for Security
AWS KMS
1. Seamless Service Integration
2. Extensive Compliance
AWS
IAM
AWS
KMS
AWS
CloudHSM
SSE-S3
Security
AWS
KMS
AWS
IAM
Recommendations
• Start Early
• Security Needs Practice!
• Federate with your Corporate Directory
• Best Practice
• Use CloudTrail and CloudWatch
• Encrypt Where Possible
• Select Bucket Region for Regulatory Compliance
• Tips
• IAM Policies, S3 Versioning and MFA Delete
• Lambda for Data Masking
API and UI
Storage and
Ingestion
Catalogue and
Search
Security
API and UI
Requirements for API and UI
• Serve Data and Capabilities to Customers
• Programmatically
• Search Catalogue
• Run Compute
• Extend Access Control Management
• And… Use of Familiar Visualisation Tools
Amazon API Gateway
1. Performance at Any Scale
2. Create RESTful Frontend
3. Managed API Lifecycle
Key Services for API and UI
AWS Lambda
1. Enables Server-less API
2. Custom Logic for Services
3. Automatic Scaling
AWS
Lambda
Amazon API
Gateway
API
and
UI Amazon
API Gateway
AWS
Lambda
Recommendations
• Tips
• Go Server-less!
• Extend Existing AWS Services and Build Custom Logic
• Data Management, Processing and Transformations
• API Gateway for Data Access
• Serve the Data, Search and Compute via RESTful APIs
• Distribute a Custom SDK
• Extend the Solution
• Build Advanced Security Controls using Metadata Index
The Whole Picture…
Storage and
Ingestion
Catalogue and
Search
Security
API and UI
Storage and
Ingestion
Catalogue and
Search
Security
API and UI
Amazon
EMR
Amazon
RDS
Amazon
S3
Amazon
Glacier
Amazon
Kinesis
Storage
and
Ingestion
Security
AWS
KMS
AWS
IAM
API
And
UI Amazon
API Gateway
AWS
LambdaUSERS
Amazon
Redshift
Catalogue and Search
AWS
Lambda
Amazon
DynamoDB
Amazon
Elasticsearch
A Data Lake is…
• Foundation of Data Storage and Streaming Data
• Metadata index to help Categorise and Govern
• Search Index to Enable Data Discovery
• Robust Set of Security Controls
• Governance Through Technology Not Policy
• Interface to Expose Data and Capabilities to Users
Demo
Building Catalogue and Search
ElasticSearch
Metadata
Index
LambdaS3 Bucket Logstash
Data Flow
Data
Source
DynamoDB
Next Steps
Proof of Concept
Next Steps
• How to Get Started
• AWS Documentation
• Getting Started Guide
• AWS Training & Certification
• Big Data on AWS
• AWS Partner Network
• AWS Professional Services
• Big Data Specialists
AWS Training & Certification
Intro Videos & Labs
Free videos and labs to
help you learn to work
with 30+ AWS services
– in minutes!
Training Classes
In-person and online
courses to build
technical skills –
taught by accredited
AWS instructors
Online Labs
Practice working with
AWS services in live
environment –
Learn how related
services work
together
AWS Certification
Validate technical
skills and expertise –
identify qualified IT
talent or show you
are AWS cloud ready
Learn more: aws.amazon.com/training
Your Training Next Steps:
Visit the AWS Training & Certification pod to discuss your
training plan & AWS Summit training offer
Register & attend AWS instructor led training
Get Certified
AWS Certified? Visit the AWS Summit Certification Lounge to pick up your swag
Learn more: aws.amazon.com/training
Thank You!