(bdt314) a big data & analytics app on amazon emr & amazon redshift

© 2015 Nasdaq, Inc. All rights reserved.

“Nasdaq” and the Nasdaq logo are the trademarks of Nasdaq, Inc. and its affiliates in the U.S. and other countries.

“Amazon” and the Amazon Web Services logo are the trademarks of Amazon Web Services, Inc. or its affiliates in the U.S. and other countries

Nate Sammons, Principal Architect, Nasdaq, Inc.

October 2015

BDT314

Running a Big Data and Analytics

Application on Amazon EMR and Amazon

Redshift with a Focus on Security

NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S

IN MARKET CAP REPRESENTING

WORTH $9.6TRILLION

DIVERSE INDUSTRIES AND

MANY OF THE WORLD’SMOST WELL-KNOWN AND

INNOVATIVE BRANDSMORE THAN U.S.1 TRILLIONNATIONAL VALUE IS TIEDTO OUR LIBRARY OF MORE THAN

41,000 GLOBAL INDEXES

N A S D A Q T E C H N O L O G Y

IS USED TO POWER MORE THAN

IN 50 COUNTRIES100 MARKETPLACES

OUR GLOBAL PLATFORM

CAN HANDLE MORE THAN

1 MILLIONMESSAGES/SECONDAT SUB-40 MICROSECONDS

AV E RAG E S P E ED S

1 C L E A R I N G H O U S E

WE OWN AND OPERATE

26 MARKETS

5 CENTRAL SECURITIES

DEPOSITORIES

INCLUDING

ACROS S AS S E T C L A S SE S

& GEOGRAPHIES

What to Expect from the Session

• Motivations for extending an Amazon Redshift warehouse with Amazon EMR

• How our data ingest workflow operates

• How to query encrypted data in Amazon S3 using Presto and other Hadoop-ecosystem tools

• How to manage schemas and data migrations

• Future direction for our data warehouse

Current State

Amazon Redshift as Nasdaq’s Main Data Warehouse

• Transitioned from an on-premises warehouse to

Amazon Redshift

• Over 1,000 tables migrated

• More data sources added as needed

• Nearly two years of data

• Average daily ingest of over 7B rows

Never Throw Anything Away

• 23 node ds2.8xlarge Amazon Redshift cluster

• 828 vCPU, 5.48 TB of RAM

• 368 TB of DB storage capacity, over 1PB of local disk!

• 92 GB/sec aggregate disk I/O

• Resize once per quarter

• 2.7 trillion rows: 1.8T from sources, 900B derived

Many Data Sources

• Internal DBs, CSV files, stream captures, etc.

• Data from all 7 exchanges operated by Nasdaq

• Orders, quotes, trade executions

• Market “tick” data

• Security master

• Membership

• All highly structured and consistent row-oriented data

Data Corollary to the Ideal Gas Law

Motivations for Extending to Amazon EMR and

Amazon S3

• Resizing a 300+ TB Amazon Redshift cluster isn’t

instantaneous

• Continuing to grow the cluster is expensive

• Paying for CPU and disk to support infrequently accessed

data doesn’t make sense

• Data will expand to fill any container

Extending Our Warehouse

Goals

• Build a secure, cost effective, long-term data store

• Provide a SQL interface to all data

• Support new MPP analytics workloads (Spark, ML, etc.)

• Cap the size of our Amazon Redshift cluster

• Manage storage and compute resources separately

High Level Overview

Amazon Redshift’s Continuing Role

• All data lands in Amazon Redshift first

• Amazon Redshift clients have strict SLAs on data availability

• Must ensure data loads are finished quickly

• Aggregations and transformations performed in SQL

• SQL is easy and we have a lot of SQL expertise

• Transformed data is then unloaded to Amazon S3 for conversion

Decouple Storage and Compute Resources

Scale each independently as needed, run multiple different

apps on top of a common storage system

Especially for old, infrequently accessed data, no need to

run compute 24/7 to support it; we can keep data “forever”

Access needs drop off dramatically over time

• Yesterday >> last month >> last quarter >> last year

Account Structure and Cost Allocations

• Separate AWS accounts for each client / department

• Departments can run as much or as little compute as

they need; use different query tools, experiments

• No competition for compute resources across clients

• Amazon S3 costs are shared, compute costs are passed

through to each department

Data Ingest Workflow

Data Ingest Overview

Nasdaq Workflow Engine

• MySQL-backed workflow engine developed in-house

• Orchestrates over 40K operations daily

• Flexible scheduling and dependency management

• Ops GUI for retrying failed steps, root cause analysis

• Moving to Amazon Aurora + Amazon EC2 in 2016

• Clustered operation using Amazon S3 as temp storage

space

Amazon Redshift Data Ingest Workflow

• Data is pulled from various sources

• Validate data, convert to CSVs + manifest

• Store compressed, encrypted data in Amazon S3 temp

space

• Load into Amazon Redshift using COPY SQL statements

• Further transformation performed using SQL

• UNLOAD transformed data back to Amazon S3

• Notifications to other systems using Amazon SQS

Amazon EMR / Amazon S3 Data Ingest Workflow

• Automatically executed after Amazon Redshift loads and

transformations complete

• Uses Amazon Redshift schema metadata and manifest file

to drive conversions to Parquet

• Detects schema changes and bumps Hive schema version

• Alters schema in Hive Metastore to add new tables,

partitions as needed

Data Security and Encryption

VPC or Nothing

• Security is our #1 priority at all times

• All instances run in a VPC

• Locked down security groups, network ACLs, etc.

• Least-privilege IAM roles for each app and human

• See SEC302 – IAM Best Practices from Anders

• EC2 instance roles in Amazon EMR

• VPC endpoint for Amazon S3

• 10 G private AWS Direct Connect circuits into AWS

Encryption Key Management

• On-premises Safenet LUNA HSM cluster for key storage

• Amazon Redshift is directly integrated with our HSMs

• Nasdaq KMS:

• Internally known as “Vinz Clortho”

• Roots encryption keys in the HSM cluster

• Allows us full control over where keys are stored, used

Transparent Encryption in Amazon S3 and EMRFS

Amazon S3 SDK EncryptionMaterialsProvider

interface:

• Adapter to retrieve keys from our KMS

• Used when reading or writing data in Amazon S3

• User metadata to encode encryption key tokens

Encryption Performance with Amazon S3

• Roughly 25% slower than unencrypted

• Seek within an encrypted object works:

• Critical for performance

• Handled automatically

• Seeks are relative to the unencrypted size

• Create a new HTTP request at an offset within the object

• Encryption offset work is handled in the AWS SDK itself

• Worst case, we must read two extra blocks of AES data

Local disk encryption with Amazon EMR

• Bootstrap action to encrypt ephemeral disks

• Specifically to encrypt Presto’s local temp storage

• Standard Linux LUKS configuration

• Integrated with the Nasdaq KMS

• Retrieves key and mounts disks on startup using init.d

SELinux on Amazon EMR

• Bootstrap action to install SELinux packages

• Adds kernel command line arguments

• Rebuilds initrd image

• Reboots the node and re-labels the filesystem

• Increases cluster boot time

• Currently only working on Amazon EMR 3.8

• Working to refine SELinux policy files for Presto

Presto on Amazon EMR

What is Presto?

• https://prestodb.io

• Open Source MPP SQL database from Facebook

• Flexible data sources through Connector API

• JDBC, ODBC drivers

• Nice GUI from Airbnb: http://nerds.airbnb.com/airpal/

• Hive Connector:

• Table schemas defined in a Hive Metastore as external tables

• Data files stored in Amazon S3

Presto Overview

Running Presto on Amazon EMR

• Bootstrap action to download and install Java 8 & Presto

• Based on the Amazon EMR team’s Presto BA

• Adds support for custom encryption materials provider jars

• Configures Presto to use a remote Hive Metastore

• Currently using Amazon EMR 3.8, working towards 4.0

Data Encryption in Presto

• Presto doesn’t use EMRFS for access to Amazon S3

• We added support for Amazon S3

EncryptionMaterialsProvider to PrestoS3FileSystem.java

• Code available at github.com/nasdaq

• Working with Facebook to integrate these changes

Data Storage Formats

File Formats: Parquet vs. ORC

The two most widely used structured data file formats:

• Compressed, columnar record storage

• Structured, schema-validated data

• Supported by a variety of Hadoop-ecosystem apps

• Arbitrary user metadata encoded at the file level

ORC

Pros:

• DATE and TIMESTAMP type support in Hive, Presto

Cons:

• Rigid column ordering requirements

• Clunky Java API

• Unacceptable performance when encrypted in

Amazon S3

• 15-18x slower during our testing (!)

The Winner: Parquet

• Wide project support: Presto, Spark, Drill, etc.

• Actively developed project

• Adoption increasing

• Column referenced by name instead of position

• Set hive.parquet.use-column-names=true in Presto config

• Good performance when encrypted (~27% slower)

• Clean Java API

Parquet Schema Workarounds

DATE not supported in Hive or Presto

• Instead, convert DATEs to an INTs

• 2015-10-08 becomes 20151008

• Timestamps become a BIGINT (64bit integer in Hive)

• For nanosecond resolution records, we use a DATE and

a separate nanos-since-midnight column

Schema and Data Management

Hive Metastore

• Amazon EMR 4.0 cluster for the Metastore

• Easier for remote access from Presto

• Reachable through VPC peering with client accounts

• The “source of truth” for Hive schemas

• Metastore DB on Amazon RDS for MySQL

• Easy backups, encrypted storage

• Data ingest system creates/alters tables

• Alters tables to add new data partitions each day

• Detects newly changed schemas

Managing Versioned, Partitioned Tables in S3

• Store versions of a table in directories in Amazon S3:

s3://schema/table/version/date=YYYYMMDD/*.parquet

Works with “msck repair table” commands

• When a schema change is detected, increment the

version. New data is written to the new location, alerts

are generated for humans to determine changes.

• Data is migrated in Amazon S3 and old versions are

kept for now

Logical vs. Physical Schemas

• Track a “logical” and “physical” schema for each table

• Logical is compared with Amazon Redshift to detect changes

• Physical schema used to produce Hive DDL for Presto

• Schema definitions stored in MySQL

• Version management and change detection

• Amazon S3 location for each table

• Tools to export these schemas as .sql files

• Hive schema and table create statements

• “msck repair table” scripts

File-level Metadata

We encode information in file-level metadata:

• Partition column definition

• Time zone in which the file was parsed

• Current & original schema name and version number

• Column data type adjustments (DATE -> INT, etc.)

Allows us to always recreate logical schema representations

from physical files, re-migrate files if a data migration step

had a bug, etc.

Table Partitioning and Data Management

• Partition hive tables by date

• We have mostly timeseries data and are on a daily cadence

• Partitioning helps query performance

• Use `backticks` when defining column names in SQL

• Column names must be lower case in Parquet

• Correct bad data in Amazon Redshift through SQL, then

UNLOAD partitions for encoding to Parquet

• Our tools and automation make it easy to replace

modified partitions of data in Hive tables

Working with data in S3 and Amazon Redshift

Custom tools developed to make life easier:

• Extract CSV data from various DBs, or UNLOAD from

Amazon Redshift in whole or in segments

• Encode CSVs as Parquet files using a Hive schema

• Write data into the correct directory structure in

Amazon S3

• Allows us to move data between Amazon Redshift and

Amazon S3 easily, and in bulk

Custom Parquet Data Migration Tools

• Read records from previous version of a table

• Reads from the old location in Amazon S3

• Write records using the current version of a table

• Writes to the new location in Amazon S3

• Most migrations are trivial:

• Add new column with some default value (or null)

• Rename columns

• More complicated migrations require Java code

• Track original and current version in file metadata

Review & Future Enhancements

Review

• Motivations for extending an Amazon Redshift warehouse with Amazon EMR

• How our data ingest system operates

• How to query encrypted data in Amazon S3 using Presto and other Hadoop-ecosystem tools

• How to manage schemas and data migrations

Lessons Learned: TL;DR

• Manage storage and compute separately

• It’s OK to be paranoid about data loss!

• Amazon S3 encryption is easy and seek() works

• Parquet vs. ORC

• Partition and version your tables

• Manage logical and physical table schemas

• Data management tools & automation are important

Future Enhancements

• Archive original source data for SEC 17-a4 compliance

(using Amazon Glacier Vault Lock)

• Decouple data retrieval and processing tasks

• Move ingest processing to Amazon EC2/Amazon ECS

• Move workflow engine DB to Amazon Aurora

• Leveraging other query frameworks: Spark, ML, etc.

• Near real-time streaming ingest

• More data sources

Related Sessions

Remember to complete

your evaluations!

Thank you!

(bdt314) a big data & analytics app on amazon emr & amazon redshift

Technology