big data and analytics on aws

34
Big Data & Analytics Randall Barnes - Bill Moritz - Kevin Dillon

Upload: 2nd-watch

Post on 13-Feb-2017

552 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Big data and Analytics on AWS

Big Data & Analytics

Randall Barnes - Bill Moritz - Kevin Dillon

Page 2: Big data and Analytics on AWS

Today’s Session Objectives

• Describe core concepts, common objectives and lessons

learned

• Present specific platforms and products available in AWS

• Provide live, hands-on deployment experience

Page 3: Big data and Analytics on AWS

Big Data applications are defined as having data volume or variety or velocity characteristics that render traditional tools/processes impractical

Great potential…• Keep pace with the accelerating information explosion• New insights and analytics to improve business decisions• Create new applications requiring massive real-time data processing

…and at times challenging• Unpredictable resource demand• Job orchestration and management complexities• Geo-distribution of data sources

Page 4: Big data and Analytics on AWS

Reduce costs per workload, saving money and creating opportunities

Extremely Flexible - ability to provide answers to analytics questions that don't yet exist

Why Big Data solutions have worked well in the cloud

Page 5: Big data and Analytics on AWS

AnalyzeCollect

Redshift

EMR EC2

Store

Glacier

S3

DynamoDB

Kinesis

Import Export

Direct Connect

MachineLearning

Page 6: Big data and Analytics on AWS

Three types of data-driven development

Retrospectiveanalysis and

reporting

Predictionsto enable smart

applications

Amazon Machine LearningAmazon EMR

Here-and-nowreal-time processing

and dashboards

Amazon Kinesis Amazon EC2 AWS Lambda

Amazon Redshift Amazon EMR

Page 7: Big data and Analytics on AWS

Core Principles for Successful ImplementationsElastic resource capacity

• Data Storage, I/O, Computing resources scale on demand• Dynamically support multiple ephemeral environments such as dev, test and QA

validations • No up-front capital expenses; pay only for what you use

Streamlined management of platforms and solutions• Raw infrastructure resources• Application stacks

Well-supported ecosystem of tools and applications• Data integration tools• Analytics and reporting applications• Resource and job orchestration

Page 8: Big data and Analytics on AWS

What is changing…

Diverse and non-traditional workloads • Using big data strategies, tools and products to solve problems that

have not traditionally been viewed as big data.

Leverage managed solutions to reduce complexity and staff constraints

• AWS-managed platforms• 3rd-Party frameworks

Making The Cloud Work For Your Enterprise

Page 9: Big data and Analytics on AWS

Most common implementation challenges:

• Managing distributed data sets

• Application platform migration – limited resources

• ETL integration, especially leveraging existing IP and

business logic

Making The Cloud Work For Your Enterprise

Page 10: Big data and Analytics on AWS

TCO Mistakes

Overprovisioning• High I/O storage space for non-active data sets

• Non-linear cost increase for certain instance types

Static resources• Low overall utilization

• Not leveraging spot instance pricing

Not leveraging Reserved Instance (RI) price strategies

Page 11: Big data and Analytics on AWS

Example 1: MPP Data Warehouse

Bulk Transfer

Reportingand

Analytics

Data Stageand Archive

Data Warehouse

Page 12: Big data and Analytics on AWS

Example 1: Elastic Data Warehouse

S3 Glacier

Redshift

Import/Export Service

Direct Connect

Page 13: Big data and Analytics on AWS

Data Collection, Ingestion and Consumption

• Ship storage devices directly to Amazon• Transfer to EBS or S3• Up to 4TB per device

• Higher bandwidth, more consistent performance• 1Gbps and 10Gbps ports [network providers may offer slice]

Direct Connect

Import/Export Service

Page 14: Big data and Analytics on AWS

Amazon Simple Storage Service (S3)Object storage container with virtually unlimited capacity

• Store files (objects) in containers (buckets)

• Redundant copies for high durability and reliability

• Available on the internet via REST requests directly or through SDK

• Multiple strategies to secure contents

• Set permissions, access policies and optionally require MFA

• Encryption: Server (simplified) or Client-side

• Audit logging (optional) will record all access requests via api

• Built-in tools for managing versioning, object lifecycle and creating static websites

• Low pay-as-you-go pricing a function of storage amount (~$.03/GB/Month) plus metering of I/O requests

Page 15: Big data and Analytics on AWS

Amazon Redshift

• High performance, massively parallel columnar storage architecture providing streamlined scalability

• Mainstream SQL query syntax (PostgreSQL) allowing for rapid platform adoption

• Flexible node type and RI options allowing for workload alignment and cost efficiency

• Integrated with other AWS Big Data Platforms (S3, EMR, DynamoDB, Data Pipeline)

• Streamlined administrative tasks (snapshot/restore, Node increase/decrease)

Scalable, fully-managed Data Warehouse

Page 16: Big data and Analytics on AWS

Recap: Elastic Data Warehouse

S3 Glacier

Redshift

Import/Export Service

Direct Connect

Page 17: Big data and Analytics on AWS

Example 2: Real-Time Data Streaming and NoSQL

Data Warehouse

Application Tier

Backend Apps

Real-Time Processing NoSQL

Page 18: Big data and Analytics on AWS

Example 2: Real-Time Data Streaming and NoSQL

Data Warehouse

Application Tier

Backend Apps

DynamoDBKinesis

Page 19: Big data and Analytics on AWS

Amazon Kinesis• Fully managed service

• Real-time Log/Application data ingestion and

transformations

• Real-time reporting and analytics

• Data ordering, deterministic routing and replay (up to 24

hours)

• Records: Partition Key, Sequence Number, Data Blob (payload)

• Shards: Units of incremental throughput capacity

• Use SDK APIs for PUT/GET operationsScalable real-time diverse data processing

Page 20: Big data and Analytics on AWS

Amazon DynamoDB• Seamless and virtually unlimited scalability; managed

automatically• Ability to define specific resource allocation limits• Easy administration and well-supported development model• Integration with other core Amazon data services• GET/PUT operations with a user-defined Primary Key• Tables contain items (PK + Attributes) up to 400KB• Data Types: Scaler, Set (collections), key-values, documents• Secondary Indexes (Global and Local)• Provisioned read- and write-throughput, SSD storage

Challenge: Proprietary API via AWS SDKs (e.g. Java, .NET)

Page 21: Big data and Analytics on AWS

Recap: Real-Time Data Streaming and NoSQL

Data Warehouse

Application Tier

Backend Apps

DynamoDBKinesis

Page 22: Big data and Analytics on AWS

Example 3: Hadoop Workloads

Data Warehouse

Application and Data Stage Tiers

Analytics

Hadoop Processing

Page 23: Big data and Analytics on AWS

Example 3: Hadoop Workloads

Data Warehouse

Application and Data Stage Tiers

Analytics

EMR

Page 24: Big data and Analytics on AWS

Amazon Elastic Map Reduce (EMR)

• Semi-managed service (access to underlying OS)

• Apache Hadoop Framework

• Robust, streamlined management for Map-Reduce jobs

• Simple api for popular extensions, e.g. Hive, Pig, Spark

• Spot Instance pricing available

• HDFS or S3 storage

Your Data + Machine Learning= Smart Applications

Page 25: Big data and Analytics on AWS

Analytics and Reporting• Broad Vendor Integration

• Reports, Dashboards, BI

Page 26: Big data and Analytics on AWS

Analytics and Reporting

Fast and easy to create• Reports• Dashboards• Near-time analytics decisions

Page 27: Big data and Analytics on AWS

Recap: Hadoop Workloads

Data Warehouse

Application and Data Stage Tiers

Analytics

EMR

Page 28: Big data and Analytics on AWS

Example 4: Machine learning

Machine learning is the technology that automatically finds patterns in your data and uses them to make predictions for new data points as they become available

Your Data + Machine Learning= Smart Applications

Page 29: Big data and Analytics on AWS

Easy to use, managed machine learning service built for developers

Robust, powerful machine learning technology based on Amazon’s internal systems

Create models using your data already stored in the AWS cloud (S3 files, Redshift query, MySQL RDS query)

Deploy models to production in seconds

Amazon

ML

Page 30: Big data and Analytics on AWS

Smart applications by example

Based on what you know about an order:

Is this order fraudulent?

Based on what you know about the user:

Will they use your product?

Based on what you know about a news article:

What other articles are interesting?

Page 31: Big data and Analytics on AWS

And a few more examples…

Fraud detection Detecting fraudulent transactions, filtering spam emails, flagging suspicious reviews, …

Personalization Recommending content, predictive content loading, improving user experience, …

Targeted marketing Matching customers and offers, choosing marketing campaigns, cross-selling and up-selling, …

Content classification Categorizing documents, matching hiring managers and resumes, …

Churn prediction Finding customers who are likely to stop using the service, free-tier upgrade targeting, …

Customer support Predictive routing of customer emails, social media listening, …

Page 32: Big data and Analytics on AWS

Securing Data in the Cloud

• Secure your AWS console root account

• Use complex passwords and rotate regularly

• Secure data stage locations

Page 33: Big data and Analytics on AWS

Thank You. Questions?

Page 34: Big data and Analytics on AWS

Contact Us

LocationsContact InfoRandall BarnesPrincipal Architect, 2nd [email protected]

Bill MoritzSr Cloud Engineer, 2nd [email protected]

2nd Watch, [email protected]

SEATTLENEW YORKVIRGINIAATLANTAPHILADELPHIAHOUSTONLIBERTY LAKELOS ANGELESCHICAGO