big data and analytics on aws

Post on 13-Feb-2017

552 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Big Data & Analytics

Randall Barnes - Bill Moritz - Kevin Dillon

Today’s Session Objectives

• Describe core concepts, common objectives and lessons

learned

• Present specific platforms and products available in AWS

• Provide live, hands-on deployment experience

Big Data applications are defined as having data volume or variety or velocity characteristics that render traditional tools/processes impractical

Great potential…• Keep pace with the accelerating information explosion• New insights and analytics to improve business decisions• Create new applications requiring massive real-time data processing

…and at times challenging• Unpredictable resource demand• Job orchestration and management complexities• Geo-distribution of data sources

Reduce costs per workload, saving money and creating opportunities

Extremely Flexible - ability to provide answers to analytics questions that don't yet exist

Why Big Data solutions have worked well in the cloud

AnalyzeCollect

Redshift

EMR EC2

Store

Glacier

S3

DynamoDB

Kinesis

Import Export

Direct Connect

MachineLearning

Three types of data-driven development

Retrospectiveanalysis and

reporting

Predictionsto enable smart

applications

Amazon Machine LearningAmazon EMR

Here-and-nowreal-time processing

and dashboards

Amazon Kinesis Amazon EC2 AWS Lambda

Amazon Redshift Amazon EMR

Core Principles for Successful ImplementationsElastic resource capacity

• Data Storage, I/O, Computing resources scale on demand• Dynamically support multiple ephemeral environments such as dev, test and QA

validations • No up-front capital expenses; pay only for what you use

Streamlined management of platforms and solutions• Raw infrastructure resources• Application stacks

Well-supported ecosystem of tools and applications• Data integration tools• Analytics and reporting applications• Resource and job orchestration

What is changing…

Diverse and non-traditional workloads • Using big data strategies, tools and products to solve problems that

have not traditionally been viewed as big data.

Leverage managed solutions to reduce complexity and staff constraints

• AWS-managed platforms• 3rd-Party frameworks

Making The Cloud Work For Your Enterprise

Most common implementation challenges:

• Managing distributed data sets

• Application platform migration – limited resources

• ETL integration, especially leveraging existing IP and

business logic

Making The Cloud Work For Your Enterprise

TCO Mistakes

Overprovisioning• High I/O storage space for non-active data sets

• Non-linear cost increase for certain instance types

Static resources• Low overall utilization

• Not leveraging spot instance pricing

Not leveraging Reserved Instance (RI) price strategies

Example 1: MPP Data Warehouse

Bulk Transfer

Reportingand

Analytics

Data Stageand Archive

Data Warehouse

Example 1: Elastic Data Warehouse

S3 Glacier

Redshift

Import/Export Service

Direct Connect

Data Collection, Ingestion and Consumption

• Ship storage devices directly to Amazon• Transfer to EBS or S3• Up to 4TB per device

• Higher bandwidth, more consistent performance• 1Gbps and 10Gbps ports [network providers may offer slice]

Direct Connect

Import/Export Service

Amazon Simple Storage Service (S3)Object storage container with virtually unlimited capacity

• Store files (objects) in containers (buckets)

• Redundant copies for high durability and reliability

• Available on the internet via REST requests directly or through SDK

• Multiple strategies to secure contents

• Set permissions, access policies and optionally require MFA

• Encryption: Server (simplified) or Client-side

• Audit logging (optional) will record all access requests via api

• Built-in tools for managing versioning, object lifecycle and creating static websites

• Low pay-as-you-go pricing a function of storage amount (~$.03/GB/Month) plus metering of I/O requests

Amazon Redshift

• High performance, massively parallel columnar storage architecture providing streamlined scalability

• Mainstream SQL query syntax (PostgreSQL) allowing for rapid platform adoption

• Flexible node type and RI options allowing for workload alignment and cost efficiency

• Integrated with other AWS Big Data Platforms (S3, EMR, DynamoDB, Data Pipeline)

• Streamlined administrative tasks (snapshot/restore, Node increase/decrease)

Scalable, fully-managed Data Warehouse

Recap: Elastic Data Warehouse

S3 Glacier

Redshift

Import/Export Service

Direct Connect

Example 2: Real-Time Data Streaming and NoSQL

Data Warehouse

Application Tier

Backend Apps

Real-Time Processing NoSQL

Example 2: Real-Time Data Streaming and NoSQL

Data Warehouse

Application Tier

Backend Apps

DynamoDBKinesis

Amazon Kinesis• Fully managed service

• Real-time Log/Application data ingestion and

transformations

• Real-time reporting and analytics

• Data ordering, deterministic routing and replay (up to 24

hours)

• Records: Partition Key, Sequence Number, Data Blob (payload)

• Shards: Units of incremental throughput capacity

• Use SDK APIs for PUT/GET operationsScalable real-time diverse data processing

Amazon DynamoDB• Seamless and virtually unlimited scalability; managed

automatically• Ability to define specific resource allocation limits• Easy administration and well-supported development model• Integration with other core Amazon data services• GET/PUT operations with a user-defined Primary Key• Tables contain items (PK + Attributes) up to 400KB• Data Types: Scaler, Set (collections), key-values, documents• Secondary Indexes (Global and Local)• Provisioned read- and write-throughput, SSD storage

Challenge: Proprietary API via AWS SDKs (e.g. Java, .NET)

Recap: Real-Time Data Streaming and NoSQL

Data Warehouse

Application Tier

Backend Apps

DynamoDBKinesis

Example 3: Hadoop Workloads

Data Warehouse

Application and Data Stage Tiers

Analytics

Hadoop Processing

Example 3: Hadoop Workloads

Data Warehouse

Application and Data Stage Tiers

Analytics

EMR

Amazon Elastic Map Reduce (EMR)

• Semi-managed service (access to underlying OS)

• Apache Hadoop Framework

• Robust, streamlined management for Map-Reduce jobs

• Simple api for popular extensions, e.g. Hive, Pig, Spark

• Spot Instance pricing available

• HDFS or S3 storage

Your Data + Machine Learning= Smart Applications

Analytics and Reporting• Broad Vendor Integration

• Reports, Dashboards, BI

Analytics and Reporting

Fast and easy to create• Reports• Dashboards• Near-time analytics decisions

Recap: Hadoop Workloads

Data Warehouse

Application and Data Stage Tiers

Analytics

EMR

Example 4: Machine learning

Machine learning is the technology that automatically finds patterns in your data and uses them to make predictions for new data points as they become available

Your Data + Machine Learning= Smart Applications

Easy to use, managed machine learning service built for developers

Robust, powerful machine learning technology based on Amazon’s internal systems

Create models using your data already stored in the AWS cloud (S3 files, Redshift query, MySQL RDS query)

Deploy models to production in seconds

Amazon

ML

Smart applications by example

Based on what you know about an order:

Is this order fraudulent?

Based on what you know about the user:

Will they use your product?

Based on what you know about a news article:

What other articles are interesting?

And a few more examples…

Fraud detection Detecting fraudulent transactions, filtering spam emails, flagging suspicious reviews, …

Personalization Recommending content, predictive content loading, improving user experience, …

Targeted marketing Matching customers and offers, choosing marketing campaigns, cross-selling and up-selling, …

Content classification Categorizing documents, matching hiring managers and resumes, …

Churn prediction Finding customers who are likely to stop using the service, free-tier upgrade targeting, …

Customer support Predictive routing of customer emails, social media listening, …

Securing Data in the Cloud

• Secure your AWS console root account

• Use complex passwords and rotate regularly

• Secure data stage locations

Thank You. Questions?

Contact Us

LocationsContact InfoRandall BarnesPrincipal Architect, 2nd Watchrbarnes@2ndwatch.com

Bill MoritzSr Cloud Engineer, 2nd Watchbmoritz@2ndwatch.com

2nd Watch, Inc.1-888-317-7920info@2ndwatch.comwww.2ndwatch.com

SEATTLENEW YORKVIRGINIAATLANTAPHILADELPHIAHOUSTONLIBERTY LAKELOS ANGELESCHICAGO

top related