big data and analytics on aws
TRANSCRIPT
Big Data & Analytics
Randall Barnes - Bill Moritz - Kevin Dillon
Today’s Session Objectives
• Describe core concepts, common objectives and lessons
learned
• Present specific platforms and products available in AWS
• Provide live, hands-on deployment experience
Big Data applications are defined as having data volume or variety or velocity characteristics that render traditional tools/processes impractical
Great potential…• Keep pace with the accelerating information explosion• New insights and analytics to improve business decisions• Create new applications requiring massive real-time data processing
…and at times challenging• Unpredictable resource demand• Job orchestration and management complexities• Geo-distribution of data sources
Reduce costs per workload, saving money and creating opportunities
Extremely Flexible - ability to provide answers to analytics questions that don't yet exist
Why Big Data solutions have worked well in the cloud
AnalyzeCollect
Redshift
EMR EC2
Store
Glacier
S3
DynamoDB
Kinesis
Import Export
Direct Connect
MachineLearning
Three types of data-driven development
Retrospectiveanalysis and
reporting
Predictionsto enable smart
applications
Amazon Machine LearningAmazon EMR
Here-and-nowreal-time processing
and dashboards
Amazon Kinesis Amazon EC2 AWS Lambda
Amazon Redshift Amazon EMR
Core Principles for Successful ImplementationsElastic resource capacity
• Data Storage, I/O, Computing resources scale on demand• Dynamically support multiple ephemeral environments such as dev, test and QA
validations • No up-front capital expenses; pay only for what you use
Streamlined management of platforms and solutions• Raw infrastructure resources• Application stacks
Well-supported ecosystem of tools and applications• Data integration tools• Analytics and reporting applications• Resource and job orchestration
What is changing…
Diverse and non-traditional workloads • Using big data strategies, tools and products to solve problems that
have not traditionally been viewed as big data.
Leverage managed solutions to reduce complexity and staff constraints
• AWS-managed platforms• 3rd-Party frameworks
Making The Cloud Work For Your Enterprise
Most common implementation challenges:
• Managing distributed data sets
• Application platform migration – limited resources
• ETL integration, especially leveraging existing IP and
business logic
Making The Cloud Work For Your Enterprise
TCO Mistakes
Overprovisioning• High I/O storage space for non-active data sets
• Non-linear cost increase for certain instance types
Static resources• Low overall utilization
• Not leveraging spot instance pricing
Not leveraging Reserved Instance (RI) price strategies
Example 1: MPP Data Warehouse
Bulk Transfer
Reportingand
Analytics
Data Stageand Archive
Data Warehouse
Example 1: Elastic Data Warehouse
S3 Glacier
Redshift
Import/Export Service
Direct Connect
Data Collection, Ingestion and Consumption
• Ship storage devices directly to Amazon• Transfer to EBS or S3• Up to 4TB per device
• Higher bandwidth, more consistent performance• 1Gbps and 10Gbps ports [network providers may offer slice]
Direct Connect
Import/Export Service
Amazon Simple Storage Service (S3)Object storage container with virtually unlimited capacity
• Store files (objects) in containers (buckets)
• Redundant copies for high durability and reliability
• Available on the internet via REST requests directly or through SDK
• Multiple strategies to secure contents
• Set permissions, access policies and optionally require MFA
• Encryption: Server (simplified) or Client-side
• Audit logging (optional) will record all access requests via api
• Built-in tools for managing versioning, object lifecycle and creating static websites
• Low pay-as-you-go pricing a function of storage amount (~$.03/GB/Month) plus metering of I/O requests
Amazon Redshift
• High performance, massively parallel columnar storage architecture providing streamlined scalability
• Mainstream SQL query syntax (PostgreSQL) allowing for rapid platform adoption
• Flexible node type and RI options allowing for workload alignment and cost efficiency
• Integrated with other AWS Big Data Platforms (S3, EMR, DynamoDB, Data Pipeline)
• Streamlined administrative tasks (snapshot/restore, Node increase/decrease)
Scalable, fully-managed Data Warehouse
Recap: Elastic Data Warehouse
S3 Glacier
Redshift
Import/Export Service
Direct Connect
Example 2: Real-Time Data Streaming and NoSQL
Data Warehouse
Application Tier
Backend Apps
Real-Time Processing NoSQL
Example 2: Real-Time Data Streaming and NoSQL
Data Warehouse
Application Tier
Backend Apps
DynamoDBKinesis
Amazon Kinesis• Fully managed service
• Real-time Log/Application data ingestion and
transformations
• Real-time reporting and analytics
• Data ordering, deterministic routing and replay (up to 24
hours)
• Records: Partition Key, Sequence Number, Data Blob (payload)
• Shards: Units of incremental throughput capacity
• Use SDK APIs for PUT/GET operationsScalable real-time diverse data processing
Amazon DynamoDB• Seamless and virtually unlimited scalability; managed
automatically• Ability to define specific resource allocation limits• Easy administration and well-supported development model• Integration with other core Amazon data services• GET/PUT operations with a user-defined Primary Key• Tables contain items (PK + Attributes) up to 400KB• Data Types: Scaler, Set (collections), key-values, documents• Secondary Indexes (Global and Local)• Provisioned read- and write-throughput, SSD storage
Challenge: Proprietary API via AWS SDKs (e.g. Java, .NET)
Recap: Real-Time Data Streaming and NoSQL
Data Warehouse
Application Tier
Backend Apps
DynamoDBKinesis
Example 3: Hadoop Workloads
Data Warehouse
Application and Data Stage Tiers
Analytics
Hadoop Processing
Example 3: Hadoop Workloads
Data Warehouse
Application and Data Stage Tiers
Analytics
EMR
Amazon Elastic Map Reduce (EMR)
• Semi-managed service (access to underlying OS)
• Apache Hadoop Framework
• Robust, streamlined management for Map-Reduce jobs
• Simple api for popular extensions, e.g. Hive, Pig, Spark
• Spot Instance pricing available
• HDFS or S3 storage
Your Data + Machine Learning= Smart Applications
Analytics and Reporting• Broad Vendor Integration
• Reports, Dashboards, BI
Analytics and Reporting
Fast and easy to create• Reports• Dashboards• Near-time analytics decisions
Recap: Hadoop Workloads
Data Warehouse
Application and Data Stage Tiers
Analytics
EMR
Example 4: Machine learning
Machine learning is the technology that automatically finds patterns in your data and uses them to make predictions for new data points as they become available
Your Data + Machine Learning= Smart Applications
Easy to use, managed machine learning service built for developers
Robust, powerful machine learning technology based on Amazon’s internal systems
Create models using your data already stored in the AWS cloud (S3 files, Redshift query, MySQL RDS query)
Deploy models to production in seconds
Amazon
ML
Smart applications by example
Based on what you know about an order:
Is this order fraudulent?
Based on what you know about the user:
Will they use your product?
Based on what you know about a news article:
What other articles are interesting?
And a few more examples…
Fraud detection Detecting fraudulent transactions, filtering spam emails, flagging suspicious reviews, …
Personalization Recommending content, predictive content loading, improving user experience, …
Targeted marketing Matching customers and offers, choosing marketing campaigns, cross-selling and up-selling, …
Content classification Categorizing documents, matching hiring managers and resumes, …
Churn prediction Finding customers who are likely to stop using the service, free-tier upgrade targeting, …
Customer support Predictive routing of customer emails, social media listening, …
Securing Data in the Cloud
• Secure your AWS console root account
• Use complex passwords and rotate regularly
• Secure data stage locations
Thank You. Questions?
Contact Us
LocationsContact InfoRandall BarnesPrincipal Architect, 2nd [email protected]
Bill MoritzSr Cloud Engineer, 2nd [email protected]
2nd Watch, [email protected]
SEATTLENEW YORKVIRGINIAATLANTAPHILADELPHIAHOUSTONLIBERTY LAKELOS ANGELESCHICAGO