Download - B3 - Business intelligence apps on aws
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Business Intelligence Applications on AWS Steffen Krause, Amazon Web Services
@sk_bln
Overview
Designing BI & big data solutions in the cloud Not the only way to do it (but one that we have seen)
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Data App App
h(p://blog.mccrory.me/2010/12/07/data-‐gravity-‐in-‐the-‐clouds/
Data has gravity
Compute Storage Big Data
Data App App
h(p://blog.mccrory.me/2010/12/07/data-‐gravity-‐in-‐the-‐clouds/
latency Throughput
…and iner0a at volume…
Compute Storage Big Data
Data
h(p://blog.mccrory.me/2010/12/07/data-‐gravity-‐in-‐the-‐clouds/
…easier to move applica0ons to the data
Compute Storage Big Data
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3 as a “single source of truth”
S3
Getting your Data into AWS
Amazon S3
Corporate Data Center
• Console Upload
• FTP
• AWS Import Export
• S3 API
• Direct Connect
• Storage Gateway
• 3rd Party Commercial Apps
• Tsunami UDP
Write directly to a data source
Your applica+on Amazon S3
DynamoDB
Any other data store
Amazon S3
Amazon EC2
Queue, pre-process and then write
Amazon Simple Queue Service (SQS)
Amazon S3
DynamoDB
Any other data store
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NoSQL Store
Log Aggrega+on tools
Choose depending upon design
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Hadoop based Analysis
Amazon S3 Amazon EMR
Amazon SQS
DynamoDB
Any SQL or NoSQL Store
Log Aggrega+on tools
EMR is Hadoop in the Cloud
Amazon Elastic MapReduce (EMR)?
EMR Cluster
S3
Put the data into S3
Choose: Hadoop distribuGon, # of nodes, types of nodes, custom configs, Hive/Pig/etc.
Get the output from S3
Launch the cluster using the EMR console, CLI, SDK, or APIs
You can also store everything in HDFS
How does EMR work ?
Resize Nodes
EMR Cluster
You can easily add and remove nodes
1 instance for 100 hours =
100 instances for 1 hour
Small instance = $5.50 (including EMR – without: $4.40)
1 instance for 1000 hours =
1000 instances for 1 hour
Small instance = $55 (including EMR – without: $44)
When you turn off your cloud resources, you actually stop paying for them
SQL based processing
Amazon S3 Amazon EMR
Amazon Redshift
Pre-processing framework
Petabyte scale Columnar Data -warehouse
Amazon SQS
DynamoDB
Any SQL or NoSQL Store
Log Aggrega+on tools
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud
What is Amazon Redshift ?
Easy to provision and scale
No upfront costs, pay as you go
High performance at a low price
Open and flexible with support for popular BI tools
Demo: Amazon Redshift
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Your choice of BI Tools
Amazon S3 Amazon EMR
Amazon Redshift
Pre-processing framework
Amazon SQS
DynamoDB
Any SQL or NoSQL Store
Log Aggrega+on tools
Demo Jaspersoft as a BI Frontend
Sharing results and visualizations
Amazon S3 Amazon EMR
Amazon Redshift
Web App Server Visualization tools
Amazon SQS
DynamoDB
Any SQL or NoSQL Store
Log Aggrega+on tools
Sharing results and visualizations
Amazon S3 Amazon EMR
Amazon Redshift Business
Intelligence Tools
Business Intelligence Tools
Amazon SQS
DynamoDB
Any SQL or NoSQL Store
Log Aggrega+on tools
Geospatial Visualizations
Amazon S3 Amazon EMR
Amazon Redshift Business
Intelligence Tools
Business Intelligence Tools
GIS tools on hadoop
GIS tools
Visualization tools
Amazon SQS
DynamoDB
Any SQL or NoSQL Store
Log Aggrega+on tools
Rinse and Repeat
Amazon S3 Amazon EMR
Amazon Redshift
Visualization tools
Business Intelligence Tools
Business Intelligence Tools
GIS tools on hadoop
GIS tools
Amazon data pipeline
Amazon SQS
DynamoDB
Any SQL or NoSQL Store
Log Aggrega+on tools
The complete architecture
Amazon S3 Amazon EMR
Amazon Redshift
Visualization tools
Business Intelligence Tools
Business Intelligence Tools
GIS tools on hadoop
GIS tools
Amazon data pipeline
Amazon SQS
DynamoDB
Any SQL or NoSQL Store
Log Aggrega+on tools
Real Time
Amazon Kinesis • Real-time processing • Massive scale • Integrated • Use cases:
• Real-time log analysis • Real-time data analytics • Social media monitoring • Financial transactions • Online machine learning
Amazon Kinesis Data Flow Data Sources
App.4 [Machine Learning]
AWS En
dpoint
App.1 [Aggregate & De-‐Duplicate]
Data Sources
Data Sources
Data Sources
App.2 [Metric ExtracGon]
S3
DynamoDB
Redshift
App.3 [Sliding Window Analysis]
Data Sources
Availability Zone
Shard 1 Shard 2 Shard N
Availability Zone Availability Zone
Use cases
SkillPages
Customer Use Case
Everyone Needs Skilled People
At Home At Work In Life
Repeatedly
Who they are
What they can do
Your real life connections to them
Examples of what they can do
Data Architecture
Data Analyst
Raw Data
Get Data
Join via Facebook
Add a Skill Page
Invite Friends
Web Servers Amazon S3 User Action Trace Events
EMR Hive Scripts Process Content
• Process log files with regular expressions to parse out the info we need.
• Processes cookies into useful searchable data such as Session, UserId, API Security token.
• Filters surplus info like internal varnish logging.
Amazon S3
Aggregated Data
Raw Events
Internal Web
Excel Tableau
Amazon Redshift
We found that Amazon Redshi^ offers the performance we needed while freeing us from the licensing costs of our previous soluGon With Amazon Redshi^ and Tableau, anyone in the company can set up any queries they like—from how users are reacGng to a feature, to growth by demographic or geography, to the impact sales efforts have had in different areas. It’s very flexible
Jon Hoffman, So<ware Engineer, Foursquare
0
0.2
0.4
0.6
Female Male
Gender
0 20 40 60 80
Age
Foursquare
Gorilla Coffee
Gray's Papaya
Amorino
When do people go to a place?
Stack – analysis and sharing
App
licat
ion
Sta
ck
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat Files Databases Logs
Dat
a S
tack
Amazon S3 Database Dumps Log Files
Hadoop Elastic Map Reduce
Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
mongoexport postgres dump Flume
Everything that was a limited resource
is now a programmable resource
• Hadoop Technology and Use Cases: http://www.powerof60.com/
• http://aws.amazon.com/de • Start with the Free Tier:
http://aws.amazon.com/de/free/ • 25 US$ credits for new German customers:
http://aws.amazon.com/de/campaigns/account/ • Twitter: @AWS_Aktuell • Facebook:
http://www.facebook.com/awsaktuell • Webinars: http://aws.amazon.com/de/about-aws/events/
Resources