Download - AWS Segment XO Group Joint webinar
Loading and Analyzing Behavioral Data in Amazon Redshift
Presented by Segment, AWS & XO Group Inc.
March 3, 2015
Jon Hawkins Dir. of Analytics & SEO
XO Group Inc.
Scott Ward Solutions Architect
AWS
Peter Reinhardt CEO
Segment
Today’s Speakers
Today’s agenda
• Amazon Redshift Architecture
• Segment SQL Quick Demo
• XO Group’s Story
Amazon Redshift
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Amazon Redshift is Easy to Use
• Provision in minutes
• Monitor query performance
• Point and click resize
• Built in security
• Automatic backups
Amazon Redshift Architecture • Leader Node
– SQL endpoint – Stores metadata – Coordinates query execution
• Compute Nodes – Local, columnar storage – Execute queries in parallel – Load, backup, restore via Amazon S3 – Parallel load from Amazon DynamoDB, Amazon
EMR, Amazon S3, HDFS/SSH
• Two hardware platforms – Optimized for data processing – DW1: HDD; scale from 2TB to 1.6PB – DW2: SSD; scale from 160GB to 256TB
10 GigE (HPC)
Ingestion Backup Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3
JDBC/ODBC
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
Leader Node
• Column storage
• Data compression
• Zone maps
• Direct-attached storage • With row storage you do unnecessary I/O
• To get total amount, you have to read everything
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
Amazon Redshift Dramatically Reduces I/O
• With column storage, you only read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
Amazon Redshift Dramatically Reduces I/O
analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw
• Column storage
• Data compression
• Zone maps
• Direct-attached storage • COPY compresses automatically
• You can analyze and override
• More performance, less cost
Amazon Redshift Dramatically Reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
10 | 13 | 14 | 26 |… … | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
• Track the minimum and maximum value for each block
• Skip over blocks that don’t contain relevant data
Amazon Redshift Dramatically Reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
128 GB RAM
16 cores
16 TB disk
DW.HS1.8XL:
• > 2 GB/s scan rate
• Optimized for data processing
• High disk density
16 GB RAM 2 cores
2 TB disk
DW.HS1.XL:
Amazon Redshift Dramatically Reduces I/O
• Query
• Load
• Backup/Restore
• Resize
Amazon Redshift Parallelizes and Distributes Everything
Amazon S3/DynamoDB
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
• Query
• Load
• Backup/Restore
• Resize
• Parallel load from Amazon DynamoDB, Amazon EMR, Amazon S3, HDFS/SSH
• Kinesis integration
• Data automatically distributed and sorted according to DDL
• Scales linearly with number of nodes
Amazon Redshift Parallelizes and Distributes Everything
Amazon S3
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
• Query
• Load
• Backup/Restore
• Resize
• Backups to Amazon S3 are automatic, continuous and incremental
• Backup your cluster to a second region
• Configurable system snapshot retention period; take user snapshots on-demand
• Streaming restores enable you to resume querying faster
Amazon Redshift Parallelizes and Distributes Everything
SQL Clients/BI Tools
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Leader Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Leader Node
• Query
• Load
• Backup/Restore
• Resize
• Add/remove nodes or change node type while remaining online
• Provision a new cluster and copy data in parallel from node to node
• Only charged for source cluster until SQL endpoint has automatically been switched over via DNS
Amazon Redshift Parallelizes and Distributes Everything
• SSL to secure data in transit
• Encryption to secure data at rest – AES-256; hardware accelerated
– All blocks on disks and in Amazon S3 encrypted
– HSM/CloudHSM
• No direct access to compute nodes
• Amazon VPC support
10 GigE (HPC)
Ingestion Backup Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Amazon S3 / Amazon DynamoDB
Customer VPC
Internal Security Group
JDBC/ODBC
Leader Node
Compute Node
Compute Node
Compute Node
Amazon Redshift Has Security Built In
Segment SQL
A single hub to store, transform, and send your customer data.
All these tools run on three types of data.
IDENTIFY
Who are your users?
TRACK
What are they doing?
PAGE
Where are they?
analytics.track({ 'Added Product' , { name: 'Monopoly: 3rd Edition', price: 18.99, quantity: 3});
12345
Replace all your tags and tracking with a single API.
A single hub to store, transform, and send your customer data.
That same data can flow effortlessly (for you) into
a petabyte scale data warehouse.
Segment SQL
All of your data in a hosted, schematized Amazon Redshift instance.
Segment SQL
Query user, page, and event tables.
Segment SQL Partners
XO Group Inc.
XO Group Inc.
Our Analytics Journey
XO Group Inc. + Segment
Individual product teams wanted isolated access to their own analytics.
Segment + Mixpanel + Customer.io + Optimizely + Uservoice
XO Group Inc. + Segment
Still needed a solution to connect Segment data from multiple products and platforms into a single view.
Segment SQL + Mode Analytics
Version support.
Sharing functionality.
Membership analysis.
3 Case Studies
Membership Analysis
Existing database didn’t give any context into how a user became a member in the app.
Couldn’t analyze which features were driving membership upgrades, marketing attribution, or cross-product pollination.
SELECT userID, source,
reason FROM xogrp.signed_up
1234
Isolated Views
Membership Analysis
App A App B
Membership Analysis
A desire to save our content drives our membership.
Experiencing certain apps before downloading a second significantly improved LTV.
We made it easier to save and organize our content.
We optimized promotion strategy to advertise the stickiest apps first to current members.
Share Functionality
Facebook & Twitter drove the most inbound traffic.
So, we guessed they’d also be the most popular sharing options.
SELECT title, shareOption,
quantity FROM xogrp.shared_article
1234
Share Functionality
More considerate “Share” options
SMS a dress from desktop or mobile web browser.
One click email yourself the details of a venue.
Version Support
Couldn’t answer very simple question about which roll up browser versions, devices, and operating systems customers were using.
Caused a lot of debate over which versions to support, which test devices to buy.
Version Support
Chrome 40 + Chrome 40 + Chrome 40 = No, just no!
Version Support
Now we can easily see what browser versions and OS’s are most popular.
Product and engineering teams can make data-driven decisions on which versions to support.
Set up your tracking and analytics database to answer any question.
Save valuable engineering time with prebuilt software.
Listen to each team’s needs and create the analytics view they want.
Takeaways
Thank you!
Schedule a demo today.
segment.com/contact/sql