aws july webinar series: amazon redshift reporting and advanced analytics

Greg Khairallah, Business Development Manager, AWS

Adam Savitzky, Software Development Engineer, Yahoo!

Scott Hoover, Data Scientist, Looker

July 23, 2015

Best Practices: Amazon RedshiftReporting and Advanced Analytics

Amazon Redshift – Resources

Getting Started – June Webinar Series: https://www.youtube.com/watch?v=biqBjWqJi-Q

Best Practices – July Webinar Series:

Optimizing Performance – July 21, 2015

Migration and Data Loading – July 22,2015

Reporting and Advanced Analytics – July 23, 2015

Agenda

• Connecting to Amazon Redshift• Case Study – Redshift analytics at Yahoo• Case Study - Redshift Optimizations at Looker • Questions and Answers

Petabyte scale; massively

parallel

Relational data warehouse

Fully managed; zero admin

SSD & HDD platforms

As low as $1,000/TB/Year

Amazon Redshift

Common Customer Use Cases

Reduce costs by extending DW rather than adding HW

Migrate completely from existing DW systems

Respond faster to business

Improve performance by an order of magnitude

Make more data available for analysis

Access business data via standard reporting tools

Add analytic functionality to applications

Scale DW capacity as demand grows

Reduce HW & SW costs by an order of magnitude

Traditional Enterprise DW Companies with Big Data SaaS Companies

Custom ODBC and JDBC Drivers

Up to 35% higher performance than open source drivers

Supported by most Business Intelligence tools

Will continue to support PostgreSQL open source drivers

Download drivers from console

Amazon Redshift Partners

Redshift for Analytics at Yahoo

Adam SavitzkyTech Yahoo, Software Development Engineer

Introduction

Who am I?• Yahoo growth team• Supporting analytics for 6 products in Yahoo’s mobile

portfolio

In the past:

Introduction

What do we do?▪ Real-time ad-hoc analytics▪ Mobile properties▪ What do we care about?

› Engagement and Activity› User demographics› Experimentation› Funnel analysis› Modeling revenue and user Lifetime Value› Cohort analysis and retention

High Level Architecture

Mobile App

Hadoop

S3 Redshift

▪ On an average day› 1 billion events› 25 million devices› 2 billion parameter key/value pairs

▪ Planned Capacity› 21 dc1.8xlarge nodes› 80 billion events› 100 million devices› 50 TB (compressed!)

Data Model

Performance Optimizations

▪ Heavy use of summarization where appropriate▪ Sort keys and partitioning▪ Data encoding

Event Schema

event_rawmail

eventhourly

eventdaily

installinstall

attribution

event_rawflickr

event_rawhomerun

event_rawstark

event_rawarrow

userretention

funnelfirst_event

parammail

paramflickr

paramhomerun

paramstark

paramarrow

is_active

paramkeys

telemetrydaily

revenuedaily

Raw Tables Summary Tables

Derived Tables

Case StudyUser Retention Analysis

Definitions

▪ Cohort - A group of product users that share one or more attributes› Example: All users who installed on Monday with Android devices

▪ Retention - How many members of a cohort of continue to use the product over time› Example: 100 users installed on Monday with Android devices. 7 days

later, 50 of those users returned to the product. We would say the 7-day retention for this cohort is 50%.

Why Study User Retention?

▪ Quantifies how “sticky” your product is▪ Allows us to measure Customer Lifetime Value (CLV or

Asymptotic Retention

No Retention

%Retained

TotalUsers

Asymptotic Retention

No Retention

Calculating User Retention

Definition: For each possible combination of cohort dimensions, for every possible event date, how many devices belong to that cohort, and how many devices from that cohort were active on that day

event_date product install_date os_name active_users cohort_size

monday mail monday android 100 100

tuesday mail monday android 83 100

monday mail monday ios 75 75

tuesday mail monday ios 62 75

Example with one dimension, os_name:

Example with one dimension, os_name: What’s my 1 day retention for users who installed on Monday?

145 175

Aggregate retention across both ios and android is (83 + 62) / (100 + 75) = 83%

Steps:1. For each day, determine whether each device was active or not

device_id date is_active

1 2015-01-01 1

1 2015-01-02 0

2 2015-01-01 1

Steps:1. For each day, determine whether each device was active or not2. Join device attributes to results of Step 1

device_id date is_active os install_date

1 2015-01-01 1 ios 2015-01-01

1 2015-01-02 0 ios 2015-01-01

2 2015-01-01 1 ios 2015-01-01

Steps:1. For each day, determine whether each device was active or not2. Join device attributes to results of Step 13. SUM is_active column, grouping by date, os, and install_date (and any

other cohort dimensions)

date active_user_count os install_date

2015-01-01 2 ios 2015-01-01

2015-01-02 1 ios 2015-01-01

Steps:1. For each day, determine whether each device was active or not2. Join device attributes to results of Step 13. SUM is_active column, grouping by date, os, and install_date (and any

other cohort dimensions)4. Join the size of each cohort to the result of Step 3

date active_user_count os install_date cohort_size

2015-01-01 2 ios 2015-01-01 2

2015-01-02 1 ios 2015-01-01 2

Demo using Looker

Lessons Learned

▪ Summarize data for optimal query performance (hourly or daily rollups)

▪ Think carefully about data model ahead of time. Choose the right sort keys.

▪ Invest in a good tool for ETL (we use Airflow)▪ Invest in a good tool for query building and sharing (we

use Looker)▪ Reserve plenty of spare capacity (at least 40% free)▪ Reserved nodes are much cheaper▪ DC nodes are faster, but much smaller capacity

Scott Hoover, Data Scientist

Redshift and Looker

• We use Redshift to power our own implementation of Looker, which serves every department with business intelligence and data for analytics.

• I have worked at Looker for just over two years, doing everything from Sales Engineering to Professional Services to Data Engineering. I currently head up our internal analytics efforts.

Introduction

• How Looker uses Redshift to supply business intelligence and drive analytics internally.

• How a few Looker customers use Redshift for reporting and analytics.

Agenda

At Looker, we have two major use cases which drove our decision to go with Redshift:

• fast analysis of usage data (300+ million events);

• to centralize multiple data sources into a single warehouse.

Looker and Redshift

• Customer Health:- MoM/WoW percent change in usage- Users added/removed- User engagement (developer, explorer, consumer, occasional consumer)- LookML contributions and contributors

• Product Usage:- Features used/not used- Release pain points- Github issue/feature tracking

• Reporting for Sales and Marketing:- Usage in trial- Performance to quota (sales, meetings, leads, etc.)- Lead/prospect fit- Campaign attribution- SaaS metrics: MRR, cMRR, Churn

What We Care About Most

Redshift Data Pipeline

Pinger

License

Real-Time RDS

Data ModelEvent Data & Everything Else

Event Schema{

"event_id": "1",

"event_type" : "view_connection",

"created_at" : "2015-07-08 20:04:08 +0000",

"attrs" : { "country" : "US",

"state" : "CA",

"browser" : "Safari/537.36",

"uri" : "%2Fadmin%2Fconnections"

"event_id": "2",

"event_type" : "save_look",

"created_at" : "2015-07-08 20:04:12 +0000",

"attrs" : { "country" : "US",

"state" : "CA",

"browser" : "Safari/537.36",

"look_id" : "32"

Event Schema

id type created_at country state uri browser error … k

1view_

connection2015-07-08

20:04:08 +0000US CA

%2Fadmin%2Fconnecti

onsSafari/537.36 ø … k1

2 save_look2015-07-08

20:04:12 +0000US CA ø Safari/537.36 ø … k2

N run_query2015-07-08

22:01:16 +0000UK ø %2Ffields= Chrome ø … kN

- explore: events extends: license_base label: 'Pinger' always_filter: events.created_date: '30 days' joins: - join: license sql_on: ${events.license_slug} = ${license.new_slug} relationship: many_to_one - join: license_users sql_on: ${events.user_id} = ${license_users.id} relationship: many_to_many - join: client sql_on: ${client.id} = ${events.client_id} relationship: many_to_one - join: account sql_on: ${client.salesforce_account_id} = ${account.id} relationship: many_to_one - join: opportunity sql_on: ${account.id} = ${opportunity.account_id} relationship: many_to_one

[...] - join: sessions sql_on: ${sessions.event_id} = ${events.id} relationship: many_to_one

Event Schema

Everything Else

company_id account_id opportunity_id trial_id license_id lead_id campaign_idcampaign_member_

at… k

1 E000000zD0IFIA0E000000Oi9mxIA

B0000014uTRG

MA21423

00QE000000NqLsvMAF

701E00000006MC7IAM

2013-09-23 23:03:05 +0000

… k1

B0000014uTRG

MA21423

00QE000000e0ZsYMAU

701E00000006OAaIAM

2014-02-20 22:39:25 +0000

… k2

B0000014uTRG

MA21423

00QE000000e0ZsYMAU

701E00000008XEbIAM

2015-02-18 00:06:09 +0000

… k3

2 E000000zrbTgIAIE000000VuLHhIA

Na06E000000a

NOcVIAW1601

00QE000000XJVJiMAP

701E00000006OB9IAM

2015-04-01 22:04:05 +0000

… k4

N … kN

- explore: company joins: - join: account sql_on: ${company.account_id} = ${account.id} relationship: many_to_one

- join: opportunity sql_on: ${company.opportunity_id} = ${opportunity.id} relationship: many_to_one - join: lead sql_on: ${company.lead_id} = ${lead.id} relationship: many_to_one - join: contact sql_on: ${contact.id} = ${company.contact_id} relationship: many_to_one fields: [export_set*]

- join: campaign sql_on: ${company.campaign_id} = ${campaign.id} relationship: many_to_one - join: trial sql_on: ${company.trial_id} = ${trial.id} relationship: many_to_one - join: account_representative from: user sql_on: ${opportunity.owner_id} = ${account_representative.id} fields: [name, count] relationship: many_to_one - join: license sql_on: ${company.account_id} = ${license.salesforce_account_id} relationship: one_to_one

Everything Else

Explore and Visualize

Analyze - Lead Scoring

API 3.0

• Construct historical data set or “Look.”

• GET “Look" using Looker API.

• Train/test model in R.• Output PMML file.• EC2 hosts

Openscoring REST service + PMML.

• Hit Salesforce API for new leads; score leads; update each lead record.

• View prioritized lists in Looker.

GET lead

UPDATE lead

GET look

• Scale/Performance- Transactional databases are not ideal for analytics (slow).- Redshift scales quickly and is incredibly fast.

• Accessibility - SQL is in many analysts’ wheelhouse and is easy to adopt.- Obvious choice for those in the AWS ecosystem or who

preferred managed offerings.• Centralization of data

- When it comes time to tie top-of-funnel actions to bottom-of-funnel behavior.

Why Our Customers Use Redshift

• Backstage/Sonicbids: They built an artist search tool that uses social data from Facebook, Twitter, YouTube, and Soundcloud to inform booking agents on what sort of draw they could expect from a certain artist. They used Snowplow, Redshift, the Looker API , Elasticsearch to build this system.

How Our Customers Use Redshift

• Smartling: sources website translation snippets from translators the world over. They maintain a database of translated snippets, like “the car is red” in Turkish, in order validate incoming translations. So, when a request for “the car is blue” in Turkish comes in, they can make an assessment on the syntactic validity of the translation.

How Our Customers Use Redshift

Learn more at www.looker.com

aws july webinar series: amazon redshift reporting and advanced analytics

event schema event

amazon redshift reporting

raw stark event

raw homerun event

raw flickr event

raw mail event hourly

tbyear amazon redshift

amazon redshift resources

Technology

aws iot analytics - aws iot analytics user guide · aws iot...

aws summit 2014 redshift

aws iot analytics - aws iot analytics user guide

using aws emr, redshift, and spark to power your analytics

(fin401) seismic shift: nasdaq's migration to amazon...

aws webcast - introducing amazon redshift

aws july webinar series: amazon redshift optimizing...

aws webinar - dynamo db + redshift 13_09_19

aws june webinar series - getting started: amazon redshift

aws analytics

(sdd414) amazon redshift deep dive and what's next | aws...

big data solution benchmark - amazon web...

stream data analytics with amazon kinesis firehose &...

aws analytics modernization

(gam301) real-time game analytics with amazon kinesis,...

aws webcast - sales productivity solutions with...

aws july webinar series: amazon redshift migration and load...

aws re:invent 2016: migrating your data warehouse to amazon...

aws atlanta meetup 2/ 2017 redshift wlm

introduction to amazon redshift and what's next (dat103) |...