(dev309) large-scale metrics analysis in ruby

73
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Alex Wood, AWS SDKs and Tools Team October 2015 Large-Scale Metrics Analysis in Ruby Data Processing from Scratch

Upload: amazon-web-services

Post on 12-Apr-2017

523 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: (DEV309) Large-Scale Metrics Analysis in Ruby

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Alex Wood, AWS SDKs and Tools Team

October 2015

Large-Scale Metrics Analysis in Ruby

Data Processing from Scratch

Page 2: (DEV309) Large-Scale Metrics Analysis in Ruby

Data Is Valuable

Page 3: (DEV309) Large-Scale Metrics Analysis in Ruby

Many Shapes, Sizes, and Sources

Page 4: (DEV309) Large-Scale Metrics Analysis in Ruby

From Reactive to Proactive

Page 5: (DEV309) Large-Scale Metrics Analysis in Ruby

This Talk Is For Me, 2 Years Ago

Page 6: (DEV309) Large-Scale Metrics Analysis in Ruby

What to expect from the session

• High-level overview

• Writing a log-processing job

• Log-processing automation

• Amazon Redshift ingestion

• Building reports

• Finer points and advanced techniques

• Conclusion

Page 7: (DEV309) Large-Scale Metrics Analysis in Ruby

From web logs

10.3.0.210 7c667f5dcd mckenzieheathcote "Firefox" 7/Oct/2015 13:55:36 "GET /admin.html" 200 2326

337.899.380.827 5bb3ee4186 osvaldohuels "IE6" 7/Oct/2015 13:55:41 "GET /products/141.html" 200 1214

510.514.49.310 9dae697a8e - "Chrome" 7/Oct/2015 13:55:51 "GET /" 200 4132

205.67.420.496 080c8f7a44 - "Safari" 7/Oct/2015 13:56:01 "GET /" 200 4123

510.514.49.310 9dae697a8e - "Chrome" 7/Oct/2015 13:56:14 "GET /products/23.html" 200 1315

10.3.0.210 7c667f5dcd mckenzieheathcote "Firefox" 7/Oct/2015 13:57:11 "POST /admin.html" 204 34

10.3.0.210 7c667f5dcd mckenzieheathcote "Firefox" 7/Oct/2015 13:57:13 "GET /admin.html" 200 2312

510.514.49.310 9dae697a8e - "Chrome" 7/Oct/2015 13:57:29 "GET /" 200 4139

Page 8: (DEV309) Large-Scale Metrics Analysis in Ruby

To digestible output

Page 9: (DEV309) Large-Scale Metrics Analysis in Ruby

Reports:

Date Request Count

2015-10-01 26,781

2015-10-02 26,864

2015-10-03 20,310

2015-10-04 14,409

2015-10-05 29,029

2015-10-06 26,545

2015-10-07 27,940

To digestible output

Page 10: (DEV309) Large-Scale Metrics Analysis in Ruby

Reports:

Date Request Count

2015-10-01 26,781

2015-10-02 26,864

2015-10-03 20,310

2015-10-04 14,409

2015-10-05 29,029

2015-10-06 26,545

2015-10-07 27,940

To digestible output

Ad hoc queries:

SELECT REQUEST,

SUM(REQUEST_COUNT) AS VISITS

FROM FACT_DAILY_REQUESTS

WHERE USERNAME != '-'

AND END_DATE = '2015-10-07'

GROUP BY REQUEST

ORDER BY VISITS DESC

LIMIT 1

{ "REQUEST" => "GET /",

"VISITS" => "14505" }

Page 11: (DEV309) Large-Scale Metrics Analysis in Ruby

Log-processing system

Amazon

Elastic

MapReduce

RedshiftLogs in

Amazon S3

Reports

Page 12: (DEV309) Large-Scale Metrics Analysis in Ruby

Writing a Log-Processing Job

Page 13: (DEV309) Large-Scale Metrics Analysis in Ruby

Log-processing system

EMR RedshiftLogs in S3 Reports

Page 14: (DEV309) Large-Scale Metrics Analysis in Ruby

Example S3 objects

log/2015-10-06/22h.log

log/2015-10-06/23h.log

log/2015-10-07/0h.log

log/2015-10-07/1h.log

log/2015-10-07/2h.log

log/2015-10-07/3h.log

log/2015-10-07/4h.log

log/2015-10-07/5h.log

Separate logs with prefixes

Page 15: (DEV309) Large-Scale Metrics Analysis in Ruby

Example S3 objects

log/2015-10-06/22h.log

log/2015-10-06/23h.log

log/2015-10-07/0h.log

log/2015-10-07/1h.log

log/2015-10-07/2h.log

log/2015-10-07/3h.log

log/2015-10-07/4h.log

log/2015-10-07/5h.log

Separate logs with prefixes

EMR w/ input prefix

"-input",

"s3://bucket/log/2015-10-07/"

Page 16: (DEV309) Large-Scale Metrics Analysis in Ruby

Example Log Storage

Page 17: (DEV309) Large-Scale Metrics Analysis in Ruby

Log-processing system

EMR RedshiftLogs in S3 Reports

Page 18: (DEV309) Large-Scale Metrics Analysis in Ruby

Amazon Elastic MapReduce overview

Worker

MasterJob

tracker

Mappers Reducers

Page 19: (DEV309) Large-Scale Metrics Analysis in Ruby

Streaming jobs

Worker

MasterJob

tracker

Mappers Reducers

• Built-in streaming JAR

• Bring your own mapper

• Bring your own reducer

• Hadoop does orchestration

Page 20: (DEV309) Large-Scale Metrics Analysis in Ruby

Mapper

Worker

MasterJob

tracker

Mappers Reducers

Page 21: (DEV309) Large-Scale Metrics Analysis in Ruby

Mapper

Worker

MasterJob

tracker

Mappers Reducers

• Input by line from STDIN

o Ruby ARGF

• Output to STDOUT

• Bottom line: Filter values

Page 22: (DEV309) Large-Scale Metrics Analysis in Ruby

Mapper Walkthrough

Page 23: (DEV309) Large-Scale Metrics Analysis in Ruby

Reducer

Worker

MasterJob

tracker

Mappers Reducers

Page 24: (DEV309) Large-Scale Metrics Analysis in Ruby

Reducer

Worker

MasterJob

tracker

Mappers Reducers

• Sorted by Hadoop

• Mapper output line by line

o Again using STDIN

• Transform output

• Count duplicates

• Output to STDOUT

Page 25: (DEV309) Large-Scale Metrics Analysis in Ruby

Reducer Walkthrough

Page 26: (DEV309) Large-Scale Metrics Analysis in Ruby

Summary

• Streaming mappers and reducers are executable scripts.

• Hadoop manages streaming orchestration.

• Input comes through STDIN.

• Output sent to STDOUT.

• Can test locally:

• cat input.txt | ruby mapper.rb | sort | ruby reducer.rb > result.out

Page 27: (DEV309) Large-Scale Metrics Analysis in Ruby

Automation

Page 28: (DEV309) Large-Scale Metrics Analysis in Ruby

Concepts: Streaming step

• Mapper and reducer source files

• Input files

• Output destination

Page 29: (DEV309) Large-Scale Metrics Analysis in Ruby

Streaming Step Live Code

Page 30: (DEV309) Large-Scale Metrics Analysis in Ruby

Concepts: Instance configuration

• How many? How big?

• Master vs. worker

Page 31: (DEV309) Large-Scale Metrics Analysis in Ruby

Instance Config Live Code

Page 32: (DEV309) Large-Scale Metrics Analysis in Ruby

Console

Console vs. SDK

Page 33: (DEV309) Large-Scale Metrics Analysis in Ruby

Console

Console vs. SDK

AWS SDK for Ruby

@client =

Aws::EMR::Client.new

@client.run_job_flow(opts)

Page 34: (DEV309) Large-Scale Metrics Analysis in Ruby

Console

Console vs. SDK

Page 35: (DEV309) Large-Scale Metrics Analysis in Ruby

Console

Console vs. SDK

AWS SDK for Ruby

@client =

Aws::EMR::Client.new

@client.run_job_flow(opts)

Page 36: (DEV309) Large-Scale Metrics Analysis in Ruby

End state

Cluster A

Step 1 Step 2

Cluster B

Step 3 Step 4 Step 5

Cluster C

Step 6 Step 7

Page 37: (DEV309) Large-Scale Metrics Analysis in Ruby

Batching Example

Page 38: (DEV309) Large-Scale Metrics Analysis in Ruby

Summary

• AWS SDKs enable automation at scale.

• Getting started is simple.

• Separate common configuration from job-specific.

Page 39: (DEV309) Large-Scale Metrics Analysis in Ruby

Amazon Redshift Ingestion

Page 40: (DEV309) Large-Scale Metrics Analysis in Ruby

Log-processing system

EMR RedshiftLogs in S3 Reports

Page 41: (DEV309) Large-Scale Metrics Analysis in Ruby

Amazon Redshift

Page 42: (DEV309) Large-Scale Metrics Analysis in Ruby

Amazon Redshift

Page 43: (DEV309) Large-Scale Metrics Analysis in Ruby

Key concepts

• Redshift ingestion uses a SQL COPY command.

• One-to-one mapping with table columns, separated by a

delimiter.

o Must be in the same order as table columns.

o Default delimiter is the pipe "|" character, but you can specify

your own.

Page 44: (DEV309) Large-Scale Metrics Analysis in Ruby

Our FACT Table

CREATE TABLE FACT_DAILY_REQUESTS(

USERNAME VARCHAR(30) NOT NULL DISTKEY,

SESSION_ID VARCHAR(10),

USER_AGENT VARCHAR(256) NOT NULL,

END_DATE DATE NOT NULL,

REQUEST VARCHAR(128) NOT NULL,

RESPONSE_CODE INTEGER NOT NULL,

REQUEST_COUNT INTEGER NOT NULL

)

INTERLEAVED SORTKEY(END_DATE,REQUEST,RESPONSE_CODE)

Page 45: (DEV309) Large-Scale Metrics Analysis in Ruby

Our FACT Table

CREATE TABLE FACT_DAILY_REQUESTS(

USERNAME VARCHAR(30) NOT NULL DISTKEY,

SESSION_ID VARCHAR(10),

USER_AGENT VARCHAR(256) NOT NULL,

END_DATE DATE NOT NULL,

REQUEST VARCHAR(128) NOT NULL,

RESPONSE_CODE INTEGER NOT NULL,

REQUEST_COUNT INTEGER NOT NULL

)

INTERLEAVED SORTKEY(END_DATE,REQUEST,RESPONSE_CODE)

Page 46: (DEV309) Large-Scale Metrics Analysis in Ruby

Copying from S3 to Redshift

COPY FACT_DAILY_REQUESTS

FROM 's3://bucket/output-prefix/part-'

DATEFORMAT AS 'DD/MON/YYYY'

delimiter '\t'

Page 47: (DEV309) Large-Scale Metrics Analysis in Ruby

Ingestion Walkthrough

Page 48: (DEV309) Large-Scale Metrics Analysis in Ruby

Summary

• Amazon Redshift interfaces like SQL.

• You can alias an S3 source, as with EMR.

• If delimited, EMR's output structure is ready to load.

Page 49: (DEV309) Large-Scale Metrics Analysis in Ruby

Report Generation

Page 50: (DEV309) Large-Scale Metrics Analysis in Ruby

Log-processing system

EMR RedshiftLogs in S3 Reports

Page 51: (DEV309) Large-Scale Metrics Analysis in Ruby

Amazon Redshift

Page 52: (DEV309) Large-Scale Metrics Analysis in Ruby

Simple Count

SELECT COUNT(DISTINCT USERNAME)

FROM FACT_DAILY_REQUESTS

Page 53: (DEV309) Large-Scale Metrics Analysis in Ruby

Date-range queries

SELECT END_DATE, SUM(REQUEST_COUNT)

FROM FACT_DAILY_REQUESTS

WHERE END_DATE BETWEEN '2015-10-06' AND '2015-10-09'

GROUP BY END_DATE

ORDER BY END_DATE DESC

Page 54: (DEV309) Large-Scale Metrics Analysis in Ruby

Advanced query – New user behavior

SELECT REQUEST, SUM(REQUEST_COUNT) AS TOTAL

FROM FACT_DAILY_REQUESTS f, DIM_USERS u

WHERE f.USERNAME = u.USERNAME

AND f.END_DATE BETWEEN '2015-10-01' AND '2015-10-07'

AND u.REGISTRATION_DATE >= '2015-10-01'

GROUP BY REQUEST

ORDER BY TOTAL DESC

LIMIT 10

Page 55: (DEV309) Large-Scale Metrics Analysis in Ruby

Reports:

Date Request Count

2015-10-01 26,781

2015-10-02 26,864

2015-10-03 20,310

2015-10-04 14,409

2015-10-05 29,029

2015-10-06 26,545

2015-10-07 27,940

Supports planned and ad hoc reports

Ad hoc queries:

SELECT REQUEST,

SUM(REQUEST_COUNT) AS VISITS

FROM FACT_DAILY_REQUESTS

WHERE USERNAME != '-'

AND END_DATE = '2015-10-07'

GROUP BY REQUEST

ORDER BY VISITS DESC

LIMIT 1

{ "REQUEST" => "GET /",

"VISITS" => "14505" }

Page 56: (DEV309) Large-Scale Metrics Analysis in Ruby

Summary

• Programmatic reporting with SQL

• Query logic not tied to Redshift

• Columnar storage optimized for common DW queries

• Can use S3 to store reports

• Can take advantage of PostgreSQL features:

• Window functions

• Common table expressions

Page 57: (DEV309) Large-Scale Metrics Analysis in Ruby

Finer Points

Page 58: (DEV309) Large-Scale Metrics Analysis in Ruby

Nice toy.

Page 59: (DEV309) Large-Scale Metrics Analysis in Ruby

Nice toy. Can it scale?

Page 60: (DEV309) Large-Scale Metrics Analysis in Ruby

1 PB = 1000000000000000B = 1015 bytes = 1000 terabytes.

Page 61: (DEV309) Large-Scale Metrics Analysis in Ruby

Got 5,000,000,000,000,000 problems

Page 62: (DEV309) Large-Scale Metrics Analysis in Ruby

Got 5,000,000,000,000,000 problems

Page 63: (DEV309) Large-Scale Metrics Analysis in Ruby

Got 5,000,000,000,000,000 problems

Page 64: (DEV309) Large-Scale Metrics Analysis in Ruby

What did we learn?

• Master instance selection matters

o jobtracker-heap-size

• Worker memory matters

o mapreduce.map.memory.mb

o mapreduce.reduce.memory.mb

o mapred.tasktracker.map.tasks.maximum

o mapred.tasktracker.reduce.tasks.maximum

• Elasticity is AWESOME!

Page 65: (DEV309) Large-Scale Metrics Analysis in Ruby

Production lessons learned

• Repeated manual tasks == Evil

• Multiple sources of truth

• Understand storage ramifications of table design

• Automate validation

Page 66: (DEV309) Large-Scale Metrics Analysis in Ruby

Validation Example

Page 67: (DEV309) Large-Scale Metrics Analysis in Ruby

You don't have to do it yourself

• Related services

• AWS Data Pipeline

• Amazon Machine Learning

• Amazon Kinesis

• Amazon Simple Email Service

• Amazon Simple Notification Service

• AWS Marketplace

Page 68: (DEV309) Large-Scale Metrics Analysis in Ruby

Conclusion

Page 69: (DEV309) Large-Scale Metrics Analysis in Ruby

Now you can:

• Write a streaming Amazon Elastic MapReduce job.

• Automate cluster creation with the AWS SDK for Ruby.

• Format results and ingest into Amazon Redshift.

• Create useful reports from Amazon Redshift.

• Start thinking about scaling and production deployment.

Page 70: (DEV309) Large-Scale Metrics Analysis in Ruby

Resources

• Sample Code

• https://github.com/awslabs/reinvent2015-dev309

• Amazon Elastic MapReduce documentation

• http://aws.amazon.com/documentation/elasticmapreduce/

• Amazon Redshift documentation

• http://aws.amazon.com/documentation/redshift/

• AWS SDK for Ruby documentation

• http://docs.aws.amazon.com/sdkforruby/api/index.html

• Twitter: @alexwwood

Page 71: (DEV309) Large-Scale Metrics Analysis in Ruby

Thank you!

Page 72: (DEV309) Large-Scale Metrics Analysis in Ruby

Remember to complete

your evaluations!

Page 73: (DEV309) Large-Scale Metrics Analysis in Ruby

Related sessions

• BDT305 - Amazon EMR Deep Dive and Best Practices

• BDT401 - Amazon Redshift Deep Dive: Tuning and Best

Practices

• DAT 201 - Introduction to Amazon Redshift