(dev309) large-scale metrics analysis in ruby

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Alex Wood, AWS SDKs and Tools Team

October 2015

Large-Scale Metrics Analysis in Ruby

Data Processing from Scratch

Data Is Valuable

Many Shapes, Sizes, and Sources

From Reactive to Proactive

This Talk Is For Me, 2 Years Ago

What to expect from the session

• High-level overview

• Writing a log-processing job

• Log-processing automation

• Amazon Redshift ingestion

• Building reports

• Finer points and advanced techniques

• Conclusion

From web logs

10.3.0.210 7c667f5dcd mckenzieheathcote "Firefox" 7/Oct/2015 13:55:36 "GET /admin.html" 200 2326

337.899.380.827 5bb3ee4186 osvaldohuels "IE6" 7/Oct/2015 13:55:41 "GET /products/141.html" 200 1214

510.514.49.310 9dae697a8e - "Chrome" 7/Oct/2015 13:55:51 "GET /" 200 4132

205.67.420.496 080c8f7a44 - "Safari" 7/Oct/2015 13:56:01 "GET /" 200 4123

510.514.49.310 9dae697a8e - "Chrome" 7/Oct/2015 13:56:14 "GET /products/23.html" 200 1315

10.3.0.210 7c667f5dcd mckenzieheathcote "Firefox" 7/Oct/2015 13:57:11 "POST /admin.html" 204 34

10.3.0.210 7c667f5dcd mckenzieheathcote "Firefox" 7/Oct/2015 13:57:13 "GET /admin.html" 200 2312

510.514.49.310 9dae697a8e - "Chrome" 7/Oct/2015 13:57:29 "GET /" 200 4139

To digestible output

Reports:

Date Request Count

2015-10-01 26,781

2015-10-02 26,864

2015-10-03 20,310

2015-10-04 14,409

2015-10-05 29,029

2015-10-06 26,545

2015-10-07 27,940


Reports:

Date Request Count

2015-10-01 26,781

2015-10-02 26,864

2015-10-03 20,310

2015-10-04 14,409

2015-10-05 29,029

2015-10-06 26,545

2015-10-07 27,940


Ad hoc queries:

SELECT REQUEST,

SUM(REQUEST_COUNT) AS VISITS

FROM FACT_DAILY_REQUESTS

WHERE USERNAME != '-'

AND END_DATE = '2015-10-07'

GROUP BY REQUEST

ORDER BY VISITS DESC

LIMIT 1

{ "REQUEST" => "GET /",

"VISITS" => "14505" }

Log-processing system

Amazon

Elastic

MapReduce

RedshiftLogs in

Amazon S3

Reports

Writing a Log-Processing Job


EMR RedshiftLogs in S3 Reports

Example S3 objects

log/2015-10-06/22h.log

log/2015-10-06/23h.log

log/2015-10-07/0h.log

log/2015-10-07/1h.log

log/2015-10-07/2h.log

log/2015-10-07/3h.log

log/2015-10-07/4h.log

log/2015-10-07/5h.log

Separate logs with prefixes

Example S3 objects

log/2015-10-06/22h.log

log/2015-10-06/23h.log

log/2015-10-07/0h.log

log/2015-10-07/1h.log

log/2015-10-07/2h.log

log/2015-10-07/3h.log

log/2015-10-07/4h.log

log/2015-10-07/5h.log

Separate logs with prefixes

EMR w/ input prefix

"-input",

"s3://bucket/log/2015-10-07/"

Example Log Storage

Amazon Elastic MapReduce overview

Worker

MasterJob

tracker

Mappers Reducers

Streaming jobs

Worker

MasterJob

tracker

Mappers Reducers

• Built-in streaming JAR

• Bring your own mapper

• Bring your own reducer

• Hadoop does orchestration

Mapper

Worker

MasterJob

tracker

Mappers Reducers

Mapper

Worker

MasterJob

tracker

Mappers Reducers

• Input by line from STDIN

o Ruby ARGF

• Output to STDOUT

• Bottom line: Filter values

Mapper Walkthrough

Reducer

Worker

MasterJob

tracker

Mappers Reducers

Reducer

Worker

MasterJob

tracker

Mappers Reducers

• Sorted by Hadoop

• Mapper output line by line

o Again using STDIN

• Transform output

• Count duplicates

• Output to STDOUT

Reducer Walkthrough

Summary

• Streaming mappers and reducers are executable scripts.

• Hadoop manages streaming orchestration.

• Input comes through STDIN.

• Output sent to STDOUT.

• Can test locally:

• cat input.txt | ruby mapper.rb | sort | ruby reducer.rb > result.out

Automation

Concepts: Streaming step

• Mapper and reducer source files

• Input files

• Output destination

Streaming Step Live Code

Concepts: Instance configuration

• How many? How big?

• Master vs. worker

Instance Config Live Code

Console

Console vs. SDK

Console

Console vs. SDK

AWS SDK for Ruby

@client =

Aws::EMR::Client.new

@client.run_job_flow(opts)

Console

Console vs. SDK

Console

Console vs. SDK

AWS SDK for Ruby

@client =

Aws::EMR::Client.new

@client.run_job_flow(opts)

End state

Cluster A

Step 1 Step 2

Cluster B

Step 3 Step 4 Step 5

Cluster C

Step 6 Step 7

Batching Example

Summary

• AWS SDKs enable automation at scale.

• Getting started is simple.

• Separate common configuration from job-specific.

Amazon Redshift Ingestion

Amazon Redshift

Key concepts

• Redshift ingestion uses a SQL COPY command.

• One-to-one mapping with table columns, separated by a

delimiter.

o Must be in the same order as table columns.

o Default delimiter is the pipe "|" character, but you can specify

your own.

Our FACT Table

CREATE TABLE FACT_DAILY_REQUESTS(

USERNAME VARCHAR(30) NOT NULL DISTKEY,

SESSION_ID VARCHAR(10),

USER_AGENT VARCHAR(256) NOT NULL,

END_DATE DATE NOT NULL,

REQUEST VARCHAR(128) NOT NULL,

RESPONSE_CODE INTEGER NOT NULL,

REQUEST_COUNT INTEGER NOT NULL

)

INTERLEAVED SORTKEY(END_DATE,REQUEST,RESPONSE_CODE)

Copying from S3 to Redshift

COPY FACT_DAILY_REQUESTS

FROM 's3://bucket/output-prefix/part-'

DATEFORMAT AS 'DD/MON/YYYY'

delimiter '\t'

Ingestion Walkthrough

Summary

• Amazon Redshift interfaces like SQL.

• You can alias an S3 source, as with EMR.

• If delimited, EMR's output structure is ready to load.

Report Generation

Amazon Redshift

Simple Count

SELECT COUNT(DISTINCT USERNAME)


Date-range queries

SELECT END_DATE, SUM(REQUEST_COUNT)


WHERE END_DATE BETWEEN '2015-10-06' AND '2015-10-09'

GROUP BY END_DATE

ORDER BY END_DATE DESC

Advanced query – New user behavior

SELECT REQUEST, SUM(REQUEST_COUNT) AS TOTAL

FROM FACT_DAILY_REQUESTS f, DIM_USERS u

WHERE f.USERNAME = u.USERNAME

AND f.END_DATE BETWEEN '2015-10-01' AND '2015-10-07'

AND u.REGISTRATION_DATE >= '2015-10-01'

GROUP BY REQUEST

ORDER BY TOTAL DESC

LIMIT 10

Reports:

Date Request Count

2015-10-01 26,781

2015-10-02 26,864

2015-10-03 20,310

2015-10-04 14,409

2015-10-05 29,029

2015-10-06 26,545

2015-10-07 27,940

Supports planned and ad hoc reports

Ad hoc queries:

SELECT REQUEST,

SUM(REQUEST_COUNT) AS VISITS


WHERE USERNAME != '-'

AND END_DATE = '2015-10-07'

GROUP BY REQUEST

ORDER BY VISITS DESC

LIMIT 1

{ "REQUEST" => "GET /",

"VISITS" => "14505" }

Summary

• Programmatic reporting with SQL

• Query logic not tied to Redshift

• Columnar storage optimized for common DW queries

• Can use S3 to store reports

• Can take advantage of PostgreSQL features:

• Window functions

• Common table expressions

Finer Points

Nice toy.

Nice toy. Can it scale?

1 PB = 1000000000000000B = 1015 bytes = 1000 terabytes.

Got 5,000,000,000,000,000 problems

What did we learn?

• Master instance selection matters

o jobtracker-heap-size

• Worker memory matters

o mapreduce.map.memory.mb

o mapreduce.reduce.memory.mb

o mapred.tasktracker.map.tasks.maximum

o mapred.tasktracker.reduce.tasks.maximum

• Elasticity is AWESOME!

Production lessons learned

• Repeated manual tasks == Evil

• Multiple sources of truth

• Understand storage ramifications of table design

• Automate validation

Validation Example

You don't have to do it yourself

• Related services

• AWS Data Pipeline

• Amazon Machine Learning

• Amazon Kinesis

• Amazon Simple Email Service

• Amazon Simple Notification Service

• AWS Marketplace

Conclusion

Now you can:

• Write a streaming Amazon Elastic MapReduce job.

• Automate cluster creation with the AWS SDK for Ruby.

• Format results and ingest into Amazon Redshift.

• Create useful reports from Amazon Redshift.

• Start thinking about scaling and production deployment.

Resources

• Sample Code

• https://github.com/awslabs/reinvent2015-dev309

• Amazon Elastic MapReduce documentation

• http://aws.amazon.com/documentation/elasticmapreduce/

• Amazon Redshift documentation

• http://aws.amazon.com/documentation/redshift/

• AWS SDK for Ruby documentation

• http://docs.aws.amazon.com/sdkforruby/api/index.html

• Twitter: @alexwwood

https://github.com/awslabs/reinvent2015-dev309

http://aws.amazon.com/documentation/elasticmapreduce/

http://aws.amazon.com/documentation/redshift/

http://docs.aws.amazon.com/sdkforruby/api/index.html

Thank you!

Remember to complete

your evaluations!

Related sessions

• BDT305 - Amazon EMR Deep Dive and Best Practices

• BDT401 - Amazon Redshift Deep Dive: Tuning and Best

Practices

• DAT 201 - Introduction to Amazon Redshift

(dev309) large-scale metrics analysis in ruby

Technology