(dev309) large-scale metrics analysis in ruby
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Alex Wood, AWS SDKs and Tools Team
October 2015
Large-Scale Metrics Analysis in Ruby
Data Processing from Scratch
Data Is Valuable
Many Shapes, Sizes, and Sources
From Reactive to Proactive
This Talk Is For Me, 2 Years Ago
What to expect from the session
• High-level overview
• Writing a log-processing job
• Log-processing automation
• Amazon Redshift ingestion
• Building reports
• Finer points and advanced techniques
• Conclusion
From web logs
10.3.0.210 7c667f5dcd mckenzieheathcote "Firefox" 7/Oct/2015 13:55:36 "GET /admin.html" 200 2326
337.899.380.827 5bb3ee4186 osvaldohuels "IE6" 7/Oct/2015 13:55:41 "GET /products/141.html" 200 1214
510.514.49.310 9dae697a8e - "Chrome" 7/Oct/2015 13:55:51 "GET /" 200 4132
205.67.420.496 080c8f7a44 - "Safari" 7/Oct/2015 13:56:01 "GET /" 200 4123
510.514.49.310 9dae697a8e - "Chrome" 7/Oct/2015 13:56:14 "GET /products/23.html" 200 1315
10.3.0.210 7c667f5dcd mckenzieheathcote "Firefox" 7/Oct/2015 13:57:11 "POST /admin.html" 204 34
10.3.0.210 7c667f5dcd mckenzieheathcote "Firefox" 7/Oct/2015 13:57:13 "GET /admin.html" 200 2312
510.514.49.310 9dae697a8e - "Chrome" 7/Oct/2015 13:57:29 "GET /" 200 4139
To digestible output
Reports:
Date Request Count
2015-10-01 26,781
2015-10-02 26,864
2015-10-03 20,310
2015-10-04 14,409
2015-10-05 29,029
2015-10-06 26,545
2015-10-07 27,940
To digestible output
Reports:
Date Request Count
2015-10-01 26,781
2015-10-02 26,864
2015-10-03 20,310
2015-10-04 14,409
2015-10-05 29,029
2015-10-06 26,545
2015-10-07 27,940
To digestible output
Ad hoc queries:
SELECT REQUEST,
SUM(REQUEST_COUNT) AS VISITS
FROM FACT_DAILY_REQUESTS
WHERE USERNAME != '-'
AND END_DATE = '2015-10-07'
GROUP BY REQUEST
ORDER BY VISITS DESC
LIMIT 1
{ "REQUEST" => "GET /",
"VISITS" => "14505" }
Log-processing system
Amazon
Elastic
MapReduce
RedshiftLogs in
Amazon S3
Reports
Writing a Log-Processing Job
Log-processing system
EMR RedshiftLogs in S3 Reports
Example S3 objects
log/2015-10-06/22h.log
log/2015-10-06/23h.log
log/2015-10-07/0h.log
log/2015-10-07/1h.log
log/2015-10-07/2h.log
log/2015-10-07/3h.log
log/2015-10-07/4h.log
log/2015-10-07/5h.log
Separate logs with prefixes
Example S3 objects
log/2015-10-06/22h.log
log/2015-10-06/23h.log
log/2015-10-07/0h.log
log/2015-10-07/1h.log
log/2015-10-07/2h.log
log/2015-10-07/3h.log
log/2015-10-07/4h.log
log/2015-10-07/5h.log
Separate logs with prefixes
EMR w/ input prefix
"-input",
"s3://bucket/log/2015-10-07/"
Example Log Storage
Log-processing system
EMR RedshiftLogs in S3 Reports
Amazon Elastic MapReduce overview
Worker
MasterJob
tracker
Mappers Reducers
Streaming jobs
Worker
MasterJob
tracker
Mappers Reducers
• Built-in streaming JAR
• Bring your own mapper
• Bring your own reducer
• Hadoop does orchestration
Mapper
Worker
MasterJob
tracker
Mappers Reducers
Mapper
Worker
MasterJob
tracker
Mappers Reducers
• Input by line from STDIN
o Ruby ARGF
• Output to STDOUT
• Bottom line: Filter values
Mapper Walkthrough
Reducer
Worker
MasterJob
tracker
Mappers Reducers
Reducer
Worker
MasterJob
tracker
Mappers Reducers
• Sorted by Hadoop
• Mapper output line by line
o Again using STDIN
• Transform output
• Count duplicates
• Output to STDOUT
Reducer Walkthrough
Summary
• Streaming mappers and reducers are executable scripts.
• Hadoop manages streaming orchestration.
• Input comes through STDIN.
• Output sent to STDOUT.
• Can test locally:
• cat input.txt | ruby mapper.rb | sort | ruby reducer.rb > result.out
Automation
Concepts: Streaming step
• Mapper and reducer source files
• Input files
• Output destination
Streaming Step Live Code
Concepts: Instance configuration
• How many? How big?
• Master vs. worker
Instance Config Live Code
Console
Console vs. SDK
Console
Console vs. SDK
AWS SDK for Ruby
@client =
Aws::EMR::Client.new
@client.run_job_flow(opts)
Console
Console vs. SDK
Console
Console vs. SDK
AWS SDK for Ruby
@client =
Aws::EMR::Client.new
@client.run_job_flow(opts)
End state
Cluster A
Step 1 Step 2
Cluster B
Step 3 Step 4 Step 5
Cluster C
Step 6 Step 7
Batching Example
Summary
• AWS SDKs enable automation at scale.
• Getting started is simple.
• Separate common configuration from job-specific.
Amazon Redshift Ingestion
Log-processing system
EMR RedshiftLogs in S3 Reports
Amazon Redshift
Amazon Redshift
Key concepts
• Redshift ingestion uses a SQL COPY command.
• One-to-one mapping with table columns, separated by a
delimiter.
o Must be in the same order as table columns.
o Default delimiter is the pipe "|" character, but you can specify
your own.
Our FACT Table
CREATE TABLE FACT_DAILY_REQUESTS(
USERNAME VARCHAR(30) NOT NULL DISTKEY,
SESSION_ID VARCHAR(10),
USER_AGENT VARCHAR(256) NOT NULL,
END_DATE DATE NOT NULL,
REQUEST VARCHAR(128) NOT NULL,
RESPONSE_CODE INTEGER NOT NULL,
REQUEST_COUNT INTEGER NOT NULL
)
INTERLEAVED SORTKEY(END_DATE,REQUEST,RESPONSE_CODE)
Our FACT Table
CREATE TABLE FACT_DAILY_REQUESTS(
USERNAME VARCHAR(30) NOT NULL DISTKEY,
SESSION_ID VARCHAR(10),
USER_AGENT VARCHAR(256) NOT NULL,
END_DATE DATE NOT NULL,
REQUEST VARCHAR(128) NOT NULL,
RESPONSE_CODE INTEGER NOT NULL,
REQUEST_COUNT INTEGER NOT NULL
)
INTERLEAVED SORTKEY(END_DATE,REQUEST,RESPONSE_CODE)
Copying from S3 to Redshift
COPY FACT_DAILY_REQUESTS
FROM 's3://bucket/output-prefix/part-'
DATEFORMAT AS 'DD/MON/YYYY'
delimiter '\t'
Ingestion Walkthrough
Summary
• Amazon Redshift interfaces like SQL.
• You can alias an S3 source, as with EMR.
• If delimited, EMR's output structure is ready to load.
Report Generation
Log-processing system
EMR RedshiftLogs in S3 Reports
Amazon Redshift
Simple Count
SELECT COUNT(DISTINCT USERNAME)
FROM FACT_DAILY_REQUESTS
Date-range queries
SELECT END_DATE, SUM(REQUEST_COUNT)
FROM FACT_DAILY_REQUESTS
WHERE END_DATE BETWEEN '2015-10-06' AND '2015-10-09'
GROUP BY END_DATE
ORDER BY END_DATE DESC
Advanced query – New user behavior
SELECT REQUEST, SUM(REQUEST_COUNT) AS TOTAL
FROM FACT_DAILY_REQUESTS f, DIM_USERS u
WHERE f.USERNAME = u.USERNAME
AND f.END_DATE BETWEEN '2015-10-01' AND '2015-10-07'
AND u.REGISTRATION_DATE >= '2015-10-01'
GROUP BY REQUEST
ORDER BY TOTAL DESC
LIMIT 10
Reports:
Date Request Count
2015-10-01 26,781
2015-10-02 26,864
2015-10-03 20,310
2015-10-04 14,409
2015-10-05 29,029
2015-10-06 26,545
2015-10-07 27,940
Supports planned and ad hoc reports
Ad hoc queries:
SELECT REQUEST,
SUM(REQUEST_COUNT) AS VISITS
FROM FACT_DAILY_REQUESTS
WHERE USERNAME != '-'
AND END_DATE = '2015-10-07'
GROUP BY REQUEST
ORDER BY VISITS DESC
LIMIT 1
{ "REQUEST" => "GET /",
"VISITS" => "14505" }
Summary
• Programmatic reporting with SQL
• Query logic not tied to Redshift
• Columnar storage optimized for common DW queries
• Can use S3 to store reports
• Can take advantage of PostgreSQL features:
• Window functions
• Common table expressions
Finer Points
Nice toy.
Nice toy. Can it scale?
1 PB = 1000000000000000B = 1015 bytes = 1000 terabytes.
Got 5,000,000,000,000,000 problems
Got 5,000,000,000,000,000 problems
Got 5,000,000,000,000,000 problems
What did we learn?
• Master instance selection matters
o jobtracker-heap-size
• Worker memory matters
o mapreduce.map.memory.mb
o mapreduce.reduce.memory.mb
o mapred.tasktracker.map.tasks.maximum
o mapred.tasktracker.reduce.tasks.maximum
• Elasticity is AWESOME!
Production lessons learned
• Repeated manual tasks == Evil
• Multiple sources of truth
• Understand storage ramifications of table design
• Automate validation
Validation Example
You don't have to do it yourself
• Related services
• AWS Data Pipeline
• Amazon Machine Learning
• Amazon Kinesis
• Amazon Simple Email Service
• Amazon Simple Notification Service
• AWS Marketplace
Conclusion
Now you can:
• Write a streaming Amazon Elastic MapReduce job.
• Automate cluster creation with the AWS SDK for Ruby.
• Format results and ingest into Amazon Redshift.
• Create useful reports from Amazon Redshift.
• Start thinking about scaling and production deployment.
Resources
• Sample Code
• https://github.com/awslabs/reinvent2015-dev309
• Amazon Elastic MapReduce documentation
• http://aws.amazon.com/documentation/elasticmapreduce/
• Amazon Redshift documentation
• http://aws.amazon.com/documentation/redshift/
• AWS SDK for Ruby documentation
• http://docs.aws.amazon.com/sdkforruby/api/index.html
• Twitter: @alexwwood
Thank you!
Remember to complete
your evaluations!
Related sessions
• BDT305 - Amazon EMR Deep Dive and Best Practices
• BDT401 - Amazon Redshift Deep Dive: Tuning and Best
Practices
• DAT 201 - Introduction to Amazon Redshift