Download - Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
ARC 306: Lumberjacking on AWS
Cutting Through Logs to Find What Matters
Guy Ernest, Solutions Architecture
November 15, 2013
Progress Is Not Evenly Distributed
1980 Today
$14,000,000/TB
100 MB
4 MB/s
$30/TB
3 TB
200 MB/s
30,000 X
50 X
450,000 ÷
Solution: More Spindles by Kheel Center, Cornell University
Case Study – Foursquare
The Challenge
“…Foursquare streams hundreds
of millions of application logs
each day. The company relies on
analytics to report on its daily
usage, evaluate new offerings,
and perform long-term trend
analysis—and with millions of
new check-ins each day, the
workload is only growing…”
“Real” Project Requirements Example
Cost Analysis
Data transfer
• By date/time
• By edge location
• By date/time within an edge location
• By top X URLs
• By HTTP vs. HTTPS
Marketing
Top URLs
• As-is count
• By content type
• By edge location
• By edge location and content type
Requests served
• By edge location
Revenue
• By edge location
Top games
• By age
• By income
• By gender
Operations
Error rates
• By top X URLs
• By edge location
• By edge location and content type
Revenue
Top games
• By revenue
• By edge location and revenue
Top ads
• That lead to a game purchase
Viable Business
# Users
$ Money
Operation Costs
Revenues
Available Data Sources Metric Sources
Data transfer by date/time CloudFront logs
Data transfer by edge location CloudFront logs
Data transfer by date/time within an edge location CloudFront logs
Data transfer by top x URLs CloudFront logs, web servers logs
Data transfer by http vs HTTPS CloudFront logs
Top URLs CloudFront logs, web servers logs
Top URLs by Content Type CloudFront logs
Top URLs by Edge Location CloudFront logs
Top URLs by Edge Location and Content Type CloudFront logs
Error rates by top x URLs CloudFront logs, web servers logs
Error rate by edge location CloudFront logs
Error Rate by edge location and content type CloudFront logs
Requests served by edge location CloudFront logs
Revenue by edge location CloudFront logs, OrdersDB, app servers logs
Top games segmented by age CloudFront logs, user profile
Top games segmented by income CloudFront logs, user profile
Top games segmented by gender CloudFront logs, user profile
Top games by revenue CloudFront logs, OrdersDB
Top games by edge location and revenue CloudFront logs, OrdersDB
Top game revenue segmented by age CloudFront logs, OrdersDB, user profile
CloudFront Access Log Format #Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query
2012-05-25 22:01:30 AMS1 4448 94.212.249.78 GET d1234567890213.cloudfront.net /YT0KthT/F5SOWdDPqNqQF07tiTOXqJMpfD\
dlb3LMwv3/jP3/CINm/yDSy0MsRcWJN/Simutrans.exe 200 http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2 Mozilla/5.0%20(compatible;%20M\
SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625181
2012-05-25 22:01:30 AMS1 4952 94.212.249.78 GET d1234567890213.cloudfront.net /66IG584/CPCxY0P44BGb5ZOd3qSUrauL05\
0LOvFwaMj/eH/caw/Blob Wars-Blob And Conquer.exe 200 http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2 Mozilla/5.0%20(compatible;%20M\
SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625184
2012-05-25 22:01:30 AMS1 4556 78.8.5.135 GET d1234567890213.cloudfront.net /SwlufjC/xEjH3BRbXMXwmFWqzKt7od6tlW\
R3e13LhmH/V3eF/lo6g/AstroMenace.exe 200 http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt Opera/9.80%20(Windows%20NT%205.1;%20U;%20pl)%2\
0Presto/2.10.229%20Version/11.60 uid=100&oid=108625189
2012-05-25 22:01:30 AMS1 47172 78.8.5.135 GET d1234567890213.cloudfront.net /Di1cXoN/TskldkSHcgkvZXQEmv5vOVR25X\
5UTisFkRq/pQa/wCjUXZb/Z1HRuGlo/Kroz.exe 200 http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt Opera/9.80%20(Windows%20NT%205.1;%20U;\
%20pl)%20Presto/2.10.229%20Version/11.60 uid=100&oid=108625206
Sample Your Data with R
> sample_data <- read.delim(”SampleFiles/E123ABCDEF.2012-05-25-22.NEfbhLN3", header=F)
> sample_data <- sample_data[-1:-2,]
> View(sample_data)
> m <- ggplot(sample_data, aes(x = factor(V9)))
> m + geom_histogram() + scale_y_log10() + xlab('Error Codes') + ylab('log(Frequency)')
Need a Lot of Memory?
OpenRefine Running on an EC2 Instance
DATAWAREHOUSE
Web
ANALYST CRM
DB
Logs
OLTP
OLTP
OLAP
E T L
Log Shipping Swedish public domain photo taken in 1918
“Poor Man’s Log Shipping”
Embedding Poor-man Invisible Pixel http://www.poor-man-analytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www.douban.com&utmcs=UTF-8&utmsr=1440x900&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.3%20r181&utmdt=%E8%B1%86%E7%93%A3&utmhid=571356425&utmr=-&utmp=%2F&utmac=UA-7019765-1&utmcc=__utma%3D30149280.1785629903.1314674330.1315290610.1315452707.10%3B%2B__utmz%3D30149280.1315452707.10.7.utmcsr%3Dbiaodianfu.com%7Cutmccn%3D(referral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2Fpoor-man-analytics-architecture.html%3B%2B__utmv%3D30149280.162%3B&utmu=qBM~
Open Source
Frameworks
Input Output
+--------------------------------------------+
| |
| Web Apps ---+ +--> File |
| | | |
| +--> ---+ |
| /var/log ------> Fluentd ------> Mail |
| +--> ---+ |
| | | |
| Apache ---+ +--> S3 |
| |
+--------------------------------------------+
Web Server
+---------+
| Fluentd -------+
+---------+ |
|
Proxy Server |
+---------+ +--> +---------+
| Fluentd ----------> | Fluentd |
+---------+ +--> +---------+
|
Database Server |
+---------+ |
| Fluentd -------+
+---------+
Fluentd
Flume
Scribe
Chukwa
…
Fluentd Ascii Diagrams
Use Amazon Kinesis to Ship Your Logs
New
Aggregation with S3Distcp Aggregated
Even-size
Compressed
S3distcp on EMR Job Sample ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \
/home/hadoop/lib/emr-s3distcp-1.0.jar \
--args \
'--src,s3://myawsbucket/cf,\
--dest,s3://myoutputbucket/aggregate ,\
--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,\
--targetSize,128,\
--outputCodec,lzo,\
--deleteOnSuccess'
Pig for Access Logs Analysis RAW_LOG = LOAD 's3://myoutputbucket/aggregate/' AS (ts:chararray, url:chararray…);
LOGS_BASE_F = FILTER RAW_LOG BY url MATCHES '^GET /__track.*$’;
LOGS_BASE_F_W_PARAM = FOREACH LOGS_BASE_F GENERATE
url,
DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') as dt,
SUBSTRING(DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') ,0, 10 ) as day,
…
status,
REGEX_EXTRACT(url, '^GET /([^\\?]+)', 1) AS action: chararray,
REGEX_EXTRACT(url, 'idt=([^&]+)', 1) AS idt: chararray,
REGEX_EXTRACT(url, 'idc=([^&]+)', 1) AS idc: chararray;
I1 = FILTER LOGS_BASE_F_W_PARAM by action == 'clic' or action == 'display';
LOGS_SHORT = FOREACH I1 GENERATE uuid, action, dt, day, ida, idas, act, idp, idcmp ,idc;
G1 = GROUP LOGS_SHORT BY (uuid,idc);
store G1 into ‘s3://mybucket/sessions/’;
Load and Filter
(cat / grep)
Parse
(awk) Store
(>)
Pig vs. Hive
• Pig is geared toward sequentially transforming data
– ETL
– Shell in scale (from local mode to any scale)
• Hive is for querying data
– Data analysis / HQL
– Some transformation, typically as a means to a goal i.e., temporary tables
Monitoring Pig
https://github.com/netflix/lipstick
Another Monitoring
Tool
https://github.com/twitter/ambrose
Optimize Your EMR Cluster
Monitor Your EMR Cluster
Bootstrap Actions --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia
Management Console
Customers Tools
Gathering information about EMR
jobs from multiple sources and
presentation it in a textual and
graphic view
github.com/Hi-Media/EmrMonitoring
Completed Job View
Spot Bidding Strategies
Most Saving
Not paying
more
Less
Interruptions
Jeff Bezos (early Amazon days)
Data Sources
Queries
Value
More Trends to Consider
Transactional Processing Analytical Processing
Transactional context Global context
Latency Throughput
Indexed access Full table scans
Random IO Sequential IO
Disk seek times Disk transfer rate
COPY into Amazon Redshift create table cf_logs
( d date, t char(8), edge char(4), bytes int, cip varchar(15),
verb char(3), distro varchar(MAX), object varchar(MAX), status int,
Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )
copy cf_logs from 's3://big-data/logs/E123ABCDEF/'
credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>'
IGNOREHEADER 2
GZIP
DELIMITER '\t'
DATEFORMAT 'YYYY-MM-DD'
COPY into Amazon Redshift with
AWS Data Pipeline
Time for Data Visualization
Charles Minard's flow map of Napoleon's March (1869)
Choose Your Favorite
Visualization Tool
Tableau (Windows instance)
R
Jaspersoft
QlikView
MicroStrategy
SiSense
…
Snapshot before Delete
Unload Data from Amazon Redshift unload (“select * from cf_logs where date between '2013-11-03’ and '2013-11-10’“)
to 's3://mybucket/unload_cf_logs_week_46'
credentials 'aws_access_key_id=<key_id>;
aws_secret_access_key=<secret_key>’
delimiter as '\t’
GZIP;
Reference Architecture
Partner Services
Loggly
Splunk
Stratalux (Logstash)
…
Loggly AWS Marketplace Page
What Else Can You Do with
Log Analysis?
Finally, a Small Warning
Abraham Wald (1902-1950)
A B C
Would You Like to Know More?
Further reading http://aws.amazon.com/architecture
http://aws.amazon.com/articles
http://aws.typepad.com
Re:invent sessions DAT205 - Amazon Redshift in Action: Enterprise, Big Data, and SaaS
DAT305 - Getting Maximum Performance from Amazon Redshift
BDT301 - Scaling your Analytics with Amazon Elastic MapReduce
Please give us your feedback on this
presentation
As a thank you, we will select prize
winners daily for completed surveys!
ARC306