build your web analytics with node.js, amazon dynamodb and amazon emr (bdt203) | aws re:invent 2013
DESCRIPTION
Want to learn how to build your own Google Analytics? Learn how to build a scalable architecture using node.js, Amazon DynamoDB, and Amazon EMR. This architecture is used by ScribbleLive to track billions of engagement minutes per month. In this session, we go over the code in node.js, how to store the data in Amazon DynamoDB, and how to roll-up the data using Hadoop and Hive. Attend this session to learn how to move data quickly at any scale as well as how to use genomic analysis tools and pipelines for next generation sequencers using Globus on AWS.TRANSCRIPT
![Page 1: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/1.jpg)
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Building Your Own Web Analytics Service
with node.js, Amazon DynamoDB, and
Amazon Elastic MapReduce
Jonathan Keebler - Founder, CTO - ScribbleLive
November 13, 2013
![Page 2: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/2.jpg)
Who Am I?
•Jonathan Keebler @keebler
•Built video player for all CTV properties
–Worked on news sites like CTV, TSN, CP24
•CTO, Founder of ScribbleLive
•Bootstrapped a high scalability startup
–Credit card limit wasn’t that high, had to find cheap
ways to handle the load of top tier news sites
![Page 3: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/3.jpg)
What is ScribbleLive?
•Leading provider of real-time engagement
management solutions
•We enable real-time publication and syndication
of digital content
•Our platform is transforming the way the world’s
largest brands and media approach
communication and content creation, creating true
real-time engagement
![Page 4: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/4.jpg)
Some of our customers
![Page 5: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/5.jpg)
Today
•Learn to build your own analytics service
– Seriously, we’re going to do it
•node.js on Amazon EC2: web servers
•Amazon DynamoDB: database
•Hadoop/Hive on Amazon Elastic MapReduce
(EMR): roll-up data
![Page 6: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/6.jpg)
Why would we do this?
•ScribbleLive tracks “engagement minutes” (EMs)
across all customer sites
– e.g., ESPN.com, CNN.com, Reuters.com
– EM = 1 minute of a user watching a webpage
– 2.5B per month, 120M+ per hour
•Big analytics providers couldn’t do it
– Didn’t have the features
– Too inaccurate
![Page 7: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/7.jpg)
How are we going to do this?
Elastic Load Balancing
Visitors
node.js node.js node.js node.js
DynamoDB
![Page 8: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/8.jpg)
DynamoDB: data structure
•Separate tables by timeframe
– Minute (written by node.js directly)
– Hour (EMR from minute data)
– Day (EMR from hour data)
– Month (EMR from day data)
•Structure
– Hash: Item (page id)
– Range: Time (rounded to min, hour, day)
– { Hits: 1 }
![Page 9: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/9.jpg)
Elastic Load Balancing: AMI setup
•Custom AMI
– Loads source from SVN
– Launches node.js
![Page 10: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/10.jpg)
Elastic Load Balancing: Load balancing
•1 load balancer
•Cookies keep unique user on same instance
•Auto-scaling
– CPU >50% or network-in 50M bytes, triggers new
servers coming online and added to Elastic Load
Balancing
![Page 11: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/11.jpg)
node.js: Overview of code
•Accepts GET /?item={ID}&uid={UserID}
•Dictionary/Array of how many GETs per item in this
minute
– Hits[Minute][“{ID}”]++
– Example: Hits[“1/1/2014 1:23:00”][“abcd”]++
•Dictionary/Array of Users already counted in
Item:Minute (prevent double-counting)
•At end of minute, write data back to DynamoDB
![Page 12: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/12.jpg)
node.js: Bulk writing to DynamoDB
•Writing all data back immediately in a loop = BAD!
– Throughput would spike in that ~second
– Would have to use higher throughput limit
– More $$$$
•Instead, figure out how many writes need to happen /
60 seconds = how many writes per second you should do
![Page 13: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/13.jpg)
node.js: Bulk writing to DynamoDB
•Call to DynamoDB per item:
– update: (atomic) add X to {ID}:{Minute}
![Page 14: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/14.jpg)
Hadoop: What we map and reduce
•To go from minute to hourly data
– Round every minute down to the nearest hour (floor( Minute / 3600 ) * 3600)
– Sum the # of “Hits” from each data point
•Just look at the past 24 hours to save time
•Do the same for hourly to daily, daily to monthly
![Page 15: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/15.jpg)
Hadoop: Hive scripts INSERT OVERWRITE TABLE MetricsHourly
SELECT
Item,
(floor( Time / 3600 ) * 3600) AS Time,
SUM(Hits) AS Hits,
from_unixtime(floor( Time / 3600 ) * 3600 ) AS TimeFriendly
FROM Metrics WHERE Time >= floor( unix_timestamp() / 86400 ) * 86400 - ( 86400 * 1 )
GROUP BY Item, floor( Time / 3600 ) * 3600;
![Page 16: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/16.jpg)
Hadoop: Setting Up EMR
![Page 17: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/17.jpg)
Hadoop: Setting Up EMR
• “Start an Interactive Hive Session”
• Run a cron job every 15 minutes to check if
the Hive job is complete
• If complete, downloads newest Hive script
and restarts the job
• Amazon CloudWatch alarms if jobs taking
longer than 12 hours
![Page 18: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/18.jpg)
Hadoop: Cron Job #!/bin/sh
JOBID=$(hadoop job -list | grep job_ | cut -f1)
if [ -n "$JOBID" ];then
echo "Another job already running";
else
echo "Starting Hive job..."
echo `date` starting >> /var/log/metricsdaily_starting
wget -qO- http://DEPLOY/metrics/rollups.sql > /tmp/rollups.sql && hive -f /tmp/rollups.sql
fi
![Page 19: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/19.jpg)
Application API
•RESTful API in the language of your choice
•Calls to DynamoDB:
–query: Hash:{ID} w/ Range:{Time A}-{Time B}
•Since M-R could take a day to run, need to reconstruct
hourly data from minutes for most recent 24 hours
–e.g. if you want hourly data for last 2 days, take 24 hourly data
pts from yesterday, and 24*60 minute data pts from today
(convert to hourly data pts in code)
![Page 20: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/20.jpg)
Performance
![Page 21: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/21.jpg)
Performance
![Page 22: Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203) | AWS re:Invent 2013](https://reader034.vdocument.in/reader034/viewer/2022042613/540dd8a48d7f72747e8b4ba3/html5/thumbnails/22.jpg)
Please give us your feedback on this
presentation
As a thank you, we will select prize
winners daily for completed surveys!
BDT203