data lessons learned at scale - big data dc
DESCRIPTION
Half of the work that it takes to do data science is plumbing and wrangling. I’ll discuss some tricks we’ve learned while building AddThis over the years to collect and process data at web scale.TRANSCRIPT
Charlie ReverteVP Engineering
@numbakrrunch
Data Lessons Learned at Scale
@numbakrrunch
Topic
Half of the work that it takes to do data science is plumbing and wrangling
I’ll discuss some tricks we’ve learned over the years to collect and process data at web scale
@numbakrrunch
About AddThis
We make tools for websites:
@numbakrrunch
Our Data
We process tool data● Sharing● Following● Visitation● Content Classification
And feed it back to sites● Analytics● Trending Content● Personalized
Recommendations
@numbakrrunch
At Scale...
● 14 million domains● 100 billion views/month● 45k events/sec● 160k concurrent firewall sessions● 500k unique metrics in ganglia
@numbakrrunch
Counting Things
Common operations:● Cardinality● Set membership● Top-k elements● Frequency
● Estimate when possible● Sample when possible● Often streaming vs. batch● Mergeability is a big plus
○ Distributed counting○ Checkpointing
Stream-lib: https://github.com/clearspring/stream-lib
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
@numbakrrunch
Distributed ID Generation
● Session IDs are generated in the browser● We concatenate time and a random value
Hex: 4f6934b6f54bd7c1
Base64: T2k0to403VS
● Time-bounded probabilistic uniqueness○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec)
● Naturally time ordered, built-in DoB
Compare to Twitter Snowflakehttps://github.com/twitter/snowflake/
time rand63 31 0
@numbakrrunch
Joining Data
● Value of data increases with higher dimensionality○ Geo, user profile, page attributes, external data
● Join and de-normalize data when you ingest○ Disk is cheap
● Join your data in client-side storage○ Browsers as a lossy distributed database
● Mutability?
“The value is in the join” (or something like that)
https://github.com/stewartoallen
@numbakrrunch
Sharding and Sampling
● Choose your shard keys wisely○ High cardinality field to reduce lumpiness○ What do you need to co-locate
● Shards also useful for sampling○ Law of big numbers
● Can yield statistical significance○ Depending on the question
@numbakrrunch
Tunable QoS
● URL Metadata stored in a 90-node Cassandra cluster
● We scrape and classify 20M URLs/day● 750 million active records● 2.2B reads/day● Variable cache TTLs
○ Depending on write rate per record
● Global TTL knob○ Turn up to reduce load for maintenance○ Turn down to improve responsiveness
6
CDN cache
Deployment
● Continuous Deploy?● Deploying our javascript costs $3k
○ Have to invalidate 1.4B browser caches○ Several hours to flush to browsers (clench)
● 2PB of CDN data served per month● Have DDOSed ourselves
○ Very interesting bugs● Simulation is weak
○ The internet is a dirty place○ Embrace incremental deploys
@numbakrrunch
Columnar Compression
● Columnar storage techniques for row data● Better compressor efficiency● Different compressors per column● >20% size savings● by @abramsm
Time IP UID URL Geo Time
IP
UID
URL
Geo
Input Data Stored Data
Block Size
@numbakrrunch
Summary
● Are you more like the post office or the bank?● Look for good-enough answers● Fight your nerd tendency for perfect
○ I’m still struggling with this
Questions?
@numbakrrunch