data lessons learned at scale - big data dc

Charlie ReverteVP Engineering

@numbakrrunch

Data Lessons Learned at Scale

@numbakrrunch

Topic

Half of the work that it takes to do data science is plumbing and wrangling

I’ll discuss some tricks we’ve learned over the years to collect and process data at web scale

@numbakrrunch

About AddThis

We make tools for websites:

@numbakrrunch

Our Data

We process tool data● Sharing● Following● Visitation● Content Classification

And feed it back to sites● Analytics● Trending Content● Personalized

Recommendations

@numbakrrunch

At Scale...

● 14 million domains● 100 billion views/month● 45k events/sec● 160k concurrent firewall sessions● 500k unique metrics in ganglia

@numbakrrunch

Counting Things

Common operations:● Cardinality● Set membership● Top-k elements● Frequency

● Estimate when possible● Sample when possible● Often streaming vs. batch● Mergeability is a big plus

○ Distributed counting○ Checkpointing

Stream-lib: https://github.com/clearspring/stream-lib

http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

https://github.com/clearspring/stream-lib





@numbakrrunch

Distributed ID Generation

● Session IDs are generated in the browser● We concatenate time and a random value

Hex: 4f6934b6f54bd7c1

Base64: T2k0to403VS

● Time-bounded probabilistic uniqueness○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec)

● Naturally time ordered, built-in DoB

Compare to Twitter Snowflakehttps://github.com/twitter/snowflake/

time rand63 31 0

https://github.com/twitter/snowflake/

https://github.com/twitter/snowflake/

@numbakrrunch

Joining Data

● Value of data increases with higher dimensionality○ Geo, user profile, page attributes, external data

● Join and de-normalize data when you ingest○ Disk is cheap

● Join your data in client-side storage○ Browsers as a lossy distributed database

● Mutability?

“The value is in the join” (or something like that)

https://github.com/stewartoallen

https://github.com/stewartoallen

@numbakrrunch

Sharding and Sampling

● Choose your shard keys wisely○ High cardinality field to reduce lumpiness○ What do you need to co-locate

● Shards also useful for sampling○ Law of big numbers

● Can yield statistical significance○ Depending on the question

@numbakrrunch

Tunable QoS

● URL Metadata stored in a 90-node Cassandra cluster

● We scrape and classify 20M URLs/day● 750 million active records● 2.2B reads/day● Variable cache TTLs

○ Depending on write rate per record

● Global TTL knob○ Turn up to reduce load for maintenance○ Turn down to improve responsiveness

6

CDN cache

Deployment

● Continuous Deploy?● Deploying our javascript costs $3k

○ Have to invalidate 1.4B browser caches○ Several hours to flush to browsers (clench)

● 2PB of CDN data served per month● Have DDOSed ourselves

○ Very interesting bugs● Simulation is weak

○ The internet is a dirty place○ Embrace incremental deploys

@numbakrrunch

Columnar Compression

● Columnar storage techniques for row data● Better compressor efficiency● Different compressors per column● >20% size savings● by @abramsm

Time IP UID URL Geo Time

IP

UID

URL

Geo

Input Data Stored Data

Block Size

http://twitter.com/abramsm

@numbakrrunch

Summary

● Are you more like the post office or the bank?● Look for good-enough answers● Fight your nerd tendency for perfect

○ I’m still struggling with this

Questions?

@numbakrrunch

data lessons learned at scale - big data dc

Technology

data value of data increases

process data

tool data

data science

pb of cdn data

random value time

rqsecnaturally time

storage geo