data lessons learned at scale - big data dc

14
Charlie Reverte VP Engineering @numbakrrunch Data Lessons Learned at Scale

Upload: charlie-reverte

Post on 24-Jun-2015

597 views

Category:

Technology


2 download

DESCRIPTION

Half of the work that it takes to do data science is plumbing and wrangling. I’ll discuss some tricks we’ve learned while building AddThis over the years to collect and process data at web scale.

TRANSCRIPT

Page 1: Data Lessons Learned at Scale - Big Data DC

Charlie ReverteVP Engineering

@numbakrrunch

Data Lessons Learned at Scale

Page 2: Data Lessons Learned at Scale - Big Data DC

@numbakrrunch

Topic

Half of the work that it takes to do data science is plumbing and wrangling

I’ll discuss some tricks we’ve learned over the years to collect and process data at web scale

Page 3: Data Lessons Learned at Scale - Big Data DC

@numbakrrunch

About AddThis

We make tools for websites:

Page 4: Data Lessons Learned at Scale - Big Data DC

@numbakrrunch

Our Data

We process tool data● Sharing● Following● Visitation● Content Classification

And feed it back to sites● Analytics● Trending Content● Personalized

Recommendations

Page 5: Data Lessons Learned at Scale - Big Data DC

@numbakrrunch

At Scale...

● 14 million domains● 100 billion views/month● 45k events/sec● 160k concurrent firewall sessions● 500k unique metrics in ganglia

Page 6: Data Lessons Learned at Scale - Big Data DC

@numbakrrunch

Counting Things

Common operations:● Cardinality● Set membership● Top-k elements● Frequency

● Estimate when possible● Sample when possible● Often streaming vs. batch● Mergeability is a big plus

○ Distributed counting○ Checkpointing

Stream-lib: https://github.com/clearspring/stream-lib

http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

Page 7: Data Lessons Learned at Scale - Big Data DC

@numbakrrunch

Distributed ID Generation

● Session IDs are generated in the browser● We concatenate time and a random value

Hex: 4f6934b6f54bd7c1

Base64: T2k0to403VS

● Time-bounded probabilistic uniqueness○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec)

● Naturally time ordered, built-in DoB

Compare to Twitter Snowflakehttps://github.com/twitter/snowflake/

time rand63 31 0

Page 8: Data Lessons Learned at Scale - Big Data DC

@numbakrrunch

Joining Data

● Value of data increases with higher dimensionality○ Geo, user profile, page attributes, external data

● Join and de-normalize data when you ingest○ Disk is cheap

● Join your data in client-side storage○ Browsers as a lossy distributed database

● Mutability?

“The value is in the join” (or something like that)

https://github.com/stewartoallen

Page 9: Data Lessons Learned at Scale - Big Data DC

@numbakrrunch

Sharding and Sampling

● Choose your shard keys wisely○ High cardinality field to reduce lumpiness○ What do you need to co-locate

● Shards also useful for sampling○ Law of big numbers

● Can yield statistical significance○ Depending on the question

Page 10: Data Lessons Learned at Scale - Big Data DC

@numbakrrunch

Tunable QoS

● URL Metadata stored in a 90-node Cassandra cluster

● We scrape and classify 20M URLs/day● 750 million active records● 2.2B reads/day● Variable cache TTLs

○ Depending on write rate per record

● Global TTL knob○ Turn up to reduce load for maintenance○ Turn down to improve responsiveness

6

CDN cache

Page 11: Data Lessons Learned at Scale - Big Data DC

Deployment

● Continuous Deploy?● Deploying our javascript costs $3k

○ Have to invalidate 1.4B browser caches○ Several hours to flush to browsers (clench)

● 2PB of CDN data served per month● Have DDOSed ourselves

○ Very interesting bugs● Simulation is weak

○ The internet is a dirty place○ Embrace incremental deploys

Page 12: Data Lessons Learned at Scale - Big Data DC

@numbakrrunch

Columnar Compression

● Columnar storage techniques for row data● Better compressor efficiency● Different compressors per column● >20% size savings● by @abramsm

Time IP UID URL Geo Time

IP

UID

URL

Geo

Input Data Stored Data

Block Size

Page 13: Data Lessons Learned at Scale - Big Data DC

@numbakrrunch

Summary

● Are you more like the post office or the bank?● Look for good-enough answers● Fight your nerd tendency for perfect

○ I’m still struggling with this

Page 14: Data Lessons Learned at Scale - Big Data DC

Questions?

@numbakrrunch