piano media - approach to data gathering and processing

nearly three years of continuous changes of approach to data gathering and processing

(Martin Strycek, Juraj Sottnik)@rubyslava 2014

We better get it right first time!

Starting point● we had two developers

● we had one live server● we had one cold backup

● we can’t store all the data● we can’t process all the data

Batch processing - the downsides

● batch every 3 hours○ delete old data

● updating counters○ you need to define them upfront

● throwing away old data○ developer point of view

■ you have no way to correct your mistake○ business

■ you lose your data

Batch processing - the benefits

● you will learn ○ profiler is your best friend○ optimizing can be hard and can take time

● what are good access logs good for○ reconstruct your deleted data

Business says: save all data

Big Data

● It’s not only about the volume

● What we gonna do with it?○ We had NO idea!

● We rent more servers.○ We needed place where to store the data

Big Data

● We went the NoSQL way○ MongoDB

■ easy replication, possible sharding■ upsert

■ rich document based queries - we still were one foot in the SQL world

■ fast prototype

● We were still doing batch processing● ~15m impressions per day ending with

~5GB raw data per day

Big Data

● each day as collection○ easy for batch processing

● each impression as a document● adding processed parameters over

time● pulling data from 30 collections

○ server is not responding○ virtual memory is low

Big Data - analytics

● Visitors counts on website/section○ active - with subscription○ inactive - without subscription○ anonymous

● Content consumption ○ how many pageviews

■ active ■ inactive■ anonymouse

● and others

● We were still doing batch processing● ~15m impressions per day ending with

~5GB raw data per day

Business asks: how many UNIQUE users

did … in month

What we really need● COUNT(* || DISTINCT ...) GROUP BY

○ entities○ date periods (day, week, month)○ combination of entities and date periods [and

some other flags]● Special demands from analytics team

○ Not too hard to implement with SQL magic● As fast as possible

○ Minimally as fast as data are incoming● Still store all historical raw data

○ Ideally compressed

What to do● Processing raw data?

○ Use lot of space, before getting result■ We need to store historical data anyway■ You can store compressed files (LZO) in Hadoop

● Sharding○ For how long?○ How to properly determine sharding key(s)?

● Do you have really big amount of data?● Do you have hardware for running

Hadoop? Really?● What overnight batch processing really

means?

Naive solution● Separate counter for each needed

combination, updated for each impression, maybe with touching DB○ Fast to generate unique key for combination

■ md5([entityType, entityId, day, dayId].join("|"))○ Really fast to get value

■ Always primary key■ Multiget

○ Need to define all GROUP BY combinations on beginning

○ Failure during processing one impression■ Need to increment counters in transaction

Real world solution● Kafka

○ Buffering incoming data○ Web workers as producers

● Storm / Trident○ Consuming data from Kafka○ Processing incoming data○ Using cassandra as storage backend

● Cassandra○ Holding counters and helper informations to

determine uniquity

Storm● Real time processing of unbounded

streams of data○ Processing data as they come○ You still need to have computing power○ Need to transform COUNT(* || DISTINCT ...)

GROUP BY everything to steps of updates of counters

○ Java, but bolts can be written in different languages

Storm● Spouts● Bolts

Trident● High level abstraction over Storm

○ Joins○ Aggregations○ Grouping○ Filtering○ Functions

Trident● Operating in transactions● Persistent aggregation

○ “Memcached”○ Cassandra

● DRPC calls○ No need to touch Cassandra

● Local cluster for development● Easy to learn basics● Hard to discover advanced stuff

■ Lack of documentation■ Need to tune configuration

Trident● Functions

○ You can do everything you want■ Touch DB, read emails, …

○ Stay with java■ No dependencies problem■ No performance penalty

● Topology○ Good to define on beginning

■ Spend time on detailed diagram■ Save you during implementation and future updates

○ Don’t do it too much complex■ Problem with loading it

Trident

Cassandra● Already in our production on different

project● No SPOF● Multi Master● Scalable● More good stuff● Lot of new features in 2.x

○ Lite transactions○ Lot of fixes

■ Good old times on 0.8■ Our bug report from 2011 - Double load of commit log

on node start :)

Kafka● A high-throughput distributed

messaging system● Something like distributed commit log

○ You can set retention○ You can move reading offset back

■ Used by Trident transactions● Cluster● Ideally to use with Trident

Business asks: are you ready for ~250m

impressions per day?

Thank you.

piano media - approach to data gathering and processing

Technology