piano media - approach to data gathering and processing
DESCRIPTION
Lessons learned when changing our mindset from batch processing to real-time processing of unbound stream of data.TRANSCRIPT
nearly three years of continuous changes of approach to data gathering and processing
(Martin Strycek, Juraj Sottnik)@rubyslava 2014
We better get it right first time!
Starting point● we had two developers
● we had one live server● we had one cold backup
● we can’t store all the data● we can’t process all the data
Batch processing - the downsides
● batch every 3 hours○ delete old data
● updating counters○ you need to define them upfront
● throwing away old data○ developer point of view
■ you have no way to correct your mistake○ business
■ you lose your data
Batch processing - the benefits
● you will learn ○ profiler is your best friend○ optimizing can be hard and can take time
● what are good access logs good for○ reconstruct your deleted data
Business says: save all data
Big Data
● It’s not only about the volume
● What we gonna do with it?○ We had NO idea!
● We rent more servers.○ We needed place where to store the data
Big Data
● We went the NoSQL way○ MongoDB
■ easy replication, possible sharding■ upsert
■ rich document based queries - we still were one foot in the SQL world
■ fast prototype
● We were still doing batch processing● ~15m impressions per day ending with
~5GB raw data per day
Big Data
● each day as collection○ easy for batch processing
● each impression as a document● adding processed parameters over
time● pulling data from 30 collections
○ server is not responding○ virtual memory is low
Big Data - analytics
● Visitors counts on website/section○ active - with subscription○ inactive - without subscription○ anonymous
● Content consumption ○ how many pageviews
■ active ■ inactive■ anonymouse
● and others
● We were still doing batch processing● ~15m impressions per day ending with
~5GB raw data per day
Business asks: how many UNIQUE users
did … in month
What we really need● COUNT(* || DISTINCT ...) GROUP BY
○ entities○ date periods (day, week, month)○ combination of entities and date periods [and
some other flags]● Special demands from analytics team
○ Not too hard to implement with SQL magic● As fast as possible
○ Minimally as fast as data are incoming● Still store all historical raw data
○ Ideally compressed
What to do● Processing raw data?
○ Use lot of space, before getting result■ We need to store historical data anyway■ You can store compressed files (LZO) in Hadoop
● Sharding○ For how long?○ How to properly determine sharding key(s)?
● Do you have really big amount of data?● Do you have hardware for running
Hadoop? Really?● What overnight batch processing really
means?
Naive solution● Separate counter for each needed
combination, updated for each impression, maybe with touching DB○ Fast to generate unique key for combination
■ md5([entityType, entityId, day, dayId].join("|"))○ Really fast to get value
■ Always primary key■ Multiget
○ Need to define all GROUP BY combinations on beginning
○ Failure during processing one impression■ Need to increment counters in transaction
Real world solution● Kafka
○ Buffering incoming data○ Web workers as producers
● Storm / Trident○ Consuming data from Kafka○ Processing incoming data○ Using cassandra as storage backend
● Cassandra○ Holding counters and helper informations to
determine uniquity
Storm● Real time processing of unbounded
streams of data○ Processing data as they come○ You still need to have computing power○ Need to transform COUNT(* || DISTINCT ...)
GROUP BY everything to steps of updates of counters
○ Java, but bolts can be written in different languages
Storm● Spouts● Bolts
Trident● High level abstraction over Storm
○ Joins○ Aggregations○ Grouping○ Filtering○ Functions
Trident● Operating in transactions● Persistent aggregation
○ “Memcached”○ Cassandra
● DRPC calls○ No need to touch Cassandra
● Local cluster for development● Easy to learn basics● Hard to discover advanced stuff
■ Lack of documentation■ Need to tune configuration
Trident● Functions
○ You can do everything you want■ Touch DB, read emails, …
○ Stay with java■ No dependencies problem■ No performance penalty
● Topology○ Good to define on beginning
■ Spend time on detailed diagram■ Save you during implementation and future updates
○ Don’t do it too much complex■ Problem with loading it
Trident
Cassandra● Already in our production on different
project● No SPOF● Multi Master● Scalable● More good stuff● Lot of new features in 2.x
○ Lite transactions○ Lot of fixes
■ Good old times on 0.8■ Our bug report from 2011 - Double load of commit log
on node start :)
Kafka● A high-throughput distributed
messaging system● Something like distributed commit log
○ You can set retention○ You can move reading offset back
■ Used by Trident transactions● Cluster● Ideally to use with Trident
Business asks: are you ready for ~250m
impressions per day?
Thank you.