piano media - approach to data gathering and processing

25
nearly three years of continuous changes of approach to data gathering and processing (Martin Strycek, Juraj Sottnik) @rubyslava 2014

Upload: martinstrycek

Post on 06-May-2015

518 views

Category:

Technology


1 download

DESCRIPTION

Lessons learned when changing our mindset from batch processing to real-time processing of unbound stream of data.

TRANSCRIPT

Page 1: Piano Media - approach to data gathering and processing

nearly three years of continuous changes of approach to data gathering and processing

(Martin Strycek, Juraj Sottnik)@rubyslava 2014

Page 2: Piano Media - approach to data gathering and processing

We better get it right first time!

Page 3: Piano Media - approach to data gathering and processing

Starting point● we had two developers

● we had one live server● we had one cold backup

● we can’t store all the data● we can’t process all the data

Page 4: Piano Media - approach to data gathering and processing

Batch processing - the downsides

● batch every 3 hours○ delete old data

● updating counters○ you need to define them upfront

● throwing away old data○ developer point of view

■ you have no way to correct your mistake○ business

■ you lose your data

Page 5: Piano Media - approach to data gathering and processing

Batch processing - the benefits

● you will learn ○ profiler is your best friend○ optimizing can be hard and can take time

● what are good access logs good for○ reconstruct your deleted data

Page 6: Piano Media - approach to data gathering and processing

Business says: save all data

Page 7: Piano Media - approach to data gathering and processing

Big Data

● It’s not only about the volume

● What we gonna do with it?○ We had NO idea!

● We rent more servers.○ We needed place where to store the data

Page 8: Piano Media - approach to data gathering and processing

Big Data

● We went the NoSQL way○ MongoDB

■ easy replication, possible sharding■ upsert

■ rich document based queries - we still were one foot in the SQL world

■ fast prototype

● We were still doing batch processing● ~15m impressions per day ending with

~5GB raw data per day

Page 9: Piano Media - approach to data gathering and processing

Big Data

● each day as collection○ easy for batch processing

● each impression as a document● adding processed parameters over

time● pulling data from 30 collections

○ server is not responding○ virtual memory is low

Page 10: Piano Media - approach to data gathering and processing

Big Data - analytics

● Visitors counts on website/section○ active - with subscription○ inactive - without subscription○ anonymous

● Content consumption ○ how many pageviews

■ active ■ inactive■ anonymouse

● and others

● We were still doing batch processing● ~15m impressions per day ending with

~5GB raw data per day

Page 11: Piano Media - approach to data gathering and processing

Business asks: how many UNIQUE users

did … in month

Page 12: Piano Media - approach to data gathering and processing

What we really need● COUNT(* || DISTINCT ...) GROUP BY

○ entities○ date periods (day, week, month)○ combination of entities and date periods [and

some other flags]● Special demands from analytics team

○ Not too hard to implement with SQL magic● As fast as possible

○ Minimally as fast as data are incoming● Still store all historical raw data

○ Ideally compressed

Page 13: Piano Media - approach to data gathering and processing

What to do● Processing raw data?

○ Use lot of space, before getting result■ We need to store historical data anyway■ You can store compressed files (LZO) in Hadoop

● Sharding○ For how long?○ How to properly determine sharding key(s)?

● Do you have really big amount of data?● Do you have hardware for running

Hadoop? Really?● What overnight batch processing really

means?

Page 14: Piano Media - approach to data gathering and processing

Naive solution● Separate counter for each needed

combination, updated for each impression, maybe with touching DB○ Fast to generate unique key for combination

■ md5([entityType, entityId, day, dayId].join("|"))○ Really fast to get value

■ Always primary key■ Multiget

○ Need to define all GROUP BY combinations on beginning

○ Failure during processing one impression■ Need to increment counters in transaction

Page 15: Piano Media - approach to data gathering and processing

Real world solution● Kafka

○ Buffering incoming data○ Web workers as producers

● Storm / Trident○ Consuming data from Kafka○ Processing incoming data○ Using cassandra as storage backend

● Cassandra○ Holding counters and helper informations to

determine uniquity

Page 16: Piano Media - approach to data gathering and processing

Storm● Real time processing of unbounded

streams of data○ Processing data as they come○ You still need to have computing power○ Need to transform COUNT(* || DISTINCT ...)

GROUP BY everything to steps of updates of counters

○ Java, but bolts can be written in different languages

Page 17: Piano Media - approach to data gathering and processing

Storm● Spouts● Bolts

Page 18: Piano Media - approach to data gathering and processing

Trident● High level abstraction over Storm

○ Joins○ Aggregations○ Grouping○ Filtering○ Functions

Page 19: Piano Media - approach to data gathering and processing

Trident● Operating in transactions● Persistent aggregation

○ “Memcached”○ Cassandra

● DRPC calls○ No need to touch Cassandra

● Local cluster for development● Easy to learn basics● Hard to discover advanced stuff

■ Lack of documentation■ Need to tune configuration

Page 20: Piano Media - approach to data gathering and processing

Trident● Functions

○ You can do everything you want■ Touch DB, read emails, …

○ Stay with java■ No dependencies problem■ No performance penalty

● Topology○ Good to define on beginning

■ Spend time on detailed diagram■ Save you during implementation and future updates

○ Don’t do it too much complex■ Problem with loading it

Page 21: Piano Media - approach to data gathering and processing

Trident

Page 22: Piano Media - approach to data gathering and processing

Cassandra● Already in our production on different

project● No SPOF● Multi Master● Scalable● More good stuff● Lot of new features in 2.x

○ Lite transactions○ Lot of fixes

■ Good old times on 0.8■ Our bug report from 2011 - Double load of commit log

on node start :)

Page 23: Piano Media - approach to data gathering and processing

Kafka● A high-throughput distributed

messaging system● Something like distributed commit log

○ You can set retention○ You can move reading offset back

■ Used by Trident transactions● Cluster● Ideally to use with Trident

Page 24: Piano Media - approach to data gathering and processing

Business asks: are you ready for ~250m

impressions per day?

Page 25: Piano Media - approach to data gathering and processing

Thank you.