real world cassandra
TRANSCRIPT
|
the prospect engine for brands.
Cassandra in Online Advertising: Real Time Bidding
Who are we?
Costa Sevdinoglou & Edward Capriolo
Impressions look like…
A High Level look at RTB
4. On behalf of the marketer, m6d bids the impressions via the
auction house. If m6d wins, we display our ad to the
browser.
3. Exchanges serve as auction houses for the impressions
1. Browsers visit Publishers and create impressions.
2. Publishers sell impressions via Exchanges.
Performance and Data
• Billions and billions of bid requests a day
• A single request can result in multiple Cassandra Operations!
• One cluster is just under 10TB and growing
• Low latency requirement below 120 ms typical
• Limited data available to m6d via the exchange
Segment Data
Segments are how we assign product or service
affinity to a group of users. User’s we consider to be
like minded with respect to a given brand will be
placed in the same segment.
Segment Data is just one component of our
overarching data model.
Segments help to reduce the number of calculations
we do in real time.
Old Approach for Segment Data
Limitations
•Periodically updated.
•Only subsection of
the data.
•Cluster performance
is effected during a
data push.
Application Nodes (Tomcat + MySQL )
Event Logs
Hadoop Aggregation
MySQL Data Push
Cassandra Approach for Segment Data
Better!
• Updating in real time now
possible
• Distributed not duplicated
• Less complexity to manage
• Storing more information
• We can now bid on users
sooner!
Application Nodes (Tomcat + Less MySQL Usage)
Cassandra
One Ring to rule them all
http://askyyy.blog.163.com/blog/static/1234575992010428819399/
Peer to Peer per operation replication
Fail fast, self-healing
Each write goes to all natural endpoints
Hinted handoff if destination is down
Repair on Read
No more: STOP SLAVE; SET GLOBAL
SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;
Multi Data Center
No designing and managing complex replication topologies
create keyspace world
with placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy'
and strategy_options={1:3, 2:3, 3:3};
The same process as single data center
No log shipping, or separate processes to run
Monitoring & Management
Many Many things to monitor with JMX
Nice command line tools
Most values can be tweaked at run time
Capacity Planning
How many
Rows
Columns
Size of Average Column
Latency requirements
Throughput read and writes per sec
Unit Tests FTW!
Max 2 billion columns per row
Awesome
Unless you accidentally write 2 billion columns to a row key named “null”
Check maxRowSize JMX
Watch logs for messages about compacting large rows
Local (NYC) Meetups
www.meetup.com/NYC-Cassandra-User-Group/