Download - Using Cassandra for RTB systems
Real Time Bidding with Apache Cassandra
Introducing RTBRTB @ Kenshoo:
- Concepts- Architecture- Challenges
Real Time Bidding (RTB)
● Real-time bidding is a dynamic auction process where each
impression is a bid for in (near) real time versus a static auction
● Kenshoo is engaged In Facebook Exchange (FBX)
● In FBX, each bid has a life-time of 120ms. All transactions have to
complete within that period, and the winning ad is presented to the
user.
● Kenshoo employs ad re-targeting, where search engine campaigns
are extended to the social network, thus giving a much higher ROI for
our customers
Flow
WebSite
RTB Logical Architecture
RTB
RTB Front
Bidder Win ErrorOpt Out Pixel Matcher
RTB BackendRTB Brain
RTB Reporter
Cassandra
Cookie to Segment(s)
Bid decision Trees
Campaigns Metadata
Focus on RTB Cassandra RTB @ Kenshoo:
- Architecture- Challenges
Requirements
● Handle 25K+ requests within the 120ms bid time-frame including network latencies
● Ability to scale up to 1M per minute requests while keeping the
current latency
● Handle ~10K writes/second with low latency
● Multi DC Configuration, all nodes must be sync-ed in real-time
● Seamless Operations: Compactions and Repairs
● High Security
C* Physical Architecture
(US) West Region
App App
VPN
App Internet
(US) East Region
App App
VPN
App
FBX WEST FBX EAST
GRE
C* Cluster Information
● Cassandra version 1.2.6● Oracle Java 7● Manual tokens, Vnodes Are Coming Soon● Multi-DC Configuration● Network Topology ● DC Connectivity between VPCs via Linux GRE● Amazon C3.2xlarge instance type● Ubuntu 13.10 with EXT4● SSD (Ephemeral)
The Ring
C* Cluster Network Between Sites
● For security reasons we,
○ Do not use EC2Snitch or EC2MultiRegionSnitch
○ Connected the nodes via VPN (Linux GRE)
● Linux GRE is fast, reliable and provides high throughput
(~1Gb/s)
C* Cluster Storage
● We started with Amazon EBS:
○ With small #nodes (up to 4 nodes): You want persistent storage; avoid running repairs if you lose a node
○ 4xEBS devices in RAID10 configuration: Provide up to 1000 IOPs and bursts of up to 2000 IOPS
○ Cheap in AWS
● 8 nodes with Ephemeral Devices:
○ Lower risk: if you lose a node, recovery isn’t as heavy on the whole cluster
○ We used RAID0○ Higher performance (double than EBS)○ Free, bundled within the instances
C* Cluster Storage continued
● 16 nodes with Ephemeral Devices:
○ When load became heavy we grew to 16 nodes○ Compactions and repairs harmed the cluster latency○ We had to use Provisioned IOPs devices for C* maintenance
● C3 Instance type with SSD:○ Came just in time providing ephemeral SSD storage○ They solved our performance problems and enabled
seamless compactions and repairs○ Amazon currently has scarce deployment of this H/W and
nodes are not stable○ Not available yet in all regions○ C3 Nodes Deployment are not always a possiblity due to AWS
capacity issues○ Amazon promised to resolve the C3 issues next month
C* Cluster Performance
Monitoring
● We heavily rely on DataStax OpsCenter
● We grab OpsCenter Metrics out for graphings
● We wrote our own Read/Write Speed Test on separate dedicated KeySpace on
each node to detect bottlenecks and problematic nodes
● We Sample the data separately from the Application to detect if the problem
origins are C* or the application
What have we learned
● Storage:○ Use SSD:
■ It provides high and stable disk performance■ Neutralizes Compaction and Repair effects on the cluster■ Worth the money
● Network:■ Use highest bandwidth VPN possible■ GRE is great (lacks encryption, but provides best bandwidth)
● Maintenance:○ Run Compact Daily: It does miracle to performance on heavy loads○ If you are not on SSD, disable thrift on the node before running compaction○ Do compactions in sequence, node by node○ On high-load systems, avoid repair as possible, it’s better to decommission
and recommission a node than to run repair!○ If you have to repair, always use “-pr” flag and if possible use the
incremental repair option (requires heavy scripting)● Monitoring:
○ Write a sampler and speed tester for each node to detect bottlenecks and performance issues sources
Thank you