c* summit eu 2013: being closer to cassandra at ok.ru

Oleg Anastasyev lead platform developer Odnoklassniki.ru Being Closer to Cassandra

Upload: planet-cassandra

Post on 20-Jun-2015




0 download


Speaker: Oleg Anastasyev, Lead Platform Developer at Odnoklassniki.ru Video: http://www.youtube.com/watch?v=1ThzjoWJsaQ&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e Odnoklassniki uses Cassandra for its business data, which doesn't fit into RAM. This data is typically fast growing, frequently accessed by our users and must be always available, because it constitute our primary business as a social network. The way we use cassandra is somewhat unusual - we don't use thrift or netty based native protocol to communicate with cassandra nodes remotely. Instead, we co-locate Cassandra nodes in the same JVM with business service logic, exposing not generic data manipulation, but business level interface remotely. This way, we avoid extra network roundtrips within a single business transaction and use internal calls to Cassandra classes to get information faster. Also, this helps us to create many small hacks on Cassandra's internals, making huge gains on efficiency and ease of distributed server development.


Page 1: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru

Oleg Anastasyevlead platform developerOdnoklassniki.ru

Being Closer to Cassandra

Page 2: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Top 10 of World’s social networks40M DAU, 80M MAU, 7M peak

~ 300 000 www req/sec, 20 ms render latency

>240 Gbit out

> 5 800 iron servers in 5 DCs99.9% java

* Odnoklassniki means “classmates” in english

Page 3: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Cassandra @ * Since 2010

-branched 0.6-aiming at:

full operation on DC failure, scalability, ease of operations

*Now-23 clusters-418 nodes in total-240 TB of stored data

-survived several DC failures

Page 4: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Case #1. The fast

Page 5: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Like! 103 927 You and 103 927

Page 6: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru



Like! widget* Its everywhere

-Have it on every page, dozen-On feeds (AKA timeline)-3rd party websites elsewhere on internet

* Its on everything-Pictures and Albums-Videos-Posts and comments-3rd party shared URLs

Like! 103 927

Page 7: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru



Like! widget*High load

-1 000 000 reads/sec, 3 000 writes/sec

*Hard load profile-Read most -Long tail (40% of reads are random)-Sensitive to latency variations-3TB total dataset (9TB with RF) and growing-~ 60 billion likes for ~6bi entities

Like! 103 927

Page 8: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


RefId:long RefType:byte UserId:long Created

9999999999 PICTURE(2) 11111111111 11:00

Classic solution

= N >=1

= M>N

= N*140

You and 4256

SQL table

to render

SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,? (98% are NONE)

SELECT COUNT (*) WHERE RefId,RefType=?,? (80% are 0)

SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)

Page 9: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Cassandra solutionLikeByRef (

refType byte,refId bigint,userId bigint,

PRIMARY KEY ( (RefType,RefId), UserId)

LikeCount (refType byte,refId bigint,likers counter,

PRIMARY KEY ( (RefType,RefId))

= N*20%

so, to render

SELECT FROM LikeCount WHERE RefId,RefType=?,? (80% are 0)

SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,? (98% are NONE)

You and 4256

Page 10: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


>11 M iops

LikeByRef (refType byte,refId bigint,userId bigint,

PRIMARY KEY ( (RefType,RefId, UserId) )

*Quick workaround ?

SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)

-Forces Order Pres Partitioner (random not scales)

-Key range scans-More network overhead-Partitions count >10x, Dataset size > x2

Page 11: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


*What is does- Includes pairs of (PartKey, ColumnKey) in

SSTable *-Filter.db

*The good-Eliminated 98 % of reads -Less false positives

*The bad-They become too large

GC Promotion Failures.. but fixable (CASSANDRA-2466)

By column bloom filter

Page 12: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Are we there yet ?

- min 2 roundtrips per render (COUNT+RR) - THRIFT is slow, esp having lot of connections- EXISTS() is 200 Gbit/sec (140*8*1Mps*20%)



application server> 400

1. COUNT()


Page 13: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru



- one-nio remoting (faster than java nio)- topology aware clients



get() : LikeSummary

Remote Business Intf

Counters Cache

Social Graph Cache

Page 14: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


* Fast TOP N friend likers query1. Take friends from graph cache2. Check it with memory bloom filter3. Read some until N friends found

*Custom caches-Tuned for application

*Custom data merge logic- ... so you can detect and resolve conflicts

co-location wins

Page 15: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Listen for mutations// Implement itinterface StoreApplyListener { boolean preapply(String key, ColumnFamily data); }

*Register itbetween commit logs replay and gossip

*RowMutation.apply()extend original mutation+ Replica, hints, ReadRepairs

// and register with CFSstore=Table.open(..) .getColumnFamilyStore(..);store.setListener(myListener);

Page 16: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Like! optimized countersLikeCount (

refType byte,refId bigint,ip inet,counter intPRIMARY KEY ( (RefType,RefId), ip)

*Counters cache-Off heap (sun.misc.Unsafe)-Compact (30M in 1G RAM)-Read cached local node only

*Replicated cache state- cold replica cache problem- making (NOP) mutations

less reads- long tail aware

Page 17: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Read latency variations*CS read behavior

1. Choose 1 node for data and N for digest2. Wait for data and digest3. Compare and return (or RR)

*Nodes suddenly slowdown-SEDA hiccup, commit log rotation, sudden IO

saturation, Network hiccup or partition, page cache miss

*The bad-You have spikes.-You have to wait (and timeout)

Page 18: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Read Latency leveling* “Parallel” read handler

1. Ask all replicas for data in parallel2. Wait for CL responses and return

*The good-Minimal latency response-Constant load when DC fails

*The (not so) bad- “Additional” work and traffic

Page 19: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


More tiny tricks*On SSD io

-Deadline IO elevator-64k -> 4k read request size

*HintLog-Commit log for hints-Wait for all hints on startup

* Selective compaction-Compacts most read CFs more often

Page 20: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Case #2. The fat

Page 21: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


*Messages in chats-Last page is accessed on open- long tail (80%) for rest

-150 billion, 100 TB in storage-Read most (120k reads/sec, 8k writes/sec)

Page 22: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Messages have structure

-All chat’s messages in single partition-Single blob for message data

to reduce overhead

-The badConflicting modifications can happen

(users, anti-spam, etc..)

Message (chatId, msgId,

created, type,userIndex,deletedBy,...text)

MessageCF (chatId, msgId,

data blob,

PRIMARY KEY ( chatId, msgId )

Page 23: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


LW conflict resolution

Messages (chatId, msgId,version timestamp,data blobPRIMARY KEY ( chatId, msgId, version )


(version:ts1, data:d1)

write( ts1, data2 )


(version:ts1, data:d1)

write( ts1, data3 )

(ts2, data2)(ts3, data3)

delete(version:ts1)insert(version: ts3=now(), data3)

- merged on read

delete(version:ts1)insert(version: ts2=now(), data2)

Page 24: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Specialized cache*Again. Because we can

-Off-heap (Unsafe)-Caches only freshest chat page-Saves its state to local (AKA system) CF

keys AND values seq read, much faster startup

- In memory compression2x more memory almost free

Page 25: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Disk mgmt*4U HDDx24, up to 4TB/node

-Size tiered compaction = 4 TB sstable file-RAID10 ? LCS ?

* Split CF to 256 pieces*The good

-Smaller, more frequent memtable flushes-Same compaction work

in smaller sets

-Can distribute across disks

Page 26: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Disk Allocation Policies*Default is

- “Take disk with most free space”

* Some disks have-Too much read iops

*Generational policy-Each disk has same # of same gen files

work better for HDD

Page 27: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Case #3. The uglyfeed my Frankenstein

Page 28: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


*Chats overview-small dataset (230GB)-has hot set, short tail (5%)- list reorders often-130k read/s, 21k write/s

Page 29: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Conflicting updates* List<Overview> is single blob

.. or you’ll have a lot of tombstones

* Lot of conflictsupdates of single column

*Need conflict detection*Has merge algoritm

Page 30: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Vector clocks*Voldemort-byte[] key -> byte[] value + VC-Coordination logic on clients-Pluggable storage engines

* Plugged-CS 0.6 SSTables persistance -Fronted by specialized cache

we love caches

Page 31: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Performance*3 node cluster, RF = 3

- Intel Xeon CPU E5506 2.13GHz RAM: 48Gb, 1x HDD, 1x SSD

*8 byte key -> 1 KB byte value

*Results-75 k /sec reads, 15 k/ sec writes

Page 32: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru


Why cassandra ?*Reusable distributed DB components

fast persistance, gossip, Reliable Async Messaging, Fail detectors,Topology, Seq scans, ...

*Has structurebeyond byte[] key -> byte[] value

*Delivered promises* Implemented in Java

Page 33: C* Summit EU 2013: Being Closer to Cassandra at Ok.ru



one-niormi faster than java nio with fast and compact automagic java serialization

shared-memory-cachejava Off-Heap cache using shared memory

Oleg [email protected]/oa@m0nstermind
