![Page 1: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/1.jpg)
Myths of big partitions
Robert StuppSolution Architect @ DataStax, C*-Committer@snazy
![Page 2: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/2.jpg)
Issues with big partitions before 3.6
• Slow reads• Compaction failures• Repair failures• java.lang.OutOfMemoryError
fail fast node down(Lot of org.apache.cassandra.io.sstable.IndexInfo on heap)
© DataStax, All Rights Reserved. 2
![Page 3: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/3.jpg)
SSTable Components
© DataStax, All Rights Reserved. 3
Data
Primary Index
Summary
Bloom Filter
Determine whether an SSTable contains a partition bloomFilterFpChance
Partition samples minIndexInterval / maxIndexInterval
All partition keys + index samples column_index_size_in_kb
All the data
![Page 4: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/4.jpg)
Read from an SSTable
© DataStax, All Rights Reserved. 4
Data
Primary Index
Summary
Bloom Filter 1. Check whether partition is in SSTable
2. Find “nearest” partition key3. Return offset in primary index
4. Find partition5. Find clustering key6. Return offset in data file
7. Find, read and return data
![Page 5: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/5.jpg)
Before CASSANDRA-11206
![Page 6: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/6.jpg)
Evaluation of SSTable Components
© DataStax, All Rights Reserved. 6
Data
Primary Index
Summary
Bloom Filter Off-Heap, small fine
Off-Heap, small-ish fine
On-Heap,many small objects, nested structure problematic
For CQL since #8099 fine
![Page 7: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/7.jpg)
Primary Index File Layout
© DataStax, All Rights Reserved.
Partition Index SamplesPartition Key Partition Index SamplesPartition Key
Partition Index Samples Partition Index SamplesPartition Key Partition Index SamplesPartition Key
Partition Index Samples Partition Index SamplesPartition Key Partition Index SamplesPartition Key
Partition Index Samples Partition Index SamplesPartition Key Partition Index SamplesPartition Key
Partition Index Samples Partition Index SamplesPartition Key Partition Index SamplesPartition Key
Partition Index Samples
”from” Summary
![Page 8: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/8.jpg)
Sampling the Primary Index
© DataStax, All Rights Reserved.
Partition in Data file
Partition KeyOffset in SSTable Data File
column_index_size_in_kb (default: 64kB)
FirstKey
LastKey
FirstKey
LastKey
FirstKey
LastKey
FirstKey
LastKey
FirstKey
LastKey
FirstKey
LastKey
FirstKey
LastKey
![Page 9: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/9.jpg)
How it looks on-heap
© DataStax, All Rights Reserved. 10
IndexedEntry
IndexInfofirstKey, lastKey, offset, width, deletionInfo
patitionKey*, offset, deletionInfo
* = technically not in IndexedEntry
IndexInfofirstKey, lastKey, offset, width, deletionInfo
IndexInfofirstKey, lastKey, offset, width, deletionInfo
…
![Page 10: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/10.jpg)
Primary IndexStructure
© DataStax, All Rights Reserved. 11
IndexedEntry extends RowIndexEntry DeletionTime ArrayList IndexInfo per 64kB
DeletionTimeBufferClustering Kind ByteBuffer[] ByteBuffer byte[] …
BufferClustering Kind ByteBuffer[] ByteBuffer byte[] …
# of Java objects:
IndexedEntry 4IndexInfo (per 64kB) 8 + 4 * clust-key-components
(primitive fields omitted)
![Page 11: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/11.jpg)
Primary Index - some numbers
© DataStax, All Rights Reserved. 12
Approximation on one 16 byte clustering-value:
Partition Size Index Size (heap) # of objects 1MB 3kB > 200 objects
4MB 11kB > 800 objects
64MB 180kB > 13,000 objects
512MB 1.4MB > 106,000 objects
2048MB 5.6MB > 424,000 objects
Disclaimer: numbers are examples and not representative
![Page 12: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/12.jpg)
Reads
• Reads IndexedEntry w/ all IndexInfo• 2GB partition means: 32,768 IndexInfo,
424,000 objects• Binary search just needs: 15 IndexInfo (max),
O(log n) ~200 objects
© DataStax, All Rights Reserved. 14 Disclaimer: numbers are examples and not representative
SELECT foo, barFROM big_partition_tableWHERE ...
![Page 13: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/13.jpg)
Writes – Flushes & Compactions
IndexedEntry constructed with all IndexInfoas Java object structure on heap first,
then serialized to disk
© DataStax, All Rights Reserved. 15
![Page 14: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/14.jpg)
106,000objects
106,000objects
106,000objects
106,000objects
Compacting a 2GB partition
© DataStax, All Rights Reserved. 16
SSTable SSTable SSTable SSTable
SSTable
KeyCache
Remove 106,000 objects
Remove 106,000 objects
Remove 106,000 objects
Remove 106,000 objects
Add424,000 objects
Construct424,000objects
Disclaimer: numbers are examples and not representative
![Page 15: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/15.jpg)
Reads of big partitions – on heap
• Primary index data deserialized• Object structure added to key cache• Other entries evicted from key cache
• Also applies to compaction & repair
© DataStax, All Rights Reserved. 17
![Page 16: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/16.jpg)
Flushes with big partitions – on heap
• Primary index data constructed• Object structure added to key cache
(for compactions)
• Also applies to compactions
© DataStax, All Rights Reserved. 18
![Page 17: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/17.jpg)
TriviaHow many 2GB partitions fit in the key cache?
© DataStax, All Rights Reserved. 19
2GB partition 5.6MB
100MB
100/6 = 16
Disclaimer: numbers are examples and not representative
![Page 18: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/18.jpg)
Issues w/ big partitions – TL;DR
• Amount of Java objects• Additions and evictions to/from key cache
© DataStax, All Rights Reserved. 20
![Page 19: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/19.jpg)
![Page 20: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/20.jpg)
Necessities – TL;DR
• Reduce amount of Java objects• Reduce GC pressure
• No change in sstable formati.e. files need to be binary compatible
© DataStax, All Rights Reserved. 22
![Page 21: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/21.jpg)
Approach
• Omit (most) IndexInfo on heap
• Read IndexInfo only when needed• Serialize primary index via byte buffer
• Objects “never” promoted to Java old gen(hope so ;) )
© DataStax, All Rights Reserved. 23
![Page 22: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/22.jpg)
Small heap (3GB) test
© DataStax, All Rights Reserved. 24
Before #11206 – duration: 3h, lots of GC, exhausted heap
With #11206 – duration: 1h10, few GC, moderate heap usage
java.lang.OutOfMemoryError
org.apache.cassandra.io.sstable.LargePartitionsTest
![Page 23: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/23.jpg)
Results
• Promising!
• But:Performance regression w/ some workloads
© DataStax, All Rights Reserved. 25
![Page 24: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/24.jpg)
Better Approach
• Keep IndexInfo objects for “nicely” sized partitions on-heap
• Controlled via c.yaml
© DataStax, All Rights Reserved. 26
![Page 25: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/25.jpg)
Doesn’t this mean more disk I/O?
• “Hot” data already in buffer cache• No change for “cold” partitions
© DataStax, All Rights Reserved. 27
![Page 26: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/26.jpg)
#11206 Benefits
• Reduced heap usage• Reduced GC pressure• Improved read and write paths• Key cache can hold “more” entries• Moved the bad partition size “barrier”
© DataStax, All Rights Reserved. 28
![Page 27: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/27.jpg)
#11206 Metrics
org.apache.cassandra.metrics: type=Index,scope=RowIndexEntry
• name=IndexInfoCountHistogram - # of IndexInfo per IndexedEntry
• name=IndexInfoGetsHistogram - # of ”gets” against single IndexedEntry
• name=IndexedEntrySizeHistogram - serialized size of IndexedEntry
© DataStax, All Rights Reserved. 29
![Page 28: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/28.jpg)
„After #11206, what‘s therecommended partition size?“
• It still depends – sorry• IMO we moved the “barrier”
Test with your
data modeland workload
© DataStax, All Rights Reserved. 30
![Page 29: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/29.jpg)
Bad usage of large partitions
• CQL SELECT without clustering key• i.e. materialize a large partition in memory
• Using the same partition key over a long time• i.e. access many sstables
© DataStax, All Rights Reserved. 31
![Page 30: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/30.jpg)
![Page 31: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/31.jpg)
#9754
• Changes on-disk primary index format• Efficient on-disk representation• Optimized for OS page size• WIP !• Fix-Version: 4.x
© DataStax, All Rights Reserved. 33
![Page 32: Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016](https://reader036.vdocument.in/reader036/viewer/2022062223/586f76121a28ab10258b62dd/html5/thumbnails/32.jpg)
Thank You!Q & A
Come to the “experts stand”