cassandra read-write path by josh mckenzie

CASSANDRA READ/WRITE PATH Josh McKenzie

[email protected]

CORE COMPONENTS

Core Components

Memtable data in memory (R/W) CommitLog data on disk (W/O) SSTable data on disk (immutable, R/O) CacheService (Row Cache and Key Cache) in-memory caches ColumnFamilyStore logical grouping of table data

DataTracker and View provides atomicity and grouping of memtable/sstable data

ColumnFamily Collection of Cells (Sorted map of Columns) Cell Name, Value, TS Tombstone Deletion marker indicating TS and deleted cell(s)

MemTable

In-memory data structure consisting of: Memory pools (on-heap, off-heap) Allocators for each pool Size and limit tracking and CommitLog sentinels

Map of Key ! AtomicBTreeColumns Atomic copy-on-write semantics for row-data Flush to disk logic is triggered when pool passes ratio of usage relative

to user-configurable threshold Memtable w/largest ratio of used space (either on or off heap) is flushed

to disk

On heap vs. Off heap Memtables: an overview

http://www.datastax.com/dev/blog/off-heap-memtables-in-cassandra-2-1 https://issues.apache.org/jira/browse/CASSANDRA-6689 https://issues.apache.org/jira/browse/CASSANDRA-6694 memtable_allocation_type

offheap_buffers moves the cell name and value to DirectBuffer objects. The values are still live Java buffers. This mode only reduces heap significantly when you are storing large strings or blobs

offheap_objects moves the entire cell off heap, leaving only the NativeCell reference containing a pointer to the native (off-heap) data. This makes it effective for small values like ints or uuids as well, at the cost of having to copy it back on-heap temporarily when reading from it.

Default in 2.1 is heap buffers

On heap vs. Off heap: continued

Why? Reduces sizes of objects in memory no more ByteBuffer overhead More data fitting in memory == better performance

Code changes that support it: MemtablePools allow on vs. off-heap allocation (and Slab, for that matter) MemtableAllocators to allow differentiating between on-heap and off-heap

allocation

DecoratedKey and *Cells changed to interfaces to have different allocation implementations based on native vs. heap

CommitLog

Append-only file structure corresponding provides interim durability for writes while theyre living in Memtables and havent been flushed to sstables

Has sync logic to determine the level of durability to disk you want - either PeriodicCommitLogService or BatchCommitLogService Periodic: (default) checks to see if it hit window limit, if so, block and wait for sync to catch up Batch: no ack until fsync to disk. Waits for a specific window before hitting fsync to coalesce

Singleton faade for commit log operations Consists of multiple components

CommitLog.java: interface to subsystem CommitLogSegment.java: files on disk CommitLogManager.java: segment allocation and management CommitLogArchiver.java: user-defined commands pre/post flush CommitLogMetrics.java

SSTable

Ordered-map of KVP Immutable Consist of 3 files: Bloom Filter: optimization to determine if the Partition Key youre

looking for is (probably) in this sstable

Index file: contains offset into data file, generally memory mapped Data file: contains data, generally compressed

Read by SSTableReader

CacheService.java

In-memory caching service to optimize lookups of hot data Contains three caches:

keyCache rowCache counterCache

See: AutoSavingCache.java InstrumentingCache.java

Tunable per table, limits in cassandra.yaml, keys to cache, size in mb, rows, size in mb Defaults to keys only, can enable row cache via CQL

ColumnFamilyStore.java

Contains logic for a table Holds DataTracker Creating and removing sstables on disk Writing / reading data Cache initialization Secondary index(es) Flushing memtables to sstables Snapshots And much more (just short of 3k LoC)

CFS: DataTracker and View

DataTracker allows for atomic operations on a view of a Table (ColumnFamilyStore) Contains various logic surrounding Memtables and flushing, SSTables and

compaction, and notification for subscribers on changes to SSTableReaders

1 DataTracker per CFS, 1 AtomicReference per DataTracker View consists of current Memtable, Memtables pending flush, SSTables for the CFS,

and SSTables being actively compacted

Currently active Memtable is atomically switched out in: DataTracker.switchMemtable(boolean truncating)

ColumnFamily.java

A sorted map of columns Abstract class, extended by:

ArrayBackedSortedColumns Array backed Non-thread-safe Good for iteration, adding cells (especially if in sorted order)

AtomicBTreeColumns (memtable only) Btree backed Thread-safe w/atomic CAS Logarithmic complexity on operations

Logic to add / retrieve columns, counters, tombstones, atoms

THE READ PATH

Read Path: Very High Level

Overview the Read Path

Return results

CollationController

Keyspace

ColumnFamilyStore

Check Row Cache hit

miss

Memtable

SSTables

read merge

ColumnFamily

Update Row Cache

Coordinator

MessagingService

Key Cache

Binary scan index, update cache

Seek to cached position

hit

miss

Read-specific primitive: QueryFilter

Contains IDiskAtomFilter IDiskAtomFilter: used to get columns from Memtable, SSTable, or SuperColumn

IdentityQueryFilter, NamesQueryFilter, SliceQueryFilter

Contains a variety of iterators to collate on disk contents, gather tombstones, reduce (merge) Cells with the same name, etc

See: collateColumns() gatherTombstones() getReducer(final Comparator comparator)

Read-specific class: SSTableReader

Has 2 SegmentedFiles, ifile and dfile, for index and data respectively Contains a Key Cache, caching positions of keys in the SSTR Contains an IndexSummary w/sampling of the keys that are in the table

Binary search used to narrow down location in file via IndexSummary getIndexScanPosition(RowPosition key)

Short running operations guarded by ColumnFamilyStore.readOrdering See OpOrder.java producer/consumer synchronization primitive to coordinate readers w/

flush operations

Resources are ref counted see RefCounted.java, Tidy interface, various private classes in SSTR that implement Tidy interface to clean up resources

Provides methods to retrieve an SSTableScanner which gives you access to OnDiskAtoms via iterators and holds RandomAccessReaders on the raw files on disk OnDiskAtom ! Cell ! *Cell