cassandra read-write path by josh mckenzie

51
CASSANDRA READ/WRITE PATH Josh McKenzie [email protected]

Upload: rajul-srivastava

Post on 16-Dec-2015

221 views

Category:

Documents


1 download

DESCRIPTION

Cassandra Read-Write Path Presenatation

TRANSCRIPT

  • CASSANDRA READ/WRITE PATH Josh McKenzie

    [email protected]

  • CORE COMPONENTS

  • Core Components

    Memtable data in memory (R/W) CommitLog data on disk (W/O) SSTable data on disk (immutable, R/O) CacheService (Row Cache and Key Cache) in-memory caches ColumnFamilyStore logical grouping of table data

    DataTracker and View provides atomicity and grouping of memtable/sstable data

    ColumnFamily Collection of Cells (Sorted map of Columns) Cell Name, Value, TS Tombstone Deletion marker indicating TS and deleted cell(s)

  • MemTable

    In-memory data structure consisting of: Memory pools (on-heap, off-heap) Allocators for each pool Size and limit tracking and CommitLog sentinels

    Map of Key ! AtomicBTreeColumns Atomic copy-on-write semantics for row-data Flush to disk logic is triggered when pool passes ratio of usage relative

    to user-configurable threshold Memtable w/largest ratio of used space (either on or off heap) is flushed

    to disk

  • On heap vs. Off heap Memtables: an overview

    http://www.datastax.com/dev/blog/off-heap-memtables-in-cassandra-2-1 https://issues.apache.org/jira/browse/CASSANDRA-6689 https://issues.apache.org/jira/browse/CASSANDRA-6694 memtable_allocation_type

    offheap_buffers moves the cell name and value to DirectBuffer objects. The values are still live Java buffers. This mode only reduces heap significantly when you are storing large strings or blobs

    offheap_objects moves the entire cell off heap, leaving only the NativeCell reference containing a pointer to the native (off-heap) data. This makes it effective for small values like ints or uuids as well, at the cost of having to copy it back on-heap temporarily when reading from it.

    Default in 2.1 is heap buffers

  • On heap vs. Off heap: continued

    Why? Reduces sizes of objects in memory no more ByteBuffer overhead More data fitting in memory == better performance

    Code changes that support it: MemtablePools allow on vs. off-heap allocation (and Slab, for that matter) MemtableAllocators to allow differentiating between on-heap and off-heap

    allocation

    DecoratedKey and *Cells changed to interfaces to have different allocation implementations based on native vs. heap

  • CommitLog

    Append-only file structure corresponding provides interim durability for writes while theyre living in Memtables and havent been flushed to sstables

    Has sync logic to determine the level of durability to disk you want - either PeriodicCommitLogService or BatchCommitLogService Periodic: (default) checks to see if it hit window limit, if so, block and wait for sync to catch up Batch: no ack until fsync to disk. Waits for a specific window before hitting fsync to coalesce

    Singleton faade for commit log operations Consists of multiple components

    CommitLog.java: interface to subsystem CommitLogSegment.java: files on disk CommitLogManager.java: segment allocation and management CommitLogArchiver.java: user-defined commands pre/post flush CommitLogMetrics.java

  • SSTable

    Ordered-map of KVP Immutable Consist of 3 files: Bloom Filter: optimization to determine if the Partition Key youre

    looking for is (probably) in this sstable

    Index file: contains offset into data file, generally memory mapped Data file: contains data, generally compressed

    Read by SSTableReader

  • CacheService.java

    In-memory caching service to optimize lookups of hot data Contains three caches:

    keyCache rowCache counterCache

    See: AutoSavingCache.java InstrumentingCache.java

    Tunable per table, limits in cassandra.yaml, keys to cache, size in mb, rows, size in mb Defaults to keys only, can enable row cache via CQL

  • ColumnFamilyStore.java

    Contains logic for a table Holds DataTracker Creating and removing sstables on disk Writing / reading data Cache initialization Secondary index(es) Flushing memtables to sstables Snapshots And much more (just short of 3k LoC)

  • CFS: DataTracker and View

    DataTracker allows for atomic operations on a view of a Table (ColumnFamilyStore) Contains various logic surrounding Memtables and flushing, SSTables and

    compaction, and notification for subscribers on changes to SSTableReaders

    1 DataTracker per CFS, 1 AtomicReference per DataTracker View consists of current Memtable, Memtables pending flush, SSTables for the CFS,

    and SSTables being actively compacted

    Currently active Memtable is atomically switched out in: DataTracker.switchMemtable(boolean truncating)

  • ColumnFamily.java

    A sorted map of columns Abstract class, extended by:

    ArrayBackedSortedColumns Array backed Non-thread-safe Good for iteration, adding cells (especially if in sorted order)

    AtomicBTreeColumns (memtable only) Btree backed Thread-safe w/atomic CAS Logarithmic complexity on operations

    Logic to add / retrieve columns, counters, tombstones, atoms

  • THE READ PATH

  • Read Path: Very High Level

  • Overview the Read Path

    Return results

    CollationController

    Keyspace

    ColumnFamilyStore

    Check Row Cache hit

    miss

    Memtable

    SSTables

    read merge

    ColumnFamily

    Update Row Cache

    Coordinator

    MessagingService

    Key Cache

    Binary scan index, update cache

    Seek to cached position

    hit

    miss

  • Read-specific primitive: QueryFilter

    Contains IDiskAtomFilter IDiskAtomFilter: used to get columns from Memtable, SSTable, or SuperColumn

    IdentityQueryFilter, NamesQueryFilter, SliceQueryFilter

    Contains a variety of iterators to collate on disk contents, gather tombstones, reduce (merge) Cells with the same name, etc

    See: collateColumns() gatherTombstones() getReducer(final Comparator comparator)

  • Read-specific class: SSTableReader

    Has 2 SegmentedFiles, ifile and dfile, for index and data respectively Contains a Key Cache, caching positions of keys in the SSTR Contains an IndexSummary w/sampling of the keys that are in the table

    Binary search used to narrow down location in file via IndexSummary getIndexScanPosition(RowPosition key)

    Short running operations guarded by ColumnFamilyStore.readOrdering See OpOrder.java producer/consumer synchronization primitive to coordinate readers w/

    flush operations

    Resources are ref counted see RefCounted.java, Tidy interface, various private classes in SSTR that implement Tidy interface to clean up resources

    Provides methods to retrieve an SSTableScanner which gives you access to OnDiskAtoms via iterators and holds RandomAccessReaders on the raw files on disk OnDiskAtom ! Cell ! *Cell

  • Overview the Read Path

    Return results

    CollationController

    Keyspace

    ColumnFamilyStore

    Check Row Cache hit

    miss

    Memtable

    SSTables

    read merge

    ColumnFamily

    Update Row Cache

    Coordinator

    MessagingService

    Key Cache

    Binary scan index, update cache

    Seek to cached position

    hit

    miss

  • ReadVerbHandler and ReadCommands

    Messages are received by the MessagingService and passed to the ReadVerbHandler for appropriate verbs

    ReadCommands: SliceFromReadCommand

    Relies on SliceQueryFilter, uses a range of columns defined by a ColumnSlice SliceByNamesReadCommand

    Relies on NamesQueryFilter, uses a column name to retrieve a single column Both diverge in calls and converge back into implementers of ColumnFamily

    ArrayBackedSortedColumns, AtomicBTreeSortedColumns

    public Row Keyspace.getRow(QueryFilter filter) {

    ColumnFamilyStore cfStore = getColumnFamilyStore(filter.getColumnFamilyName());

    ColumnFamily columnFamily = cfStore.getColumnFamily(filter);

    return new Row(filter.key, columnFamily);

    }

  • Overview the Read Path

    Return results

    CollationController

    Keyspace

    ColumnFamilyStore

    Check Row Cache hit

    miss

    Memtable

    SSTables

    read merge

    ColumnFamily

    Update Row Cache

    Coordinator

    MessagingService

    Key Cache

    Binary scan index, update cache

    Seek to cached position

    hit

    miss

  • RowCache

    CFS.getThroughCache(UUID cfId, QueryFilter filter) After retrieving our CFS, the first thing we check is our Row Cache to see if the row is

    already merged, in memory, and ready to go

    If we get a cache hit on the key, well: Confirm its not just a sentinel of someone else in flight. If so, we query w/out caching If the data for the key is valid, we filter it down to the query we have in flight and return

    those results as itll have >= the count of Cells were looking for

    On cache miss: Eventually cache all top level columns for the key queried if configured to do so (after

    Collation) Cache results of user query if it satisfies the cache config params Extend the results of the query to satisfy the caching requirements of the system

  • Overview the Read Path

    Return results

    CollationController

    Keyspace

    ColumnFamilyStore

    Check Row Cache hit

    miss

    Memtable

    SSTables

    read merge

    ColumnFamily

    Update Row Cache

    Coordinator

    MessagingService

    Key Cache

    Binary scan index, update cache

    Seek to cached position

    hit

    miss

  • CollationController.collect*Data ()

    The data were looking for may be in a Memtable, an SSTable, multiple of either, or a combination of all of them.

    The logic to query this data and merge our results exists in CollationController.java: collectAllData collectTimeOrderedData

    High level flow: 1. Get data from memtables for the QueryFilter were processing 2. Get data from sstables for the QueryFilter were processing 3. Merge all the data together, keeping the most recent 4. If we iterated across enough sstables, hoist up the now defragmented data into a memtable,

    bypassing CommitLog and Index update (collectTimeOrderedData only)

  • Overview the Read Path

    Return results

    CollationController

    Keyspace

    ColumnFamilyStore

    Check Row Cache hit

    miss

    Memtable

    SSTables

    read merge

    ColumnFamily

    Update Row Cache

    Coordinator

    MessagingService

    Key Cache

    Binary scan index, update cache

    Seek to cached position

    hit

    miss

  • CollationController merging: memtables

    Fairly straightforward operations on memtables in the view: Check all memtables to see if they have a ColumnFamily that matches our filter.key Add all columns to our result ColumnFamily that match Keep a running tally of the mostRecentRowTombstone for use in next step.

  • Overview the Read Path

    Return results

    CollationController

    Keyspace

    ColumnFamilyStore

    Check Row Cache hit

    miss

    Memtable

    SSTables

    read merge

    ColumnFamily

    Update Row Cache

    Coordinator

    MessagingService

    Key Cache

    Binary scan index, update cache

    Seek to cached position

    hit

    miss

  • CollationController merging: sstables

    We have a few optimizations available for merging in data from sstables: Sort the collection of SSTables by the max timestamp present Iterate across the SSTables

    Skipping any that are older than the most recent tombstone weve seen Create a reduced name filter by removing columns from our filter where we

    have fresher data than the SSTRs max Timestamp

    Get iterator from SSTR for Atoms matching that reduced name filter Add any matching OnDiskAtoms to our result set (BloomFilter excludes via

    iterator with SSTR.getPosition() call)

  • Overview the Read Path

    Return results

    CollationController

    Keyspace

    ColumnFamilyStore

    Check Row Cache hit

    miss

    Memtable

    SSTables

    read merge

    ColumnFamily

    Update Row Cache

    Coordinator

    MessagingService

    Key Cache

    Binary scan index, update cache

    Seek to cached position

    hit

    miss

  • THE WRITE PATH

  • Write Path: Very High Level

  • Overview the Write Path

    MessagingService

    Keyspace

    CommitLog Enabled for this mutation?

    Yes

    Write CommitLog

    No

    Skip

    Write to Memtable

    SecondaryIndexManager.Updater

    Invalidate Row Cache

  • MutationVerbHandler, Mutation.apply

    Contains Keyspace name DecoratedKey Map of cfId to ColumnFamily of modifications to perform MutationVerbHandler ! Mutation.apply() ! Keyspace.apply() !

    ColumnFamilyStore.apply()

  • Overview the Write Path

    MessagingService

    Keyspace

    CommitLog Enabled for this mutation?

    Yes

    Write CommitLog

    No

    Skip

    Write to Memtable

    SecondaryIndexManager.Updater

    Invalidate Row Cache

  • The CommitLog ecosystem

    CommitLogSegment: file on disk CommitLogSegmentManager: allocation and recycling of CommitLogSegments CommitLogArchiver: allows user-defined archive and restore commands to be run

    Reference conf/commitlog_archiving.properties An AbstractCommitLogService, one of either:

    BatchCommitLogService writer waits on sync to complete before returning PeriodicCommitLogService Check if sync is behind, if so, register w/signal and

    block until lastSyncedAt catches up

  • CommitLogSegmentManager (CLSM): overview

    Contains 2 collections of CommitLogSegments availableSegments: Segments ready to be used activeSegments: Segments that are active and contain unflushed data

    Only 1 active CommitLogSegment is in use at any given time Manager thread is responsible for maintaining active vs. available

    CommitLogSegments and can be woken up by other contexts when maintenance is needed

  • CLSM: allocation on the write path

    During CommitLog.add(), a writer asks for allocated space for their mutation from the CommitLogSegmentManager

    This is passed to the active CommitLogSegments allocate() method CommitLogSegmentManager.allocate(int size) spins non-blocking until the space in

    the segment is allocated, at which time it marks it dirty

    If the CLS.allocate() call returns null indicating we need a new segment: CommitLogSegment.advanceAllocatingFrom(CommitLogSegment old) Goal is to move CLS from available to active segments so we have more CLS to work with If it fails to get an available segment, the manager thread is woken back up to do some

    maintenance, be it recycling or allocating a new CLS

  • CLSM: manager thread, new segments, recycling

    Constructor creates a runnable that blocks on segmentManagementTasks Task can either be null indicating were out of space (allocate path) or a segment thats

    flushed and ready for recycle

    If theres no available segments, we create new CommitLogSegments and add them to availableSegments hasAvailableSegments WaitQueue is signaled by this to awake any blocking writes waiting for

    allocation

    When our CommitLog usage is approaching our allowable limit: If our total used size is > than the size allowed

    CommitLogSegmentManager.flushDataFrom on a list of activeSegments Force flush on any CFS thats dirty

    Which switches Memtables and flushes to SSTable more on this later

  • Overview the Write Path

    MessagingService

    Keyspace

    CommitLog Enabled for this mutation?

    Yes

    Write CommitLog

    No

    Skip

    Write to Memtable

    SecondaryIndexManager.Updater

    Invalidate Row Cache

  • Memtable writes

    We attempt to get the partition for the given key if it exists If not, we allocate space for a new key and put an empty entry in the memtable for it,

    backing that out if we race and someone else got there first on allocation

    Once we have space allocated, we call addAllWithSizeDelta Add the record to a new BTree and CAS it into the existing Holder Updates secondary indexes Finalize some heap tracking in the ColumnUpdater used by the BTree to perform updates

    Further reading: AtomicBTreeColumns.java (specifically addAllWithSizeDelta) BTree.java

  • MemtablePool

    Single MEMORY_POOL instance across entire DB Get an allocator to the memory pool during construction of a memtable Interface covering management of an on-heap and off-heap pool via SubPool

    HeapPool: On heap ByteBuffer allocations and release, subject to GC w/object overhead NativePool: Blend of on and off heap based on limits passed in

    Off heap allocations and release through NativeAllocator, calls to Unsafe SlabPool: Blend of on and off heap based on limits passed in (used for memtables, reduced fragmentation)

    Allocated in large chunks by SlabAllocator (1024*1024)

    MemtablePool.SubPool / SubAllocator: Contains various atomically updated longs tracking:

    Limits on allocation Currently allocated amounts Currently reclaiming amounts Threshold for when to run Cleaner thread

    Spin and CAS for updates on the above on allocator calls in addAllWithSizeDelta

  • Overview the Write Path

    MessagingService

    Keyspace

    CommitLog Enabled for this mutation?

    Yes

    Write CommitLog

    No

    Skip

    Write to Memtable

    SecondaryIndexManager.Updater

    Invalidate Row Cache

  • Secondary Indexes: an overview

    Essentially a separate table stored on disk / in memtable Contains a ConcurrentNavigableMap of ByteBuffer ! SecondaryIndex

    There are quite a few SecondaryIndex implementations in the code base, ex: PerRowSecondaryIndex PerColumnSecondaryIndex KeysIndex

    On Write Path: SecondaryIndex updater passed down through to ColumnUpdater ctor On ColumnUpdater.apply(), insert for secondary index is called Essentially amounts to a 2nd write on another table

  • Overview the Write Path

    MessagingService

    Keyspace

    CommitLog Enabled for this mutation?

    Yes

    Write CommitLog

    No

    Skip

    Write to Memtable

    SecondaryIndexManager.Updater

    Invalidate Row Cache

  • FLUSHING MEMTABLES

  • CLSM.activeSegments ColumnFamilyStore

    Flushing Memtables

    Memtable

    SSTableWriter

    SSTable

    SSTableReader

    CommitLog.discardCompletedSegments( cfId, lastReplayPosition)

    CLS 1

    CLS 2

    CLS Active Actively allocating

    Skip

    Still other cfDirty Remove flushed cfId

    Removed last dirty Recycle CLS

    Stop at position of flush

  • MemtableCleanerThread: starting a flush

    When MemtableAllocator adjusts the size of the data it has acquired the MemtablePool checks whether or not we need to flush to free up space in memory

    If our used memory is > than the total reclaiming memory + the limit * ratio defined in conf.memtable_cleanup_threshold, a memtable needs to be cleaned

    Cleaner thread is currently: ColumnFamilyStore.FlushLargestColumnFamily()) We find the memtable with the largest Ownership ratio as determined by the currently

    owned memory vs. limit, taking the max of either on or off heap

    Signals to CommitLog to discard completed segments on PostFlush stage of flush

  • Memtable Flushing

    Reference ColumnFamilyStore$Flush 1st, switch out memtables in CFS.DataTracker.View so new ops go to new memtable Sets lifecycle in memtable to discarding Runs the FlushRunnable in the Memtable

    Memtable.writeSortedContents Uses SSTableWriter to write sorted contents to disk Returns SSTableReader created by SSTableWriter.closeAndOpenReader

    Memtable.setDiscarded() ! MemtableAllocator.setDiscarded() Lifecycle to Discarded Free up all memory from the allocator for this memtable

  • Memtable Flushing: the commit log

    ColumnFamilyStore$PostFlush All relative to a timestamp of the most recent data in the flushed memtable Record sentinel for when this cf was cleaned (to be used later if it was active and we

    couldnt purge at time of flush)

    Walk through CommitLogSegments and remove dirty cfid Unless its actively being allocated from If the CLS is no longer in use:

    Remove it from our activeSegments Queue a task for Management thread to wake up and recycle the segment

  • Switching out memtables

    CFS.switchMemtableIfCurrent / CFS.switchMemtable Theres some complex non-blocking write-barrier operations on

    Keyspace.writeOrder to allow us to wait for writes to finish in this context before swapping out with new memtables regardless of dirty status

    Reference: OpOrder.java,OpOrder.Barrier

    Write sorted contents to disk (Memtable.FlushRunnable.runWith(File sstableDirectory)

    cfs.replaceFlushed, swapping the memtable with the new SSTableReader returned from writeSortedContents

  • 3.0

  • A couple of interesting things

    Sylvain has a herculean effort underway to refactor and modernize the storage engine, removing quite a bit of complexity from the querying process

    Various components renamed (names are in flux atm) https://issues.apache.org/jira/browse/CASSANDRA-8099

    CommitLogSegment Recycling is pretty complex without much payoff, so theres plans to remove it:

    https://issues.apache.org/jira/browse/CASSANDRA-8771