cassandra day ny 2014: apache cassandra & python for the the new york times ⨍aбrik platform

Cassandra python driver Benchmarking concurrency for nyt aбrik⨍[email protected]

A Global Mesh with a Memory

Message-based: WebSocket, AMQP, SockJS

If in doubt:• Resend• Reconnect• Reread

Idempotent:• Replicating• Racy• Resolving

Classes of service:• Gold: replicate/race• Silver: prioritize• Bronze: queueable

Millions of users

Message: an event with data

CREATE TABLE source_data ( hash_key int, -- real ones are more complex message_id timeuuid, body blob, -- whatever metadata text, -- JSON PRIMARY KEY (hash_key, message_id));

1-10kb

1-10kb

Ack

Ack

Push

1kb

1kb

10-150kb

10-150kb

Pull

Synchronous:C* Thrift orCQL Native

ConcurrentDegree = 3

(using theLibev eventLoop)

Asynchronous:CQL Native only

More Concurrency

Can also try:• DC Aware• Token Aware• Subprocessing

Build one

def build_message(self): message = { "message_id": str(uuid.uuid1()), "hash_key": randint(0, self._hash_key_range), # int(e ** 8) "app_id": self._app_id, "timestamp": datetime.utcnow().isoformat() + 'Z', "content_type": "application/binary", "body": os.urandom(randint(1, self._body_range)) # int(e ** 9) }

Kick-off

def push_message(self): if self._submitted_count.next() < self._message_count: message = self.build_message() self.submit_query(message)

def push_initial_data(self): self._start_time = time()

try: with self._lock: for i in range( 0, min(CONCURRENCY, self._message_count) ): self.push_message()

Put it in the pipeline

def submit_query(self, message): body = message.pop('body')

substitution_args = ( json.dumps(message, **JSON_DUMPS_ARGS), body, message['hash_key'], uuid.UUID(message['message_id']) )

future = self._cql_session.execute_async( self._query, substitution_args )

future.add_callback(self.push_or_finish) future.add_errback(self.note_error)

Maintain concurrency or finish

def push_or_finish(self, _): try: if ( self._unfinished and self._confirmed_count.next() < self._message_count ): with self._lock: self.push_message() else: self.finish()

1-10kb

1-10kb

Ack

Ack

Push

Push some messages

usage: bm_push.py [-h] [-c [CQL_HOST [CQL_HOST ...]]] [-d LOCAL_DC] [--remote-dc-hosts REMOTE_DC_HOSTS] [-p PREFETCH_COUNT] [-w WORKER_COUNT] [-a] [-t] [-n {ONE, TWO, THREE, QUORUM, ALL, LOCAL_QUORUM, EACH_QUORUM, SERIAL, LOCAL_SERIAL, LOCAL_ONE}] [-r] [-j] [-l {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]

Push messages from a RabbitMQ queue into a Cassandra table.

Push messages many times

usage: run_push.py [-h] [-c [CQL_HOST [CQL_HOST ...]]] [-i ITERATIONS] [-d LOCAL_DC] [-w [worker_count [worker_count ...]]] [-p [prefetch_count [prefetch_count ...]]] [-n [level [level ...]]] [-a] [-t] [-m MESSAGE_EXPONENT] [-b BODY_EXPONENT] [-l {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]

Run multiple test cases based upon the product of worker_counts,prefetch_counts, and consistency_levels. Each test case may be run with up to4 variations reflecting the use or not of the dc_aware and token_awarepolicies. The results are output to stdout as a JSON object.

1kb

1kb

10-150kb

10-150kb

Pull

cassandra day ny 2014: apache cassandra & python for the the new york times ⨍aбrik platform

Technology

message self

async self

complex message

memory message

count worker

count prefetch

push messages

ack ack push