cassandra day ny 2014: apache cassandra & python for the the new york times ⨍aбrik platform

Post on 15-Jan-2015

1.667 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

In this session, you’ll learn about how Apache Cassandra is used with Python in the NY Times ⨍aбrik messaging platform. Michael will start his talk off by diving into an overview of the NYT⨍aбrik global message bus platform and its “memory” features and then discuss their use of the open source Apache Cassandra Python driver by DataStax. Progressive benchmark to test features/performance will be presented: from naive and synchronous to asynchronous with multiple IO loops; these benchmarks tailored to usage at the NY Times. Code snippets, followed by beer, for those who survive. All code available on Github!

TRANSCRIPT

Cassandra python driver Benchmarking concurrency for nyt aбrik⨍Michael.Laing@nytimes.com

A Global Mesh with a Memory

Message-based: WebSocket, AMQP, SockJS

If in doubt:• Resend• Reconnect• Reread

Idempotent:• Replicating• Racy• Resolving

Classes of service:• Gold: replicate/race• Silver: prioritize• Bronze: queueable

Millions of users

Message: an event with data

CREATE TABLE source_data ( hash_key int, -- real ones are more complex message_id timeuuid, body blob, -- whatever metadata text, -- JSON PRIMARY KEY (hash_key, message_id));

1-10kb

1-10kb

Ack

Ack

Push

1kb

1kb

10-150kb

10-150kb

Pull

Synchronous:C* Thrift orCQL Native

ConcurrentDegree = 3

(using theLibev eventLoop)

Asynchronous:CQL Native only

More Concurrency

Can also try:• DC Aware• Token Aware• Subprocessing

Build one

def build_message(self): message = { "message_id": str(uuid.uuid1()), "hash_key": randint(0, self._hash_key_range), # int(e ** 8) "app_id": self._app_id, "timestamp": datetime.utcnow().isoformat() + 'Z', "content_type": "application/binary", "body": os.urandom(randint(1, self._body_range)) # int(e ** 9) }

Kick-off

def push_message(self): if self._submitted_count.next() < self._message_count: message = self.build_message() self.submit_query(message)

def push_initial_data(self): self._start_time = time()

try: with self._lock: for i in range( 0, min(CONCURRENCY, self._message_count) ): self.push_message()

Put it in the pipeline

def submit_query(self, message): body = message.pop('body')

substitution_args = ( json.dumps(message, **JSON_DUMPS_ARGS), body, message['hash_key'], uuid.UUID(message['message_id']) )

future = self._cql_session.execute_async( self._query, substitution_args )

future.add_callback(self.push_or_finish) future.add_errback(self.note_error)

Maintain concurrency or finish

def push_or_finish(self, _): try: if ( self._unfinished and self._confirmed_count.next() < self._message_count ): with self._lock: self.push_message() else: self.finish()

1-10kb

1-10kb

Ack

Ack

Push

Push some messages

usage: bm_push.py [-h] [-c [CQL_HOST [CQL_HOST ...]]] [-d LOCAL_DC] [--remote-dc-hosts REMOTE_DC_HOSTS] [-p PREFETCH_COUNT] [-w WORKER_COUNT] [-a] [-t] [-n {ONE, TWO, THREE, QUORUM, ALL, LOCAL_QUORUM, EACH_QUORUM, SERIAL, LOCAL_SERIAL, LOCAL_ONE}] [-r] [-j] [-l {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]

Push messages from a RabbitMQ queue into a Cassandra table.

Push messages many times

usage: run_push.py [-h] [-c [CQL_HOST [CQL_HOST ...]]] [-i ITERATIONS] [-d LOCAL_DC] [-w [worker_count [worker_count ...]]] [-p [prefetch_count [prefetch_count ...]]] [-n [level [level ...]]] [-a] [-t] [-m MESSAGE_EXPONENT] [-b BODY_EXPONENT] [-l {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]

Run multiple test cases based upon the product of worker_counts,prefetch_counts, and consistency_levels. Each test case may be run with up to4 variations reflecting the use or not of the dc_aware and token_awarepolicies. The results are output to stdout as a JSON object.

1kb

1kb

10-150kb

10-150kb

Pull

top related