cassandra day ny 2014: apache cassandra & python for the the new york times ⨍aбrik platform
DESCRIPTION
In this session, you’ll learn about how Apache Cassandra is used with Python in the NY Times ⨍aбrik messaging platform. Michael will start his talk off by diving into an overview of the NYT⨍aбrik global message bus platform and its “memory” features and then discuss their use of the open source Apache Cassandra Python driver by DataStax. Progressive benchmark to test features/performance will be presented: from naive and synchronous to asynchronous with multiple IO loops; these benchmarks tailored to usage at the NY Times. Code snippets, followed by beer, for those who survive. All code available on Github!TRANSCRIPT
Cassandra python driver Benchmarking concurrency for nyt aбrik⨍[email protected]
A Global Mesh with a Memory
Message-based: WebSocket, AMQP, SockJS
If in doubt:• Resend• Reconnect• Reread
Idempotent:• Replicating• Racy• Resolving
Classes of service:• Gold: replicate/race• Silver: prioritize• Bronze: queueable
Millions of users
Message: an event with data
CREATE TABLE source_data ( hash_key int, -- real ones are more complex message_id timeuuid, body blob, -- whatever metadata text, -- JSON PRIMARY KEY (hash_key, message_id));
1-10kb
1-10kb
Ack
Ack
Push
1kb
1kb
10-150kb
10-150kb
Pull
Synchronous:C* Thrift orCQL Native
ConcurrentDegree = 3
(using theLibev eventLoop)
Asynchronous:CQL Native only
More Concurrency
Can also try:• DC Aware• Token Aware• Subprocessing
Build one
def build_message(self): message = { "message_id": str(uuid.uuid1()), "hash_key": randint(0, self._hash_key_range), # int(e ** 8) "app_id": self._app_id, "timestamp": datetime.utcnow().isoformat() + 'Z', "content_type": "application/binary", "body": os.urandom(randint(1, self._body_range)) # int(e ** 9) }
Kick-off
def push_message(self): if self._submitted_count.next() < self._message_count: message = self.build_message() self.submit_query(message)
def push_initial_data(self): self._start_time = time()
try: with self._lock: for i in range( 0, min(CONCURRENCY, self._message_count) ): self.push_message()
Put it in the pipeline
def submit_query(self, message): body = message.pop('body')
substitution_args = ( json.dumps(message, **JSON_DUMPS_ARGS), body, message['hash_key'], uuid.UUID(message['message_id']) )
future = self._cql_session.execute_async( self._query, substitution_args )
future.add_callback(self.push_or_finish) future.add_errback(self.note_error)
Maintain concurrency or finish
def push_or_finish(self, _): try: if ( self._unfinished and self._confirmed_count.next() < self._message_count ): with self._lock: self.push_message() else: self.finish()
1-10kb
1-10kb
Ack
Ack
Push
Push some messages
usage: bm_push.py [-h] [-c [CQL_HOST [CQL_HOST ...]]] [-d LOCAL_DC] [--remote-dc-hosts REMOTE_DC_HOSTS] [-p PREFETCH_COUNT] [-w WORKER_COUNT] [-a] [-t] [-n {ONE, TWO, THREE, QUORUM, ALL, LOCAL_QUORUM, EACH_QUORUM, SERIAL, LOCAL_SERIAL, LOCAL_ONE}] [-r] [-j] [-l {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
Push messages from a RabbitMQ queue into a Cassandra table.
Push messages many times
usage: run_push.py [-h] [-c [CQL_HOST [CQL_HOST ...]]] [-i ITERATIONS] [-d LOCAL_DC] [-w [worker_count [worker_count ...]]] [-p [prefetch_count [prefetch_count ...]]] [-n [level [level ...]]] [-a] [-t] [-m MESSAGE_EXPONENT] [-b BODY_EXPONENT] [-l {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
Run multiple test cases based upon the product of worker_counts,prefetch_counts, and consistency_levels. Each test case may be run with up to4 variations reflecting the use or not of the dc_aware and token_awarepolicies. The results are output to stdout as a JSON object.
1kb
1kb
10-150kb
10-150kb
Pull