cassandra for python developers

34
Cassandra for Tyler Hobbs Python Developers

Upload: tyler-hobbs

Post on 11-May-2015

2.928 views

Category:

Technology


4 download

DESCRIPTION

A high level introduction to Apache Cassandra followed by an introduction to pycassa, the Python client library for Cassandra.Presented at PyTexas 2011 by Tyler Hobbs.

TRANSCRIPT

Page 1: Cassandra for Python Developers

Cassandrafor

Tyler Hobbs

Python Developers

Page 2: Cassandra for Python Developers

Open-sourced by Facebook Apache Incubator Top-level Apache project DataStax founded

History

2008

2009

2010

2010

Page 3: Cassandra for Python Developers

Scalable– 2x Nodes == 2x Performance

Reliable (Available)– Replication that works

– Multi-DC support

– No single point of failure

Strengths

Page 4: Cassandra for Python Developers

Fast– 10-30k writes/sec, 1-10k reads/sec

Analytics– Integrated Hadoop support

Strengths

Page 5: Cassandra for Python Developers

No ACID transactions– Don't need these as often as you'd think

Limited support for ad-hoc queries– You'll give these up anyway when sharding an RDBMS

Generally complements another system– Not intended to be one-size-fits-all

Weaknesses

Page 6: Cassandra for Python Developers

Every node plays the same role– No masters, slaves, or special nodes

Clustering

Page 7: Cassandra for Python Developers

Clustering

0

10

20

30

40

50

Page 8: Cassandra for Python Developers

Clustering

0

10

20

30

40

50

Key: “www.google.com”

Page 9: Cassandra for Python Developers

Clustering

0

10

20

30

40

50

Key: “www.google.com”

14

md5(“www.google.com”)

Page 10: Cassandra for Python Developers

Clustering

0

10

20

30

40

50

14

Key: “www.google.com”

md5(“www.google.com”)

Page 11: Cassandra for Python Developers

Clustering

0

10

20

30

40

50

14

Key: “www.google.com”

md5(“www.google.com”)

Page 12: Cassandra for Python Developers

Clustering

0

10

20

30

40

50

14

Key: “www.google.com”

md5(“www.google.com”)

Replication Factor = 3

Page 13: Cassandra for Python Developers

Client can talk to any node

Clustering

Page 14: Cassandra for Python Developers

Keyspace– A collection of Column Families– Controls replication settings

Column Family– Kinda resembles a table

Data Model

Page 15: Cassandra for Python Developers

Static– Object data

Dynamic– Pre-calculated query results

ColumnFamilies

Page 16: Cassandra for Python Developers

Static Column Families

zznate

driftx

thobbs

jbellis

password: *

password: *

password: *

name: Nate

name: Brandon

name: Tyler

password: * name: Jonathan site: riptano.com

Users

Page 17: Cassandra for Python Developers

Dynamic Column Families

zznate

driftx

thobbs

jbellis

driftx: thobbs:

driftx: thobbs:mdennis: zznate

Following

zznate:

pcmanus xedin:

Page 18: Cassandra for Python Developers

Dynamic Column Families Timeline of tweets by a user Timeline of tweets by all of the people a

user is following List of comments sorted by score List of friends grouped by state

Page 19: Cassandra for Python Developers

Python client library for Cassandra

Open Source (MIT License)– www.github.com/pycassa/pycassa

Users– Reddit

– ~10k github downloads of every version

Pycassa

Page 20: Cassandra for Python Developers

easy_install pycassa– or pip

Installing Pycassa

Page 21: Cassandra for Python Developers

pycassa.pool– Connection pooling

pycassa.columnfamily– Primary module for the data API

pycassa.system_manager– Schema management

Basic Layout

Page 22: Cassandra for Python Developers

RPC-based API

Rows are like a sorted list of (name,value)

tuples– Like a dict, but sorted by the names– OrderedDicts are used to preserve sorting

The Data API

Page 23: Cassandra for Python Developers

Inserting Data

>>> from pycassa.pool import ConnectionPool>>> from pycassa.columnfamily import ColumnFamily>>>>>> pool = ConnectionPool(“MyKeyspace”)>>> cf = ColumnFamily(pool, “MyCF”)>>>>>> cf.insert(“key”, {“col_name”: “col_value”})>>> cf.get(“key”){“col_name”: “col_value”}

Page 24: Cassandra for Python Developers

Inserting Data

>>> columns = {“aaa”: 1, “ccc”: 3}>>> cf.insert(“key”, columns)>>> cf.get(“key”){“aaa”: 1, “ccc”: 3}>>>>>> # Updates are the same as inserts>>> cf.insert(“key”, {“aaa”: 42})>>> cf.get(“key”){“aaa”: 42, “ccc”: 3}>>>>>> # We can insert anywhere in the row>>> cf.insert(“key”, {“bbb”: 2, “ddd”: 4})>>> cf.get(“key”){“aaa”: 42, “bbb”: 2, “ccc”: 3, “ddd”: 4}

Page 25: Cassandra for Python Developers

Fetching Data

>>> cf.get(“key”){“aaa”: 42, “bbb”: 2, “ccc”: 3, “ddd”: 4}>>>>>> # Get a set of columns by name>>> cf.get(“key”, columns=[“bbb”, “ddd”]){“bbb”: 2, “ddd”: 4}

Page 26: Cassandra for Python Developers

Fetching Data

>>> # Get a slice of columns>>> cf.get(“key”, column_start=”bbb”,... column_finish=”ccc”){“bbb”: 2, “ccc”: 3}>>>>>> # Slice from “ccc” to the end>>> cf.get(“key”, column_start=”ccc”){“ccc”: 3, “ddd”: 4}>>>>>> # Slice from “bbb” to the beginning>>> cf.get(“key”, column_start=”bbb”,... column_reversed=True){“bbb”: 2, “aaa”: 42}

Page 27: Cassandra for Python Developers

Fetching Data

>>> # Get the first two columns in the row>>> cf.get(“key”, column_count=2){“aaa”: 42, “bbb”: 2}>>>>>> # Get the last two columns in the row>>> cf.get(“key”, column_reversed=True,... column_count=2){“ddd”: 4, “ccc”: 3}

Page 28: Cassandra for Python Developers

Fetching Multiple Rows

>>> columns = {“col”: “val”}>>> cf.batch_insert({“k1”: columns,... “k2”: columns,... “k3”: columns})>>>>>> # Get multiple rows by name>>> cf.multiget([“k1”,“k2”]){“k1”: {”col”: “val”}, “k2”: {“col”: “val”}}

>>> # You can get slices of each row, too>>> cf.multiget([“k1”,“k2”], column_start=”bbb”) …

Page 29: Cassandra for Python Developers

Fetching a Range of Rows

>>> # Get a generator over all of the rows>>> for key, columns in cf.get_range():... print key, columns“k1” {”col”: “val”}“k2” {“col”: “val”}“k3” {“col”: “val”}

>>> # You can get slices of each row>>> cf.get_range(column_start=”bbb”) …

Page 30: Cassandra for Python Developers

Fetching Rows by Secondary Index

>>> from pycassa.index import *>>>>>> # Build up our index clause to match>>> exp = create_index_expression(“name”, “Joe”)>>> clause = create_index_clause([exp])>>> matches = users.get_indexed_slices(clause)>>>>>> # results is a generator over matching rows>>> for key, columns in matches:... print key, columns“13” {”name”: “Joe”, “nick”: “thatguy2”}“257” {“name”: “Joe”, “nick”: “flowers”}“98” {“name”: “Joe”, “nick”: “fr0d0”}

Page 31: Cassandra for Python Developers

Deleting Data

>>> # Delete a whole row>>> cf.remove(“key1”)>>>>>> # Or selectively delete columns>>> cf.remove(“key2”, columns=[“name”, “date”])

Page 32: Cassandra for Python Developers

Connection Management

pycassa.pool.ConnectionPool– Takes a list of servers

• Can be any set of nodes in your cluster

– pool_size, max_retries, timeout– Automatically retries operations against other nodes

• Writes are idempotent!

– Individual node failures are transparent– Thread safe

Page 33: Cassandra for Python Developers

Async Options

eventlet– Just need to monkeypatch socket and threading

Twisted– Use Telephus instead of Pycassa– www.github.com/driftx/telephus– Less friendly, documented, etc

Page 34: Cassandra for Python Developers

Tyler Hobbs@tylhobbs

[email protected]