python-couchdb training at pycon pl 2012
TRANSCRIPT
Using CouchDB with Python
Stefan Kögl@skoegl
What we will cover● What is CouchDB?
– Access from Python though couchdbkit
– Key-value Store Functionality
– MapReduce Queries
– HTTP API
● When is CouchDB useful and when not?
– Multi-Master Replication
– Scaling up and down
● Pointers to other resources, CouchDB ecosystem
What we won't cover
● CouchApps – browser-based apps that are served by CouchDB
● Detailled Security, Scaling and other operative issues
● Other functionality that didn't fit
Training Modes● Code-Along
– Follow Examples, write your own code
– Small Scripts or REPL
● Learning-by-Watching
– Example Application at https://github.com/stefankoegl/python-couchdb-examples
– Slides at https://slideshare.net/skoegl/couch-db-pythonpyconpl2012
– Use example scripts and see what happens
– Submit Pull-Requests!
Contents● Intro
– Contents
– CouchDB
– Example Application
● DB Initialization
● Key-Value Store
● Simple MapReduce Queries
● The _changes Feed
● Complex MapReduce Queries
● Replication
● Additional Features and the Couch Ecosystem
CouchDB● Apache Project
● https://couchdb.apache.org/
● Current Version: 1.2
● Apache CouchDB™ is a database that uses JSON for documents, JavaScript for MapReduce queries, and regular HTTP for an API
Example Application● Lending Database
– Stores Items that you might want to lend
– Stores when you have lent what to whom
● Stand-alone or distributed
● Small Scripts that do one task each
● Look at HTTP traffic
Contents● Intro
● DB Initialization
– Setting Up CouchDB
– Installing couchdbkit
– Creating a Database
● Key-Value Store
● Simple MapReduce Queries
● The _changes Feed
● Complex MapReduce Queries
● Replication
● Additional Features and the Couch Ecosystem
Getting Set Up: CouchDB● Provided by me (not valid anymore after the training)
● http://couch.skoegl.net:5984/<yourname>
● Authentication: username training, password training
● Setup your DB_URL in settings.py
● If you want to install your own
– Tutorials: https://wiki.apache.org/couchdb/Installation
– Ubuntu: https://launchpad.net/~longsleep/+archive/couchdb
– Mac, Windows: https://couchdb.apache.org/#download
Getting Set Up: couchdbkit● http://couchdbkit.org/
● Python client library # install with pip
pip install couchdbkit
# or from source
git clone git://github.com/benoitc/couchdbkit.git
cd couchdbkit
sudo python setup.py install
# and then you should be able to import
import couchdbkit
Contents● Intro
● DB Initialization
– Setting Up CouchDB
– Installing couchdbkit
– Creating a Database
● Key-Value Store
● Simple MapReduce Queries
● Complex MapReduce Queries
● The _changes Feed
● Replication
● Additional Features and the Couch Ecosystem
Creating a Database● What we have: a CouchDB server and its URL
eg http://127.0.0.1:5984
● What we want: a database there
eg http://127.0.0.1:5984/myname
● http://wiki.apache.org/couchdb/HTTP_database_API
A note on Debugging● Apache-style log files
● Locally– $ tail f /var/log/couchdb/couch.log
● HTTP
– http://127.0.0.1:5984/_log?bytes=5000
– http://wiki.apache.org/couchdb/HttpGetLog
Creating a Database# ldb-init.py
from restkit import BasicAuth
from couchdbkit import Database
from couchdbkit.exceptions import ResourceNotFound
auth_filter = BasicAuth('username', 'pwd')
db = Database(dburl, filters=[auth_filter])
server = db.server
try:
server.delete_db(db.dbname)
except ResourceNotFound:
pass
db = server.get_or_create_db(db.dbname)
Creating a Database
[Thu, 06 Sep 2012 16:44:30 GMT] [info] [<0.1435.0>] 127.0.0.1 - - DELETE /myname/ 200
[Thu, 06 Sep 2012 16:44:30 GMT] [info] [<0.1435.0>] 127.0.0.1 - - HEAD /myname/ 404
[Thu, 06 Sep 2012 16:44:30 GMT] [info] [<0.1440.0>] 127.0.0.1 - - PUT /myname/ 201
Contents● Intro
● DB Initialization
● Key-Value Store
– Modelling Documents
– Storing and Retrieving Documents
– Updating Documents
● Simple MapReduce Queries
● Complex MapReduce Queries
● The _changes Feed
● Replication
● Additional Features and the Couch Ecosystem
Key-Value Store● Core of CouchDB
● Keys (_id): any valid JSON string
● Values (documents): any valid JSON objects
● Stored in B+-Trees
● http://guide.couchdb.org/draft/btree.html
Modelling a Thing● A thing that we want to lend
– Name
– Owner
– Dynamic properties like ● Description● Movie rating● etc
Modelling a Thing● In CouchDB documents are JSON objects
● You can store any dict
– Wrapped in couchdbkit's Document classes for convenience
● Documents can be serialized to JSON …
mydict = mydoc.to_json()
● … and deserialized from JSON
mydoc = DocClass.wrap(mydict)
Modelling a Thing# models.py
from couchdbkit import Database, Document, StringProperty
class Thing(Document):
owner = StringProperty(required=True)
name = StringProperty(required=True)
db = Database(DB_URL)
Thing.set_db(db)
Storing a Document● Document identified by _id
– Auto-assigned by Database (bad)
– Provided when storing the database (good)
– Think about lost responses
– couchdbkit does that for us
● couchdbkit adds property doc_type with value „Thing“
Internal Storage● Database File /var/lib/couchdb/dbname.couch
● B+-Tree of _id
● Access: O(log n)
● Append-only storage
● Accessible in historic order (we'll come to that later)
Storing a Document# ldb-new-thing.py
couchguide = Thing(owner='stefan', name='CouchDB The Definitive Guide')
couchguide.publisher = "O'Reilly“
couchguide.to_json()# {'owner': u'stefan', 'doc_type': 'Thing', # 'name': u'CouchDB The Definitive Guide', # 'publisher': u"O'Reilly"}
couchguide.save()
print couchguide._id
# 448aaecfe9bc1cde5d6564a4c93f79c2
Storing a Document
[Thu, 06 Sep 2012 19:40:26 GMT] [info] [<0.962.0>] 127.0.0.1 - - GET /_uuids?count=1000 200
[Thu, 06 Sep 2012 19:40:26 GMT] [info] [<0.962.0>] 127.0.0.1 - - PUT /lendb/8f14ef7617b8492fdbd800f1101ebb35 201
Retrieving a Document● Retrieve Documents by its _id
– Limited use
– Does not allow queries by other properties
# ldbgetthing.py
thing = Thing.get(thing_id)
Retrieving a Document
[Thu, 06 Sep 2012 19:45:30 GMT] [info] [<0.962.0>] 127.0.0.1 - - GET /lendb/8f14ef7617b8492fdbd800f1101ebb35 200
Updating a Document● Optimistic Concurrency Control
● Each Document has a revision
● Each Operation includes revision
● Operation fails if revision doesn't match
Updating a Document>>> thing1 = Thing.get(some_id)
>>> thing1._rev
'1110e1e46bcde6ed3c2d9b1073f0b26'
>>> thing1.something = True
>>> thing1.save()
>>> thing1._rev
'23f800dffa62f4414b2f8c84f7cb1a1'
Success
>>> thing2 = Thing.get(some_id)
>>> thing2._rev
'1110e1e46bcde6ed3c2d9b1073f0b26'
>>> thing2._rev
'1110e1e46bcde6ed3c2d9b1073f0b26'
>>> thing2.conflicting = 'test'
>>> thing2.save()
couchdbkit.exceptions.ResourceConflict: Document update conflict.
Failed
Updating a Document
[Thu, 13 Sep 2012 06:16:52 GMT] [info] [<0.7977.0>] 127.0.0.1 - - GET /lendb/d46d311d9a0f64b1f7322d20721f9f1d 200
[Thu, 13 Sep 2012 06:16:55 GMT] [info] [<0.7977.0>] 127.0.0.1 - - GET /lendb/d46d311d9a0f64b1f7322d20721f9f1d 200
[Thu, 13 Sep 2012 06:17:34 GMT] [info] [<0.7977.0>] 127.0.0.1 - - PUT /lendb/d46d311d9a0f64b1f7322d20721f9f1d 201
[Thu, 13 Sep 2012 06:17:48 GMT] [info] [<0.7977.0>] 127.0.0.1 - - PUT /lendb/d46d311d9a0f64b1f7322d20721f9f1d 409
Contents● Intro
● DB Initialization
● Key-Value Store
● Simple MapReduce Queries
– Create a View
– Query the View
● Complex MapReduce Queries
● The _changes Feed
● Replication
● Additional Features and the Couch Ecosystem
Views● A specific „view“ on (parts of) the data in a database
● Indexed incrementally
● Query is just reading a range of a view sequentially
● Generated using MapReduce
MapReduce Views● Map Function
– Called for each document
– Has to be side-effect free
– Emits zero or more intermediate key-value pairs
● Reduce Function (optional)
– Aggregates intermediate pairs
● View Results stored in B+-Tree
– Incrementally pre-computed at query-time
– Queries are just a O(log n)
List all Things● Implemented as MapReduce View
● Contained in a Design Document
– Create
– Store
– Query
Create a Design Document● Regular document, interpreted by the database
● Views Mapped to Filesystem by directory structure_design/<ddoc name>/views/<view name>/{map,reduce}.js
● Written in JavaScript or Erlang● Pluggable View Servers
– http://wiki.apache.org/couchdb/View_server– http://packages.python.org/CouchDB/views.html– Lisp, PHP, Ruby, Python, Clojure, Perl, etc
Design Document
# _design/things/views/by_owner_name/map.js
function(doc) {
if(doc.doc_type == “Thing“) {
emit([doc.owner, doc.name], null);
}
}
Intermediate ResultsKey Value
[„stefan“, „couchguide“] null
[„stefan“, „Polish Dictionary“] null
[„marek“, „robot“] null
Design Document
# _design/things/views/by_owner_name/reduce.js
_count
Reduced Results● Result depends on group level
Key Value
[„stefan“, „couchguide“] 1
[„stefan“, „Polish Dictionary“] 1
[„marek“, „robot“] 1
Key Value
[„stefan“] 2
[„marek“] 1
Key Value
null 3
Synchronize Design Docs● Upload the design document
● _id: _design/<ddoc name>
● couchdbkit syncs ddocs from filesystem
● We'll need this a few more times
– Put the following in its own script
– or run$ ./ldbsyncddocs.py
Synchronize Design Docs# ldbsyncddocs.py
from couchdbkit.loaders import FileSystemDocsLoader
auth_filter = BasicAuth('username', 'pwd')
db = Database(dburl, filters=[auth_filter])
loader = FileSystemDocsLoader('_design')
loader.sync(db, verbose=True)
View things/by_name● Emitted key-value pairs
● Sorted by key http://wiki.apache.org/couchdb/View_collation
● Keys can be complex (lists, dicts)
● Query
http://127.0.0.1:5984/myname/_design/things/_view/by_name?reduce=false
Key Value _id (implicit) Document (implicit)
[“stefan“, “couchguide“] null { … }
[“stefan“, “Polish Dictionary“] null { … }
Query a View
# ldblistthings.py
things = Thing.view('things/by_owner_name', include_docs=True, reduce=False)
for thing in things:
print thing._id, thing.name, thing.owner
Query a View – Reduced
# ldboverview.py
owners = Thing.view('things/by_owner_name', group_level=1)
for owner_status in owners:
owner = owner_status['key'][0]
count = owner_status['value']
print owner, count
Break
From the Break● Filtering by Price
– startkey = 5
– endkey = 10
● Structure: ddoc name / view name
– Logical Grouping
– Performance
Contents● Intro
● DB Initialization
● Key-Value Store
● Simple MapReduce Queries
● The _changes Feed
– Accessing the _changes Feed
– Lending Objects
● Advanced MapReduce Queries
● Replication
● Additional Features and the Couch Ecosystem
Changes Sequence● With every document update, a change is recorded
● local history, ordered by _seq value
● Only the latest _seq is kept
Changes Feed● List of all documents, in the order they were last modified
● Possibility to
– React on changes
– Process all documents without skipping any
– Continue at some point with since parameter
● CouchDB as a distributed, persistent MQ
● http://guide.couchdb.org/draft/notifications.html
● http://wiki.apache.org/couchdb/HTTP_database_API#Changes
Changes Feed# ldbchangeslog.py
def callback(line):
seq = line['seq']
doc = line['doc']
# get obj according to doc['doc_type']
print seq, obj
consumer = Consumer(db)
consumer.wait(callback, since=since, include_docs=True)
„Lending“ Objects● Thing that is lent
● Who lent it (ie who is the owner of the thing)
● To whom it is lent
● When it was lent
● When it was returned
Modelling a „Lend“ Object# models.py
class Lending(Document):
thing = StringProperty(required=True)
owner = StringProperty(required=True)
to_user = StringProperty(required=True)
lent = DateTimeProperty(default=datetime.now)
returned = DateTimeProperty()
Lending.set_db(db)
Lending a Thing# ldblendthing.py
lending = Lending(thing=thing_id, owner=username, to_user=to_user)
lending.save()
Returning a Thing# ldbreturnthing.py
lending = Lending.get(lend_id)
lending.returned = datetime.now()
lending.save()
Contents● Intro
● DB Initialization
● Key-Value Store
● Simple MapReduce Queries
● The _changes Feed
● Advanced MapReduce Queries
– Imitating Joins with „Mixed“ Views
● Replication
● Additional Features and the Couch Ecosystem
Current Thing Status● View to get the current status of a thing
● No Joins
● We emit with keys, that group together
Complex View# _design/things/_view/history/map.js
function(doc) {
if(doc.doc_type == "Thing") {
emit([doc.owner, doc._id, 1], doc.name);
}
if(doc.doc_type == "Lending") {
if(doc.lent && !doc.returned) {
emit([doc.owner, doc.thing, 2], doc.to_user);
}
}
}
Intermediate View ResultsKey Value
[„stefan“, 12345, 1] „couchguide“
[„stefan“, 12345, 2] [„someone“, „2012-09-12“]
[„marek“, 34544, 1] „robot“
Reduce Intermediate Results# _design/things/_view/status/reduce.js
/* use with group_level = 2 */
function(keys, values) {
/* there is at least one „Lending“ row */
if(keys.length > 1) {
return "lent";
} else {
return "available";
}
}
● Don't forget to synchronize your design docs!
● Group Level: 2
● Reduce Function receives rows with same grouped valueIntermediate – not reduced
reduced
Reduce Intermediate Results
Key Value
[„owner“, 12345] „lent“
[„owner“, 34544] „available“
Key Value
[„stefan“, 12345, 1] „couchguide“
[„stefan“, 12345, 2] [„someone“, „2012-09-12“]
[„marek“, 34544, 1] „robot“
Get Status# ldbstatus.py
things = Thing.view('things/status', group_level = 2)
for result in things:
owner = result['key'][0]
thing_id = result['key'][1]
status = result['value'])
Print owner, thing_id, status
Contents● Intro
● DB Initialization
● Key-Value Store
● Simple MapReduce Queries
● The _changes Feed
● Advanced MapReduce Queries
● Replication
– Setting up filters
– Find Friends and Replicate from them
– Eventual Consistency and Conflicts
● Additional Features and the Couch Ecosystem
Replication● Replicate Things and their status from friends
● Don't replicate things from friends of friends
– we don't want to borrow anything from them
Replication● Pull replication
– Pull documents from our friends, and store them locally
● There's also Push replication, but we won't use it
● Goes through the source's _changes feed
● Compares with local documents, updates or creates conflicts
Set up a Filter● A Filter is a JavaScript function that takes
– a document
– a request object
● and returns
– true, if the document passes the filter
– false otherwise
● A filter is evaluated at the source
Replication Filter# _design/things/filters/from_friend.js
/* doc is the document,
req is the request that uses the filter */
function(doc, req)
{
/* Allow only if entry is owned by the friend */
return (doc.owner == req.query.friend);
}
Replication● Sync design docs to your own database!
● Find friends to borrow from
– Post your nickname and Database URL to http://piratepad.net/pycouchpl
– Pick at least two friends
Replication● _replicator database
● Objects describe Replication tasks
– Source
– Target
– Continuous
– Filter
– etc
● http://wiki.apache.org/couchdb/Replication
Replication# ldbreplicatefriend.py
auth_filter = BasicAuth(username, password)
db = Database(db_url, filters=[auth_filter])
replicator_db = db.server['_replicator']
replication_doc = {
"source": friend_db_url, "target": db_url,
"continuous": True,
"filter": "things/from_friend",
"query_params": { "friend": friend_name }
}
replicator_db[username+““+friend_name]=replication_doc
Replication● Documents should be propagated into own database
● Views should contain both own and friends' things
Dealing with Conflicts● Conflicts introduces by
– Replication
– „forcing“ a document update
● _rev calculated based on
– Previous _rev
– document content
● Conflict when two documents have
– The same _id
– Distinct _rev
Dealing with Conflicts● Select a Winner
● Database can't do this for you
● Automatic strategy selects a (temporary) winner
– Deterministic: always the same winner on each node
– leaves them in conflict state
● View that contains all conflicts
● Resolve conflict programmatically
● http://guide.couchdb.org/draft/conflicts.html
● http://wiki.apache.org/couchdb/Replication_and_conflicts
Contents● Intro
● DB Initialization
● Key-Value Store
● Simple MapReduce Queries
● The _changes Feed
● Advanced MapReduce Queries
● Replication
● Additional Features and the Couch Ecosystem
– Scaling and related Projects
– Fulltext Search
– Further Reading
Scaling Up / Out● BigCouch
– Cluster of CouchDB nodes that appears as a single server
– http://bigcouch.cloudant.com/
– will be merged into CouchDB soon
● refuge
– Fully decentralized data platform based on CouchDB
– Includes fork of GeoCouch for spatial indexing
– http://refuge.io/
Scaling Down● CouchDB-compatible Databases on a smaller scale
● PouchDB
– JavaScript library http://pouchdb.com/
● TouchDB● IOS: https://github.com/couchbaselabs/TouchDB-iOS● Android: https://github.com/couchbaselabs/TouchDB-Android
Fulltext and Relational Search● http://wiki.apache.org/couchdb/Full_text_search
● CouchDB Lucene
– http://www.slideshare.net/martin.rehfeld/couchdblucene
– https://github.com/rnewson/couchdb-lucene
● Elastic Search
– http://www.elasticsearch.org/
Operations Considerations● Append Only Storage
● Your backup tools: cp, rsync
● Regular Compaction needed
Further Features● Update Handlers: JavaScript code that carries out update in
the database server
● External Processes: use CouchDB as a proxy to other processes (eg search engines)
● Attachments: attach binary files to documents
● Update Validation: JavaScript code to validate doc updates
● CouchApps: Web-Apps served directly by CouchDB
● Bulk APIs: Several Updates in one Request
● List and Show Functions: Transforming responses before serving them
Summing Up● Apache CouchDB™ is a database that uses JSON for
documents, JavaScript for MapReduce queries, and regular HTTP for an API
● couchdbkit is a a Python library providing access to Apache CouchDB
Thanks!
Time for Questions and Discussion
Downloads
https://slideshare.net/skoegl/couch-db-pythonpyconpl2012
https://github.com/stefankoegl/python-couchdb-examples
Stefan Kögl
@skoegl