2010 11-02-documents
TRANSCRIPT
Exploring Document Databases
zendcon 2010
Documents,Documents,Documents
Matthew Weier O'PhinneyProject Lead, Zend Framework
Writing the
typical PHP app
design the schema
(http://musicbrainz.org/)
take input, and shove it in a DB
(http://musicbrainz.org/)
write queries to pull from the DB
$result = mysql_query(
"SELECT * FROM sometable"
);$rows = false;if (mysql_num_rows($result) > 0) { $rows =
array(); while ($row = mysql_fetch_assoc($result)) { $rows[] =
$row; }}
spit data onto a page
(http://www.irs.gov/)
Profit!
Things
That Happen
Things
That Happen
go wrong
SQL Injection
performance issues
Expensive queries
Potentially ORM induced resource issues
Design
Issues
Prague's Dancing House
design issues?
Many 1:1 or 1:N relationshipsNon-trivial insert/update operations
Harder to hit indexes on read operations
For trivial stuff like tags, addresses, etc!
design issues?
Worse: Changing requirementsAdditional columns needed?
Additional tables needed?
Occasional data needed?
design issues?
Entity-Attribute-Value
Anti-PatternOften added after-the-fact, as requirements change or
expand
Support arbitrary data for any record of any table
Can't do type enforcement
Leads to complex joins that often cannot hit table indexes
Leads to complex application logic to support retrieval and insertion of such metadata
Bill Karwin has written on this on his blog and in his book SQL Anti-patterns
Design First
Use Domain Driven Design (DDD), or Behavior Driven Design (BDD)
Eric Evans has written the classic text on DDD, and performs DDD immersion classes regularly.
Develop your application logic first, in order to determine what needs to be persisted.
the primary rule
define your application entities
use Plain Old PHP Objects
class User{ public function getId() {} public function setId($value) {} public function getRealname() {} public function setRealname($value) {} public function getEmail() {} public function setEmail($value) {}}
write tests
class PostTest extends PHPUnit_Framework_TestCase{ public
function testRaisesExceptionOnInvalidDate() {
$this->setExpectedException(
'InvalidArgumentException'); $this->post->setDate('foo bar');
}}
implement behaviors
class Post{ private $date; private $timezone = 'America/New_York';
public function setDate($date) { if (false === strtotime($date))
{ throw new InvalidArgumentException(); } $this->date = new
DateTime(
$date, $this->timezone); return $this; }}
NOW
determine
what data
you need
to persist.
Define a schema based on
the objects you use
(http://musicbrainz.org/)
map entities to data store
public function fromArray(array $data){ $filter = new OptionsFilter(); foreach ($data as $key => $value) { $method = 'set' . $filter($key); if (method_exists($this, $method)) { $this->$method($value); } }}
public function toArray(){ return array( '_id' => $this->getId(), 'timestamp' => $this->getTimestamp(), 'title' => $this->getTitle(),
approaches
Transaction Scripts
Object Relational Maps (ORM)
Transaction scripts do not need o be strictly procedural; they pattern can also apply to OOP code using such patterns as Strategy, Visitor, etc.
use mappers or transaction scripts
to translate objects to data & back
$user = new User();$user->setId('matthew') ->setName("Matthew Weier O'Phinney");$mapper->save($user);
$user = $repository->find('matthew');
additions
Service LayersInteracts with domain entities
Good place for caching, ACLs, etc.
use service objects
to manipulate entities
namespace Blog\Service;class Entries{ public function
fetchEntry($permalink) {} public function fetchCommentCount(
$permalink) {} public function fetchComments($permalink) {} public
function fetchTrackbacks($permalink) {} public function
addComment($permalink,
array $comment) {} public function addTrackback($permalink,
array $comment) {} public function fetchTagCloud() {}}
Data
Persistence
you have a choice
Before, relational databases were the only choice
you have a choice
Today, relational databases are only one choice
have your domain dictate storage
Do you have many arbitrary, row-specific fields in the design?
Do you need many pivot tables to describe a single entity?
Is transactional integrity part of your requirements?
Do changes need to be immediately available?
defining by what it isn't?
still defining by what it isn't
types: key/value stores
each record is a key/value pair, (though the value may be
non-scalar)
Interesting, but not what we're going to look at today.
types: document databases
Each document can define its own structure
Typically a document consists of many key/value pairs
This is what we'll look at!
{ _id: "weierophinney", realname: "Matthew Weier O'Phinney", email: "[email protected]", roles: [ "admin", "user" ]}
document dbs are plentiful
Also mention Azure Tables
document dbs solve web problems
Data can expand and add properties over time
without requiring schema changes!
Different content types can
co-exist in the same general storage
document dbs solve web problems
Aggregate related content in the document that owns itTags
Comments
Addresses
Eventual consistencyUpdates often don't need to propagate in real-time
types of problems documents solve
Blog and News Posts
Product Entries
Content Management documents
what don't they solve?
identifiers are king
Most are optimized for fetching via identifierProvide your own IDs
Fallback on system
(usually UUID)
mapping documents to objects
Many utilize JSON
If they don't, abstractions let you sling PHP arrays
$result = $cxn->fetch($id);$user = new User();$user->fromArray($result);
$cxn->save($user->toArray());
aggregate metadata
Instead of EAV tables, store metadata in the document
{ "_id" : "blog-post-stub", "published" : true, "reviewed" : true, "reviewed_by" : "matthew"}
to pivot tables required!
Instead of pivot tables, aggregate data inside values
{ "_id" : "blog-post-stub", "tags" : [
"zend framework",
"presentations"
]}
It's not all walks in the park
increased disk usage
Each document contains its schemaSilver lining: most solutions can cluster and/or provide sharding.
schema differences
How do you keep schemas in sync between documents when requirements change?
If you have multiple schemas for the same document type, what do you query on?firstName or FIRST_NAME?
Did you remember to create new indexes?
managing schema changes
Handle the differences in your application code
switch ($user->schema_version) { case '2010-01-31': // ... break; case '2010-11-02': // ... break;}
Meh.
managing schema changes
Do a batch conversionCopy all records to a new database or collection
Migrate all records to the new schema
Point your application to the new database/collection
Meh.
managing schema changes
Version the document schema
Update when fetched
{ "_id" : "blog-post-stub", "schema_version" : "2010-11-02"}
if ($post->schema_version != $latest) { $post->metadata = $post->METADATA; $post->schema_version = $latest; unset($post->METADATA); $mapper->save($post);}
Use AOP-like practices such as SignalSlot, Subject/Observer, etc to help automate this.
benefits you may enjoy
Easier mapping of document concepts to data persistence
Easier scalingMost support clustering and sharding natively
Easier migration to cloud-based storage
Closing Notes
Don't start your development from the wrong end. Start with objects.
Be aware of all the options you have for persisting data; choose appropriately.
Consider document data stores when your objects represent content; store metadata in the document.
Thank you
Feedback? http://joind.in/2233http://twitter.com/weierophinneyhttp://framework.zend.com/