nosql presentation

Post on 07-Dec-2014

104.110 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation given at NoSql EU conference describing architectures past, present & future for guardian.co.uk

TRANSCRIPT

NoSql at guardian.co.ukMatthew WallSimon Willison

!

SQL

ot

nly

Guardian journalism online: 1995

Guardian journalism online: 1999

Guardian journalism online: 2000

Guardian journalism online: 2010

Read all about it!

I bring you NEWS!!!App server App server App server

Web server Web server Web server

CMS Data feeds

Oracle

Memcached (20Gb)

I bring you NEWS!!!App server App server App server

Web server Web server Web server

CMS Data feeds

Oracle

Memcached

Why RDBMS?

5 years ago, fewer alternatives

Understand operations procedures

Can easily recruit DBAs / devs

Developer/ops tools

Business critical system: a safe choice

Related content from search engine

Introduction of memcached

Related content from search engine

Introduction of memcached

Big traffic spikeRelated content from search engine

Distributed memcached

Protects database from peak load

Entities explicitly decached

Queries given TTL

memcached = database supercharger

Now we have a stable “broadcast” platform

We know how to scale it

SQL running effectively at core

We’ve finished, right?

Digital journalism is changing

We can’t cover everything

We can’t compete with everyone

Need to be “part of the web” not just “on the web”

Mutualisethe news!

Mutualised news!

Mutalisation of journalism

No longer only broadcasting content

User engagement & contribution:journalism

datasoftware

Data curation / linked data

Support engaged developers with data and APIs

Mutualised news!

Be a part of the data fabric of the internet

Mutualised news!Platform strategy

Out: Release our data to the world via APIs

In: Rapidly build new functionality outside the core

Write: Ingest, store & present arbitrary data

Mutualised news!

Data Out

Content API

Mutualised news!

Content API

Delivered using Apache Solr

Document oriented search engine

Loose schema:records, fields, facets

Fields can be multi-value

Supports dynamic field generation

Can apply multiple facets in queries faster than RDBMS

Mutualised news!

Mutualised news!

Mutualised news!

Mutualised news!

Is Solr a database?

Mutualised news!Can perform complex queries, including full text search

Can filter results with facets (WHERE clause)

ANYTHING can be a facet. Very powerful.

On our dataset most queries are of a similar cost

Scales very well horizontally

Handles millions of documents

Mutualised news!No transactions

Excellent for certain types of queries

Not truly general purpose

Schema design very important

Search index not really persistence

App server

Web servers

CMS

Memcached (20Gb)

Solr

Core

Solr

Solr

Solr

Solr

Solr

Cloud, EC2

M/Q

Api

rdbms

Mutualised news!API

Currently powering iPad app

Site components

External applications

Editors tools

More to follow

Mutualised news!

Data In

Application framework

Mutualised news!

Application framework

Simple REST/ HTTP framework allows lightweight development

Applications proxied for performance

Apps generally hosted in the cloud, hot deployment into production

No RDBMs provided for storage

Can develop in news timeline

App server

Web servers

CMS

Memcached (20Gb)

Core

M/Q

App

App

App

App

App

App

Apps

Proxy

external hostingapp engine etc

rdbms

NoSQL for journalism

Some useful characteristics

• Scale down as well as up

• Support rapid production-ready prototyping: turn projects around in hours or days

• Handle massive traffic spikes

Desktop analysis• Leaked BNP

membership list

• Load postcodes to constituencies mapping in to Redis

• Generate heatmaps by looking up all 12,000 postcodes

MP’s expenses

MP’s expenses

SELECT * FROM pages WHERE is_reviewed = 0 ORDER BY RAND()

v2 used Redis

v2 used RedisSet difference:labour MP pages - reviewed pages

SRANDMEMBER

BigTable: Zeitgeist

Zeitgeist stores pre-calculated results in BigTable

• Data comes in from stats system, comments system and OneRiot real-time search API

• AppEngine cron tasks populate task queues

• Task queues recalculate hotness levels

• “Live” BigTable queries are simple SELECT / SORT

Live debate poll

• Over a million votes cast in an hour

• Stretched limits of BigTable / AppEngine

• Sharded counter pattern to handle writes

Spreadsheets are NoSQL too...

Google Docs powered infographics

The Datablog

• Datablog was launched with no development involvement at all - it’s a blog, and a bunch of Google Docs Spreadsheets

• Retrieve data as CSV, XLS, JSON, Atom...

• “Make a copy” and run your own analysis

Mutualised news!

Write

Arbitrary data

Mutualised news!Create schema free database alongside RDBMS

Index in Solr

Provide access in API

Investigating: CouchDB

App server

Web servers

CMS Data feeds

Memcached (20Gb)

Solr

Core

Solr

Solr

Solr

Solr

Solr

Cloud, EC2

M/Q

Out

App

App

App

App

App

App

In

Proxyexternal hostingapp engine etc

CouchDB?rdbms

top related