torodb: scaling postgresql like mongodb

43
ToroDB SCALING PostgreSQL LIKE MongoDB @NoSQLonSQL Álvaro Hernández <[email protected]>

Upload: 8kdata-technology

Post on 16-Apr-2017

2.718 views

Category:

Software


3 download

TRANSCRIPT

Page 1: ToroDB: scaling PostgreSQL like MongoDB

ToroDBSCALING PostgreSQL

LIKE MongoDB

@NoSQLonSQL

Álvaro Hernández <[email protected]>

Page 2: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

About *8Kdata*

● Research & Development in databases

● Consulting, Training and Support in PostgreSQL

● Founders of PostgreSQL España, 5th largest PUG in the world (>500 members as of today)

● About myself: CTO at 8Kdata:@ahachetehttp://linkd.in/1jhvzQ3

www.8kdata.com

Page 3: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

ToroDB in brief

Page 4: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

ToroDB in one slide

● Document-oriented, JSON, NoSQL db

● Open source (AGPL)

● MongoDB compatibility (wire protocol level)

● Uses PostgreSQL as a storage backend

Page 5: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

ToroDB storage

● Data is stored in tables. No blobs

● JSON documents are split by hierarchy levels into “subdocuments”, which contain no nested structures. Each subdocument level is stored separately

● Subdocuments are classified by “type”. Each “type” maps to a different table

Page 6: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

ToroDB storage (II)

● A “structure” table keeps the subdocument “schema”

● Keys in JSON are mapped to attributes, which retain the original name

● Tables are created dinamically and transparently to match the exact types of the documents

Page 7: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

ToroDB storage internals

{ "name": "ToroDB", "data": { "a": 42, "b": "hello world!" }, "nested": { "j": 42, "deeper": { "a": 21, "b": "hello" } }}

Page 8: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

ToroDB storage internals

The document is split into the following subdocuments:

{ "name": "ToroDB", "data": {}, "nested": {} }

{ "a": 42, "b": "hello world!"}

{ "j": 42, "deeper": {}}

{ "a": 21, "b": "hello"}

Page 9: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

ToroDB storage internals

select * from demo.t_3┌─────┬───────┬────────────────────────────┬────────┐│ did │ index │ _id │ name │├─────┼───────┼────────────────────────────┼────────┤│ 0 │ ¤ │ \x5451a07de7032d23a908576d │ ToroDB │└─────┴───────┴────────────────────────────┴────────┘select * from demo.t_1┌─────┬───────┬────┬──────────────┐│ did │ index │ a │ b │├─────┼───────┼────┼──────────────┤│ 0 │ ¤ │ 42 │ hello world! ││ 0 │ 1 │ 21 │ hello │└─────┴───────┴────┴──────────────┘select * from demo.t_2┌─────┬───────┬────┐│ did │ index │ j │├─────┼───────┼────┤│ 0 │ ¤ │ 42 │└─────┴───────┴────┘

Page 10: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

ToroDB storage internals

select * from demo.structures┌─────┬────────────────────────────────────────────────────────────────────────────┐│ sid │ _structure │├─────┼────────────────────────────────────────────────────────────────────────────┤│ 0 │ {"t": 3, "data": {"t": 1}, "nested": {"t": 2, "deeper": {"i": 1, "t": 1}}} │└─────┴────────────────────────────────────────────────────────────────────────────┘

select * from demo.root;┌─────┬─────┐│ did │ sid │├─────┼─────┤│ 0 │ 0 │└─────┴─────┘

Page 11: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

ToroDB storage and I/O savings

29% - 68% storage required,compared to Mongo 2.6

Page 12: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

ToroDB: query “by structure”

● ToroDB is effectively partitioning by type

● Structures (schemas, partitioning types) are cached in ToroDB memory

● Queries only scan a subset of the data

● Negative queries are served directly from memory

Page 13: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

ScalingToroDB like MongoDB

Page 14: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Big Data: NoSQL vs SQL

vs

http://www.networkworld.com/article/2226514/tech-debates/what-s-better-for-your-big-data-application--sql-or-nosql-.html

Page 15: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Scalability?

Page 16: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Ways to scale

● Vertical scalability➔ Concurrency scalability➔ Hardware scalability➔ Query scalability

● Read scalability (replication)

● Write scalability (horizontal, sharding)

Page 17: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Vertical scalability

Concurrency scalability

● SQL is usually better (e.g. PostgreSQL):➔ Finer locking➔ MVCC➔ better caching

● NoSQL often needs sharding within the same host to scale

Page 18: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Vertical scalability

Hardware scalability● Scaling with the number of cores?● Process/threading model?

Query scalability● Use of indexes? Use of more than one?● Table/collection partitioning?● ToroDB “by-type” partitioning

Page 19: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Read scalability: replication

● Replicate data to slave nodes, available read-only: scale-out reads

● Both NoSQL and SQL support it

● Binary replication usually faster (e.g. PostgreSQL's Streaming Replication)

● Not free from undesirable phenomena

Page 20: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads

Dirty and stale reads(“call me maybe”)

Page 21: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

MongoDB write acknowledge

https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads

Page 22: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

MongoDB's dirty and stale reads

Dirty readsA primary in minority accepts a write that other clients see, but it later steps down, write is rolled back (fixed in 3.2?)

Stale readsA primary in minority serves a value that ought to be current, but a newer value was written to the other primary in minority

Page 23: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Write scalability(sharding)

● NoSQL better prepared than SQL

● But many compromises in data modeling (schema design): no FKs

● There are also solutions for SQL:➔ Shared-disk, limited scalability (RAC)➔ Sharding (like pg_shard)➔ PostgreSQL's FDWs

Page 24: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Read scalability (replication)in MongoDB and ToroDB

Page 25: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Replication protocol choice

● ToroDB is based on PostgreSQL

● PostgreSQL has either binary streaming replication (async or sync) or logical replication

● MongoDB has logical replication

● ToroDB uses MongoDB's protocol

Page 26: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

MongoDB's replication protocol

● Every change is recorded in JSON documents, idempotent format (collection: local.oplog.rs)

● Slaves pull these documents from master (or other slaves) asynchronously

● Changes are applied and feedback is sent upstream

Page 27: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

MongoDB replication implementation

https://github.com/stripe/mosql

Page 28: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Announcing ToroDB v0.4 (snap)Supporting MongoDB replication

Page 29: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

ToroDB v0.4● ToroDB works as a secondary slave of a MongoDB master (or slave)

● Implements the full replication protocol (not an oplog tailable query)

● Replicates from Mongo to a PostgreSQL

● Open source github.com/torodb/torodb (repl branch, version 0.4-SNAPSHOT)

Page 30: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Advantages of ToroDB w/ replication

● Native SQL

● Query “by type”

● Better SQL scaling● Less concurrency contention● Better hardware utilization

● No need for ETL from Mongo to PG!

Page 31: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

● NoSQL is trying to get back to SQL

● ToroDB is SQL native!

● Insert with Mongo, query with SQL!

● Powerful PostgreSQL SQL: window functions, recursive queries, hypothetical aggregates, lateral joins, CTEs, etc

ToroDB: native SQL

Page 32: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

ToroDB: native SQL

Page 33: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

db.bds15.insert({id:5, person: {

name: "Alvaro", surname: "Hernandez",contact: { email: "[email protected]", verified: true }}

})

db.bds15.insert({id:6, person: {

name: "Name", surname: "Surname", age: 31,contact: { email: "[email protected]" }}

})

Introducing ToroDB VIEWs

Page 34: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

torodb$ select * from bds15.person ;┌─────┬───────────┬────────┬─────┐│ did │ surname │ name │ age │├─────┼───────────┼────────┼─────┤│ 0 │ Hernandez │ Alvaro │ ¤ ││ 1 │ Surname │ Name │ 31 │└─────┴───────────┴────────┴─────┘(2 rows)

torodb$ select * from bds15."person.contact";┌─────┬──────────┬────────────────────────┐│ did │ verified │ email │├─────┼──────────┼────────────────────────┤│ 0 │ t │ [email protected] ││ 1 │ ¤ │ [email protected] │└─────┴──────────┴────────────────────────┘(2 rows)

Introducing ToroDB VIEWs

Page 35: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

VIEWs, ToroDB from any SQL tool

Page 36: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Enabling real DWon MongoDB:

ToroDB + Greenplum

Page 37: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

● GreenPlum was open sourced 6 days ago

● ToroDB already runs on GP!

● Goal: ToroDB v0.4 replicates from a MongoDB master onto Toro-Greenplum and run the reporting using distributed SQL

ToroDB on GP: SQL DW for MongoDB

Page 38: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

SELECT count( distinct( "reviewerID" ))FROM reviews;

Queries: which one is easier?

db.reviews.aggregate([{ $group: { _id: "reviewerID"}},{ $group: {_id: 1, count: { $sum: 1}}}])

Page 39: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

SELECT "reviewerName", count(*) as reviews FROM reviews GROUP BY "reviewerName" ORDER BY reviews DESC LIMIT 10;

Queries: which one is easier?

db.reviews.aggregate([ { $group : { _id : '$reviewerName', r : { $sum : 1 } } }, { $sort : { r : -1 } }, { $limit : 10 } ], {allowDiskUse: true})

Page 40: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

● Amazon reviews datasetImage-based recommendations on styles and substitutesJ. McAuley, C. Targett, J. Shi, A. van den HengelSIGIR, 2015

● AWS c4.xlarge (4vCPU, 8GB RAM) 4KIOPS SSD EBS

● 4x shards, 3x config; 4x segments GP

● 83M records, 65GB plain json

Benchmark

Page 41: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Disk usage

Mongo 3.0, WT, SnappyGP columnar, zlib level 9

table size index size total size0

10000000000

20000000000

30000000000

40000000000

50000000000

60000000000

70000000000

80000000000

Storage requirements

MongoDB vs ToroDB on Greenplum

Mongo

ToroDB on GP

byt

es

Page 42: ToroDB: scaling PostgreSQL like MongoDB

ToroDB @NoSQLonSQL

Query times

3 different queriesQ3 on MongoDB: aggregate fails

27,95 74,87 00

200

400

600

800

1000

1200

9691007

035 13 31

Query duration (s)

MongoDB vs ToroDB on Greenplum

MongoDB

ToroDB on GP

speedup

seco

nd

s

Page 43: ToroDB: scaling PostgreSQL like MongoDB