fractal tree indexes : from theory to practice

Fractal Tree® IndexesTheory to Practice

Percona Live London 2013

Tim Callaghan, Tokutektim@tokutek.com

@tmcallaghan

Tuesday, November 12, 13

Ever seen this?

IO Utilization Graph, performance is IO limited

Who is Tokutek?

Tokutek builds high-performance database software!

TokuDB - storage engine for MySQL and MariaDB

TokuMX - storage engine for MongoDB

HDD & SSD!storage"

"Storage Engine"

Developer Interface"

Who am I?

• 17 year database consumer• schema design, development, deployment• database administration + infrastructure• mostly Oracle

• 5 year database producer• 2 years @ VoltDB• 2+ years @ Tokutek

Housekeeping

• Feedback is important to me• Ideas for Webinars or Presentations?

• Who’s using MongoDB?

• Anyone using TokuDB or TokuMX?

• Please ask questions

Agenda

• Why Fractal Tree indexes are cool• What they enable in MySQL

® (TokuDB)

• What they enable in MongoDB® (TokuMX)

• Q+A

Indexing:

B-trees and Fractal Tree Indexes

B-trees

B-tree Overview - vocabulary

Internal Nodes - Path to data

Leaf Nodes - Actual Data - Sorted

Pointers

Pivots

B-tree Overview - example

2, 3, 4 10,20 22,25 99

* Pivot Rule is >=

B-tree Overview - search

2, 3, 4 10,20 22,25 99

“Find 25”

B-tree Overview - insert

2, 3, 4 10,15,20 22,25 99

“Insert 15”

B-tree Overview - performance

2, 3, 4 10,20 22,25 99

Performance is IO limited when data > RAM, one IO is needed for each insert/update

(actually it’s one IO for every index on the table)

Fractal Tree Indexes

similar to B-trees•store data in leaf nodes•use index key for ordering

message buffer

All internal nodes have message

buffers

different than B-trees•message buffers•big nodes (4MB vs. ~16KB)

As buffers overflow, they cascade down

the tree

Messages are eventually applied to

leaf nodes

Fractal Tree Indexes - sample data

2,3,4 10,20 22,25 99

Looks a lot like a b-tree!

insert 15;

Fractal Tree Indexes - insert

2,3,4 10,20 22,25 99

insert (15)

• search operations must consider messages along the way• messages cascade down the tree as buffers fill up• they are eventually applied to the leaf nodes, hundreds or

thousands of operations for a single IO• CPU and cache are conserved as important data is not ejected

Fractal Tree Indexes - other operations

2,3,4 10,20 22,25 99

add_column(c4 bigint)delete(99)

increment(22,+5)...

insert (100)delete(8)delete(2)insert (8)

Lots of operations can be messages!

TokuDB

Fractal Tree Indexing + MySQL/MariaDB

What is TokuDB?

• Transactional MySQL Storage Engine - think InnoDB• Available for MySQL 5.5 and MariaDB 5.5• ACID and MVCC• Free/OSS Community Edition– http://github.com/Tokutek/ft-engine

• Enterprise Edition– Commercial support + hot backup

Performance + Compression + Agility

TokuDB Performance

Warning - Benchmarks Ahead!

Indexed Insertion Performance

• High-performance insert/update/delete for large databases (> RAM) while maintaining indexes

* old numbers, now > 25K/sec

Sysbench Performance

Sysbench read/write workload, > RAM

The fastest IO is the one you never have to do (compression)

• Efficient index maintenance, especially secondary indexes

• Clustered secondary indexes• Additional copy of the row is stored in the index• No additional IO to get row data from primary key• Think better covering index (all non-indexed columns)• Compression eliminates size concerns

• Big blocks = sequential IO for range scans• Basement nodes are always co-located

• Multi-threaded bulk loader

Performance Advantages

TokuDB Compression

Compression: TokuDB vs. InnoDB

• InnoDB compression misses force node splits, which greatly reduces performance– MySQL 5.6 “dynamic padding” (from FB), less cache

• Larger block size and flexible on-disk size wins!• Multiple compression algorithms (lzma, quicklz, zlib)• Larger, less frequent writes (much less IO)• Why it matters on spinning disks:

– Compressed reads and amortized compressed writes overcome IO limitations

• Why it matters on flash/SSD:– Buy less : 250GB * 10x = as 2.5TB)– Large/less frequent writes are flash friendly

Compression + IO Reduction

• Server was at 90% IO utilization with InnoDB, 10% IO utilization with TokuDB

Compression Performance

• iiBench benchmark

Compression Achieved

• log data (extremely compressible)

TokuDB Agility

The Challenge of MySQL Schema Changes

• Common schema changes can take hours in MySQL– Adding, dropping, or expanding a column– Adding an index

• And the table is unavailable for writes during the process

• As a workaround, people generally– Use a replication slave, then swap with master– Use helper tools: Percona OSC, MySQL 5.6

o These have IO, CPU, RAM consequences

Schema Changes Without Downtime

• In TokuDB, column add/drop/expand is instantaneous– “it’s just a message”

• Indexes can be created in the background while table is fully available– TokuDB just builds the index, it does not

rebuild the table (MySQL getting better)

TokuMX

Fractal Tree Indexing + MongoDB

What is TokuMX?

• TokuMX = MongoDB with improved storage (Fractal Tree indexes)

• Drop in replacement for MongoDB v2.2 applications– Including replication and sharding– Same data model– Same query language– Drivers just work

• Open Source– http://github.com/Tokutek/mongo

Performance + Compression + Transactions

MongoDB Storage

4 5555

(1,ptr5) (4,ptr1),(12,ptr8)

(19,ptr7) (10000,ptr2)

The “pointer” tells MongoDB where to look in the heap for the requested document (another IO)

40 120

(2,ptr5), (22,ptr6)

(50,ptr4) (100,ptr7) (222,ptr3)

PK index (_id + pointer) Secondary index (foo + pointer)

db.test.insert({foo:55})db.test.ensureIndex({foo:1})

memory mapped heap

TokuMX Storage

4 5555

(1,doc) (4,doc),(12,doc)

(19,doc) (10000,doc)

40 120

(2,4), (22,12) (50,19) (100,10000) (222,1)

PK index (_id + document) Secondary index (foo + _id)

db.test.insert({foo:55})db.test.ensureIndex({foo:1})

memory mapped heap

One less IO per _id lookup, document is clustered in the index

TokuMX Performance

Performance - Indexed Insertion

• 100mm inserts into a collection with 3 secondary indexes

• Indexed Insertion : Multikey (100 inserts per doc)

Performance - Inserts on Indexed Arrays

Performance - Replication

• TokuMX replication allows secondary servers to process replication without IO– Simply injecting messages into the Fractal Tree

Indexes on the secondary server– The “Hard Work” was done on the primaryoUniqueness checkingo Transactional lockingoUpdate effort (read-before-write)

– Elimination of replication lag• Your secondaries are fully available for read scaling!– Wasn’t that the point?

Performance - Lock Refinement

• TokuMX performs locking at the document level– Extreme concurrency!

instance

database database

collection collection collection collection

document

document document

document

MongoDB v2.2

MongoDB v2.0

TokuMX

Performance - Lock Refinement

• Sysbench benchmark (> RAM)

Performance - Lock Refinement + Reduced IO

– Indexed insertion benchmark

Performance - Reduced IO

Performance - Clustered Indexes

• Clustered secondary indexes• Additional copy of the document is stored in the index• No additional IO to get row data from primary key• Think better covered index (all non-indexed fields)• Good for point queries, great for range scans• Compression eliminates size concerns

Performance - Memory Management

• Two approaches to memory management– MongoDB = memory-mapped filesoOperating system determines what data is

important– TokuMX = managed cacheoUser defined sizeo TokuMX determines what data is important

• Run multiple TokuMX instances on a single server– Each has it’s own fixed cache size

TokuMX Compression

Compression

• MongoDB does not offer compression– Compressed file systems?– Shortened field names?

o Remember: each field name is stored in every single document• TokuMX easily achieves 5x-10x compression

– Buy less disk or flash– Compressed reads and writes reduce overall IO

• TokuMX support 3 compression types– zlib, quicklz, lzma (size vs. speed)– all data is compressed

• Use descriptive field names!– They are easy to compress

Compression

• 31 million documents, bit torrent peer data– http://cs.brown.edu/~pavlo/torrent/

TokuMX Transactions

ACID + MVCC

• ACID– In MongoDB, multi-insertion operations allow for

partial successo Asked to store 5 documents, 3 succeeded

– We offer “all or nothing” behavior– Document level locking

• MVCC– In MongoDB, queries can be interrupted by writers.

o The effect of these writers are visible to the reader– TokuMX offers MVCC

o Reads are consistent as of the operation start

Multi-statement Transactions

• TokuMX brings the following to MongoDB– db.runCommand({“beginTransaction”, “isolation”:

“mvcc”})– ... perform 1 or more operations– db.runCommand(“rollbackTransaction”) |

db.runCommand(“commitTransaction”)

• Not allowed in sharded environments– mongos will reject

Tim CallaghanVP/Engineering, Tokutek

tim@tokutek.com@tmcallaghan

Questions?

fractal tree indexes : from theory to practice

fractal tree indexestuesday

tokudb fractal tree

fractal tree indexes

performance mysql

btree overview search

additional io

mysql tokudb

btree overview example

Technology

tree-based indexes for image data - ut tyler department of

07 tree indexes

design of printed fractal tree monopole antenna …...

08 tree indexes part ii - cmu 15-445

23- 05766703 novel modified pythagorean tree fractal...

fractals in nature. a fractal fern a fractal tree

tokudb v6.5.0 with fractal tree indexing for mariadb

indexing delight --thinking cap of fractal-tree indexes

nova mysql october meetup tokudb and fractal tree indexes...

copyright 2003curt hill hash indexes are they better or...

1 chapter 5 index and clustering. 2 overview of indexes and...

generalized pythagoras trees for visualizing...

hierarchical locking in b-tree indexes -...

fractal completeness techniques in topological modal logic:...

onezoom: a fractal explorer for the tree of ... - spiral:...

indexes...•hash indexes (memory and ndb) •bitmap indexes...

reciprocal tree-like fractal structures...research...

part 4: b-tree indexes

fractal tree-technology-and-the-art-of-indexing

tree-structured indexes