introduction to riaknosqlroadshow.com/dl/nosql-berlin-2013/presentations/... · 2013-06-28 · ©...
TRANSCRIPT
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Onlybasho
Concepts, Architecture and Functionality
Introduction to Riak
1Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
WHAT IS RIAK?
• Key-Value store + extras
•Distributed and horizontally scalable
• Fault-tolerant
• Highly available
• Built for the web
• Based on Amazon Dynamo
2Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
KEY-VALUE STORE
• Simple operations - GET, PUT, DELETE
• Value is opaque (mostly), with metadata
• Extras, e.g.
• Secondary Indexes (2i)
• MapReduce
• Commit Hooks
3Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
HORIZONTALLY SCALABLE
•Default configuration is optimized for a cluster
•Query load and data are spread evenly
• Add more nodes and get more:
• ops/second
• storage capacity
• compute power (for Map/Reduce)
4Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
FAULT TOLERANT
• All nodes participate equally - no single point of failure (SPOF)
• All data is replicated
• Cluster transparently survives...
• node failure
• network partitions
• Built on Erlang/OTP (designed for FT)
5Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
HIGHLY AVAILABLE
• Any node can serve client requests
• Fallbacks are used when nodes are down
• Always accepts read and write requests
• Per-request quorums
6Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
CAP THEOREM
• C = Consistency
• A = Availability
• P = Partition Tolerance
• Cap theorem states that a distributed shared data system can at most support 2 out of these 3 properties
DB DB DB
Client Client
Network/Data Partition
7Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
CORE CONCEPTS (1)
•Node - An Erlang VM running an instance of Riak
• Cluster - A collection of connected Riak nodes
• Bucket - Logical grouping of objects. Shared configuration
• Key - An identifier for a record/object
• Value - Opaque binary representation of data stored with key
8Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
CORE CONCEPTS (2)
•Metadata - Additional data linked to record, not part of value
• Vector clock - Used for establishing causality of actions and tracks updates. Helps Riak resolve conflicts.
• Riak Object - Bucket, Key, Value and Metadata. Unit of replication.
• Consistent hashing - Cryptographic SHA-1 hash - 2^160
9Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
CORE CONCEPTS (3)
• Partition - Logical division of storage
• Vnode - Process handling requests and managing a partition
•Ownership Handoff - Transfer of data on cluster change
• Hinted Handoff - Transfer of data on node/network failure
•Quorum - Set of nodes required to participate in transaction
10Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
THE RING
11Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
REPLICATION• Replicates to designated vnode plus following (n_val -1)
vnodes
12Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
DISASTER SCENARIO
•Node fails
• Request goes to fallback
•Node comes back
• Handoff - data retuned to recovered node
•Normal operations resume
X
X
XX
X
X
XX
hash(“user_id”)
13Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
REQUEST QUORUMS
• Every request contacts all replicas of key
• N - number of replicas (default 3)
• R - read quorum
• W/DW - write quorum
•Quorum:The quantity of replicas that must respond to a read or write request before it is considered successful. (default 2 - Calculated as: floor(n_val / 2) + 1 )
14Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
ANATOMY OF A REQUESTget(“user_id”)
Get Handler (FSM)
clientRiak
hash(“user_id”)== 10, 11, 12
get(“user_id”)Coordinating node
Cluster
6 7 8 9 10 11 12 13 14 15 16
The Ring
R=2
v1 v2
v1 v2
v2
15Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
READ REPAIR
v2v2
get(“user_id”)
Get Handler (FSM)
clientRiak
Coordinating nodeCluster
6 7 8 9 10 11 12 13 14 15 16
R=2 v1 v2
v2
v1
v2v1v1 v2v2
16Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
ANTI-ENTROPY
• Read-repair corrects inconsistencies on read only.
• Active Anti-Entropy is new in 1.3.0 and uses Merkle trees to compare data in partitions and periodically ensure consistency.
• Active Anti-Entropy runs as a background process
17Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
CONFLICT RESOLUTION
•Network partitions and concurrent actors modifying the same data cause data divergence.
• Riak provides two solutions to manage this that can be set on bucket level:
• Last Write Wins - Naive approach but works for some use cases
• Vector Clocks - Retain “sibling” copies of data for merging
18Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
VECTOR CLOCKS
• Every node has an ID
• Send last-seen vector clock in every “put” request
• Can be viewed as ‘commit history’
• Auto-resolves stale versions
• Lets you decide conflicts
19Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
VECTOR CLOCK EXAMPLE
0
32
1
Objectv0
[{a,1}]
1) 2)0
32
1Objectv0
Objectv0
[{a,1}]
[{a,1}]
3) 4)0
32
1Objectv1
Objectv0
[{a,2}]
[{a,1}]
0
32
1Objectv1 [{a,2}]
20Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
SIBLING CREATION
0
32
1Objectv1
Objectv1
[{a,3}]
[{a,2},{b,1}]
1) 2)[{a,3}]
[{a,2},{b,1}]
0
32
1Objectv1
Object v1
Object v1
• Siblings can be created by:
• Simultaneous writes (based on same object version)
• Network partitions
• Writes to existing key without submitting vector clock
21Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
SIBLING RESOLUTION
• If data can be represented as a monotonic set of unique data items or operations, resolution can be done through a set union, e.g shopping cart
• Store information that help resolve conflicts in or with the object.
• Convergent / Commutative Replicated Data Types are emerging to help address this problem.
22Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
SIBLING EXPLOSION
•Without sibling resolution, the number of stored versions will continually grow, resulting in degraded performance across the cluster in the form of extremely high per-operation latencies or apparent unresponsiveness.
• Frequent updates of the same object can lead to sibling explosion.
• Inserting without first checking existence through read can lead to sibling explosion if objects are not unique.
23Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
STORAGE BACKENDS
• Bitcask
• LevelDB
•Memory
•Multi
24Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
BITCASK
• A fast, append-only key-value store
• In memory key lookup table (key_dir)
• Closed files are immutable
•Merging cleans up old data
•Developed by Basho Technologies
• Suitable for bounded data, e.g. reference data
25Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
LEVELDB
• Key-Value storage developed by Google
• Append-only
•Multiple levels of SSTable-like data structures
• Allows for more advanced querying (2i)
•Open Source (BSD License)
• Suitable for unbounded data or advanced querying
26Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
MEMORY
•Data is never persisted to disk
• Typically used for “test” databases (unit tests... etc)
•Definable memory limits per vnode
• Configurable object expiry
• Useful for highly transient data
27Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
MULTI
• Configure multiple storage engines for different types of data
• Configure the “default” storage engine
• Choose storage engine on per bucket basis
•No reason not to use it
28Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
CLIENT TYPES
• Riak supports two main client types:
• REST based HTTP Interface
• Easy to use from command line and simple scripts
• Useful if using intermediate caching layer, e.g. Varnish
• Protocol Buffers
• Optimized binary encoding standard developed by Google
• More efficient/performant than HTTP interface
29Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
CLIENT LIBRARIES
• Client libraries supported by Basho:
• Community supported languages and frameworks:
• C/C++, Clojure, Common Lisp, Dart, Django, Go, Grails, Griffon, Groovy, Erlang, Haskell, Java, .NET, Node.js, OCaml , Perl, PHP, Play, Python, Racket, Ruby, Scala, Smalltalk
30Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
BUCKET PROPERTIES
• ‘n_val’ - number of copies of each object to be stored
• ‘allow_mult’ / ‘last_write_wins’: Boolean conflict resolution parameters (true/false)
• Tunable consistency parameters: ‘r’, ‘w’, ‘dw’, ‘rw’, ‘pw’ and ‘pr’
• Allowed values: ‘all’, ‘quorum’, ‘one’ or an integer (default: ‘quorum’ for r/w/dw/rw, 0 for pw/pr)
• ‘precommit’, ‘postcommit’ and ‘backend’
31Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
CONSISTENCY PARAMETERS (1)
• R - Number of vnodes that need to agree when retrieving the object before returning a response
•W - Number of vnodes that must confirm receiving writes before returning a successful response
•DW - Number of replicas to commit to durable storage before returning a successful response (minimum 1from Riak 1.3 onwards)
32Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
CONSISTENCY PARAMETERS (2)
• RW - Quorum for both operations (get and put) involved in deleting an object
• PR - Number of nodes read from that must not be fallback nodes. Setting this > 0 MAY cause reads to fail under certain network partitioning scenarios.
• PW - Number of replicas to commit to primary nodes before returning a successful response. Setting this > 0 MAY cause writes to fail under certain network partitioning scenarios.
33Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
TUNABLE CONSISTENCY
• R, W, DW, RW, PR, PW tunable per bucket as well as per request
• R + W > n_val provides consistency in fully operational cluster.
• n_val = 3 and R, W = ‘quorum’ (2) means one node that is slow or down can be tolerated (default setting)
34Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
GETTING BUCKET PROPERTIES
• Bucket properties best retrieved via HTTP interface
• URL path is /buckets/<bucket_name>/propscurl -X GET http://127.0.0.1:8098/buckets/test/props{"props":{"name":"test","allow_mult":false,"basic_quorum":false, "big_vclock":50,"chash_keyfun":{"mod":"riak_core_util","fun":"chash_std_keyfun"},"dw":"quorum", "last_write_wins":false,"linkfun":{"mod":"riak_kv_wm_link_walker", "fun":"mapreduce_linkfun"},"n_val":3,"notfound_ok":true,"old_vclock":86400,"postcommit":[],"pr":0,"precommit":[],"pw":0,"r":"quorum","rw":"quorum", "small_vclock":50,"w":"quorum","young_vclock":20}}
35Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
SETTING BUCKET PROPERTIES
•Default bucket properties can be specified in the app.config file:
curl -X PUT -H "Content-Type: application/json" -d '{"props":{"allow_mult":true,”dw”:1}}' http://127.0.0.1:8098/buckets/test/props
{default_bucket_props, [ {n_val,3}, {allow_mult,true}, {last_write_wins,false}]}
•Override of default bucket properties best done via HTTP interface as protocol buffers do not support all parameters
36Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
WHAT ARE SECONDARY INDEXES?
•Non-primary key lookups
•Defined as metadata
• Requires memory or LevelDB backend
• Two index types: integer and binary
• Two query modes: exact match or range query
• An index can have multiple values for an object
37Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
SECONDARY INDEXES (1)
• There are two special indexes automatically available: ‘$bucket’ and ‘$key’
• Indexes must follow naming convention: ‘<name>_int’ for integer indexes and ‘<name>_bin’ for binary indexes
•Queries returns keys, not objects
• Indexes can be created by adding metadata to Riak objects.
• 2i query can be used as input to Map/Reduce
38Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
SECONDARY INDEXES (2)
• Secondary indexes can only be queried one at a time. Get around this by creating composite indexes, e.g. <customer_id>_<date>, that are suitable for exact match or range query based on your query patterns
• Uses document-based partitioning, stored locally with object.
• All queries requires covering set of vnodes (ring_size / n_val)
39Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
DATA MODELING
• Content-Types
• Identify and plan for your query patterns
• Use natural and meaningful, possibly composite, keys that allows retrieval by key, as this is the by far most efficient query method and enhances scalability.
•De-normalize data
• Time-boxing
40Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
MODELING TOOLS
• Key-Value
• Secondary Indexes (2i)
• Full-text search
•Map/Reduce
• Links
41Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
FULL-TEXT SEARCH
•Designed for searching prose/JSON/XML
• Lucene/Solr-like query interface
• Automatically indexes k/v pairs
• Can be used as input to Map/Reduce
• Customizable index schemas
• Flexible and puts less load on system than MapReduce
42Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
MAP/REDUCE
• For more involved queries
• Specify the input keys and process data in sequence of “map” and “reduce” functions
• Javascript or Erlang (JavaScript not recommended for heavy production use)
•Not designed for real-time processing
• Requires a covering set of vnodes to participate
43Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
LINKS
• Lightweight relationships, like <a>
• Includes a “tag”
• Built-in traversal operation (“walking”)GET /riak/b/k/[bucket],[tag],[keep]
• Limited in number (part of metadata)
• Built on top of Map/Reduce
44Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
ACCESS PATTERNS (1)
• Analyze access and query patterns, including frequency, in design phase and optimize data model for these.
• Unlike in relational databases, modifying records already in the database is relatively expensive. Adding or modifying data/indexes requires read and write of every object.
• Try to perform majority of data access directly through keys for performance and scalability whenever possible
45Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
ACCESS PATTERNS (2)
• For data that will never be updated, enable last_write_wins for increased performance
• Consider Bitcask as a backend if records need to be expired from the system within a fixed amount of time. Deleting objects in Riak is relatively expensive as it involves read and write.
46Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
SEMANTIC KEYS
• Always try to pick keys that are natural and contain information about the record.
• Avoid UUIDs if this means direct key access can not be used
• Semantic keys allows for efficient, direct lookups.
• Semantic keys allows use of key filters which can make migrations and bulk processing through MapReduce easier if necessary.
47Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
SIBLING MANAGEMENT
• Enable siblings wherever data is updated and loss of data is not acceptable
•Determine strategy for sibling resolution for all applicable data
• If possible, consider serializing writes in the application layer to avoid/reduce sibling creation for frequently updated objects
48Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
DENORMALIZATION
CustomerCustomerContact
Order
Order Item
InvoiceCopy
CustomerAddress
Customer[Customer Info]
[Contacts][Addresses][Invoices][Orders]
Order[Order Details][Order Items]
[Invoice]
InvoiceCopy
Key: <customer_id>
Key: <customer_id>_<date>_<order_id>
Key: <customer_id>_<date>_<invoice_number>
49Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
INDEX OBJECTS
CustomerOrders
Order
Order
Order
Order
Order
Order
Order
Index object with ID of order objectsand information that can be used
for filtering and identifying
50Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
SECONDARY INDEX USAGE
• Stored with object and can therefore be maintained easily and updated with changing object values
• Create multiple indexes to allow for different types of searches.
• Create composite indexes in order to work around limitation that only match or simple range queries are possible.
• Avoid having it as main access method
51Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
MULTIGETS
• Retrieve multiple records most efficiently by using several connections/threads to issue GET requests in parallel. This is the most scalable method.
•Do not use MapReduce for retrieving multiple records as this does not scale as well as direct KV access and does not do allow for quorum read.
52Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
TIME BOXING AND ROLLUPS (1)HourMinute Day
GOOG_20130213_0900
GOOG_20130213_0959
GOOG_20130213_1000
GOOG_20130213_1059
GOOG_20130214_0900
GOOG_20130214_0959
GOOG_20130213_09
GOOG_20130213_10
GOOG_20130214_09
...
...
...
GOOG_20130213
GOOG_20130214
Hourly batch rollup Daily batch rollup
Writes and Updates
53Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
TIME BOXING AND ROLLUPS (2)
• Sensible key choices allow for direct access based on data
• Several rollups of base data can be performed in order to allow different access or query patterns
• Bulk updates can be done by external application or possibly even through MapReduce
• Application logic or commit hooks can be used to catch out of bounds data and update rollups
54Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
MULTI DATACENTER REPLICATION (MDC)
• Allows data to be replicated between clusters in different data centers. Can handle larger latencies.
• Two synchronization modes that can be used together: real-time and full sync
• Set up as uni-directional links. 2 links can be set up for bi-directional replication.
• Can be used for backing up data
55Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
RIAK-CS
• Built on top of Riak and supports MDC
• Exposes a Amazon S3 API compatible interface
• Supports multi-tenancy
• Per-tenant usage data and statistics on network I/O
• Supports Objects of Arbitrary Content Type Up to 5GB
•Often used to build private cloud storage
56Wednesday, 22 May 13
© Copyright 2012-2013, Basho Technologies Inc. All rights reserved For Internal Use Only
QUESTIONS?Christian Dahlqvist, [email protected]
57Wednesday, 22 May 13