couchbase 2.0: indexing and querying
TRANSCRIPT
1 1
Couchbase Server 2.0 Indexing and Querying
Dipti BorkarDirector, Product Management
Couchbase Server 2.0 - Webinar Series
Couchbase Server 2.0 and Indexing/Querying
Couchbase Server 2.0 and Incremental Map Reduce for Real-Time Analytics
Couchbase Server 2.0 and Cross Data Center Replication
Couchbase Server 2.0 and Full-Text Search Integration
Couchbase Server 2.0 Use Cases Overview
Introducing Couchbase Server 2.0
http://www.couchbase.com/webinarshttp://www.couchbase.com/webinars
New in Two
JSON support
Indexing and Querying
Cross data center replication
Incremental Map Reduce
4
What we’ll talk about
• Records vs Documents • Lifecycle of a view
Index definition, build, and query phase Indexing details
• Replica indexes, failover and compaction• Patterns
Primary and Secondary indexes Basic aggregations (avg ratings by brewery) Time-based analytics with group_level Leaderboard
5
Relational vs Document data model
Relational data model Document data modelCollection of complex documents with
arbitrary, nested data formats andvarying “record” format.
Highly-structured table organization with rigidly-defined data formats and
record structure.
JSONJSON
JSON
C1 C2 C3 C4
{
}
6
Example: User Profile
Address Info
1 DEN 30303CO
2 MV 94040CA
3 CHI 60609IL
User Info
KEY First ZIP_idLast
4 NY 10010NY
1 Dipti 2Borkar
2 Joe 2Smith
3 Ali 2Dodson
4 John 3Doe
ZIP_id CITY ZIPSTATE
1 2
2 MV 94040CA
To get information about specific user, you perform a join across two tables To get information about specific user, you perform a join across two tables
7
All data in a single documentAll data in a single document
Document Example: User Profile
{ “ID”: 1, “FIRST”: “Dipti”, “LAST”: “Borkar”, “ZIP”: “94040”, “CITY”: “MV”, “STATE”: “CA” }
JSON
= +
8
Making the Same Change with a Document Database
{ “ID”: 1, “FIRST”: “Dipti”, “LAST”: “Borkar”, “ZIP”: “94040”, “CITY”: “MV”, “STATE”: “CA”, “STATUS”: { “TEXT”: “At Conf” }
}
“GEO_LOC”: “134” },“COUNTRY”: ”USA”
Just add information to a documentJust add information to a document
JSON
,}
9
JSON Documents
• Maps more closely to objects or entities• CRUD Operations, lightweight schema
• Stored under an identifier key
{ “fields” : [“with basic types”, 3.14159, true], “like” : “your favorite language”}
client.set(“mydocumentid”, myDocument);mySavedDocument = client.get(“mydocumentid”);
Couchbase Server 2.0: Views
• Views can cover a few different use cases–Primary Index –Simple secondary indexes (the most common)–Complex secondary, tertiary and composite indexes–Aggregation functions (reduction)• Example: count the number of North American Ales
–Organizing related data
• Built using Map/Reduce–Map function creates a matrix from document fields–Reduce function summarizes (reduces) information–Written using superfast Javascript
What are Views?
• Extract fields from JSON documents and produce an index of the selected information
12
Development vs. Production Views
• Development views index a subset of the data.• Publishing a view builds the
index across the entire cluster.• Queries on production
views are scattered to all cluster members and results are gathered and returned to the client.
View LifecycleDefine -> Build -> Query
1313
• Create design documents on a bucket• Create views within a design document
Buckets & Design docs & Views
14
BUCKET 1
Design document
1
View 1View 1
View 2View 2
View 3View 3
Design document
2
View 4View 4
View 5View 5
Design document
3
View 6View 6
View 7View 7
BUCKET 2
15
Couchbase Server Cluster
Distributed Indexing and Querying
User Configured Replica Count = 1
Active
Doc 5
Doc 2
Doc
Doc
Doc
Server 1
REPLICA
Doc 3
Doc 1
Doc 7
Doc
Doc
Doc
App Server 1
COUCHBASE Client LibraryCOUCHBASE Client Library
Cluster Map
COUCHBASE Client LibraryCOUCHBASE Client Library
Cluster Map
App Server 2
Doc 9
• Indexing work is distributed amongst nodes
• Parallelize the effort
• Each node has index for data stored on it
• Queries combine the results from required nodes
Active
Doc 3
Doc 1
Doc
Doc
Doc
Server 2
REPLICA
Doc 6
Doc 4
Doc 9
Doc
Doc
Doc
Doc 8
Active
Doc 4
Doc 6
Doc
Doc
Doc
Server 3
REPLICA
Doc 2
Doc 5
Doc 8
Doc
Doc
Doc
Doc 7
Query
Create Index / View
16
DEFINE Index / View Definition in JavaScript
CREATE INDEX City ON Brewery.City;
16
17
BUILD Distributed Index Build Phase
• Optimized for lookups, in-order access and aggregations• All view reads from disk (different performance profile)• View builds against every document on every node–This is why you should group them in a design document
• Automatically kept up to date
18
• Efficiently fetch a document or group of similar documents • Queries use cached values from B-tree inner nodes when possible• Take advantage of in-order tree traversal with group_level queries
QUERY Dynamic Queries with Optional Aggregation
Query ?startkey=“J”&endkey=“K ”{ “rows”:[{“key”:“Juneau”,“value”:null}]}
–All the views within a design document are incrementally updated when the view is accessed or auto-indexing kicks in–Automatic view updates• In addition to forcing an index build, active & replica indexes are
updated every 3 seconds of inactivity if there are at least 5000 new changes (configurable)
–The entire view is recreated if the view definition has changed–Views can be conditionally updated by specifying the “stale”
argument to the view query–The index information stored on disk consists of the
combination of both the key and value information defined within your view.
Index building details
20
Queries run against stale indexes by default
• stale=update_after (default if nothing is specified)–always get fastest response–can take two queries to read your own writes
• stale=ok–auto update will trigger eventually–might not see your own writes for a few minutes– least frequent updates -> least resource impact
• stale=false–Use with “set with persistence” if data needs to be included in
view results–BUT be aware of delay it adds, only use when really required
Views and Replica indexes
• In addition to replicas for data (up to 3 copies), optionally create replica for indexes–Set at a bucket level –Replica index populated from replica data–Replica index is used after a failover
• Each node manages replica index data structures– Index structure optimized to handle sharded data (vBuckets)–One for all active shards on the node–One for all replica shards on the node
Views and failover
• Replica indexes enabled on failover• Replicas indexes are rebuilt on replica nodes–Automatically incrementally built based on replica data–Updated every 3 seconds of inactivity if there are at least 5000
new changes–Not copied/moved to be consistent with persisted replica data
View Compaction
• Compaction is ONLINE • Reclaims empty allocated space from disk• Indexes are stored on disk for active vBuckets on each
node and updated in append-only manner• Auto-compaction performed in the background–Set the database fragmentation levels–Set the index fragmentation levels–Choose a schedule–Global and bucket specific settings
24 24
Primary and Secondary Indexing
Example Document
25
Document ID
26
Define a primary index on the bucket
• Lookup the document ID / key by key, range, prefix, suffix
Index definitio
n
27
Define a secondary index on the bucket
• Lookup an attribute by value, range, prefix, suffix
Index definitio
n
Query PatternsFind by Attribute
28 28
29
Find documents by a specific attribute
• Lets find beers by brewery_id!
30
The index definition
ValueKey
31
The result set: beers keyed by brewery_id
Query PatternBasic Aggregations
32 32
33
Use a built-in reduce function with a group query
• Lets find average abv for each brewery!
34 34
We are reducing doc.abv with _stats
35 35
Group reduce (reduce by unique key)
Query PatternTime-based Rollups
36 36
37
Find patterns in beer comments by time
{ "type": "comment", "about_id": "beer_Enlightened_Black_Ale", "user_id": 525, "text": "tastes like college!", "updated": "2010-07-22 20:00:20"}{ "id": "f1e62"}
timestamp
38
Query with group_level=2 to get monthly rollups
39
dateToArray() is your friend
• String or Integer based timestamps• Output optimized for group_level queries• array of JSON numbers:
[2012,9,21,11,30,44]
dateT
oArra
y()
40
group_level=2 results
• Monthly rollup• Sorted by time—sort the query results in your
application if you want to rank by value—no chained map-reduce
41
group_level=3 - daily results - great for graphing
• Daily, hourly, minute or second rollup all possible with the same index.
42 42
Query PatternLeaderboard
43
Aggregate value stored in a document
• Lets find the top-rated beers!
{ "brewery": "New Belgium Brewing", "name": "1554 Enlightened Black Ale", "abv": 5.5, "description": "Born of a flood...", "category": "Belgian and French Ale", "style": "Other Belgian-Style Ales", "updated": "2010-07-22 20:00:20", “ratings” : { “jchris” : 5, “scalabl3” : 4, “damienkatz” : 1 }, “comments” : [ “f1e62”, “6ad8c” ]}
ratings
44 44
Sort each beer by its average rating
• Lets find the top-rated beers!
average
45 45
What NOT to Write
46
Most common mistakes
• Reduces that don’t reduce• Trying to do too many things with one view• Emitting too much data into a view value• Expecting view query performance to be as fast as get/set• Recursive queries require application code.
47 47
Full Text index
48
Elastic Search Adapter
ElasticSearch
• Elastic Search is good for ad-hoc queries and faceted browsing• Our adapter is aware of changing Couchbase topology• Indexed by Elastic Search after stored to disk in Couchbase
Couchbase Server 2.0 and Full-Text Search Integration
Geographic index
49 49
50
Experimental Status
• Not yet using Superstar trees • (only fast on large clusters)
• Optimized for bulk loading
Questions?
51 51
53
The B-tree Index
• Helps to know what's efficient• Superstar
http://damienkatz.net/2012/05/stabilizing_couchbase_server_2.html
54
• Incremental reduce values are stored in the tree
Logical View B-tree
REDUCES
REDUCES
55
• Incremental reduce values are stored in the tree
Logical View B-tree
7 5 5 3 2 3 7 5 5 3 2 3
2525 REDUCES
REDUCES
56
• Incremental reduce values are stored in the tree
Reduce!
7 5 5 3 2 3 7 5 5 3 2 3
2525
_count
function(keys, values) { return keys ? values.length : sum(values);}
_count
function(keys, values) { return keys ? values.length : sum(values);}
57
• You can query that tree dynamically• Lots of the patterns are about pulling value from this data structure
Dynamic Queries
2525
7 5 5 3 2 3 7 5 5 3 2 3
{ }{ }?startkey=“abba”&endkey=“robot”{ “value”:19}?startkey=“abba”&endkey=“robot”{ “value”:19}
_count
function(keys, values) { return keys ? values.length : sum(values);}
_count
function(keys, values) { return keys ? values.length : sum(values);}
58
Lookup via key-range
• Find tables during yesterdays lunch shift• Find shifts owned by which manager
7 5 5 3 2 3 7 5 5 3 2 3
2525
?startkey=“abba”&endkey=“robot”{ “value”:19}?startkey=“abba”&endkey=“robot”{ “value”:19}