nosql: an analysis
TRANSCRIPT
April 10-12 | Chicago, IL
NoSQL: An Analysis
Andrew J. Brust, Founder and CEO, Blue Badge Insights
April 10-12 | Chicago, IL
Please silence cell phones
3
Meet Andrew
CEO and Founder, Blue Badge Insights
Big Data blogger for ZDNetMicrosoft Regional Director, MVPCo-chair VSLive! and 17 years as a speakerFounder, Microsoft BI User Group of NYC• http://www.msbinyc.comCo-moderator, NYC .NET Developers Group• http://www.nycdotnetdev.com“Redmond Review” columnist for Visual Studio Magazine and Redmond Developer Newsbrustblog.com, Twitter: @andrewbrust
Read all about it!
Agenda
Why NoSQL?ConceptsNoSQL CategoriesProvisioning, market, applicabilityTake-aways
NoSQL Data Fodder
AddressesPreference
s
NotesFriends,
Followers
Documents
“Web Scale”This the term used to justify NoSQLScenario is simple needs but “made up for in volume”• Millions of concurrent users
Think of sites like Amazon or GoogleThink of non-transactional tasks like loading catalog data to display product page, or environment preferences
NoSQL Common Traits
Non-relationalNon-schematized/schema-freeOpen sourceDistributedEventual consistency“Web scale”Developed at big Internet companies
CONCEPTS
Consistency
CAP Theorem
• Databases may only excel at two of the following three attributes: consistency, availability and partition tolerance
NoSQL does not offer “ACID” guarantees
• Atomicity, consistency, isolation and durability
Instead offers “eventual consistency”
Similar to DNS propagation
Things like inventory, account balances should be consistent
• Imagine updating a server in Seattle that stock was depleted
• Imagine not updating the server in NY
• Customer in NY goes to order 50 pieces of the item
• Order processed even though no stock
Things like catalog information don’t have to be, at least not immediately
• If a new item is entered into the catalog, it’s OK for some customers to see it even before the other customers’ server knows about it
But catalog info must come up quickly
• Therefore don’t lock data in one location while waiting to update the other
Therefore, OK to sacrifice consistency for speed, in some cases
Consistency
CAP Theorem
Consistency
Availability
Partition Tolerance
Relational
NoSQL
Indexing
Most NoSQL databases are indexed by keySome allow so-called “secondary” indexesOften the primary key indexes are clusteredHBase uses HDFS (the Hadoop Distributed File System), which is append-only• Writes are logged
• Logged writes are batched
• File is re-created and sorted
Queries
Typically no query languageInstead, create procedural programSometimes SQL is supportedSometimes MapReduce code is used…
MapReduce
This is not Hadoop’s MapReduce, but it’s conceptually relatedMap step: pre-processes dataReduce step: summarizes/aggregates dataWill show a MapReduce code sample for Mongo soonWill demo map code on CouchDB
Sharding
A partitioning pattern where separate servers store partitionsFan-out queries supportedPartitions may be duplicated, so replication also provided• Good for disaster recovery
Since “shards” can be geographically distributed, sharding can act like a CDNGood for keeping data close to processing• Reduces network traffic when MapReduce splitting takes place
NOSQL CATEGORIES
20
Key-Value Stores
The most common; not necessarily the most popularHas rows, each with something like a big dictionary/associative array• Schema may differ from row to row
Common on cloud platforms• e.g. Amazon SimpleDB, Azure Table Storage
MemcacheDB, Voldemort, Couchbase, DynamoDB (AWS), Dynomite, Redis and Riak
Key-Value Stores
Table: CustomersRow ID: 101
First_Name: AndrewLast_Name: BrustAddress: 123 Main StreetLast_Order: 1501
Row ID: 202First_Name: JaneLast_Name: DoeAddress: 321 Elm StreetLast_Order: 1502
Table: Orders
Row ID: 1501Price: 300 USDItem1: 52134Item2: 24457
Row ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428
Database
Wide Column Stores
Has tables with declared column families
• Each column family has “columns” which are KV pairs that can vary from row to row
These are the most foundational for large sites
• BigTable (Google)
• HBase (Originally part of Yahoo-dominated Hadoop project)
• Cassandra (Facebook)
• Calls column families “super columns” and tables “super column families”
They are the most “Big Data”-ready
• Especially HBase + Hadoop
Table: CustomersRow ID: 101
Super Column: Name Column: First_Name: Andrew Column: Last_Name: BrustSuper Column: Address Column: Number: 123 Column: Street: Main StreetSuper Column: Orders Column: Last_Order: 1501
Table: Orders
Row ID: 1501Super Column: Pricing Column: Price: 300 USDSuper Column: Items Column: Item1: 52134 Column: Item2: 24457Row ID: 1502Super Column: Pricing Column: Price: 2500 GBPSuper Column: Items Column: Item1: 98456 Column: Item2: 59428
Row ID: 202Super Column: Name Column: First_Name: Jane Column: Last_Name: DoeSuper Column: Address Column: Number: 321 Column: Street: Elm StreetSuper Column: Orders Column: Last_Order: 1502
Wide Column Stores
April 10-12 | Chicago, IL
DemoWide Column Stores
Document Stores
Have “databases,” which are akin to tablesHave “documents,” akin to rows
• Documents are typically JSON objects
• Each document has properties and values
• Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. contained JSON objects - Allows for hierarchical storage)
• Can have attachments as well
Old versions are retained
• So Doc Stores work well for content management
Some view doc stores as specialized KV storesMost popular with developers, startups, VCsThe biggies:
• CouchDB
• Derivatives
• MongoDB
Document Store Application Orientation
Documents can each be addressed by URIsCouchDB supports full REST interfaceVery geared towards JavaScript and JSON
• Documents are JSON objects
• CouchDB/MongoDB use JavaScript as native language
In CouchDB, “view functions” also have unique URIs and they return HTML
• So you can build entire applications in the database
Database: CustomersDocument ID: 101
First_Name: AndrewLast_Name: BrustAddress:
Orders:
Database: Orders
Document ID: 1501Price: 300 USDItem1: 52134Item2: 24457
Document ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428
Number: 123Street: Main Street
Most_recent: 1501
Document ID: 202First_Name: JaneLast_Name: DoeAddress:
Orders:
Number: 321Street: Elm Street
Most_recent: 1502
Document Stores
April 10-12 | Chicago, IL
DemoDocument Stores
Graph Databases
Great for social network applications and others where relationships are importantNodes and edges• Edge like a join
• Nodes like rows in a table
Nodes can also have properties and valuesNeo4j is a popular graph db
Database
Sent invitation to
Commented on photo by
Friend of
Address
Placed order
Item2
Item1
Joe Smith Jane Doe
Andrew Brust
Street: 123 Main StreetCity: New YorkState: NYZip: 10014
ID: 52134Type: DressColor: Blue
ID: 24457Type: ShirtColor: Red
ID: 252Total Price: 300 USD
George Washington
Graph Databases
PROVISIONING, MARKET, APPLICABILITY
NoSQL + BI
NoSQL databases are bad for ad hoc query and data warehousingBI applications involve models; models rely on schemaExtract, transform and load (ETL) may be your friendWide-column stores, however are good for “Big Data”
• See next slide
Wide-column stores and column-oriented databases are similar technologically
NoSQL + Big DataBig Data and NoSQL are interrelatedTypically, Wide-Column stores used in Big Data scenariosPrime example:• HBase and Hadoop
Why?• Lack of indexing not a problem
• Consistency not an issue
• Fast reads very important
• Distributed file systems important too
• Commodity hardware and disk assumptions also important
• Not Web scale but massive scale-out, so similar concerns
34
Going “NoSQL-Like” on the MS CloudAzure Table Storage (a key-value store)SQL Azure XML columns (supports variable schema, hierarchy)SQL Azure Federation (a sharding implementation)OData (HTTP/JSON data APIs)Running NoSQL database products using Azure VMs…
NoSQL on Windows Azure
Platform as a Service• Cloudant: https://cloudant.com/azure/
• MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/
MongoDB, DIY: • On an Azure Worker Role:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles
• On a Windows VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer
• On a Linux VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorialhttp://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-vm/
NoSQL on Windows AzureOthers, DIY (Linux VMs):• Couchbase:
http://blog.couchbase.com/couchbase-server-new-windows-azure
• CouchDB: http://ossonazure.interoperabilitybridges.com/articles/couchdb-installer-for-windows-azure
• Riak:http://basho.com/blog/technical/2012/10/09/Riak-on-Microsoft-Azure/
• Redis: http://blogs.msdn.com/b/tconte/archive/2012/06/08/running-redis-on-a-centos-linux-vm-in-windows-azure.aspx
• Cassandra: http://www.windowsazure.com/en-us/manage/linux/other-resources/how-to-run-cassandra-with-linux/
37
And With MS On-Premise Technologies
SQL Server 2008/2008R2/2012 “Beyond Relational” Features• Sparse columns (like Wide Column Stores)• Geospatial (geometry, geography data types)• FILESTREAM, FileTable (like Document Store attachments)• Full Text Search, Semantic Similarity Search• HierarchyID (can simulate Graph Database functionality)SQL Server Parallel Data Warehouse Edition (PDW)• Distributed architecture (like MapReduce/Hadoop)• PolyBase in PDW v2 (interfaces PDW and HDFS)
TAKE-AWAYS
Compromises
Eventual consistencyWrite bufferingOnly primary keys can be indexedQueries must be written as programsTooling• Productivity (= money)
Summing Up
• Line of Business -> Relational• Large, public (consumer)-facing sites -> NoSQL
• Complex data structures -> Relational• Big Data -> NoSQL
• Transactional -> Relational• Content Management -> NoSQL
• Enterprise->Relational • Consumer Web -> NoSQL
Thank you
• [email protected]• @andrewbrust on twitter• Want to get on Blue Badge Insights’ list?”Text “bluebadge” to 22828
Win a Microsoft Surface Pro!
Complete an online SESSION EVALUATION to be entered into the draw.
Draw closes April 12, 11:59pm CTWinners will be announced on the PASS BA Conference website and on Twitter.
Go to passbaconference.com/evals or follow the QR code link displayed on session signage throughout the conference venue.
Your feedback is important and valuable. All feedback will be used to improve and select sessions for future events.
April 10-12, Chicago, IL
Thank you!Diamond Sponsor Platinum Sponsor