Download - Nosql seminar
NOSQL
Agenda
Introduction to NOSQLObjectiveExamples of NOSQL databasesNOSQL vs SQLConclusion
Basic Concepts
Database – is a organized collection of data.Data base Management System (DBMS)- is a
software package with computer program that controls the creation , maintainance & use of a database. for DBMS , we use structured language to interact with it Ex. Oracle , IBM DB2 , Ms Access , MySQL , FoxPro etc.
Relational DBMS - A relational database is a collection of data items organized as a set of formally described tables from which data can be accessed easily. A relational database is created using the relational model. The software used in a relational database is called a relational database management system (RDBMS).
SQL
Stuctured Query LanguageSpecial purpose programming language designed for
managing data in RDBMS.Origininally based upon relational algebra & tuple
relation calculas.SQl’s scope include data insert,upadte & delete, schema
creation and modification , data access control.It is static and strong used in database.Most used widely used database language. Query is the most important operation in SQL.Ex. SELECT * FROM Book WHERE price > 100.00 ORDER BY title;
NOSQL
Stands for Not Only SQLClass of non-relational data storage systemsUsually do not require a fixed table schema
nor do they use the concept of joinsAll NOSQL offerings relax one or more of
the ACID properties . Atomicity , Consistancy , Isolation , Durability
( ACID )“NOSQL” = “Not Only SQL” =
Not Only using traditional relational DBMS
NOSQL
• Alternative to traditional relational DBMS• Flexible schema• Quicker/cheaper to set up• Massive scalability• Relaxed consistency higher performance &
availability
* No declarative query language more programming* Relaxed consistency fewer guarantees
Why NOSQL?
Every problem cannot be solved by traditional relational database system exclusively.
Handles huge databases.Redundancy, data is pretty safe on commodity
hardwareSuper flexible queries using map/reduceRapid development (no fixed schema, yeah!)Very fast for common use cases
Contd..
Inspired by Distributed Data Storage problems
Scale easily by adding servers Not suited to all problem types, but super-
suited to certain large problem types High-write situations (eg activity tracking or
timeline rendering for millions of users) A lot of relational uses are really dumbed
down (eg fetch by PK with update)
Architecture
How does it work?
Clients know how to: Send items to servers (consistent hashing) What to do when a server fails How to fetch keys from servers Can “weigh” to server capacities Servers know how to: Store items they receive Expire them from the cache No inter-server comms – everything is unaware
Performance
RDBMS uses buffer to ensure ACID properties
NoSQL does not guarantee ACID and is therefore much faster
We don’t need ACID everywhere!Ex. Data processing (every minute) is 4x
faster with MongoDB, despite being a lot more detailed (due to much simple development)
Why NOSQL is faster than SQL ? - Scalling
Simple web application with not much traffic Application server, database server all on one machine
Scalling contd..
More traffic comes in Application server Database server
Even more traffic comes in Load balancer Application server x2 Database server
Scalling contd..
Even more traffic comes in Load balancer x N
easy Application server x N
easy Database server xN
hard for SQL databases
SQL Slowdown
Not linear!
Scalling contd..
NoSQL Scalling - Need more storage?
Add more servers!Need higher performance?
Add more servers!Need better reliability?
Add more servers!
Scalling Summary
You can scale SQL databases (Oracle, MySQL, SQL Server…) This will cost you dearly If you don’t have a lot of money, you will reach limits
quicklyYou can scale NoSQL databases
Very easy horizontal scaling Lots of open-source solutions Scaling is one of the basic incentives for design, so it
is well handled Scaling is the cause of trade-offs causing you to have
to use map/reduce
Characterstics
Almost infinite horizontal scalingVery fastPerformance doesn’t deteriorate with growth
(much)No fixed table schemasNo join operationsAd-hoc queries difficult or impossibleStructured storageAlmost everything happens in RAM
NOSQL Types
Wide Column Store / Column FamiliesDocument StoreKey Value / Tuple StoreGraph DatabasesObject DatabasesXML DatabasesMultivalue Databases
Main types -
Key-Value StoresMap Reduce FrameworkDocument DatabasesGraph Databases
Key Value Stores
Lineage: Amazon's Dynamo paper and Distributed HashTables.
Data model: A global collection of key-value pairsExample systems
Google BigTable , Amazon Dynamo, Cassandra, Voldemort , Hbase , …
Implementation: efficiency, scalability, fault-tolerance Records distributed to nodes based on key Replication Single-record transactions, “eventual
consistency”
Documented Databases
Lineage: Inspired by Lotus Notes.Data model: Collections of documents, which
contain key-value collections (called "documents").
Example: CouchDB, MongoDB, Riak
Graph Database
Lineage: Draws from Euler and graph theory.Data model: Nodes & relationships, both
which can hold key-value pairsExample: AllegroGraph, InfoGrid, Neo4j
Map Reduce Framework
Google’s framework for processing highly distributable problems across huge datasets using a large number of computers
Let’s define large number of computers Cluster if all of them have same hardware Grid unless Cluster (if !Cluster for old-style programmers)
Process split into two phases Map
Take the input, partition it delegate to other machines Other machines can repeat the process, leading to tree
structure Each machine returns results to the machine who gave it the
task
Map Reduce Framework contd..
Reduce collect results from machines you gave the tasks combine results and return it to requester
Slower than sequential data processing, but massively parallel
Sort petabyte of data in a few hours Input, Map, Shuffle, Reduce, Output
Popular NoSQL
Hadoop / Hbase
CassandraAmazon
SimpleDBMongoDBCouchDBRedis
MemcacheDBVoldemortHypertableCloudataIBM
Lotus/Domino
Real World Use
Cassandra Facebook (original developer, used it till late 2010) Twitter Digg Reddit Rackspace Cisco
BigTable Google (open-source version is HBase)
MongoDB Foursquare Craigslist Bit.ly SourceForge GitHub
MONGODB
Document storeBasic support for dynamic (ad hoc)
queriesQuery by example (nice!)
Conditional Operators <, <=, >, >= $all, $exists, $mod, $ne, $in, $nin, $nor, $or,
$and, $size, $type
MONGODB
Data is stored as BSON (binary JSON) Makes it very well suited for languages with native JSON support
Map/Reduce written in Javascript Slow! There is one single thread of execution in Javascript
Master/slave replication (auto failover with replica sets)Sharding built-inUses memory mapped files for data storagePerformance over featuresOn 32bit systems, limited to ~2.5GbAn empty database takes up 192MbGridFS to store big data + metadata (not actually an FS)
CASANDRA
Written in: JavaProtocol: Custom, binary (Thrift)Tunable trade-offs for distribution and
replication (N, R, W)Querying by column, range of keysBigTable-like features: columns, column
familiesWrites are much faster than reads (!)
Constant write time regardless of database sizeMap/reduce possible with Apache Hadoop
Some more info about Cassndra in Facebook
Cassandra is open source DBMS from Appache software foundation.
Cassandra provides a structured key-value store with tunable consistency
Cassandra is a distributed storage system for managing structured data that is designed to scale to a very large size across many commodity servers, with no single point of failure
It is a NoSQL solution that was initially developed by Facebook and powered their Inbox Search feature until late 2010
HBASE
Written in: JavaMain point: Billions of rows X millions of columnsModeled after BigTableMap/reduce with HadoopQuery predicate push down via server side scan and get filtersOptimizations for real time queriesA high performance Thrift gatewayHTTP supports XML, Protobuf, and binaryCascading, hive, and pig source and sink modulesNo single point of failureWhile Hadoop streams data efficiently, it has overhead for
starting map/reduce jobs. HBase is column oriented key/value store and allows for low latency read and writes.
Random access performance is like MySQL
COUCHDB
Written in: Erlang Main point: DB consistency, ease of use Bi-directional (!) replication, continuous or ad-hoc, with conflict
detection, thus, master-master replication. (!) MVCC - write operations do not block reads Previous versions of documents are available Crash-only (reliable) design Needs compacting from time to time Views: embedded map/reduce Formatting views: lists & shows Server-side document validation possible Authentication possible Real-time updates via _changes (!) Attachment handling CouchApps (standalone JS apps)
HADOOP
Apache projectA framework that allows for the distributed processing of
large data sets across clusters of computersDesigned to scale up from single servers to thousands of
machinesDesigned to detect and handle failures at the application
layer, instead of relying on hardware for itCreated by Doug Cutting, who named it after his son's toy
elephantHadoop subprojects
Cassandra HBase Pig
Hive was a Hadoop subproject, but is now a top-level Apache project
HADOOP contd..
Scales to hundreds or thousands of computers, each with several processor cores
Designed to efficiently distribute large amounts of work across a set of machines
Hundreds of gigabytes of data constitute the low end of Hadoop-scale
Built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes
Uses Java, but allows streaming so other languages can easily send and accept data items to/from Hadoop
HADOOP contd..
Uses distributed file system (HDFS) Designed to hold very large amounts of data
(terabytes or even petabytes) Files are stored in a redundant fashion across multiple
machines to ensure their durability to failure and high availability to very parallel applications
Data organized into directories and files Files are divided into block (64MB by default) and
distributed across nodesDesign of HDFS is based on the design of the
Google File System
HIVE
A petabyte-scale data warehouse system for Hadoop
Easy data summarization, ad-hoc queriesQuery the data using a SQL-like language
called HiveQLHive compiler generates map-reduce jobs for
most queries
Conclusion
NoSQL is a great problem solver if you need it
Choose your NoSQL platform carefully as each is designed for specific purpose
Get used to Map/ReduceIt’s not a sin to use NoSQL alongside
(yes)SQL database
Referance
http://www.facebook.com/note.php?note_id=24413138919
http://en.wikipedia.org/wiki/Apache_Cassandra
http://en.wikipedia.org/wiki/SQLhttp://en.wikipedia.org/wiki/NoSQLwww.slideshare.com
THANK YOU..!!