no sql studies

Upload: adityaa2064

Post on 03-Apr-2018

222 views

Category:

Documents


1 download

TRANSCRIPT

  • 7/28/2019 NO SQL STUDIES

    1/4

    International Journal of Computer Applications (0975 888)Volume 48 No.20, June 2012

    1

    Comparative Study of the New Generation, Agile,Scalable, High Performance NOSQL Databases

    Clarence J M TauroCentre for Research,

    Christ University,Hosur Road, Bangalore, India

    Aravindh SDept. of Computer Science,

    Christ University,Hosur Road, Bangalore, India

    Shreeharsha A.BDept. of Computer Science,

    Christ University,Hosur Road, Bangalore, India

    ABSTRACT Relational database is widely used in most of the applicationto store and retrieve data. They work best when they handle alimited set of data. Handling real time huge volume of datalike internet was inefficient in relation database systems. Toovercome this problem the "NO-SQL" or "Not Only SQL"

    Database came into existence. This paper discusses about problems with relation databases and how different types of NOSQL Databases are used to efficiently handle the realworld problems.

    Keywords NoSQL, Key-Value Store, BigTable, Document Databases,Graph Databases, Berkeley DB, Tokyo Tryant, VoldemartDB, Neo4j, Hadoop

    1. INTRODUCTIONThe problems with Relation database is that it lacked handlingexponential growth of data. Many organizations collect vastamounts of customer, scientific, sales, and other data for future analysis. Traditionally, most of these organizationshave stored structured data in relational databases for subsequent access and analysis. However, a growing number of developers and users have begun turning to various types of non-relational, now frequently called NOSQL databases [2].Their primary advantage of NOSQL Database is that, unlikerelational databases they handle unstructured data such asdocuments, e-mail, multimedia and social media efficiently.The common features of NOSQL databases can besummarized as high scalability and reliability, very simpledata model, very simple (primitive) query language, lack of mechanism for handling and managing data consistency andintegrity constraints maintenance [1].

    2. PROBLEMS WITH RELATIONDATABASESThere are three major problems with Relation Databaseswhich makes it inefficient.

    First is the data set size. There is a huge growth of information in internet. International Data Corporation (IDC)analysis shown in Figure 1 shows the growth of data from theyear 2007 to 2010 to be almost 25 times.

    Figure. 1 Scalable of data size from 2007 2010

    Second is the problem on connectivity. Over time, the

    information is getting more and more connected.Figure 2shows the growth of connectivity over the years.

    Fugure. 2 Growth of connectivity

    Third problem is about the semi structure. Semi structureinformation is information which has few mandatoryattributes but many optional attributes. As the informationgrew, there was need to increase the columns of table, whichwould lead to sparse tables.

    3. NOSQL CATEGORIES NOSQL can be broken into 4 different categories.

  • 7/28/2019 NO SQL STUDIES

    2/4

    International Journal of Computer Applications (0975 888)Volume 48 No.20, June 2012

    2

    Key Value Stores

    Big Table

    Document Databases

    Graph Databases

    Each database is individually good at dealing with size andcomplexities.

    3.1 Key Value StoresKey value data model means that a value corresponds to aKey. Although the structure is simpler, the query speed ishigher than relational database, supports mass storage andhigh concurrency, etc., It provided support for query andmodify operations for data through the primary key [3]. Keyvalues represent bucket of data. For example, in case of ashopping cart mentioned in Figure 3, each shopping cart arerepresented in individual buckets and represented using a keyvalue which could be user id.

    The key values can be serialized using either java serialization

    or XML. This way is very fast to store as it just writes bits tothe discs. Some of key value stores available in market areBerkeley DB, Tokyo Tyrant, Voldemart, Crassandra.

    Figure. 2 Shopping cart example of Key Value Store

    3.1.1 Berkeley DBOracle Berkeley DB is a high-performance embeddabledatabase providing key value storage. Berkeley DB offersadvanced features including transactional data storage, highlyconcurrent access, replication for high availability, and fault

    tolerance in a self-contained, small footprint software library[4].

    3.1.2 Tokyo TyrantTokyo Tyrant is a high performance storage engine, while theTT provides multi-threaded high-concurrency servers, it canhandle 4-5 million times read and write operations per second[3]. While ensure the high performance of read and writeconcurrent, it uses a reliable data persistence mechanism [3].

    3.1.3 Searching in key value storeSince everything is stored in a bucket, searching is donethrough manual indexing function where all essentialattributes are added to an index table and searched. This maynot be as popular, because it expects the user to haveknowledge of index. Search was made easy in key value store

    using lucene search engine, which retrieves number of key backs to look up in the key value stores.

    3.1.4 Scaling to sizeThe biggest challenge to NOSQL databases is scaling to size.

    Scaling to size in today s world is scaling horizontally, that isadding new machines. There are number of techniques toachieve this.

    a. Master Slave replication

    b. Sharding

    c. Dynamo model

    a. Master Slave Replication

    In Master Slave Replication, one server is designated asmaster and all write happens only to the server. Every updateis propagated to all other servers. For example, if there are 1master and 4 slaves machines, the write can happen only on 1master server where as there would 5 machines whichresponds to read requests.

    Figure. 3 Master slave replication of read write

    b. Sharding

    The problem with Master Slave system is, if there more writerequest than what one machine can handle. This leads to nextsystem called Sharding. In Sharding, completely isolatedstructures are put into one server according to alphabeticalorder. For example- Figure 5 shows the example of allcustomer names starting from A-M are put on one server and

    N-Z are put on another server. This certainly can handletwice the workload. Similarly there can be n number of machines added to ease the work load.

    Figure. 4 Shopping cart example for Sharding model.

    The problem with sharding is that when a new machine has to be added. It requires the online system to be shut down to

    shuffle the data across the new server.

  • 7/28/2019 NO SQL STUDIES

    3/4

    International Journal of Computer Applications (0975 888)Volume 48 No.20, June 2012

    3

    c. Dynamo Model

    Amazon came up with solution to the problem with shardingwhich is a dynamo model. In this model, not the entire onlinesystem is brought down to add a new machine. Instead, only a

    portion of the online system is brought down. There are threedifferent approaches of DYNAMO model. The first approach

    is called Basically Available. This method basically hasAmazon online selling few products like books, where as itwould not be able to sell music. This is better than havingAmazon offline completely. The next method is called SoftState where it checks most recent value. For example, it therewere 20 copies of a book about one hr ago. It assumes thereare copies available and allows the customer to order for the

    book. The third method is the most effective DYNAMOapproach called Eventually Available which means photouploaded on a social network server in India may not show upimmediately in U.S. Whereas, eventually it will be availablein U.S server in sometime.

    Linkedin developed database called Voldemart is based onDynamo model. This is another example of a key value store.

    It is developed in Java.

    3.2 BIGTABLESearch engine Zvents develop open source distributed datastorage system hyper table [3] by drawing big table.

    A BigTable is a light, scattered, constant multidimensionalsorted map. Indexing of the map is done by a row key, columnkey, and a timestamp. In BigTable, un-interpreted arrays of

    bytes are used as values. BigTable stores structured data. Anytype of data from text to serialized objects can be stored byapplications. It does not impose any size constraint for eachvalue. A table is allowed to have limitless number of columns.Data is indexed using row and column names that can bearbitrary strings [5].

    Figure.6Normalizedtable VS Big table

    Google developed its own big table model called GoogleBigTable. Google BigTable has been designed to scale intothe petabyte range across hundreds or even thousands of computers, and also to ease the addition of more machineswithout much reconfiguration, thereby making the fullest useof the resources [5].

    Google BigTable is built on top of the Google File System,Chubby and stored in an immutable data structure calledSSTable which facilitates the storage of log and data files [5].Chubby is used by BigTable to store the root tablet, schemadetails, access control lists, coordinate and identify tabletservers [5].

    Hbase is the open source version of BigTable. Hbase emulatesmost of the functionalities provided by BigTable. Like mostnon SQL database systems, Hbase is written in Java. Hbase isan Apache open source project and aims to provide a storagesystem similar to BigTable in the Hadoop distributedcomputing environment. Hadoop Distributed File System(HDFS) is a distributed file system structure for operating oncommon hardware structures (commodity computers)characterized by low cost implementation [6]. Figure 7explains Architecture of HDFS.

    Figure. 5 Architecture of HBASE

    3.3 DOCUMENT DATABASEDocument database is not concerned about high performance

    read and write concurrent, but rather to ensure that big datastorage and good query performance [3]. Typical documentdatabase are Mongo DB, Couch DB.

    Couch DB is one most popular document database which isflexible, fault-tolerant database, which supports data formatcalled JSON. It provides REST-style API to ensure dataconsistency, Couch DB comply with ACID properties. Inaddition, Couch DB provides a P2P-based distributeddatabase solution that supports bidirectional replication.However, it also has some limitations, such as only providingan interface based on HTTP REST, concurrent read and write

    performance is not ideal and so on [3].

    Mongo DB is non-relational database, which features the

    richest and most like the relational database. It supportscomplex data types which uses BJSON data structures to storecomplex data type [3]. It uses powerful query language whichallows most of functions like query in single-table of relational databases, and also support index. High-speedaccess to mass data: when the data exceeds 50GB, Mongo DBaccess speed is 10 times than My SQL. Because of thesecharacteristics of Mongo DB, many projects with increasingdata are considering using Mongo DB instead of relationaldatabase.

    3.4 GRAPH DATABASEGraph Database works with 3 core abstraction, Node,relationships between nodes and key value pairs which can beattached to nodes and relationships.

    Neo4j is a high-performance, NOSQL graph database with allthe features of a mature and robust database [7]. It is a graph

  • 7/28/2019 NO SQL STUDIES

    4/4

    International Journal of Computer Applications (0975 888)Volume 48 No.20, June 2012

    4

    database which is supported by neo technologies. It isdeveloped in Java and it supports master-slave replication.

    Allegrograph is another graph database which is developed byFranz Inc [8]. Allegrograph supports ad-doc query languagecalled SPARQL. It provides REST protocol architecture.

    4.

    CONCLUSIONThis paper describes the limitations of traditional databasesand the advantages of NO SQL database. According to eachtype of data models, introduce the current mainstream NOSQL database, and objective analysis of their strengthsandweaknesses respectively, which will help user to choose

    NO SQL database in practice.

    Companies need to consider the following options whendeciding which properties NO SQL:

    Data Model CAP Support Multi Data Center Support Capacity

    Performance Query API Reliability Data Persistence Business Support

    5. ACKNOWLEDGMENTWe are heartily thankful to Prof. Jibrael Jos, Prof. JoyPauloseand Dr. N. Ganeshan whose encouragement, guidance

    and support enabled us to develop an understanding of thesubject.

    6. REFERENCES[1] Okman, Lior; Gal-Oz, Nurit; Gonen, Yaron; Gudes,

    Ehud; Abramov, Jenny; , "Security Issues in NoSQLDatabases," Trust, Security and Privacy in Computing

    and Communications (TrustCom), 2011 IEEE 10thInternational Conference on , vol., no., pp.541-547, 16-18 Nov. 2011 doi : 10.1109/TrustCom.2011.70

    [2] Leavitt, N.; , "Will NoSQL Databases Live Up to Their Promise?," Computer , vol.43, no.2, pp.12-14, Feb.2010doi: 10.1109/MC.2010.58

    [3] Jing Han; Haihong, E.; Guan Le; Jian Du; , "Survey on NoSQL database," Pervasive Computing andApplications (ICPCA), 2011 6th InternationalConference on , vol., no., pp.363-366, 26-28 Oct. 2011doi: 10.1109/ ICPCA.2011.6106531

    [4] Oracle Corporation, [online],http://www.oracle.com/us/products/database/berkeley-db/db/overview/index.html (Accessed: 9 February2012).

    [5] Ramanathan, Shalini; Goel, Savita; Alagumalai,Subramanian; , "Comparison of Cloud database:Amazon's SimpleDB and Google's Bigtable," RecentTrends in Information Systems (ReTIS), 2011International Conference on , vol., no., pp.165-168, 21-23 Dec. 2011doi: 10.1109/ReTIS.2011.614686

    http://www.oracle.com/us/products/database/berkeley-db/db/overview/index.htmlhttp://www.oracle.com/us/products/database/berkeley-db/db/overview/index.htmlhttp://www.oracle.com/us/products/database/berkeley-db/db/overview/index.htmlhttp://www.oracle.com/us/products/database/berkeley-db/db/overview/index.html