zhang gang 2012.9.27. big data high scalability one time write, multi times read …….(to be add )
Post on 29-Dec-2015
220 Views
Preview:
TRANSCRIPT
NoSQL DB Comparison
Zhang Gang2012.9.27
DIRAC Accounting system need
Big data High scalability One time write , multi times read …….(to be add )
Features
Riak Written in: Erlang & C, some Javascript Main point: Fault tolerance Principle from Amazon's Dynamo paper Tunable trade-offs for distribution and replication
(N, R, W) Map/reduce in JavaScript or Erlang Masterless multi-site replication Language support: include python Support full-text search, indexing, querying with
Riak Search server
Riak Best used: If you want something Cassandra-
like (Dynamo-like), but no way you're gonna deal with the bloat and complexity. If you need very good single-site scalability, availability and fault-tolerance, but you're ready to pay for multi-site replication.
For example: Point-of-sales data collection. Factory control systems. Places where even seconds of downtime hurt. Could be used as a well-update-able web server.
CouchDB Written in: Erlang Main point: embrace the web, ease of use Document-oriented Data format: JSON Bi-directional replication and off-line operation in
mind MVCC - write operations do not block reads Needs compacting from time to time Views: embedded map/reduce Built for off-line Automatically replicates all the data to all servers. Support AICD transaction.
CouchDB
Best used: Replication and synchronization capabilities of CouchDB make it ideal for using it in mobile devices, where network connection is not guaranteed but the application must keep on working offline.
For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important.
For example: CRM, CMS systems. Master-master replication is an especially interesting feature, allowing easy multi-site deployments
Cassandra Written in: Java Main point: Best of BigTable and Dynamo Tunable trade-offs for distribution and replication
(N, R, W) Querying by column, range of keys BigTable-like features: columns, column families Has secondary indices Writes are much faster than reads (!) Map/reduce possible with Apache Hadoop All nodes are similar, as opposed to
Hadoop/Hbase Gossip protocol, multi data center, no single point
of failure
Cassandra
Best used: When you write more than you read (logging). If every component of the system must be in Java. ("No one gets fired for choosing Apache's stuff.")
For example: Banking, financial industry .Writes are faster than reads, so one natural niche is real time data analysis.
Hadoop HBase Written in: Java Main point: Billions of rows X millions of columns Modeled after Google's BigTable Uses Hadoop's HDFS as storage Map/reduce with Hadoop Optimizations for real time queries A high performance Thrift gateway(access
interface) Cascading, hive, and pig source and sink modules Random access performance is like MySQL A cluster consists of several different types of
nodes(Muster/RegionServer) Not scale down to small installations.
Hadoop HBase
Best used: Hadoop is probably still the best way to run Map/Reduce jobs on huge datasets. Best if you use the Hadoop/HDFS stack already.
For example: Analysing log data.
Comparison
Cassandra VS CouchDB
Points that favor CouchDB A document store Offline replication embrace the web Automatically replicates all the data to all servers,
impractical for very large number of replicas and very databases.
This features maybe unsuitable for DIRAC Accounting System
So, compare CouchDB ,Cassandra win I think
Cassandra VS Riak Both architecturally strongly influenced by
Dynamo Both also go beyond Dynamo in providing a "richer
than pure K/V" data model Points that favor Cassandra
speed support for clusters spanning multiple data centers big names using it (digg, twitter, facebook, webex, ... )
Points that favor Cassandra map/reduce support out of the box(Cassandra can do it
with Hadoop map/reduce )
So, maybe Cassandra win again I think
HBase VS Cassandra C has only one type of nodes, all nodes are
similar . H consists of several different types of nodes(Muster/RegionServer).
H must deployed over the HDFS, compare this C is much more simple
Data consistency of C is tunable(N,W,R). H better support map/reduce H provides the developer with row locking
facilities whereas Cassandra can not. C just use timestamp.
C has better I/O performance and better scalability but not good at range scan.
CAP:C focus on AC and H focus on CP H has an SQL compatibility interface(Hive),so H
support SQL
HBase VS Cassandra The structure of C is simple ,deploy and maintenance
is simple, compare C(save money, save time) ,H is much more complex deploy or maintenance. But we have a Hadoop cluster here already.
H maybe more suitable for data warehousing, and large scale data processing and analysis. And C being more suitable for real time transaction processing and the serving of interactive data.
.
HBase VS Cassandra HBase has been recognized by the WLCG Database
Technical Evolution Group as having the greatest potential impact in the LHC experiments out of all NoSQL technologies. The CERN IT organization is setting up a cluster to try it.
So, for a Accounting system ,maybe HBase is a good choice I think.
A few company use cases
B C D E R G
Use in production for CMS and ATLAS
CouchDB: CMS use CouchDB in production for parts of
its Data and Workflow Management systems, in particular for some queues and for the job state machine. The installation has 3 replicas of a CouchDB database at CERN and 4 replicas of the same database at Fermilab.
Use in production for CMS and ATLAS
HBase: HBase is used in production by ATLAS in
its Distributed Data Manager called DQ2,for both log analysis and accounting on a 12-node cluster. The original method they had for doing their accounting summary was 8 to 20 times faster than the same method on the shared Oracle system they had, depending on the HDFS replication level.
Use in production for CMS and ATLAS
Cassandra: Cassandra is used in production by ATLAS
PanDa monitoring.They chose to host it at BNLon only 3 nodes that were quite high-powered:each node has 24 cores and 1Terabyte of RAID0 Solid-State Disks(SSDs).
ScriptUse the records in type_*table to draw some pie
plot
Script
FUNCTION:def generatePlotByTime(groupby,generate,keyTableName,startTime,endTime): “the main function ,gengrate a plot by
parameters”def getTrueValue(keyTableName,index): “select the key tables to get the true value by
index”Calling like this:
generatePlotByTime(‘Site’,’CPUTime’,’ac_key_Lhcb-Production_job_Site’,’2010-6-20’,’2012-6-20’)
Script
DiskSpace groupby site cost about:97.39s
Script
CPU time groupby User cost about:97.03s
Script
CPU time groupby UserGroup cost about:93.62s
Script
Diskspace groupby ProcessingType cost:94.82s
Script-bytime
processing time:86.64s
Script-bytime
processing time:97.69s
页面标题页面标题
Thanks
top related