evaluaonof) nosql)and)array) databasesfor scienc&fic...
TRANSCRIPT
Evalua&on of NoSQL and Array Databases for Scienc&fic Applica&ons
Lavanya Ramakrishnan Pradeep Mantha, Yushu Yao, Richard Shane Canon
Lawrence Berkeley Na=onal Lab
New types of databases are being increasingly used for science applica&ons
Advance Light Source Data Rates
2009 65 TB/yr
2011 312 TB/yr
2013 1900 TB/yr
Schemaless database
manager.x manager.x manager.x
www.materialsproject.org Source: Michael Kocher, Daniel Gunter
NERSC
HPSS MongoDB
Portal
GPFS
register clean
monitor metadata
More details: Big Data Analy=cs: Challenges and Opportuni=es (BDAC-‐13) Workshop at 2:50 pm
Schema-‐less or NoSQL databases • Dynamic schema evolu=on • Typically …
– does not use SQL as query language – may not give full ACID guarantees (eventual consistency) – distributed fault-‐tolerant architecture
• Significant performance and scalability – Highly op=mized for retrieve and append
• Schema-‐less databases fall in the con=nuous spectrum closer to BASE
3
Atomicity Consistency Isola=on Durability
Basic Availability So^-‐state Eventual Consistency
Schema-‐less databases are roughly classified into four categories
• Key/Value Store – e.g., Dynamo, Project Voldemort
• Columnar or extensible record – e.g., BigTable, Hbase, Cassandra, Hypertable
• Document Stores – store documents/objects – e.g., CouchDB, MongoDB
• Graph databases – e.g., neo4j
4
SciDB • Open-‐source array database • Provides scalability • Provides support for slicing, aggrega=on, join of mul=-‐
dimensional array • Linear Algebra
Evalua&on Scope • Comparison of Cassandra, Hbase and MongoDB using the
Yahoo! Cloud Serving Benchmark (YCSB) • Comparison of Cassandra, SciDB and PostgresSQL using two
scien=fic datasets • Detailed study of Cassandra: various factors impac=ng
performance
Experiment Setup • 160 node Jesup testbed
– Quad-‐core Intel Xeon X5550, 2.67GHZ processors 24 GB of memory per node
• Yahoo! Cloud Serving Benchmark (YCSB) – Supports diverse workload types – 100 million records where each record is 1 KB (10 fields X 100 Bytes)
Scien&fic Datasets • Bioinforma=cs database
– 16 million entries – each entry had sequence id, sequence, checksum – Query: get all fields for a par=cular sequence id
• Astronomy database – 55K spectrums produced from simula=on of different supernova with
different input parameters – 2500 wavelength steps and two record types – 275 million records – query: calculates the Chi-‐squre distance between the given observed
spectrum to every spectrum in the set and returns the ID of the spectrum that has minimal Chi-‐square distance.
YCSB Results
D E F G NR
Tra
nsac
tion
s pe
r se
cond
Workload
CassandraHBaseMongoDB
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
A B C
Beker A: Update B: Read heavy C: Read only D: Read latest workload E: Short ranges F: Read-‐modify-‐write G: Custom (100% write) Workload NR: Record size matching bioinforma=cs workload
Bioinforma&cs: Cassandra and SciDB T
ransa
ctio
ns
per
sec
ond (
Log s
cale
)
Number of data nodes
SciDBCassandra
1
10
100
1,000
10,000
100,000
1 12 Beker
Astronomy: Cassandra, SciDB and PostgresSQL
Tra
nsa
ctio
ns
per
sec
on
d
Number of data nodes
SciDBCassandraPostgreSQL
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
1 12
Beker
Cassandra: Impact of Client Side Tools
140,000
32 64 128 256 512
Tra
nsa
ctio
ns
per
sec
ond
Number of client threads
YCSBPycassaCassandra Client
0
20,000
40,000
60,000
80,000
100,000
120,000
Beker
Cassandra: Loading &me
40,000
50,000
60,000
70,000
80,000
1 2 4 8 12
Tra
nsa
ctio
ns
per
sec
ond
Number of data nodes
YCSBPycassaBulk Loader
0
10,000
20,000
30,000
Beker
Cassandra: Scaling effects
Number of client threads
1 DN2 DNs4 DNs6DNs8 DNs10 DNs12 DNs
0
10,000
20,000
30,000
40,000
50,000
60,000
32 64 128 256 512
Tra
nsa
ctio
ns
per
sec
ond
Beker
Cassandra: Querying for Missing Reads
Number of data nodes
0.20
0.40
0.60
0.80
1.00
1 12
Tim
e i
n m
illi
seco
nd
s
0.00
Beker
Conclusions
• Understand the query load distribu=on • Determine the right client side tools • This is a constantly evolving space
Ques&ons? • [email protected]
This work was supported by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE-‐AC02-‐05CH11231. The authors would like to thank Elif Dede.