evaluaonof) nosql)and)array) databasesfor scienc&fic...

17
Evalua&on of NoSQL and Array Databases for Scienc&fic Applica&ons Lavanya Ramakrishnan Pradeep Mantha, Yushu Yao, Richard Shane Canon Lawrence Berkeley Na=onal Lab [email protected]

Upload: others

Post on 14-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Evalua&on  of  NoSQL  and  Array  Databases  for  Scienc&fic  Applica&ons    

Lavanya  Ramakrishnan  Pradeep  Mantha,  Yushu  Yao,  Richard  Shane  Canon  

 Lawrence  Berkeley  Na=onal  Lab  

[email protected]  

Page 2: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

New  types  of  databases  are  being  increasingly  used  for  science  applica&ons  

Advance  Light  Source  Data  Rates  

2009          65  TB/yr  

2011      312  TB/yr  

2013   1900  TB/yr  

Schemaless database

manager.x manager.x manager.x

www.materialsproject.org  Source:  Michael  Kocher,  Daniel  Gunter  

NERSC  

HPSS  MongoDB  

Portal  

GPFS  

register  clean  

monitor metadata

More  details:  Big  Data  Analy=cs:  Challenges  and  Opportuni=es  (BDAC-­‐13)  Workshop    at  2:50  pm  

Page 3: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Schema-­‐less  or  NoSQL  databases  •  Dynamic  schema  evolu=on  •  Typically  …  

–  does  not  use  SQL  as  query  language    –  may  not  give  full  ACID  guarantees  (eventual  consistency)    –  distributed  fault-­‐tolerant  architecture  

•  Significant  performance  and  scalability  –  Highly  op=mized  for  retrieve  and  append    

•  Schema-­‐less  databases  fall  in  the  con=nuous  spectrum  closer  to  BASE    

3  

Atomicity  Consistency  Isola=on  Durability  

Basic  Availability  So^-­‐state  Eventual  Consistency  

Page 4: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Schema-­‐less  databases  are  roughly  classified  into  four  categories    

•  Key/Value  Store  –  e.g.,  Dynamo,  Project  Voldemort  

•  Columnar  or  extensible  record  –  e.g.,  BigTable,  Hbase,  Cassandra,  Hypertable  

•  Document  Stores  –  store  documents/objects  –  e.g.,  CouchDB,  MongoDB  

•  Graph  databases  –  e.g.,  neo4j  

4  

Page 5: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

SciDB  •  Open-­‐source  array  database  •  Provides  scalability  •  Provides  support  for  slicing,  aggrega=on,  join  of  mul=-­‐

dimensional  array  •  Linear  Algebra    

Page 6: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Evalua&on  Scope  •  Comparison  of  Cassandra,  Hbase  and  MongoDB  using  the  

Yahoo!  Cloud  Serving  Benchmark  (YCSB)    •  Comparison  of  Cassandra,  SciDB  and  PostgresSQL  using  two  

scien=fic  datasets    •  Detailed  study  of  Cassandra:  various  factors  impac=ng  

performance    

Page 7: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Experiment  Setup    •  160  node  Jesup  testbed    

–  Quad-­‐core  Intel  Xeon  X5550,  2.67GHZ  processors  24  GB  of  memory  per  node  

•  Yahoo!  Cloud  Serving  Benchmark  (YCSB)    –  Supports  diverse  workload  types  –  100  million  records  where  each  record  is  1  KB  (10  fields  X  100  Bytes)  

Page 8: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Scien&fic  Datasets  •  Bioinforma=cs  database    

–  16  million  entries    –  each  entry  had  sequence  id,  sequence,  checksum  –  Query:  get  all  fields  for  a  par=cular  sequence  id  

•  Astronomy  database  –  55K  spectrums  produced  from  simula=on  of  different  supernova  with  

different  input  parameters  –  2500  wavelength  steps  and  two  record  types  –  275  million  records    –  query:  calculates  the  Chi-­‐squre  distance  between  the  given  observed  

spectrum  to  every  spectrum  in  the  set  and  returns  the  ID  of  the  spectrum  that  has  minimal  Chi-­‐square  distance.    

Page 9: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

YCSB  Results  

D E F G NR

Tra

nsac

tion

s pe

r se

cond

Workload

CassandraHBaseMongoDB

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

A B C

Beker  A:  Update    B:  Read  heavy        C:  Read  only    D:  Read  latest  workload    E:  Short  ranges    F:  Read-­‐modify-­‐write  G:  Custom  (100%  write)  Workload  NR:  Record  size  matching  bioinforma=cs  workload    

Page 10: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Bioinforma&cs:  Cassandra  and  SciDB  T

ransa

ctio

ns

per

sec

ond (

Log s

cale

)

Number of data nodes

SciDBCassandra

1

10

100

1,000

10,000

100,000

1 12 Beker  

Page 11: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Astronomy:  Cassandra,  SciDB  and  PostgresSQL  

Tra

nsa

ctio

ns

per

sec

on

d

Number of data nodes

SciDBCassandraPostgreSQL

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

1 12

Beker  

Page 12: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Cassandra:  Impact  of  Client  Side  Tools  

140,000

32 64 128 256 512

Tra

nsa

ctio

ns

per

sec

ond

Number of client threads

YCSBPycassaCassandra Client

0

20,000

40,000

60,000

80,000

100,000

120,000

Beker  

Page 13: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Cassandra:  Loading  &me    

40,000

50,000

60,000

70,000

80,000

1 2 4 8 12

Tra

nsa

ctio

ns

per

sec

ond

Number of data nodes

YCSBPycassaBulk Loader

0

10,000

20,000

30,000

Beker  

Page 14: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Cassandra:  Scaling  effects  

Number of client threads

1 DN2 DNs4 DNs6DNs8 DNs10 DNs12 DNs

0

10,000

20,000

30,000

40,000

50,000

60,000

32 64 128 256 512

Tra

nsa

ctio

ns

per

sec

ond

Beker  

Page 15: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Cassandra:  Querying  for  Missing  Reads  

Number of data nodes

0.20

0.40

0.60

0.80

1.00

1 12

Tim

e i

n m

illi

seco

nd

s

0.00

Beker  

Page 16: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Conclusions  

•  Understand  the  query  load  distribu=on  •  Determine  the  right  client  side  tools    •  This  is  a  constantly  evolving  space    

Page 17: Evaluaonof) NoSQL)and)Array) Databasesfor Scienc&fic ...datasys.cs.iit.edu/events/DataCloud2013/Lavanya_NoSQL.pdf · SciDB Cassandra 1 10 100 1,000 10,000 100,000 1 12 Beer. Astronomy:)Cassandra,)SciDBand)

Ques&ons?  •  [email protected]  

This  work  was  supported  by  the  Director,  Office  of  Science,  of  the  U.S.  Department  of  Energy  under  Contract  No.  DE-­‐AC02-­‐05CH11231.  The  authors  would  like  to  thank  Elif  Dede.