performance evaluation of a mongodb and hadoop platform ... · set and emits a series of...
TRANSCRIPT
![Page 1: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/1.jpg)
1
Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis
Elif Dede, Madhusudhan Govindaraju
Lavanya Ramakrishnan, Dan Gunter, Shane Canon
Department of Computer Science, Binghamton University (SUNY) Lawrence Berkeley National Laboratory
![Page 2: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/2.jpg)
Computa(on and Data are cri(cal parts of the scien(fic process
Experiment
Theory
Computation
Data (Fourth Paradigm)
Advance Light Source Data Rates
2009 65 TB/yr
2011 312 TB/yr
2013 1900 TB/yr
Three Pillars of Science
![Page 3: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/3.jpg)
Materials Project
3
Schemaless database
manager.x manager.x manager.x
Brain
www.materialsproject.org Source: Michael Kocher, Daniel Gunter
![Page 4: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/4.jpg)
Data is “Big”
4
![Page 5: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/5.jpg)
Processing “Big Data”: MapReduce
5
• Introduced in OSDI 2004 by Dean and Ghemawat from Google
• Programming model for processing large data sets
• Exploits large a set of commodity machines
• Characteristics of the model: • Relaxed synchronization
constraints • Locality optimization • Fault-tolerance • Load balancing OSDI 2004
![Page 6: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/6.jpg)
Map and Reduce Map/Reduce:
• The map() function is called on every item in the input set and emits a series of intermediate key/value pairs
• All values associated with a given intermediate key are grouped together
• The reduce() function is called on every unique intermediate key, and its value list, and emits a final output value
6
![Page 7: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/7.jpg)
Apache Hadoop • Open-‐source MapReduce implementa;on in Java • Easy scalability • Built-‐in I/O management
• Hadoop Distributed File System(HDFS) • Data distribu(on, management and replica(on
• Load balancing • Handles stragglers
• Fault tolerance • Commodity hardware • Heartbeats • Specula(ve execu(on and data replica(on
• Hadoop Streaming • Create and run MapReduce jobs with any executable or script as the mapper and/or the reducer
7
![Page 8: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/8.jpg)
Scien;fic Compu;ng and Hadoop Hadoop provides: • Data Flow Parallelism
• Data goes through different steps of processing • Similar Job Phases
• Data prepara(on, transforma(on and reduc(on • MapReduce: maps (transforma(on) and reduces
(reduc(on) • Number of maps >>> Number of reduce
• Data transforma(on is typically more parallel than data reduc(on
• Fault Tolerance and Data Locality • Data intensive loads • Long running scien(fic jobs
8
![Page 9: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/9.jpg)
Scien;fic Compu;ng and Hadoop (Cont.)
Hadoop does not provide: • Java implementa(on
• Legacy scien(fic code mostly is not in java and hard to rewrite as map and reduce func(ons
• Hadoop Streaming allows other modes • HDFS is a non-‐POSIX file system
• HDFS java library calls needed to create, read and write files • HDFS data locality good but does not handle applica(ons that
might have mul(ple data sets • Scien(fic data formats do not fit in the line/block oriented inputs of
typical Hadoop jobs • Scien(fic applica(ons o]en work with files where the logical
division of work is per file • New file formats require addi(onal java programming to define
the format, appropriate split for a single map task
9
![Page 10: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/10.jpg)
Scien;fic Compu;ng and Hadoop (Cont.)
Hadoop does not provide:
• Maps and reduces are considered iden(cal (executables/arguments)
• Implemen(ng different tasks requires logic in the tasks that differen(ate the func(onality
• This can cause worker processing (mes to vary widely an lead to (meouts and restarted tasks due to the specula(ve execu(on in Hadoop
• No built-‐in dynamic and itera(ve applica(on support
10
![Page 11: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/11.jpg)
New Genera;on Data
11
• Dynamic Data • Size and Content
• Structured? • Semi structured,
unstructured
• Relational? • Not always
![Page 12: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/12.jpg)
NoSQL
12
A broad class of data management systems where the data is partitioned across a set of servers, where no server plays a privileged role
• NoSQL has emerged as an alternative model for this new non-relational data model.
• Address the ``Big Data'' challenge by providing horizontal scalability.
• Lower maintenance costs and flexibility.
• There are various data models that are represented under NoSQL including key-value, column-oriented and document-oriented stores.
• Each of these models has its own interpretation of data storage and makes different tradeoffs within the Consistency, Availability and Performance
![Page 13: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/13.jpg)
What is MongoDB?
13
• Open source document-oriented database
• Data is not in tables with rows and columns
• Data is stored as “documents”, each of which is a associative array of scalar values, or nested associative arrays
• Javascript Object Notation (JSON) format
• Stored as BSON
• MongoDB uses sharding to split the data evenly across the cluster to parallelize access.
• This is done through front-end “routing servers” and back-end “data servers”
• Provides a built-in MapReduce • Drawbacks • The MapReduce scripts should be written in JavaScript • Slow and poor analytics libraries • The JavaScript implementation used by the MongoDB is not thread safe
![Page 14: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/14.jpg)
Why MongoDB?
14
Materials Project: • A community accessible data store of calculated materials.
• Data store is complex with hundreds of attributes and constantly evolving.
• MongoDB provides an appropriate data model and query language.
• The project also needs to perform complex statistical data mining to discover patterns in materials and validate/verify correctness.
• These task are difficult with MongoDB but natural for MapReduce
ALS: • Advanced Light Source’s Tomogropy beamline uses MongoDB to store metadata from experiments (Summer’12, LBNL)
Schemaless database
manager.x manager.x manager.x
Brain
www.materialsproject.org Source: Michael Kocher, Daniel Gunter (LBNL)
![Page 15: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/15.jpg)
Hadoop-‐MongoDB Connector
15
• Input splits are retrieved from a MongoDB server(s)
• Each mapper can read its splits in parallel
• Results are written back to MongoDB by the Hadoop reducer(s)
• It works with single MongoDB server or with a sharding setup
• User determines the split size
![Page 16: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/16.jpg)
MongoDB: Overhead of mul;ple connec;ons
16
• Test ability to handle large number of simultaneous connections • 768 tasks with different checkpoint intervals compared to when there is no checkpoint
• overhead
• Connections increased from 154 to 768 per second, write volume increased to 768 MBs/.
![Page 17: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/17.jpg)
MongoDB: Overhead, when using more nodes, tasks
17
• 10 min per task, • All tasks run in parallel • 10 sec checkpoint interval • Overhead observed after 1000 parallel tasks
• Large number of connections is the bottleneck
• More than the data volume
![Page 18: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/18.jpg)
MongoDB MapReduce vs. Hadoop-‐MongoDB Read/Write Performance Comparison
18
• Data is stored on a single MongoDB server
• Hadoop cluster consists of 2 worker nodes
• The mongo-hadoop plug-in provides roughly five times better performance.
![Page 19: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/19.jpg)
Hadoop-‐MongoDB: Choosing the Split Size
19
• Processing 9.3 million input records with Hadoop
• Each mapper reads an input split from the MongoDB server, does processing and sends its intermediate output to the reducer
• Split size varies:16, 32, 64, 128, 254 MBs
• sweet spot: 128 MB
• With the default split size of 8MB, Hadoop schedules over 500 mappers; by increasing the split size, this number drops around to 40
![Page 20: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/20.jpg)
Hadoop-‐MongoDB: Increasing Data
20
• For 4.6 million input records, HDFS Hadoop is two times better than MongoDB, and at 37.2 million records it is five times
• At 37.2 million input records mongo-hadoop is more than 3 times slower in reading and more than nine times in writing than Hadoop-HDFS.
• In a sharded setup, mongo-hadoop reading times improve considerably.
2-node Hadoop Cluster and 2 Mongo-DB servers.
![Page 21: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/21.jpg)
Hadoop-‐MongoDB: Sharding and processing on local nodes vs different nodes
21
• The performance slightly worsened compared to running the servers on different machines.
• MongoDB uses mmap to aggressively cache data from disk into memory
• With increasing input size growing memory and CPU usage is observed on the worker/server nodes
• This effects the performance of the MapReduce job
Performance bottleneck is due to memory Contention. Locality has minimal effect.
![Page 22: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/22.jpg)
Hadoop-‐MongoDB: Increasing #Workers
22
• The performance over increasing cluster sizes from 16 to 64 cores
• Single to two MongoDB sharded servers
• The write time is bound by the reduce phase for this MapReduce job
• Number of mappers >> number of reducers
• The write performance of MongoDB still remains to be a bottleneck along with the overhead of routing data to be written between sharding servers.
Write performance of MongoDB is a bottleneck.
![Page 23: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/23.jpg)
Hadoop-‐MongoDB: Different Setups (given that the data is in MongoDB)
23
• Best performance achieved reading from MongoDB and writing the output to HDFS
• Downloading the data to HDFS before running the analysis is the slowest .
Hadoop-HDFS provides the best peformance.
![Page 24: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/24.jpg)
Hadoop-‐MongoDB: Different Setups
24
• Increasing cluster size (from 8 cores to 64) for 37.2 million input records
• With an increasing number of worker nodes the concurrency of the map phase increases
• The map times get considerably faster
![Page 25: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/25.jpg)
Hadoop-‐MongoDB: Fault Tolerance
25
• 32 node Hadoop cluster processing ~37 million input records
• After eight faulted worker nodes Hadoop-HDFS loses too many data nodes and fails to complete the MapReduce job
• Mongo-hadoop gets the input splits from the MongoDB server therefore losing worker nodes does not lead to loss of input data
![Page 26: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/26.jpg)
Conclusions • Sharding helps to improve MongoDB’s performance especially for reads.
• In a sharded setup, mongo-‐hadoop reading (mes improve considerably, as there are mul(ple servers to respond to parallel worker requests
• In cases where data is stored in MongoDB and needs to be analyzed, the mongo-‐hadoop connector is a convenient way to use Hadoop.
• Performance improves when output is wriben to HDFS
• MongoDB performance degrada(on observed with the increasing number of connec(ons, increasing write requests per second, as well as the increase in total write volume
• The mongo-‐hadoop plug-‐in provides roughly five (mes beber performance compared to using MongoDB’s na(ve MapReduce implementa(on.
• The performance gain from using mongo-‐hadoop increases linearly with input size.
26
![Page 27: Performance Evaluation of a MongoDB and Hadoop Platform ... · set and emits a series of intermediate key/value pairs • All values associated with a given intermediate key are grouped](https://reader033.vdocument.in/reader033/viewer/2022042212/5eb58d43e9749732416029cd/html5/thumbnails/27.jpg)
27
Contact
Madhu Govindaraju [email protected] Binghamton University State University of New York (SUNY)
Dan Gunter [email protected] Lawrence Berkeley National Laboratory