scaling mongodb - presentation at mtp

21
April MTP - Scalability Scaling MongoDB

Upload: darkdata

Post on 17-Jul-2015

411 views

Category:

Data & Analytics


4 download

TRANSCRIPT

April MTP - Scalability

Scaling MongoDB

Contents

1. Introduction2. Key concepts3. Types of scaling4. Use cases5. Architecture for scaling6. Implementation7. Scaling hardware8. Choosing a shard key

Introduction

• Scaling a database can be defined as the ability to add humungous amounts of data while ensuring high level of availability

• MongoDB achieves scaling using two concepts– Sharding for horizontal scaling– Replication for high availability

Key concepts

• Document - a record in a MongoDB collection and the basic unit of data in MongoDB. Documents are analogous to JSON objects but exist in the database in a more type-rich format known as BSON. BSON as a binary representation of JSON (JavaScript Object Notation){

"employees":[{"firstName":"John", "lastName":"Doe"},{"firstName":"Anna", "lastName":"Smith"},{"firstName":"Peter", "lastName":"Jones"}

]}

Key concepts

• Collection - A grouping of MongoDB documents, like an RDBMS table.

• Replication - A feature allowing multiple database servers to share the same data for redundancy and facilitating load balancing.

• Replica set - A cluster of MongoDB servers that implements master-slave replication and automated failover.

• Primary - In a replica set, the primary (master) receives all write operations.

• Secondary - A replica set member that replicates the contents of the master and accepts only read requests

Key concepts

• Sharding - Sharding is the method MongoDB uses to split a large collection across several servers (called a cluster).

• Shard - A single mongod instance or replica set that stores some portion of a sharded cluster’s total data set.

• Shard key - The field MongoDB uses to distribute documents among members of a sharded cluster.

• Chunk - A contiguous range of shard key values within a particular shard.

• Balancer - An internal MongoDB process that runs in the context of a sharded cluster and manages the migration of chunks.

Key concepts

• CAP Theorem - Given three properties of computing systems, consistency, availability, and partition tolerance, a distributed computing system can provide any two of these features, but never all three. MongoDB offers availability, partition tolerance and eventualconsistency

Types of scaling

• Cluster scale - Distributing the database across 100+ nodes, often in multiple data centers

• Performance scale - Sustaining 100,000+ database read and writes per second

• Data scale - Storing 1 billion+ documents in the database

Cluster scale

Performance scale

Data scale

Aadhar Data store

Architecture

Architecture

• mongod – database server • mongos - routing process/ load balancer. mongos sits in

front of your cluster and looks like an ordinary mongod server to anything that connects to it. It forwards requests to the correct server or servers in the cluster, then assembles their responses and sends them back to the client. This makes it so that, in general, a client does not need to know that they’re talking to a cluster rather than a single server.

• config server - A mongod instance that stores all the metadata associated with a sharded cluster. A production sharded cluster requires three config servers, each on a separate machine.

Implementation

Implementation

Adding hardware

Choosing a shard key

• The shard key is either an indexed field or an indexed compound field that exists in every document in the collection. So it should be a field present in every document.

• Shard keys are immutable and cannot be changed after insertion.

• The index on the shard key cannot be a multikey index.• Choose shard keys that have both high cardinality (eg.

username) and will distribute write operations across the entire cluster.

Choosing a shard key

• Generally, the fastest queries in a sharded environment are those that mongos will route to a single shard, using the shard key and the cluster meta data from the config server. For queries that don’t include the shard key, mongos must query all shards, wait for their responses and then return the result to the application. These “scatter/gather” queries can be long running operations.

• If your query includes the shard key in case of simple keys or the first component of a compound shard key, the mongos can route the query directly to a single shard, or a small number of shards, which provides better performance.

Choosing a shard key

• The type of queries in turn depend on application requirements. Therefore selecting the shard key now depends on application requirements. Application requirements in turn depend on how we model the view. So eventually selecting the shard key depends on the how we model the view.

• To summarize, scaling MongoDB has to be viewed in the context of a specific application, NOT in isolation. The application has to be built ground up with scaling in mind. It can’t be an afterthought. However implementation can happen progressively

THANK YOU

Anand George, founder, Dark Data Consulting

Shameless pitch: Looking to expand my team. Everyone’s invited

anandgeor

darkdata_ darkdatain

http://www.darkdataconsult.in