app server
Post on 23-Feb-2016
29 Views
Preview:
DESCRIPTION
TRANSCRIPT
© 2011 Cisco All rights reserved. Cisco Confidential 1
APP serverClient library
Memory(Managed Cache)
Queue to disk
Disk
NIC
Replication Queue
Write
ResponseBefore writing to disk
Couchbase Server Node
To other node. replicas within a cluster
Data is persisted to disk by Couchbase Server asynchronously, based on rules in the system.
Great performance, but if node fails all data in queues will be lost forever! No chance to recover such data. This is not acceptable for some of the videoscape apps
APP serverClient library
Memory(Managed Cache)
Queue to disk
Disk
NIC
Replication Queue
Write
ResponseAfter writing to disk
Couchbase Server Node
APP serverClient library
Memory(Managed Cache)
Queue to disk
Disk
NIC
Replication Queue
WriteResponseAfterwriting to disk and one or more replica?
Couchbase Server Node
• Trading performance for better reliability. On a per-operation basis .
• In Couchbase , this is supported, but there is a performance impact that need to be quantified! (assumed to be significant, until proven otherwise)
• Node failure, can still result in the loss of data in the replication queue! Meaning local writes to disk may not have replicas on other nodes yet.
• Q1. Is the above mode supported by Couchbase (i.e. respond to write only after writing to local disk and insuring that at least one replica is successfully copied on another node within a cluster ?
• Q2. If supported, can this be done on a per operation basis?
• Q3. Do you have any performance data on this and the previous case?
Performance vs. reliability (fault tolerance and recoverability)Please see notes section below
© 2011 Cisco All rights reserved. Cisco Confidential 2
2
Replication vs Persistance2
Managed Cache
Disk Queue
Disk
Replication Queue
App Server
Server 2
Replication Queue
2Managed Cache
Disk Queue
Disk
Replication Queue
Replication Queue
2Managed Cache
Disk Queue
Disk
Replication Queue
Replication Queue
Server 1 Server 3
• Replication allows us to block the write while a node persists that write to the memcached layer of 2 other nodes. This is typically a very quick operation and it means we can return control back the app server near immediately instead of waiting on a write to disk
© 2011 Cisco All rights reserved. Cisco Confidential 3
APP serverClient library
Memory(Managed Cache)
Queue to disk
Disk
NIC
Replication Queue
Write
Couchbase Server Node
To other node. replicas within a cluster
XDCR Queue
NIC
Replicas to other DCs
When are the replicas to other DCs put in the XDCR queue? • Based on the user manual, this is done after writing to
local disk! This will obviously add major latencies. • Q5. Is there another option to accelerate this as is the
case of local replicas? my understanding is that when the “replication queue” is used for local replications, couchbase puts the data in the replication queue before writing to the disk. Is this correct? Why is this different for the “XDCR queue” case, i.e. write to disk first?
XDCR: Cross Data Center Replication
ResponseAfterwriting replica to remote DC?
• Q4. Is this mode supported by Couchbase (i.e. respond to the client only after the replica is successfully copied on a remote datacenter node/cluster?
Please see notes section below
© 2011 Cisco All rights reserved. Cisco Confidential 4
Couchbase Server Node
XDCR: Cross Data Center Replication Q4. Is this mode supported by Couchbase (i.e. respond to the client only after the replica is successfully copied on a
remote datacenter node/cluster?
As a workaround from the application side – the write operation can be overloaded so that before it returns control to the app it does 2 gets. One against the current cluster and one against the remote cluster for the item that was written to determine when that write has persisted to the remote cluster.(data center write) – see pseduo-code in notes
We will support synchronous writes to a remote datacenter from the client side in the 2.0.1 release of Couchbase Server – available Q1 2013
Lastly, XDCR is only necessary to span AWS Regions, a Couchbase cluster without XDCR configured can span AWS zones without issue.
Q5. Is there another option to accelerate this as is the case of local replicas? my understanding is that when the “replication queue” is used for local replications, Couchbase puts the data in the replication queue before writing to the disk. Is this correct? Why is this different for the “XDCR queue” case, i.e. write to disk first?
XDCR replication to a remote DC is built on a different technology from the in-memory intra-cluster replication. XDCR is done along with writing data to disk so that it is more efficient. Since the flusher that writes to disk, de-dups data, that is only write the last mutation for the document it is updating on disk, XDCR can benefit from this. This is particularly helpful for write heavy / update heavy workloads. The XDCR queue sends less data over the wire and hence is more efficient.
Lastly – we are also looking at a prioritized disk write queue as a roadmap item for 2013. This feature could be used to accelerate a writes persistence to disk for the purposes of reducing latencies for Indexes and XDCR.
Q6. – per command
© 2011 Cisco All rights reserved. Cisco Confidential 5
Data Center
Rack #1 Rack #2 Rack #n
Couch node1
Couch node5
Couch node7
Couch node2 Couch node3
Couch node4
Couch node6
• Q6. Can couchbase support rack-aware replication (as in Cassandra)?• If we can’t control where the replicas are placed, a rack failure could loose all replicas (i.e. docs become unavailable until the rack recovers)!• Q7. How does couchbase deal with that today? We need at least one of the replicas to be on a different rack.• Note that in actual deployments we can’t always assume that Couchbase nodes will be guaranteed to be placed in different racks.• See the following link for AWS/Hadoop use case for example:http
://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Understanding_Hadoop_Clusters_and_the_Network-bradhedlund_com.pdf
WriteFile-1
File-1Replica-1
File-1Replica-2
APP serverClient library
Rack-aware replication?
Please see notes section below
© 2011 Cisco All rights reserved. Cisco Confidential 6
Couchbase Server Sharding Approach
© 2011 Cisco All rights reserved. Cisco Confidential 7
Fail Over Node
REPLICA
ACTIVE
Doc 5
Doc 2
Doc
Doc
Doc 4
Doc 1
Doc
Doc
SERVER 1
REPLICA
ACTIVE
Doc 4
Doc 7
Doc
Doc
Doc 6
Doc 3
Doc
Doc
SERVER 2
REPLICA
ACTIVE
Doc 1
Doc 2
Doc
Doc
Doc 7
Doc 9
Doc
Doc
SERVER 3 SERVER 4
SERVER 5
REPLICA
ACTIVE
REPLICA
ACTIVE
Doc 9
Doc 8
Doc Doc 6 Doc
Doc
Doc 5 Doc
Doc 2
Doc 8 Doc
Doc
• App servers accessing docs
• Requests to Server 3 fail
• Cluster detects server failedPromotes replicas of docs to activeUpdates cluster map
• Requests for docs now go to appropriate server
• Typically rebalance would follow
Doc
Doc 1 Doc 3
APP SERVER 1COUCHBASE Client LibraryCLUSTER
MAP
COUCHBASE Client LibraryCLUSTER
MAP
APP SERVER 2
User Configured Replica Count = 1
COUCHBASE SERVER CLUSTER
© 2011 Cisco All rights reserved. Cisco Confidential 8
Data Center
Rack #1 Rack #2 Rack #3
Vb1 Active
Vb1 Replica1
Vb1000 Replica 2
Vb256 Replica 2 Vb1000 Active
Vb277 Active
Vb1 Replica 2
APP serverClient library
Rack-aware replication.
Vb1000 Replica1
Vb900 Replica1
Rack #4
Vb1 Replica 3
Vb500 Replica3
Vb256 Active
• Couchbase supports Rack aware replication through the number of replica copies and limiting the number of Couchbase nodes in a given rack.
© 2011 Cisco All rights reserved. Cisco Confidential 9
Availability Zone-aware replicationAWS EAST
Zone A
Zone B
Zone C
Node 1Node 2
Node 3
Node 4Node 5
Node 6
Node 7Node 8
Node 9
• Q8. If all the nodes of a single couchbase cluster (nodes 1-9) are distributed among 3X availability zones (AZ) as shown, can couchbase support AZ-aware replication?
That is to insure that the replicas of a doc are distribute across different zones, so that a zone failure does not result in doc unavailability.
Assume Inter-AZ latency ~1.5 ms
© 2011 Cisco All rights reserved. Cisco Confidential 10
Availability Zone-aware replicationAWS EAST
Zone A
Zone B
Zone C
Node 1Node 2
Node 3
Node 4Node 5
Node 6
Node 7Node 8
Node 9
Assume Inter-AZ latency ~1.5 ms
• Couchbase currently supports zone affinity through the number of replicas and limiting the number of Couchbase nodes in a given zone.
• Replica factor is applied at the bucket level and up to 3 replicas can be specified. Each replica is equal to 1 full copy of the data set that will be distributed across the available nodes in the cluster. With 3 replicas – the cluster contains 4 full copies of the data. 1 active + 3 replica
• By limiting the number of Couchbase nodes in a given zone to the replica count, losing a zone does not result in data loss as there is a full copy of data still in another zone.
• In the example above – in a worst case scenario. We have 3 replicas enabled. Active lives on node1, replica1 lives on node2, replica2 on node3 and replica3 on node4. If zone 1 goes down, those nodes can be automatically failed over which promotes replica3 on node4 to active
• Explicit Zone aware replication(affinity) is a roadmap item for 2013.
© 2011 Cisco All rights reserved. Cisco Confidential 11
Throughput /latency with scale We need performance tests showing scaling from 3 nodes to 50-80 nodes in a single
cluster (performance tests to show performance (throughput and latencies) in 5 or 10 nodes increments). During these tests, the following configurations/assumptions must be used:
• Nodes are physically distributed in 3 availability zones (e.g. AWS EAST zones).• No data loss (of any acknowledged write) when a single node fails in any zone, or when an entire
availability zone fail. Ok to loose un-acknowledged writes since clients can deal with that. To achieve this, we need:o Durable writes enabled (i.e. don’t ack client’s request to write until the write is physically done to disk on the local node
and at least one more replica is written to disks of other nodes in different availability zones).
Even though the shared performance tests look great, unfortunately the test assumptions used (lack of reliable writes) are unrealistic for our videoscape deployments/use case scenarios.
We need performance test results that are close to our use case! Please see the following Netflix’s tests and write durability /multi zone assumptions which are very close to our use case (http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html)
Q9. Please advise if you are willing to conduct such tests!
© 2011 Cisco All rights reserved. Cisco Confidential 12
Throughput /latency with scale We need performance test results that are close to our use case! Please see the following
Netflix’s tests and write durability /multi zone assumptions which are very close to our use case (http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html)
The Netflix test describes that writes can be durable across regions but does not specify if the latency and throughput they saw were from Quorum writes or Single Writes(Single writes are faster and map to a unacknowledged write). This is similar to our benchmark numbers as though we have the ability to do durable writes across regions and racks we do not specify it in the data. Can you confirm that the Netflix benchmark numbers for latency/throughput/cpu util were done wholly with Quorum writes?
top related