couchbase 2.0 performance.hp-dl380p.gen8

34
Technical white paper Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server An Emerging Database Lab Reference Architecture Table of contents Executive summary ...................................................................................................................................................................... 3 Couchbase overview ..................................................................................................................................................................... 4 Schema-free documents ........................................................................................................................................................ 5 Queries and views ..................................................................................................................................................................... 5 Programmer interface ............................................................................................................................................................. 5 Replication and sharding ......................................................................................................................................................... 5 Test load server – HP DL380p Gen8 Server ............................................................................................................................ 6 Load test – Yahoo! Cloud Serving Benchmark ........................................................................................................................ 7 YCSB is a cloud serving data system load test and measurement tool ....................................................................... 7 Benchmark value ...................................................................................................................................................................... 8 Workloads used ......................................................................................................................................................................... 8 Test scenarios ............................................................................................................................................................................ 8 HP DL380p Gen8 and Couchbase 2.0 performance .............................................................................................................. 8 Test setup ................................................................................................................................................................................... 8 Client setup properties ............................................................................................................................................................ 9 DL380p Gen8 setup................................................................................................................................................................ 10 Interrupt assignments and memcache processor affinity ............................................................................................. 11 Memory ..................................................................................................................................................................................... 13 Storage ...................................................................................................................................................................................... 17 Scale-out................................................................................................................................................................................... 17 Replication ................................................................................................................................................................................ 18 Couchbase and Hadoop ............................................................................................................................................................. 22 Installing the Couchbase Hadoop Connector .................................................................................................................... 22 Exporting data from Hadoop to Couchbase Server......................................................................................................... 22 Importing data from Couchbase Server to Hadoop ........................................................................................................ 23 HP Insight Cluster Management Utility .............................................................................................................................. 23 Summary ....................................................................................................................................................................................... 24

Upload: donjoice

Post on 05-Dec-2014

187 views

Category:

Data & Analytics


0 download

DESCRIPTION

Learning

TRANSCRIPT

Page 1: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper

Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server An Emerging Database Lab Reference Architecture

Table of contents Executive summary ...................................................................................................................................................................... 3

Couchbase overview ..................................................................................................................................................................... 4

Schema-free documents ........................................................................................................................................................ 5

Queries and views ..................................................................................................................................................................... 5

Programmer interface ............................................................................................................................................................. 5

Replication and sharding ......................................................................................................................................................... 5

Test load server – HP DL380p Gen8 Server ............................................................................................................................ 6

Load test – Yahoo! Cloud Serving Benchmark ........................................................................................................................ 7

YCSB is a cloud serving data system load test and measurement tool ....................................................................... 7

Benchmark value ...................................................................................................................................................................... 8

Workloads used ......................................................................................................................................................................... 8

Test scenarios ............................................................................................................................................................................ 8

HP DL380p Gen8 and Couchbase 2.0 performance .............................................................................................................. 8

Test setup ................................................................................................................................................................................... 8

Client setup properties ............................................................................................................................................................ 9

DL380p Gen8 setup ................................................................................................................................................................ 10

Interrupt assignments and memcache processor affinity ............................................................................................. 11

Memory ..................................................................................................................................................................................... 13

Storage ...................................................................................................................................................................................... 17

Scale-out ................................................................................................................................................................................... 17

Replication ................................................................................................................................................................................ 18

Couchbase and Hadoop ............................................................................................................................................................. 22

Installing the Couchbase Hadoop Connector .................................................................................................................... 22

Exporting data from Hadoop to Couchbase Server......................................................................................................... 22

Importing data from Couchbase Server to Hadoop ........................................................................................................ 23

HP Insight Cluster Management Utility .............................................................................................................................. 23

Summary ....................................................................................................................................................................................... 24

Page 2: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

Appendices.................................................................................................................................................................................... 24

Appendix A: Increasing memcache threads ...................................................................................................................... 24

Appendix B: Setting processor affinity ............................................................................................................................... 25

Appendix C: Java jar files required to support the YCSB client ...................................................................................... 25

Appendix D: Performance comparison NIC processor affinity data ............................................................................. 26

Appendix E: Hyper-Threading throughput comparison ................................................................................................. 28

Appendix F: Comparison of operations per second and average read latency ......................................................... 29

Appendix G: Performance implications of varying memory bucket size .................................................................... 29

Appendix H: Scale-out performance of two and four nodes ......................................................................................... 30

Appendix I: Replication tables .............................................................................................................................................. 30

Appendix J: JSON output for Hadoop .................................................................................................................................. 33

For more information ................................................................................................................................................................. 34

Page 3: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

3

Simpler, Faster, Better – Making it Matter We in HP have tested the performance of Couchbase Server 2.0 on our ProLiant DL380p Gen8 server using the Yahoo! Cloud Serving Benchmark and also verified how to connect this solution with Apache Hadoop. We take the guesswork out of a new implementation by showing you how we implemented this solution in our test laboratory. Get a jump start on your application and system deployment by reviewing our test configurations and results – and save yourself some time and money.

Executive summary

“When all you have is a hammer, everything looks like a nail.”

– Bernard Baruch

This has certainly been true for most enterprise applications; we have used the hammer of the traditional relational database management system (RDBMS) to solve most database design problems. In this paper, we are not suggesting eliminating the RDBMS from the data center, but recognizing that a new class of database management tools is becoming mainstream; NoSQL database management systems (DBMSs) are among this class.

NoSQL DBMSs have a different operational model that can enhance throughput and increase application flexibility when compared with an RDBMS, leaving many traditional expectations lying on the cutting room floor. Many have no explicit table structure, no support for data joins, limited transaction support, and a security model delegated almost completely to the application tier … with demonstrable value for many of today’s most demanding applications. Among the first we have tested is Couchbase Server 2.0, an open-source NoSQL1 document-oriented DBMS.

In documented-oriented DBMSs, documents do not have fixed structures like relational tables; they are simply collections of key-value pairs that can be accessed via unique document identifiers or indexed by fields in the documents. Many such DBMSs support JSON2 documents. There is no rigid metadata associated with JSON documents, so application designers can add new fields to a logical schema without rebuilding the entire database. This flexibility may increase developer productivity and assist in keeping pace with the rapid advance of modern web applications.

The design paradigm of the NoSQL DBMS necessitates different performance tuning strategies than for an RDBMS. Couchbase Server is designed with a scale-out architecture, including built-in replication and sharding mechanisms to ensure data integrity, availability, horizontal scale and efficient remote client access. Proper configuration of NoSQL DBMS clusters and system-level tuning for optimal memory, disk, and network performance is critical for enterprise-class deployments.

Many companies are looking for solutions with Big Data in mind, so we have included an example configuration linking this solution to a Hadoop cluster.

1 NoSQL has multiple meanings at this time, but we will go forward with Not a Relational Database Management System. 2 JavaScript Object Notation (JSON) is a compact, yet human-readable object-oriented data structure language that is useful for the sharing of information

between heterogeneous computer systems and languages.

Page 4: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

4

To assist our customers in evaluating NoSQL solutions, HP has configured and tested Couchbase Server on the latest HP ProLiant DL380p Gen8 server. This reference architecture documents these efforts and highlights several unique advantages HP platforms offer when deploying this DBMS. Following is a small sample of DL380p features.

• Flexible network solutions including advanced LAN-On-Motherboard options that enable customer-tailored network fabrics without consuming PCI-E expansion slots.

• Storage solutions ranging from direct-access hard disks, embedded RAID solutions, high-performance solid-state disk drives, and storage area networks, enabling customers to select the appropriate price and performance point for their individual needs.

• HP Active Health System, which provides continuous, proactive monitoring of over 1,600 system parameters including hardware, operating system, and some application software via the HP Integrated Lights-Out (iLO 4) management processor.

Additionally, cluster management capabilities are provided by HP Insight Cluster Management Utility (CMU). CMU provides push-button scale out and provisioning with industry leading provisioning performance (deployment of 800 nodes in 30 minutes), reducing deployments from days to hours. In addition, CMU provides real-time and historical infrastructure and Hadoop monitoring with 3D visualizations allowing customers to easily characterize Hadoop workloads and cluster performance reducing complexity and improving system optimization leading to improved performance and reduced cost. HP Insight Management and HP Service Pack for ProLiant, allow for easy management of firmware and the server. In the following pages, we’ll discuss the basics of Couchbase Server deployments on the HP ProLiant platform. We use industry quasi-standard workloads to demonstrate the impact of hardware and software configuration choices on server performance. These configuration choices are:

• A basic Couchbase system implementation

• A replication implementation for a highly-available configuration

• A sharded3 implementation for scale-out write and update-heavy configuration

Target audience: This document is intended for decision makers, system and solution architects, system administrators

and experienced users who are interested in reducing time to design, purchase, and implement an HP ProLiant and Couchbase Server solution with an optional Hadoop integration.

This white paper describes testing performed November 2012 through January 2013.

DISCLAIMER OF WARRANTY

This document may contain the following HP or other software: XML, CLI statements, scripts, parameter files. These are provided as a courtesy, free of charge, “AS-IS” by Hewlett-Packard Company (“HP”). HP shall have no obligation to maintain or support this software. HP MAKES NO EXPRESS OR IMPLIED WARRANTY OF ANY KIND REGARDING THIS SOFTWARE INCLUDING ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. HP SHALL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES, WHETHER BASED ON CONTRACT, TORT OR ANY OTHER LEGAL THEORY, IN CONNECTION WITH OR ARISING OUT OF THE FURNISHING, PERFORMANCE OR USE OF THIS SOFTWARE.

Couchbase overview

Couchbase Server 2.0 is an open-source NoSQL DBMS optimized for interactive applications that store data in key-value or document format. It has support for JSON, binary data, indexing and querying, replication for high availability (including cross-datacenter replication), and sharding that provides horizontal scaling. Couchbase is easily scalable, highly performant, with a driving philosophy of “always on”. In this section we present several high-level features of Couchbase, and we direct you to more detail in the footnotes and references.

3 Sharding is a partitioning of the database across two or more systems with the intent to increase throughput by enlarging the number of systems, disk

spindles, and other resources working on a sharable task. Documents are partitioned based upon a document identifier or key.

Page 5: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

5

Schema-free documents

Documents in Couchbase Server can be stored in JSON or binary format. Documents contain key and value pairs that are sent to the server and are returned back to the applications by client libraries (APIs). Using Couchbase SDKs, the document ID is hashed and documents are uniformly distributed across partitions in the cluster. On each node in the cluster, documents are stored in data containers called buckets, which is similar to the concept of a database in an RDBMS. Each bucket can have multiple documents and documents with different structures can be collocated in the same bucket. Buckets provide a logical grouping of physical resources within a cluster. Based on your configuration settings on the bucket, you can replicate a bucket up to 3 times within a cluster.

In contrast with RDBMSs where tables are normalized to minimize duplication of information, NoSQL database management systems typically store data in a de-normalized way. There are no joins.

Agile programming enablement is a key focus for a NoSQL DBMS. Its schema-free document model makes coding more flexible and adaptable to the needs of the application and the availability of information provided in a less-structured content.

Schema-free means that the database doesn't need to know the content of the documents, and that documents with different structures may be collocated in the same bucket if desirable. This doesn’t mean the contents are not used by the DBMS, as indexes are based on fields (key-value pairs) in the documents; it is simply a statement that the fields are not required to function as they would with an RDBMS. If the field does not exist, or is not of the value required, the document simply isn’t included in the results.

Performance is enhanced by including relevant data together in a document. Typically in an RDBMS environment, tables are normalized to minimize duplication of information, but even in these shops, selective denormalization is used to enhance application performance. NoSQL database management systems typically include redundant information in the documents based upon application requirements and information availability to enhance performance, especially in a web environment, to minimize the number of round-trips to access information.

Queries and views

“Couchbase Server incorporates the ability to summarize and query the information that is stored in the database through the use of views. Views in Couchbase are written as a JavaScript Map Reduce function. Views define the structure and content of the response generated from the stored key and value pairs. From the information generated by the view, you can query and select specific rows, ranges of rows, and summaries of this information such as counts or sums.”

Views create indexes based upon attribute values that are emitted. The index stores a key and value for each qualified document. A view key defines the search parameters, and a view value provides the fields emitted in the view. This is significant in that the entire document isn’t accessed or returned, only the attributes defined in a view. A view also provides the key to the original document, so if additional information beyond the value fields is required, the original document can be obtained. To provide information in various combinations, multiple views may be used.

Programmer interface

Programmers interface with the DBMS via Software Development Kits (SDKs); these are also known as client libraries or APIs. These language-specific APIs, provided by Couchbase, Inc. and the Couchbase Server open source community, are responsible for communicating with the server to perform database operations. Client libraries are cluster-aware, maintain database topology and node status in a dynamic environment, distribute read and write requests to the appropriate nodes, and compensate for failed nodes automatically by redirecting requests as appropriate.

Along with the standard get and set operations, Couchbase Server provides some integrity checking operations. The add() operation assures a record is not already in the database, the replace() operation assures the record already exists, and the cas() operation, compare and swap, allows us to verify a field’s state before an update operation occurs. Couchbase also provides atomic operations for binary values for atomically incrementing (incr) or decrementing (decr) a value for a given key.

Replication and sharding

High availability comes with replication, and horizontal scale comes with sharding. Replication and sharding may be combined to provide a horizontal scale deployment that provides high availability with large-scale storage capacity.

Data stored in Couchbase is also partitioned using consistent hash partitioning. By design, each Couchbase data bucket is split into 1,024 partitions called vBuckets. Documents are auto-sharded and evenly distributed across these partitions.

Replication is a feature that allows copies of a database bucket to reside on alternate nodes; this is the basis of high availability. The system internally manages the replication of partitions for each bucket. The client library manages this for the developer, and based on the document ID referenced, routes requests to the appropriate node that manages the

Page 6: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

6

partition to which the document belongs. For Couchbase Server, the number of replicas for a bucket must be established when the bucket is created.

From a user or developer perspective, the client library manages the failover of a partition automatically when a node becomes unresponsive. From a server perspective, the replica partition is promoted to active status and the administrator may rebalance the cluster to establish another replica partition for the data4.

When replication is applied, we must also recognize the eventually-consistent paradigm of NoSQL DBMSs. Relational database management systems normally use an ACID model (atomicity, consistency, isolation, durability) to ensure data consistency, where NoSQL DBMSs, including Couchbase Server, typically use a BASE model (basically available, soft state, eventually consistent). When replication is in use, we must recognize that write durability (persistence) is an important consideration regarding the availability of writes and updates to the replicas. Couchbase Server 2.0 provides the observe() operation5 that allows us to assure a document is persisted to disk and, if replication is active, that the document is replicated.

Nodes may be added to an active cluster to increase storage, and the administrator may rebalance the buckets as appropriate.6 These same nodes may also contain replication vBuckets.

Next, we will review the system configuration used for the performance load testing.

Test load server – HP DL380p Gen8 Server

The server hardware for the Couchbase Server test configuration is an HP ProLiant DL380p Gen8 server. Each HP DL380p Gen8 server is outfitted with two Intel Xeon E5-2690 8-core 2.9GHz CPUs, 192GB of RAM7, 24Gbps network bandwidth (48Gbps bidirectional bandwidth), and eight 300GB small form factor hard disk drives for a total of 2.4TB. When used in the scale-out configuration, a four node cluster includes 64-cores, 128-threads, 768GB RAM, 96Gbps network bandwidth, and 9.6TB of disk storage. This is just the server test configuration. The HP DL380p Gen8 server has much more capacity than shown here; please see the data sheet for more information8.

Figure 1. HP DL380p Gen8 Server

4 See administration of node failover: couchbase.com/docs/couchbase-manual-2.0/couchbase-admin-tasks-failover.html 5 See observe in the SDK manual: couchbase.com/docs/couchbase-devguide-2.0/couchbase-sdk-when-to-observe.html 6 For more information on sharding, see: couchbase.com/docs/couchbase-devguide-2.0/brief-info-on-vBuckets.html 7 For the load test, we limit physical RAM to 192GB; we also decrease the available RAM to the database to assist in gathering load performance statistics when

comparing database size with RAM availability. 8 HP DL380p Gen8 server data sheet: http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA3-9615ENW.

Page 7: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

7

Table 1. HP DL380p Gen8 server configuration

Quantity Description

1 HP DL380p Gen8 8-SFF CTO Server

2 HP DL380p Gen8 Intel® Xeon® E5-2690 Performance Processors

1 HP Smart Array P420i SAS controller

1 HP 1GB P-series Smart Array Flash Backed Write Cache

12 HP 16GB 2Rx4 PC3-12800R-11

1 HP 1GbE 4 Port 331FLR Adapter

1 HP NC523SFP 10Gb 2-port Server Adapter

2 HP 750W Common Slot Gold Hot Plug Power Supply Kit

8 HP 300GB 6G SAS 15K 2.5in SC Enterprise Hard Drive

Next, we review the load generating benchmark.

Load test – Yahoo! Cloud Serving Benchmark

For every solution, we need to define the applicable areas for which the system may be used, and to show configuration trade-offs of the solution (like the amount of memory required, network bandwidth consumed, CPU usage). We have chosen the Yahoo! Cloud Serving Benchmark (YCSB) workload and its framework to characterize the performance of the NoSQL DBMS on the HP DL380p.

In this section we present a brief overview of YCSB and the workloads used in our tests. More information is available in the footnotes and references.

YCSB is a cloud serving data system load test and measurement tool

Yahoo!’s research arm developed a framework and a common set of workloads to help understand the performance characteristics of various DBMSs under various types of load. The stated purpose of YCSB is to use the workloads in evaluating the performance of "key-value" and "cloud" serving stores.9 This allows us to use this common benchmark to test the NoSQL database systems and determine their performance characteristics on a server. We may also use this performance information to compare with other DBMS solutions. For further information, there is an original Association for Computing Machinery paper, “Benchmarking Cloud Serving Systems with YCSB”,10 published by the Yahoo! research team.

The research team recognized a new class of DBMSs that didn’t follow the traditional ACID approach to transactions and write durability, nor the OLTP types of workloads. Therefore a new approach was proposed to facilitate the comparison of similar “cloud data serving systems”11. The YCSB framework is extensible to allow expansion of the workloads and DBMSs as desired.

The YCSB framework has a client that manages the workload and maintains and reports the statistics. This client has multiple threads that are used to send requests to the database and monitor the responses using a driver specific to the DBMS under test. This same client is also used to load the test documents into the database.

Source code for this benchmark is maintained at github.12 Additional published benchmark results are available, (2010-03-31) ycbs-v4.pdf13).

9 The benchmark purpose may be reviewed at : http://research.yahoo.com/Web_Information_Management/YCSB 10 Information about the paper (2010): http://research.yahoo.com/node/3202,

Paper download: http://research.yahoo.com/files/ycsb.pdf 11 http://research.yahoo.com/node/3202 12 Source code may be obtained at: http://github.com/brianfrankcooper/YCSB 13 http://research.yahoo.com/files/ycsb-v4.pdf

Page 8: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

8

Benchmark value

The value of running these tests is that the results help us position the solution, understand best practices, and understand the benefits of the various configuration options. Latency and throughput are two results from these tests that indicate the value of our choices. There is a significant difference when running workloads where the working set of data is all cached in memory compared with workloads having a working set that is too large to be fully contained in memory, thus forcing disk I/O. We may also observe the impact of write durability options that force data to be preserved on disk before returning to the client, or replication options that require documents to be preserved across the network on another node. These results may also be compared with other solutions.

• Latency is the time it takes to return from a request.

• Throughput is the number of requests per second the solution is capable of achieving.

Workloads used

There are two YCSB workloads we use to test the NoSQL database management systems; they are called Workload A and Workload C.14

Workload A is an update-heavy workload with operations split between 50% read and 50% write. Workload C is a read-only workload with 100% read operations.

• Update-heavy

• Read-only

The data in a document, for this benchmark, is approximately 1,000 bytes. Documents consist of ten fields with 100 bytes in each field, and a key. The entire document is brought in for each read and update.

When updating or writing to the disk, Couchbase Server processes 250k15 documents at a time, and then commits the writes to the disk. It only writes the latest update to a document, pruning multiple writes to the same document.

For both the update-record and read-record selection, we use the uniform distribution algorithm, giving any record an even chance of being selected. Zipfian is the default test algorithm, but for our tests, we choose the uniform algorithm to lessen the potential performance improvement that caching of hot documents would provide.16 When a document is selected, the entire document is returned to the client.

Test scenarios

The test database is loaded with 25 million documents on each server, containing 10 fields of 100 bytes each, per node; therefore, a four-node cluster has 100 million documents.

Our experience indicates that NoSQL DBMSs prefer to have enough memory to hold the entire working set in memory, thus we have two sub-scenarios for each workload to document performance. In the first scenario, enough memory to hold the working set is provided, and in the second, half the memory required to hold the entire working set is provided.

• Working set cached in memory

• Working set twice the capacity for memory

Now, let us review some of the performance results.

HP DL380p Gen8 and Couchbase 2.0 performance

Test setup

The test environment includes four HP ProLiant DL380p Gen8 servers configured as previously described as the Couchbase servers, and sixteen HP ProLiant DL380p servers consisting of a mixture including twelve Gen8 servers with sixteen cores and four G7 servers with twelve cores each as the clients (or drivers). We utilize one 10GbE NIC on each node.

14 More information regarding workloads is available at: https://github.com/brianfrankcooper/YCSB/wiki/core-workloads 15 couchbase.com/docs/couchbase-manual-2.0/couchbase-monitoring-diskwritequeue.html 16 More information can be found in the YCSB document: http://research.yahoo.com/files/ycsb.pdf

Page 9: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

9

The test setup consists of the following software:

• OS Version: Red Hat Enterprise Linux Server release 6.1 (Santiago), 2.6.32-131.0.15.el6.x86_64

• Oracle Java version: 1.7.0_0317

• YCSB – commit version 8afcc6e from https://github.com/Altoros/YCSB

• Couchbase version: Couchbase Community Server 2.0.0 Beta, Build 1723

The test setup consists of the following hardware:

• Sixteen ProLiant DL380p Gen8 with 2 Intel Xeon 2.9 GHz E5-2690 Sandy Bridge processors

• Four ProLiant DL380p G7 with 2 Intel Xeon 3.46 GHz X5690 Westmere processors

• One HP 6600-24XG18 Switch

Figure 2. Test setup

Client setup properties

During the test runs, we ran 1 to 32 YCSB clients (drivers) on each node with 128 threads; for a maximum of 512 clients with 128 threads.

JVM options The Java Virtual Machine on the client systems is run with the following parameters:

-Xms2048m -Xmx2048m -XX:MaxDirectMemorySize=2048m

-XX:+UseConcMarkSweepGC -XX:MaxGCPauseMillis=850

YCSB run parameters

Table 2 describes the YCSB parameters used for the test runs.

17 Appendix C provides a list of Java jar files required to support the YCSB client. 18 Because of the heavy network traffic, we suggest upgrading the test configuration network switch to HP 5920AF high-performance 24 port top-of-rack

10GbE switch with ultra-deep packet buffering.

Page 10: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

10

Table 2. YCSB client run parameters

Parameter Value Brief

couchbase.timeout 60,000

couchbase.readBufferSize -1

insertorder ordered

requestdistribution uniform

recordcount 25 million

maxexecutiontime 360

operationcount 999,999,999

writeallfields true

readallfields true

threads 128

readproportion 1 or 0.5 1 for workload ‘C’, or 0.5 for workload ‘A’

updateproportion 0 or 0.5 0 for workload ‘C’, or 0.5 for workload ‘A’

scanproportion 0

insertproportion 0

DL380p Gen8 setup

We recommend the DL380p Gen8 nodes be provisioned with 10G Ethernet ports.19

We recommend leaving Hyper-Threading disabled.20

We recommend the DL380p Gen8 nodes be provisioned with a minimum of 128GB of RAM. See the Memory section that follows. A two processor DL380p has 4 memory channels per processor with 3 DIMM slots in each channel for a total of 24 slots. Memory configuration is subject to DIMM population rules and guidelines described in the white paper “Configuring and using DDR3 memory with HP ProLiant Gen8 Servers”.21 The maximum number of DIMMs per channel (DPC) to utilize the highest supported DIMM speed of 1600 MHz is 2 DPC.

We recommend using the suggested settings from the white paper “Configuring and Tuning HP ProLiant Servers for Low-Latency Applications”22. For the tests we used all the recommended settings except for the Intel Turbo Boost Technology setting, which we enabled since in the testing we were careful not to exceed the processor temperature limit which would cause it to transition out of turbo boost mode.

19 See Appendix F for a comparison of operations per second and average read latency. 20 See Appendix E for Hyper-Threading throughput comparison. 21 http://h20000.www2.hp.com/bc/docs/support/SupportManual/c03293145/c03293145.pdf 22 Configuring and Tuning HP ProLiant Servers for Low-Latency Applications,

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01804533.pdf

Page 11: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

11

Table 3. BIOS settings

Parameter Value

Intel Virtualization Technology Disabled

Intel Hyper-Threading Options Disabled

Intel Turbo Boost Technology Enabled

Intel VT-d Disabled

Thermal Configuration Maximum Cooling

Processor Power and Utilization Monitoring Disabled

Memory Pre-Failure Notification Disabled

HP Power Profile Maximum Performance

HP Power Regulator Static High Performance

Intel QPI Link Power Management Disabled

Minimum Processor Idle Power Core State No C-states

Minimum Processor Idle Power Package State No Package State

Energy/Performance Bias Maximum Performance

Collaborative Power Control Disabled

DIMM Voltage Preference Optimized for Performance

Interrupt assignments and memcache processor affinity

The Intel Sandy Bridge processors have an integrated PCI Express 3.0 controller. In our test configuration, we have two of these processors in each DL380p Gen8 server. Therefore, to best support maximum performance, we localize PCIe interrupt processing (interrupts from the 10GbE NICs) to the processor hosting the task requesting the network traffic. If we allow the interrupts to be hosted by the other processor, we incur an extra hop between processors to handle the interrupt, and given our test results, the impact is significant.23

Table 4. DL380p PCIe slot to processor assignments24

PCIe Riser Slot Processor

Standard 1 1

Standard 2 1

Onboard LOM n/a 1

Expansion 4 2

Expansion 5 2

Expansion 6 2

23 More information on the E5-2600 processor is available at intel.com/p/en_US/embedded/hwsw/hardware/xeon-e5-c604/overview 24 For more information on the DL380p PCIe slot assignments, see the quickspecs at

http://h18004.www1.hp.com/products/quickspecs/14212_div/14212_div.html

Page 12: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

12

In the Couchbase server process, the Couchbase-memcache worker threads handle the network I/O for Couchbase client requests. For the system under test the PCIe controller for the 10Gb NIC is connected to Processor 1, therefore we set the processor affinity of these threads to Processor 1. By having all the worker threads running on a single processor, we also get additional performance gains due to less memcache thread synchronization overhead between the processor chips.

By default, the Couchbase Server process starts with four Couchbase-memcache worker threads per Couchbase server instance. To improve performance, and to fully utilize all cores of the processor, we recommend starting one Couchbase-memcache worker thread per core. For the system under test, with eight cores per processor available, eight Couchbase-memcache worker threads were started.

The second processor provides additional compute capacity for all the other Couchbase processes and threads (the disk writer threads, the disk compaction process, and other overhead tasks).

Appendix A provides an example C program to increase the number of Couchbase-memcache worker threads, and Appendix B provides an example shell script to set processor affinity.

Figure 3 shows a throughput comparison when the NIC interrupts are processed by processor 1 and the memcache threads have varying processor affinities. The throughput is measured for YCSB benchmark ‘C’. The Couchbase Server was started with eight Couchbase-memcache worker threads. The detail data is provided in Appendix D, Table D-1.

Figure 3. Performance comparison with varying memcache thread processor affinity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 4 8 16 32 64 128 256 512

Thro

ugh

pu

t (o

pe

rati

on

s/se

con

d)

Mill

ion

s

Number of YCSB Clients

Performance With Processor Affinity

8 memcache threads with Processor 1affinity

8 memcache threads with Processor 2affinity

8 memcache threads equally splitProcessor 1 and 2

Page 13: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

13

Figure 4 shows average document read latency comparison when the NIC interrupts are processed by processor 1 and the memcache threads have varying processor affinities. The latency is measured for YCSB benchmark ‘C’. The Couchbase server was started with eight Couchbase-memcache worker threads. The detail data is provided in Appendix D, Table D-4.

Figure 4. Average document read latency comparison with varying memcache thread processor affinity

Memory

System dynamic RAM is a determining factor when calculating throughput performance. There are two portions to this section:

• Populating the DIMM slots for maximum performance

• Sizing memory correctly for Couchbase server data buckets

Populating DIMM slots

For maximizing Couchbase performance, DL380p Gen8 nodes should be provisioned with a minimum of 128GB RAM configured as eight 16GB DDR3-1600 RDIMMs, one DIMM per channel. The memory configuration may be expanded to 256GB by adding 8 x 16GB DIMMs, or 192GB by adding 8 x 8GB DIMMS, in the second DIMM slot.25

25 A two processor DL380p has 4 memory channels per processor with 3 DIMM slots in each channel for a total of 24 slots. Memory configuration is subject to

DIMM population rules and guidelines described in the white paper “Configuring and using DDR3 memory with HP ProLiant Gen8 Servers”

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c03293145/c03293145.pdf. To utilize the highest supported DIMM speed of 1600 MHz, a

maximum of two DIMMs per channel may be occupied. Page 20 states “For optimal throughput and latency, populate all four channels of each installed CPU

identically.”, and that in the second slot “There are no performance implications for mixing sets of different capacity DIMMs at the same operating speed.”

Therefore, the second DIMM slot may be populated with 8GB or 16GB modules while still maintaining optimal performance.

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128 256 512

Ave

rage

Re

ad L

ate

ncy

-m

illis

eco

nd

s

Number of YCSB Clients

Average Read Latency With Processor Affinity

8 memcache threads with Processor 1affinity

8 memcache threads with Processor 2affinity

8 memcache threads equally splitProcessor 1 and 2

Page 14: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

14

Figure 5: Memory bandwidth when populating slots, from “Configuring and Using DDR3 Memory”

Couchbase memory recommendations

Sizing the memory working set correctly has a significant impact on Couchbase server performance.

Buckets are used to compartmentalize data within the Couchbase Server and per node RAM Quota can be set for each bucket. When the memory usage exceeds a threshold of 65% of Couchbase data bucket allocated RAM (the default low water mark) and replication is enabled, eviction of replica data items begins. When memory usage exceeds a threshold of 75% (the default high water mark) Couchbase starts evicting items from the cache using a random eviction policy until the memory usage falls below the low water mark. When memory usage reaches 90% of bucket allocated RAM, Couchbase will return temporary out of memory errors to clients when storing data.

To get to the appropriate memory recommendation, we need to calculate the working set size of the application using Couchbase server. You will need to perform a similar calculation for your application as we perform for YCSB. The formulas and values below follow the recommendations Couchbase Inc. provides in the sizing guidelines available on their website.26

The behavior of our test with the YCSB client is to evenly distribute its requests across the complete target data set as configured in the YCSB requestdistribution parameter. Your application may only have a small percentage of “hot” or working set data. In our test, we are using the entire database as the working set to assist with our performance analysis. Estimating real-world performance may then use these performance results as a base to assist with specifying the configuration and estimating performance expectations. For example, a 2.5TB dataset with a 1% working set would be expected to perform similarly to the systems tested due to the comparable active dataset size of 25GB. Remember, since each application has its own system usage characteristics, your mileage may vary.

For our test, we use the following sizing input variables.

• Number of Documents: 25 million, the total number of documents in our working set – per server node

• Number of Replicas: zero or one, the number of copies of the original data we want to keep

• Document key size: 10 bytes, the size of the document IDs

• Document value size: 1,000, the size of values

• Working set percentage: 100%, or less to assist with performance analysis, the percentage of data we want in memory

There are also several sizing constants.

• Metadata per document, memory resident storage overhead for each document: 64 bytes.

• Headroom, an additional 25-30% space overhead over the size of the dataset required by the cluster

• High water mark, the default is set to 75% of allocated memory

26 Couchbase Sizing guidelines, couchbase.com/docs/couchbase-manual-2.0/couchbase-bestpractice-sizing.html

Page 15: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

15

Following are the formulas to calculate RAM required.

• Total Size of Metadata = (Number of documents) * ( metadata per document + document key size) * (1 + Number of replicas)

• Total Size of Dataset = (Number of documents) * (document value size) * (1 + Number of replicas)

• RAM required = (Total size of Metadata + (Total size of dataset * (working set percentage/100))) * (1 + headroom)/ (high water mark)

Table 5: YCSB RAM size estimation

Number of Documents 25,000,000

Number of replicas 0 or 1, doubling the RAM required

Metadata per document 64 bytes

Document key size 10 bytes

Document value size 1000 bytes

Total Metadata 1,850,000,000 bytes

Total Dataset 25,000,000,000 bytes

Working set % 100

Working set size 25,000,000,000

Headroom 0.3

High Water Mark 0.75

RAM Required 43.34 GB

Page 16: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

16

Performance implications of varying memory configurations

Varying the memory assigned to the bucket size, we observe the following results. Supporting tabular data is in Appendix G, Table G-1 and Table G-2.

With a data bucket size set to 43GB, which was the minimum required to hold all the 25 million data items in memory, we find the benchmark ‘C’ performance comparable to the performance at 64GB. However once the allocated RAM was reduced by 5% to 41GB – i.e. only 95% of the working set now resides in memory – the performance drops considerably as shown in Figure 6.

Figure 6: YCSB workload ‘C’ with varying bucket memory configurations

In the case of YCSB workload ‘A’, the disk write queues required to hold the mutations to the dataset before being persisted to disk requires additional memory. So even though 43GB was sufficient for a read-only workload (YCSB ‘C’), in the case of YCSB ‘A’ we see a performance degradation because the additional memory usage exceeds the high water mark threshold, causing items to be evicted from memory until the memory usage drops to the low water mark. This causes additional cache misses. In Figure 7, we see an additional memory requirement of approximately 10GB. Therefore, setting the data bucket size to 53GB, the YCSB workload ‘A’ performance is comparable to 64GB.

Figure 7: YCSB workload ‘A’ with varying bucket memory configurations

0

200

400

600

800

1000

1200

1 2 4 8 16 32 64 128 256

Max

imu

m T

hro

ugh

pu

t (o

pe

rati

on

s/se

con

d)

Tho

usa

nd

s

Number of YCSB Clients

Workload 'C' - Throughput versus Memory

data bucket memory = 64GB

data bucket memory = 53GB

data bucket memory = 43GB

data bucket memory = 41GB

data bucket memory = 32GB

0

100

200

300

400

500

600

1 2 4 8 16 32 64 128 256

Max

imu

m T

hro

ugh

pu

t (o

pe

rati

on

s/se

con

d)

Tho

usa

nd

s

Number of YCSB Clients

Workload 'A' - Throughput versus Memory

data bucket memory = 64GB

data bucket memory = 53GB

data bucket memory = 43GB

data bucket memory = 41GB

data bucket memory = 32GB

Page 17: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

17

Figure 8 plots the Couchbase disk write queues for a 60 minute YCSB workload ‘A’ test run using one Couchbase server with 25 million documents configured with a 64GB data bucket. There are 32 YCSB clients driving the workload and updates are occurring at approximately 250,000 updates per second for the duration of the run.

Figure 8: YCSB workload ‘A’ showing memory consumption and disk write queue

Storage

Couchbase’s replication scheme addresses redundancy requirements. Drives may be striped (RAID 0) for maximum performance. The DL380p Gen8 provides a maximum internal storage of 25TB using 25 small form factor (SFF) serial attached SCSI (SAS) hot plug drives, or 36TB using larger SATA form factor drives.

Scale-out

For scale-out (sharding) performance tests, we increase the number of server nodes to two and four Couchbase Servers, comparing the results to a single node using the same number of clients. We loaded each server with 25 million additional documents, until we reached a total of 100 million documents with four nodes. Figure 9 plots the throughput and Figure 10 plots the average read latency for YCSB workloads ‘A’ and ‘C’ using 32 test clients. See Appendix H, Table H-1, for tabular data. The results show a near linear scaling in throughput with a comparable decrease in read latency.

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

0

10

20

30

40

50

0 20 40 60

Wri

te Q

ue

ue

Dra

in R

ate

M

em

ory

Use

d (

MB

)

Dis

k W

rite

Qu

eu

e

Mill

ion

s

Time (minutes)

Disk Write Queues - Memory Used, Drain Rate

Write Queue Size

Write Queue Drain Rate

Memory Used (MB)

Page 18: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

18

Figure 9: Scale-out performance of 2 and 4 nodes compared with a single node

Figure 10: Scale-out read latency of 2 and 4 nodes compared with a single node

Replication

For high availability, we recommend a three node configuration with one replica.

The following graphs provide some general statistics for three node clusters with and without a replication factor of one. At moderate update rates, replication is consistent with write and update rates. As we exceed six clients (~230k write or updates per second per node), replication backoff is requested. At high update rates, most of the document mutations are in memory and are slowly drained to disk and replication is delayed. Tables in Appendix I provide more detail of replication statistics and backlog.

As the number of items in the disk write and replication queues exceed the throughput capability of a node, replication is asked to backoff. As a node receives requests, both the update and write data that a node receives from client applications and the items to be replicated received from other servers are placed on a disk write queue. If there are too many items waiting in the disk write queue at any given destination, Couchbase Server will request the other servers to reduce the rate

0

0.5

1

1.5

2

2.5

3

1 2 4

Thro

ugh

pu

t (O

pe

rati

on

s/se

con

d)

Mill

ion

s

Number of Nodes

Scale-out Throughput at 32 Clients

32 client Throughput - YCSB 'C'

32 client Throughput - YCSB 'A'

0

2

4

6

8

10

12

14

16

18

1 2 4

Ave

rage

Re

ad L

ate

ncy

pe

r O

pe

rati

on

(i

n m

illis

eco

nd

s)

Number of Nodes

Scale-Out Average Read Latency at 32 Clients

32 client Average Read Latency -YCSB 'C'

32 client Average Read Latency -YCSB 'A'

Page 19: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

19

that replication data is sent to a destination. By default, this limit is very low and needs to be modified (see issue report couchbase.com/issues/browse/MB-4358). For our tests this limit is approximately 230k write or updates per second per cluster node; your mileage will vary.

For our testing, we set the queue size to 100 million items27.

Figure 11 shows that throughput continues to grow beyond six clients, however, it doesn’t show that replication is not keeping up with throughput. During burst traffic, this may be allowed, but if this volume is consistent, more nodes should be added to the cluster. We must observe the TAP28 queues to know whether the documents are being replicated at an acceptable rate.

Figure 11: Throughput with and without replication for YCSB ‘A’ for 3 Couchbase nodes, 1 replica

To demonstrate the observation that replication is backing-off; observe the difference in drain rate when compared with set rate in Table 6. At six clients, with consolidation of writes, the drain rate is keeping up with sets ; at eight clients, there is a consistent backlog that drains after the test stops; and at 32 clients, the client write and update load is so heavy that the drain rate is reduced significantly during the test, delaying replication. To optimize replication resource usage, sets to the same document are consolidated before transmission.

Table 6: Replication set rate compared with drain rate for 6, 8, and 32 clients; 3 Couchbase nodes, 1 replica

6 Clients 8 Clients 32 Clients

Set Rate 323k 536k 734k

Drain Rate 232k 237k 40.1k

27 /opt/couchbase/bin/cbepctl <couchbase server nodes>:11210 set -b <bucket name> tap_param tap_throttle_queue_cap 100000000

28 The TAP protocol is an internal part of the Couchbase Server system and is used in a number of different areas to exchange data throughout the system.

TAP provides a stream of data of the changes that are occurring within the system. Source: couchbase.com/docs/couchbase-manual-2.0/couchbase-

introduction-architecture-tap.html

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

1 2 4 6 8 16 32

Thro

ugh

pu

t O

pe

rati

on

s p

er

Seco

nd

Number of YCSB Clients

Throughput with and without Replication

Throughput (ops/sec) withoutReplication

Throughput (ops/sec) withReplication

Page 20: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

20

Table 7 shows the replication backlog over a sixty minute test run using three Couchbase nodes with one replica. Notice that six clients are complete at the end of the test, and that the backlog takes several minutes to clear with eight and thirty-two clients.

Table 7: Replication backlog during sixty minute test run for 6, 8, and 32 clients; 3 Couchbase nodes, 1 replica

Minutes 6 Clients 8 Clients 32 Clients

Test Starts 0 0 0

15 13 13,969,771 35,458,754

30 20 28,124,147 26,007,996

45 18 17,937,968 41,194,417

60 0 14,131,188 32,997,405

Test Stops - - -

TS+1 0 11,743,104 26,887,516

TS+2 0 3,592,056 15,294,458

TS+3 0 36,723 2,552,896

TS+4 0 0 0

Figure 12 shows the consumption of processor one between six and eight clients stabilizes; all memcached threads are running on this processor. These memcached threads process all the sets for updates, writes, and replication.

Figure 12: Processor usage with and without replication for YCSB ‘A’ for 3 Couchbase nodes, 1 replica

0

20

40

60

80

100

1 2 4 6 8 16 32

Pro

cess

or

Usa

ge %

CP

U

Number of YCSB Clients

Processor Usage with and without Replication

Processor 1 w/o Replication

Processor 1 w/ Replication

Processor 2 w/o Replication

Processor 2 w/ Replication

Page 21: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

21

Figure 13 shows that network traffic stabilizes at approximately eight clients. This is where replication is asked to backoff. The observation is that the processor running the memcached threads is wholly consumed between six and eight clients. Again, watch your TAP queues.

Figure 13: Network usage with and without replication for YCSB ‘A’ for 3 Couchbase nodes, 1 replica

Figure 14 shows a general trend of greater disk write rate with replication. The observation is that each node has greater disk requirements due to replication.

Figure 14: Disk usage with and without replication for YCSB ‘A’ for 3 Couchbase nodes, 1 replica

0

1

2

3

4

5

6

1 2 4 6 8 16 32

Tota

l Ne

two

rk U

sage

(G

bp

s)

Number of YCSB Clients

Network Usage with Replication

Network Usage - NoReplication

Network Usage - 1 Replica

0

5

10

15

20

25

30

35

40

1 2 4 6 8 16 32

Dis

k U

sage

Tra

nsa

ctio

ns

pe

r Se

con

d

Number of YCSB Clients

Disk Usage with Replication

Disk TPS - No Replication

Disk TPS - 1 Replica

Page 22: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

22

Couchbase and Hadoop

NoSQL DBMSs and Couchbase Server are designed for interactive, OLTP-like applications. For batch processing or large scale analytics, other solutions like Hadoop can be used. Couchbase and Hadoop are complementary solutions. Over time, users may move data to Hadoop and Hadoop Distributed File System (HDFS) for historical or pattern analysis. Couchbase provides a Hadoop Connector and is Cloudera certified. It can be used for integration with existing HP Solutions for Apache Hadoop29. The Connector utilizes the sqoop plugin to stream data between Couchbase and Hadoop clusters.

Installing the Couchbase Hadoop Connector

The currently distributed Couchbase Hadoop Connector only works with Cloudera CDH3. The next release of the Couchbase Hadoop Connector will support Cloudera CDH4. At the time of writing, the Couchbase Hadoop Connector for Cloudera CDH4 is not generally available and the version we used was a preview version provided by Couchbase Inc. At the HP Emerging Database Lab, we’ve tested with both Cloudera CDH3 and CDH4. Once you’ve installed Cloudera CDH3, or CDH4, and sqoop, download the Couchbase Hadoop Connector from couchbase.com/develop/connectors/hadoop.

The Connector is packaged as a zip file. Unzip and run the install.sh script inside pointing to the location of the sqoop libraries installed on the system (default is /usr/lib/sqoop). So for example:

To install the Couchbase Hadoop Connector,

./install.sh /usr/lib/sqoop

Exporting data from Hadoop to Couchbase Server

To get data from Hadoop to the Couchbase cluster, we need to use sqoop to export the data from HDFS to Couchbase. As an example we can use the Hadoop WordCount example from the map-reduce tutorial30.

The Hadoop WordCount example reads text files from HDFS and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. Each mapper takes a line as input and breaks it into words. It then emits a key-value pair of the word and 1. Each reducer sums the counts for each word and emits a single key-value with the word and the sum.

Assuming the text files are stored in the HDFS directory /wordcount-input, we can kick off a Hadoop map-reduce job as follows.

/usr/bin/hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar wordcount /wordcount-input /wordcount-output

To export these key-value pairs from HDFS so that they can be queried from Couchbase applications, we invoke sqoop as follows.

sqoop export --connect http://<ip address of Couchbase node>:8091/pools \

--table dummy_argument --export-dir /wordcount-output \

--fields-terminated-by '\t' --lines-terminated-by '\n'

The table parameter is required, but ignored. The export-dir parameter is the output directory from our Hadoop word count map-reduce job. The fields-terminated-by and lines-terminated-by parameters describe the format of the files.

To export the data as JSON documents, you can modify the reducer in the word count example to emit JSON formatted data from the reduce phase of the processing. See Appendix J for emitting JSON formatted data.

29 HP Solutions for Apache Hadoop, hp.com/go/hadoop 30 Hadoop word count example, http://hadoop.apache.org/docs/r0.19.1/mapred_tutorial.html and http://wiki.apache.org/hadoop/WordCount

Page 23: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

23

Importing data from Couchbase Server to Hadoop

Similar to exporting data out of Hadoop to Couchbase we can import data from Couchbase into Hadoop, to perform map-reduce processing in the Hadoop cluster.

sqoop import --connect http://<ip address of Couchbase node>:8091/pools \

--table DUMP --username default --target-dir /exporteddatafromcouch

For importing JSON documents from Couchbase to Hadoop with the current released version of the Couchbase Hadoop Connector, users must write their own class that implements the SqoopRecord abstract class and use it when invoking sqoop import with the jar-file and class-name parameters.

sqoop import --connect http:// http://<ip address of Couchbase node>:8091/pools \

--table DUMP --jar-file /userdir/DUMP.jar --class-name DUMP --username default \

--target-dir /exporteddatafromcouch

The preceding line is required until the upcoming release of Couchbase Hadoop Connector.

HP Insight Cluster Management Utility

HP Insight Cluster Management Utility31 (CMU) is an efficient and robust hyperscale cluster lifecycle management framework and suite of tools for large Linux clusters such as those found in High Performance Computing (HPC) and Big Data environments. A simple graphical interface enables an ‘at-a-glance’ view of the entire cluster across multiple metrics, provides frictionless scalable remote management and analysis, and allows rapid provisioning of software to all the nodes of the system. Insight CMU makes the management of a cluster more user friendly, efficient, and error free than if it were being managed by scripts, or on a node-by-node basis. Insight CMU offers full support for iLO 2, iLO 3, iLO 4 and LO100i adapters on all HP ProLiant servers in the cluster.

Insight CMU is highly flexible and customizable, offers both GUI and CLI interfaces, and is being used to deploy a range of software environments, from simple compute farms to highly customized, application-specific configurations. Insight CMU is available for HP ProLiant and HP BladeSystem servers with Linux operating systems, including Red Hat Enterprise Linux, SUSE Linux Enterprise, CentOS, and Ubuntu. Insight CMU also includes options for monitoring Graphical Processing Units (GPUs) and for installing GPU drivers and software. Figure 15 provides an example throughput graphic available through CMU.

Figure 15: CMU example throughput graphics

31 For more information on HP Insight Cluster Management Utility, hp.com/go/cmu

Page 24: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

24

Summary

The HP DL380p Gen8 server is an excellent host for Couchbase Server 2.0 and provides state-of-the-art capabilities and capacities in a compact 2U chassis.

• Two Intel Xeon E5-2690 8-core processors

• Up to 768GB RAM

• Up to 16 SFF HP SmartDrives with the latest HP Smart Array Controller

• Two 10Gbps Ethernet or FlexFabric LOM

• Six PCIe Gen3 slots

• HP Active Health System for always on diagnostics

• HP iLO 4th generation management processor

As shown in Table 8, using an HP DL380p Gen8 server paired with Couchbase 2.0 in a single node configuration running YCSB, the benchmark throughput is 796,044 read or 505,634 update-heavy operations per second. In a scale-out four node configuration with a sharded database, the YCSB throughput is near linear with 2,680,010 read and 1,777,368 update-heavy operations per second. Read latency showed near linear decrease as nodes are added.

Table 8: YCSB workloads ‘A’ and ‘C’, one to four nodes, 25 to 100 million documents; throughput in operations per second, and latency in microseconds

Nodes ‘C’ Throughput ‘C’ Read Latency ‘A’ Throughput ‘A’ Read Latency

1 796,044 5,141 505,634 16,194

2 1,387,095 2,975 952,431 8,591

4 2,680,010 1,543 1,777,368 4,600

Appendices

Appendix A: Increasing memcache threads

Currently there is no simple way to increase the number of Couchbase-memcache worker threads. See Couchbase issue # MB-5519 couchbase.com/issues/browse/MB-5519.

For our tests, we increased the number of Couchbase-memcache worker threads by creating a ‘stub’, which would invoke the real executable binary with the required parameters. To set the number of memcached worker threads, the original memcached binary (memcached.orig) is invoked with the parameter string “–t 8”.

Move the original “/opt/couchbase/bin/memcached” to “/opt/couchbase/bin/memcached.orig”.

Compile the source for the following stub and move the executable to /opt/couchbase/bin/memcached and set the file permissions to the original memcached values.

#include <unistd.h>

#include <stdio.h>

main(int argc, char** argv)

{

execl("/opt/couchbase/bin/memcached.orig","/opt/couchbase/bin/memcached.orig",

"-t","8","-X", "/opt/couchbase/lib/memcached/stdin_term_handler.so",

"-X","/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;"

"sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",

"-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p","11210",

"-E","/opt/couchbase/lib/memcached/bucket_engine.so","-B","binary","-r",

"-c","10000","-e","admin=_admin;default_bucket_name=default;auto_create=false",

(char *) 0);

}

Page 25: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

25

Appendix B: Setting processor affinity

The following is a sample script to set processor affinity.

#!/bin/bash

TMP1=`mktemp`

TMP2=`mktemp`

TMP3=`mktemp`

#get a stack trace of the couchbase memcached process

pstack `ps -u couchbase | grep memcached | sed 's/^[ ]*//' | \

cut -d' ' -f1` > $TMP1

#the memcached worker threads all have 'worker_libevent' in the thread stack frame

grep -B 4 "worker_libevent" $TMP1 | grep LWP | sort > $TMP2

grep LWP $TMP1 | sort > $TMP3

#set the affinities of the non-memcached worker threads to the second processor

comm -3 $TMP2 $TMP3 | cut -d\( -f3 | cut -d\) -f1 | sed 's/LWP//' | \

sed 's/^/taskset -c -p 8-15/' > $TMP1

echo "will execute this script:"

cat $TMP1

chmod +x $TMP1

$TMP1

#set the affinities of the memcached worker threads to the first processor

cat $TMP2 | cut -d\( -f3 | cut -d\) -f1 | sed 's/LWP//' | \

sed 's/^/taskset -c -p 0-7/' > $TMP1

echo "will execute this script:"

cat $TMP1

chmod +x $TMP1

$TMP1

pstack `ps -u couchbase | grep beam.smp | sed 's/^[ ]*//' | cut -d' ' -f1` | \

grep LWP > $TMP1

cat $TMP1 | cut -d\( -f3 | cut -d\) -f1 | sed 's/LWP//' | \

sed 's/^/taskset -c -p 8-15/' > $TMP2

chmod +x $TMP2

$TMP2

rm -rf $TMP1 $TMP2 $TMP3

Appendix C: Java jar files required to support the YCSB client

A list of jar files used by the YCSB test client:

core-1.0.jar

couchbase-2.0-1.0.jar

couchbase-client-1.0.2.jar

slf4j-api-1.5.2.jar

slf4j-log4j12-1.5.2.jar

log4j-1.2.14.jar

jackson-core-asl-1.9.2.jar

jackson-mapper-asl-1.9.2.jar

spymemcached-2.8.0.jar

jettison-1.1.jar

netty-3.2.0.Final.jar

commons-codec-1.5.jar

Page 26: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

26

Appendix D: Performance comparison NIC processor affinity data

All tests in this appendix are run with YCSB workload ‘C’ on one Couchbase server node with:

• 64GB bucket size

• No Hyper-Threading

• 25 million documents

• 8 Couchbase-memcache threads

The information related to each test displayed in Figure 3 follows.

Table D-1: Affinity set to Processor 1

YCSB Clients Throughput (ops/sec)

Average Latency (micro secs)

Total CPU% Processor 1 %CPU

Processor 2 %CPU

Network Usage (Gbps)

1 87,165 1,464 5 8 1 1

2 162,491 1,571 9 17 1 2

4 290,953 1,759 17 33 1 3

8 535,220 1,912 33 65 1 6

16 716,585 2,856 44 87 1 7

32 796,044 5,141 48 95 2 8

64 856,588 9,566 48 95 2 9

128 898,709 18,350 49 95 2 9

256 924,019 35,920 48 95 2 9

512 945,230 69,420 48 94 2 9

Table D-2: Affinity set to Processor 2

YCSB Clients Throughput (ops/sec)

Average Latency (micro secs)

Total CPU% Processor 1 %CPU

Processor 2 %CPU

Network Usage (Gbps)

1 77,558 1,646 5 1 8 1

2 147,474 1,732 9 1 17 2

4 270,725 1,894 17 1 32 3

8 515,503 1,983 32 1 61 5

16 684,837 2,991 42 1 79 7

32 728,545 5,625 45 1 84 8

64 715,887 11,427 44 1 84 8

128 711,053 23,016 45 1 84 8

256 712,483 45,687 45 1 84 8

512 700,544 91,975 45 1 85 7

Page 27: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

27

Table D-3: Affinity equally split between Processor 1 and 2

YCSB Clients Throughput (ops/sec)

Average Latency (micro secs)

Total CPU% Processor 1 %CPU

Processor 2 %CPU

Network Usage (Gbps)

1 87,562 1,457 5 9 1 1

2 143,029 1,800 10 9 10 2

4 263,978 1,937 18 9 25 3

8 466,853 2,220 35 35 35 5

16 574,819 3,591 45 44 46 6

32 593,565 6,935 47 46 48 6

64 588,088 13,969 46 46 47 6

128 599,402 27,302 47 46 47 6

256 599,154 54,445 47 46 47 6

512 611,754 105,377 48 47 48 6

Table D-4: Average document read latency comparison with varying memcache thread processor affinity; latency in microseconds

YCSB Clients Processor 1 Affinity Packet Latency

Processor 2 Affinity Packet Latency

Split Processor Affinity Packet Latency

1 1,464 1,646 1,457

2 1,571 1,732 1,800

4 1,759 1,894 1,937

8 1,912 1,983 2,220

16 2,856 2,991 3,591

32 5,141 5,625 6,935

64 9,566 11,427 13,969

128 18,350 23,016 27,302

256 35,920 45,687 54,445

512 69,420 91,975 105,377

Page 28: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

28

Appendix E: Hyper-Threading throughput comparison Figure E-1: Operations per second with Hyper-Threading enabled and disabled

Table E-1: Operations per second with Hyper-Threading enabled and disabled

YCSB Clients Hyper-Threading Off (Disabled)

Hyper-Threading On (Enabled)

1 87,165 87,721

2 162,491 153,645

4 290,953 246,990

8 535,220 471,060

16 716,985 696,002

32 796,044 797,855

64 856,588 865,621

128 898,709 902,177

256 924,019 923,406

512 945,230 922,442

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 4 8 16 32 64 128 256 512

Thro

ugh

pu

t (o

pe

rati

on

s p

er

seco

nd

)

Mill

ion

s

Number of YCSB Clients

Performance with Hyperthreads

HT 'off', 8 memcache threads

HT 'on', 16 memcache threads

Page 29: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

29

Appendix F: Comparison of operations per second and average read latency

For maximum performance, the DL380p nodes should be provisioned with 10G Ethernet ports. The throughput shown in Table F-1 is for one DL380p Gen8 server node.

Table F-1: Comparison of operations per second and average read latency with 10GbE NIC

YCSB Clients Throughput (ops per second)

Average Read Latency (microseconds)

1 87,165 1,464

2 162,491 1,571

4 290,953 1,759

8 535,220 1,912

16 716,585 2,856

32 796,044 5,141

64 856,588 9,566

128 898,709 18,350

256 924,019 35,920

512 945,230 69,420

Appendix G: Performance implications of varying memory bucket size Table G-1: YCSB workload ‘C’, one node, 25 million documents, with varying bucket memory configurations; throughput in operations per second

YCSB Clients

64GB 53GB 43GB 41GB 32GB

1 87,358 89,393 90,679 70 5

2 164,010 170,792 167,272 162 14,293

4 292,671 302,591 299,893 35,534 26,885

8 540,882 548,807 547,101 92,717 26,561

16 714,945 730,184 720,451 94,775 30,873

32 802,232 809,897 805,374 101,311 34,666

64 871,037 888,026 875,130 100,629 33,072

128 913,413 937,368 918,457 95,732 35,525

256 946,124 969,239 949,576 103,886 35,840

Page 30: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

30

Table G-2: YCSB workload ‘A’, one node, 25 million documents, with varying bucket memory configurations; throughput in operations per second

YCSB Clients

64GB 53GB 43GB 41GB 32GB

1 87,047 87,204 5,645 155 10

2 156,855 157,582 160 178 1,906

4 278,978 281,590 24,982 1,727 23,884

8 504,167 494,053 10,9627 75,093 24,628

16 515,176 508,931 98,790 70,338 25,715

32 521,661 519,218 90,692 67,388 25,769

64 523,903 521,885 95,937 67,359 26,384

128 521,993 522,801 101,818 67,384 27,073

256 518,300 519,793 102,405 71,504 29,416

Appendix H: Scale-out performance of two and four nodes Table H-1: YCSB workloads ‘A’ and ‘C’, one to four nodes, 25 to 100 million documents; throughput in ops per second, and latency in microseconds

Nodes ‘C’ Throughput ‘C’ Read Latency ‘A’ Throughput ‘A’ Read Latency

1 796,044 5,141 505,634 16,194

2 1,387,095 2,975 952,431 8,591

4 2,680,010 1,543 1,777,368 4,600

Appendix I: Replication tables

The first set of tables below provides some general statistics for two and three node clusters with and without replication. At high update rates, most of the document mutations are in memory and are slowly drained to disk. For high availability, we recommend a three node configuration with one replica, however we recognize some sites may only use a two node cluster. With that in mind, we include the results and statistics for both two and three node configurations with and without one replica.

The second set of tables provides replication backlog information. As the number of items in the disk write and replication queues exceed the throughput capability of the cluster, replication is asked to backoff. As a node receives requests, both the update and write data that a node receives from client applications and the items to be replicated received from other servers are placed on a disk write queue. If there are too many items waiting in the disk write queue at any given destination, Couchbase Server will request the other servers to reduce the rate that data is sent to a destination. By default, this limit is very low and needs to be modified (see issue report couchbase.com/issues/browse/MB-4358). For our tests this limit is approximately 230k write or updates per second per cluster node; your mileage will vary.

For our testing, we use a three node cluster and set the queue size to 100 million items. The following command is an example to set the queue size.

/opt/couchbase/bin/cbepctl <couchbase server nodes>:11210 set -b <bucket name>

tap_param tap_throttle_queue_cap 100000000

Replication statistics for two and three nodes with and without replication

To demonstrate the replication capability of the cluster, we only use the update benchmark YCSB ‘A’. Tables I-1 and I-2 are the results and statistics for two Couchbase nodes with one replica; Tables I-3 and I-4 are the results and statistics for three Couchbase nodes with one replica.

Page 31: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

31

Table I-1: Throughput and resource consumption without replication – YCSB workload ‘A’, 2 Couchbase nodes, 1 replica

YCSB Clients

Throughput (ops/sec)

Average Latency (micro secs)

Total CPU%

Processor 1 %CPU

Processor 2 %CPU

Memory %

Network Usage (Gbps)

Disk (tps)

1 167,302 1,495 13 12 14 42 .88 3

2 300,916 1,669 19 24 13 42 1,58 2

4 528,127 1,908 30 45 13 42 2,78 2

6 737,731 2,051 40 66 13 44 3.74 2

8 944,832 2,135 51 86 14 42 4.65 2

16 955,568 4,266 51 86 14 40 4.67 2

32 947,329 8,667 53 84 19 42 4.58 1

Table I-2: Throughput and resource consumption with replication – YCSB workload ‘A’, 2 Couchbase nodes, 1 replica

YCSB Clients

Throughput (ops/sec)

Average Latency (micro secs)

Total CPU%

Processor 1 %CPU

Processor 2 %CPU

Memory %

Network Usage (Gbps)

Disk (tps)

1 153,800 1,633 20 22 17 74 1.58 2

2 278,665 1,805 34 43 24 78 2.68 2

4 397,187 2,946 37 50 24 77 3.18 12

6 617,684 2,509 44 60 28 75 3.73 17

8 832,199 2,507 54 82 25 76 4.70 24

16 902,888 4,589 56 84 26 76 4.66 24

32 955,981 8,580 57 87 24 76 4.85 32

Table I-3: Throughput and resource consumption without replication – YCSB workload ‘A’, 3 Couchbase nodes, 1 replica

YCSB Clients

Throughput (ops/sec)

Average Latency (micro secs)

Total CPU%

Processor 1 %CPU

Processor 2 %CPU

Memory %

Network Usage (Gbps)

Disk (tps)

1 187,912 1,328 16 8 23 64 0.64 19

2 366,488 1,361 16 18 13 65 1.25 2

4 682,691 1,502 31 36 25 63 2.29 12

6 975,603 1,583 35 56 13 62 3.35 2

8 1,246,078 1,647 50 74 24 61 4.35 21

16 1,393,160 2,899 50 82 15 62 4.61 6

32 1,384,244 5,901 52 81 21 64 4.52 16

Page 32: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

32

Table I-4: Throughput and resource consumption with replication – YCSB workload ‘A’, 3 Couchbase nodes, 1 replica

YCSB Clients

Throughput (ops/sec)

Average Latency (micro secs)

Total CPU%

Processor 1 %CPU

Processor 2 %CPU

Memory %

Network Usage (Gbps)

Disk (tps)

1 186,958 1,335 17 16 17 78 1.24 3

2 231,865 2,683 27 25 29 84 1.58 35

4 293,408 6,463 39 44 33 79 1,99 32

6 666,862 9,378 51 61 41 78 3.98 21

8 1,031,963 2,077 50 80 18 80 5,11 8

16 1,273,242 3,215 54 81 25 76 4.83 31

32 1,364,619 5,985 53 79 25 76 4.69 27

Replication backlog

The following tables show the replication backlog during 60 minute test runs with 6, 8 and 32 YCSB clients on a three node cluster. Replication backlog is obtained using the cbstats tool.

Tables I-5, I-6, and I-7 show the replication backlog. The database and replication is eventually consistent, yet once we pass approximately 230k updates per node mark, the replication queue starts to grow and replication is asked to backoff. When the clients are consuming the processor capability (like at 32 clients), replication approaches zero until processing cycles are available to handle the replication queue. Once the load falls below maximum capacity, replication will continue. The actual document writes and updates are consolidated; multiple writes to the same document are purged and the last write or update is replicated and written to disk.

Notice in Table I-5 that there is no backlog when the test stops. At rates above approximately 237k writes and updates per second a backlog occurs.

Table I-5: Replication backlog 6 clients

Time Replication backlog (Ep_tap_total_backlog_size)

05:00 0

05:15 13

05:30 20

05:45 18

06:00 0

Test Stops

Page 33: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

33

Table I-6: Replication backlog 8 clients

Time Replication backlog (Ep_tap_total_backlog_size)

06:46 0

07:01 13,969,771

07:16 28,124,147

07:31 17,937,968

07:46 14,131,188

Test Stops

07:47 11,743,104

07:48 3,592,056

07:49 36723

07:50 0

Table I-7: Replication backlog 32 clients

Time Replication backlog (Ep_tap_total_backlog_size)

08:51 0

09:06 35,458,754

09:21 26,007,996

09:36 41,194,417

09:50 32,997,405

Test Stops

09:51 26,887,516

09:52 15,294,458

09:53 2,552,896

09:54 0

Appendix J: JSON output for Hadoop

The map-reduce word count job outputs the results in a key-value format. If you export that to Couchbase, the “value” part will be in binary format; we want the value part in JSON document format (new for Couchbase 2.0). Modify the map-reduce example to use a JSON library of your choice. http://hadoop.apache.org/docs/r0.19.1/mapred_tutorial.html is an example for exporting as key-value, and ibm.com/developerworks/opensource/library/ba-hadoop-couchbase/index.html is an example for exporting values as JSON documents.

Several Java/JSON libraries are available, following are two.

• http://code.google.com/p/google-gson/

• http://jackson.codehaus.org/

Page 34: Couchbase 2.0 performance.hp-dl380p.gen8

Technical white paper | Couchbase Server 2.0 performance on HP ProLiant DL380p Gen8 Server

For more information

HP DL380p Gen8 server datasheet: http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA3-9615ENW

HP Insight Cluster Management Utility: hp.com/go/cmu

Configuring and Tuning HP ProLiant Servers for Low-Latency Applications: http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01804533.pdf

Configuring and using DDR3 memory with HP ProLiant Gen8 Servers: http://h20000.www2.hp.com/bc/docs/support/SupportManual/c03293145/c03293145.pdf

Couchbase Server Overview: couchbase.com/couchbase-server/overview

Couchbase Server 2.0 white paper: couchbase.com/sites/default/files/uploads/all/whitepapers/Introducing-Couchbase-Server-2_0.pdf

Couchbase 2.0 server manual: couchbase.com/docs/couchbase-manual-2.0/

Couchbase 2.0 developer’s guide: couchbase.com/docs/couchbase-devguide-2.0/

Couchbase Server technical overview: couchbase.com/sites/default/files/uploads/all/whitepapers/Couchbase-Server-Technical-Whitepaper.pdf

Couchbase important UI stats: couchbase.com/docs/couchbase-manual-2.0/couchbase-bestpractice-ongoing-ui.html

Couchbase Sizing guidelines: couchbase.com/docs/couchbase-manual-2.0/couchbase-bestpractice-sizing.html

YCSB source code: http://github.com/brianfrankcooper/YCSB

Additional YCSB published benchmark results, dated 2010-03-31: http://research.yahoo.com/files/ycsb-v4.pdf

To help us improve our documents, please provide feedback at hp.com/solutions/feedback.

Sign up for updates

hp.com/go/getupdated

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for

HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as

constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.

Intel and Xeon are trademarks of Intel Corporation in the U.S. and other countries. Oracle and Java are registered trademarks of Oracle and/or its affiliates.

4AA4-6203ENW, June 2013, Rev. 1