durability for memory-based key-value...
TRANSCRIPT
Durability for Memory-Based Key-Value Stores
Kiarash Rezahanjani
Dissertation for European Master in Distributed Computing Programme
Supervisor: Flavio JunqueiraTutor: Yolanda Becerra
Juri
President: Felix Freitag (UPC)Secretary: Jordi Guitart (UPC)Vocal: Johan Montelius (KTH)
July 4, 2012
Acknowledgements
I would like to thank Flavio Junqueira, Vincent Leroy and Yolanda Becerra who helped me in
this work, especially when my steps faltered. Moreover, I owe my gratitude to my parents, Souri
and Mohammad, who have been a constant source of love, motivation, support and strength all
these years.
Hosting Institution
Yahoo! Inc. is the world’s largest global online network of integrated services with more
than 500 million users worldwide. Yahoo! Inc. provides Internet services to users, advertisers,
publishers, and developers worldwide. The company owns and operates online properties and
services, and provides advertising offerings and access to Internet users through its distribution
network of third-party entities, as well as offers marketing services to advertisers and publishers.
Social media sites consist of Yahoo! Groups, Yahoo! Answers, and Flickr to organize into
groups and share knowledge and photos. Search products comprise Yahoo! Search, Yahoo!
Local, Yahoo! Yellow Pages, and Yahoo! Maps to navigate through the Internet and search for
information. Yahoo! also provides a large number of specific communication, information and
life-style services. In the business domain, Yahoo! HotJobs, provides solutions for employers,
staffing firms, and job seekers; and Yahoo! Small Business that offers an integrated suite of
fee-based online services, including web hosting, business mail and an e-commerce platform.
Yahoo! Research Barcelona is the research lab hosted in the Barcelona Media Innovation Center
focuses on Scalable computing, web retrieval, data mining and social media, including distributed
and semantic search. This work has been done in Scalable computing group of Yahoo! Research
Barcelona.
Barcelona, July 4, 2012
Kiarash Rezahanjani
Abstract
The emergence of multicore architecture as well as larger, less expensive RAM has made it
possible to leverage the performance superiority of main memory for large databases. Increas-
ingly, large scale applications demanding high performance have also made RAM an appealing
candidate for primary storage. However, conventional DRAM is volatile, meaning that hardware
or software crashes result in the loss of data. The existing solutions to this, such as write-ahead
logging and replication, result in either partial loss of data or significant performance reduction.
We propose an approach to provide durability to memory databases, with a negligible overhead
and a low probability of data loss. We exploit known techniques such as chain replication,
write-ahead logging and sequential writes to disk to provide durability while maintaining the
high throughput and the low latency of main memory.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Structure of the Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background and Related Work 5
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Memory Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Stable Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3.1 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3.2 Message Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3.3 Pessimistic vs. Optimistic Logging . . . . . . . . . . . . . . . . . 7
2.1.4 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4.1 Replication Through Atomic Broadcast . . . . . . . . . . . . . . 8
2.1.4.2 Chain Replication . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.5 Disk vs RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Redis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 RAMCloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
i
2.2.3 Bookkeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Design and Architecture 19
3.1 Durability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Target Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 System Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1 Fault Tolerance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.2 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.4 Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.4.1 Consistent Replicas and Correct Recovered State . . . . . . . . . 24
3.5.4.2 Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.5 Operational Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.1 Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.2 Coordination of Distributed Processes . . . . . . . . . . . . . . . . . . . . 27
3.6.3 Server Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6.3.1 Coordination Protocol . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6.3.2 Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6.4 Stable Storage Unit (SSU) . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6.5 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ii
3.6.6 Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6.7 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Experimental Evaluation 37
4.1 Network Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Stable Storage Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Impact of Log Entry Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 Impact of Replication Factor . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.3 Impact of Persistence on Disk . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Load Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Durability and Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . 44
5 Conclusions 47
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
References 52
iii
iv
List of Figures
2.1 Buffered logging in RAMCloud. Based on (1). . . . . . . . . . . . . . . . . . . . 12
2.2 Bookkeeper write operation. Extracted from bookkeeper presentation slide (2). 13
2.3 Pipeline during block constraction. Based on (3) . . . . . . . . . . . . . . . . . . 14
3.1 System entities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Leader states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Follower states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Log server operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Storage unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Clustering decision based on the servers available resources. . . . . . . . . . . . . 33
3.7 Failover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 Thoughput vs. Latency graph for our stable storage unit for different entry sizes
with replication factor of three. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Throughput vs. Latency for stable storage unit with replication factor of two and
three for log entry size of 200 bytes. . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Throughput vs. Latency of a stable storage unit for log entries of 200 bytes, when
persistence to local disk is enabled and disabled. . . . . . . . . . . . . . . . . . . 43
4.4 Throughput of stable storage unit under sustained load. . . . . . . . . . . . . . . 44
4.5 Latency of stable storage unit under sustained load. . . . . . . . . . . . . . . . . 44
4.6 Performance comparison of stable storage unit and hard disk. . . . . . . . . . . . 46
v
vi
List of Tables
4.1 RPC latency for different packet sizes within a datacenter . . . . . . . . . . . . . 38
4.2 Latency and throughput for a single client synchronously writing to stable storage
unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vii
viii
1Introduction1.1 Motivation
In the past decades, disk has been the primary system of storage. Magnetic disks offer
reliable storage and a large capacity at a low cost. Although disk capacity has dramatically
improved over the past decades; access latency and bandwidth of disks have not shown such
improvements. Disk bandwidth can be improved by aggregating the bandwidth of several disks
(e.g. RAID) but high access latency remains an issue.
To mitigate these shortcomings and improve the performance of disk-based approaches a
number of techniques are employed such as adding caching layers and data striping. However,
these techniques complicate large scale application development and often become costly.
In comparison to disk, RAM (refering to DRAM) offers hundreds of times higher bandwidth
and thousands of times lower latency. In today’s datacenters, commodity machines with up to
32 gigabyte of DRAM is common and it is cost-effective to have up to 64GB of DRAM (1).
This makes it possible to deploy terabytes of data entirely in few dozens of commodity machines
by aggregating their RAM. The superior performance of RAM and its dropping cost has made
it an attractive storage means for applications demanding low latency and high throughput.
As an example of such applications, Google search engine keeps entire its index table in RAM
(4) , the social network LinkedIn stores the social graph of all the members in memory and
Google Bigtable holds SSTables’ block indexes in memory (5). This trend can also be seen in
the appearance of many in-memory databases such as Redis (6) and Couchbase (7) that use
memory as their primary storage.
Despite the advantages of RAM over disk, RAM is subject to one major issue: volatility
and consequently non-durability. Therefore, in the event of a power outage, hardware or soft-
ware failures data stored in RAM will be lost. In memory-based storage systems, operating on
commodity machines, providing durability while maintaining good performance is a major chal-
2 CHAPTER 1. INTRODUCTION
lenge. The majority of existing techniques to provide durability of data, such as checkpointing
and write-ahead logging either do not guarantee persistence of the entire data or result in signif-
icant performance degradation. For example, in periodic checkpointing, committed updates at
the time interval between last checkpoint and failure point are lost or in the case of write-ahead
logging to disk, the write latency is tighten to disk access latency.
This work proposes a solution to provide durability for memory databases while preserving
their high performance.
1.2 Contributions
We propose an approach to provide durability for a cluster of memory databases, on a set of
commodity servers with negligible impact on the database performance. We have designed and
implemented a highly available stable storage system that provides low-latency high-throughput
write operations allowing a memory database to log the state changes. This allows durable
writes with low latency and recovery of the latest database state in case of failure.
Our stable storage consists of a set of storage units that collectively provide fault-tolerance
and load-balancing. Each storage unit consists of a set of servers; each server performs asyn-
chronous message logging to record changes of database state. Log entries are replicated on
memory of all the servers in the storage unit through chain replication. This minimizes the pos-
sibility of data loss derived by asynchronous writes in the case of servers failure and increases
availability of logs for the purpose of recovery. Each server exploits the maximum throughput
of the hard disk by sequentially writing the log entries.
Our solution is tailored for a large cluster of memory-based databases that store data in
the form of key-value pairs and comply to the characteristics of social network platforms. The
evaluation results indicate our approach enables durable write operations with latency of less
than one millisecond while providing a good level of durability. The results also indicate that
our storage solution is able to outperform the conventional write-ahead logging on local disk
in terms of latency. In addition to low response time, the system is designed to achieve high
availability and read throughput through replication of log entries in several servers. The design
also accomodates scalability by minimizing the interactions amongst the servers and utilizing
local resources.
1.3. STRUCTURE OF THE DOCUMENT 3
1.3 Structure of the Document
The rest of this document is organized as follows. Chapter 2 provides a brief introduction
on several techniques and concepts related to this work. Further in this chapter, we review
four systems that have influenced the design and discuss the approach used by each one of the
systems. In Chapter 3 we present our solution to the durability problem. We describe the
properties of our system as well as the architecture and the implementation. Chapter 4 presents
the results of the experimental evaluation and analyzes the results. Finally, Chapter 5 concludes
this thesis by summarizing its main points and presenting the future work.
4 CHAPTER 1. INTRODUCTION
2Background and Related
Work
2.1 Background
2.1.1 Memory Database
In-memory or main memory database systems store the data permanently in main memory
and disk is usually used only for backup. In disk-oriented databases data is stored in disk and
it may be cached in memory for faster data access. Memcached (8) is an in-memory key-value
store that is widely used for such as a purpose. For example, Facebook uses Memcached to
put the data from MySQL database into memory (9) and consistency between Memcached and
MySQL servers is managed by application software. In both systems an object can be kept in
memory or in disk but the major difference is that in main memory databases the primary copy
of an object lives in memory and in disk-oriented databases the primary copy lives in the disk.
Main memory databases pose several properties different from disk-oriented databases and here
we mention the most relevant ones to this project.
The layout of data stored in disk is important, for example, sequential access and random
access to data stored in disk causes a major performance difference while the method of access
to memory is of no importance. Memory databases use data strcuctures that allow leveraging
performance benefits of main memory. For example, T-tree is mainly used for indexing of
memory databases while B-tree is prefered for index of disk-based relational databases (10).
Main memory databases are able to provide a far faster access time and a higher throughput
than disk-oriented databases. Although the latter provides a stronger durability as main memory
is volatile and in case of a process crash or power outage data residing in memory will be lost
(11). To mitigate this issue, disk is used as a backup for memory databases; hence, at the time of
a crash the database can be recovered. We will discuss several approaches to provide durability
of data and recovery of the system state.
6 CHAPTER 2. BACKGROUND AND RELATED WORK
2.1.2 Stable Storage
There are three storage categories (12):
1. Memory storage which loses the data at the time of process or machine failure and power
outage.
2. Disk storage which survives the power outage and process failures except disk related
crashes such as disk head crash and bad-sectors.
3. Stable storage which survives any type of failures and provides high degree of fault tolerance
that is usually achieved through replication. This storage model suites applications which
require reading back the correct data after writing with a very small probability of data
loss.
2.1.3 Recovery
Recovery techniques in a distributed environment can become complicated when a globally
consistent state has to be recovered at several nodes and there are several writers or readers.
Our approach is based on a single-writer single-reader model that simplifies the recovery; hence
we discuss the recovery techniques given the single-reader single-writer model.
2.1.3.1 Checkpoint
Checkpoint (snapshot) is a technique in fault-tolerant distributed systems to enable back-
ward recovery by saving the system state from time to time onto a stable storage. Checkpoint
is a suitable option for backup and disaster recovery as it allows having different versions of the
system states at different point in time. Since checkpointing produces a single file that can be
compressed, it can easily and quickly be transferred over the network to other data centers to
enhance the availability and recovery of the service. Checkpoints are ready state of a system
therefore it is only required to read the snapshot to reconstruct the state and there is no need
for further processing.
The downside of this approach is this method stores the snapshot of the server state from
one point in time to another which means failure at any point in time will result in losing all
2.1. BACKGROUND 7
the changes made from the last snapshot up to the failure point. This characteristic makes this
method undesirable if the latest state needs to be recovered.
In practice this is implemented by forking a child process (with copy-on-write semantic)
to persist the state (13). This could significantly slow down the parent process serving a large
dataset or interrupts the service for hundreds of milliseconds particularly on a machine with
poor CPU performance. This can specially become an issue when the system is at its peak load.
2.1.3.2 Message Logging
It is not possible to always recover the latest state of a database using snapshots and in
order to have a more recent state, the more frequent snapshots is required. This yields a high
cost in terms of operations required for writing the entire state in a stable storage. To reduce
the number of checkpoints and enable the latest state recovery, message logging technique can
be used.
In message-logging, a sequence number is associated with messages are recorded onto a
stable storage. The underlying idea of message logging is to use the logs stored in stable storage
and a check pointed state (as a starting point) to reconstruct the latest state by replying the
logs on the given checkpoint. The checkpoint is only needed to limit the number of logs; hence
shortening the recovery time.
Message logging requires that after completion of recovery no orphan processes exist. An
orphan process is a process that survived the crash and they are in different state from the
recovered process (14). In Chapter 3 we will discuss this property in our design.
2.1.3.3 Pessimistic vs. Optimistic Logging
Message logging can be categorized into two categories: Optimistic logging and pessimistic
logging (14). Message logging takes time and logging methods can be categorized depending on
whether a process waits to ensure every event is safely logged before the process can impact the
rest of the system.
Processes that do not wait for completion of logging of an event are optimistic and pro-
cesses that block sending a message until the completion of logging of the previous message are
8 CHAPTER 2. BACKGROUND AND RELATED WORK
pessimistic processes. Pessimistic logging sacrifices a better performance during failure-free run
for a guarantee of recovering a consistent state with the crashed process state.
In conclusion, optimistic logging is desirable from performance point of view and it is suitable
for systems with a low failure rate. Pessimistic logging is suitable for systems with high failure
rate or systems that reliability is critical.
Write-ahead logging (WAL) can be considered as an example of pessimistic method that
the logs should be persisted before the changes take place. WAL is widely used in databases to
implement roll-forward recovery (redo) .
2.1.4 Replication
There are two main reasons for replication: scalability and reliability. Replication enables
fault-tolerance as in the event of a crash, system can continue working using other available
replicas. Replication can also be used to improve performance and scalability, when many
processes access a service provided by a single server, replication can be used to divide the load
among several servers.
There is variety of replication techniques with different consistency model, in this document
we explain two major replication techniques and further we describe how our system benefits
from replication in order to improve its reliability and minimizes data loss.
2.1.4.1 Replication Through Atomic Broadcast
Atomic broadcast or total order broadcast is a well-known approach that guarantees all the
messages are received reliably and in order by all the participants (15). Using atomic broadcast
all updates can be delivered and processed in order, this property can be used to create a
replicated data store that all the replicas have consistent states.
2.1.4.2 Chain Replication
Chain replication is a simple straightforward replication protocol intended to support high
throughput and high availability without sacrificing strong consistency guarantee. In chain
replication servers are linearly ordered to form a chain.
2.1. BACKGROUND 9
The first server in the chain which is the entry point for queries is called head and the the
last server which sends the replies is called tail. Each update request enters at the head of the
chain and after being processed by the head the state changes are forwarded along a reliable
FIFO channel to the next node in the chain and it continues in the same manner until it reaches
the tail. This method handles queries by forwarding the queries to the tail of the chain.
This method is not tolerant to network partition but instead it offers high throughput,
scalability and consistency (16).
2.1.5 Disk vs RAM
Magnetic disk and RAM have several well-known differences. The RAM access time is orders
of magnitude less than magnetic disk and its throughput is orders of magnitude higher. Access
time for a record in magnetic disk consists of a seek time, rotational latency and transfer time.
Among the three, seek time is dominant when records are not large (megabytes). The seek time
of disk is several milliseconds and the transfer time varies depending on the bandwidth. For
instance, for 1 MB the transfer time is 10ms for a disk with bandwidth of 100 MB/s. On the
other hand, the access latency of a record in memory is a few nanoseconds and its bandwidth is
several gigabytes per second (17), (18). This means RAM performs orders of magnitude better
in terms of latency and throughput.
The access method and the way data is structured in RAM do not make a difference in
performance of RAM, although this is not the case for disk. Sequential writes to disk provide a
far better latency and throughput than random writes because it eliminates the need for constant
seek operations (19). Everest (20) is an example of a system that uses sequential writing to disk
in order to increase the throughput.
The other difference of RAM and magnetic disk is volatility. Memory is volatile and data
will be lost at the time of power outage or crashing the process referencing the data. Magnetic
disk is a non-volatile storage and data written to disk survives power outage and process crashes.
However, writing to disk (forcing the data to disk) does not guarantee that data is persisted
immediately. Disk has a cache layer which is non-volatile, therefore loss of power to cache
results in loss of data being written to disk. One solution is to disable the cache, though this
is not practical as it significantly degrades the disk performance; hence the application writing
10 CHAPTER 2. BACKGROUND AND RELATED WORK
to disk. Other solutions are using non-volatile RAM as used by NetApp filer (21) or disks with
battery-backed write cache such as HP SmartArray RAID controllers, this provides a power
source independent from external environment to maintain the data for a short time allowing
it to be written to disk at the time of power outage (22). Although, the latter options are not
considered as commodity hardware.
2.2 Related Work
In this part we present some of the existing systems related to this work that has influenced
our solution in one way or another. Another reason to select the following systems to present
in this report is that the collection of approaches implemented by these systems represents a
comprehensive set of common methods applied to provide durability for many main memory
databases.
We describe:
• Redis (23) which is an in-memory database and uses writes to local disk as well as repli-
cation to achieve durability.
• Bookkeeper (24) that provides a write-ahead logging as a reliable fault tolerance distributed
service.
• RAMCloud (1) a new approach to datacenter storage by keeping the data entirely in
DRAM of thousands of commodity servers.
• HDFS (3), a highly available distributed file system with append-only capability for key-
value pairs.
At the end we discuss the pros and cons of each approach taken by these systems.
2.2.1 Redis
Redis (23) is an in-memory key-value store that aims at providing low latency. To meet
this objective Redis server holds the entire data in memory to avoid page swapping between
memory and disk, and consequently the serialization/deserialization process. Redis provides a
comprehensive set of options for durability of data as follows.
2.2. RELATED WORK 11
1. Replication of full-state in memory Redis applies master-slave replication model so that
all the slaves’ servers synchronize their states with the master server. The synchronization
process is performed using non-blocking operations on both master and slaves; therefore
they are able to serve clients’ queries while performing synchronization. This implies even-
tual consistency model of Redis server, meaning that slave servers might reply to clients’
queries with an old version of data while performing the synchronization. MongoDB is
another example of a database system that uses a similar technique for replication (25).
Redis implements a single-writer and multiple-readers model that clients are able to read
from any replica but only permitted to write to one server. This model along with eventual
consistency ensures all the replicas will eventually be in a same state, while maintaining a
good performance in terms of latency and read throughput.
2. The other durability method of Redis is persisting the data into local disk using point-
in-time snapshot (checkpoint) at specified intervals. In this method Redis server stores
the entire state of the database server every T seconds or every W write operations onto
the local disk. Copy-on-write semantic is applied to avoid interruption of service during
persisting the data on disk.
3. Asynchronous logging is another approach taken by Redis to provide durability. Write
operations are buffered in memory and flushed into disk by a background process in
append-only fashion. The time to sink the data depends on the sync policy specified
in configuration parameters (flush to disk every second or for every write) (26).
2.2.2 RAMCloud
RAMCloud (1) is a large scale storage system designed for cloud scale data-intensive appli-
cations requiring very low latency and high throughput. It stores a large volume of data entirely
in DRAM by aggregating the main memory of hundreds or thousands commodity machines and
aims at providing the same level of durability as disk by using a mixture of replication and
backup techniques.
RAMCloud applies buffered logging method for durability that utilizes both memory repli-
cation and logging onto disk. In RAMCloud only one copy of every object is kept in the memory
and the backup is stored in the disks of several machines. The primary server updates its state
12 CHAPTER 2. BACKGROUND AND RELATED WORK
Figure 2.1: Buffered logging in RAMCloud. Based on (1).
upon receiving a write query and forward the log to the backup servers, acknowledgement is
sent by a backup server once the log is stored in the memory. A write operation is returned by
the primary server once all the backup servers acknowledge. Backup servers write the logs into
disk asynchronously and remove the logs from the memory.
To recover quickly and avoid disruption of the service, RAMCloud applies two optimizations.
First is by truncating the logs to reduce the required amount of data to be read during recovery.
This can be achieved by creating frequent checkpoint and discarding the logs up to that point
or by cleaning the stale logs occasionally to reduce the size of log file. Second optimization is
to divide the DRAM of each primary server into hundreds of shards and assigning each shard
to one backup server. At the time of a crash each backup server reads the logs and act as a
temporary primary server until a full state of the failed server can be reconstructed.
2.2.3 Bookkeeper
Bookkeeper (24) provides write-ahead logging as a reliable distributed service (D-WAL). It
is designed to tolerate failure by replicating the logs in several locations. It ensures that write-
ahead logs are durable and available to other servers so in the event of failure other servers can
take over and resume the service.
Bookkeeper allows WAL by replicating log entries across remote servers using a simple
2.2. RELATED WORK 13
quorum protocol. A write is successful if the entry is successfully written to all the servers in a
quorum. A quorum of size f +1 is needed to tolerate concurrent failure of f servers.
Bookkeeper allows aggregating disk bandwidth by stripping logs across multiple servers. An
application using Bookkeeper service is able to choose the quorum size as well the number of
servers used for logging. When the number of selected servers is greater than the quorum size
Bookkeeper performs stripping the logs among the servers. Figure below shows the bookkeeper
write operation and how it takes advantage of stripping.
Figure 2.2: Bookkeeper write operation. Extracted from bookkeeper presentation slide (2).
In figure 2.2 Ledger corresponds to a log file of an application, a Bookie is a storage server
storing the ledgers and BK Client is used by an application to process the requests and interact
with bookies.
Assuming client selects three bookies and quorum size of two. Bookkeeper performs strip-
ping by switching the quorums and spreading the load among the three bookies. This allows
distribution of load among the servers and if a servers crashes service continues without inter-
ruption. A client can read different entries from different bookies, this allows a higher read
throughput by aggregating the read throughput of individual servers. Bookkeeper also sequen-
tially writes to disk by interleaving the entries into a single file and stores index of the entries
14 CHAPTER 2. BACKGROUND AND RELATED WORK
Figure 2.3: Pipeline during block constraction. Based on (3)
to locate and read the entries. This allows maximizing the disk bandwidth utilization of disk
and the throughput.
Bookkeeper follows a single-writer multiple-reader model and guarantees that once a ledger
is closed by a client all the readers read the same data.
2.2.4 HDFS
HDFS (3) is a scalable distributed file system for reliable storage of large datasets and it
delivers the data at a high bandwidth to applications. What makes HDFS interesting with
regard to our work is the way it performs I/O operations, and achieves high reliability and
availability.
HDFS allows an application to create a new file and write to the file. HDFS implements
a single-writer multiple-reader model. When a client opens a file for writing, no other client is
permitted to write to the same file until the file is closed. After a file is closed the file content
cannot be altered, although new bytes can be appended to the file. HDFS splits a file into
large blocks and stores the replicas of each block on different DataNodes. NameNode stores
namespace tree and the mapping of file blocks to DataNodes.
When writing to a file, if there is a need for a new block, NameNode allocates a new block
and assigns a set of DataNodes to store the replicas of the block, then these DataNodes form a
pipeline (chain). Data is buffered at the client side and when the buffer is filled bytes are pushed
2.3. DISCUSSION 15
through the pipeline. This prevents the overhead of the packet headers. The DataNodes are
ordered in such a way that minimizes the distance of the client from the last node in the pipeline,
thereby minimizes the latency. HDFSFileSink operator in Datanodes buffers the writes and the
buffer is written into disk only when adding the next tuple exceeds the size of the buffer. Thereby,
each server writes to disk asynchronously which enables a low latency of writes in HDFS.
Placement of blocks’ replicas is critical for reliability, availability, and network bandwidth
utilization. HDFS applies an interesting strategy to place the replicas. It provides a tradeoff
between minimizing the write cost, maximizing reliability, availability and read bandwidth.
HDFS places the first replica of each block on the same node as the writer and the second
and third one on two different nodes in two different racks. HDFS enforces two restrictions:
DataNodes cannot store more than one replica of any block, provided that there are sufficient
racks in the cluster, no rack should store more than two replicas of any block. In this way,
HDFS minimizes the probability of correlated failures as failure of two nodes in a same rack is
more likely to occur than two nodes in different racks which maximize the availability and read
bandwidth (3).
2.3 Discussion
We summarize the approaches towards durability into four major categories.
• Replication of the full state into several locations.
• Periodic snapshots of the system state.
• Asynchronous logging of writes onto a stable storage.
• Synchronous logging of writes onto a stable storage.
The full replication approach along with eventual consistency (e.g. Redis) ensures all the
replicas will eventually be in a same state, while maintaining a good performance in terms of
latency and read throughput.
This approach provides low latency and high read throughput that linearly scales with the
number of slave servers because all the read and write queries can be served from the memory
16 CHAPTER 2. BACKGROUND AND RELATED WORK
without involving disk. However, this approach is subject to one major drawback; large memory
requirement.
This method becomes costly in term of hardware and more importantly utility cost when
we have a large cluster of servers. DRAM memory is volatile and it requires constant electricity
power meaning the machines need to be powered at all the times. For example, in today’s
datacenters the largest amount of DRAM which is cost effective is 64GB (1), having such a
datacenter, to store 1TB of data requires 16 machines. To have a replication factor of three
which is considered a norm to have a good level of durability (3), we need 32 extra servers. Even
though, this approach offers great benefits but it is not a proper choice for a large cluster of
in-memory databases as it becomes costly.
The other drawback is the possibility of data loss. For example, in case of Redis, master
server replies to updates before replication on slave servers has been completed (for a lower
latency); hence if master fails in the time between the reply to update and before sending the
update to replicas the data can be lost. To prevent such a risk the update should not return until
all the replicas have received the update, although this increases the latency. This is a tradeoff
that needs to be made between high performance and durability. The other risk associated
with this approach is that in case of concurrent failure of all the servers holding the replicas
(datacenter power outage) the entire data will be permanently lost. To mitigate this issue, data
can be replicated in multiple datacenters, however this methods results in a high latency of
updates (hundreds of milliseconds) for blocking calls or partial loss of data for asynchronous
calls.
Redis provides periodic snapshot. This is a good choice for backup and disaster recovery as
it allows having different versions of the system states at different point in time. Since the full
state is contained in a single file it can be compressed and be transferred to other data centers
to enhance the availability and recovery of the service.
Periodic snapshot stores server state from one point in time to another, however, failure at
any point in time will result in losing all the updates from the last snapshot up to the failure
point. This propery makes this method undesirable when the latest state needs to be recovered.
The other point to consider is that forking a child process to persist the state could significantly
slow down the parent process serving a large dataset or interrupting the service.
2.3. DISCUSSION 17
In comparison to snapshot, this method provides a better durability as every write operation
can be written into disk. To improve performance, write operations are batched in memory
before being written to disk. Thus, a failure results in loss of the buffered data. Logging
performance can be improved by writing the updates into disk in an append-only fashion. This
prevents the long latency of seek operations on disk (dedicated disks) by sequentially writing
the logs. Therefore, if the sink thread is the only thread writing to disk (in append-only fashion)
it can achieve a better write throughput.
Logging provides a stronger durability than snapshot but it results in creating a larger log
file and slower recovery process, since all the logs need to be played in order for rebuilding the
full state of the dataset. To accelerate the recovery process number of logs required to build
the state should be deducted. Two major techniques for truncating the log file are as follows.
System state should frequently be checkpointed, so that the logs before the checkpoint can be
removed. The other technique, that is implemented by Redis is cleaning the old logs. Redis
rewrites the log file in the background to drop unneeded logs and minimize the log file size.
Asyncronous logging to disk provides a better performance than synchronous loggig, how-
ever, this increases the possibility of losing the updates. Asynchronous logging is usually used
along with replication to mitigate this issue. RAMCloud takes this approach by replicating the
logs through broadcast, refers to it as buffered logging, that allows writes (also reads) to proceed
at the speed of RAM along with a good durability and availability. Buffered logging allows a
high write throughput, however if the write throughput continues at a sustained rate higher
than disk throughput it eventually results in filling the entire OS memory and throughput drops
to the throughput of the disk. Therefore, buffered logging provides a good performance as long
as free memory is available.
Moreover, buffered logging does not always guarantee durability as in case of a sudden power
outage the buffered data will be lost. Therefore, it is suitable for applications that can afford
the loss some updates. To deal with such scenarios cross-data center replication can be done,
however the latency of write are expected to drop significantly.
HDFS provides an append-only operation that can be used for the purpose of logging. HBase
is an example of an application using this capability of HDFS for logging purpose (27). The
idea of HDFS is similar to RAMCloud, though the major difference is that the replication model
applied in HDFS is similar to chain replication that enables high write throughput. HDFS buffers
18 CHAPTER 2. BACKGROUND AND RELATED WORK
the bytes in memory and writes a big chunk of data into disk when the buffer is full. HDFS
creates one file for each client on each machine (residing replicas) which means if multiple clients
concurrently write to file’s blocks located in a same machine, the write performance degrades
as writing to several files in a same disk requires frequent seek operations. HDFS addresses
correlated failures through smart replication strategy by placing the replicas in multiple racks
on different machines.
In case of Bookkeeper, the quorum approach of Bookkeeper consume more resources from
one of the participants as one needs to perform the multicast. For instance, In Bookkeeper,
the client multicasts the log entries across several bookies consequently this consumes more
bandwidth and CPU power of the client. One way to resolve this, could be outsourcing the
replication responsibility to the sever ensemble and create a more balanced replication strategy.
For example, Zookeeper (28), a coordination system for distributed processes, applies a quorum
based approach on server side for replication by implementing a totally ordered bradcast protocol
called Zab (29), however this complicates the server implementation. Our design decision to
approach the durability problem in memory databases is mostly influenced by the approaches
described above. In next chapter, we describe our solution in details.
3Design and Architecture
In this chapter, we define durability with respect to this work and describe how we approach
the durability problem in memory-based key-value stores. We explain the system design and
it’s properties, and finally how the system is built.
3.1 Durability
For the purpose of this work, ”durability” means that if the value v corespondent to the
key k is updated to u at time t, then a read for key k at time t’ such that t’ > t must return
u, if no updates occured between time t and t’. We assume that durability condition holds for
a memory database as long as no crash has occured. This work is to address the durability of a
memory database (in our case a key-value store) such that the latest committed value of every
key can be retrieved after a crash.
3.2 Target Systems
The proposed system design is tailored to provide durability for a cluster of in-memory
databases storing data in form of key-value pairs which complies to the following specifications.
1. Dataset is large and the cluster of in-memory key-value stores consists of at least dozens
of machines
2. Write query size (update/insert/delete) varies from few hundreds of bytes to few of KB
(an example of write query is ”SET K V ” to set the value of key K to V ).
3. Workload is read dominant(10 - 20% of queries are write)
4. High availability of service is important
20 CHAPTER 3. DESIGN AND ARCHITECTURE
The above specification is common for social networking platforms such as facebook, twitter
and Yahoo! News Activity that store large amount of data in main memory and process large
amounts of events. For example, in only year 2008, facebook had been serving 28 terabytes of
data from memory (30) and this number is increasing. Based on (31), in facebook cluster, less
than 6 percent of the queries are write queries. In social network platforms users write queries
are generally small (less than 4KB) (32), for example, twitter message size is limited by 140
characters (33).
3.3 Design Goals
• In our design we aim to provide a high level of durability such that in the event of a
crash, the latest state of the system is recoverable with a low probability of data loss. The
objective is to achieve this goal with minimal impact on performance of memory database
(read operations do not make any changes to the database state, thereby, only the write
operations should be durable).
• We need to ensure that our system is highly available so that changes to database state can
be reliably recorded to stable storage and the records can be read at the time of recovery.
• The system needs to scale with increasing number of databases and write operations.
• Maximizing utilization of local resources of the database cluster is another objective and
we try to avoid additional dependency to external systems and create a self-contained
application.
• Any guarantee about durability of a write should be provided before acknowledging the
success of the write operation to the writer.
• Our durability mechanism should enable a low recovery time to enhance the availability
of the database service.
3.4 Design Decisions
In Chapter 2.2, we described and discussed the common approaches towards durability in
memory-based databases. In this section, we explain our design decision with respect to the
3.4. DESIGN DECISIONS 21
target systems and the objectives.
Checkpoint vs. Logging As checkpointing consumes a considerable amount of resources
and it always leaves the possibility of data loss, we choose to use message logging to persist
the changes of database state, thereby, the state can be reconstructed by replaying the logs (To
reduce the recovery time and limit the number of logs, a snapshot of the system state is needed
or the unneeded logs should be truncated before recovery. To eliminate the cost of this process
during operation a background processes can be assigned to reconstruct the system state and
store it into stable storage when the system is not under stress. This is part of the future work.)
Pessimistic vs. Optimistic Logging We choose to use pessimistic logging to ensure that
the changes will take place only after they are durable in a stable storage system. Low latency
is one of our main objectives. In order to achieve this objective, we create a stable storage by
mixture of in memory replication and asynchronous logging of changes of the database state.
This allows storing log entries in several locations while providing low response time. We name
the set of servers cooperating to perform replication and logging a stable storage unit or SSU.
Asynchronous vs. Synchronous The asynchronous logging is the core of our design to
provide low response time. The reason to choose asynchronous logging is to eliminate the latency
of writing the logs onto disk. However, since DRAM is volatile this method carries the risk of
losing the logs upon a crash. To address this issue we replicate the logs into memory of several
machines before acknowledging for durability of the write. In this way, we can significantly
reduce the probability of data loss as it is very unlikely that all the machines crash at the same
time (3). The design targets low latency and high throughput for write operations by trading
the guaranteed durability with a low probability of data loss. Further in this chapter, we discuss
the possibility of losing the data and reliability of this method.
Chain Replication vs. Broadcast In order to replicate the logs we choose to use chain
replication for two main reasons. 1) Chain replication puts nearly the same load on the resources
of each server, while in broadcast one of the participants utilizes more resources than the others.
This allows providing an implicit load balancing. 2) Chain replication enables high throughput
logging as the symmetric load on the servers allows utilizing the maximum resources of each
server and minimizing the chance of the appearance of bottleneck. We also performed an exper-
iment to help us with the decision. We measured the latency caused by network transmission
using either approach. We discuss the experiment in Chapter 4.
22 CHAPTER 3. DESIGN AND ARCHITECTURE
Local disk vs. Remote file System Logs can be persisted either in the local disk of
the servers or an existing reliable remote file system (e.g. NFS, HDFS). We choose to use local
disk of the server to maximize utilization of the local resources, reduce dependencies and avoid
the use of network bandwidth for persistence. As the logs are replicated in memory of several
machines and all the machines persist the logs onto their disk, we will have the replicas of the
logs on several hard disks. This enhances the availability of the logs at the time of recovery and
accelerates the recovery process by reading different partitions of the logs from different servers
(hence aggregating disk bandwidth of the replicas).
Faster recovery vs. Higher write performance During peak load where the system
is under sustained intensive load, if the write throughput to the stoarge is higher than the
write throughput to disk, the servers’buffer eventually become saturated and the performance
degrades significantly. Thereby, it is important to fully utilize the disk bandwidth and minimize
the write latency in order to prevent saturation of the buffer as much as possible. We write the
logs in an append-only fashion by sequentially writing them to disk in a single file to eliminate
the seek time and maximize the disk throughput utilization. Therefore, we need to interleave
the logs from all the writers into a single file.
As opposed to having one file per writer this method (sequential writes to a single file)
makes the recovery process slower since to recover the logs belong to one writer we need to
read all the logs in the file in a sequential manner. Recovery needs to be done only at the time
of crash and this does not happens frequently and is rare, on the other hand logging needs to
be performed constantly (constant write operations). Thus, we choose to have a faster logging
rather than faster recovery. Although the read performance can be improved by indexing the
log entries (Bookkeeper (24) implements this method).
Transport layer protocol We choose to use TCP/IP for communication as we want to
deliver the messages in order and reliably to provide a consistent view of the logs (and the stored
files) among all the servers in a chain.
3.5 System Properties
Our stable storage system consistes of a set of stable storage units (storage unit or SSU).
Every stable storage unit consists of several servers, each persisting log entries onto its local
3.5. SYSTEM PROPERTIES 23
disk. A writer process writes to only one of the stable storage units. A storage unit follows a
fail-stop model and upon a SSU crash, its clients write to another storage unit. The system
environment allows detection of failure through membership management service provided by
an external system.
Our solution follows a Single-writer Single-reader model. Log entries of a database applica-
tion is written only by one process to the stable storage. The process that writes the logs is the
process (same identifier) that reads the logs from the storage. The read operation needs to be
performed at the time of recovery. Therefore, read and write operations on the same data item
is never performed simutanously.
A reader can read the logs from more than one server within the storage unit as all the
servers store identical set of data (acknowledged log entries). A process writes to a different
storage unit, if the storage unit fails or if the storage decides to disconnect the process.
3.5.1 Fault Tolerance Model
The system needs to be fault tolerant to continue its service at the time of failure. We
achieve fault tolerance through replication. In our system persistence of an acknowledged log is
guaranteed for f simultaneous servers’ failures, if we have f +1 servers in the replication chain.
However to guarantee the stable storage of a log we require f +2 servers to tolerate f simultaneous
failures.
We implement fail-stop model. A server halts in response to failure and we assume the
servers crash can be detected by all the other servers in the storage unit. In the event of a server
crash, the storage unit stops serving all its writers and it only persists the remaining logs in its
servers’ buffer onto disk (Writers connect to another storage unit to continue the logging). Once
all the logs are persisted into disk all servers restart and become available to form new storage
unit.
An alternative option to deal with failures is to repair the storage unit. However due to
following reason we prefer to re-create a storage unit and avoid repairing.
Repairing a storage unit requires addressing many failure scenarios which complicates the
implementation. In addition, possibility of corner cases which have not been taken into account
as well as the possibility of additional failures during repair further complicates the matters.
24 CHAPTER 3. DESIGN AND ARCHITECTURE
3.5.2 Availability
The system allows creation of many stable storage units. Each storage unit can provide
different replication factors. Larger replication factors (number of servers in the replication
chain within storage unit) provide three advantages:
• higher availability of the stored entries since all the servers within the storage units host
a replica of logs.
• lower probability of data loss when a correlated failures occur. It is more likely that at
least one server, holding the buffered data survives the failure and persists it to disk.
• higher read bandwidth by aggregating the bandwidth of servers hosting the replicas.
Therefore, higher replication factor of storage unit enhances data availability and read through-
put as well as stronger durability. However, in the case of a catastrophic failure of all the servers
in a storage unit data can be lost. Larger number of storage units enhance the availability of
the write operations since writers can continue logging upon the storage unit crash.
3.5.3 Scalability
In our storage system, every storage unit is independent of every other unit and there are
no shared resources or coordination amongst them. The independent nature of the storage units
allows adding new units without impacting the service performance. The load is divided by
assigning each set of writers to different units. Therefore, to create a storage unit, a set of
servers with closest resource usage are selected in order to prevent a server from becoming a
bottleneck in the chain. This allows the maximum resource utilization within a storage unit.
3.5.4 Safety
3.5.4.1 Consistent Replicas and Correct Recovered State
We reply on TCP protocol to transfer the messages in order, reliably and without duplication
between the nodes. The servers in a chain are connected by a single TCP channel and messages
are forwarded and persisted in the same order that they have been received. This ensures that
3.6. ARCHITECTURE 25
all the servers in a chain view and store the logs in the same order that messages are sent by
the writer (writer also writes through a single TCP channel). In our system it is not possible to
recover an incorrect or a stale state of database without the knowledge of the recovery processor
(reader). Every writer is represented by a unique Id (client Id) and every log is uniquely identified
by combination of the client Id and a log Id. The log Id increases by one for every new log entry.
During recovery of a database state, logs are read and played in order and in case of a missing
log (or duplicate), the recovery processor will be able to detect the missing (or duplicate) log
entry.
3.5.4.2 Integrity
During recovery we need to ensure that the object being read is not corrupted. This requires
adding a checksum to every object stored in storage to enable verification of data being read
(this feature is not part of the implementation).
3.5.5 Operational Constraints
The availability of the Zookeeper quorum is essential for the availability and operation of
the system since we reply on Zookeeper for failure detection and accessing metadata regarding
the nodes. The availability of write operation depends on availability of a storage unit and the
availability of read operations requires at least one server that stores the requested logs.
In order to operate continuously, at least two storage units should be available to quickly
resume the service upon a storage unit crash. For example, in the current implementation we
require six servers to have two storage units with replication factor of three. This requirement
can be reduced by fixing the storage units upon a crash through replacing the failed server from
a pool of available servers. However, this complicates the implementation.
3.6 Architecture
The main idea is to create several storage units, each capable of storing and retrieving
stream of logs reliably. Each storage consists of a number of coordinated servers that perform
chain replication and asynchronously persisting the logs onto their local disk to provide a lower
26 CHAPTER 3. DESIGN AND ARCHITECTURE
response time. Each log entry is acknowledged only after it is replicated in all the servers within
the storage unit. Hence we ensure that the logs are persisted in the event of failure of some
of the servers. The number of servers in each storage unit is equal to the replication factor
it provides. In this section, we describe the architecture of the system (with respect to write
operation).
3.6.1 Abstractions
Our system consists of three types of processes. Figure 3.1 illustrates the processes and
below we describe their functions.
• Log server processes (server) form a storage unit and asynchronously store the log
entries on local disks (in append-only fashion). They also read and stream the requested
logs from the local disk upon the request of client process at the time of recovery. In
Figure 3.1 head, tail and the middle nodes are the log server processes.
• Stable Storage Unit or storage unit (SSU) provides stable storage of log entries. It
consists of a number of machines hosting two types of processes. Log server process (each
on different machine) to replicate and store logs and state builder process. The number of
machines is equal to replication factor provided by a stable storage unit.
• Client process (writer/reader) processes requests (writes) from an application and
creates log entries. It streams the entries to an appropriate storage unit and responds to
the application. The client process also reads the logs and reconstructs the database state
at the time of failure (read operation is a future work).
• State builder processes are the background processes that read the logs from the local
disk to compute the latest value of each key. Once the values are computed, they are stored
into the disk and the old logs are removed from the disk. The purpose of this process is to
reduce the recovery time by eagerly preparing the latest state of the key-value store. This
process takes place whenever system is not under stress (part of future work).
3.6. ARCHITECTURE 27
Figure 3.1: System entities.
3.6.2 Coordination of Distributed Processes
Zookeeper is a coordination system for distributed processes (28). We use Zookeeper for
membership management and storing metadata about the server processes, storage units, and
client processes. Data in zookeeper is in a form of tree structure and each node is called a znode.
There are two type of znode: ephemeral and permanent. Ephemeral znodes exist as long as the
season of zookeeper client creating that znode is alive. Ephemeral znodes can be used to detect
failure of a process. Permanent znode stores the data permanently and ensures it is available.
We use these metadata and the Zookeeper membership service to coordinate server processes for
creating storage units and detect failures. Client processes also use Zookeeper service to locate
storage units and detect their failures. Below we describe the metadata and the types of nodes
used in our system.
MetaData
• Log Server znode (ephemeral)
– IP/Port for coordination protocol
– IP/Port for streaming
– Rack: the rack that server is located
– Status: accept or reject storage join request
– Load status : updated resource utilization status
• Storage unit znode (ephemeral)
28 CHAPTER 3. DESIGN AND ARCHITECTURE
– Replication factor
– Status: accept/reject new clients
– List of log servers
– Load status: load of the log server with highest resource utilization
• File map znode (permanent)
– Mapping of logs to servers
• Client znode (ephemeral)
– Only used for failure detection
• Global view znode (permanent)
– List of servers and their roles (leader/follower) used to form stable storages units
3.6.3 Server Components
A log server process creates an ephemeral znode in Zookeeper upon its start and it constantly
updates its status data at this node. This process follows a protocol that allows it to cooperate
with other processes to form a storage unit. We first describe this protocol, and then we explain
how an individual log server operates within a storage unit.
3.6.3.1 Coordination Protocol
This protocol is used to form a replication chain (storage unit) and operates very similarly
to two-phase commit (12). The protocol defines two roles servers: leader and follower. Leader
is responsible to contact the followers and manages creation of a storage unit. Followers act as a
passive process and only respond to the leader. Figure 3.2 and 3.3 describe the state transition
of both leader and follower.
If a server process is not part of storage it sets its state to listening state. In listening state
a process frequently checks the global view data. If the process is listed as a leader it reads the
list of its followers’ addresses. It sends the followers a join-request message and set a failure
detector (sets a watch flag on their ephemeral znode) to detect their failures.
3.6. ARCHITECTURE 29
Figure 3.2: Leader states.
Figure 3.3: Follower states.
30 CHAPTER 3. DESIGN AND ARCHITECTURE
Followers are able to accept or reject the join request depending on their available resources.
If a follower fails or rejects the request, the leader triggers the abort process and all processes
resume their initial state. In order to abort, the leader sends an abort message to all the
followers. Upon receiving of an abort message, each follower (and leader itself) cleans all the
data structures and return to the initial state 3.3.
Each follower sets failure detector for the leader before accepting the join request so that
in case of the leader failure, it can detect the failure and resume the previous state. If all
the followers accept the join request the leader sends a connection-request message carrying an
ordered list of servers (including the leader). Each server connects to the previous and next
server in the list as its predecessor and successor in the chain. Once a server is connected and
ready to stream data, it sends connect-completion signal to the leader. If a server fails to connect
or crashes, the leader aborts the process. Otherwise a complete chain of servers is ready and
leader creates a znode for the new storage unit and sends a start signal along with the znode
path of the storage unit to all the followers to start the service 3.2.
3.6.3.2 Concurrency
Each server process consists of three main threads operating concurrently based on producer-
consumer model. Figure 3.4 shows how threads in a single server operate and interact through
shared data structures.
Three shared data structures among the threads are:
• DataBuffer stores the logs entries in memory.
• SenderQueue keeps the ordered index of the log entries that should be either sent to the
next server (head or middle server) or acknowledged to client (tail server).
• PersistQueue holds an ordered index of log entries that should be written to disk.
The receiver thread reads the entries from TCP buffer and inserts the entries into DataBuffer.
It also inserts the index of the entry into SenderQueue. DataBuffer has a pre-specified size
(number of entries) and if the DataBuffer is full, receiver thread must wait until an entry is
removed from the DataBuffer.
3.6. ARCHITECTURE 31
Figure 3.4: Log server operation.
The sender thread waits until there exist an index in the SenderQueue. It reads the index
of the entry from the SenderQueue to find and read the entry from the DataBuffer. If the server
is the tail server for the entry it sends an acknowledgment to the corresponding client indicating
that the entry has been replicated in all the servers. If the server is not the tail it simply sends
the entry to the next server in the chain (successor). Once the message is sent to an the next
hop, the sender thread puts the index of the entry in Persist queue.
The persister thread waits until an index exists in the PersistQueue. It reads the index
of the entry in the DataBuffer and persists the entry into disk in append-only fashion. This
thread is the only thread persisting the entries and all the entries from different clients will be
interleaved into a single file. Once the entry is written to disk it is removed from the DataBuffer
by this thread.
3.6.4 Stable Storage Unit (SSU)
A stable storage consists of a set of servers forming a replication chain. It ensures the
replication and availability of delivered entries. One of the servers acts as a leader and holds
the lease to the znode of the storage unit. The storage unit is considered failed, when one or
more servers in the storage unit crash. Upon crash of storage unit, service is stopped and it
only persists the entries from memory to disk. When the leader crashes, the znode is removed
automatically and when the other servers fail, the leader removes the znode. Therefore, all
clients will be notified about the storage failure.
In a storage unit, every server can act as head, tail or middle server. Clients can connect to
32 CHAPTER 3. DESIGN AND ARCHITECTURE
Figure 3.5: Storage unit.
any server in the chain. The server acting as entry point is the head of the chain for the client
and the last server in the chain (which sends the acknowledgment) is the tail. Figure 3.5 shows
how several clients can stream to the storage unit.
3.6.5 Load Balancing
We make load balancing decisions at three points.
• We ensure that a set of servers selected to create a storage unit (perform chain replication)
have nearly the same load. This minimizes the chance of appearance of a bottleneck in
the chain and maximizes resource utilization of each server. Figure 3.6 shows how servers
are clustered to form a storage unit.
• One of the servers within the storage unit constantly updates the available resource of the
storage unit and its status in zookeeper node. This enables the clients to select a storage
unit with the lowest load by reading this data from Zookeeper servers.
• In a storage unit every server can act as head, tail or middle server. The tail consumes less
bandwidth since it only sends acknowledgments to the client, while head and middle server
need to transfer the entry to the next server in the chain. Hence, if all the clients choose
the same server as the head, the tail server consumes half the bandwidth compared to the
rest of the servers. To mitigate this issue, clients of one storage unit connect to different
servers. In the current implementation clients randomly choose one server in the storage
3.6. ARCHITECTURE 33
Figure 3.6: Clustering decision based on the servers available resources.
unit as the head. However, this can be improved by connecting clients to different servers
in a round-robin fashion. Thereby, every server serves nearly equal number of clients as the
head (and consequently as a tail). Figure 3.5 also shows the distribution of load through
selection of different head servers by the clients.
The decision to cluster the servers is based on the available resources of the servers. One of
the log server process is in charge of compiling servers’ data and makes the clustering decisions.
It reads the servers’ data from Zookeeper and sorts the available servers based on their free
resources. Using this information, servers with similar amount of avaiable resources are grouped.
The server with the largest amount of free resources in the group is chosen as the leader. This
process is performed frequently and the output is written to ”Global View” znode in Zookeeper.
Each process frequently reads this data to determine its group and its role. Upon a crash of the
process (responsible for updating the ”Global View” znode), another process takes over the job.
Our current implementation does not provide dynamic load balancing and it is part the future
work.
3.6.6 Failover
In the event of storage failure (any of the servers) clients of the storage are able to detect
the failure and find another storage unit by querying zookeeper. The client connects to another
storage unit (if one is available) to continue with writing the logs. An altenative to shorten the
service disruption is to allow the client to hold connections to two storage units and upon the
crash of one (the one writing to it), it immediately resumes the operation by switching to the
other storage unit.
34 CHAPTER 3. DESIGN AND ARCHITECTURE
Figure 3.7: Failover.
3.6.7 API
The following APIs can be used by an application (a memory-based datastore) to write into
the stable storage system.
• connect(replicationFactor); finds a storage unit with with the specified replication factor
and randomly connect to one of the server within that storage.
• addEntry(entry); submit an entry to the storage unit and blocks when the number of
non-acknowledged entries reaches the specified windowsize.
• setWindowSize(windowSize); sets the number of entries that can be sent without receiving
acknowledgments. Setting the window size to one is equivalent to having a synchronous
communication (default windowsize is one).
• close(); Close the connection to the service
3.7 Implementation
We have implemented a prototype of the storage system in Java and the source code is
publicly available through github repository (34). For network communication, we use Netty
(35), an asynchronous event-driven network application framework, and for efficient serializa-
tion/deserialization, we use Protocol Buffers (36). We have implemented a cluster of stable
storage unit (SSUs) and SSU client with failover mechanism. The parameters such as replica-
3.7. IMPLEMENTATION 35
tion factors, buffer size and client window size are configurable. We have not implemented read
operations and recovery, and they are part of the future work.
36 CHAPTER 3. DESIGN AND ARCHITECTURE
4Experimental Evaluation
This section explains a set of experimental evaluations. All the experiments are conducted
on a set of identical machines with the following specifications:
• Two processors, each processor is an Intel(R) Xeon(R), 2.50GHz, L5420, 4 cores
• SATA drive of 1 TB with spin speed of 7200 RPM
• 1 Gb/s network interface
• DRAM of 16 GB
Each experiment has a specific configuration that is explained in the respective section. We
begin with an experiment that enables us to estimate the expected network latency of chain
replication and atomic broadcast. Then, we evaluate our stable storage in terms of throughput
and latency. We show the system behavior under a sustained entry rate and finally we compare
the performance results of our storage with hard disk as the most common means of storage.
4.1 Network Latency
The goal of the experiment is to determine the lowest possible latency that can be achieved
using chain replication and broadcast. We only accomodate the network delay in our compuation
as processing time at each node depends on the implementaion. In this experiment, we measure
the average transfer time for different entry sizes between two machines located within the same
rack and different racks in a datacenter. We conducted this experiment before the design stage
to help us with our design decisions.
To estimate the network latency of each replication technique, we first measure the transfer
time from one machine to another. In table 4.1 first column contains two numbers (first/second),
the first number indicates the elapsed time starting at the point that application starts writing
38 CHAPTER 4. EXPERIMENTAL EVALUATION
Size Latency within one rack (microsecond) Latency in different racks (microsecond)
256B 107/62 240/130
512B 138/93 285/175
1KB 178/133 357/247
Table 4.1: RPC latency for different packet sizes within a datacenter
to the socket and ends when it receives an acknowledgement from the receiver, receiver sends the
acknowledgement after reading and discarding the entry. The second number simply excludes
the time for acknowledgement from the receiver to the sender. To compute the second number we
need to know the time required for transmission of an acknowledgement message. We estimate
this time by measuring the round-trip time of an ICMP packet by sending a ping message from
one server to the other server and divide it by two. Transmission time of an ICMP packet of 50
bytes from one server to another within a same rack is 45 microsecond and for different racks is
110 microsecond.
Using the measured latencies in 4.1 and estimated time for an acknowledgement, we now
analytically compute the expected network latency for chain replication and broadcast. Our
computation assumes replication factor of three and packet size of 256 bytes (typical write
query size in social network platforms is limited to few hundreds of bytes; for example, twitter
limits the tweets’ size to 140 characters (33) ). For chain replication, we assume the tail is in a
different rack and for broadcast one of the slaves reside in a different rack from other servers. In
a chain of three servers, an entry is sent from the client to the head server (62µs) consequently
head server reads the entry and sends it to the middle server (62µs). Similarly the middle server
reads the entry and sends it to the tail (130µs). Finally the client receives an acknowledgement
from the tail (110µs). In total, chain replication with replication factor of 3 (perfomed in 2
racks) and entry size of 256B yields a total latency of 364 microsecond (this time excludes the
computation time at each node).
In atomic broadcast an entry should be transferred to the master (62µs), and then the master
broadcasts the entry to the slaves (130µs). Once the master receives the acknowledgement from
all the slaves (110µs), it can acknowledge the message to the client (45µs). This yields a total
latency of 347 microsecond.
The network latency of chain replication and atomic broadcast is nearly the same for replica-
tion factor of three. However, the bandwidth utilization of the master node in atomic broadcast
4.2. STABLE STORAGE PERFORMANCE 39
is 50% more since the master receives an entry and sends it to two slaves. For higher replication
factors, the master node can easily become a bottleneck as it consumes more bandwidth and
CPU power to transmit the packets compared to chain replication. For chain replication ev-
ery node would consume similar amount of resources and increasing the replication factor only
increases the latency and does not impact the throughput.
4.2 Stable Storage Performance
As explained in Chapter 3, our stable storage consists of several stable storage units. Fol-
lowing experiments evaluate the performance of a single stable storage unit. In the following
experiments, we set the TCP-NODELAY to true, to ensure TCP does not batch small packets
and they are sent immediately. This enables us to measure the actual performance of the sys-
tem for small packages. In the experiments, large chunks of entries are periodically written to
disk, thereby, the maximum disk bandwidth is utilized. This prevents the disk from becoming
a bottleneck in the following experiments(disk bandwidth is higher than the available network
bandwidth used by our servers).
To measure the performance, we use a single client and increase the load on the storage
unit by increasing the client window size, ranging from window size of one (synchronous) up
to one hundred (outstanding entries). The results are average over three runs of 200,000 write
operations. All the following experiments are performed within the same rack and there is no
cross-rack traffic. Placement of servers in different racks adds a constant delay to the latency.
4.2.1 Impact of Log Entry Size
The objective of this experiment is to measure the performance of our stable storage unit
and the performance that a single client can expect from the stable storage.
In order to measure the lowest latency a client can expect from the system, we run an
experiment using a single client with window size of one and replication factor of three. As can
be seen in Table 4.2, client can write entries ranging from 200 bytes to 4KB with latency of
less than one millisecond. Figure 4.1 shows the performance of storage unit with different log
entry sizes while window size increases. For entry size of 200 bytes throughut can increases up
40 CHAPTER 4. EXPERIMENTAL EVALUATION
to 34600 entries/sec and for 4KB entries throughput increases up to 7200 entries/sec. Moreover,
for entry size of 200 bytes depending on the throughput, latencies vary from 0.45 ms to 2.5 ms
and for entry size of 4 KB vary from 1 ms to 3 ms .
The results indicate that for smaller size of log entries (few bytes to hundreds of bytes)
performance does not vary significantly but larger packet sizes notably degrade the throughput
as well as the latency. For packet size of 200 bytes and its highest throughput, 34600 entries/sec,
CPU utilization of each server increases nearly up to 300 percent (three threads running on each
server) while network bandwidth utilization of each server stays below 100 Mb/s. This shows
that our system is CPU bound for small packet sizes due to high overhead of packet processing.
Another contributor to the high CPU usage is the way the buffer has been implemented. The
current implementation of the buffer uses hashing to locate the entries and it does not pre-
allocate memory. A more sophisticated implementation of buffer will certainly increase the
throughput.
For larger packet sizes (4KB and 16KB), the CPU utilization comes close to saturation
point, although it does not become fully saturated. The reason is that our system is network
bound for such large entries. For the highest throughput of 4 KB entries, 7900 entries/sec, the
network bandwidth usage of each server (except tail) exceeds 500 Mb/s. This is the highest
network bandwidth each one of our servers can utilize using a single thread writing to a single
channel. In the current implementation, implementation, servers are connected using a single
socket connection. A single connection is not able to utilize the full bandwidth (37). To fully
uilize the available bandwidth we need to increase the number of concurrent connections between
the servers. However, this causes the servers in a chain to receive and persist the messages in
different orders and might complicate the read operation. Another observation from Figure 4.1
is that for large entry sizes, throughput in terms of the number of entries is lower, although
network bandwidth is utilized more efficiently. To benefit from this property of the system,
client can batch the small entries before sending to storage unit. This also lower the overhead
of TCP/IP packet header (at the cost of higher latency).
4.2.2 Impact of Replication Factor
This experiment is to determine the behavior of the system with different replication factors
and additional latencies caused by an extra replica. Figure 4.2 shows the impact of differing
4.2. STABLE STORAGE PERFORMANCE 41
Entry size (bytes) Throughput (entries/sec) Latency (ms)
5 2440 0.39
200 2200 0.45
1024 1600 0.62
4096 971.52 0.99
Table 4.2: Latency and throughput for a single client synchronously writing to stable storageunit.
Figure 4.1: Thoughput vs. Latency graph for our stable storage unit for different entry sizeswith replication factor of three.
replication factor on throughput and latency. As can be seen from the figure, increasing the
replication factor from two to three does not impact the throughput but only the latency. For
entry size of 200 bytes and throughput ranging from 5000 to 34000 latency difference lays between
100µs and 350µs. This behavior is expected since in chain replication each server handles the
same load regardless of chain’s length; thus, the number of servers in a chain do not impact the
throughput (if none becomes a bottleneck). Adding an extra server to a chain only increases the
latency of a request by sub-millisecond since the distance of the tail from the client increases.
Larger number of replicas provide a higher reliability and availability as well as a higher read
throughput. These benefits can be provided with additioinal latency of sub-millisecond and
without degrading the write throughput.
42 CHAPTER 4. EXPERIMENTAL EVALUATION
Figure 4.2: Throughput vs. Latency for stable storage unit with replication factor of two andthree for log entry size of 200 bytes.
4.2.3 Impact of Persistence on Disk
This experiments is to show how the system benefits from asynchronous logging to improve
the performance. Figure 4.3 shows how our asynchronous approach eliminates the impact of
low performance of disk on the write performance of stable storage units. The red line shows
the storage unit performance when it only performs in-memory chain replication. Meaning that
entries are not persisted on disk and every server removes the log entry from its memory once it
is sent to next hop. The blue line shows the storage unit performance when it removes the entries
after persisting to disk. As can be seen, storage unit performance is identical in both cases. This
also indicates the correct implementation of asynchronous logging, as disabling persistence on
disk do not impact latency. Nevertheless, the high performance lasts up to a point that none of
the server’s buffer in the chain is filled. Next experiment investigates our system behavior when
servers’ buffer is full.
4.3 Load Test
Asynchronous logging of entries provide significant performance gain. However, under a
sustained load with an average throughput higher than hard disk write throughput, the buffer
4.3. LOAD TEST 43
Figure 4.3: Throughput vs. Latency of a stable storage unit for log entries of 200 bytes, whenpersistence to local disk is enabled and disabled.
eventually fills up and the performance drops to the performance of the slowest disk in the chain.
In this experiment, we explore the behavior of our system in such a situation.
To conduct the experiment, replication factor of the storage unit is set to three, each server
has a buffer size of 100,000 entries and writes are forced to disk upon every five writes. As a
side note, in the current implementation of the buffer capacity is defined in terms of the number
of entries and this number is configurable. This is fine for the purpose of prototyping, however,
in a more sophisticated system this limit should be imposed on the number of bytes.
Figure 4.4 and 4.5 illustrate how the latency and throughput of the system change when
the buffer fills up under sustained load. The horizental axis indicates the number of entries
sent to the storage unit, starting from a point in time before the buffer is saturated. This can
be thought as the passage of time. As can be seen in Figure 4.4, when the buffer is filled the
throughput drops dramatically. Throughput for entry sizes of 200 bytes, 1 KB and 4 KB drops
from 11100, 8770 and 5540 entries/sec to 2741, 2328 and 1479 entries/sec respectively. The
latter throughput numbers (after the buffer is saturated) nearly equal to the throughput of hard
disk for entry sizes equal to total size of the batched entries (each batch consists of five entries
in this experiment). Figure 4.5 illustrates a similar effect on latency. For entry sizes of 200
bytes, 1 KB and 4 KB latency increases from 0.75, 1.17 and 3.01 ms to 3.01, 4.02 and 15.10 ms
once the buffer is filled. The reason for this significant raise of latency after the saturation of
buffer is that an entry can be written to buffer only when another entry (head of the queue in
44 CHAPTER 4. EXPERIMENTAL EVALUATION
Persistence Queue) is written to disk and removed from the buffer. In short, write latency of
the stable storage unit tightens to latency of the hard disk once the buffer is filled.
Figure 4.4: Throughput of stable storage unit under sustained load.
Figure 4.5: Latency of stable storage unit under sustained load.
4.4 Durability and Performance Comparison
We have described several exsiting systems and approaches towards of durability in section
2.2. One common approach is to persist log entries in local hard disk (as Redis does). To
ensure every entry is written to disk we need to force upon every write. However, this does not
guarantee the durability of data on commodity hardware for all the failure cases. Hard disks
have a cache layer that buffers the data and writes it to the disk every now and then. Data
4.4. DURABILITY AND PERFORMANCE COMPARISON 45
written to disk’s buffer survives the process crash; however this data will be lost in the case of
power outage or hardware failure. To guarantee the persistence of data in all the cases, disk
cache must be disabled but this is not acceptable as the disk performance drops significantly.
Using redundent array of inexpensive disks (RAID) write/read throughput can be improved (38)
and also disk failures can be tolerated (e.g. level 1 RAID), but this still does not address the
power outage problem.
Our stable storage persists entries to the disk and in order to address the process crash and
power failure, it replicates the entries on several servers to guarantee the durability of every
entry.
To establish a baseline, we have conducted an experiment to measure the performance of
a hard disk in terms of latency of append-only operations. Using RAID, throughput can be
multiplied but the latency does not improve. Figure 4.6 provides the comparison between the
latency of our stable storage unit with the hard disk. As can be seen, for entry size of 200 bytes
the storage unit is able to outperform the disk by factor of 4 even when its cache is enabled and
by factor of 2 for entry size of 1 KB and 4 KB. Disabling the disk cache increases the disk latency
from 2 ms to almost 50 ms. Having the disk cache disabled, it performs dozens of times slower
than the stable storage unit. Moreover, our stable storage also provides a higher reliability and
availability by replicating data on several remote servers. This can also be used to increase
the read throughput through aggregating the throughput of servers hosting the replicas; hence,
accecelerating the recovery process.
46 CHAPTER 4. EXPERIMENTAL EVALUATION
Figure 4.6: Performance comparison of stable storage unit and hard disk.
5Conclusions5.1 Conclusions
Memory-based databases outperform disk-based databases, but their superior performance
is hindered by the non-volatility of DRAM and consequently data loss at time of failure. Existing
techniques to provide durability of data such as checkpointing and write-ahead logging to hard
disk either do not guarantee the persistence of the entire data or result in significant performance
degradation. Moreover, full state in-memory replication of databases becomes costly for large
deployments. We have proposed an approach to provide durabiliy to memory-based key-value
stores by creating a high performance stable storage system. The system consists of a set of
stable storage units capable of storing log entries for every write operation to memory database
with low response time. Each stable storage unit ensures the persistence of each log entry by
replicating the entry in the memory of several servers prior to acknowledgment of the operation.
Entries are stored in memory of each server and asynchronously written to disk in append-
only fashion. We apply chain replication by forming a pipeline of servers, achieving high write
throughput. Our stable storage unit implements the fail-stop model and in case of failure, it’s
clients simply switch to another replica without significant disruption of operation.
The evaluation results demonstrate that our stable storage system can provide durable
writes with low latency, while providing high data availability. For log entry sizes of 200 bytes
to 4 KB, and replication factor of three, we achieve write latencies of less than one millisecond
and a maximum throughput of 7900 entries/sec for 4 KB entries and 34600 entries/sec for
200 bytes entries. An additional replica can be added to a stable storage unit with negligible
increased latency and no impact on the system throughput. The evaluation results also show
that, in terms of latency, our approach outperforms the convential write-ahead logging to disk.
We believe our solution enables memory-based databases to achieve durability while preserving
their high performance.
48 CHAPTER 5. CONCLUSIONS
5.2 Future Work
The current implementation supports the write operation to stable storage. Design and
implementation of the read operation for recovery purpose is part of the future work. There
are several possible extensions to the current system. Rack-awareness can be incorporated
into server clustering decisions (to create stable storage units) to enhance the reliability and
availability of the storage as well as providing more efficient bandwidth utilization. The current
load balancing decisions are made before creation of storage units and by distribution of clients
amongst the stable storage units. An extension would be to enable the system to adapt to
changes of servers load when the servers are sharing resources. Another extension is to provide
a self-healing mechanism for our stable storage unit such that a failed server within a storage unit
can be replaced to resume the service. Lastly, we are exploring the potential memory-databases
that can benefit from our system to gain durability and performance.
References
[1] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazieres, S. Mi-
tra, A. Narayanan, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, and
R. Stutsman, “The case for ramclouds: scalable high-performance storage entirely in
dram,” SIGOPS Oper. Syst. Rev., vol. 43, pp. 92–105, Jan. 2010.
[2] “Apache BookKeeper presentation - apache software foundation.”
https://cwiki.apache.org/confluence/display/bookkeeper/BookKeeper+presentations,
June 2012.
[3] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,”
in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on,
p. 1a10, 2010.
[4] D. Ongaro, S. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum, “Fast crash recov-
ery in ramcloud,” in Proceedings of the Twenty-Third ACM Symposium on Operating
Systems Principles, pp. 29–41, ACM, 2011.
[5] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra,
A. Fikes, and R. E. Gruber, “Bigtable: A distributed storage system for structured
data,” ACM Trans. Comput. Syst., vol. 26, pp. 4:1–4:26, June 2008.
[6] “Redis: Remote dictionary server.” http://redis.io, June 2012.
[7] “Couchbase.” http://www.couchbase.com/, June 2012.
[8] “memcached - a distributed memory object caching system.” http://memcached.org/, June
2012.
[9] J. Sobel, “Scaling out.” http://www.facebook.com/note.php?note id=23844338919, June
2012.
[10] H. Lu, Y. Y. Ng, and Z. Tian, “T-tree or b-tree: Main memory database index structure
revisited,” in Database Conference, 2000. ADC 2000. Proceedings. 11th Australasian,
p. 6573, 2000.
49
50 CHAPTER 5. CONCLUSIONS
[11] H. Garcia-Molina and K. Salem, “Main memory database systems: An overview,” Knowl-
edge and Data Engineering, IEEE Transactions on, vol. 4, no. 6, p. 509a516, 1992.
[12] A. S. Tanenbaum and M. Van Steen, Distributed systems: principles and paradigms. 2002.
second ed.
[13] J. Duell, “The design and implementation of berkeley lab’s linux checkpoint/restart,” 2005.
[14] L. Alvisi and K. Marzullo, “Message logging: Pessimistic, optimistic, causal, and optimal,”
Software Engineering, IEEE Transactions on, vol. 24, no. 2, p. 149a159, 1998.
[15] X. Defago, A. Schiper, and P. Urban, “Total order broadcast and multicast algorithms:
Taxonomy and survey,” ACM Computing Surveys (CSUR), vol. 36, no. 4, p. 372421,
2004.
[16] R. van Renesse and F. B. Schneider, “Chain replication for supporting high throughput and
availability,” in Proceedings of the 6th conference on Symposium on Opearting Systems
Design & Implementation - Volume 6, OSDI’04, (Berkeley, CA, USA), pp. 7–7, USENIX
Association, 2004.
[17] “DRAM and memory system trends.” www.research.ibm.com/ismm04/slides/woo.pdf,
June 2012.
[18] J. Dean, “Large-scale distributed systems at google: Current systems and future directions,”
2009.
[19] A. Jacobs, “The pathologies of big data,” Communications of the ACM, vol. 52, no. 8,
pp. 36–44, 2009.
[20] D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, and A. Rowstron, “Everest: Scaling
down peak loads through I/O off-loading,” in Proceedings of the 8th USENIX conference
on Operating systems design and implementation, p. 1528, 2008.
[21] “Introduction to NetApp infinite volume.” http://www.netapp.com/templates/mediaView?m=tr-
4037.pdf&cc=us&wid=159193675&mid=78187885, June 2012.
[22] “HP smart array controller technology.” http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00687518/c00687518.pdf.
[23] “Documentation of redis.” http://redis.io/documentation, June 2012.
[24] “Apache BookKeeper.” http://zookeeper.apache.org/bookkeeper/, June 2012.
[25] “Replication - MongoDB.” http://www.mongodb.org/display/DOCS/Replication, June
2012.
5.2. FUTURE WORK 51
[26] “Redis persistence a redis.” http://redis.io/topics/persistence, June 2012.
[27] “HBase - Hdfs Sync Support - hadoop wiki.” http://wiki.apache.org/hadoop/Hbase/HdfsSyncSupport,
June 2012.
[28] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, “Zookeeper: wait-free coordination
for internet-scale systems,” in Proceedings of the 2010 USENIX conference on USENIX
annual technical conference, USENIXATC’10, (Berkeley, CA, USA), pp. 11–11, USENIX
Association, 2010.
[29] B. Reed and F. P. Junqueira, “A simple totally ordered broadcast protocol,” in Proceedings
of the 2nd Workshop on Large-Scale Distributed Systems and Middleware, LADIS ’08,
(New York, NY, USA), pp. 2:1–2:6, ACM, 2008.
[30] P. Saab, “Scaling memcache at facebook.” http://www.facebook.com/note.php?note id=39391378919,
Dec. 2008.
[31] M. Kwiatkowski, “Memcache at facebook.” qcontokyo.com/, 2010.
[32] S. Ding, K. Lai, and D. Wang, “A study on the characteristics of the data traffic of online
social networks,” in Communications (ICC), 2011 IEEE International Conference on,
p. 15, 2011.
[33] J. K. Todd Fast, “Twitter data - a simple, open proposal for embedding data in twitter
messages - home.” http://twitterdata.org/, June 2009.
[34] K. Rezahanjani, “Source code - high performance stable storage.”
https://github.com/Kiarashrezahanjani/in-memory-chain-replication/tree/complete,
June 2012.
[35] “Netty - the java NIO client server socket framework - JBoss community.”
http://www.jboss.org/netty, June 2012.
[36] “Protocol buffers - google’s data interchange format.” http://code.google.com/p/protobuf/,
June 2012.
[37] “Performance comparison between NIO frameworks.” http://gleamynode.net/articles/2232/,
Oct. 2008.
[38] D. Patterson, P. Chen, G. Gibson, and R. Katz, “Introduction to redundant arrays of inex-
pensive disks (raid),” in COMPCON Spring’89. Thirty-Fourth IEEE Computer Society
52 CHAPTER 5. CONCLUSIONS
International Conference: Intellectual Leverage, Digest of Papers., pp. 112–117, IEEE,
1989.