thesis
TRANSCRIPT
Can we design an efficient ByzantineFault Tolerant mechanism for acloud computing environment?
Karan Chhabra
Submitted as part of the requirements for the degree
of MSc in Cloud Computing
at the School of Computing,
National College of Ireland
Dublin, Ireland.
April 2015
Supervisor Dr. Adriana Chis
Abstract
The substantial advancements in Information Technology (IT) over the last century,
have triggered a perceived vision of computing being a 5th utility one day after
water, gas, electricity and telephony. In order to deliver this vision various computing
paradigm are proposed out which Cloud computing is the latest paradigm. Cloud
computing has benefits of its own including on-demand service, pay-per-use etc. but
along with benefits there are various challenges associated with it .
One of the major research challenges in Cloud Computing is to ensure reliability and
availability of resources provided by it which is only possible if the cloud computing
environment is not prone to faults and if there is a proper fault tolerant system/mecha-
nism to prevent these faults. This concept becomes even stronger after introduction of
federated computing i.e. a type of computing in which customers using cloud services
are allowed to scale various applications over many domains but building such reliable
clouds is an area of concern due faults like byzantine faults which are arbitrary in
nature.
This research work is focused on designing an efficient Byzantine Fault Tolerant
mechanism BFS (Byzantine Fault Solution) for a cloud computing environment by
creating a vigorous Fault Tolerant (FT) system in order to prevent byzantine faults
in a cloud computing environment. This research problem is essential because these
arbitrary faults or byzantine faults can occur in any running process within any cloud
computing environment at any point of time and a proper fault tolerant mechanism is
required to prevent or overcome these faults.
Keywords: Cloud computing, Fault tolerance, Byzantine faults, Byzantine fault
tolerance.
ii
Contents
Abstract ii
1 Introduction 1
2 Background 3
2.1 Cloud computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Fault Tolerance: A challenge in cloud computing . . . . . . . . . . . . . 4
2.3 Byzantine Fault Tolerance (BFT) . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Current Byzantine Fault Tolerance (BFT) work in cloud . . . . . . . . . 7
2.5 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Expected Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Bibliography 10
iii
Chapter 1
Introduction
Cloud computing is a term used to describe a category of on-demand computing ser-
vices (pay as you go) which were first offered by providers, such as Google, Amazon and
Microsoft. Cloud computing acts as a model on which the infrastructure of computing
is viewed as a cloud, through which individuals access applications on demand. Cloud
computing is considered as utility computing which has enabled computing infrastruc-
ture which are highly scaled and flexible in nature. The computing infrastructure
empowered by cloud computing has provided capital saving for both consumers and
service providers. Such highly scalable compute resources has led to a tremendous
amount of cost saving.
Before we proceed further to the research problem, it is important to understand the
concepts of cloud computing along with the different areas of cloud computing related
to the research problem.
In Cloud computing major emphasis is laid on providing computing as a service i.e.
on demand and is achieved by providing a virtualized infrastructure consisting of data
centers which are maintained and monitored by content providers. The services offered
by cloud computing vendors are based on the fundamental models of cloud computing
such as IaaS (Infrastructure as a Service), SaaS (Software as a Service) and PaaS (Plat-
form as a Service). Cloud computing stands tall over other computing paradigms such
as grid computing, utility computing, mainframe computers and peer-to-peer comput-
ers due to its characteristics such as agility, Application Programming Interface (API),
cost, device and location independency, maintenance, multitenancy, performance, pro-
ductivity, reliability, scalability, elasticity and security.
Cloud Computing provides benefits like cost efficiency, on-demand services and multi-
tenancy but with some risks associated to it. One of the major research challenges
in Cloud Computing is to ensure reliability and availability of resources provided by
1
it which is only possible if the cloud computing environment is not prone to faults
and if there is a proper fault tolerant system/mechanism to prevent these faults. So
there is a requirement for a vigorous Fault Tolerant (FT) system in Cloud Comput-
ing. These faults can be of many types but this research problem focuses on byzantine
faults, which occur at any time and destroy an ongoing process completely in a cloud
computing environment. This document also focuses on the measures to overcome this
problem in order to provide good quality of service. There is a requirement for a robust
fault tolerance mechanism in order to prevent these byzantine faults.
[9] say that the fundamental problem occurring in distributed systems is constructing
robust network services that can with-stand a wide range of failure types. In order to
mask arbitrary failures the most general approach used is byzantine fault tolerance but
it is considered too exorbitant to install in practice and many solutions are not resilient
to performance attacks. According to [1], the major requirement for the clients deal-
ing with clouds is data security as clouds may fail due to the faults occurring in the
software or hardware, or attacks from malicious insiders. Hence, the construction of a
highly reliable or consistent cloud system has become a vital research requirement.
According to [19] cloud computing is becoming a highly popular and efficient solu-
tion for constructing dependable applications on dispersed resources. However, it is a
perilous challenge to assure the dependability of applications within the system due
to the very dynamic environment. This document debates the problems faced due
to byzantine faults occurring in cloud computing environments along-with the differ-
ent approaches to overcome these faults by applying different byzantine fault tolerant
mechanisms in a cloud environment.
The structure of this document is laid as follows:
Chapter 2 discusses the related work done in domain of the research problem along-with
hypothesis and expected contribution.
2
Chapter 2
Background
This chapter discusses the background of Cloud computing along-with a glimpse of
domain of the research problem, Fault tolerance: A challenge in cloud computing,
Byzantine Fault Tolerance (BFT), Current Byzantine Fault Tolerance (BFT) work in
cloud, Hypothesis and Contribution.
2.1 Cloud computing
This section discusses the background of cloud computing and the main issues that oc-
cur in a cloud computing environment. The first paragraph discusses the background
of cloud computing while the second paragraph discusses the main issues in a cloud
computing environment.
[2] define cloud computing as referring to both the applications delivered as services
over the internet and the hardware and systems software in the data centers that pro-
vide those services. The services themselves have long been referred to as Software as a
Service (SaaS). Some vendors use terms such as IaaS (Infrastructure as a Service) and
PaaS (Platform as a Service) to describe their products, but eschew with these because
accepted definitions for them still vary widely.
[2] and [3] describe cloud computing as a computing rebellion that has the capability
to change a huge part of the IT industry by making software eye-catching and chang-
ing the way IT hardware is purchased and designed. They have also compared cloud
computing with several other computing paradigms that have promised to deliver util-
ity computing, Grid computing and Cluster computing. According to NIST, Cloud
Computing is a model for enabling ubiquitous convenient, on-demand network access
to a shared pool of configurable computing resources (e.g., networks, servers, storage,
3
applications, and services) that can be rapidly provisioned and released with minimal
management effort or service provider interaction. [16]
The advent of Cloud Computing has conveyed a new dimension to the domain of (IT)
Information Technology but with many benefits there are some issues related to it. One
of the major research challenges in Cloud Computing is to make sure that the resources
are reliable and continuously available. [11] recommend the need for a vigorous Fault
Tolerant (FT) system in Cloud Computing. However building vigorous fault tolerant
systems is itself a big challenge but it plays an important role in the improvement of
quality of service (QOS) in cloud computing.
In the next section we will discuss Fault Tolerance in more detail.
2.2 Fault Tolerance: A challenge in cloud computing
This section discusses fault tolerance, its resilience and management in cloud computing
along with some fault tolerant techniques applied in a cloud computing environment
in order to build a fault tolerant environment and also demonstrates the perception of
the number, nature and kind of faults that appear in cloud computing infrastructures,
and the impact of these faults on a users applications along with the measures to
handle these faults in a cost-effective and efficient manner. The rst paragraph discusses
fault tolerance and how different journals and conferences describe fault tolerance in
cloud computing while the second paragraph discusses the different approaches of fault
tolerance in different computing environments.
[17] discuss and classify the type of faults that appear as failures to the end users
as Crash faults that stop the functioning of the system components completely or
remain inactive at the time of failures and Byzantine faults that force the components
of a system to act arbitrarily at the time of failure, triggering the system to behave
randomly incorrect. [11] say that a single Cloud consists of different layers which can
be affected with various types of faults. So these layers requires different levels of fault
tolerant techniques in order to provide seamless service. The argument of both sets
of authors share a common ground on providing a robust fault tolerance in order to
overcome faults. [17] discuss fault tolerance as the capability of the system to achieve
its purpose even in the existence of failures. They describe fault tolerance as one of the
mechanisms to improve the overall dependability of the system.
Fault tolerance serves as a key factor in achieving good quality of service in a computing
environment because if a system is prone to faults it cannot provide good service quality
so a tough fault tolerance mechanism is required to prevent faults. Since fault tolerance
is required in a cloud computing environment to prevent faults, different people have
proposed different approaches of fault tolerance in order to achieve quality of service
4
in a cloud computing environment. [14] discuss an inventive and integrated perception
for creation and management of fault tolerance in Clouds. They present an intangible
framework called Fault Tolerance Manager (FTM) which provides a base for the service
provider to propose fault tolerance as a service, and propose an inclusive approach to
cover execution details of fault tolerance techniques to developers and users by virtue of
a committed service layer. [20] discuss reliable cloud applications as a critical research
problem and propose a FTCloud framework to build cloud applications which can
tolerate faults and hence improve reliability. [11] agree with [14] on a requirement
for a vigorous fault tolerance framework in order to prevent faults. [11] discuss the
elementary concepts of fault tolerance by thoughtful consideration of different Fault
Tolerance policies like Reactive Fault tolerance policy and Proactive Fault Tolerance
policy and the related Fault Tolerance techniques applied on diverse types of faults.
[15] debate the major problems arising in data sets due to the continuous increase in
the size of data sets and the requirement for data management in order to meet the
processing needs by means of added parallelism (by including a greater number of nodes
and/or cores into the system) but this exposes the system to frequent failures during
processing. In order to recover from this problem they present a possible design and
implementation of a fault- tolerant environment for processing large queries on huge
datasets.
To conclude, Fault tolerance is becoming a crucial area of research as faults occurring
during processing highly affect the performance of a system which in turn affects the
service quality offered by the system. [17] discuss the importance of fault tolerance in
a cloud computing environment to prevent faults like byzantine faults and develop an
efficient fault tolerance model to prevent these kind of faults.
In the next section we will be discussing Byzantine fault tolerance in further detail.
2.3 Byzantine Fault Tolerance (BFT)
This section discusses Byzantine Fault Tolerance (BFT), a fault tolerant mechanism
required to prevent byzantine faults in a cloud computing environment. The rst para-
graph discusses byzantine fault tolerance and how different journals and conferences
describe byzantine fault tolerance in a cloud computing environment while the second
paragraph provides a conclusion to byzantine fault tolerance with a glimpse of different
approaches for byzantine fault tolerance in a cloud computing environment.
[19] confer byzantine faults as arbitrary faults which when they occur in a cloud com-
puting environment may damage a process or an entire application and in order to
build dependable cloud applications on a cloud infrastructure, it is vital to design a
fault tolerance structure for handling these type of faults. Normally, the dependability
5
of cloud applications in a cloud computing environment is effected due to different types
of faults such as network faults (disconnection), node faults (crashing), byzantine faults
(arbitrary faults) and the research problem is focused on one of these faults i.e. byzan-
tine faults, application of a byzantine fault tolerant mechanism in a cloud computing
environment. [4] discuss the arbitrary behavior caused by byzantine faults due to ma-
licious attacks, software error and mistake of operator and suggest the requirement of
highly available systems for the growing online services in order to provide these services
without interruptions. [10] discuss the need of a byzantine fault tolerant mechanism in
order to overcome hardware and software errors, arbitrary/byzantine failures which are
generated by malicious attacks in modern distributed systems. Byzantine faults make
a huge impact on quality of service in a cloud computing environment as these faults
send an inconsistent response to a request, forcing a process to crash. [9] agree with
[19] on byzantine fault tolerance, a necessary framework to overcome byzantine faults.
Both sets of authors confer that regardless of signicant improvement in making byzan-
tine fault tolerance practical, it is still not adopted widely because of high overheads
and complex techniques involved building such structures. [9] say despite signicant
progress in making BFT practical, it has not been widely adopted, mainly because of
the complexity of the techniques involved and high overheads. In addition, BFT is not
a panacea, since there are a variety of attacks, such as various performance attacks that
BFT does not handle well. [18] describes byzantine faults as malicious behavior or arbi-
trary faults which have become an important issue in a cloud computing environment,
an efficient fault tolerant structure to prevent byzantine faults is required. [13] agrees
with [10] on the requirement of a robust fault tolerant mechanism in a cloud computing
environment. [13] discuss the requirement of byzantine fault tolerant protocols in order
to tolerate arbitrary/byzantine failures of hardware and software components. Hence
byzantine fault tolerance attracts lots of researchers ever since byzantine faults were
introduced but despite being a big research area byzantine fault tolerance suffers from
a limited practical adoption in real-time systems such as the aerospace industry.
To conclude, Byzantine fault tolerance is an approach for overcoming the effects of
byzantine faults or arbitrary faults in a cloud computing environment to enhance qual-
ity of service. Since byzantine fault tolerance serves as an area of concern therefore so
many attempts have been made to use byzantine fault tolerance in cloud computing
environments in order to enhance quality of service provided. Byzantine fault tolerance
is serving as a big issue in cloud computing and makes the research problem even more
important because of the amount of impact it causes to a cloud computing environ-
ment and in this review we will be discussing more about the different approaches of
application of byzantine fault tolerance in a cloud computing environment.
In the next section we will analyze the different approaches of Byzantine fault tolerance
6
in detail.
2.4 Current Byzantine Fault Tolerance (BFT) work in
cloud
This section discusses the current work done in the domain byzantine fault tolerance
(BFT) in a cloud computing environment. This section provides an analysis on the
different methods adopted by different people in order to provide byzantine fault tol-
erance in cloud computing. The rst paragraph analyses the diverse approaches for
byzantine fault tolerance by different people in a cloud computing environment while
the second paragraph provides a conclusion to different methodologies for byzantine
fault tolerance.
[1] consider data security, a significant requirement in clouds because a cloud may fail
due to faults occurring in the hardware or software making it a critical research problem.
They propose a practical model, BFT-MCDB (Byzantine Fault Tolerance Multi-Clouds
Database) with byzantine fault tolerance in a multi-cloud environment which depend
on an approach which combines Shamirs secret sharing approach (to detect Byzantine
failure) along with Byzantine Agreement protocols in a multi-cloud computing environ-
ment ensuring the security of stowed data inside the cloud. [19] agree with [1] in context
to byzantine fault tolerance being an important research area in a cloud computing en-
vironment. [19] discuss the importance of building highly dependent applications in a
cloud computing environment. They say building such applications is a big challenge
in order to guarantee dependability of the applications mostly in voluntary-resource
cloud because of the highly dynamic environment, so they propose a Byzantine fault
tolerance structure for constructing vigorous systems in voluntary-resource cloud envi-
ronment, BFTCloud (Byzantine Fault Tolerant Cloud) which guarantees heftiness of
systems when up to f out of 3f +1 resource providers incur fault which may include
arbitrary faults. [4] propose a new replication algorithm to tolerate byzantine faults
and produce highly available systems. [19] also say that BFTCloud guarantees high
reliability of systems built on the top of voluntary-resource cloud infrastructure and
ensures good performance of these systems. [12] support [19] and [1] on proposing a
fault tolerant system that can overcome arbitrary faults or byzantine faults and also
discuss strengthening this concept with the rise of federated computing clouds which
help to meet Quality of Service targets by allowing users to scale the applications across
various domains. They analyze the application of byzantine fault tolerance to federated
clouds in detail by an experiment under which a cloud framework called FT-FC is built
which allows them to create diversity based byzantine fault tolerant systems and apply
7
them to federated Clouds in order to examine the efficiency of byzantine fault tolerance
in federated Clouds. [5] describe about how to build an interoperable heterogeneous
cloud milieu inside a horizontally federated configuration, where clouds cooperate with
each other in order to build trust and provide new opportunities for business including
power saving, reduced cost assets and on-demand provisioning of resources. [7] talk
about the importance of MapReduce to run scientific data analysis and how result of
these MapReduce jobs get effected by arbitrary faults. They say MapReduce runtimes
like Hadoop tolerate crash faults, but not arbitrary or Byzantine faults. So they pro-
pose a MapReduce algorithm to tolerate these type of faults. Both [8] and [6] agrees
with [7] and describe byzantine faults in a cloud and propose a MapReduce runtime
which can tolerate these faults and run at a low cost in terms of execution time.
To conclude, byzantine fault tolerance is considered a large domain and a lot of work
is done on it, there are various other approaches for application of byzantine fault
tolerance in cloud computing environment along with the approaches discussed above
making byzantine fault tolerance a wide area for research.
In the next section we will confer the hypothesis of this document.
2.5 Hypothesis
This document focuses on the following research problem:-
Can we design an efficient Byzantine Fault Tolerant mechanism for a cloud
computing environment?
This research problem is important because these arbitrary faults or byzantine faults
can occur in any running process within any cloud computing environment at any point
of time and a proper fault tolerant mechanism is required to prevent or overcome these
faults. In this research work, we propose an efficient byzantine fault tolerant framework
named BFS (Byzantine Fault Solution) for a cloud computing environment.
The next section describes the expected contribution for this research problem.
2.6 Expected Contribution
This paper carefully addresses the byzantine/arbitrary faults and continues the novel
approach of [12] and proposes an effective fault tolerant framework, Byzantine Fault
8
Solution (BFS) for a cloud computing environment by trying to:
• Inject complex faults into the application layer,
• Explain the reasons of failing due to byzantine/arbitrary faults,
• Increase the cloud dynamicity, record results for the same framework, and
• Compare Byzantine Fault Solution (BFS) with the framework proposed by [12].
9
Bibliography
[1] M.A. AlZain, B. Soh, and E. Pardede. A byzantine fault tolerance model for a multi-cloud
computing. In Computational Science and Engineering (CSE), 2013 IEEE 16th International
Conference on, pages 130–137, Dec 2013.
[2] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwin-
ski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. A view of cloud
computing. Commun. ACM, 53(4):50–58, April 2010.
[3] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, and Ivona Brandic. Cloud
computing and emerging {IT} platforms: Vision, hype, and reality for delivering computing as
the 5th utility. Future Generation Computer Systems, 25(6):599 – 616, 2009.
[4] Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance and proactive recovery.
ACM Trans. Comput. Syst., 20(4):398–461, November 2002.
[5] A. Celesti, F. Tusa, M. Villari, and A. Puliafito. Three-phase cross-cloud federation model: The
cloud sso authentication. In Advances in Future Internet (AFIN), 2010 Second International
Conference on, pages 94–101, July 2010.
[6] M. Correia, P. Costa, M. Pasin, A. Bessani, F. Ramos, and P. Verissimo. On the feasibility of
byzantine fault-tolerant mapreduce in clouds-of-clouds. In Reliable Distributed Systems (SRDS),
2012 IEEE 31st Symposium on, pages 448–453, Oct 2012.
[7] P. Costa, M. Pasin, A.N. Bessani, and M. Correia. Byzantine fault-tolerant mapreduce: Faults
are not just crashes. In Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third
International Conference on, pages 32–39, Nov 2011.
[8] P. Costa, M. Pasin, A.N. Bessani, and M.P. Correia. On the performance of byzantine fault-
tolerant mapreduce. Dependable and Secure Computing, IEEE Transactions on, 10(5):301–313,
Sept 2013.
[9] S. Duan, K. Levitt, Hein Meling, S. Peisert, and Haibin Zhang. Byzid: Byzantine fault tolerance
from intrusion detection. In Reliable Distributed Systems (SRDS), 2014 IEEE 33rd International
Symposium on, pages 253–264, Oct 2014.
[10] S. Duan, S. Peisert, and K.N. Levitt. hbft: Speculative byzantine fault tolerance with minimum
cost. Dependable and Secure Computing, IEEE Transactions on, 12(1):58–70, Jan 2015.
[11] A. Ganesh, M. Sandhya, and S. Shankar. A study on fault tolerance methods in cloud computing.
In Advance Computing Conference (IACC), 2014 IEEE International, pages 844–849, Feb 2014.
[12] P. Garraghan, P. Townend, and Jie Xu. Byzantine fault-tolerance in federated cloud computing.
In Service Oriented System Engineering (SOSE), 2011 IEEE 6th International Symposium on,
pages 280–285, Dec 2011.
10
[13] Rachid Guerraoui and Maysam Yabandeh. Independent faults in the cloud. In Proceedings of
the 4th International Workshop on Large Scale Distributed Systems and Middleware, LADIS ’10,
pages 12–17, New York, NY, USA, 2010. ACM.
[14] R. Jhawar, V. Piuri, and M. Santambrogio. Fault tolerance management in cloud computing: A
system-level perspective. Systems Journal, IEEE, 7(2):288–297, June 2013.
[15] M.C. Kurt and G. Agrawal. A fault-tolerant environment for large-scale query processing. In High
Performance Computing (HiPC), 2012 19th International Conference on, pages 1–10, Dec 2012.
[16] Timothy Grance Peter Mell. The nist definition of cloud computing. September 2011. [Online;
accessed 26-December-2014].
[17] Vincenzo Piuri Ravi Jhawar. Fault tolerance and resilience in cloud computing environment. In
Cyber Security and IT Infrastructure Protection, pages 1–28, Boston:Syngress, 2014.
[18] Marko Vukolic. The byzantine empire in the intercloud. SIGACT News, 41(3):105–111, September
2010.
[19] Yilei Zhang, Zibin Zheng, and M.R. Lyu. Bftcloud: A byzantine fault tolerance framework for
voluntary-resource cloud computing. In Cloud Computing (CLOUD), 2011 IEEE International
Conference on, pages 444–451, July 2011.
[20] Zibin Zheng, T.C. Zhou, M.R. Lyu, and I. King. Ftcloud: A component ranking framework for
fault-tolerant cloud applications. In Software Reliability Engineering (ISSRE), 2010 IEEE 21st
International Symposium on, pages 398–407, Nov 2010.
11