cloud operations and analytics: improving distributed systems reliability using fault injection
TRANSCRIPT
Cloud Operations and Analytics Improving Distributed Systems Reliability
Using Fault Injection
December, 12, 2016
Technical University of Munich
www.tum.de
Dr. Jorge Cardoso ([email protected])
Chief Architect for Cloud Operations and Analytics
IT R&D Division
1
About Me
Jorge Cardoso http://jorge-cardoso.github.io/
Interests
Cloud Computing Service Science and Internet of Services Business Process Management Semantic Web
Positions in Industry
Prof. Jorge Cardoso obtained his PhD degree in Computer Science from the University of Georgia (US) in 2002. He is Chief Architect for Cloud Operations and Analytics at Huawei GRC in Munich, Germany, and Professor at the University of Coimbra, Portugal. He frequently publishes papers in first-tier conferences such as ICWS, CAISE, and ESWC, and first-tier journals such as IEEE TSC and Journal of Web Semantics. He has published several books on distributed systems, process management systems, and service systems.
Short Bio
2
Contents
Fault Injection Techniques 4
Cloud Reliability 3
Cloud Operating Systems 2
Cloud Computing 1
Butterfly Effect Project 5
3
From Virtualization to Clouds
Cloud Computing Deployment Stages of Enterprises
• Computing virtualization
• Storage virtualization
• Network and security
virtualization
• Automatic management
• Elastic resource scheduling
• HA based on large clusters • Consolidation of multiple DCs
• Multi-level backup and DR
• Software-defined networking
(SDN)
• Unified management
• Optimal resource allocation
• Flexible service migration
Private Public
Hybrid
Cloud
Private Cloud
Virtualization Data Center
Consolidation
Hybrid Cloud
Focus on resources
Gradually focus
on business Focus on global
business
Flexible and
service-driven
4
Server virtualization is the partitioning of a physical server into smaller virtual servers to maximize resources. The resources of the server are hidden from users. Software is used to divide the physical server into multiple virtual environments.
Communications of the ACM, vol 17, no 7, 1974, pp.412-421
Virtualization
X86 Windows
XP
X86 Windows
2003
X86 Suse
X86 Red Hat
12% Hardware Utilization
15% Hardware Utilization
18% Hardware Utilization
10% Hardware Utilization
App App App App App App App App
X86 Multi-Core, Multi Processor
X86 Windows
XP
X86 Windows
2003
X86 Suse
X86 Red Hat
App App App App App App App App
70% Hardware Utilization
5
Contents
Fault Injection Techniques 4
Cloud Reliability 3
Cloud Operating Systems 2
Cloud Computing 1
Butterfly Effect Project 5
6
Azure, Amazon, Google,
Oracle, OpenStack,
SoftLayer, etc.
Transforms datacenters into
pools of resources
Provides a management
layer for controlling,
automating, and efficiently
allocating resources
Adopts a selfservice mode
Enables developers to build
cloud-aware applications via
standard APIs
Cloud Operating Systems
7
Started by Rackspace and NASA (2010)
Driven by the emergence of virtualization
Rackspace wanted to rewrite its cloud servers offering
NASA had published code for Nova, a Python-based
cloud computing controller
OpenStack History
Series Status Initial Release Date EOL Date
Queens Future TBD TBD
Pike Future TBD TBD
Ocata Under Development 2017-02-22
(planned) TBD
Newton Current stable release,
security-supported 2016-10-06 TBD
Mitaka Security-supported 2016-04-07 2017-04-10
Liberty Security-supported 2015-10-15 2016-11-17
Kilo EOL 2015-04-30 2016-05-02
Juno EOL 2014-10-16 2015-12-07
Icehouse EOL 2014-04-17 2015-07-02
Havana EOL 2013-10-17 2014-09-30
Grizzly EOL 2013-04-04 2014-03-29
Folsom EOL 2012-09-27 2013-11-19
Essex EOL 2012-04-05 2013-05-06
Diablo EOL 2011-09-22 2013-05-06
Cactus Deprecated 2011-04-15
Bexar Deprecated 2011-02-03
Austin Deprecated 2010-10-21
https://www.nextplatform.com/2016/11/03/building-stack-openstack/
8
OpenStack Community
1,500+ active participants!
17 countries represented at Design Summit!
60,000+ downloads!
Worldwide network of user groups (North
America, South America, Europe, Asia and
Africa)
9
OpenStack Architecture
https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components
10 OpenStack User Survey: A snapshot of OpenStack users’ attitudes and deployments. April 2016. (https://www.openstack.org/assets/survey/April-2016-User-Survey-Report.pdf). Fig. 4.6, Pag. 31.
11
Compute Architecture
https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components
12
Adopters
Apr 6, 2016
http://cloud.telekom.de/Deutsche-Cloud
13
$ sudo yum install -y centos-release-openstack-newton
$ sudo yum update -y
$ sudo yum install -y openstack-packstack
$ packstack --allinone
Deploying OpenStack
https://www.rdoproject.org/install/quickstart/
14
Contents
Fault Injection Techniques 4
Cloud Reliability 3
Cloud Operating Systems 2
Cloud Computing 1
Butterfly Effect Project 5
15
One reason [Netflix]: It’s the lack of control over the underlying hardware, the inability to configure it to ensure 100% uptime
Why does using a cloud infrastructure requires advanced approaches for resiliency?
16
Unplanned downtime is caused by* software bugs … 27% hardware … 23% human error … 18% network failures … 17% natural disasters … 8%
* Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.
Google's 2007 found annualized failure rates (AFRs) for drives 1 year old 1.7% 3 year old >8.6%
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proc. of the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.
17
A program designed to increase resilience by purposely injecting
major failures Discover flaws and subtle dependencies
Amazon AWS: GameDay
“That seems totally bizarre on the face of it, but as you dig down, you end up finding
some dependency no one knew about previously […] We’ve had situations where we
brought down a network in, say, São Paulo, only to find that in doing so we broke our
links in Mexico.”
18
Google DIRT (Disaster Recovery Test) Annual disaster recovery & testing exercise
8 years since inception
Multi-day exercise triggering (controlled) failures in systems and process
Premise 30-day incapacitation of headquarters following a disaster
Other offices and facilities may be affected
When “Big disaster”: Annually for 3-5 days
Continuous testing: Year-round
Who 100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)
Business units (Human Resources, Finance, Safety, Crisis response etc.)
Google: DiRT
Source http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf
19
Netflix: Chaos Monkey
Fewer alerts
for ops team
Amazon EC2 and Amazon RDS Service Disruption in the US East Region April 29, 2011
September 20th, 2015 Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1
Transfer traffic to east region
20
Dependability. Concepts, techniques and tools developed over the past four decades and include the attributes:
Availability. readiness for correct service. Reliability. Continuity of correct service. Safety. absence of catastrophic consequences on the
User(s) and the environment. Integrity. absence of improper system alterations. Maintainability. ability to undergo modifications and
repairs.
Means to attain dependability Fault prevention means to prevent the occurrence or
introduction of faults. Fault tolerance means to avoid service failures in the
presence of faults [Voas98]. Fault removal means to reduce the number and
severity of faults. Fault forecasting means to estimate the present number,
the future incidence, and the likely consequences of faults.
Reliability
A. Avizienis, JC Laprie, B. Randell, and C. Landwehr. 2004. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. Dependable Secur. Comput. 1, 1, 11-33
J. Voas, G. McGraw. “Software Fault Injection: Inoculating programs against errors”. Edit. Wiley. USA, 1998.
De
pe
nd
abili
ty
21
Fault. adjudged or hypothesized cause of an
error
Error. discrepancy between a computed,
observed, or measured value or condition and
a true, specified, or theoretically correct value
or condition. Error is a consequence of a fault
Failure. deviation of the delivered service from
fulfilling the system function
Threats
Marcello Cinque, Domenico Cotroneo, Antonio Pecchia Event Logs for the Analysis of Software Failures: A Rule-Based Approach, #6, vol.39, pp: 806-821
E
Ft
Fl
E Ft Fl
Fault Error Failure
22
Contents
Fault Injection Techniques 4
Cloud Reliability 3
Cloud Operating Systems 2
Cloud Computing 1
Butterfly Effect Project 5
23
Fault Injection
FI on Simulated models
VHDL Simulation models
Other languages
FI on prototypes
Hardware Injection HWIFI
External
HWIFI at pin level
Electromagnetic Perturbations
Internal
Heavy ion radiations
Laser Radiation
Scan Chain
Software Injection SWIFI
(1)
Time
Static
Dynamic
Level
High Level
Machine Language
Injection Objectives
• Prediction
• Elimination
Fault Injection Techniques
Software-implemented
fault injection (SWIFI)
Fault injection techniques introduce faults to
perturb the normal flow of a program to
extend test coverage or stress test the
system.
Inject a fault into a
software system at run
time.
Experiments can be run in near real-time No model development needed Can be expanded for new classes of
faults. Limited set of injection instants. Cannot inject faults into locations that
are inaccessible to software. Require modification of the source code
to support the fault injection.
24
Huawei: Butterfly Effect
-- Butterfly Effect System -- Enables to Automatically Test and Repair OpenStack and Cloud
Applications
CLOUD APPLICATION
HUAWEI FusionSphere
The system works by intentionally injecting different failures, test the ability to
survive them, and learn how to predict and repair failures preemptively
Failure
Repair
Test
In chaos theory, the butterfly effect is the sensitive dependence on initial conditions in which a small change in one state of a deterministic nonlinear system can result in large differences in a later state. [Wikipedia]
25
The Strategy
VM failures send VM creation request find compute node where request was scheduled damage to the compute server check if the VM creation was re-scheduled to another node
Disk temporarily unavailable unmount a disk
wait for replicas to regenerate
remount the disk with the data intact
wait for replicas to regenerate the extra replicas from handoff nodes
should get removed
Disk replacement unmount a disk
wait for replicas regenerate
delete the disk and remount it
wait for replicas to regenerate
Extra replicas from handoff nodes should get removed
Replication damage three disks at the same time
more if the replica count is higher
check that the replicas didn’t regenerate even after some time period
fail if the replicas regenerated
this tests if the tests themselves are correct
1
2
3
4
1
2
3
4
26
Approach
Fully automated and customizable
Simple using ssh and bash scripting
FusionServer RH2288
Deploy and Destroy: 2 hours to deploy OpenStack infrastructure with 32 VMs…32
seconds to destroy…
Vagrant. Provides easy to configure, reproducible, and portable environments for
OpenStack Interfaces to VirtualBox, VMware, AWS, an other providers
VirtualBox. Free open-source hypervisor for x86 computers from Oracle Management of virtual machines
RDO. Freely-available
distribution of OpenStack
from Red Hat
OpenStack Mitaka
Test Environment
Huawei RH2288 + Fedora
Vagrant
Virtualbox
VM VM VM VM VM VM VM
27
Service to Destroy
Database
Message Queue
Authentication
Hypervisor
Hard drive
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Nodes, services, processes, network,
hypervisor, storage, etc.
Nova-Compute
28
1. Request provisioning UI/CLI
2. Validate Auth data
3. Send API request to NOVA API
4. Validate API token
5. Process API request
6. Publish Provisioning Request
7. Schedule Provisioning
8. Start VM provisioning
9. Configure Network
10. Request Volume
11. Request VM image from Glance
12. Get image URL from Glance
13. Direct Image File Copy
14. Start VM rendering via Hypervison
Scenario Driven
http://www.slideshare.net/mirantis/openstack-architecture-43160012 http://docs.openstack.org/developer/tempest/field_guide/scenario.html
Create Server
• Create server
Inject Faults
Scenario Faults Process
flavor create flavor delete flavor list host list hypervisor list hypervisor show image add project image create image delete image list image show ip fixed add …
openstack server create --flavor m1.medium --image "fedora-23" --key-name ayoung-pubkey --security-group default --nic net-id=63258623-1fd5-497c-b62d-e0651e03bdca windows_dev
29
Localized Injection
State based
Time based
State
Time
30
Faults to Inject
Bit-flips - CPU registers/memory
Memory errors - mem corruptions/leaks, lack of memory
Disk faults - read/write errors, lack disk space
Network faults - packet loss, network congestion, etc.
Terminate instance
Introduce delays in message delivery
Corrupt data in DB
Services, processes, and application crash
Reboot node
Configuration error
31
Detect Failures
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Network tests
• create keypairs
• create security
groups
• create networks
Compute tests
• create a keypair
• create a security
group
• boot a instance
Swift tests
• create a volume
• get the volume
• delete the volume
Identity tests
…
Cinder tests
…
Glance tests
…
echo "$ tempest init cloud-01"
echo "$ cp tempest/etc/tempest.conf cloud-01/etc/"
echo "$ cd cloud-01"
echo "Next is the full test suite:"
echo "$ ostestr -c 3 --regex '(?!.*\[.*\bslow\b.*\])(^tempest\.(api|scenario))'"
echo "Next ist the minimum basic test:"
echo "$ ostestr -c 3 --regex '(?!.*\[.*\bslow\b.*\])(^tempest.scenario.test_minimum_basic)'"
32
Detect Failures
Tempest 0
1400 test/45min-2h
Tempest 1
100%,100
40%,40
Tempest 2 Tempest 3
Overlapping tests Mutually exclusive tests
5%, Log2 40
Branch and bound
4%, Log2 20
Side Effects. Integration tests often have side effects and require specific setups. Thus, they often cannot be used
in production systems. For example, running integration tests which delete all the virtual machines running in a
production platforms cannot be run in production.
Reuse. Integration tests are composed on many types of tests (e.g., unit testing, API testing, integration testing,
scenario testing, and positive and negative test). Reusing code tests for damage detection is useful but the selection
can be difficult.
Filtering. Most of the tests are not relevant for damage detection on production systems. While damage detection
looks for components, services, and processes which are no longer working properly, tests determine if commits to
code generate errors. When software code is tested, many functional test are irrelevant to use in production.
Specificity. New code for damage detection always needs to be developed since testing does not typically looks
for problems that can happen when a system is in a particular operational state.
Limitations of Integration Tests
33
Butterfly Effect: Example of Fault Injection
34
Butterfly Effect: Example of Fault Injection
Dmitri Zimine (Brocade) giving his
speech on workflows for auto-
remediation (credits to Johannes
Weingart).
Sebastian Kirsch (Google), co-
author of the bestselling book Site
Reliability Engineering from Google,
and the workshop organizer Jorge
Cardoso (Huawei).
The International Industry-Academia Workshop on Cloud Reliability and Resilience was held
in Berlin on 7-8 November 2016. The workshop gathered close to 50 participants from
industry (Intel, Red Hat, CISCO, SAP, Google, LinkedIn, Microsoft, Mirantis, Brocade, T-
Systems, SysEleven, Deutsche Telekom, Flexiant, Hastexo) and academia (TU Wien, TU
Berlin, U Oxford, ETH, U Stuttgart, TU Chemnitz, U Potsdam, TU Darmstadt, U Lisbon, U
Coimbra).
International Industry-Academia
Workshop on
Cloud Reliability and Resilience Berlin on 7-8 November 2016
Current Team: Cloud Operations and Analytics Objective
Planet-scale distributed systems = automation
Highly complex systems = AI and machine learning
Skills and knowledge
OpenStack Software Development
Machine Learning and Real-time Analysis
Reliability for Cloud Native Applications
Large-scale distributed systems
Working Student
Distributed Execution Graphs (DEG) for OpenStack.
Master Students
Efficient Diagnosis in Cloud Platforms.
DEG-driven Fault Injection for Cloud platforms.
PhD Students
Risk-aware Cloud Recovery using Machine Learning
(automation + AI).
Internship for PhD student
Next generation of DEG-driven systems beyond
Google’s Dapper and Twitter’s Zipkin.
Working & MSc students
Fault injection, fault models,
fault libraries, fault plans,
brake and rebuild systems all
day long, …
PhD Students Rapid prototyping of cool
ideas: propose it today, code
it, and show it running in 3
months…
Postdocs Solving difficult challenges of
real problems using quick and
dirty prototyping
Open Positions
Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive
statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time
without notice.
HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY
38
The complexity and dynamicity of large-scale cloud platforms requires automated solutions to reduce the risks of eventual failures.
Fault injection mechanisms enable to determine (and repair) the types of failures that platforms cannot tolerate under controlled environments rather than taking a passive approach waiting that Murphy’s law comes into play on a Sunday at 2am when engineers are off duty.
Pioneers, such as Amazon, Google, and Netflix, have already developed fault injection mechanisms and have also changed their mindset with respect to the importance of the resiliency of cloud platforms.
As an innovation topic, we take one step further towards fault-tolerant platforms by exploring, not only fault injection, but also the automated recovery of platforms.
Executive Summary
39
FIAT: Fault Injection Based Automated Testing Environment, Carnegie Mellon University.
EFI, PROFI: Processor Fault Injector, Dortmund University.
FERRARI: Fault and ERRor Automatic Real-time Injector, Texas University.
SFI, DOCTOR: intergrateD sOftware implemented fault injeCTiOn enviRonment, Michigan University.
FINE: Fault Injection and moNitoring Environment, Universidad de Illinois University.
FTAPE: Fault Tolerance and Performance Evaluator, Illinois University.
XCEPTION: Coimbra University.
MAFALDA, MAFALDA-RT: Microkernel Assessment by Fault injection AnaLysis and Design Aid, LAAS-
CNRS en Toulouse
BALLISTA: Carnegie Mellon University.
SW Fault Injection Tools