cloud operations and analytics: improving distributed systems reliability using fault injection

Cloud Operations and Analytics Improving Distributed Systems Reliability

Using Fault Injection

December, 12, 2016

Technical University of Munich

www.tum.de

Dr. Jorge Cardoso ([email protected])

Chief Architect for Cloud Operations and Analytics

IT R&D Division

1

About Me

Jorge Cardoso http://jorge-cardoso.github.io/

Interests

Cloud Computing Service Science and Internet of Services Business Process Management Semantic Web

Positions in Industry

Prof. Jorge Cardoso obtained his PhD degree in Computer Science from the University of Georgia (US) in 2002. He is Chief Architect for Cloud Operations and Analytics at Huawei GRC in Munich, Germany, and Professor at the University of Coimbra, Portugal. He frequently publishes papers in first-tier conferences such as ICWS, CAISE, and ESWC, and first-tier journals such as IEEE TSC and Journal of Web Semantics. He has published several books on distributed systems, process management systems, and service systems.

Short Bio

2

Contents

Fault Injection Techniques 4

Cloud Reliability 3

Cloud Operating Systems 2

Cloud Computing 1

Butterfly Effect Project 5

3

From Virtualization to Clouds

Cloud Computing Deployment Stages of Enterprises

• Computing virtualization

• Storage virtualization

• Network and security

virtualization

• Automatic management

• Elastic resource scheduling

• HA based on large clusters • Consolidation of multiple DCs

• Multi-level backup and DR

• Software-defined networking

(SDN)

• Unified management

• Optimal resource allocation

• Flexible service migration

Private Public

Hybrid

Cloud

Private Cloud

Virtualization Data Center

Consolidation

Hybrid Cloud

Focus on resources

Gradually focus

on business Focus on global

business

Flexible and

service-driven

4

Server virtualization is the partitioning of a physical server into smaller virtual servers to maximize resources. The resources of the server are hidden from users. Software is used to divide the physical server into multiple virtual environments.

Communications of the ACM, vol 17, no 7, 1974, pp.412-421

Virtualization

X86 Windows

XP

X86 Windows

2003

X86 Suse

X86 Red Hat

12% Hardware Utilization




App App App App App App App App

X86 Multi-Core, Multi Processor

X86 Windows

XP

X86 Windows

2003

X86 Suse

X86 Red Hat

App App App App App App App App


5

Contents


Cloud Reliability 3


Cloud Computing 1


6

Azure, Amazon, Google,

Oracle, OpenStack,

SoftLayer, etc.

Transforms datacenters into

pools of resources

Provides a management

layer for controlling,

automating, and efficiently

allocating resources

Adopts a selfservice mode

Enables developers to build

cloud-aware applications via

standard APIs

Cloud Operating Systems

7

Started by Rackspace and NASA (2010)

Driven by the emergence of virtualization

Rackspace wanted to rewrite its cloud servers offering

NASA had published code for Nova, a Python-based

cloud computing controller

OpenStack History

Series Status Initial Release Date EOL Date

Queens Future TBD TBD

Pike Future TBD TBD

Ocata Under Development 2017-02-22

(planned) TBD

Newton Current stable release,

security-supported 2016-10-06 TBD

Mitaka Security-supported 2016-04-07 2017-04-10

Liberty Security-supported 2015-10-15 2016-11-17

Kilo EOL 2015-04-30 2016-05-02

Juno EOL 2014-10-16 2015-12-07

Icehouse EOL 2014-04-17 2015-07-02

Havana EOL 2013-10-17 2014-09-30

Grizzly EOL 2013-04-04 2014-03-29

Folsom EOL 2012-09-27 2013-11-19

Essex EOL 2012-04-05 2013-05-06

Diablo EOL 2011-09-22 2013-05-06

Cactus Deprecated 2011-04-15

Bexar Deprecated 2011-02-03

Austin Deprecated 2010-10-21

https://www.nextplatform.com/2016/11/03/building-stack-openstack/

https://releases.openstack.org/queens/index.html

https://releases.openstack.org/pike/index.html

https://releases.openstack.org/ocata/index.html

https://releases.openstack.org/ocata/schedule.html

https://releases.openstack.org/newton/index.html

https://releases.openstack.org/mitaka/index.html

https://releases.openstack.org/liberty/index.html

https://releases.openstack.org/kilo/index.html

https://releases.openstack.org/juno/index.html

https://releases.openstack.org/icehouse/index.html

https://releases.openstack.org/havana/index.html

https://releases.openstack.org/grizzly/index.html

https://releases.openstack.org/folsom/index.html

https://releases.openstack.org/essex/index.html

https://releases.openstack.org/diablo/index.html

https://releases.openstack.org/cactus/index.html

https://releases.openstack.org/bexar/index.html

https://releases.openstack.org/austin/index.html

8

OpenStack Community

1,500+ active participants!

17 countries represented at Design Summit!

60,000+ downloads!

Worldwide network of user groups (North

America, South America, Europe, Asia and

Africa)

9

OpenStack Architecture

https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components

10 OpenStack User Survey: A snapshot of OpenStack users’ attitudes and deployments. April 2016. (https://www.openstack.org/assets/survey/April-2016-User-Survey-Report.pdf). Fig. 4.6, Pag. 31.

11

Compute Architecture

https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components

12

Adopters

Apr 6, 2016

http://cloud.telekom.de/Deutsche-Cloud‎

13

$ sudo yum install -y centos-release-openstack-newton

$ sudo yum update -y

$ sudo yum install -y openstack-packstack

$ packstack --allinone

Deploying OpenStack

https://www.rdoproject.org/install/quickstart/

14

Contents


Cloud Reliability 3


Cloud Computing 1


15

One reason [Netflix]: It’s the lack of control over the underlying hardware, the inability to configure it to ensure 100% uptime

Why does using a cloud infrastructure requires advanced approaches for resiliency?

16

Unplanned downtime is caused by* software bugs … 27% hardware … 23% human error … 18% network failures … 17% natural disasters … 8%

* Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.

Google's 2007 found annualized failure rates (AFRs) for drives 1 year old 1.7% 3 year old >8.6%

Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proc. of the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.

17

A program designed to increase resilience by purposely injecting

major failures Discover flaws and subtle dependencies

Amazon AWS: GameDay

“That seems totally bizarre on the face of it, but as you dig down, you end up finding

some dependency no one knew about previously […] We’ve had situations where we

brought down a network in, say, São Paulo, only to find that in doing so we broke our

links in Mexico.”

18

Google DIRT (Disaster Recovery Test) Annual disaster recovery & testing exercise

8 years since inception

Multi-day exercise triggering (controlled) failures in systems and process

Premise 30-day incapacitation of headquarters following a disaster

Other offices and facilities may be affected

When “Big disaster”: Annually for 3-5 days

Continuous testing: Year-round

Who 100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)

Business units (Human Resources, Finance, Safety, Crisis response etc.)

Google: DiRT

Source http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf

19

Netflix: Chaos Monkey

Fewer alerts

for ops team

Amazon EC2 and Amazon RDS Service Disruption in the US East Region April 29, 2011

September 20th, 2015 Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1

Transfer traffic to east region

20

Dependability. Concepts, techniques and tools developed over the past four decades and include the attributes:

Availability. readiness for correct service. Reliability. Continuity of correct service. Safety. absence of catastrophic consequences on the

User(s) and the environment. Integrity. absence of improper system alterations. Maintainability. ability to undergo modifications and

repairs.

Means to attain dependability Fault prevention means to prevent the occurrence or

introduction of faults. Fault tolerance means to avoid service failures in the

presence of faults [Voas98]. Fault removal means to reduce the number and

severity of faults. Fault forecasting means to estimate the present number,

the future incidence, and the likely consequences of faults.

Reliability

A. Avizienis, JC Laprie, B. Randell, and C. Landwehr. 2004. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. Dependable Secur. Comput. 1, 1, 11-33

J. Voas, G. McGraw. “Software Fault Injection: Inoculating programs against errors”. Edit. Wiley. USA, 1998.

De

pe

nd

abili

ty

21

Fault. adjudged or hypothesized cause of an

error

Error. discrepancy between a computed,

observed, or measured value or condition and

a true, specified, or theoretically correct value

or condition. Error is a consequence of a fault

Failure. deviation of the delivered service from

fulfilling the system function

Threats

Marcello Cinque, Domenico Cotroneo, Antonio Pecchia Event Logs for the Analysis of Software Failures: A Rule-Based Approach, #6, vol.39, pp: 806-821

E

Ft

Fl

E Ft Fl

Fault Error Failure

22

Contents


Cloud Reliability 3


Cloud Computing 1


23

Fault Injection

FI on Simulated models

VHDL Simulation models

Other languages

FI on prototypes

Hardware Injection HWIFI

External

HWIFI at pin level

Electromagnetic Perturbations

Internal

Heavy ion radiations

Laser Radiation

Scan Chain

Software Injection SWIFI

(1)

Time

Static

Dynamic

Level

High Level

Machine Language

Injection Objectives

• Prediction

• Elimination

Fault Injection Techniques

Software-implemented

fault injection (SWIFI)

Fault injection techniques introduce faults to

perturb the normal flow of a program to

extend test coverage or stress test the

system.

Inject a fault into a

software system at run

time.

Experiments can be run in near real-time No model development needed Can be expanded for new classes of

faults. Limited set of injection instants. Cannot inject faults into locations that

are inaccessible to software. Require modification of the source code

to support the fault injection.

24

Huawei: Butterfly Effect

-- Butterfly Effect System -- Enables to Automatically Test and Repair OpenStack and Cloud

Applications

CLOUD APPLICATION

HUAWEI FusionSphere

The system works by intentionally injecting different failures, test the ability to

survive them, and learn how to predict and repair failures preemptively

Failure

Repair

Test

In chaos theory, the butterfly effect is the sensitive dependence on initial conditions in which a small change in one state of a deterministic nonlinear system can result in large differences in a later state. [Wikipedia]

25

The Strategy

VM failures send VM creation request find compute node where request was scheduled damage to the compute server check if the VM creation was re-scheduled to another node

Disk temporarily unavailable unmount a disk

wait for replicas to regenerate

remount the disk with the data intact

wait for replicas to regenerate the extra replicas from handoff nodes

should get removed

Disk replacement unmount a disk

wait for replicas regenerate

delete the disk and remount it

wait for replicas to regenerate

Extra replicas from handoff nodes should get removed

Replication damage three disks at the same time

more if the replica count is higher

check that the replicas didn’t regenerate even after some time period

fail if the replicas regenerated

this tests if the tests themselves are correct

1

2

3

4

1

2

3

4

26

Approach

Fully automated and customizable

Simple using ssh and bash scripting

FusionServer RH2288

Deploy and Destroy: 2 hours to deploy OpenStack infrastructure with 32 VMs…32

seconds to destroy…

Vagrant. Provides easy to configure, reproducible, and portable environments for

OpenStack Interfaces to VirtualBox, VMware, AWS, an other providers

VirtualBox. Free open-source hypervisor for x86 computers from Oracle Management of virtual machines

RDO. Freely-available

distribution of OpenStack

from Red Hat

OpenStack Mitaka

Test Environment

Huawei RH2288 + Fedora

Vagrant

Virtualbox

VM VM VM VM VM VM VM

27

Service to Destroy

Database

Message Queue

Authentication

Hypervisor

Hard drive

The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)

Nodes, services, processes, network,

hypervisor, storage, etc.

Nova-Compute

28

1. Request provisioning UI/CLI

2. Validate Auth data

3. Send API request to NOVA API

4. Validate API token

5. Process API request

6. Publish Provisioning Request

7. Schedule Provisioning

8. Start VM provisioning

9. Configure Network

10. Request Volume

11. Request VM image from Glance

12. Get image URL from Glance

13. Direct Image File Copy

14. Start VM rendering via Hypervison

Scenario Driven

http://www.slideshare.net/mirantis/openstack-architecture-43160012 http://docs.openstack.org/developer/tempest/field_guide/scenario.html

Create Server

• Create server

Inject Faults

Scenario Faults Process

flavor create flavor delete flavor list host list hypervisor list hypervisor show image add project image create image delete image list image show ip fixed add …

openstack server create --flavor m1.medium --image "fedora-23" --key-name ayoung-pubkey --security-group default --nic net-id=63258623-1fd5-497c-b62d-e0651e03bdca windows_dev

29

Localized Injection

State based

Time based

State

Time

30

Faults to Inject

Bit-flips - CPU registers/memory

Memory errors - mem corruptions/leaks, lack of memory

Disk faults - read/write errors, lack disk space

Network faults - packet loss, network congestion, etc.

Terminate instance

Introduce delays in message delivery

Corrupt data in DB

Services, processes, and application crash

Reboot node

Configuration error

31

Detect Failures

The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)

Network tests

• create keypairs

• create security

groups

• create networks

Compute tests

• create a keypair

• create a security

group

• boot a instance

Swift tests

• create a volume

• get the volume

• delete the volume

Identity tests

…

Cinder tests

…

Glance tests

…

echo "$ tempest init cloud-01"

echo "$ cp tempest/etc/tempest.conf cloud-01/etc/"

echo "$ cd cloud-01"

echo "Next is the full test suite:"

echo "$ ostestr -c 3 --regex '(?!.*\[.*\bslow\b.*\])(^tempest\.(api|scenario))'"

echo "Next ist the minimum basic test:"

echo "$ ostestr -c 3 --regex '(?!.*\[.*\bslow\b.*\])(^tempest.scenario.test_minimum_basic)'"

32

Detect Failures

Tempest 0

1400 test/45min-2h

Tempest 1

100%,100

40%,40

Tempest 2 Tempest 3

Overlapping tests Mutually exclusive tests

5%, Log2 40

Branch and bound

4%, Log2 20

Side Effects. Integration tests often have side effects and require specific setups. Thus, they often cannot be used

in production systems. For example, running integration tests which delete all the virtual machines running in a

production platforms cannot be run in production.

Reuse. Integration tests are composed on many types of tests (e.g., unit testing, API testing, integration testing,

scenario testing, and positive and negative test). Reusing code tests for damage detection is useful but the selection

can be difficult.

Filtering. Most of the tests are not relevant for damage detection on production systems. While damage detection

looks for components, services, and processes which are no longer working properly, tests determine if commits to

code generate errors. When software code is tested, many functional test are irrelevant to use in production.

Specificity. New code for damage detection always needs to be developed since testing does not typically looks

for problems that can happen when a system is in a particular operational state.

Limitations of Integration Tests

33

Butterfly Effect: Example of Fault Injection

34

Butterfly Effect: Example of Fault Injection

Dmitri Zimine (Brocade) giving his

speech on workflows for auto-

remediation (credits to Johannes

Weingart).

Sebastian Kirsch (Google), co-

author of the bestselling book Site

Reliability Engineering from Google,

and the workshop organizer Jorge

Cardoso (Huawei).

The International Industry-Academia Workshop on Cloud Reliability and Resilience was held

in Berlin on 7-8 November 2016. The workshop gathered close to 50 participants from

industry (Intel, Red Hat, CISCO, SAP, Google, LinkedIn, Microsoft, Mirantis, Brocade, T-

Systems, SysEleven, Deutsche Telekom, Flexiant, Hastexo) and academia (TU Wien, TU

Berlin, U Oxford, ETH, U Stuttgart, TU Chemnitz, U Potsdam, TU Darmstadt, U Lisbon, U

Coimbra).

International Industry-Academia

Workshop on

Cloud Reliability and Resilience Berlin on 7-8 November 2016

Current Team: Cloud Operations and Analytics Objective

Planet-scale distributed systems = automation

Highly complex systems = AI and machine learning

Skills and knowledge

OpenStack Software Development

Machine Learning and Real-time Analysis

Reliability for Cloud Native Applications

Large-scale distributed systems

Working Student

Distributed Execution Graphs (DEG) for OpenStack.

Master Students

Efficient Diagnosis in Cloud Platforms.

DEG-driven Fault Injection for Cloud platforms.

PhD Students

Risk-aware Cloud Recovery using Machine Learning

(automation + AI).

Internship for PhD student

Next generation of DEG-driven systems beyond

Google’s Dapper and Twitter’s Zipkin.

Working & MSc students

Fault injection, fault models,

fault libraries, fault plans,

brake and rebuild systems all

day long, …

PhD Students Rapid prototyping of cool

ideas: propose it today, code

it, and show it running in 3

months…

Postdocs Solving difficult challenges of

real problems using quick and

dirty prototyping

Open Positions

Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.

The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product

portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive

statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time

without notice.

HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY

38

The complexity and dynamicity of large-scale cloud platforms requires automated solutions to reduce the risks of eventual failures.

Fault injection mechanisms enable to determine (and repair) the types of failures that platforms cannot tolerate under controlled environments rather than taking a passive approach waiting that Murphy’s law comes into play on a Sunday at 2am when engineers are off duty.

Pioneers, such as Amazon, Google, and Netflix, have already developed fault injection mechanisms and have also changed their mindset with respect to the importance of the resiliency of cloud platforms.

As an innovation topic, we take one step further towards fault-tolerant platforms by exploring, not only fault injection, but also the automated recovery of platforms.

Executive Summary

39

FIAT: Fault Injection Based Automated Testing Environment, Carnegie Mellon University.

EFI, PROFI: Processor Fault Injector, Dortmund University.

FERRARI: Fault and ERRor Automatic Real-time Injector, Texas University.

SFI, DOCTOR: intergrateD sOftware implemented fault injeCTiOn enviRonment, Michigan University.

FINE: Fault Injection and moNitoring Environment, Universidad de Illinois University.

FTAPE: Fault Tolerance and Performance Evaluator, Illinois University.

XCEPTION: Coimbra University.

MAFALDA, MAFALDA-RT: Microkernel Assessment by Fault injection AnaLysis and Design Aid, LAAS-

CNRS en Toulouse

BALLISTA: Carnegie Mellon University.

SW Fault Injection Tools