recapitulation workshop cloud reliability resilience 2016

BERLIN 7-8 NOVEMBER 2016 6, Ernst-Reuter-Platz 7

10587 Berlin

Germany

International Industry-Academia Workshop on Cloud Reliability and Resilience

1

The International Industry-Academia Workshop on Cloud

Reliability and Resilience was held in Berlin on 7-8 November

2016. The workshop gathered close to 50 participants from

industry (Intel, Red Hat, CISCO, SAP, Google, LinkedIn,

Microsoft, Mirantis, Brocade, T-Systems, SysEleven, Deutsche

Telekom, Flexiant, Hastexo) and academia (TU Wien, TU

Berlin, U Oxford, ETH, U Stuttgart, TU Chemnitz, U Potsdam,

TU Darmstadt, U Lisbon, U Coimbra).

During two days, modern approaches for managing the

reliability and resilience of cloud platforms and large-scale

systems were discussed. The main topics under scrutiny were

focused on site reliability engineering, fault injection testing,

auto remediation, cloud standards, and dependable cloud

storage.

Recapitulation

Dmitri Zimine (Brocade) giving his speech on workflows

for auto-remediation (credits to Johannes Weingart).

During the workshop, participants agreed that reliability and resilience are having an increasing importance for companies

operating large-scale or planet-scale systems. Well-known service providers such as LinkedIn, Google, Uber, Dropbox,

Salesforce, Netflix, and New Relic are all adopting concepts borrowed from Site Reliability Engineering (SRE), and have

developed homegrown frameworks, techniques, and tools to run numerous services as efficiently and reliably as possible.

Examples of techniques that were discussed include fault injection testing, workflows for auto remediation, and the use of

multiple clouds to build dependable storage. Areas of interest which still need to be explored in the future include service

level objectives and agreements, real-time monitoring and analytics, failure prediction, predictive maintenance, and

automated recovery.

2

Building Blocks for Site Reliability

Sebastian Kirsch, Google, Switzerland.

Breaking Azure for Fun and Profit

Pavel Michailov, Microsoft, US.

Using Event-driven Automation and Workflows for Auto-remediation

Dmitri Zimine, Brocade, US.

High Availability and Disaster Recovery in OpenStack: From humble beginnings to

enterprise reliability

Florian Haas, Hastexo, Austria.

A Tale of Ice and Fire, or: The Cloud and The Standards

Michel Drescher, University of Oxford, UK.

I’m No Hero: Full Stack Reliability at LinkedIn

Todd Palino, LinkedIn, US.

Resilient Cloud Storage – The Consistency View

Neeraj Suri, TU Darmstadt, Germany.

A Cloud is Not Enough, Reliable Delivery Matters More

Ajay Gulati, ZeroStack, US.

Dependable Storage and Computing using Multiple Cloud Providers

Alysson Neves Bessani, University of Lisbon, Portugal.

Cloud Based Fault Injection for Anomaly Detection, Craig Sheridan

flexiOPS, UK.

Recapitulation

3

The workshop also gathered a panel titled “What‘s next

for Cloud Reliability and Resilience: Challenges,

Opportunities, Technologies, Theories, and More...”

moderated by Dr. Goetz Brasche (Huawei), with the

participation of Dr. Goetz Reinhaeckel (T-Systems), Dmitri

Zimine (Brocade), Prof. Alysson Bessani (U Lisbon), and

Sebastian Kirsch (Google).

Recapitulation

Panel “What‘s next for Cloud Reliability and Resilience:

Challenges, Opportunities, Technologies, Theories, and More...”

We would like to express our thanks, first of all, to the invited

speakers for sharing their expertise with the audience. They were

keystones of the program that we believe was exciting and of

high quality. Next, we would like to thank the steering

committee, who helped us with strategic guidance. Special

thanks should be addressed to the organization and sponsorship

of the event made by EIT Digital and Huawei German Research

Center.

The General Chairs

Sebastian Kirsch (Google), co-author of the bestselling

book Site Reliability Engineering from Google, and the

workshop organizer Jorge Cardoso (Huawei).

4

With the increasing adoption and reliance on cloud platforms and services, it is undeniable that cloud

computing is becoming a utility such as water, energy, transportation, or telecommunications.

This status brings the responsibility for providers to ensure the development of highly available services.

Nonetheless, a study from Gartner, which analyzed [1] outages for a period of 10 years, found that 47% of

all documented problems were caused by cloud services outages. The duration of cloud outages ranged

between 40 minutes and five days with an average duration of 17 hours. Ponemon Institute studied

the financial impact of downtime by looking at 41 data centers in the US [2] and found that outages on

average cost US$ 690,204 (ranging from US$ 74,223 up to US$ 1,734,433). On an average, a data centre

downtime costs about US$ 6,828 per minute. These results are important due to the economic impact of

unplanned outages on cloud operations.

Thus, the development of new strategies, techniques, and methods to evaluate and increase the reliability

and resilience of cloud platforms from a software perspective is fundamental.

This workshop intends to bring together industry, academia, and regulators to identify the most relevant

requirements in the field of cloud reliability and resilience, on one hand, and existing state-of-the-art

solutions, on the other. We invite engineers, scientists, and experts to discuss and contribute to the creation

of a new generation of highly reliable cloud platforms.

Workshop Description

5

Challenges of data center reliability

Methods and algorithms for failure prediction

Damage detection and problem diagnosis

Automated repair and recovery of cloud systems

Disaster recovery in cloud computing

Fault-injection as an approach for reliability

Evaluation of cloud platforms reliability

Cloud reliability metrics and benchmarks

Service Level Agreement (SLA) and reliability

Quality of Service (QoS) in the cloud

Standards, regulations, and legislation

Topics Covered

6

General Chairs

Prof. Dr. Jorge Cardoso, Huawei GRC, Germany

Henrik Abramowicz, EIT Digital, Sweden

Steering Committee

Dr. Götz Reinhäckel, Head of Cloud Engineering, T-Systems International, Germany.

Dr. Jeff Voas, US National Institute of Standards and Technology (NIST), US.

Prof. Paulo Esteves Veríssimo, University of Luxembourg, Luxembourg.

Michel Drescher, Cloud Computing Standards Specialist, University of Oxford, UK.

Dra. Valentina Salapura, Chief Architect, Resiliency and Business Continuity, IBM, US.

Organization

7

Workshop Location

Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.

The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product

portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive

statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time

without notice.

HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY

recapitulation workshop cloud reliability resilience 2016

Internet