recapitulation workshop cloud reliability resilience 2016
Post on 14-Apr-2017
91 Views
Preview:
TRANSCRIPT
BERLIN 7-8 NOVEMBER 2016 6, Ernst-Reuter-Platz 7
10587 Berlin
Germany
International Industry-Academia Workshop on Cloud Reliability and Resilience
1
The International Industry-Academia Workshop on Cloud
Reliability and Resilience was held in Berlin on 7-8 November
2016. The workshop gathered close to 50 participants from
industry (Intel, Red Hat, CISCO, SAP, Google, LinkedIn,
Microsoft, Mirantis, Brocade, T-Systems, SysEleven, Deutsche
Telekom, Flexiant, Hastexo) and academia (TU Wien, TU
Berlin, U Oxford, ETH, U Stuttgart, TU Chemnitz, U Potsdam,
TU Darmstadt, U Lisbon, U Coimbra).
During two days, modern approaches for managing the
reliability and resilience of cloud platforms and large-scale
systems were discussed. The main topics under scrutiny were
focused on site reliability engineering, fault injection testing,
auto remediation, cloud standards, and dependable cloud
storage.
Recapitulation
Dmitri Zimine (Brocade) giving his speech on workflows
for auto-remediation (credits to Johannes Weingart).
During the workshop, participants agreed that reliability and resilience are having an increasing importance for companies
operating large-scale or planet-scale systems. Well-known service providers such as LinkedIn, Google, Uber, Dropbox,
Salesforce, Netflix, and New Relic are all adopting concepts borrowed from Site Reliability Engineering (SRE), and have
developed homegrown frameworks, techniques, and tools to run numerous services as efficiently and reliably as possible.
Examples of techniques that were discussed include fault injection testing, workflows for auto remediation, and the use of
multiple clouds to build dependable storage. Areas of interest which still need to be explored in the future include service
level objectives and agreements, real-time monitoring and analytics, failure prediction, predictive maintenance, and
automated recovery.
2
Building Blocks for Site Reliability
Sebastian Kirsch, Google, Switzerland.
Breaking Azure for Fun and Profit
Pavel Michailov, Microsoft, US.
Using Event-driven Automation and Workflows for Auto-remediation
Dmitri Zimine, Brocade, US.
High Availability and Disaster Recovery in OpenStack: From humble beginnings to
enterprise reliability
Florian Haas, Hastexo, Austria.
A Tale of Ice and Fire, or: The Cloud and The Standards
Michel Drescher, University of Oxford, UK.
I’m No Hero: Full Stack Reliability at LinkedIn
Todd Palino, LinkedIn, US.
Resilient Cloud Storage – The Consistency View
Neeraj Suri, TU Darmstadt, Germany.
A Cloud is Not Enough, Reliable Delivery Matters More
Ajay Gulati, ZeroStack, US.
Dependable Storage and Computing using Multiple Cloud Providers
Alysson Neves Bessani, University of Lisbon, Portugal.
Cloud Based Fault Injection for Anomaly Detection, Craig Sheridan
flexiOPS, UK.
Recapitulation
3
The workshop also gathered a panel titled “What‘s next
for Cloud Reliability and Resilience: Challenges,
Opportunities, Technologies, Theories, and More...”
moderated by Dr. Goetz Brasche (Huawei), with the
participation of Dr. Goetz Reinhaeckel (T-Systems), Dmitri
Zimine (Brocade), Prof. Alysson Bessani (U Lisbon), and
Sebastian Kirsch (Google).
Recapitulation
Panel “What‘s next for Cloud Reliability and Resilience:
Challenges, Opportunities, Technologies, Theories, and More...”
We would like to express our thanks, first of all, to the invited
speakers for sharing their expertise with the audience. They were
keystones of the program that we believe was exciting and of
high quality. Next, we would like to thank the steering
committee, who helped us with strategic guidance. Special
thanks should be addressed to the organization and sponsorship
of the event made by EIT Digital and Huawei German Research
Center.
The General Chairs
Sebastian Kirsch (Google), co-author of the bestselling
book Site Reliability Engineering from Google, and the
workshop organizer Jorge Cardoso (Huawei).
4
With the increasing adoption and reliance on cloud platforms and services, it is undeniable that cloud
computing is becoming a utility such as water, energy, transportation, or telecommunications.
This status brings the responsibility for providers to ensure the development of highly available services.
Nonetheless, a study from Gartner, which analyzed [1] outages for a period of 10 years, found that 47% of
all documented problems were caused by cloud services outages. The duration of cloud outages ranged
between 40 minutes and five days with an average duration of 17 hours. Ponemon Institute studied
the financial impact of downtime by looking at 41 data centers in the US [2] and found that outages on
average cost US$ 690,204 (ranging from US$ 74,223 up to US$ 1,734,433). On an average, a data centre
downtime costs about US$ 6,828 per minute. These results are important due to the economic impact of
unplanned outages on cloud operations.
Thus, the development of new strategies, techniques, and methods to evaluate and increase the reliability
and resilience of cloud platforms from a software perspective is fundamental.
This workshop intends to bring together industry, academia, and regulators to identify the most relevant
requirements in the field of cloud reliability and resilience, on one hand, and existing state-of-the-art
solutions, on the other. We invite engineers, scientists, and experts to discuss and contribute to the creation
of a new generation of highly reliable cloud platforms.
Workshop Description
5
Challenges of data center reliability
Methods and algorithms for failure prediction
Damage detection and problem diagnosis
Automated repair and recovery of cloud systems
Disaster recovery in cloud computing
Fault-injection as an approach for reliability
Evaluation of cloud platforms reliability
Cloud reliability metrics and benchmarks
Service Level Agreement (SLA) and reliability
Quality of Service (QoS) in the cloud
Standards, regulations, and legislation
Topics Covered
6
General Chairs
Prof. Dr. Jorge Cardoso, Huawei GRC, Germany
Henrik Abramowicz, EIT Digital, Sweden
Steering Committee
Dr. Götz Reinhäckel, Head of Cloud Engineering, T-Systems International, Germany.
Dr. Jeff Voas, US National Institute of Standards and Technology (NIST), US.
Prof. Paulo Esteves Veríssimo, University of Luxembourg, Luxembourg.
Michel Drescher, Cloud Computing Standards Specialist, University of Oxford, UK.
Dra. Valentina Salapura, Chief Architect, Resiliency and Business Continuity, IBM, US.
Organization
7
Workshop Location
Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive
statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time
without notice.
HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY
top related