adrian cockcroft’s architecture

67

Upload: others

Post on 18-Dec-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adrian Cockcroft’s architecture
Page 2: Adrian Cockcroft’s architecture

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Adrian Cockcroft’s architecture trends and topics for 2021

Adrian Cockcroft

VP Cloud Architecture Strategy

AWS

A R C 2 1 3

Page 3: Adrian Cockcroft’s architecture

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Serverless first

Chaos engineering and failing over without falling over

Wardley Mapping for technology evolution

Hardware trends

Sustainability

Agenda

Page 4: Adrian Cockcroft’s architecture

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Serverless first

Page 5: Adrian Cockcroft’s architecture

Trends: Microservices

MicroXchg Berlin event where a group of speakers including @adrianco adopted the term

Page 6: Adrian Cockcroft’s architecture

Start with serverless microservices

Definition from a talk by @adrianco in 2014

Page 7: Adrian Cockcroft’s architecture

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

We chose microservices to get speed and agility . . .

So how long does it really take to build a new application from scratch?

Page 8: Adrian Cockcroft’s architecture

AWS re:InventN O N - P R O F I T H A C K A T H O N F O R S O C I A L G O O D

SkyTruth

Best Friends Animal Society

Several teams

of ~four people

Non-profit organization

with a problem

Monday, 8 AM

Here’s a set of

problems. Pick

one and build a

solution by 8 PM

this evening.

“ “+ =

Page 9: Adrian Cockcroft’s architecture

“ “SkyTruth

Best Friends Animal Society

Several teams

of ~four people

Non-profit organization

with a problem

Monday, 8 AM

Here’s a set of

problems. Pick

one and build a

solution by 8 PM

this evening.+ =

AWS re:InventN O N - P R O F I T H A C K A T H O N F O R S O C I A L G O O D

Where would you start?

Page 10: Adrian Cockcroft’s architecture

Every team used

AWS Lambda 1 Less than

a day total2

Extremely functional

and scalable prototypes4Many team

members’ first time

using AWS Lambda

3

2016 AWS re:Invent HackathonA L I G H T B U L B M O M E N T

Page 11: Adrian Cockcroft’s architecture

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Note: See re:Invent 2019, SVS343

So . . . why doesn’t everyone use serverless first?

Page 12: Adrian Cockcroft’s architecture

Language support

Scalability (new: 10 GB/6 vCPU)

Startup and network latency

Databases / storage interfacing

Security

State handling, event processing

Run duration

Complex configs (new: AWS Proton)

Portability (new: container support) Getting started

See my Serverless-First Function talk for a deep dive https://aws.amazon.com/blogs/compute/icymi-serverless-first-function/

Common concerns

Page 13: Adrian Cockcroft’s architecture

Update S E R V E R L E S S S C A L E

AWS Lambda per-function

maximum disk space

Local

512 MB

Amazon EFS

Petabytes

47.9 TiB per file

June 2020 introduction of support for NFS mounting an

Amazon EFS file system – up to 25,000 concurrent connectionshttps://aws.amazon.com/blogs/compute/using-amazon-efs-for-aws-lambda-in-your-serverless-applications/

Page 14: Adrian Cockcroft’s architecture

UpdateS E R V E R L E S S - F I R S T C O N T A I N E R S

New support for

common build pipelines

delivering containers for

both runtimes

Serverless AWS Fargate

re:Invent 2020 introduction of AWS Lambda support for container packaging

and AWS Proton delivery tooling makes serverless first an even better choice!

Easier transition from spiky

workloads to Always On capacity

Page 15: Adrian Cockcroft’s architecture

Bottom line

Serverless is the fastest way

to build a modern application

If you object, let us know what

we should work on next!

Page 16: Adrian Cockcroft’s architecture

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Chaos engineering and failing over without falling over

Page 17: Adrian Cockcroft’s architecture

Chaos engineering definition(Adrian’s concise version)

Experiment to ensure that the impact of failures is mitigated

Page 18: Adrian Cockcroft’s architecture

Trends – chaos engineering – 10 years on . . .

2010 – First Netflix blog post about Chaos Monkey

Page 19: Adrian Cockcroft’s architecture

Past Present Future

?

Disasterrecovery

Chaosengineering

Continuousresilience

Resilience

Page 20: Adrian Cockcroft’s architecture

We build redundancy into systems so that if

something fails, we can fail over to an alternative

However, our ability to fail over is complex and

hard to test, so often the whole system falls over

Page 21: Adrian Cockcroft’s architecture

How can we do better?

Page 22: Adrian Cockcroft’s architecture

The last strand that breaks is not the cause of a failure!

Build resilient systems like a rope, not a chain, but make sure you know how

much margin you have and how “frayed” your system is

Page 23: Adrian Cockcroft’s architecture

Infrastructure

Switching

Application

People

Chaos

engineering

team

To

ols

Security

red

team

To

ols

Four layers, two teams, an attitude

Experiment to ensure that the effect of failures is mitigated

Chaos architecture

New! AWS Fault Injection Simulator

Run experiments to ensure that both availability and performance impact of failures are mitigated

Page 24: Adrian Cockcroft’s architecture

“You can’t legislateagainst failure; focus on

fast detection and response.”

—Chris Pinkham

Observability Control

Page 25: Adrian Cockcroft’s architecture

Let’s see what we can learn from experts who have been

working on controlling safety critical systems for decades

Page 26: Adrian Cockcroft’s architecture

Engineering a Safer WorldSystems Thinking Applied to Safety – 2012

Professor Nancy G. Leveson – MIT

STPA – Systems Theoretic Process Analysis

STAMP – Systems Theoretic Accident Model & Processes

http://psas.scripts.mit.edu for handbook and talks

Page 27: Adrian Cockcroft’s architecture

STPA model

control diagramUnderstand hazards that

could disrupt successful

application processing

Observabilityand control

Data plane

Throughput

Control plane

Human controller

Model of automation

Model of controlled process

DisplaysControls

Control algorithmModel of

controlled process

Web service

SensorsActuators

Control action generation

Environmental inputs

Written / trained procedures

Completed actions

Customerrequests

Disturbances

Page 28: Adrian Cockcroft’s architecture

Data plane

Throughput

Control plane

Human controller

Model of automation

Model of controlled process

DisplaysControls

Control algorithmModel of

controlled process

Web service

SensorsActuators

Control action generation

Environmental inputs

Written / trained procedures

Completed actions

Customerrequests

Disturbances

What happens

if there is a

big enough

disturbance

to break the

web service?

Large-scale failure

Page 29: Adrian Cockcroft’s architecture

Application is “out of control,” many possibilities

Automation has failed

Network partition, no route connecting application to customers

Application crashed or corrupted, not easily restartable

Anything else you didn’t think of

Trigger failover to alternative system

Large-scale failures

Page 30: Adrian Cockcroft’s architecture

Human controllers

Router control plane

Routing service

Control plane A

Web service A

Availability Zone A

Control plane B

Web service B

Availability Zone B

Control plane C

Web service C

Availability Zone C

Symmetry

and assertions• Services and data

are consistent across

three zones

• Zone failure modes

are independent

• Application should

work normally with

any zone offline

• Routing service

manages failover

ScenarioA W S A V A I L A B I L I T Y Z O N E S

Page 31: Adrian Cockcroft’s architecture

Human controllers

Routing service

Control plane A

Web service A

Availability Zone A

Control plane B

Web service B

Availability Zone B

• Router control plane

detects offline AZ, stops

routing traffic

to it, and retries requests

on the online AZs

• Automated response;

what could go wrong?

Router control plane

ScenarioA W S A V A I L A B I L I T Y Z O N E S

Page 32: Adrian Cockcroft’s architecture

Human controllers

Router control plane

Routing service

Control plane A

Web service A

Availability Zone A

Control plane B

Web service B

Availability Zone B

• Missing updates

• Zeroed

• Overflowed

• Corrupted

• Out of order

• Updates too rapid

• Updates infrequent

• Updates delayed

• Coordination problems

• Degradation over time

STPA hazardsS E N S O R M E T R I C S C H E C K L I S T

Routing control

plane doesn’t clearly

inform humans that

everything is taken

care of, and offline

zone delays and

breaks other metrics

with a flood of errors

Page 33: Adrian Cockcroft’s architecture

Human controllers

Router control plane

Routing service

Control plane A

Web service A

Availability Zone A

Control plane B

Web service B

Availability Zone B

• Model mismatch

• Missing inputs

• Missing updates

• Updates too rapid

• Updates infrequent

• Updates delayed

• Coordination problems

• Degradation over time

STPA hazardsM O D E L P R O B L E M S C H E C K L I S T

Confused human

controllers disagree

among themselves

about whether

they need to do

something or not,

with floods of errors,

displays that lag

reality by several

minutes, and out-of-

date runbooks

Page 34: Adrian Cockcroft’s architecture

Human controllers

Router control plane

Routing service

Control plane A

Web service A

Availability Zone A

Control plane B

Web service B

Availability Zone B

• Not provided

• Unsafe action

• Safe but too early

• Safe but too late

• Wrong sequence

• Stopped too soon

• Applied too long

• Conflicts

• Coordination problems

• Degradation over time

STPA hazardsH U M A N C O N T R O L A C T I O N C H E C K L I S T

Human controllers

should not need

to do anything!

However, they

are confused and

working separately,

trying to fix different

problems – some of

their tools don’t get

used often, and

are broken or

misconfigured to do

the wrong thing

Page 35: Adrian Cockcroft’s architecture

Human controllers

Instead of failing

over, system

falls over

Most likelyresult

In-rush of extra

traffic from failed

zone, and extra work

from a cross zone

request retry storm

causes zones A and B

to struggle and

triggers a complete

failure of the

application –

meanwhile, the

routing service also

has a retry storm

and is impacted

Page 36: Adrian Cockcroft’s architecture

How to fail over without falling over

Alert correlation

Floods of alerts need to be reduced to actionable insights (new: Amazon DevOps Guru)

Observability system needs to cope with floods without failing

Run regular chaos engineering experiments (new: AWS Fault Injection Simulator)

Retry storms – prevent work amplification

Reduce retries to zero except at subsystem entry and exit points

Reduce timeouts to drop orphaned requests

Route calls within the same zone

Page 37: Adrian Cockcroft’s architecture

Symmetries

High level of automation, consistent configuration as code

Consistent instance types, services, versions, zones, and Regions

Principles

If it can be the same, make it look and act identically

If it’s different, make that clearly visible

Test your assumptions continuously

Page 38: Adrian Cockcroft’s architecture

What about multi-Region?

Get AZ failover solid

before attempting

multi-Region

Use STPA to

analyze multi-Region-

specific hazards

Follow

Well-Architected

guide patterns

Page 39: Adrian Cockcroft’s architecture

As data centers migrate to the cloud, fragile and manual disaster recovery processes have been standardized and automated

Page 40: Adrian Cockcroft’s architecture

Testing failure mitigation has moved from a scary annual experience to automated

continuous resilience

Page 41: Adrian Cockcroft’s architecture

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Wardley Maps

Page 42: Adrian Cockcroft’s architecture

Early adopters are using Wardley Maps

Page 43: Adrian Cockcroft’s architecture

Maps always allow you to define position and movement

What is a Wardley Map?

Wardley Maps help define strategic context, barriers, and actions

Examples

• Evolution of technology from custom to utility• Technology stack migrations from data center to cloud• Make vs. buy decisions• Open-source product strategy

Page 44: Adrian Cockcroft’s architecture

Mapping open source for Kubernetes

Too many

projects in the cloud-

native landscape

All in

different stages

of evolution

So let’s try and

put some of them

on a map!

Page 45: Adrian Cockcroft’s architecture

CNCF landscapeH T T P S : / / L A N D S C A P E . C N C F . I O – N O T A M A P , A N E Y E C H A R T !

Page 46: Adrian Cockcroft’s architecture

Projects outside CNCF CNCF projects Services

Mapping value chain against evolutionT H I S I S A M A P ! C L O U D - N A T I V E K U B E R N E T E S C O R E P R O J E C T S

Page 47: Adrian Cockcroft’s architecture

Projects outside CNCF CNCF projects Services

New! AWS Grafana and Prometheus services R E M E M B E R : A L L M A P S A R E W R O N G , B U T S O M E M A P S A R E U S E F U L – U S E T O F A C I L I T A T E D I S C U S S I O N

Page 48: Adrian Cockcroft’s architecture

Results from mapping

Successful products and services due to better:

Shared understanding of moves

Situational awareness

Application of doctrines

Strategic decision-making

Page 49: Adrian Cockcroft’s architecture

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Hardware trends

Page 50: Adrian Cockcroft’s architecture

Data that’s too big to process on one machine

What is “big data”?

But the machines keep getting bigger

Page 51: Adrian Cockcroft’s architecture

EC2 UltraClusters4 , 0 0 0 + A 1 0 0 G P U S O N 4 0 0 G B I T L I N K S I N T O P E T A B I T - S C A L E F A B R I C

Page 52: Adrian Cockcroft’s architecture

Gigabits

bandwidth

10,000

100

10

320 GB 40 GB1,100 GB24,000 GB4,400,000 GB

1,000

1

Memory capacity vs. bandwidth

Engulf your data

in memory to

reduce overhead

Internal memory bandwidth of A100

Terabyte+/second

Network bandwidth per node

comparable to Intel Cascade

Lake main memory bandwidth

EC2 UltraCluster

UltraCluster 4,000+ P4d

400 Gbit Ethernet

Network bandwidth

per node

AWS u24-tb1

8x Intel

Cascade Lake

main memory

Single

NVIDIA A100

1.2 TB/s

GPU memory

AWS P4d

8x NVIDIA A100

NVLINK 600 GB/s

GPU memory

AWS P4d

2x Intel Cascade Lake

main memory

Memory capacity

Page 53: Adrian Cockcroft’s architecture

How can we take advantage of more memory?

In-memory analytics

(e.g., SAP HANA) is

mainstream today

Very large

graph databases(it’s hard to partition graphs)

Use the memory

interconnect as a low-

latency network?

Put all the pods

and instances on

the same node?

Fast replication to

another node, in a

different AZ, for

persistence?

Use shared

memory to avoid

serialization overhead?

Page 54: Adrian Cockcroft’s architecture

Interesting work to watch

Twizzler

UC Santa Cruz / Daniel

Bittman – data-oriented

operating system

Brytlyt

Database that runs

completely inside the GPU

MemVerge

Distributed

memory objects

SQream

GPU accelerated

data warehouse

Zenotech

GPU-based CFD simulation

Page 55: Adrian Cockcroft’s architecture

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Sustainability

Page 56: Adrian Cockcroft’s architecture

The Climate Pledge

Reach the Paris Agreement

BY 2040

Climate Pledge fund

$2 BILLION

Signatories

https://www.theclimatepledge.com

Page 57: Adrian Cockcroft’s architecture

Renewable Energy

https://sustainability.aboutamazon.com/environment/sustainable-operations/renewable-energy

Page 58: Adrian Cockcroft’s architecture

Reducing carbon by moving to AWS

https://sustainability.aboutamazon.com/environment/the-cloud

Page 59: Adrian Cockcroft’s architecture

Amazon Sustainability Data Initiative significantly reduces the

cost, time, and technical barriers

associated with analyzing large datasets to

generate sustainability insights

Data Insights

Information Actionable knowledge

Page 60: Adrian Cockcroft’s architecture

We

ath

er

& c

lim

ate

Re

mo

te s

en

sin

g

CMIP6 climate projections

LOCA downscaled climate projections

Atmospheric Models from Météo-France

NOAA’s Global Surface Summary of Day

HIRLAM Weather Model

NOAA Global Ensemble Forecast System (GEFS)

NOAA Global Forecast System (GFS)

NOAA High-Resolution Rapid Refresh (HRRR) Model

JMA Himawari-8

Landsat 8

NOAA GOES 16 & 17

Sentinel-1

MODIS MYD13A1, MOD13A1, MYD11A1, MOD11A1, MCD43A4

Terra Fusion Data Sampler

CBERS

Enabling easy access to petabytes of foundational data

Hosted free of charge in public Amazon S3 buckets

Managed and updated by data owners

Amazon Sustainability Data Initiative – ASDI

Page 61: Adrian Cockcroft’s architecture

OS-Climate is an open-source platform to enable access to critical

data and models to support climate risk assessment

https://www.os-climate.org/

Climate risk: OS-Climate

Page 62: Adrian Cockcroft’s architecture

AWS contributions to OS-Climate

https://www.os-climate.org/

Page 63: Adrian Cockcroft’s architecture

CALL TO ACTION

Use AWS services and open data programs to architect solutions and achieve your sustainability goals

Look for opportunities to optimize your architecture

Share the solutions you build to support sustainability goals and collaborate with others in your industry and cross-sector

Page 64: Adrian Cockcroft’s architecture

Resources

Related re:Invent 2020 talks

ARC207 – Architecting resiliency, recovery, and reliability: Capital One

ARC307 – Going global in minutes: Multi-Region expansion simplified (Autodesk)

ARC316 – Testing resiliency using chaos engineering

Related blog posts

“Failing over without falling over”:https://stackoverflow.blog/2020/10/23/adrian-cockcroft-aws-failover-chaos-engineering-fault-tolerance-distaster-recovery

Full Serverless-First talks: https://aws.amazon.com/blogs/compute/icymi-serverless-first-function

Response time variation: https://dev.to/aws/why-are-services-slow-sometimes-mn3

Retries and timeouts: https://dev.to/aws/if-at-first-you-don-t-get-an-answer-3e85

Paper

“Building Mission-Critical Financial Services Applications on AWS,” by Pawan Agnihotri with contributions by Adrian Cockcroft: https://d1.awsstatic.com/Financial%20Services/Resilient%20Applications%20on%20AWS%20for%20Financial%20Services.pdf

Book recommendations

Reading list at: http://a.co/79CGMfB

Adrian Cockcroft@adrianco

Page 65: Adrian Cockcroft’s architecture

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

AWS Well-Architected Framework

The official best practices for architecting in the AWS Cloud

aws.amazon.com/architecture/well-architected

AWS Well-Architected Labs

Hands-on labs to help you learn, measure, and build using architectural best practices

wellarchitectedlabs.com

AWS Architecture Center

Official AWS repository for all architecture resources

aws.amazon.com/architecture

AWS Solutions Library

Vetted reference implementations and Well-Architected patterns

aws.amazon.com/solutions

Architecture resources

Page 66: Adrian Cockcroft’s architecture

Thank you!

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Adrian Cockcroft

@adrianco

Page 67: Adrian Cockcroft’s architecture

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.