in this talk - qcon · @nora_js “we ran a chaos experiment which verifies that our fallback path...

69

Upload: others

Post on 12-Mar-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 2: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

In this talk

●●●●

@nora_js

Page 3: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

In this talk

●●●●

@nora_js

Page 4: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

Known ways of testing for availability

●●●●

Page 5: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 6: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 7: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

You can’t keep blaming your cloud provider

@nora_js

Page 8: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

Why is there a fear of Chaos when it’s inevitable?

Page 9: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 10: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

Meet “Chaos Carol”

Page 11: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

Where is Carol starting her Chaos?

Page 12: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 13: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

Start with a steady state

@nora_js

Page 14: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 15: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

There isn’t always money in microservices

Page 16: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

Randomly turn things off?

Recreate things that already happened?

Page 17: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 18: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

Let people know? Let the Chaos run automatically?

Page 19: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 20: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 21: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 22: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 23: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

Socialization

●●

Page 24: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 25: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 26: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 27: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 28: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

● Focus more on asking the questions, rather than answering them.

● Find customers willing to try first. Then share their stories.● Be honest. Don’t make false promises about what Chaos

will do.

Page 29: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 30: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 31: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

●●

Page 32: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 33: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 34: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 35: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 36: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 37: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 38: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 39: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 40: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 41: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 42: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 43: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 44: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 45: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 46: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 47: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 48: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 49: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 50: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 51: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 52: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 53: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 54: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

ChAP

●●●

@nora_js

Page 55: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 56: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 57: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 58: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 59: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 60: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

Targeted Chaos: Kafka Problems

●●

@nora_js

Page 61: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

Targeted Chaos: Kafka Ideas

●●●●●●

@nora_js

Page 62: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 63: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback
Page 64: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

@nora_js

“We ran a chaos experiment which verifies that our fallback path works

(crucial for our availability) and it successfully caught a issue in the fallback path and the issue was

resolved before it resulted in any availability incident!”

Page 65: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

“While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful. This likely

means that whoever was consuming the fallback was retrying the call, causing an increase in

license requests.”

@nora_js

Page 66: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

●●

@nora_js

Page 67: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

Should you develop experiments for the service teams?

Let them do it on their own?

Page 68: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback

Takeaways

@nora_js

Page 69: In this talk - QCon · @nora_js “We ran a chaos experiment which verifies that our fallback path works (crucial for our availability) and it successfully caught a issue in the fallback