just eat: embracing devops

33
JUST EAT: embracing DevOps Or: How we make a Windows-based ecommerce platform work (with AWS) @petemounce & @justeat_tech

Upload: peter-mounce

Post on 25-Jun-2015

2.868 views

Category:

Software


0 download

DESCRIPTION

How we make a Windows-based ecommerce platform work (with AWS) @petemounce & @justeat_tech

TRANSCRIPT

Page 1: JUST EAT: Embracing DevOps

JUST EAT: embracing DevOpsOr: How we make a Windows-based ecommerce platform work (with AWS)@petemounce & @justeat_tech

Page 2: JUST EAT: Embracing DevOps

JUST EAT: Who are we?

● In business since 2001 in DK, 2005 in UK● Tech team is ~50 people in UK, ~20 people in Ukraine● Cloud native in AWS

○ Except for the bits that aren’t (yet)

● Very predictable load● ~900 orders/minute at peak in UK

● We’re recruiting!○ http://tech.just-eat.com/jobs/○ http://tech.just-eat.com/jobs/senior-software-engineer-

platform-services/○ Lots of other roles

Page 3: JUST EAT: Embracing DevOps

JUST EAT: Who are we?

Oh, yeah - we do online takeaway.

We’re an extra sales channel for our restaurant partners.

We do the online part.

Challenging!

We make this work.On Windows.

Page 4: JUST EAT: Embracing DevOps

What are we?

We do high-volume ecommerce.

Windows platform.

Most production code is C#, .NET 4 or 4.5.

Most automation is ruby 1.9.x. Some powershell.

Ongoing legacy transformation; no big rewrites.

Splitting up a monolithic system into SOA/APIs, incrementally.

Page 5: JUST EAT: Embracing DevOps

Architecture, before AWS

Page 6: JUST EAT: Embracing DevOps

Data centre life, pre 2013

Physical hardware

Snowflake servers - no configuration management tooling

Manual deployments, done by operations team

No real time monitoring - SQL queries only

Monolithic applications, not much fast-running test coverage

… But at least we had source control and decent continuous integration! (since 2010)

Page 7: JUST EAT: Embracing DevOps

Architecture, post AWS migration

Page 8: JUST EAT: Embracing DevOps

Estate & High Availability by default

At peak, we run ~500-600 EC2 instances

We migrated from the single data centre in DK, to eu-west-1.

We run everything multi-AZ, auto-scaling by default.(Almost).

Page 9: JUST EAT: Embracing DevOps

Delivery pipeline

Very standard. Nothing to see here.

Multi-tenant.

Tenants are isolated against bad-neighbour issues; individually scalable.

This basically means our tools take a tenant parameter as well as an environment parameter.

Page 10: JUST EAT: Embracing DevOps

Tech organisation structure

We stole from AWS - “two-pizza teams”(we understand metrics couched in terms of food)

We have a team each for● consumer web app● consumer native apps (one iOS, one Android)● restaurant apps● business-support apps● APIs (actually, four teams in one unit)● PaaS

○ responsible for internal services; monitoring/alerting/logs○ systems automation

Page 11: JUST EAT: Embracing DevOps

Tech culture

“You ship it, you operate it”

Each team owns their own features, infrastructure-up.

Minimise dependencies between teams.

Each team has autonomy to work on what they want within some constraints.

Rules:● don’t break backwards compatibility● use what you want - but operate it yourself● other teams must be able to launch & verify your stuff in

their environments

Page 12: JUST EAT: Embracing DevOps

But how?

Table-stakes for this to work (well):

1. Persistent group chat

2. Real-time monitoring

3. Real-time alerting

4. Centralised logging

Make it easier to debug in production without a debugger.

Page 13: JUST EAT: Embracing DevOps

Persistent group chat

We use HipChat.

You could use IRC / Campfire / Hangouts.

● Persistent - jump in, read up

● Searchable history

● Integrate other tools to it

● hubot for fun and profit○ @jebot trg pd emergency with msg “we’re out of champagne in the

office fridge”

Page 14: JUST EAT: Embracing DevOps

Real-time monitoring

Microsoft’s SCOM requires an AD

Publish OS-level performance counters with perftap - windows analogue of collectd we found and customised

Receive metrics into statsd

Visualise time-series data with graphite○ 10s granularity retained for 13 months○ AWS’ CloudWatch gives you 1min / 2 weeks

Addictive!

Page 15: JUST EAT: Embracing DevOps

Real-time alerting

This is the 21st century; emailing someone their server is down doesn’t cut it.

seyren runs our checks.

Publishes to● HipChat● PagerDuty● SMS● statsd event metrics (coming soon, hopefully)

Page 16: JUST EAT: Embracing DevOps

Centralised logging

Windows doesn’t have syslog.

Out of the box EventLog isn’t quite it.

Publish logs via nxlog agent.

Receive logs into logstash cluster.

Filter, transform and enrich into elasticsearch cluster.

Query, visualise and dashboard via kibana.

Page 17: JUST EAT: Embracing DevOps

Without these things, operating a distributed system on Windows is hard.

Windows at scale assumes that you have an Active Directory.We don’t.

● No Windows network load-balancing.● No centrally trusted authentication.● No central monitoring (SCOM) to harvest performance

counters.● No easy remote command execution (WinRM wants an AD,

too)● Other stuff; these are the highlights.

Page 18: JUST EAT: Embracing DevOps

Open source & build vs buy

We treat Microsoft as just another third party vendor dependency.

We lean on open-source libraries and tools a lot.

Page 19: JUST EAT: Embracing DevOps

Anatomy of a feature

We decompose the platform into its component parts

Imaginatively, we call these “platform features”

For example● consumer web app == publicweb● back office tools == handle, guard● etc

Page 20: JUST EAT: Embracing DevOps

Platform features

Features are defined by AWS CloudFormation.

● Everything is pull-deployment, from S3.

● No state is kept (for long) on the instance itself.

● No external actor can tell an instance to do something, beyond what the feature itself allows.

Instances boot, and then bootstrap themselves from content in S3 based on CloudFormation::Init metadata

Page 21: JUST EAT: Embracing DevOps

Platform feature: Servers

We have several “baseline” AMIs.

These have required system dependencies like .NET framework, ruby, 7-zip, etc.

Periodically we update them for OS-level patches, and roll out new baseline AMIs. We deprecate the older AMIs.

Page 22: JUST EAT: Embracing DevOps

Platform feature: Infrastructure

Defined by CloudFormation. Each one stands up everything that feature needs to run, excluding cross-cutting dependencies (like DNS, firewall rules).

Mostly standard:● ELB● AutoScaling Group + Launch Configuration● IAM as necessary● … anything else required by the feature

Page 23: JUST EAT: Embracing DevOps

Platform feature: Infrastructure

Page 24: JUST EAT: Embracing DevOps

Platform feature: code package

● A standardised package containing○ built code (website, service, combinations)○ configuration + deltas to run any tenant/environment○ automation to deploy the feature

● CloudFormation::Init has a configSet to○ unzip○ install automation dependencies○ execute the deployment automation○ warm up the feature, post-install

Page 25: JUST EAT: Embracing DevOps

What have we gained?

Instances are disposable and short lived.

● Enables “shoot it in the head” debugging

● Disks no longer ever fill up

● Minimal environmental differences

● New environment == mostly automated

● Infrastructure as code == testable, repeatable - and we do!

Page 26: JUST EAT: Embracing DevOps

Culture again: On-call

Teams are on-call for their features.

Decide own rota; coverage minimums for peak-time

But: teams (must!) have autonomy to improve their features so they don’t get called as often.

Otherwise, constant fire-fighting

Page 27: JUST EAT: Embracing DevOps

Things still break!

Page me once, shame on you.Page me twice, shame on me.

Teams do root-cause analysis of incidents that triggered incidents.… An operations team / NOC does not.

Warn call-centre proatively

Take action proactively

Automate mitigation steps!

Feature toggles: not just for launching new stuff.

Page 28: JUST EAT: Embracing DevOps

The role of our PaaS team

Enablement.

● Run monitoring & alerting

● Run centralised logging

● Run deployment service

● Apply security updates

Page 29: JUST EAT: Embracing DevOps

Why not Azure / OpenStack et al?

Decision to migrate to AWS made in late 2011.

AWS was more mature than alternatives at the time. It offered many hosted services on top of the IaaS offering.

Still is, even accounting for Azure’s recent advances.

Page 30: JUST EAT: Embracing DevOps

The future

Immutable/golden instances; faster provisioning.

Failover to secondary region (we operate in CA).

Always: more test coverage, more confidence.

Publish some of our tools as OSShttps://github.com/justeat

Page 31: JUST EAT: Embracing DevOps

The most important things

● Culture

● Principles that everyone lives by

● Devolve autonomy down to people on the ground

● (Tools)

Page 32: JUST EAT: Embracing DevOps

Did we mention we’re hiring?

We’re pragmatic.

We’re successful.

We support each other.

We use sharp tools that we pick ourselves based on merit.

Join us!○ http://tech.just-eat.com/jobs/○ http://tech.just-eat.com/jobs/senior-software-engineer-

platform-services/○ Lots of other roles

Page 33: JUST EAT: Embracing DevOps

Any questions?