db migrations equal pain

54
DB Migrations = Pain %4

Upload: eugen-oskin

Post on 14-Apr-2017

44 views

Category:

Software


0 download

TRANSCRIPT

DB Migrations = Pain

%4

Context

● Look is an application for live video streaming

● Backend, iOS and Android client, Admin page,frontend for customers

● Good management

● Good architecture

%7

Context

● 3 environments: develop, qa, production (andlocal)

● 3 core services:– web (aka api)

– rtmp (video streaming)

– cent (realtime messaging)

%11

Context

● There are 2 backend developers

● We think about code quality:– very strict linter

– tests: unit and behave

– deploy in 1 command

%15

Story

● Deployment after 3 monthes of development

● DB redesign: changed one of the core modelsto fit business logic– Schema migration

– Data migration

● Statistics on the admin page

● Successfully deployed to dev and qa

%19

Story

● Data migrations was running during 40 minutes:– I was ready to it

● Production was down during 5 hours– Kernel Panic!

● I deployed the previous version and restore DBfrom snapshot – lost last 3 hours of data

%22

Plan

● Analyze

● Fix

● Learn the lesson

%26

What was the symptoms?

● Django was not responding to request at all

● Memory usage was fine

● CPU was fine

● Network was fine

● Actually, Django was responding with HUGElatency– the best case was 5 minutes, to the simplest

request!

%30

How did we investigate?

● Find bottlenecks:– analyze latencies locally – django-silk is the best

● Fix them one by one

● Test the fixes on the develop environment

%33

How did we fix it?

● Speed up data migrations: 40 minutes → 7minutes– select_related

● Move all long running tasks to celery tasks

● To prevent race between celery and django werun them on separate instances

%37

How did we fix it?

● Simplify admin page– Calculate metrics in periodic celery task

● each 10 minutes, with timeout 1 hour

– Keep in DB

– Join with the metric table

%41

What do we need to do?

● Zero down time deployment aka ContiniusDeployment

%44

Continues Deployment

● Blue Green Deployment

%48

Our way

● Use 2 web instances:– Current

– Staging

● Use 2 DB instances:– Current

– Staging

%52

Our way

● Deployment steps:– Deploy to staging

– Run migrations

– Wait

– Swap the DNS

%56

The fixes deployment

%59

The fixes deployment

● Production was down during 4 hour– Panic!

● The same symptoms!

%63

The guess

● Look at whole stack:– DB flood the disk space

– The free disk space metric has reverse sawtoothform

● Super hot fix: turn off metric task– The free disk space metric have the same period as

the periodic task for calculating metrics

%67

Invistigation

● Use the production DB clone

● Run the raw query that collects metrics– It was running 1 hour!

● This is the reason!

%70

How did we fix it?

● The raw query looks like:– SELECT DISTINCT

– 8 LEFT OUTER JOINs

– 5 COUNTs

– 3 CASEs

– GROUP BY user.id

● Use EXPLAIN

%74

How did we fix it?

● We were not trying to use the raw query indjango– There is no reasons to do so

● Attempts:– Remove metrics that requires CASEs

– Reduce amount of COUNTs and JOINs

– Remove DISTINCT – Fetch row by row

– Use one query for each metric

%78

How did we fix it?

● The fix is:– Use one query for each metric

● The best performance in the production case

%81

Did it help?

Yes

%85

The lesson

● Good management and good architecture arematter

● Deploy more frequently

● Do not use data migrations as is – Usecommands

● Django admin is not efficient for aggregationqueries

● Analyze and synthesize are matter

%89

A proof

● I have refactored another core model:– A schema migration

– A command for data migration

● I have deployed it without downtime

● Look production environment is still alive

%93

Summary

● Analyze

● Fix

● Learn the lesson

%96

DB Migrations = Pain

%4

Context

● Look is an application for live video streaming

● Backend, iOS and Android client, Admin page,frontend for customers

● Good management

● Good architecture

%7

Context

● 3 environments: develop, qa, production (andlocal)

● 3 core services:– web (aka api)

– rtmp (video streaming)

– cent (realtime messaging)

%11

Context

● There are 2 backend developers

● We think about code quality:– very strict linter

– tests: unit and behave

– deploy in 1 command

%15

Story

● Deployment after 3 monthes of development

● DB redesign: changed one of the core modelsto fit business logic– Schema migration

– Data migration

● Statistics on the admin page

● Successfully deployed to dev and qa

%19

Story

● Data migrations was running during 40 minutes:– I was ready to it

● Production was down during 5 hours– Kernel Panic!

● I deployed the previous version and restore DBfrom snapshot – lost last 3 hours of data

%22

Plan

● Analyze

● Fix

● Learn the lesson

%26

What was the symptoms?

● Django was not responding to request at all

● Memory usage was fine

● CPU was fine

● Network was fine

● Actually, Django was responding with HUGElatency– the best case was 5 minutes, to the simplest

request!

%30

How did we investigate?

● Find bottlenecks:– analyze latencies locally – django-silk is the best

● Fix them one by one

● Test the fixes on the develop environment

%33

How did we fix it?

● Speed up data migrations: 40 minutes → 7minutes– select_related

● Move all long running tasks to celery tasks

● To prevent race between celery and django werun them on separate instances

%37

How did we fix it?

● Simplify admin page– Calculate metrics in periodic celery task

● each 10 minutes, with timeout 1 hour

– Keep in DB

– Join with the metric table

%41

What do we need to do?

● Zero down time deployment aka ContiniusDeployment

%44

Continues Deployment

● Blue Green Deployment

%48

Our way

● Use 2 web instances:– Current

– Staging

● Use 2 DB instances:– Current

– Staging

%52

Our way

● Deployment steps:– Deploy to staging

– Run migrations

– Wait

– Swap the DNS

%56

The fixes deployment

%59

The fixes deployment

● Production was down during 4 hour– Panic!

● The same symptoms!

%63

The guess

● Look at whole stack:– DB flood the disk space

– The free disk space metric has reverse sawtoothform

● Super hot fix: turn off metric task– The free disk space metric have the same period as

the periodic task for calculating metrics

%67

Invistigation

● Use the production DB clone

● Run the raw query that collects metrics– It was running 1 hour!

● This is the reason!

%70

How did we fix it?

● The raw query looks like:– SELECT DISTINCT

– 8 LEFT OUTER JOINs

– 5 COUNTs

– 3 CASEs

– GROUP BY user.id

● Use EXPLAIN

%74

How did we fix it?

● We were not trying to use the raw query indjango– There is no reasons to do so

● Attempts:– Remove metrics that requires CASEs

– Reduce amount of COUNTs and JOINs

– Remove DISTINCT – Fetch row by row

– Use one query for each metric

%78

How did we fix it?

● The fix is:– Use one query for each metric

● The best performance in the production case

%81

Did it help?

Yes

%85

The lesson

● Good management and good architecture arematter

● Deploy more frequently

● Do not use data migrations as is – Usecommands

● Django admin is not efficient for aggregationqueries

● Analyze and synthesize are matter

%89

A proof

● I have refactored another core model:– A schema migration

– A command for data migration

● I have deployed it without downtime

● Look production environment is still alive

%93

Summary

● Analyze

● Fix

● Learn the lesson

%96

References

● https://crystalnix.com/works/look/

● http://martinfowler.com/bliki/BlueGreenDeployment.html

● https://gist.github.com/EvgeneOskin/99880b7b7e0cd2d0115f87b7eeb5ae57

%100