db migrations equal pain
TRANSCRIPT
Context
● Look is an application for live video streaming
● Backend, iOS and Android client, Admin page,frontend for customers
● Good management
● Good architecture
%7
Context
● 3 environments: develop, qa, production (andlocal)
● 3 core services:– web (aka api)
– rtmp (video streaming)
– cent (realtime messaging)
%11
Context
● There are 2 backend developers
● We think about code quality:– very strict linter
– tests: unit and behave
– deploy in 1 command
%15
Story
● Deployment after 3 monthes of development
● DB redesign: changed one of the core modelsto fit business logic– Schema migration
– Data migration
● Statistics on the admin page
● Successfully deployed to dev and qa
%19
Story
● Data migrations was running during 40 minutes:– I was ready to it
● Production was down during 5 hours– Kernel Panic!
● I deployed the previous version and restore DBfrom snapshot – lost last 3 hours of data
%22
What was the symptoms?
● Django was not responding to request at all
● Memory usage was fine
● CPU was fine
● Network was fine
● Actually, Django was responding with HUGElatency– the best case was 5 minutes, to the simplest
request!
%30
How did we investigate?
● Find bottlenecks:– analyze latencies locally – django-silk is the best
● Fix them one by one
● Test the fixes on the develop environment
%33
How did we fix it?
● Speed up data migrations: 40 minutes → 7minutes– select_related
● Move all long running tasks to celery tasks
● To prevent race between celery and django werun them on separate instances
%37
How did we fix it?
● Simplify admin page– Calculate metrics in periodic celery task
● each 10 minutes, with timeout 1 hour
– Keep in DB
– Join with the metric table
%41
The guess
● Look at whole stack:– DB flood the disk space
– The free disk space metric has reverse sawtoothform
● Super hot fix: turn off metric task– The free disk space metric have the same period as
the periodic task for calculating metrics
%67
Invistigation
● Use the production DB clone
● Run the raw query that collects metrics– It was running 1 hour!
● This is the reason!
%70
How did we fix it?
● The raw query looks like:– SELECT DISTINCT
– 8 LEFT OUTER JOINs
– 5 COUNTs
– 3 CASEs
– GROUP BY user.id
● Use EXPLAIN
%74
How did we fix it?
● We were not trying to use the raw query indjango– There is no reasons to do so
● Attempts:– Remove metrics that requires CASEs
– Reduce amount of COUNTs and JOINs
– Remove DISTINCT – Fetch row by row
– Use one query for each metric
%78
How did we fix it?
● The fix is:– Use one query for each metric
● The best performance in the production case
%81
The lesson
● Good management and good architecture arematter
● Deploy more frequently
● Do not use data migrations as is – Usecommands
● Django admin is not efficient for aggregationqueries
● Analyze and synthesize are matter
%89
A proof
● I have refactored another core model:– A schema migration
– A command for data migration
● I have deployed it without downtime
● Look production environment is still alive
%93
References
● https://crystalnix.com/works/look/
● http://martinfowler.com/bliki/BlueGreenDeployment.html
● https://gist.github.com/EvgeneOskin/99880b7b7e0cd2d0115f87b7eeb5ae57
%100
Context
● Look is an application for live video streaming
● Backend, iOS and Android client, Admin page,frontend for customers
● Good management
● Good architecture
%7
Context
● 3 environments: develop, qa, production (andlocal)
● 3 core services:– web (aka api)
– rtmp (video streaming)
– cent (realtime messaging)
%11
Context
● There are 2 backend developers
● We think about code quality:– very strict linter
– tests: unit and behave
– deploy in 1 command
%15
Story
● Deployment after 3 monthes of development
● DB redesign: changed one of the core modelsto fit business logic– Schema migration
– Data migration
● Statistics on the admin page
● Successfully deployed to dev and qa
%19
Story
● Data migrations was running during 40 minutes:– I was ready to it
● Production was down during 5 hours– Kernel Panic!
● I deployed the previous version and restore DBfrom snapshot – lost last 3 hours of data
%22
What was the symptoms?
● Django was not responding to request at all
● Memory usage was fine
● CPU was fine
● Network was fine
● Actually, Django was responding with HUGElatency– the best case was 5 minutes, to the simplest
request!
%30
How did we investigate?
● Find bottlenecks:– analyze latencies locally – django-silk is the best
● Fix them one by one
● Test the fixes on the develop environment
%33
How did we fix it?
● Speed up data migrations: 40 minutes → 7minutes– select_related
● Move all long running tasks to celery tasks
● To prevent race between celery and django werun them on separate instances
%37
How did we fix it?
● Simplify admin page– Calculate metrics in periodic celery task
● each 10 minutes, with timeout 1 hour
– Keep in DB
– Join with the metric table
%41
The guess
● Look at whole stack:– DB flood the disk space
– The free disk space metric has reverse sawtoothform
● Super hot fix: turn off metric task– The free disk space metric have the same period as
the periodic task for calculating metrics
%67
Invistigation
● Use the production DB clone
● Run the raw query that collects metrics– It was running 1 hour!
● This is the reason!
%70
How did we fix it?
● The raw query looks like:– SELECT DISTINCT
– 8 LEFT OUTER JOINs
– 5 COUNTs
– 3 CASEs
– GROUP BY user.id
● Use EXPLAIN
%74
How did we fix it?
● We were not trying to use the raw query indjango– There is no reasons to do so
● Attempts:– Remove metrics that requires CASEs
– Reduce amount of COUNTs and JOINs
– Remove DISTINCT – Fetch row by row
– Use one query for each metric
%78
How did we fix it?
● The fix is:– Use one query for each metric
● The best performance in the production case
%81
The lesson
● Good management and good architecture arematter
● Deploy more frequently
● Do not use data migrations as is – Usecommands
● Django admin is not efficient for aggregationqueries
● Analyze and synthesize are matter
%89
A proof
● I have refactored another core model:– A schema migration
– A command for data migration
● I have deployed it without downtime
● Look production environment is still alive
%93