plnog14: automation at brainly - paweł rozlach

Automation at Brainly… or how to enter the world of automation in a “different way”.

OPS stack:

● ~80 servers, heavy usage of LXC containers (~1000)

● 99.9% Debian, 1 Ubuntu host :) ● Nginx / Apache2, 2k reqs per sec● 200 million page views monthly● 700Mbps peak traffic● Python is dominant

About BrainlyWorld’s largest homework help social network, connecting over 40 million users monthly

DEV stack:

● PHP－ Symfony 2－ SOA projects－ 200 reqs per sec on russian version

● Erlang－ 55k concurrent users－ 22k events per sec

● Native Apps－ iOS－ Android

● Puppet was not feasible for us- *lots* of dependencies which make containers bigger/heavier- problems with Puppet's declarative language- seemed incoherent, lacking integration of orchestration- steep learning curve- YMMV

● "packaging as automation" as an intermediate solution- dependency hell, installing one package could result in uninstalling others- inflexible, lots of code duplication in debian/rules file- LOTS of custom bash and PHP scripts, usually very hard to reuse

and not standardized- this was a dead end :(

● Ansible- initially used only for orchestration- maintaining it required keeping up2date inventory, which later

simplified and helped with lots of things

Starting point

● we decided to move forward with Ansible and use it for setting up machines as well

● first project was nagios monitoring plugins setup● turned out to be ideal for containers and our needs in general

- very little dependencies to begin with (python2, python-apt), and small footprint - "configured" Python modules are transferreddirectly to machine, no need for local repositories

- very light, no compilation on the destination host is needed- easy to understand. Tasks/playbooks map directly to actions

an ops/devops would have done if he was doing it by hand- compatible with "automation by packages". We were able to

migrate from the old system in small steps.

First steps with Ansible

● all policies, rules, and good practices written down in automation's repo main directory

● helps with introducing new people into the team or with devops approach- newbies are able to start committing to repo quickly- what's in GUIDELINES.md, that's law and changing it requires wider

consensus- gives examples on how to deal with certain problems in standardized way

● few examples:- limit the number of tags, each of them should be self-contained

with no cross-dependencies.- do not include roles/tasks inside other roles,

this creates hard to follow dependencies- NEVER subset the list of hosts inside the role, do it in site.yml.

Otherwise debugging roles/hosts will become difficult- think twice before adding new role and esp. groups. As infrastructure

grows, it becomes hard to manage and/or creates "dead” code/roles

Avoiding regressions

● one of the policies introduced was storing one-off scripts in a separate directory in our automation repo.

● most of them are Ansible playbooks used just for one particular task (i.e. Squeeze->Wheezy migration)

● version-control everything!● turned out to be very useful, some of them turned out to be useful

enough to be rewritten to proper role or a tool

Ugly-hacks reusability

● available on GitHub and Ansible Galaxy: https://galaxy.ansible.com/list#/roles/940

https://galaxy.ansible.com/list#/roles/941● “base” role:

- is reused across 8 different production roles we have ATM- contains basic monitoring, log rotation, packages installation, etc…- includes PHP setup in modphp/prefork configuration- PHP disabled functions control- basic security setup- does not include any site-specific stuff

● "site” role:- contains all site specific stuff and dependencies

(vhosts, additional packages, etc...)- usually very simple- more than one site role possible, only one base role though

● It is an example of how we make our roles reusable

Apache2 automation

● automatically setups monitoring basing on inventory and host groups● implements devops approach - if dev has root on machine, he also has

access to all monitoring stuff related to this system● automatic host dependencies basing on host groups● provisioning new hosts is no longer so painful ("auto-discovery")● all services configuration is stored as YAML files, and used in templates● role uses DNS data directly from inventory in order to make monitoring

independent of DNS failures

Icinga

DNS migration

● at the beginning:- dozens of authoritative name servers, each of them having

customized configuration, running ~100 zones, all created by hand- the main reason for that was using DNS for switching between

primary/secondary servers/services● three phases:

- slurping configuration into Ansible- normalizing the configuration- improving the setup

● Python script which uses Ansible API to fetch normalized zone configuration from each server

- results available in a neat hash, with per-host, per-zone keys!- normalization using named-checkconf tool

● use slurped configuration to re-generate all configs, this time using only the data available to Ansible's

● "push-button" migration, after all recipes were ready :)

● secure: all zone transfers are signed with individual keys, ACLs are tight● playbooks use dns data directly from inventory● changing/migrating slaves/masters is easy, NS records are auto-generated● updates to zones automatically bump serial, while still preserving the

YYYYMMDDxx format● CRM records are auto-generated as well

* see next slide about CRM automation● dns entries are always up2date thanks to some custom action modules

- ansible_ssh_host variables are harvested and processed into zones- only custom entries and zone primary/secondary server names are

now stored in YAML- new hosts are automatically added to zones, decommissioned

ones - removed- auto-generation of reverse zones

DNS automation

● we have ~130 CRM clusters● setting them up by hand would be "difficult" at best, impossible at worst● available on Ansible Galaxy:

- https://galaxy.ansible.com/list#/roles/956- https://galaxy.ansible.com/list#/roles/979

● follows pattern from apache2_base- “base” role suitable for manually set up clusters- "cluster” role provides service upon base, with few reusable snippets

and a possibility for more complex configurations● automatic membership based on ansible inventory (no multicasts!)● the most difficult part was providing synchronous handlers● few simple configurations are provided, like single service-single vip

Corosync & Pacemaker

● initially we did not have time nor resources to set up full fledged LDAP● we needed:

- user should be able to log in even during a network outage- removal/adding users, ssh-keys, custom settings, etc..

all had to be supported- it had to be reusable/accessible in other roles

(i.e. Icinga/monitoring)- different privileges for dev,production and other environments- UID/GID unification

● turned out to be simpler than we thought - users are managed using few simple tasks and group_vars data. Rest is handled via variables precedence.

● migration/standardization required some effort though

User management automation

● standard ansible inventory management becomes a bit cumbersome with 100’s of

hosts:

- each host has to have ansible_ssh_host defined

- adding/removing large number of hosts/groups required editing lots of files

and/or one-off scripts

- ip address management using google docs does not scale ;)

● Ansible has well defined dynamic inventory API, with scripts available for AWS,

Cobbler, Rackspace, Docker, and many others.

● we wrote our own, which is based on YAML file, version controlled by git:

- python API allowing to manipulate the inventory easily

- logic and syntax checking of the inventory

● available as opensource: https://github.com/brainly/inventory_tool

Inventory management

● we are leasing our servers from Hetzner, no direct Layer 2 connectivity

● all tunnel setups are done using Ansible, new server

is automatically added to our network

● firewalls are set up by Ansible as well:

- OPS contribute the base firewall, DEVs can open

the ports of interest for their application

- ferm at it's base, for easy rule making and keeping in-kernel firewall in sync

with on-disk rules

- rules are auto-generated basing on inventory, adding/removing hosts is

automatically reconfigures FW

Networking

● based on Bareos, opensource Bacula fork

● new hosts are automatically set up for backup,

extending storage space is no longer a problem

● authentication using certificates, PITA without ansible

Backups

● deployment done by Python script calling Ansible API

● simple tasks implemented using ansible playbooks

● complex logic implemented in Python

Deployments

● Jinja2 template error messages are "difficult" to interpret● templates sometimes grow to huge complexity● Jinja2 is designed for speed, but with tradeoffs - some Python operators are

missing and creating custom plugins/filters poses some problems● multi-inheritance, problems with 2-headed trees● speed, improved with "pipelining=True", containerization on the long run● some useful functionality requires paid subscription (Ansible Tower)

- RESTfull API, useful if you want to push new application versionto productions via i.e. Jenkins

- schedules - currently we need to push the changes ourselves

Not everything is perfect

● developers by default have RO access to repo, RW on case-by-case basis● changes to systems owned by developers are done by developers,

OPS only provide the platform and tools● all non-trivial changes require a Pull Request and a review from Ops● encrypt mission critical data with Ansible Vault and push it directly to the repo

- *strong* encryption- available to Ansible without the need for decryption

(password still required though)- all security sensitive stuff can be skipped by developers with

"--skip-tags" option to ansible-playbooks

Dev,DevOps,Ops

● some of the things we mentioned can be find on our Github account

● we are working on opensourcing more stuff

https://github.com/brainly

Opensource! Opensource! Opensource!

● time needed to deploy new markets dropped considerably

● increased productivity

● better cooperation with developers

● more workpower, Devs are no longer blocked so much, we can push

tasks to them

● infrastructure as a code

● versioning

● code-reuse, less copy-pasting

Conclusions

We are hiring!http://brainly.co/jobs/

Questions?

Thank you!

plnog14: automation at brainly - paweł rozlach

Internet